“Finite Field Arithmetic.” Chapter 9: "Exodus from Egypt" with Comba's Algorithm.

This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical "Open Sores" abomination, in that -- rather than trusting the author blindly with their lives -- prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.

You will need:

Add the above vpatch and seal to your V-set, and press to ffa_ch9_exodus.kv.vpatch.

You should end up with the same directory structure as previously.

Now compile ffacalc:

cd ffacalc
gprbuild

But do not run it quite yet.


The "Egyptian" algorithm for multiplication (in its simplest Chapter 5 version, and likewise in the refined form given in Chapter 7) has a time cost that is proportional to the square of the operand bitness -- i.e. it runs in O(N^2), or quadratic time. In Chapter 10, we will meet a substantially faster -- subquadratic -- algorithm for multiplication. However, for reasons which will become apparent to the reader, we would first like to obtain the fastest quadratic-time multiplier that we possibly can (without compromising any of the FFA design principles laid out in Chapter 1).

The ancient Egyptians came up with an easily-taught and reasonably efficient-method for multiplication. However, they ran it "on" human scribes, rather than automatic machinery. And it turns out that the latter, especially in the form of recently-made CPUs, is quite sensitive to the particular time-and-space "shape" of an algorithm. But we will expand on this point later.

Mathematician and astronomer Paul G. Comba (1926 -- 2017) devised an elegant multiplication algorithm which superficially resembles the familiar "grade school" long-multiplication -- but with the important difference that it does not require the generation of intermediate per-row temporary results. Instead, it generates each column -- or digit -- (or, in FFA parlance, Word) of the product, sequentially, from lowest to highest, using only a handful of words of memory for temporary storage. Let's implement Comba's algorithm:

FZ_Mul.ads:

with FZ_Type; use FZ_Type;
 
package FZ_Mul is
 
   pragma Pure;
 
   -- Comba's multiplier.
   procedure FZ_Mul_Comba(X     : in  FZ;
                          Y     : in  FZ;
                          XY_Lo : out FZ;
                          XY_Hi : out FZ);
   pragma Precondition(X'Length = Y'Length and
                         XY_Lo'Length = XY_Hi'Length and
                         XY_Lo'Length = ((X'Length + Y'Length) / 2));
 
end FZ_Mul;

FZ_Mul.adb:

with Words;    use Words;
with Word_Ops; use Word_Ops;
with W_Mul;    use W_Mul;
 
 
package body FZ_Mul is
 
   -- Comba's multiplier.
   procedure FZ_Mul_Comba(X     : in  FZ;
                          Y     : in  FZ;
                          XY_Lo : out FZ;
                          XY_Hi : out FZ) is
 
      -- Words in each multiplicand
      L : constant Word_Index := X'Length;
 
      -- Length of Product, i.e. double the length of either multiplicand
      LP : constant Word_Index := 2 * L;
 
      -- 3-word Accumulator
      A2, A1, A0 : Word := 0;
 
      -- Register holding Product; indexed from zero
      XY : FZ(0 .. LP - 1);
 
      -- Type for referring to a column of XY
      subtype ColXY is Word_Index range XY'Range;
 
      -- Compute the Nth (indexed from zero) column of the Product
      procedure Col(N : in ColXY; U : in ColXY; V : in ColXY) is
 
         -- The outputs of a Word multiplication
         Lo, Hi : Word;
 
         -- Carry for the Accumulator addition
         C      : WBool;
 
         -- Sum for Accumulator addition
         Sum    : Word;
 
      begin
 
         -- For lower half of XY, will go from 0 to N
         -- For upper half of XY, will go from N - L + 1 to L - 1
         for j in U .. V loop
 
            -- Hi:Lo := j-th Word of X  *  (N - j)-th Word of Y
            Mul_Word(X(X'First + j),
                     Y(Y'First - j + N),
                     Lo, Hi);
 
            -- Now add Hi:Lo into the Accumulator:
 
            -- A0 += Lo; C := Carry
            Sum := A0 + Lo;
            C   := W_Carry(A0, Lo, Sum);
            A0  := Sum;
 
            -- A1 += Hi + C; C := Carry
            Sum := A1 + Hi + C;
            C   := W_Carry(A1, Hi, Sum);
            A1  := Sum;
 
            -- A2 += A2 + C
            A2  := A2 + C;
 
         end loop;
 
         -- We now have the Nth (indexed from zero) word of XY
         XY(N) := A0;
 
         -- Right-Shift the Accumulator by one Word width:
         A0 := A1;
         A1 := A2;
         A2 := 0;
 
      end Col;
      pragma Inline_Always(Col);
 
   begin
 
      -- Compute the lower half of the Product:
      for i in 0 .. L - 1 loop
 
         Col(i, 0, i);
 
      end loop;
 
      -- Compute the upper half (sans last Word) of the Product:
      for i in L .. LP - 2 loop
 
         Col(i, i - L + 1, L - 1);
 
      end loop;
 
      -- The very last Word of the Product:
      XY(XY'Last) := A0;
 
      -- Output the Product's lower and upper FZs:
      XY_Lo := XY(0 .. L - 1);
      XY_Hi := XY(L .. XY'Last);
 
   end FZ_Mul_Comba;
   pragma Inline_Always(FZ_Mul_Comba);
 
end FZ_Mul;

I will omit the detailed analysis of why Comba's Multiplication works, and leave the proof as an exercise for the reader. However, do observe that this routine demands a Mul_Word procedure. Which the reader will correctly guess, lives in the unit W_Mul:

W_Mul.ads:

with Words; use Words;
 
package W_Mul is
 
   pragma Pure;
 
   -- The bitness of a Half-Word
   HalfBitness : constant Positive := Bitness / 2;
   subtype HalfWord is Word range 0 .. 2**HalfBitness;
 
   -- The number of bytes in a Half-Word
   HalfByteness : constant Positive := Byteness / 2;
 
   -- Multiply half-words X and Y, producing a Word-sized product
   function Mul_HalfWord(X : in HalfWord; Y : in HalfWord) return Word;
 
   -- Get the bottom half of a Word
   function BottomHW(W : in Word) return HalfWord;
 
   -- Get the top half of a Word
   function TopHW(W : in Word) return HalfWord;
 
   -- Carry out X*Y mult, return lower word XY_LW and upper word XY_HW.
   procedure Mul_Word(X : in Word; Y : in Word; XY_LW : out Word; XY_HW : out Word);
 
end W_Mul;

W_Mul.adb ("Iron" Variant):

with W_Shifts; use W_Shifts;
 
 
package body W_Mul is
 
   function Mul_HalfWord(X : in HalfWord;
                         Y : in HalfWord) return Word is
   begin
      return X * Y;
   end Mul_HalfWord;
   pragma Inline_Always(Mul_HalfWord);
 
 
   -- Get the bottom half of a Word
   function BottomHW(W : in Word) return HalfWord is
   begin
      return W and (2**HalfBitness - 1);
   end BottomHW;
   pragma Inline_Always(BottomHW);
 
 
   -- Get the top half of a Word
   function TopHW(W : in Word) return HalfWord is
   begin
      return Shift_Right(W, HalfBitness);
   end TopHW;
   pragma Inline_Always(TopHW);
 
 
   -- Carry out X*Y mult, return lower word XY_LW and upper word XY_HW.
   procedure Mul_Word(X       : in  Word;
                      Y       : in  Word;
                      XY_LW   : out Word;
                      XY_HW   : out Word) is
 
      -- Bottom half of multiplicand X
      XL : constant HalfWord := BottomHW(X);
 
      -- Top half of multiplicand X
      XH : constant HalfWord := TopHW(X);
 
      -- Bottom half of multiplicand Y
      YL : constant HalfWord := BottomHW(Y);
 
      -- Top half of multiplicand Y
      YH : constant HalfWord := TopHW(Y);
 
      -- XL * YL
      LL : constant Word := Mul_HalfWord(XL, YL);
 
      -- XL * YH
      LH : constant Word := Mul_HalfWord(XL, YH);
 
      -- XH * YL
      HL : constant Word := Mul_HalfWord(XH, YL);
 
      -- XH * YH
      HH : constant Word := Mul_HalfWord(XH, YH);
 
      -- Carry
      CL : constant Word := TopHW(TopHW(LL) + BottomHW(LH) + BottomHW(HL));
 
   begin
 
      -- Get the bottom half of the Product:
      XY_LW := LL + Shift_Left(LH + HL, HalfBitness);
 
      -- Get the top half of the Product:
      XY_HW := HH + TopHW(HL) + TopHW(LH) + CL;
 
   end Mul_Word;
   pragma Inline_Always(Mul_Word);
 
end W_Mul;

Mul_Word, it turns out, is merely a Word*Word -> double-Word "schoolbook" multiplier. And: it is abundantly obvious that it could be replaced with a single IMUL instruction on x86/x86-64. But we won't do this in standard FFA -- recall the design principles from Chapter 1! We will not be "marrying" any particular iron.

A very sharp reader may have already noticed the "problematic" element in the above Mul_Word routine. Namely, Mul_HalfWord as given here creates a danger of timing sidechannel leakage on certain machine architectures!

Ostorozhno s vilami!

CPUs widely known to suffer from this "helpful" "optimization" include x86-compatible VIA, old Intels (386, 486), and most of the commonly-met PowerPC and ARM devices. Others archs, as well as obscure implementations of well-known archs, may also belong on this list. (Please write in if you happen to know of one such that I have not listed here.)

If FFA were to continue with the above variant (let's call it iron) of Mul_HalfWord, every machine on which it is put to use would have to be first probed for non-constanttime iron multiplication: every newly-purchased CPU is arguably "guilty until proven innocent". But it turns out that there is a simple and relatively-inexpensive cure for this headache:

W_Mul.adb ("Soft" Variant, non-unrolled):

   -- Multiply half-words X and Y, producing a Word-sized product
   function Mul_HalfWord(X : in HalfWord; Y : in HalfWord) return Word is
 
      -- X-Slide
      XS : Word := X;
 
      -- Y-Slide
      YS : Word := Y;
 
      -- Gate Mask
      GM : Word;
 
      -- The Product
      XY : Word := 0;
 
   begin
 
      -- For each bit of the Y-Slide:
      for b in 1 .. HalfBitness loop
 
         -- Compute the gate mask
         GM := 0 - (YS and 1);
 
         -- Perform the gated addition
         XY := XY + (XS and GM);
 
         -- Crank the next Y-slide bit into position
         YS := Shift_Right(YS, 1);
 
         -- Advance the X-slide by 1 bit
         XS := Shift_Left(XS, 1);
 
      end loop;
 
      -- Return the Product
      return XY;
 
   end Mul_HalfWord;
   pragma Inline_Always(Mul_HalfWord);

The above Mul_HalfWord replaces the "iron" multiplication with a miniature version of the Chapter 5 "Egyptian" algorithm. Because there is no multiwordism and no carry calculation, this safe (bitwise logical operations and shifts by one bit are constant-time on all CPUs known to me) alternative to the "leaky" MUL, when implemented in the partially unrolled form as follows -- comes at a cost of a less than 6 percent (see further below) increase in required CPU time(see comments):

W_Mul.adb ("Soft" Variant, unrolled):

   -- Multiply half-words X and Y, producing a Word-sized product
   function Mul_HalfWord(X : in HalfWord; Y : in HalfWord) return Word is
 
      -- X-Slide
      XS : Word := X;
 
      -- Y-Slide
      YS : Word := Y;
 
      -- Gate Mask
      GM : Word;
 
      -- The Product
      XY : Word := 0;
 
      -- Performed for each bit of HalfWord's bitness:
      procedure Bit is
      begin
 
         -- Compute the gate mask
         GM := 0 - (YS and 1);
 
         -- Perform the gated addition
         XY := XY + (XS and GM);
 
         -- Crank the next Y-slide bit into position
         YS := Shift_Right(YS, 1);
 
         -- Advance the X-slide by 1 bit
         XS := Shift_Left(XS, 1);
 
      end Bit;
      pragma Inline_Always(Bit);
 
   begin
 
      -- For each bit of the Y-Slide (unrolled) :
      for b in 1 .. HalfByteness loop
 
         Bit; Bit; Bit; Bit; Bit; Bit; Bit; Bit;
 
      end loop;
 
      -- Return the Product
      return XY;
 
   end Mul_HalfWord;
   pragma Inline_Always(Mul_HalfWord);

And now, on the same machine as in previous Chapters, timed and plotted logarithmically:

Cost of Ch.9 'X' Operation, vs FFA Bitness.

Or, for those who prefer the raw numbers,

FFA Bitness Ch.6 Egyptian 'X' (sec) Ch.7 Egyptian 'X' (sec) Ch.9 "Iron" Comba 'X' (sec) Ch.9 "Soft" Comba 'X' (sec)
256 0.072 0.040 0.028 0.029
512 0.505 0.240 0.163 0.169
1024 3.685 1.672 1.125 1.177
2048 27.975 12.024 7.975 8.387
4096 217.966 90.439 59.356 62.725
8192 1720.127 699.979 457.103 483.993

Despite this small increase in runtime cost, standard FFA will from this point on use only the "soft" variant of Mul_HalfWord, so as to maintain the guarantee of constant-time operation on all known CPU architectures. Individual readers, can, of course, at their own risk -- replace the entirety of Mul_Word -- or even FZ_Mul_Comba -- with an inline-assembly routine known to work safely on a particular CPU. But this type of optimization falls outside the scope of FFA as per the design principles introduced in Chapter 1.

~To be continued!~


Answer to Homework Problem 1 from Chapter 8:

Let's suppose that you wish to test multiplication by generating a Python program via FFACalc. Your tape could look like this:

tape.txt:

??``*[print 0x]##[ == 0x]#[ * 0x]#

Which would produce a valid Python program that can be fed straight into a Python interpreter, e.g. :

./bin/ffa_calc 1048576 32 /dev/urandom < tape.txt | tr -d '\n' > multest.py

... will pseudorandomly generate two numbers, each being one megabit in size, and produce a statement regarding their (two-megabit-wide, at most) product, that can be verified by the Python interpreter, in turn producing an output after around half a minute on the test box used thus far.

Now let's feed this output into Python:

$ python < multest.py

Prints:

True

... after less than one fifth of a second, on the same machine.

The bulk of the remaining material in this series will concern methods of bridging the speed gap between our routines and traditional "bignum" mechanisms (such as those used in the Python interpreter) to the extent possible without compromising any of the design principles of FFA.


Homework:

1. Prove that Comba's Algorithm works. And in particular, that the three-word Accumulator space given, will always suffice.

2. Given that both "Egyptian" and Comba's multiplication run in O(N^2), why does Comba's method win speedwise? Does it continue to win for all possible operand bitnesses, or only over a particular range?

3. Write an inline assembly variant of FZ_Mul_Comba for your CPU architecture. Demonstrate how you determined that the resulting program, when operating on the exponentiation test tape from Chapter 6, still runs in constant time (i.e. its CPU time cost is unaffected by any aspect of the input other than the choice of FFA bitness) on your particular machine.


~To be continued!~

This entry was written by Stanislav , posted on Sunday January 28 2018 , filed under Ada, Bitcoin, Cold Air, Computation, Cryptography, FFA, Friends, Mathematics, ShouldersGiants, SoftwareArchaeology, SoftwareSucks . Bookmark the permalink . Post a comment below or leave a trackback: Trackback URL.

6 Responses to ““Finite Field Arithmetic.” Chapter 9: "Exodus from Egypt" with Comba's Algorithm.”

  • apeloyee says:

    The rush in preparation of this chapter shows.
    The commentary contains a howler.
    The impact of replacing hardware multiply by single-word Egyptian multiply is of course much bigger than 6%, but is masked here by the slowness of Egyptian division.
    A well.

    • Stanislav says:

      Dear apeloyee,

      The ~6% statement is correct for the measurements given here, which concern the X operation in the particular program of Ch. 9, on the particular test machine. I thought this were obvious ?

      I'ma put a pure-MUL comparison in the next Chapter. (Impact on multiplication per se, on my iron, is of course around 6-times, not 6%, yes)

      Yours,
      -S

  • PeterL says:

    Is your captcha broken? I am seeing the folowing text in the box:

    V1 UNSUPPORTED Please direct siteowner to g.co/recaptcha/upgrade

    but then I get the error: That reCAPTCHA response was incorrect.

    And it gives me what looks like a normal captcha?

    • Stanislav says:

      Dear PeterL,

      I have absolutely no idea.

      And no intention to "upgrade" any such thing.

      The old heathen-powered spamfilter is slated for replacement in the near future; probably with MP's.

      Yours,
      -S

    • apeloyee says:

      Yes, captcha sometimes doesn't work, seemingly randomly.
      I deal with it the shamanic way (save my comment somewhere else, and refresh the page).

      • Stanislav says:

        Dear apeloyee,

        Indeed, the googlecaptcha thing is liquishit, and I'd like to be rid of it.

        This however will involve opening up a gnarly and rusted mechanism I haven't touched for over a decade; thus it will likely have to wait until the end of this series.

        Yours,
        -S

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" highlight="">


MANDATORY: Please prove that you are human:

79 xor 3 = ?

What is the serial baud rate of the FG device ?


Answer the riddle correctly before clicking "Submit", or comment will NOT appear! Not in moderation queue, NOWHERE!