(Multiplication algorithms:)

What it does

The Montgomery multiplication algorithm, described in [1], is a fast method for computing

x × y mod M,
for large x, y and M (typically up to a few thousands of bits), when M is odd. Many important algorithms of asymmetric cryptography -- Diffie-Hellman algorithm is probably the most important example -- require many such operations, making Montgomery multiplication an important tool. Montgomery observed that, while multiplication is not too expensive, modular reduction is. So a variant of the plain old multiplication algorithm is used, but the modulo operation is eliminated totally.

Where it does it

In what follows, assume M is n bits long (and think e.g. n=1536). Compute

r = 2^n mod M = 2^n-M.
Additional required constants are r-1 (the inverse of r) and r2 -- both modulo M. If changing M is an infrequent operation (and for Diffie-Hellman, it is) then this precomputation is essentially free.

Montgomery multiplication defines a new representation of large numbers modulo M

a ⇒ R(a) = a × r mod M,
as well as a new multiplication operation
ab = a × b × r-1 mod M.
The neat property of all this is that
R(a × b) = a × a × r mod M = R(a) × R(b) × r = R(a) ⊗ R(b)
So R is a weird isomorphism of (Z/MZ)*, if you us the operator ⊗. If the preceding sentence means nothing to you, ignore it. It's just saying that ⊗ multiplies in the R-world the same way that × multiplies in the "real" modulo M world.

If we're going to be computing a sequence of multiplications, we don't need to convert to and from the R-world all the time, either. For instance, if computing g^x mod M (the basic operation for Diffie-Hellman), we just need to compute R(g) (which is a precomputed constant for computing a private key), then repeatedly use Montgomery multiplications (up to 2n-1 times), and convert the end result back to the real world.

Even nicer, R can be expressed in terms of ⊗ on bit-strings, using a precomputed constant. We have that

x ⊗ r2 = x × r2 × r-1 mod M = x × r mod M = R(x),
and
R(x) ⊗ 1 = x × r ×1 × r-1 = x,
so conversion to and from the R-world is just another application of our ⊗ operation.

A full Diffie-Hellman exchange is at most 2n ⊗ operations!

How it does it

So the trick is to show how to compute R(⋅) and ⊗ rapidly -- without recourse to modulo operations.

Well, the elementary algorithm for multiplying X=(xn-1...x0) by Y=(yn-1...y0) is just

Z = 0;
for i in 0..n-1:
  if (xi is set) then Z = Z + (Y << i);
done;
output: X×Y = Z.
We want to perform the above multiplication, simultaneously dividing by 2n. Since there are n steps, we divide by 2 at each step.

And here's the big deal. Dividing Z by 2 is easy -- if Z is even: just right shift. It even works when computing modulo M. But if 0 ≤ Z < M is odd, it is true that Z / 2 mod M = (Z+M) / 2 (where the RHS is computed as normal base-2 arithmetic). And we can compute this last division using a shift.

So the algorithm is just

Z = 0;
for i in 0..n-1:
  if (xi is set) then Z = Z + Y;
  if (Z is odd) then Z = Z + M;
  Z = Z >> 1;
done;
if (Z ≥ M) then Z = Z - M;
output: X⊗Y = Z

It gets even better...

Fewer branches are required in the loop than might appear. In a circuit, the first "if" operation can be performed directly: add xi ⋅ Y (an AND gate will do) instead of the first if. The algorithm is particularly suitable to storing Z in a redundant notation: "carry save", where carry bits are stored separately from Z. And no large operands need to be kept lying around -- n bits each is enough.

The final reduction after the loop (if (Z ≥ M) then Z = Z - M;) is an annoyance: it's another branch delay, right at the end. Luckily, if a sequence of multiplications is required, it need only be performed after the sequence ends (the proof of this is based on showing that if one input is bounded by 2M-1 and the other by M, then the output is also bounded by 2M-1) -- this appeared in [2]

References

  1. P. Montgomery, Modular multiplication without trial division, Mathematics of computation, vol. 44, 1985, pp. 519-521.
  2. Colin D. Walter, Montgomery exponentiation needs no final subtractions, Electronics Letters, 35(21):1831--1832, October 1999.