What it does
The Montgomery multiplication algorithm, described in , is a fast method for computing
x × y mod M,
for large x, y and M (typically up to a few thousands of bits), when M is odd
. Many important algorithms of asymmetric cryptography
algorithm is probably the most important example -- require many such operations, making Montgomery multiplication an important tool. Montgomery observed that, while multiplication is not too expensive, modular reduction is
. So a variant of the plain old multiplication algorithm is used, but the modulo operation is eliminated
Where it does it
In what follows, assume M is n bits long (and think e.g. n=1536). Compute
r = 2^n mod M = 2^n-M.
Additional required constants are r-1
(the inverse of r) and r2
-- both modulo M. If changing M is an infrequent operation (and for Diffie-Hellman, it is)
then this precomputation
is essentially free.
Montgomery multiplication defines a new representation of large numbers modulo M
a ⇒ R(a) = a × r mod M,
as well as a new multiplication operation
a ⊗ b =
a × b × r-1 mod M.
The neat property of all this is that
R(a × b) =
a × a × r mod M =
R(a) × R(b) × r =
R(a) ⊗ R(b)
So R is a weird isomorphism
)*, if you us the operator
⊗. If the preceding sentence means nothing to you, ignore it. It's just saying that ⊗ multiplies in the R-world the same way that × multiplies in the "real" modulo M world.
If we're going to be computing a sequence of multiplications, we don't need to convert to and from the R-world all the time, either. For instance, if computing g^x mod M (the basic operation for Diffie-Hellman), we just need to compute R(g) (which is a precomputed constant for computing a private key), then repeatedly use Montgomery multiplications (up to 2n-1 times), and convert the end result back to the real world.
Even nicer, R can be expressed in terms of ⊗ on bit-strings, using a precomputed constant. We have that
x ⊗ r2 = x × r2 × r-1 mod M = x × r mod M = R(x),
R(x) ⊗ 1 = x × r ×1 × r-1 = x,
so conversion to and from the R-world is just another application of our ⊗ operation.
A full Diffie-Hellman exchange is at most 2n ⊗ operations!
How it does it
So the trick is to show how to compute R(⋅) and ⊗ rapidly -- without recourse to modulo operations.
Well, the elementary algorithm for multiplying X=(xn-1...x0) by Y=(yn-1...y0) is just
Z = 0;
for i in 0..n-1:
if (xi is set) then Z = Z + (Y << i);
output: X×Y = Z.
We want to perform the above multiplication, simultaneously dividing by 2n
. Since there are n steps, we divide by 2 at each step.
And here's the big deal. Dividing Z by 2 is easy -- if Z is even: just right shift. It even works when computing modulo M. But if 0 ≤ Z < M is odd, it is true that Z / 2 mod M = (Z+M) / 2 (where the RHS is computed as normal base-2 arithmetic). And we can compute this last division using a shift.
So the algorithm is just
Z = 0;
for i in 0..n-1:
if (xi is set) then Z = Z + Y;
if (Z is odd) then Z = Z + M;
Z = Z >> 1;
if (Z ≥ M) then Z = Z - M;
output: X⊗Y = Z
It gets even better...
Fewer branches are required in the loop than might appear. In a circuit, the first "if" operation can be performed directly: add
xi ⋅ Y (an AND gate will do) instead of the first if. The algorithm is particularly suitable to storing Z in a redundant notation: "carry save", where carry bits are stored separately from Z. And no large operands need to be kept lying around -- n bits each is enough.
The final reduction after the loop (
if (Z ≥ M) then Z = Z - M;) is an annoyance: it's another branch delay, right at the end. Luckily, if a sequence of multiplications is required, it need only be performed after the sequence ends (the proof of this is based on showing that if one input is bounded by 2M-1 and the other by M, then the output is also bounded by 2M-1) -- this appeared in 
- P. Montgomery, Modular multiplication without trial division, Mathematics of computation, vol. 44, 1985, pp. 519-521.
- Colin D. Walter, Montgomery exponentiation needs no final subtractions, Electronics Letters, 35(21):1831--1832, October 1999.