### What it does

The Montgomery multiplication algorithm, described in [1], is a fast method for computing

x × y mod M,

for large x, y and M (typically up to a few thousands of bits), when M is

odd. Many important algorithms of

asymmetric cryptography --

Diffie-Hellman algorithm is probably the most important example -- require many such operations, making Montgomery multiplication an important tool. Montgomery observed that, while multiplication is not too expensive,

modular reduction *is*. So a variant of the plain old multiplication algorithm is used, but the modulo operation is

eliminated totally.

### Where it does it

In what follows, assume M is n bits long (and think e.g. n=1536). Compute

r = 2^n mod M = 2^n-M.

Additional required constants are r

^{-1} (the inverse of r) and r

^{2} -- both modulo M. If changing M is an infrequent operation

(and for Diffie-Hellman, it is) then this

precomputation is essentially free.

Montgomery multiplication defines a new *representation* of large numbers modulo M

`a` ⇒ R(`a`) = `a` × r mod M,

as well as a new multiplication operation

`a` ⊗ `b` =
`a` × `b` × r^{-1} mod M.

The neat property of all this is that

R(`a` × `b`) =
`a` × `a` × r mod M =
R(`a`) × R(`b`) × r =
R(`a`) ⊗ R(`b`)

So R is a weird

isomorphism of (

**Z**/M

**Z**)*, if you us the

operator ⊗. If the preceding sentence means nothing to you, ignore it. It's just saying that ⊗ multiplies in the R-world the same way that × multiplies in the "real" modulo M world.

If we're going to be computing a sequence of multiplications, we don't need to convert to and from the R-world all the time, either. For instance, if computing g^x mod M (the basic operation for Diffie-Hellman), we just need to compute R(g) (which is a precomputed constant for computing a private key), then repeatedly use Montgomery multiplications (up to 2n-1 times), and convert the end result back to the real world.

Even nicer, R can be expressed in terms of ⊗ on bit-strings, using a precomputed constant. We have that

x ⊗ r^{2} = x × r^{2} × r^{-1} mod M = x × r mod M = R(x),

and

R(x) ⊗ 1 = x × r ×1 × r^{-1} = x,

so conversion to and from the R-world is just another application of our ⊗ operation.

A full Diffie-Hellman exchange is at most 2n ⊗ operations!

### How it does it

So the trick is to show how to compute R(⋅) and ⊗ rapidly -- without recourse to modulo operations.

Well, the elementary algorithm for multiplying X=(x_{n-1}...x_{0}) by Y=(y_{n-1}...y_{0}) is just

Z = 0;
for i in 0..n-1:
if (x_{i} is set) then Z = Z + (Y << i);
done;
output: X×Y = Z.

We want to perform the above multiplication, simultaneously dividing by 2

^{n}. Since there are n steps, we divide by 2 at each step.

And here's the big deal. Dividing Z by 2 is easy -- *if Z is even*: just right shift. It even works when computing modulo M. But if 0 ≤ Z < M is odd, it is true that Z / 2 mod M = (Z+M) / 2 (where the RHS is computed as normal base-2 arithmetic). And we can compute this last division using a shift.

So the algorithm is just

Z = 0;
for i in 0..n-1:
if (x_{i} is set) then Z = Z + Y;
if (Z is odd) then Z = Z + M;
Z = Z >> 1;
done;
if (Z ≥ M) then Z = Z - M;
output: X⊗Y = Z

### It gets even better...

Fewer branches are required in the loop than might appear. In a circuit, the first "if" operation can be performed directly: add `x`_{i} ⋅ Y

(an AND gate will do) instead of the first if. The algorithm is particularly suitable to storing Z in a redundant notation: "carry save", where carry bits are stored separately from Z. And no large operands need to be kept lying around -- n bits each is enough.

The final reduction after the loop (`if (Z ≥ M) then Z = Z - M;`

) is an annoyance: it's another branch delay, right at the end. Luckily, if a sequence of multiplications is required, it need only be performed after the sequence ends (the proof of this is based on showing that if one input is bounded by 2M-1 and the other by M, then the output is also bounded by 2M-1) -- this appeared in [2]

### References

- P. Montgomery,
*Modular multiplication without trial division*, **Mathematics of computation**, vol. 44, 1985, pp. 519-521.
- Colin D. Walter,
*Montgomery exponentiation needs no final subtractions*, **Electronics Letters**, 35(21):1831--1832, October 1999.