T = t + (m-t)/p
Brent's theorem specifies for a sequential algorithm with t time steps, and a total of m operations, that a run time T is definitely possible on a shared memory machine (PRAM) with p processors. There may be an algorithm that solves this problem faster, or it may be possible to implement this algorithm faster (by scheduling instruction differently to minimise idle processors, for instance), but it is definitely possible to implement this algorithm in this time given p processors.
Key to understanding Brent's theorem is understanding time steps. In a single time step every instruction that has no dependencies is executed, and therefore t is equal to the length of the longest chain of instructions that depend on the results of other instructions (as any shorter chains will be finished executing by (or before) the time the longest chain has).
Say we are summing an array. We could step through the array, adding each value in turn to an accumulator. In pseudocode:
for (i=0;i < length(a); i++)
{
sum = sum + a(i);
}
Using this algorithm, each add operation depends on the result of the previous one, forming a chain of length n, thus t=n. There are n operations, so m = n.
T = n + 0/p.
(m-t) = 0, so no matter how many processors are available, this algorithm will take time n.
Now consider the adding the array recursively:
sum(a) = ( (A0 + A1) + (A2 + A3) ) + ( (A4 + A5) + (A6 + A7) ) etc...
We can add A2 to A3 without needing to know what A0 + A1 is. To calculate the sum from 0-3, we only need the results of A0 + A1, and A2 + A3. Instead of one chain of length 4, we have many chains of length 2. For an array of length n, the longest chain(s) will be of length log(n). t = log(n). m = n (as before).
T = log(n) + (n - log(n))/p.
This tells us many useful things about the algorithm:
- No matter how many processors used, there can be no implementation of this algorithm that runs faster than Olog(n).
- If we have n processors, the algorithm can be implemented in log(n) time
- If we have log(n) processors, the algorithm can be implemented in 2log(n) time (asymptotically this is the same as log(n) time)
- If we have one processor, the algorithm can be implemented in n time.
If we consider the amount of work done in each case, with one processor, we do n work, with log(n) processors we do n work, but with n processors we do nlog(n) work. The implementations with 1 or log(n) processors, therefore are
cost optimal, while the implementation with n processors is not.
It is important to remember that Brent's theorem does not tell us how to implement any of these algorithms in parallel; it merely tells us what is possible. The Brent's theorem implementation may be hideously ugly compared to the naive implementation.