The "classic" control scheme for out-of-order superscalar processor
micro-architectures, Tomasulo's Algorithm provides a framework
for organising and coordinating the simultaneous execution
of independent instructions on multiple execution units, with
dynamic scheduling around data hazards and structural hazards
such as multi-cycle pipelines, as well as cache misses.
Many modern microprocessors are designed around variations on Tomasulo's
Algorithm, perhaps most notably Andy Glew's P6 core, found in the Pentium Pro,
Pentium II, Pentium III and Pentium M. Other implementations include most PowerPC processors, and
famously the first few generations of the DEC Alpha.
The first implementation of Tomasulo's algorithm was the floating point unit of the IBM 360/91, and the first entire processor implementing it was the CDC 6600, both mainframe processors.
The Problem
A microprocessor will have a variety of different functional units,
corresponding roughly to the variety of different instructions
found in the instruction set architecture it implements: memory accesses,
simple integer arithmetic, multiplication, floating point arithmetic and
so on. In a simple, "scalar" pipelined microprocessor, only one instruction
is actually executed at a time (although others may be in the process of
being fetched from instruction memory, having operands fetched from
registers, and so on), and so most of the actual functional units will
be idle for most of the time. Furthermore, where functional units take
more than a single cycle to compute their results, they can potentially
hold up following instructions unnecessarily.
So, it's desirable to be able to make use of these idle units, and execute
more than one instruction at the same time to improve performance, without
adding the cost of any more execution units. So long as successive instructions
don't rely on each others results too often (ie there is a sufficient amount
of instruction level parallelism), we should see a reasonable improvement
in performance.
This sounds all very reasonable so far, and in fact is the very key idea in
superscalar processors. The trouble is that providing the ability to allow multiple instructions
to execute at once, in a manner that actually allows simultaneous execution
to occur a useful proportion of the time turns out to be really quite complex.
Operand dependencies between instructions restrict which instructions
can execute at the same time. Instructions which can take more than
one cycle to execute (for example, multiplications, memory accesses
and floating point operations) make things far more complex to
organise since they can affect scheduling of instructions in more than
one way: resolving a read-after-write dependency requires stalling
the second instruction until the first instruction completes its
result, but a write-after-read requires only that the first
instruction be started before the second instruction is completed.
There are a multitude of issues to contend with, so much so that
attempting to arrange the control of a superscalar processor in an
ad-hoc manner is fraught with danger and almost certain to result in a
subtly buggy implementation, be difficult to program efficiently, and
so largely doomed to failure in the commercial world.
Tomasulo's Algorithm provides a robust, correct algorithm for the
superscalar execution of a sequential instruction stream, which can be
realistically implemented in hardware, and which can expose a
reasonable and useful amount of instruction level parallelism from a
program.
The Algorithm
The basis of the algorithm is, in a quick summary form: instructions
are "issued" (a process which also involves register renaming) to
"reservation stations" attached to functional units, in program order;
the reservation stations decide when to execute the instructions based
on the availability of their operands, and broadcast the results of
the operations to all the other functional units as they are
calculated.
So, let's follow the life of an instruction through the stages of a
typical Tomasulo's processor. Note that each "stage" here may
correspond to more than one pipeline stage:
- Fetch
-
Just like in a simple scalar processor, instructions are fetched
from instruction memory. Unlike a simple scalar processor,
instructions are likely to be fetched in batches of two or more at a
time.
- Issue
-
This is where things get interesting. In the issue stage, the
processor works out which reservation station to issue the the
instruction to, based on what type of functional unit it requires, and
the availability of spaces in the reservation stations.
Instructions can be issued to reservation stations regardless of
whether or not their operands are available. If an operand isn't
available to be read from the register file immediately, this must
be because the value associated with that register has not yet been
calculated. If this is the case, then it will be written with the
result of an instruction which has already been issued, and is
therefore assigned to a reservation station.
Hence, as the issue unit
issues the instruction to its reservation station, it renames any references to outstanding registers with an
identifier tag which indicates which functional unit will produce the
result, and which virtual register identifier the result will be
identified as. The issue unit also renames any result registers
associated with the new instruction to a virtual register identifier
so that it can tell subsequent instructions where to find the results
of this instruction.
- Execute
-
The execution stage of a processor implementing Tomasulo's
algorithm consists of a number of functional units, each with their
own reservation station. The reservation station holds a small number
of instructions which have been issued, but cannot yet be
executed. When an instruction becomes ready to be executed because of
operand values becoming available (by being broadcast on the common
data bus... more on that later), and the functional unit is ready to
accept a new instruction, the reservation station passes the
instruction to the functional unit, where the instruction's real
execution takes place: arithmetic operations are calculated, memory is
accessed. You know, the real work.
- Writeback
- The final stage an instruction goes through is writeback. This is
similar in many ways to a simple pipelined machine: when the result of
an instruction has been calculated, the value is driven on one of a
number of common data buses,
to be sent to the register file. Much like a bypass, this bus
is also monitored by other parts of the machine so that the value may
be used immediately by waiting instructions.
Example
Let's try a small example now to see how this works in practice. We'll
try a very simple bit of code, which simply computes the factorial of its input:
factorial(int x) {
int f = 1;
loop:
f = f * x;
x = x - 1;
if (x >= 2) goto loop;
return f; }
The above is C code, but extremely simple C, such that each statement will likely correspond to a single machine instruction, much like GIMPLE.
For the purposes of keeping the example simple, we'll assume that our processor has the bare minimum set of functional units necessary to compute this function: a multiplier, and an arithmetic unit capable of subtraction and comparison (which is, afterall, just subtraction). We'll assume a realistic set of times for these units, too: the arithmetic unit will take a single cycle, whereas the multiplier will take three (which is a reasonable figure for most modern processors). We'll also assume, to keep things nice and simple, that we have an ideal branch predictor which will always fetch the correct instruction. To keep things nice and brief, we'll set the initial value of 'x' to 3, implying two iterations of the loop.
First, let's try executing this on a processor using a simple scalar control model, with the execution units grouped together into a single 'Execute' stage. The pipeline looks a little like this:
Fetch -> Issue -> Execute -> Writeback
and so as we execute the code, the following instructions are in each pipeline stage in each cycle:
Cycle Issue unit Execute unit
1 f=1;
2 f=f*x; f=1;
3 x=x-1; f=1*3; {1}
4 x=x-1; f=1*3; {2}
4 x=x-1; f=1*3; {3}
5 if (x>=2) goto loop; x=3-1;
6 f=f*x; if (2>=2) goto loop;
7 x=x-1; f=3*2; {1}
8 x=x-1; f=3*2; {2}
9 x=x-1; f=3*2; {3}
10 if (x>=2) goto loop; x=2-1;
11 return f; if (1>=2) goto loop;
12 return 6;
if (x>=2)...
So, we can see that it took approximately 12 cycles to compute that factorial. During these 12 cycles, the multiplier is used only 50% of the time, despite holding up following instructions 33% of the time. Let's try Tomasulo's algorithm instead. To make it a fair comparison, we'll keep the same fetch unit, and so we'll still issue only a single instruction per cycle. We rearrange the pipeline slightly to look like this:
.-> Arithmetic -.
Fetch -> Issue -| |-> Writeback
'-> Multiplier -'
Note that we haven't added any new execution resources: we've just rearranged the pipeline around the execution units that were there already. Using Tomasulo's Algorithm, our new pipeline will execute the factorial code as follows:
Cycle Issue unit Arithmetic unit Multiplier Writeback
1 f=1;
2 f=f*x; f=1;
3 x=x-1; f=1*3; {1} f=1;
4 if (x>=2) goto loop; x=3-1; f=1*3; {2}
5 f=f*x; if (2>=2) goto loop; f=1*3; {3} x=2;
6 x=x-1; f=3*2; {1} f=3;
7 if (x>=2) goto loop; x=2-1; f=3*2; {2}
8 return f; if (1>=2); f=3*2; {3} x=1;
9 return 6; f=6;
That's a lot more like it. We've taken only 9 cycles (75% of the original time) to compute the result, and we've used the multiplier and the arithmetic unit 66% of the time rather than 50%. In fact, when we look at the inner loop in this calculation, we can see that it takes only 3 cycles, which is the theoretical limit for this multiplier since it takes 3 cycles to perform its multiplication. Our 9 cycles are made up of 2 iterations at 3 cycles each, and 3 cycles of pipeline overhead. For higher values of x, we'll approach the theoretical limit of 3 cycles per iteration. That's not bad going.
This only happens because the example was pretty well balanced: each iteration requires 3 instructions from the fetch unit, and 3 cycles of time in the multiplier. As such, it doesn't illustrate the use of register renaming or reservation stations: every cycle, the instruction issued goes directly to the appropriate execution unit, and doesn't spend any time waiting in a reservation station. So, let's modify our processor by adding a faster multiplier unit that takes only two cycles, and giving our fetch unit the ability to fetch two instructions per cycle. We'll give the arithmetic unit and the multiplier two reservation stations each, which we'll denote as 'a0' and 'a1', and 'm0' and 'm1' respectively, and we'll show these explicitly on each cycle.
As values are broadcast on the common data bus, we'll substitute those values into the code in the reservation stations to indicate that the values are now held in the reservation station ready for execution.
Cycle Registers Issue unit Arithmetic unit Multiply unit Writeback
1 f=1; ra0: rm0:
f=f*x; ra1: rm1:
exe: exe:
2 f=rm0 x=x-1; ra0: f=1; rm0: f=ra0:f*3;
if (x>=2) goto loop; ra1: rm1:
exe: f=1; exe:
3 f=rm0, f=f*x; ra0: x=3-1; rm0: f=1*3; ra0:f=1;
x=ra0 x=x-1; ra1: if (ra0:x>=2) goto loop; rm1:
exe: x=3-1; exe: f=1*3; {1}
4 f=rm1, if (x>=2) goto loop; ra0: x=2-1; rm0: f=1*3; ra0:x=2;
x=ra0 return f; ra1: if (2>=2) goto loop; rm1: f=rm0:f*2
exe: if (2>=2) goto loop; exe: f=1*3; {2}
5 f=rm1, if (x>=2) goto loop; ra0: x=2-1; rm0: rm0:f=3;
x=ra0 return f; ra1: goto loop; rm1: f=3*2;
exe: x=2-1; exe: f=3*2; {1}
6 f=rm1 ra0: if (1>=2) goto loop; rm0: ra0:x=1;
x=ra0 ra1: return f; rm1: f=3*2;
exe: if (1>=2) goto loop; exe: f=3*2; {2}
7 f=rm1 ra0: rm0: rm1:f=6;
ra1: return f; rm1:
exe: return f; exe:
Job done, in approximately 7 cycles. A few observations:
- There are a fair few cycles there in which the issue logic is shown as empty. In the later cycles, these could potentially be busy issuing the next few instructions.
- We have a few fewer cycles, but not that many fewer. Not the 1/3 reduction we'd expect to see given that we've increased the multiplier throughput by that much, and certainly not the 50% reduction that we might hope for given that we've increased the fetch and issue width by that much. We are still limited to an aggregate throughput of 1 instruction per cycle because there's still only a single common data bus.
- Tracing execution like this, it looks very, very complex.
The third point there is a rather important one: Tomasulo's Algorithm gives us a fairly straightforward way to describe and implement some very, very complex behaviour. But implementations can still be very complicated.
Other issues that we can probably see from here are branch prediction and exception precision. I've used an idealised branch prediction algorithm which happens to get the correct answer every time and manages to keep the issue unit fully supplied with instructions. Branch prediction becomes incredibly important as issue widths go up: every branch misprediction causes inefficiency proportional to the issue width, since that number of instructions could potentially be issued in every cycle wasted by the branch misprediction.
Exception precision is a more subtle issue. If we supplement our multiplier with a divider, and allow it to raise an exception on a divide by zero error, how would our processor be able to recover from such an exception? By the time a divide instruction is actually executed, instructions following it may have already been issued to the integer arithmetic unit or to other reservation stations in the multiplier / divider. Determining which following instructions to cancel can be difficult, and different implementations take different approaches.
The P6 core, for example, retires instructions in program order after their actual execution, holding un-committed results in a reorder buffer until the instructions are to be committed. Other control schemes may keep instructions in-order in the pipeline, like a scalar machine, until they pass the point at which they may raise exceptions: division by zero, for example, can be detected early, while most memory faults can be detected before the access is attempted.