Pipelining is a method to improve performance used in almost all modern processors of the last 20 years. The normal way to make a processor faster would be to reduce the time one instruction needs for being executed. This may work very well in the beginning, but becomes harder and harder, as one comes nearer to barriers defined by physics. Pipelining does not improve the performance of one instruction, but it increases the throughput of instructions. The secret behind this increasment is, that most of the stages of the execution of instructions do not differ. For every instruction you have to fetch the instruction, decode it, load the register values, ... and finally write the result back:
x86 asm:
add ax, bx
sub ax, bx
fetching instruction, decoding instruction, loading register values and writing result are identical
The same in MIPS asm:
add $s0, $s0, $s1
sub $s0, $s0, $s1
After realizing this, pipelining is a straight-forward approach: Split the instructions, and execute these splitted instructions parallel: You fetch instruction 1, now you go to the next stage of the pipeline, decoding the instruction. Now the fetching part of the CPU is free and it can be filled with fetching the next instruction. This increases performance dramatically, with making the instructions themselves faster: Your CPU needs 10 ns for executing one instruction, so you can execute a new instruction every 10 ns. If you split your CPU into a 5-stage pipeline, every step takes only 2 ns and you finish an instruction every 2 ns instead of 10 ns!
I will show you now a short example for a pipeline. I take the MIPS pipeline as it is really simple: The MIPS pipeline consists of 5 stages: Instruction Fetch(IF), Instruction Decode + Register load(REG), Execute(Ex), Memory Access(Mem), Register write (RW). A filled pipeline may look like this:
         
                    2ns 2ns 2ns 2ns 2ns 2ns 2ns
add $s1, $s1, $s2   IF  REG Ex  Mem RW 
lw $s3, 0($s2)          IF  REG Ex  Mem RW
xor $s4, $s4, $s4           IF  REG Ex  Mem RW
----------------------------------------------> time
A short explanation of this: The first column is the assembler operation, where as the other columns represent time slices. In the first time slice only the instruction fetch of the first operation is executed, while in the second slice two independent things are done (Instruction decode and register load for first operation and instruction fetch for the second operation) and in the third three things are done. Of course such a great method also has some drawbacks: Your instruction set should be chosen carefully. MIPS is a good example for this. Fixed length instructions make fetching and decoding very easy. A bad example is the x86, as instruction lengths vary widely, making a fast instruction fetch and decode very difficult. This means, that you have to slow down the whole pipeline to the speed of the instruction fetch or you have to use a faster, more complicated IF stage.
Another problem occurs if you replace the $s2 in the second instruction with $s1. $s1 is written in the first instruction and the new value does not stand in the register. This is called data hazard You have to stall the second instruction until the first one has finished the last stage in order to get valid results. There are two ways to overcome this stall. You could switch instructions 2 and 3 as they are not connected to reduce the stall. The second method is called forwarding. If you look at the first instruction you may see, that the value is already calculated after the 3th stage. To reduce the stall one can read this value without waiting for it to appear in the register.
The third problem are branches. Branches are used to control the instruction flow, so this problem is called control hazard. Branching means, that jump is preformed depending on the result of a boolean expression:
beq $s1,$s2, equal
This would result in a jump to the label equal, if the values of $s1 and $s2 are equal. If they are unequal the normal instruction order is used. Let's look at an example on how this can lead to problems:
                    2ns 2ns 2ns 2ns 2ns 2ns 2ns
add $s1, $s1, $s2   IF  REG Ex  Mem RW 
beq $s5, $s6, equal     IF  REG Ex  Mem RW
lw $s3, 0($s2)              IF  REG Ex  Mem RW
xor $s4, $s4, $s4               IF  REG Ex  Mem RW
...
equal:
...
----------------------------------------------> time

If $s5 and $s6 are not equal, nothing bad happens, but if they are equal, all the pipelined instructions after the branch are useless, resulting in a lower throughput, as some stages are clear now. One tries to overcome this, by predicting if the branch is taken or not. The easiest way is that you predict: Branches are not taken. Obviously this does not work very well. Take for example the following C for loop:

for(int i = 0; i < 20; i++) {};

The branch for this loop is executed 21 times. 20 times it is taken, 1 time not. That's pretty bad. So let's just assume that every branch is taken! This will not work, too. It dramatically improves performance for this loop, but it surely slows down others. Other methods are needed. One is that one saves how the last branch at this address (normally not exactly this address as only some LSBs are used) was executed. If this branch prediction buffer is large enough, and the looping structure of the program is not too complicated, this works pretty well.

Sources:
Computer Organization & Design: The Hardware / Software Interface, David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, San Francisco, California
Technische Grundlagen der Informatik II: Script, Prof. Dr. Michael Gössel, University of Potsdam
Spim Tutorial: Einführung in die Assemblerprogrammierung mit MIPS, Reinhard Nitzsche, avaiable online
Grundlagen der Digitaltechnik, Prof. Dr.-Ing. Dr.-Ing h.c. Dr. h.c. Hans Martin Lipp, Oldenbourg Verlag, München and Wien
Technische Informatik I, Wolfram Schiffmann and Peter Schmitz, Springer Lehrbuch, Berlin
Spim Documentation, avaiable at http://www.cs.wisc.edu/~larus/spim.html

Log in or register to write something here or to contact authors.