Problems began to arise with VLIW, EPIC, and VelociTI architectures when compilers were not able to find enough instruction-level parallelism anywhere but tight loops. Solution: If you can't find enough ILP in one thread, run more threads.

Most programs severely underutilize a CPU's functional units because of structural hazards. Some architectures maintain more than one execution state (program counter, page table, and register contents) and steal instructions from other threads whenever there's a pipeline bubble (such as underutilized functional units, branch delay, RAM latency, etc.). Surprisingly, Rollo's writeup in branch prediction has a shred of truth: you can fill a load or branch's delay slots with another process.

Compaq's latest Alpha processors do this. IBM's Power4 processor does something similar called "chip multiprocessing" that's a bit simpler but involves static (not dynamic) allocation of functional units to processes.

Simultaneous Multithreading (SMT) is another evolution in processor architecture that allows the CPU to process a greater number of instructions per clock cycle.

Out of Order processors allow a processor to execute instructions in an arbitrary order with some instructions happening in parallel. Only one process or thread can run at a time, a pipeline flush has to occur to switch to another thread. Often a single thread can only be parallelized a small amount by the hardware.

SMT combines out of order processing capability with the ability to run multiple processes or threads "at the same time." Since the threads don't use the same registers or memory space(*) then the processor can run many more instructions in parallel than with a single thread - there are no dependencies between the instructions. The additional complexity added to the processor is not trivial, but the performance increase can be very large.

Intel's version of SMT is called Hyperthreading, and can run 2 threads or processes simultaneously. The processor keeps the instructions in the same buffers, giving the processes different register sets and a few other seperate buffers.

The original P4 contained a full SMT implementation that actually worked, but in a few corner cases it slowed the entire processor to a crawl. Intel decided to release the processor with "Hyperthreading" turned off until they fixed the performance issues bogging these exceptional cases.

(*) The SMT architecture appears as two seperate processors to the operating system, so it's like having two processing units, two sets of registers, two sets of memory spaces, etc. The reality is that they are sharing the same pool of registers and processing units. So program A's register 1 is not the same register as program B's register 1. This means that any instruction from program A and any instruction from program B can be run simultaneously even if they use the 'same' architectural registers (with some minor exceptions regarding locks, memory access, etc). This is not as difficult as it may seem since out of order processing, used in processors for many years, renames the registers as they come into the processer to a shared pool of registers.

Log in or register to write something here or to contact authors.