superpipelined - Everything2.com

The Marketing Definition

A Superpipelined microprocessor is one which is pipelined to an extent greater than that which the manufacturer believes is implied by simply describing it as "pipelined".

A "pipelined" microprocessor is merely one in which at least two sequentially-ordered units of the processor are active on different instructions during the same period of time, or clock cycle. To put it another way, "pipelined" means it has at least two "pipeline stages".

That's not a terribly impressive thing for chip shops to say, particularly when we'd been used to three- or five-stage pipelined microprocessors for a few years already. Continuing to describe their newer 7- and 9-stage microprocessors as simply "pipelined" sells them short. It's a lot like listing "wheels" as a selling point for a car.

Cue the introduction of the prefix "super"! It simply indicates that the extent to which a processor is pipelined is 'greater' than you'd expect from the word 'pipelined'.

The technical definition

Not quite constituting a separate definition, it's a way of looking at the same thing from a technical point of view: more pipelined than simply "pipelined", with the cut-off point generally held as being whether or not some logically singular function (such as an arithmetic operation, memory or cache access, etc) is carried out over more than a single pipeline stage.

Because cache access is generally very time-consuming in comparison with plain old integer arithmetic, this means that the determining feature for a 'superpipelined' processor is usually how many pipeline stages it takes to access cache. Consider the following two processor pipelines:

                +-------------------------------+
                | Instruction cache access      |
                +-------------------------------+
                               |
                               V
                +-------------------------------+
                | Instruction decode            |
                +-------------------------------+
                               |             +-----+ 
                               V             V     |
                +-------------------------------+  |
                | Execute / Address Calculation |--+
                +-------------------------------+  |
                               |                   |
                               V                   |
                +-------------------------------+  |
                | Data cache access             |--+
                +-------------------------------+

This looks a lot like a StrongARM or an ARM9, or something of that time period. Data cache and instruction cache access both exist in only a single pipeline stage and take only a single clock cycle, so our marketing department wouldn't really get away with calling it "superpipelined".

Also, the limiting factor on our processor's clock cycle is likely to be the cache access, since we've only allowed a single cycle in which to do that costly cache access. So, let's superpipeline the design!

                +-------------------------------+
                | Instruction cache decode, tag |
                +-------------------------------+
                               |
                               V
                +-------------------------------+
                | Instruction cache read        |
                +-------------------------------+
                               |              +----+
                               V              V    |
                +-------------------------------+  |
                | Instruction decode            |  |
                +-------------------------------+  |
                               |                   |
                               V                   |
                +-------------------------------+  |
                | Execute / Address Calculation |--+
                +-------------------------------+  |
                               |                   |
                               V                   |
                +-------------------------------+  |
                | Data cache decode, tag        |  |
                +-------------------------------+  |
                               |                   |
                               V                   |
                +-------------------------------+  |
                | Data cache read/write         |--+
                +-------------------------------+

Now, we've split out the address decode, and access parts of our cache access into separate pipeline stages, so we have about half the work to do in each of those pipeline stages. More likely than not, we can now clock the processor a good 50% faster, and we can happily report to the marketing department that they can now slap a big, impressive "superpipelined" all over the brochures for the chip. Clocking it faster will up the power consumption a bit, which is bad, but we can mitigate that effect by gating access to the cache ways in the "read/write" stages, since by the beginning of the access we will have completed the tag comparison and can identify which cache way will contain the data; in a 4-way set-associative cache, this means we would only have to drive a quarter of the cache, saving a significant amount of power.

Everybody wins.

Oh, unless you're an assembly language programmer or a compiler, in which case you're probably upset at having to rewrite critical sections of your code, because the new chip has a longer load latency on memory accesses than the old chip did.

Not reading a string in C	superscalar	ARM10	pipelining
StrongARM II	MIPS	conditional execution	XScale
VLIW	L1 cache	micro-architecture	January 30, 2006
Tomasulo's Algorithm	Unicode Indic Scripts	EIPT	buffer cache
MIPS R5000	PowerPC 970	MIPS R10000	extranet
Super-	Cite	RTC	axon