Is it considered nerdy to have a favourite functional unit in a microprocessor? Well, you can call me whatever names you like, but for my money, I'd have to pick the barrel shifter as being my favourite.
Where other functional units calculate such dull and banal things as
x|y, the barrel shifter gets the exciting task of calculating the results of shift operations like
x>>y (or, if you prefer, x2y), as well as the somewhat more exotic and seldom-seen
x>>y + (x<<(8*sizeof(int)-y)) 'rotate' operation. These operations make it possible for us to do multiplication, division, cyclic redundancy checks, all manner of cryptography things, floating point arithmetic, and to store Boolean true/false values in single bits within words; and all of the myriad algorithms that depend on the ability to do these things. A barrel shifter can perform these operations in a stateless, combinatorial manner that can easily complete in a single cycle even on aggressively clocked modern microprocessors.
Of course, shifts can be calculated by a good old-fashioned shift register, taking one cycle for every bit position shifted. And for many algorithms (notably multiplication) this is perfectly fine, because the multiplier will be used at each shift position in succession, and applying a little bit of strength reduction, each successive position can be calculated using only a single-position shift from the previously shifted value of the multiplier.
So for many applications, a barrel shifter doesn't make things much quicker; but barrel shifters are large structures to put onto silicon; far larger than a simple adder or shift register. Hence many early processors, for example all Intel x86 processors before the 386, had simple shift registers instead of barrel shifters.
Almost all modern processors since the i386 have included barrel shifters, with the exception of some embedded processors, and rather more bizarrely the Willamette and Northwood generations of Pentium 4s. Barrel shifters are also to be found as components in the normalisation stages of FPUs.
So, what does a barrel shifter actually look like, and how do we build one?
Well, to implement a shifter, we must have the ability to move any bit in the input word to any position in the output word. So, if you have any grounding at all in electronics, you're probably already thinking to yourself "Well, we could implement it with a bundle of multiplexors, one for each output bit, which has an input for each input-bit, and arrange the connections between input and output so that the control input for each multiplexor can be the same, sharing the decode."
If this is what you're thinking, you may well have followed that up with "Ah, but for an n-bit input/output word, this requires of the order of n2 gates, which is a hell of a lot of gates for a 32-bit or, heaven help us, 64-bit input word" and abandoned your line of reasoning in the expectation that you're about to be told not to be so silly, and that a barrel shifter implementation is actually a lot smaller and more efficient than that.
Except of course, you would have been absolutely right. A barrel shifter is essentially that: an array of multiplexors, with a size (in terms of gates and silicon area) related to the product of the number of bits in the input and output words. Additional logic on the inputs and outputs of the barrel shifter extends the basic functionality of the barrel shifter to implement the full range of shifts provided by a processor's instruction set by masking off low or high bits in the output, sign-extending the result, and selecting between left and right shifts.
We can see why barrel shifters are expensive in terms of silicon area, and why they were omitted from early microprocessors where silicon area was at a premium. It's also obvious why barrel shifters are fast: the longest signal path through the shifter only goes through a single multiplexor, albeit one with a rather high fan-in.
And it's incredibly lucky that barrel shifters happen to be so quick, because it's not feasible to pipeline a barrel shifter. In order to split a barrel shifter across more than one pipeline stage, the internal state which would have to be registered across pipeline stages is of the order of n2. For a 32-bit word width, requiring something like a kilobit of register state.
So, overall, a barrel shifter is O(n2) in terms of silicon area, and approximately O(1) in terms of time.
If you're thinking ahead, you're probably thinking that O(n2) is rather awful, and that there must, surely, be a better way. If you're really on the ball, in fact, you've probably come up with the idea of a logarithmic shifter. A logarithmic shifter performs the same function as a barrel shifter, but in a slightly different manner, using less logic gates and silicon area.
A logarithmic shifter works by successively shifting (or, not, depending on a bit of the index register) the input by powers of two, such that the result, the product of each of these successive shifts, is shifted overall by the value of the shift index: the first stage shifts by 1 place if bit 0 of the index is set; the next by 2 places if bit 1 is set; the next by 4 places if bit 2 is set; and so on until the last stage, which shifts by n/2 places if bit (log2n)-1 of the index is set. This can be implemented very simply with multiplexors picking their inputs from the relevant bits of the input.
Since there are log2n stages in this circuit, each proportional in size to the input word width, the number of gates is O(n log n), far smaller than a barrel shifter. However, because the stages operate successively, the propagation delay overall is O(log n). For a 32-bit shifter, this implies at least 5 gate delays before any additional logic is added, substantially slower than a barrel shifter, and possibly making the shifter slow enough overall that it might violate timing on a fast microprocessor.
Logarithmic shifters are, however, easy to pipeline since the intermediate state at any given point in the shifter is only n bits. If a microprocessor's pipeline structure can support early completion of results, a simple zero-detection operating in the first pipeline stage of the shifter can indicate whether a shifter's result has completed, or has any further activity needed; thus by performing the most commonly used shifts (or the components thereof) first, a logarithmic shifter can operate, in practice, almost as quickly as a barrel shifter.
I'm only speculating, but I suspect it's an arrangement like this that was used in the early Intel Pentium 4 processors, where shifts could allegedly take a variable number of cycles to complete, depending on the index.
It can be observed that using 4-bit multiplexors instead of single-bit multiplexors will create a shifter that can shift by two powers of two per stage, resulting in a faster but slightly larger circuit. Applying this inductively, we can observe that this forms a series of shifter implementations: at one end the log-2 shifter with 1-bit multiplexors on each of log2n stages, through to the full barrel shifter with log2n-bit multiplexors on each of one stage. These can be freely mixed and matched, too. For example, adding a single logarithmic shift stage can half the number of connections required in a barrel shift stage, for only a comparatively small delay penalty.
No, I don't know where the name "barrel shifter" comes from, actually. Perhaps the structure somehow resembles a barrel, in the eyes of the original inventor. Perhaps it's simply because, as a functional unit, it's more fun than a barrel of monkeys. My best guess is that it's related to the fact that on the surface of a barrel (or any cylinder), rotating around its axis, everything moves with the same rotational velocity at every point, simultaneously, and this property is more or less shared with a barrel shifter.