The
Intel Xscale
micro-architecture is a
high-speed (700MHz and upwards),
full-custom
implementation of the
ARM V5 instruction set, including the
Thumb compressed instruction set. It's an
evolution of the
DEC StrongARM, originally slated as 'StrongARM II'.
The micro-architecture is in-order, single-issue, and quite deeply
superpipelined. Cache accesses are each pipelined across 2 cycles
rather than the single-cycle access of the StrongARM and others,
giving a longer load-use penalty and more painful worst-case branch
penalty. A branch target buffer helps with this (so long as branches
are predictable) and allows zero-cost branches in tight loops; a big
plus for DSP.
Basic pipeline structure
From the textual descriptions in the Intel
technical summary, the pipeline
structure looks something like the following:
Branch target buffer
↓
Icache 1
↓
Icache 2
↓
Register/shift
+----------------------------+
↓ ↓
Integer ALU MAC1
+---------------+ |
↓ ↓ ↓
State/buffer Dcache 1 MAC2
↓ ↓ ↓
Writeback Dcache 2 MAC3
| |
| ↓
| MAC4
+------------+
↓
DC Writeback
After the common
instruction fetch/
decode stages, the pipeline splits
into 3 separate pipes;
integer arithmetic,
load/store and a
multiply-accumulate pipe to implement the extension
MAC instructions and
coprocessor
interface.
The load/store pipe allows for hit under miss operation while cache misses
are serviced by external memory.
The pre-ALU shift, which has become a painful throwback in recent ARM
implementations, is subsumed into the register fetch pipe stage.
There shall now follow some educated guesswork/wild speculation about
implementation details.
The effect on instruction timing isn't mentioned by the Intel
technical summary, but it seems fairly safe to assume that the traditional
extra stall cycle for shift-by-register will still be in effect;
shifting by non-trivial constants may or may not require an extra cycle.
The 'State/buffer' stage above is referred to by Intel as 'State Execute'
which is probably equivalent to the 'Buffer' stage in the StrongARM. This
sort of implies that the machine runs in-order up till the boundary
between State execute/Dcache 1/MAC2 and Writeback/DCache2/MAC3;
any memory faults will be determined by the end of Dcache 1, allowing
integer/mac instructions to be safely aborted before writeback. Beyond
this, the mismatched length of the MAC and dcache pipes implies result
completion is out-of-order.
The DC Writeback is presumably a second (set of) GP reg write port(s)
to match the differences in the lengths of integer, MAC and load/store pipes
without the excessive increase in the number of bypass points that would be
needed if the StrongARM-esque buffer-stage scheme were to be extended to
match the length of the MAC pipe.
Since the DC Writeback stage is shared between the MAC and load/store pipes,
this implies potential contention between these pipes when both want to
return an integer register result; this shouldn't be a performance problem
since the MAC pipe will return integer results comparatively infrequently,
but the arbitration logic may be complex. If not for hit-under-miss
cache operation, it would be possible to perform the arbitration by simply
stalling a load operation at issue if a MAR operation was issued in the
previous cycle.
...and I shall stop right there, as this has already gotten to be too
dull and rambling for anyone other than another comp.arch geek to read.
References:
Intel Xscale Microarchitecture Technical Summary, http://developer.intel.com/design/intelxscale/ixm.htm