Intro and History
The Intel StrongARM is a family of high-speed microprocessors, targetted at
embedded systems, and implementing v4 of the ARM instruction set. It's
widely used in PDAs running Windows CE, thin clients such as the
NetWinder and some RISC OS machines such as the Acorn RiscPC.
The StrongARM project was announced in early February 1995 as a joint venture
between ARM Limited and Digital Semiconductor, with the goal of combining the
strengths of Digital's Alpha implementations (high performance), and ARM's
own processor implementations (low power consumption).
The first devices with StrongARM cores, the SA-110, were announced almost exactly
a year later in early February 1996, soon followed by the SA-1110, targetted at highly
cost-conscious embedded systems featuring the same processor core but with a reduced
cache and onboard peripherals including a serial
UART, a memory/PCMCIA controller, an LCD display controller, USB
interface and many general purpose IO pins.
In the sale of Digital Corporation to Compaq, the rights to StrongARM were
snatched from Compaq by Intel in amongst some wrangling over patent infringements,
so the product line is now manufactured and sold as the 'Intel StrongARM' platform,
with continued developent of the StrongARM II re-labelled as 'Xscale'.
Technical Summary
Its suitability for embedded machines stems from its small die area,
low power consumption and high performance vs. power dissipation ratio. Core clock speeds range
from 160MHz to 233MHz on a 0.35 micron silicon process, with maximum power
dissipation from 300mW to 1000mW;
performance scales linearly with clock speed, from 115 to 268 dhrystone MIPS.
As would be expected from the target market, no floating point unit is provided,
so whetstone results are dependent on the quality of a software implementation,
and naturally are very poor in comparison with any hardware implemention.
All StrongARM devices share a similar StrongARM processor core, so
have common features: the same 5-stage integer pipeline, with its non-pipelined so-called
'fast multiplier' which does 32-bit unsigned muliplication in a maximum of
3 clock cycles. The core architecture requires a Harvard architecture cache system.
The archetypical StrongARM devices, the variants of the SA-110, have
a Harvard architecture cache and memory management system, with 16kb data cache and
16kb instruction cache, both virtually mapped and 32-way set-associative. Both the instruction and data caches
handle address translation by separate instruction and data MMUs with a 32-entry TLB of
variable size (4kb, 64kb or 1Mb) pages. There's also an 8-entry write buffer for writes
to external memory. The external memory bus is 32 bits wide (cc. Pentium whose external
bus is 64 bits) to keep pin-counts and system costs down.
The SA-1100's cache and memory architecture is essentially similar, however the data cache is
reduced in size to only 8kb to (reduce die area), but is complemented by a direct-mapped 512
byte mini-cache to maintain reasonable performance for typical application code.
Core Microarchiteture
The fun part! Be warned, a background in computer architecture or electronics is probably useful here :)
The StrongARM core is vastly different from all prior ARM implementations:
- Abandons Stephen Furber's beloved 3-stage pipeline and datapath in favour of
a more aggressive, traditional RISC 5-stage pipeline.
- Relies on a cache architucture capable of delivering an instruction AND
performing a data access in a single cycle: a multi-port cache (impratical!)
or a Harvard architecture cache.
- 26-bit mode support severely reduced by starting all exceptions in 32-bit mode
regardless of the previous operating mode. For the Acorn RiscPC's
StrongARM processor card, this meant an updated version of RISC OS had to
be shipped with the cards.
Integer Pipeline
The blissful simplicity of the traditional 3-stage ARM pipeline and datapath, and the sometimes
quirky timing model associated with it, were the first things to go in the design of
the StrongARM. It's a completely different machine, built from the ground up. The StrongARM pipeline looks instead like an early MIPS or similar
pipeline, and with it comes a less quirky but more hazardous timing and control model.
The machine is sequential, single-issue and in-order with respect to
instruction issue and completion. All instructions pass through all five pipe stages
- Fetch: icache access.
- Decode: instruction control decode and register fetch
- Execute: ALU operations (including pre-ALU shift operations),
and load/store address calculation.
- Buffer: Load/store data cache access; multiplication completion.
- Writeback: Register results are written back to the register file.
Register file and bypasses
The register file, (banked in traditional ARM fashion) is designed to match the
maximum requirements of any simple ARM instruction, so allowing a sustained
throughput of one instruction per clock
where possible without the need for
complex control logic and buffering where more than two register reads are needed:
it has three read ports (A and B operands and shift index are the maximum
requirement) and two write ports (load instruction also with base address writeback).
This design decision is an easy win on the StrongARM, whereas it wasn't on previous
ARM implementations: the StrongARM has a whole clock cycle to perform a read from the
comparatively small register file (27 registers), whereas every previous ARM has had to accomodate
this within the same cycle as the barrel shifter, ALU and register writeback operations.
With the register file being read and written in different pipeline stages,
register bypass paths are needed to deliver the results of previous instructions
to the start of the execution stage. The StrongARM pipeline may produce
only two results in any given cycle: a single result from each of Execute
and Buffer, so to reduce complexity each of these data sources is bypassed
to each of the three register read ports.
Control Logic
Control logic is fairly simple. A register scoreboard ensures that instructions which
have a dependency on an outstanding register result (an instruction which is in Execute or stalled
in Buffer will stall in Decode until the result is available. This stall will cause
the Fetch stage to stall also, but the Execute, Buffer and Writeback stages proceed as normal.
The stall will cause a pipeline bubble (a nop, in essence) to be introduced in the wake of
the stalled instruction. The pipe is inelastic so once introduced, a bubble will also pass
through all stages of the pipe like any other instruction.
Stateful control logic is required for multiple-cycle instructions. Because of the
StrongARM's many read/write ports, the only instructions which have to spend more
than one cycle in any datapath pipestage are multiplication, and load/store multiple (LDM/STM)
instructions. These again stall following instructions in decode (and for multiplies, also in Execute).
Branch instructions calculate the branch target address in Execute, and Fetch will
respond in the next cycle. For taken branch instructions, this means that the instructions previously in
Fetch and Decode are from the untaken path and will be discarded; hence a branch penalty of two cycles.
References:
- SA-110 Microprocessor Technical Reference Manual
- http://www.intel.com/design/strong/manuals/278058.htm
- SA-1110 Synopsis
- http://developer.intel.com/design/strong/1110_brf.htm
- Digital Semi and ARM press releases
- http://www.lri.fr/archi/mirror/CIC/otherpr/StrongARM
- http://www.arm.com/news.ns4/iwpList107/9C362E0DF59587858025693100396C45?OpenDocument