Intro and History

The Intel StrongARM is a family of high-speed microprocessors, targetted at embedded systems, and implementing v4 of the ARM instruction set. It's widely used in PDAs running Windows CE, thin clients such as the NetWinder and some RISC OS machines such as the Acorn RiscPC.

The StrongARM project was announced in early February 1995 as a joint venture between ARM Limited and Digital Semiconductor, with the goal of combining the strengths of Digital's Alpha implementations (high performance), and ARM's own processor implementations (low power consumption).

The first devices with StrongARM cores, the SA-110, were announced almost exactly a year later in early February 1996, soon followed by the SA-1110, targetted at highly cost-conscious embedded systems featuring the same processor core but with a reduced cache and onboard peripherals including a serial UART, a memory/PCMCIA controller, an LCD display controller, USB interface and many general purpose IO pins.

In the sale of Digital Corporation to Compaq, the rights to StrongARM were snatched from Compaq by Intel in amongst some wrangling over patent infringements, so the product line is now manufactured and sold as the 'Intel StrongARM' platform, with continued developent of the StrongARM II re-labelled as 'Xscale'.

Technical Summary

Its suitability for embedded machines stems from its small die area, low power consumption and high performance vs. power dissipation ratio. Core clock speeds range from 160MHz to 233MHz on a 0.35 micron silicon process, with maximum power dissipation from 300mW to 1000mW; performance scales linearly with clock speed, from 115 to 268 dhrystone MIPS. As would be expected from the target market, no floating point unit is provided, so whetstone results are dependent on the quality of a software implementation, and naturally are very poor in comparison with any hardware implemention.

All StrongARM devices share a similar StrongARM processor core, so have common features: the same 5-stage integer pipeline, with its non-pipelined so-called 'fast multiplier' which does 32-bit unsigned muliplication in a maximum of 3 clock cycles. The core architecture requires a Harvard architecture cache system.

The archetypical StrongARM devices, the variants of the SA-110, have a Harvard architecture cache and memory management system, with 16kb data cache and 16kb instruction cache, both virtually mapped and 32-way set-associative. Both the instruction and data caches handle address translation by separate instruction and data MMUs with a 32-entry TLB of variable size (4kb, 64kb or 1Mb) pages. There's also an 8-entry write buffer for writes to external memory. The external memory bus is 32 bits wide (cc. Pentium whose external bus is 64 bits) to keep pin-counts and system costs down.

The SA-1100's cache and memory architecture is essentially similar, however the data cache is reduced in size to only 8kb to (reduce die area), but is complemented by a direct-mapped 512 byte mini-cache to maintain reasonable performance for typical application code.

Core Microarchiteture

The fun part! Be warned, a background in computer architecture or electronics is probably useful here :)

The StrongARM core is vastly different from all prior ARM implementations:

  • Abandons Stephen Furber's beloved 3-stage pipeline and datapath in favour of a more aggressive, traditional RISC 5-stage pipeline.
  • Relies on a cache architucture capable of delivering an instruction AND performing a data access in a single cycle: a multi-port cache (impratical!) or a Harvard architecture cache.
  • 26-bit mode support severely reduced by starting all exceptions in 32-bit mode regardless of the previous operating mode. For the Acorn RiscPC's StrongARM processor card, this meant an updated version of RISC OS had to be shipped with the cards.
Integer Pipeline

The blissful simplicity of the traditional 3-stage ARM pipeline and datapath, and the sometimes quirky timing model associated with it, were the first things to go in the design of the StrongARM. It's a completely different machine, built from the ground up. The StrongARM pipeline looks instead like an early MIPS or similar pipeline, and with it comes a less quirky but more hazardous timing and control model. The machine is sequential, single-issue and in-order with respect to instruction issue and completion. All instructions pass through all five pipe stages

  1. Fetch: icache access.
  2. Decode: instruction control decode and register fetch
  3. Execute: ALU operations (including pre-ALU shift operations), and load/store address calculation.
  4. Buffer: Load/store data cache access; multiplication completion.
  5. Writeback: Register results are written back to the register file.
Register file and bypasses

The register file, (banked in traditional ARM fashion) is designed to match the maximum requirements of any simple ARM instruction, so allowing a sustained throughput of one instruction per clock where possible without the need for complex control logic and buffering where more than two register reads are needed: it has three read ports (A and B operands and shift index are the maximum requirement) and two write ports (load instruction also with base address writeback).

This design decision is an easy win on the StrongARM, whereas it wasn't on previous ARM implementations: the StrongARM has a whole clock cycle to perform a read from the comparatively small register file (27 registers), whereas every previous ARM has had to accomodate this within the same cycle as the barrel shifter, ALU and register writeback operations.

With the register file being read and written in different pipeline stages, register bypass paths are needed to deliver the results of previous instructions to the start of the execution stage. The StrongARM pipeline may produce only two results in any given cycle: a single result from each of Execute and Buffer, so to reduce complexity each of these data sources is bypassed to each of the three register read ports.

Control Logic

Control logic is fairly simple. A register scoreboard ensures that instructions which have a dependency on an outstanding register result (an instruction which is in Execute or stalled in Buffer will stall in Decode until the result is available. This stall will cause the Fetch stage to stall also, but the Execute, Buffer and Writeback stages proceed as normal. The stall will cause a pipeline bubble (a nop, in essence) to be introduced in the wake of the stalled instruction. The pipe is inelastic so once introduced, a bubble will also pass through all stages of the pipe like any other instruction.

Stateful control logic is required for multiple-cycle instructions. Because of the StrongARM's many read/write ports, the only instructions which have to spend more than one cycle in any datapath pipestage are multiplication, and load/store multiple (LDM/STM) instructions. These again stall following instructions in decode (and for multiplies, also in Execute).

Branch instructions calculate the branch target address in Execute, and Fetch will respond in the next cycle. For taken branch instructions, this means that the instructions previously in Fetch and Decode are from the untaken path and will be discarded; hence a branch penalty of two cycles.


References:

SA-110 Microprocessor Technical Reference Manual
http://www.intel.com/design/strong/manuals/278058.htm
SA-1110 Synopsis
http://developer.intel.com/design/strong/1110_brf.htm
Digital Semi and ARM press releases
http://www.lri.fr/archi/mirror/CIC/otherpr/StrongARM
http://www.arm.com/news.ns4/iwpList107/9C362E0DF59587858025693100396C45?OpenDocument