A low power, high performance RISC processor from ARM Limited, targetted primarily at portable appliances such as mobile phones and personal digital assistants.

An evolution on the concepts first used in the ARM8, the ARM9 superseded its predecessor in late 1997. The key improvement over the ARM8 was support for the popular TDMI extensions; the new 5-stage pipeline design was retained, and a StrongARM-like Harvard memory architecture introduced.

Memory Architecture

As realised during the design stages of the ARM8, increasing ARM's performance comes down to reducing the number of cycles per instruction (CPI). The most effective way to do this is to alter the memory architecture such that multiple 32-bit accesses are possible within one cycle. The ARM9 core allows for this by requiring separate instruction and data memories, which is normally achieved through two separate instruction and data caches connected to a single unified main memory.

The ARM8 uses a double-bandwidth memory, which has two main advantages. It allows a prefetch buffer the capability to predictive cache instructions, and later feed them to the pipeline at one per cycle, and it also improves the speed of Load Multiple (LDM) instructions by reading two 32-bit words into registers in one cycle. However, double-bandwidth caches are expensive and fairly complicated to produce, and the performance gains are only seen when the prefetch unit correctly predicts instruction flow.

ARM9, like the DEC StrongARM, allows simultaneous access to instruction and data memories. In the 5-stage ARM pipeline, this is critically useful. The two pipeline stages which access memory, instruction fetch and memory read/write, are four stages apart. As shown below, after four instructions they overlap, and simultaneous instruction and data memory access is necessary to avoid a pipeline stall.

PC                ARM9 Pipeline
     -------- -------- -------- -------- --------
 0  | Fetch  | Decode |  ALU   | Memory | Write  |
     -------- -------- -------- -------- --------
              -------- -------- -------- --------
 4           | Fetch  | Decode |  ALU   | Memory | ...
              -------- -------- -------- --------
                       -------- -------- --------
 8                    | Fetch  | Decode |  ALU   | ...
                       -------- -------- --------
                                -------- --------
 12                            | Fetch  | Decode | ...
                                -------- --------
                                         --------
 16                                     | Fetch  | ...
                                         --------

Pipeline Alterations

The pipeline used in the ARM8 is almost identical to that used in the ARM9. However, since simultaneous instruction and data memory accesses are possible, it is no longer necessary to have a prefetch unit buffering sequences of instructions. Furthermore, the branch prediction capability of the prefetch unit was not used in the ARM9, as practical usage found it to be mostly redundant, as noted by ARM architect Guy Larri:

"We had branch prediction in the ARM8. But we found that compilers using conditional execution could simply eliminate many of the branches we were trying to predict."

With a 5-stage standard RISC pipeline with register forwarding, and a Harvard memory architecture, the ARM9 appears to be very similar to the DEC StrongARM. In fact, the primary difference between the two cores is that the StrongARM has a dedicated branch target adder which runs in parallel with the instruction decode stage, while the ARM9 uses the standard ALU to calculate the branch offset. This removes a critical path from the pipeline, allowing for a simpler and more portable core, but at the expense of a single cycle delay for a taken branch.

Practical Implications

The modifications made to the pipeline and memory architecture allowed the ARM9 to exceed the performance of the ARM8, more readily move to a smaller process and higher clock speed, and lowered the power consumption. Perhaps more importantly, the addition of the TDMI features (Thumb instruction set, fast multiply hardware, and debug and in-circuit emulation) allowed the core to be more readily targetted at the embedded market.

ARM9 performance is around 220 Dhrystone 2.1 MIPS at 200MHz, with power usage ranging from 560mW for the ARM920T to an incredible 150mW for the ARM9TDMI. This performance/power ratio makes it ideal for battery-powered portable devices, such as cell phones and PDAs. The ARM9TDMI, when fabricated at 0.18µm, will also operate on a core voltage of just 1.2 volts.

Fact Sheet

  • Used in: Gamepark GP32 handheld games console, modern cell phones (Nokia 7650, Nokia 9210, Sony-Ericsson P800)
  • Processors available:
    • ARM9TDMI (cell with TDMI extensions)
    • ARM920T (cell, MMU, dual 16KB instruction and data caches, write buffer)
    • ARM940T (ARM920T with Memory Protection Unit instead of MMU)
  • Fabrication: 0.35µm, 0.25µm, 0.18µm
  • Clock: 0--200MHz
  • Cache: split instruction/data
  • Addressing: 26-bit, 32-bit
  • Architecture: ARMv4T, ARMv5TE
  • Notable features: Harvard architecture, TDMI extensions, highly portable core, excellent performance/power ratio

References:

"ARM System-on-Chip Architecture", Furber, Addison-Wesley, 2000
"Complex RISC collides with IA-64 parallelism", Wilson and Wolfe, EE Times, 1997
"ARM9 Data Sheet", Advanced RISC Machines Ltd, www.arm.com