FirePath is a RISC/DSP processor architecture targeted at low power and embedded applications.
The FirePath instruction set is, in essence, the supercharged bastard son of ARM. It began its life at Acorn some time after the design of the ARM instruction set, designed by Sophie Wilson who also designed the ARM instruction set.
When the remnants of Acorn were broken up in 1999 under Stan Boland, the FirePath project formed the core of the start-up which Stan spun out of the remains: Element 14 Inc..
Element 14 focussed on DSL as a target market for applications of FirePath. Broadcom bought out Element 14 for $640 million worth of BRCM stock in late 2000, forming the bulk of the Broadcom DSL Business unit. Shortly afterwards, the initial silicon implementation appeared in mysterious demo boxes in side rooms at trade shows in the US.
FirePath cores are used in Broadcom's 12-line central office DSLAM chipset (a single chip runs the DSP for 12 lines of ADSL), the BCM6410RA0.
This is the part I could go on for ages and ages about. But which, sadly, I can't because I'd get in trouble with the lawyers for breach of NDAs.
Welcome to buzzword central. FirePath is a rigourously SIMD, LIW architecture, somewhere between a conventional RISC architecture and a general purpose DSP core.
A single instruction is a 64-bit packet capable of encoding two operations, each executing in parallel using FirePath's two symmetrical 64-bit SIMD datapaths. This symmetry is the major differentiating factor between FirePath and VLIW architectures, so much so that Sophie coined the term 'DIMD', or 'Dual Instruction, Multiple Data' to describe FirePath.
The SIMD nature of the instruction set is complemented by the way FirePath predication works. Rather than predicate an entire instruction's execution on a single bit, like IA-64's predication, FirePath has 8-bit predicate registers, each bit controlling the conditional writeback of a separate byte in an operation's result, thus allowing conditional operations to be performed in a SIMD fashion.
Let's try a little code example. Since bits of code have been presented at HotChips and Embedded Microprocessor Forum, there shouldn't be any harm in presenting a little bit here. Let's say we want to take two arrays of 8 unsigned bytes, and write the maximum of each byte to another array.
In ARM, we could do it like this...
mov r0, #8
ldrb r1, [rAp], #1!
ldrb r2, [rBp], #1!
cmp r1, r2
movhi r2, r1
strb r2, [rMp], #1!
subs r0, r0, #1
I'm not claiming this is the best way to do it, of course. We could do the load and stores 4 bytes at a time, and unroll the loop, for instance.
Now here's the same thing coded in the 'obvious' way in FirePath assembler.
ldl r1, [rAp], #8 : ldl r2, [rBp], #8
cmphib p0, r1, r2 : nop
p0.movb r2, r1 : nop
stl r2, [rMp], #8 : nop
This is a good example of SIMD parallelism in FirePath: all eight 'A' bytes are loaded in parallel by the first 'ldl' instruction; all eight 'B' bytes are loaded by the second one (which, written on the same line like this, is actually executed in parallel). All eight comparisons are done in parallel by the 'tsthib' operation, setting a bit of predicate register 'p0' for each byte's comparison. The predicated 'movb' instruction does the same as all eight executions of the ARM 'movhi' instruction, and the 'stl' writes back all eight 'maximum' bytes.
The 'nop' operations in the above could be replaced by useful operations if there were sufficient parallelism in the algorithm to find some independent code to execute there.