Mill CPU Architecture - Everything2.com

New CPU architectures do not crop up nearly as often as new software does. Even though each iteration of a CPU may be different from its predecessor (different number of registers, different pipeline, different latencies, different hardware features, etc.) the fundamental nature of the CPU at the programmer level does not change drastically. For example, when you go from the older Ivy Bridge or Haswell chip to the newer Broadwell chip, a program compiled for x86 that ran on an Ivy Bridge or a Haswell will also run on a Broadwell. However, a completely new type of architecture, a complete redesign or reinvention? That's something that happens much less often (though FPGAs are helping to make it less of a rarity).

Enter the Mill CPU. This is a new CPU architecture that is currently in development. They are still years away from it being physically available. However, they maintain that this is a commercial, and not an academic, enterprise. This architecture is a complete redesign of what it means to be a General Purpose CPU. Some of their ideas are incredibly creative, but at the same time you can't help but think, "What the hell are these guys smoking?!". It was inspired by the fact that a lot of time in programs is spent in loops. As such, the Mill intends to be wide issue like a VLIW (Very Long Instruction Word) and run general purpose code at DSP (Digital Signal Processor) speed and power requirements.

The general philosophy of the Mill is: Remove anything that doesn't directly contribute to computation at runtime from the chip as much as possible, perform those tasks once and optimally in the compiler, and use the freed space for more computation units. This results in vastly improved single core performance through more instruction level parallelism as well as more room for more cores. It tackles the problems traditional architecture have when it comes to actually utilizing the large amounts of instruction level parallelism provided by ALUs.

The Belt
The Mill does not use any General Purpose Registers. Instead, it uses The Belt. The Belt can be thought of as a FIFO (First In First Out) that is used for the source of operands and destination of results for instructions, as well as for very short term data storage. It is different from registers in two respects: one, it is read only which makes it easy to keep track of data dependencies, and two, when a new value is created, the oldest value on The Belt is dropped, all other positions are advanced by one step, and the new value is placed at the front of the belt. Therefore, instead of using register numbers as operands for hardware operations, The Mill uses belt position indices; values are not referred to using fixed locations but instead, relative to the time slot they were inserted at the front. This is known as Temporal Addressing.

When the oldest value on The Belt is dropped, it is lost forever. In order to deal with long lived values, The Mill has a Scratchpad that is used to hold values for a longer period of time. The Mill also supports multiple return values; if an instruction returns multiple values, they are both placed at the front of the belt (the exact order is model dependent and something only the compiler should worry about). Additionally, since each function call creates a new stack frame, it also creates a new belt for that frame. The belt is preloaded with copies of values from the argument list of the call operation. As a security feature, the callee cannot access the caller's belt and vice versa.

The Belt is only a programming model and an interface for the compiler. In reality, there are no values being moved around on a "belt". It is implemented as a set of tagged store cells that are interconnected to all the functional units, and other producers and consumers using a crossbar switch. The tags are used to control access and data flow, consisting of the frame ID and the current belt index which gets incremented when a new value is dropped into the frame.

Instruction Encoding
The Mill is very wide issue; it is able to sustain decoding rates of 30 operations per cycle. It is like a VLIW in that each instruction actually contains multiple operations that are issued together. However, unlike a VLIW, it does not have a fixed width. The way it manages to do this is by using Split Instruction Streams. Code on The Mill is organized into EBBs (Extended Basic Blocks - a block of code with one entry point and multiple exit points). As it executes each EBB, the instructions get split into exu and flow streams. The exu stream is for all arithmetic and logic computations while the flow stream is for all control flow and memory accesses. Because there are two streams, it means there are two PCs (Program Counters). The Mill takes only one address as a branch target. Both PCs start from there. One stream walks up memory while the other walks down; the two sides could be of different lengths. As a result, while logically this branch target starts off at the top of an EBB, in memory it jumps to the middle of it.

For a true understanding of exactly how this works, I recommend you watch their video on Encoding; it is just as mind boggling as it sounds.

Memory The Mill is 64-bit architecture; it does not support 32-bit. As a result, it uses a Single Address Space model. To quickly give some background, most General Purpose CPUs use multiple address spaces; one per process. Therefore two processes might access memory which they each think is at address 0x0000000012345678. However, that is only a virtual address; the real physical address will be at a different location, and the OS deals with mapping virtual addresses to physical addresses using a Translation Lookaside Buffer (TLB). The Mill also uses virtual addressing but in the context of a single address space; it has a unified 60-bit virtual address space that is distributed among the processes by the operating system. It can be mapped to physical addresses in the traditional way. The benefit of this approach is that the TLB can be removed the critical path and placed close to the physical memory (eg DRAM). In this scenario, two processes that try to access address 0x0000000012345678 will access the same memory location.

There also exists the concept of "Virtual Zero" and "Implicit Zero". Virtual Zero occurs when a load misses the cache and the TLB with no Page Table Entry in it. Because this means that there has not been any stores to the address, the Memory Management Unit (MMU) returns a 0. This saves the OS the trouble of zeroing out pages. Implicit Zero is an optimization of this for data stacks. When allocating a new stack frame, the Mill simply puts markers into implicit zero registers for the cache lines associated with that frame. When a load occurs, the Mill doesn't bother going through the usual route of checking the caches, it just returns a 0. This not only speeds up loads, but also makes it impossible to snoop on old stack frames.

The Data Cache (dCache) and Instruction Cache (iCache) are implemented differently. On the dCache side, there are no cache misses when performing a store. This is because stores are always performed at the top of the cache. Cache lines get evicted to lower levels (L1, L2, etc.) and are hoisted back up when there is a load. All of this is done purely in the cache without any physical memory involved. An actual page backed by physical memory is only allocated when the lowest cache line is evicted. The iCaches are read-only and all memory accesses occur in program order, which means that there is no need for fence instructions. This order is maintained across cores (in the case of a multi-core machine).

Protection
The Mill's security and protection applies to the virtual address space. It uses a Protection Lookaside Buffer (PLB) to this end. Because all processes share a single address space, the Mill uses concepts known as Regions and Turfs. A region is a a continuous stretch of memory. A turf is a collection of regions. The PLB contains information about the regions and turfs. Threads always execute within their protection domain; they can only have one turf at any given time but the turf can change over time and can also be shared among multiple threads. Well Known Regions are defined via registers to save on the time needed to go through the PLB for common data accesses. The Stack region grows and shrinks along with the stack pointer, which means that no one can access memory above the stack pointer. Changing turfs involves using Portals, which isn't a task switch but rather a way of getting a thread to run in a new protection environment.

For a true understanding of exactly how this all works, I recommend you watch their video on Memory.

Prediction
Prediction on the Mill is different from traditional architectures in that it does not predict branch entries but branch exits. To be precise, instead of monitoring whether to take each branch or not, it monitors the branches at which control flow exits an EBB. It thus has an Exit Table which contains more information than traditional branch predictors because there are less branches that it needs to keep track of. From any EBB's exit table, the address of the next exit is constructed, which is used to construct the keys for the next exits and so on. In this manner, the Mill builds up Exit Chains that helps the Prefetcher get ahead of the instruction stream.

On a mispredict, the delay can range from five cycles (prediction cache) to more than a hundred cycles (DRAM) depending on how far away the memory is. It also updates the exit table with the correct prediction. The initial exit table is filled in from a saved table from a load section. This could be produced by the compiler which has full knowledge of all unconditional branches and can infer a decent amount of information regarding conditional ones as well. Profiling could further improve predictions. Finally, the initial seeding of the exit table is done in batches which reduces the overhead of having it done on demand.

For those who find the very low-level interesting, I recommend you check out their videos and their wiki. While I am somewhat skeptical about this endeavor, it is indeed interesting to learn about their architecture. I look forward to more of their videos as well as to seeing if they do manage to gain market penetration and profit if and when they finally have it physically and commercially available.

Sources:
http://millcomputing.com/docs/
http://millcomputing.com/wiki/Belt
http://millcomputing.com/wiki/Encoding
http://millcomputing.com/wiki/Memory
http://millcomputing.com/wiki/Protection
http://millcomputing.com/wiki/Prediction

function call	stack frame	scratchpad	stack pointer
cache miss	branch prediction	physical memory	EDB
pipeline	ALU	Frame	Crossbar
skeptical	Program Counter	storage	Ebb
branch	FPGA	FIFO	computation
MMU	dram	protection	VLIW