Background

At the heart of most modern computers lies the central processing unit or CPU. It is quite often the fastest and most expensive part of the machine. The CPU operates on data and programs which reside in the computer's main memory chips. The CPU fetches programs from the main memory one instruction at a time and executes them. Quite often, the CPU will have to fetch additional information from the memory, such as data on which a calculation is to be performed. The CPU runs according to a clock. This clock sends out pulses. At each pulse, the CPU will go one step further in this fetch decode execute cycle.

From this very simple sketch, it should be clear that the communication with main memory needs to be very fast: it would be unacceptable if it took a CPU only one clock cycle to execute an instruction, but ten clockcycles to fetch the instruction from main memory. It would be possible to construct memory in such a manner that its access time is equal to that of the CPU's internal registers. This way, a fetch would take only one clock cycle. Unfortunately, such memory is very expensive. For reasons beyond the scope of this text, it would in addition be very complicated, if not impossible, to build a large memory like this. That is why normal computers use slow, cheap dynamic RAM for memory. This RAM is in fact about a factor ten slower than the CPU and it is the biggest bottleneck in modern computers. When buying a computer, it would thus be wise to check out its bus speed instead of the CPU speed.

Principle

To soften the problem outlined above, processors are equipped with a very small amount of expensive on-chip memory, called a cache. The word cache (pronounced 'cash') derives from the French word 'cacher' (to conceal). The cache effectively hides the slower main memory from the CPU. The cache keeps copies of small pieces of data that reside in the main memory. When the CPU requests the data in a certain memory location, the hardware first checks if a copy resides in the cache. If so, the cached data is used. If not, the hardware loads a region of memory into a slot in the cache.

Why cache improves speed

Computer scientists have observed long ago that the memory access behaviour of computer programs is far from random. The CPU does not request values from all over the memory. Instead, there are always some hot spots, some parts of memory that the CPU just keeps accessing repeatedly. This is not surprising since computers are programmed by people that like to organise data and computations into lists, variables, procedures, loops, etc. A program usually executes the same procedures repeatedly, accesses the same variables repeatedly, works its way through lists and pieces of code, etc. This phenomenon is known as locality of reference. In a sentence: locality of reference is the observed statistical property that a program is most likely to request data equal to or close to previously requested data.

With the property of locality of reference in mind, it makes sense to cache an entire region of memory when a location from that region is requested. With luck, the processor will access many locations in this region in the future. Since the cache memory is very fast, this strategy can increase the effective throughput of the CPU considerably.

Caching policy issues

Write Through versus Write Back

Obviously, when the CPU changes data in a cached region, it needs to be written back to main memory at some point in time. There are two policies that can be followed here. They are write back and write through. With write back, a region is marked as 'dirty' once its data have been altered. When the cache is full and regions need to be evicted, the region is written back to memory. If it was not dirty, it is just discarded, since the main memory holds the same copy.With write through, the region is written to memory as soon as it is altered. The former policy holds the best performance increase, since a region may be written many times over for the cost of only one complete memory transfer, whereas the latter costs one complete region transfer per write. In some situations, like multiprocessor environments or situations where consistency and robustness are issues, it might be necessary to write through.

Region replacement

When the cache memory fills up, regions need to be evicted to make room for new ones. The question arises which strategy should be used in order to make the best use of the cache memory. The problem is essentially the same as that of page replacement in a virtual memory system. Belady stated an algorithm in 1966 which Mattson proved to be optimal in 1970. That is, Belady's algorithm causes the least amount of cache misses of all conceivable algorithms. (A cache miss is when a piece of requested data is not in the cache and has to be fetched from memory. The opposite is called a cache hit.) The algorithm is an obscenely simple one: "evict the region that will not be used for the longest period of time". Unfortunately, this algorithm is not causal. It requires knowledge of future events, which makes it impossible to implement (at the time of writing). Many algorithms exist that approach the optimal algorithm, like the Least Recently Used (LRU) algorithm, which evicts the least recently used region, or the First In First Out (FIFO) algorithm and many others. Preferably, hardware designers would use LRU, because it is statistically the closest approximation of the optimal algorithm. Unfortunately, implementing LRU is rather expensive and a compromise is often chosen. To see why, read on!

Mapping

When a region is brought into the cache, there are two conflicting factors involved when choosing a place to store the region. Those factors are efficiency and money. First of all, we could have a fully associative mapping. This means that the region can be placed anywhere in the cache. This would permit an implementation of the LRU algorithm. However, such a cache would make it somewhat difficult to find a specific piece of data within the cache. Since any requested address can be anywhere in the cache, the hardware required to translate linear addresses requested by the CPU to cache locations needs to be fairly sophisticated, read: expensive.

Another strategy is to use direct mapping. In this scheme, each memory region can be in at most one location in the cache. Another way to look at this is that certain memory regions share the same slot in the cache. So if a certain region is in the cache, we always know exaclty where it is, since it has only one place to go. This makes for simple hardware, but it is extremely inefficient. When two regions share the same cache slot they have to be continually swapped in and out if the CPU requires both at the same time. This resembles the thrashing phenomenon found in virtual memory systems that employ a local replacement policy, which was described by Denning in 1968. It is especially unnecessary if the rest of the cache is fully available.

A compromise can be reached in the form of a set-associative mapping. In this scheme, each region has a limited amount (a 'set') of cache regions in which it can reside. The hardware thus doesn't have to look through the entire cache, because it knows there are just a few locations where the sought after region can reside. The first few bits of the absolute memory address are often used as a means to group regions. They are called the tag bits. This way, each chunk of memory maps onto its own set of cache regions, but within that region the mapping is fully associative.

Remark that both direct mapping and set-associative mapping inhibit the implementation of a full LRU replacement strategy. They are cheaper to implement however. An interesting way to look at this is that a direct mapping allows no policy at all. The replacement algorithm is trivial since each requested region has only one way to go. Thus, direct mapping requires no special policy hardware. It also implies one of the worst possible policies.

Region size

Another important design consideration for cache memories is the size of the cached regions. Again, many factors conflict. It is desirable to have a large region size because it increases the chances of another hit on that region. Large regions also make the controlling hardware considerably easier and cheaper. On the other hand, since cache space is limited, a small cache would be more desirable. Further, transferring regions to and from memory takes time; the longer the region, the more time it takes. In practice, finding a good region size is very much an empirical matter. The size of the cache itself is an economical one.

Leftovers

Multiple level caching

Effectively, a computer consists of layer upon layer of caches: the memory caches disk blocks, the on-chip cache caches memory, the registers cache the on-chip cache. There is really no limitation to add yet another layer of cache somewhere. For instance, between the on-chip cache and the main memory we could have another layer of memory that is faster and more expensive than main memory, but cheaper and slower than the on-chip cache. This too comes at an obvious price, but this kind of level 2 caching is quite common. So is level 3 caching. There are some natural limits however. At some point, building a large, fast memory will be cheaper than building dozens of layers of cache.

Translation Look-aside Buffers

In a virtual memory system, which will not be described in this node, a special kind of cache is used in conjunction with the normal cache described above. In a virtual memory system it is necessary to translate virtual addresses to physical ones. It is also necessary to check if a physical address is actually in memory or if it is residing on the hard disk. Because this translation takes time (clock cycles) it pays off to keep the information on recently accessed memory pages in a special cache: the Translation Look-aside Buffer or TLB. On exactly the same grounds of locality of reference, the CPU is likely to need access to the same page in the future as it has just accessed. This is why the page descriptor of such a page is kept in the TLB.

Some typical figures

If you have carefully read the above, you will have noticed that there is much talk about probability and little about certainty. This needs to be stressed: in most of the cases, caching increases speed and in most of the cases, LRU works best. However, these are not laws of nature! There are programs for which LRU performs very poorly, especially when traversing long lists, and there are situations in which caching is slower than direct memory access. These situations do occur in the field so to speak. Be warned. That aside: your average application will have just about 90 percent of cache hits for every memory reference on your average computer. Pretty good! The same holds for the memory by the way: every memory reference that the cache makes in a virtual memory system usually has about 90 percent of hits. The chance of going to the disk is thus very dim. In a properly equipped system that is. This is just to give an impression of course. Your mileage will definitely vary.

Remarks

Caching, and the problems involved, is very similar to the mechanics of paging in a virtual memory system and to the mechanics of block access on disks. In fact, the main memory itself is used for caching disk blocks by the disk driver, since disks are much slower than main memory. Each has its own problems and peculiarities however, but generally, this type of managing limitations in speed and capacity is what makes designing hardware and operating system software so complex, and challenging perhaps. Problems like these are found in operating systems on many, many levels. The manner in which the implementer deals with them makes up for a great deal of the speed and efficiency with which your machine operates. It also makes it very difficult do compare, say a Macintosh with a G4 processor running MacOS X to a PC with an Athlon running Linux.

Another thing to note about the LRU replacement algorithm described above, is that it requires the hardware to keep statistics on each region. It requires continuous knowledge of when each region was used. This causes considerable overhead. Many times a simple approximation is used that will only track the usage of each region for one to four cache misses. Still, updating all the statistics requires complex hardware.

Finally, the computer model described above had to be simplified somewhat. The process of fetching, decoding and executing is a fairly accurate representation of computing on a RISC architecture, like the PowerPC. On a CISC architecture, things are a bit more complex, since each instruction might execute a number of microinstructions, which in turn might access memory.