display | more...
A buffering algorithm used in complex database systems to allow the caching of data based on the actual number of times a particular part of data is accessed by user processes.

Databases typically store data in database blocks which are read and stored from phyiscal disk. Since physical disk access can take a relatively long time in comparison to reading from memory the use of a data cache is implemented to reduce the number of physical disk accesses required. If ten processes are all executing more or less the same task they may all require access to the same data. A disk cache means that blocks which are being accessed many times over can be held in memory, meaning that potentially only the first of the ten processes would require to access the disk.

One of the principal problems in implementing a cache system is how you decide which blocks are stored in memory and which blocks are held in disk. If you have 100G of database and 8G of memory it's obvious that you can't store all of your data in memory. Many database and OS caching systems use what is called a Least Recently Used (LRU) algorithm to determine which data areas stay in memory. This algorithm is fairly simple in its basic form. A list is maintained of all blocks accessed. The most recently accessed blocks are held in memory. When more blocks are read from disk the least recently used blocks in memory are dropped out to make room for the new, most recently used blocks. The problem here is that the least recently used blocks are not necessarily the most likely to be used next.

Imagine a situation where most users are entering data into an accounting system. In doing so they constantly read from the customers table which contains details of customer names and addresses. In theory these processes should keep the most popular customer details at the head of the LRU. They are always being accessed and are therefore always near the Most Recently Used end of the cache. But halfway through the working day Joe The Manager starts running some reports on the monthly accounting data. These reports are historical reports that scan large amounts of data from the database to calculate totals etc. The reports will cause this historical data, which is accessed only by these reports, to be read from disk and placed at the head of the cache. The commonly accessed data for the customer addresses is now further down the queue and at risk of being put out of memory altogether. A single process has knocked the performance of ten other processes for the sake of reading something that will not be read again.

This kind of problem has always existed in LRU algorithms, but was never seen as a large issue. As more and more large database and data warehousing applications came online the problems of mixing single large accesses to data with lots of small data accesses began to impact the scalability of the LRU. Mixed implementations of the LRU were tried, for example by splitting the cache into areas to store single large accesses and other areas to store smaller areas. The success of these changes was serverely impacted by the administration burden it added in choosing which data areas were stored in which cache.

The Touch Count Buffer algorithm attempts to fix this problem by assigning a value to data areas that is based on the number of times the data is accessed rather than when it was last accessed. By using a simple counter assigned to each data block and incrementing that counter each time a block is accessed the database can tell which blocks are accessed most often and try and keep those in memory. If Joe The Manager runs his reports the data accessed will only overwrite the least popular parts of the cache, preserving the areas that are used most often by other processes.

The Touch Count Buffer is typically implemented around an existing LRU algorithm to allow the two to mix. The cache is split into segments of Hot and Cool areas. The Hot cache contains data which is accessed most often and is least likely to be required to be flushed. The Cool area contains data that has only been accessed a few times and is therefore a good candidate for flushing. The LRU is used within the Hot and Cold cache areas to help determine which blocks are kept when many blocks are no more or less popular than others.

The addition of the Touch Count Buffer to databases will hopefully help in the performance of large, mult-use database systems. It adds CPU burden to the machine in calculating which blocks to keep and lose, but its benefits in large complex systems should outweigh these costs.

Again, the straightforward implementation of this technique has a severe drawback: it doesn't cope well with access frequencies that change over time. Why should something be kept in the buffer that is accessed infrequently, just because it used to be accessed frequently last week?

The obvious solution would be to only count the accesses in, e.g. the last five hours, but this is not feasible, because it would mean not just a single counter for each block, but a (possibly huge) number of access timestamps.

A good alternative are strategies that "age" the counters, e.g. decreasing all counters once per minute, but that would require touching the entire database each minute, which would be very bad for performance. A tolerable implementation might be one that subtracts a fixed larger amount from all counters or divides them by a constant in larger intervals. Another good idea might be a hybrid touch buffer / LRU strategy, in which the "importance" of a block is computed as the average between its touch count and the inverse of the time since it was last accessed.

Log in or register to write something here or to contact authors.