Hash Tables are used in computer software as a data structure for storing and retrieving data very quickly.

Data to be stored in a hash table require a key. For example a key for a person's data record might be their social security number.

Given the key, the hash table datastructure can find a piece of data in one or at most a few key comparisons (O(1) time independent of the size of the table or the number of entries in the table.)

At its heart a hash table consists of an array of entries. The number of entries in the hash table is usually bigger than the expected number of entries.

Each entry can be a linked list of the data (this is the normal case) or just the data you wish to access (rarer, this is known as an open address hash table).

To insert a data entry in the hash table, the hash function for the key of the data is run to generate an integer called a hash. The hash is then used modulo the length of the table and is used to index into the array. The key and data is then placed in the linked list (or in the case of an open addressed table, inserted in the hash table at that point- or as near as possible).

Similarly, to retrieve the entry the key is used, and hashed to determine the position to search for the entry. Once the entry is found, it is searched until the correct one is found. There's usually only 1 or a very few data records around or at that point in the table, so this is very fast.

One complication with hash functions is known as a collision. A collision occurs when two hashes resolve down to the same array location (which usually happens somewhat unless a perfect hash can be employed). With the linked list concept the key/data pair just gets added to the end of the same list. If open addressing is used, then the key/data pair is just placed in the next free hole in the array.

Because the hash function typically runs quickly and an array lookup is also a fast operation, inserting, deleting and retrieving from a hash table is very fast; the hash function is often the slowest operation performed. However, the comparison can be slow too. Note that issues such as paging and cache limitations can occur on large tables.

In practice, open addressing is rarely employed- the open addressed hash table has a maximum size, whereas linked list hash tables have no limit to their capacity except available memory; also the performance degrades much more slowly with linked lists as more entries are added (open addressed tables basically crawl to a halt as the table fills). Still, open addressed tables have certain advantages, and sometimes the extra space and extra indirection used by linked lists is bothersome.

Even though a hash table is a fairly wonderful data structure, it has been somewhat infrequently used- in most cases modern computers are fast enough to use linear searching for most problems. It's only when the combination of more than a few hundred data entries and extreme speed are called for that it is necessary to employ it. Still, languages like Perl have the data structure built in, and hence its use is in fact increasing.

For more details about hash tables, I recommend you consult The Art of Computer Programming by Donald Knuth.

The hash table belongs to the abstract type of data structures called the table or dictionary. In tables, information as stored in (key, value) pairs, in which the key is the 'handle' used to get to the useful data stored as the value.

If the keys obey certain criteria, this functionality is very easy to acheive. All that is required is to use the key's value to directly specify a cell in a simple vector.
For example, if you wanted to store the numbers of wheels for different vehicles:

        ----------------------------
keys:   |    2    |    3     |  4  |
        ----------------------------
values: | bicycle | tricycle | car |
        ----------------------------

So that get(3) would return "tricycle" as required. This operation would only require a single lookup into an array, making the entire process O(1).

However, this implementation makes some massive assumptions about the keys that are unreasonable in the real world.
The criteria that need to be met are:

These requirements are so restrictive as to make this implementation impractical in all but a very limited set of cases.

An alternative approach keeps the pairs in a vector, sorted according to the keys' values. This means that the keys can be any comparable type - a much less strict requirement than those detailed above.
If the inverse of the above table was to be created, the sorted vector would look like this:

        ----------------------------
  keys: | bicycle | car | tricycle |
        ----------------------------
values: |    2    |  4  |    3     |
        ----------------------------

So that get("tricycle") returns the value 3, as required.
This lookup would be performed using a binary chop on the vector, making the operation O(log (n)).

However, hash tables boast even better performance than this, but through a totally different method: while sorted arrays rely on order, a hash table relies on chaos.

The reason why hash tables work so well is because the key is converted very simply and neatly from any sort of object into an integer obeying the three requirements listed before.
To do this, the key is fed into a hash function, which returns a integer which is effectively unique to that key. This integer is then used to index an array directly, as in the example outlined above. The beauty of this system is that the hash function is very cheap to compute, as is the array lookup, so a get(key) or set(key, value) operation is O(1) on average!

If a hash table is being used to map the type of vehicle onto the values "zero", "one", "two", etc., this is how it would look:

                   (key, value)
                        |
                        |
                  Hash Function
                        |
                        |
             --------------------
             |          |       |
             V          V       V
        -----------------------------
  keys: | bicycle | tricycle | car  |
        -----------------------------
values: |   two   |  three   | four |
        -----------------------------

When an operation is called, the key is converted by the hash function into a hash value which is then used to access a cell in the vector, or bucket, all in O(1) time (on average).

However, this is a much simplified overview of the workings of a hash table. There are a few 'gotchas' and glitches which need to be sorted out. I will look at one here.

Collisions

One problem is that of a non-perfect hash function. In reality, an infinite number of different keys will map to the same hash value, and something must be done to cope with the situation when two different keys produce the same result from the hash function. This is called a collsion.
There are two different methods employed here, giving rise to two types of hash table:

Open hash table

Open hash tables don't simply store the (key, value) pairs in an array, they store an array of pointers, each leading to a linked list. Each element in the list contains a (key, value) pair and a pointer to the next list element.

The hash table's array of pointers:
        -------------------
... pointer | pointer | pointer ...
        -------------------
                 |
                 |
                 V
      ----------------------------
L     | key1 | value1 | pointer1 |
I     ----------------------------
N                |
K                |
E                V
D     ----------------------------
      | key2 | value2 | pointer2 |
L     ----------------------------
I                |
S                |
T                V
               null

When an operation is performed, the pointer is followed into the linked list, which is searched linearly for the relevant (key, value) pair.

This approach has the advantage that the hash table has not got a maximum capacity. As more and more entries are added, and more and more collisions occur, the linked lists just grow in size. This means performance will degrade, but the hash table won't die.

Closed Hash Table

This implementation of a hash table has a fixed size, but will offer better performance when sparsely filled.
Instead of having a list of pairs for each bucket, when a collision occurs, some algorithm is employed to work out an alternative bucket for this key. One simple, and quite inefficient, method is to traverse the vector linearly until a free bucket is found, and use that one; i.e., if a bucket is full already, the next empty one is used instead.

If a malicious user or cracker learns what your hash function is, then both the open and closed hash tables' pathological case can easily be created by constructing keys that all hash to the same value. If this happens, operations are O(n) instead of O(1); a distastrous performance hit, which could have massive importance for the effectiveness of the entire system.

As the open hash table offers better performance, on average, and are guaranteed not to saturate, they are chosen more often than closed hash tables.

The Uses of Hash Tables

Although the general idea behind hash tables is simple, there are a number of tricky implementation details that must be adequately handled to acheive an acceptable level of robustness. For example: What hash function to use? How should you dynamically change the size the table to get a balance between size and speed? Open or closed? If closed, what collision recovery algorithm?
Fortunately, implementations exist for almost all mainstream languages now; Java, C++, C, Perl and Python to name some, so their use is increasing.

Why use hash tables?

There exists a common, erroneous attitude that because modern computers are so fast, high-performance data structures and algorithms are becoming unnecessary. However, if one program runs 10 times faster than another on an 8086, it will probably run 10 times faster on modern CPU. The absolute difference is smaller, but the relative speed will remain fairly constant. As modern processing jobs are much larger than jobs 8086's were expected to do, the relative speed of a bad programme to a good one is just as relevant now as it was back then.
For this reason, the most efficient data structures and algorithms should be used in all but the simplest cases.

Although hash tables boast O(1) operations, this does not mean they are as fast as O(1) array lookups or even some small O(log(n)) sorted vector or binary tree lookups. The invisible constant factor in big-O notation is increased by the need for computing the hash function every time an operation is performed. For reasons discussed below, when the pattern and type of use of a data structure is well known at design time, it may well be that slighly more exotic data structures offer better performance. That said, a well implemented hash table guarantees excellent scalability, relaxed key requirements and, above all, efficiency. They are a good choice for a general-purpose yet functional data structure.

Why not use hash tables?

As mentioned above, if the data structure is going to be used for a very well defined and specific task, choosing a hash table could be the wrong way to go. For example, when storing a large set of words (as an OCR programme might), ternary search trees typically offer better performance than hash tables.

Thanks to pfft for this point:
Another problem with hash tables is due to the pattern in which they store data in memory or on disk.
It has been known for a while that very often, data is accessed in long, continous stretches. Think how many times you have looped along the length of an array, performing a task on each element!
For this reason, computer designers now use cacheing to speed up reads/writes to areas of memory and disk which are close to recent accesses. Cacheing does cost CPU time and bus bandwidth, but in the long run, under normal usage patterns, it vastly improves performance.

With hash tables too, people will often iterate through an entire table, accessing each element. However, due to the randomizing effect of the hash function, sequential keys will be scattered all over the system - numerous caches, memory, hard disk. As a result, even though sequential keys are being accessed, as the designers expected, the memory accesses aren't sequential, so not only is cacheing useless, it is expenisive!

To combat this problem, the hash table would have to be combined with something like dynamic hashing, reducing disk reads. Otherwise, some other data structure where items are kept in an order could be used, such as 2-3-4 trees, heaps or splay trees.


NB This writeup describes open and closed hash tables the other way around to other sources. This is a reflection of the fact that there exists no standard terminology for which means what.

Thanks to pfft an isogolem for informed advice.


Factual sources:
University of Cambridge, Computer Lab.
"Computer Science: a modern introduction" - Goldschlager, Lister, Lister. Prentice Hall. ISBN: 0131659456

Log in or register to write something here or to contact authors.