3.3.1. Hashing and hashed types

Learning objectives

Hashing
Hash functions
Hashing in LEDA
Hashed types
The LEDA rule “Requirements for hashed types

The basic idea of hashing

How we can store a number i and find it again as fast as possible, that is, how can we decide whether we have stored this number or not? If we had an infinitely long array A at our disposal, we simply could write the number i into the i-th element A[i] of the array. The test whether we have stored i then would take time O (1) only: All we would have to do is to look at position A[i]. If a 0 is standing there, we have not stored i. If the number i is standing there, we have stored i. (We neglect the problem of initializing an infinitely long array with zeros in this description.)

However, there is no infinitely long array A. Therefore we manage with a trick: Conceptually, we hash (cf. also with the French word “Hashé”) the array A into pieces, in front of the positions m, 2m, 3m, etc., where m is the size of an array T that we can actually afford with respect to the size of our main memory, for example m=1000. We put the single pieces one below the other (conceptually) so that the elements 0, m, 2m, 3m, etc. of the original array are placed one below the other: They are all “hashed” to the element T[0]. Figure 3.4 illustrates this.

Figure 3.4. Hashing

Hashing

The basic idea of hashing: Hash an infinitely large array into pieces and put the single pieces below each other. Positions i with an equal hash value are placed one below the other. Here the hash value is i mod m.


Now, if we want to store a number i, we store it in the array T at position i mod m, that is in the element T[i mod m]. In this context, T is denoted as a hash table or simply as a hash. The method itself is denoted as hashing. The individual positions in the hash table are denoted as slots. (As the name suggests and as we will see at once, several numbers may fall into one slot, that is, they may be hashed to the same position.)

A function like h(i)=i mod m that, for an integral i, calculates the number of the slot in a hash table is called a hash function; the value h(i) calculated by it is called a hash value.

Hashing with chaining and table doubling

A new problem appears now: Several i's can be hashed to the same position. For example, all multiples of m are hashed to position 0. Hashing with chaining is a method to handle such collisions. Here the array T does not consist of integer numbers, but rather of linked lists of integer numbers, one list for every slot. Arbitrarily many numbers can fall into one slot now; they are stored in the corresponding list. Figure 3.5 shows hashing with chaining in a hash table with 4 slots and 7 numbers stored.

Figure 3.5. Hashing with chaining

Hashing with chaining

Hashing with chaining: A hash table with 4 slots and 7 numbers stored. The hash function is h(i)=i mod 4. It is, for example, h(17)=1 and h(-4)=0. The load factor is 7/4. With a doubling strategy, insertion of another number would lead to a rehash into a table of size 8.


To store a number i, h(i) is calculated and i is appended to the h(i)-th list. This does never take longer than time O(1). Likewise, to look up the number i in the hash, h(i) is calculated, too, and the h(i)-th list is searched for the number i. The time for this is proportional to the length of the h(i)-th list.

So how long does the h(i)-th list become? In the worst case, all numbers are hashed to the same slot; the structure then degenerates into an ordinary linear list. In the best case, the hash function distributes the numbers uniformly over all m lists. Then, with n numbers stored, the length of every list is n/m and therefore the mean access time is O(n/m). The quotient n/m is denoted as the load factor. With a uniform distribution, if n is smaller than m, that is, if there are at most as many numbers stored altogether as there are slots, the mean access time is also O(1), that is constant.

If more and more numbers are stored, that is, if n grows further, also the load factor, and with it the mean access time O(n/m), become larger and larger. Sometime, n would be much larger than m and the access time then could not be considered as constant any more. To prevent this, another trick is used: To keep the load factor always smaller than 2, the size of the table is doubled if n is twice as large as m. This means that all elements of the old table are “rehashed” into a new one containing twice as many slots. This is called a rehash. Contrary, if great many numbers are deleted from the hash and the load factor sinks to 1/2 or below thereby, the size of the table is halved and a rehash also takes place. Immediately after a rehash, the load factor is always 1, be it after a doubling or a halving of the table size.

Such a rehash absolutely may take time - it is proportional to n. One can, however, easily show by an amortized analysis that the costs of a rehash hardly manifest themselves on the average: If one distributes its costs on the access operations that can be carried out after it and before the next rehash takes place, the analysis shows that the amortized time of an operation is always O(2·n/m) and therefore can be considered as constant because of n/m < 2. Finally, this is the reason why hashing is such a fast method for storing and retrieving information.

Numbers and other keys

Until now, we have only talked about numbers to be stored. However, hashing can actually be used to store and to retrieve arbitrary information - only a hash value h that can be calculated by an algorithm must be assignable to every information to be stored. The implementation of the hash then stores not only this hash value h, but together with it also a reference to the original (unhashed) information, which is denoted as a key in this context.

Every compiler, for example, stores the identifiers occurring in the program text, like variable names, function names etc., in a hash table. For this purpose, a function h is necessary that turns a string into a number that can be used as a hash value. The simplest way to do this is to interpret the ASCII values of the characters as a long bit string (to be more precise: as the coefficients of a polynomial with base 27).

Hashing is also used as a method to implement an associative container type. For this purpose, a reference to the value associated to the key is additionally stored with every key in the table.

For example, for every variable name a compiler stores the type of the variable as it is defined in its declaration.

Good and bad hash functions

Of course the overall method all depends on the assumption that the keys are uniformly distributed over the individual slots. For this assumption to be satisfied, the probability distribution with which the keys are drawn from their domain and fed into the hash and the hash function used must match each other.

More formally speaking: If P(k) is the probability with which key k is fed into the method, and if all P(k)'s are summed up for the k's that are hashed to slot j, then a sum of 1/m should result, and this should hold for all slots j. This means in other words that 1/m-th of all keys are hashed to each slot.

The above function h(i) = i mod m, for example, distributes numbers uniformly over the hash table if only the numbers themselves originate from a very large (compared to the number of slots) interval and if they are all drawn from it with equal probability.

However, this is not always the case. If one knows for example from the start that only even numbers will occur as keys, this function would leave every second slot empty (if m is a power of 2 as a result of doublings and halvings). Here it would be better to use a hash function that cuts off the last bit of the key first.

Moreover, the previous considerations assume that the hash value itself can be calculated in constant time. Of course, the speed at which the hash function works also plays a role in the overall method. The function introduced above, which makes a number from every string, takes the more time the longer the string is.

There is no hash function that is suitable for every type of input key and that works equally well for every distribution of the keys. It is always an art to find a good hash function for a certain kind of key and a certain distribution. It would go beyond the scope of this introduction to delve into the construction of good hash functions.

Hashing in LEDA

LEDA offers hashing with concatenation and table doubling, encapsulated into the class h_array. The size of the first table is m=1, that is, the data structure starts with one slot only. (This initial size can be adapted to a different value in the constructor.) The hash function used by LEDA is h(i) = i mod m, in which m is doubled or halved with every doubling or halving, respectively. So m is always a power of 2 (unless the initial table size was set to a value l different from 1; then it always has the form l·2k). The input keys i must originate from the interval that is covered by the C++ type int, that is normally the interval [-231..231-1].

To give the user the possibility to influence the computation of the hash value, he may prepend a function g of his own to the function h computing the hash value. LEDA then hashes the key i to the value h(g(i))). g must return integer numbers of the type int. So this is a two-level method.

For the number types of C++, g is predefined as the identity function g(i)=i. For pointers to void, g is predefined as the function that interprets the address of the pointer as an integer number and returns it. If the user wants to modify g for these types, he must overwrite the function

int Hash(const int& i);

or its equivalents for the other number types.

According to the LEDA rule “Requirements for hashed types”, to make a self-defined type T a hashed type also and thereby to be able to use it as the key type of an h_array, the user must implement the function

int Hash(const T& i);

This function must be put into the namespace leda. It is then used by LEDA as the input g(i) of the hashing h.

Furthermore, the equality operator

bool operator == (const T&, const T&);

must be defined for the type T. This is due to the following: If two different keys k1 and k2 with the associated values v1 and v2, respectively, are hashed to the same slot in the associative container of type h_array, then v1 and v2 shall nevertheless be accessible later via the keys k1 and k2, respectively, without mixing these values up. However, for this k1 and k2 must be distinguishable within the implementation of the hash, which cannot be carried out by means of their hash values - these are equal. Therefore, the operator == must implement the possibility for distinction.

Additionally, the operator == must realize an equivalence relation on T. In addition, for all objects x and y of the type T the following must hold: If x == y, then also Hash(x) == Hash(y).

If, in contrast, two objects x and y are equal, that is if x == y holds, the hash function should not consider them as being different. Then Hash(x) == Hash(y) must also hold.

Furthermore, with the class map LEDA offers a particularly fast implementation for keys of integral type, of pointer type, or of item type. The method used therein is one-level and cannot be influenced by the user: The keys k are interpreted as 32 bit values and k is hashed to the position k mod 2m, in which m is the current number of slots. This method also works with table doubling. In general, it is fundamentally faster than the two-step method of h_array because the hash function is hardwired; in exchange, one is very restricted here when choosing the key type.

Exercises

Exercise 32.

Work out in detail why the load factor really always remains between the values 1/2 and 2 with the doubling and halving strategy, and why, in the long run (that is with respect to many operations to follow), the costs of a rehash do not manifest themselves fundamentally (that is asymptotically by not more than a constant factor).

Exercise 33.

Identifiers that do not differ by more than one or some few characters, like counter0 and counter1, frequently occur coexisting in the same source code. A good hash function should minimize the probability that these identifiers are hashed to the same slot. Design a hash function that accomplishes this.