|Hashing in LEDA|
|The LEDA rule “Requirements for hashed types”|
How we can store a number i and find it again as fast as possible, that is, how can we decide whether we have stored this number or not? If we had an infinitely long array A at our disposal, we simply could write the number i into the i-th element A[i] of the array. The test whether we have stored i then would take time O (1) only: All we would have to do is to look at position A[i]. If a 0 is standing there, we have not stored i. If the number i is standing there, we have stored i. (We neglect the problem of initializing an infinitely long array with zeros in this description.)
However, there is no infinitely long array A. Therefore we manage with a trick: Conceptually, we hash (cf. also with the French word “Hashé”) the array A into pieces, in front of the positions m, 2m, 3m, etc., where m is the size of an array T that we can actually afford with respect to the size of our main memory, for example m=1000. We put the single pieces one below the other (conceptually) so that the elements 0, m, 2m, 3m, etc. of the original array are placed one below the other: They are all “hashed” to the element T. Figure 3.4 illustrates this.
Now, if we want to store a number i, we store it in the array T at position i mod m, that is in the element T[i mod m]. In this context, T is denoted as a hash table or simply as a hash. The method itself is denoted as hashing. The individual positions in the hash table are denoted as slots. (As the name suggests and as we will see at once, several numbers may fall into one slot, that is, they may be hashed to the same position.)
A new problem appears now: Several i's can be hashed to the same position. For example, all multiples of m are hashed to position 0. Hashing with chaining is a method to handle such collisions. Here the array T does not consist of integer numbers, but rather of linked lists of integer numbers, one list for every slot. Arbitrarily many numbers can fall into one slot now; they are stored in the corresponding list. Figure 3.5 shows hashing with chaining in a hash table with 4 slots and 7 numbers stored.
To store a number i, h(i) is calculated and i is appended to the h(i)-th list. This does never take longer than time O(1). Likewise, to look up the number i in the hash, h(i) is calculated, too, and the h(i)-th list is searched for the number i. The time for this is proportional to the length of the h(i)-th list.
So how long does the h(i)-th list become? In the worst case, all numbers are hashed to the same slot; the structure then degenerates into an ordinary linear list. In the best case, the hash function distributes the numbers uniformly over all m lists. Then, with n numbers stored, the length of every list is n/m and therefore the mean access time is O(n/m). The quotient n/m is denoted as the load factor. With a uniform distribution, if n is smaller than m, that is, if there are at most as many numbers stored altogether as there are slots, the mean access time is also O(1), that is constant.
If more and more numbers are stored, that is, if n grows further, also the load factor, and with it the mean access time O(n/m), become larger and larger. Sometime, n would be much larger than m and the access time then could not be considered as constant any more. To prevent this, another trick is used: To keep the load factor always smaller than 2, the size of the table is doubled if n is twice as large as m. This means that all elements of the old table are “rehashed” into a new one containing twice as many slots. This is called a rehash. Contrary, if great many numbers are deleted from the hash and the load factor sinks to 1/2 or below thereby, the size of the table is halved and a rehash also takes place. Immediately after a rehash, the load factor is always 1, be it after a doubling or a halving of the table size.
Such a rehash absolutely may take time - it is proportional to n. One can, however, easily show by an amortized analysis that the costs of a rehash hardly manifest themselves on the average: If one distributes its costs on the access operations that can be carried out after it and before the next rehash takes place, the analysis shows that the amortized time of an operation is always O(2·n/m) and therefore can be considered as constant because of n/m < 2. Finally, this is the reason why hashing is such a fast method for storing and retrieving information.
Until now, we have only talked about numbers to be stored. However, hashing can actually be used to store and to retrieve arbitrary information - only a hash value h that can be calculated by an algorithm must be assignable to every information to be stored. The implementation of the hash then stores not only this hash value h, but together with it also a reference to the original (unhashed) information, which is denoted as a key in this context.
Every compiler, for example, stores the identifiers occurring in the program text, like variable names, function names etc., in a hash table. For this purpose, a function h is necessary that turns a string into a number that can be used as a hash value. The simplest way to do this is to interpret the ASCII values of the characters as a long bit string (to be more precise: as the coefficients of a polynomial with base 27).
Hashing is also used as a method to implement an associative container type. For this purpose, a reference to the value associated to the key is additionally stored with every key in the table.
For example, for every variable name a compiler stores the type of the variable as it is defined in its declaration.
Of course the overall method all depends on the assumption that the keys are uniformly distributed over the individual slots. For this assumption to be satisfied, the probability distribution with which the keys are drawn from their domain and fed into the hash and the hash function used must match each other.
More formally speaking: If P(k) is the probability with which key k is fed into the method, and if all P(k)'s are summed up for the k's that are hashed to slot j, then a sum of 1/m should result, and this should hold for all slots j. This means in other words that 1/m-th of all keys are hashed to each slot.
The above function h(i) = i mod m, for example, distributes numbers uniformly over the hash table if only the numbers themselves originate from a very large (compared to the number of slots) interval and if they are all drawn from it with equal probability.
However, this is not always the case. If one knows for example from the start that only even numbers will occur as keys, this function would leave every second slot empty (if m is a power of 2 as a result of doublings and halvings). Here it would be better to use a hash function that cuts off the last bit of the key first.
Moreover, the previous considerations assume that the hash value itself can be calculated in constant time. Of course, the speed at which the hash function works also plays a role in the overall method. The function introduced above, which makes a number from every string, takes the more time the longer the string is.
There is no hash function that is suitable for every type of input key and that works equally well for every distribution of the keys. It is always an art to find a good hash function for a certain kind of key and a certain distribution. It would go beyond the scope of this introduction to delve into the construction of good hash functions.
LEDA offers hashing with concatenation and table
doubling, encapsulated into the class
size of the first table is m=1, that is, the data structure
starts with one slot only. (This initial size can be adapted
to a different value in the constructor.) The hash function
used by LEDA is h(i) = i mod m, in which m is doubled or
halved with every doubling or halving, respectively. So m is
always a power of 2 (unless the initial table size was set to
a value l different from 1; then it always has the form
input keys i must originate from the interval that is
covered by the C++ type int, that is normally the interval
To give the user the possibility to influence the
computation of the hash value, he may prepend a function g of
his own to the function h computing the hash value. LEDA then
hashes the key i to the value h(g(i))). g must return integer
numbers of the type
int. So this is a
For the number types of C++, g is predefined as the
identity function g(i)=i. For pointers to
void, g is
predefined as the function that interprets the address of the
pointer as an integer number and returns it. If the user wants
to modify g for these types, he must overwrite the function
int Hash(const int& i);
According to the LEDA rule “Requirements for hashed types”, to make a self-defined
T a hashed
type also and thereby to be able to use it as
the key type of an
h_array, the user
must implement the function
int Hash(const T& i);
This function must be put into the namespace
leda. It is then used by LEDA as the input g(i) of
the hashing h.
Furthermore, the equality operator
bool operator == (const T&, const T&);
must be defined for the type
T. This is due to the
following: If two different keys k1 and k2 with the associated values
v1 and v2, respectively, are hashed to the same slot in the
associative container of type
h_array, then v1
and v2 shall nevertheless be accessible later via the keys k1 and k2,
respectively, without mixing these values up. However, for this k1 and
k2 must be distinguishable within the implementation of the hash,
which cannot be carried out by means of their hash values - these are
equal. Therefore, the operator
== must implement
the possibility for distinction.
Additionally, the operator
realize an equivalence relation on
addition, for all objects
y of the type
following must hold: If
x == y, then also
Hash(x) == Hash(y).
If, in contrast, two objects
y are equal, that is if
y holds, the hash function should not consider them
as being different. Then
Hash(x) == Hash(y)
must also hold.
Furthermore, with the class
LEDA offers a particularly fast implementation for keys
of integral type, of pointer type, or of item type. The
method used therein is one-level and
cannot be influenced by the user: The keys k are interpreted
as 32 bit values and k is hashed to the position k mod
2m, in which m is the current
number of slots. This method also works with table
doubling. In general, it is fundamentally faster than the
two-step method of
h_array because the
hash function is hardwired; in exchange, one is very
restricted here when choosing the key type.
Work out in detail why the load factor really always remains between the values 1/2 and 2 with the doubling and halving strategy, and why, in the long run (that is with respect to many operations to follow), the costs of a rehash do not manifest themselves fundamentally (that is asymptotically by not more than a constant factor).
Identifiers that do not differ by more
than one or some few characters, like