*Learning objectives*

Hashing |

Hash functions |

Hashing in LEDA |

Hashed types |

The LEDA rule “Requirements for hashed types” |

How we can store a number i and find it again
*as fast as possible*, that is, how can we
decide whether we have stored this number or not? If we had an
infinitely long array A at our disposal, we simply could write
the number i into the i-th element A[i] of the array. The test
whether we have stored i then would take time O (1) only: All
we would have to do is to look at position A[i]. If a 0 is
standing there, we have not stored i. If the number i is
standing there, we have stored i. (We neglect the problem of
initializing an infinitely long array with zeros in this
description.)

However, there is no infinitely long array A. Therefore
we manage with a trick: Conceptually, we
*hash* (cf. also with the French word
“Hashé”) the array A into pieces, in front of the
positions m, 2m, 3m, etc., where m is the size of an array T
that we can actually afford with respect to the size of our
main memory, for example m=1000. We put the single pieces one
below the other (conceptually) so that the elements 0, m,
2m, 3m, etc. of the original array are placed one below the
other: They are all “hashed” to the element
T[0]. Figure 3.4 illustrates this.

Now, if we want to store a number i, we store it in the
array T at position i mod m, that is in the element T[i mod
m]. In this context, T is denoted as a *hash table* or simply as a
*hash*. The method itself
is denoted as *hashing*. The individual positions
in the hash table are denoted as *slots*. (As the name suggests and as
we will see at once, several numbers may fall into one slot,
that is, they may be hashed to the same position.)

A function like h(i)=i mod m that, for an integral i,
calculates the number of the slot in a hash table is called a
*hash function*;
the value h(i) calculated by it is called a *hash value*.

A new problem appears now: Several i's can be hashed to
the same position. For example, all multiples of m are hashed
to position 0. *Hashing with chaining*
is a method to handle such *collisions*. Here the array T
does not consist of integer numbers, but rather of linked lists
of integer numbers, one list for every slot. Arbitrarily many
numbers can fall into one slot now; they are stored in the
corresponding list. Figure 3.5 shows hashing
with chaining in a hash table with 4 slots and 7 numbers
stored.

To store a number i, h(i) is calculated and i is appended to the h(i)-th list. This does never take longer than time O(1). Likewise, to look up the number i in the hash, h(i) is calculated, too, and the h(i)-th list is searched for the number i. The time for this is proportional to the length of the h(i)-th list.

So how long does the h(i)-th list become? In the worst
case, all numbers are hashed to the same slot; the structure
then degenerates into an ordinary linear list. In the best
case, the hash function distributes the numbers
*uniformly* over all m lists. Then, with n
numbers stored, the length of every list is n/m and therefore
the mean access time is O(n/m). The quotient n/m is denoted as
the *load
factor*. With a uniform distribution, if n is
smaller than m, that is, if there are at most as many numbers
stored altogether as there are slots, the mean access time is
also O(1), that is constant.

If more and more numbers are stored, that is, if n grows
further, also the load factor, and with it the mean access
time O(n/m), become larger and larger. Sometime, n would be
much larger than m and the access time then could not be
considered as constant any more. To prevent this, another
trick is used: To keep the load factor always smaller than 2,
the size of the table is doubled if n is twice as large as
m. This means that all elements of the old table are
“rehashed” into a new one containing twice as
many slots. This is called a *rehash*. Contrary, if great many
numbers are deleted from the hash and the load factor sinks to
1/2 or below thereby, the size of the table is halved and a
rehash also takes place. Immediately after a rehash, the load
factor is always 1, be it after a doubling or a halving of the
table size.

Such a rehash absolutely may take time - it is
proportional to n. One can, however, easily show by an amortized analysis that
the costs of a rehash hardly manifest themselves on the
average: If one distributes its costs on the access operations
that can be carried out after it and
*before* the next rehash takes place, the
analysis shows that the amortized time of an operation is
always O(2·n/m) and therefore can be considered as
constant because of n/m < 2. Finally, this is the reason
why hashing is such a fast method for storing and retrieving
information.

Until now, we have only talked about numbers to be
stored. However, hashing can actually be used to store and to
retrieve arbitrary information - only a hash value h that
can be calculated by an algorithm must be
assignable to every information to be stored. The
implementation of the hash then stores not only this hash
value h, but together with it also a reference to the original
(unhashed) information, which is denoted as a
*key* in this context.

Every compiler, for example, stores the identifiers
occurring in the program text, like variable names, function
names etc., in a hash table. For this purpose, a function h
is necessary that turns a string into a number that can be
used as a hash value. The simplest way to do this is to
interpret the ASCII values of the characters as a long bit
string (to be more precise: as the coefficients of a polynomial
with base 2^{7}).

Hashing is also used as a method to implement an associative container type. For this purpose, a reference to the value associated to the key is additionally stored with every key in the table.

For example, for every variable name a compiler stores the type of the variable as it is defined in its declaration.

Of course the overall method all depends on the assumption that the keys are uniformly distributed over the individual slots. For this assumption to be satisfied, the probability distribution with which the keys are drawn from their domain and fed into the hash and the hash function used must match each other.

More formally speaking: If P(k) is the probability with which key k is fed into the method, and if all P(k)'s are summed up for the k's that are hashed to slot j, then a sum of 1/m should result, and this should hold for all slots j. This means in other words that 1/m-th of all keys are hashed to each slot.

The above function h(i) = i mod m, for example, distributes numbers uniformly over the hash table if only the numbers themselves originate from a very large (compared to the number of slots) interval and if they are all drawn from it with equal probability.

However, this is not always the case. If one knows for example from the start that only even numbers will occur as keys, this function would leave every second slot empty (if m is a power of 2 as a result of doublings and halvings). Here it would be better to use a hash function that cuts off the last bit of the key first.

Moreover, the previous considerations assume that the hash value itself can be calculated in constant time. Of course, the speed at which the hash function works also plays a role in the overall method. The function introduced above, which makes a number from every string, takes the more time the longer the string is.

There is no hash function that is suitable for every type of input key and that works equally well for every distribution of the keys. It is always an art to find a good hash function for a certain kind of key and a certain distribution. It would go beyond the scope of this introduction to delve into the construction of good hash functions.

LEDA offers hashing with concatenation and table
doubling, encapsulated into the class `h_array`

. The
size of the first table is m=1, that is, the data structure
starts with one slot only. (This initial size can be adapted
to a different value in the constructor.) The hash function
used by LEDA is h(i) = i mod m, in which m is doubled or
halved with every doubling or halving, respectively. So m is
always a power of 2 (unless the initial table size was set to
a value l different from 1; then it always has the form
l·2^{k}). The
input keys i must originate from the interval that is
covered by the C++ type int, that is normally the interval
[-2^{31}..2^{31}-1].

To give the user the possibility to influence the
computation of the hash value, he may prepend a function g of
his own to the function h computing the hash value. LEDA then
hashes the key i to the value h(g(i))). g must return integer
numbers of the type `int`

. So this is a
*two-level method*.

For the number types of C++, g is predefined as the
identity function g(i)=i. For pointers to `void`

, g is
predefined as the function that interprets the address of the
pointer as an integer number and returns it. If the user wants
to modify g for these types, he must overwrite the function

int Hash(const int& i);

or its equivalents for the other number types.

According to the LEDA rule “Requirements for hashed types”, to make a self-defined
type `T`

a *hashed
type* also and thereby to be able to use it as
the key type of an `h_array`

, the user
must implement the function

int Hash(const T& i);

This function must be put into the namespace
`leda`

. It is then used by LEDA as the input g(i) of
the hashing h.

Furthermore, the equality operator

bool operator == (const T&, const T&);

must be defined for the type `T`

. This is due to the
following: If two different keys k1 and k2 with the associated values
v1 and v2, respectively, are hashed to the same slot in the
associative container of type `h_array`

, then v1
and v2 shall nevertheless be accessible later via the keys k1 and k2,
respectively, without mixing these values up. However, for this k1 and
k2 must be distinguishable within the implementation of the hash,
which cannot be carried out by means of their hash values - these are
equal. Therefore, the operator `==`

must implement
the possibility for distinction.

Additionally, the operator `==`

must
realize an equivalence relation on `T`

. In
addition, for all objects `x`

and
`y`

of the type `T`

the
following must hold: If `x == y`

, then also
`Hash(x) == Hash(y)`

.

If, in contrast, two objects `x`

and
`y`

are equal, that is if ```
x ==
y
```

holds, the hash function should not consider them
as being different. Then `Hash(x) == Hash(y)`

must also hold.

Furthermore, with the class `map`

LEDA offers a particularly fast implementation for keys
of integral type, of pointer type, or of item type. The
method used therein is *one-level* and
cannot be influenced by the user: The keys k are interpreted
as 32 bit values and k is hashed to the position k mod
2^{m}, in which m is the current
number of slots. This method also works with table
doubling. In general, it is fundamentally faster than the
two-step method of `h_array`

because the
hash function is hardwired; in exchange, one is very
restricted here when choosing the key type.