Learning objectives
The class h_array |
| Persistent and non-persistent variables |
As we already mentioned in the previous section, LEDA
offers an associative container type h_array that
stores keys according to the principle of “hashing with chaining and
table doubling”. For this, the key type must be a hashed type, that is, the functions
namespace leda {
int Hash(const T&);
}and
bool operator == (const T&, const T&);
must be defined.
![]() | Important |
|---|---|
Since the version 4.4 of LEDA the function |
The name h_array stands for
hashing array. The
interface of h_array is almost the same as the
one of d_array.
As a first working example, we once again implement the
program that counts strings from standard input. This time,
however, we use the type h_array to store
the strings and the multiplicity with which these occur. So the
keys are strings, the values are integer numbers:
#include <LEDA/h_array.h>
#include <LEDA/string.h>
#include <iostream>
using leda::h_array;
using leda::string;
namespace leda { // must be in the namespace leda!
int Hash(const string& s) // must be const!
{
return s.length() > 0 ? s[0] : 0;
}
}
int main()
{
h_array<string,int> counter(0);
string s;
while (std::cin >> s) counter[s]++;
std::cout << "I have counted the following strings:\n";
forall_defined (s,counter)
std::cout << s << " " << counter[s] << std::endl;
}
An example run looks as follows:
Karin Julia Marion Hanne Karin Julia Karin Julia
Hanne Marion Julia Julia Andrea
Miriam Julia Julia Hanne Andrea Karin Hanne
I have counted the following strings:
Hanne 4
Marion 2
Miriam 1
Andrea 2
Julia 7
Karin 4
Since the class string is
not a hashed type by default, we must make it
one: We must define the function Hash() on
strings. As a very simple possibility, we calculate the hash value of
a string as the code of the first character in the underlying
character encoding. (In general, this is not a
good hash function! Also see Exercise 35.)
![]() | Note |
|---|---|
However, it is not possible to overwrite
this default version of |
The interface of h_array is exactly
the same as the one of d_array. There is an
additional constructor
h_array<K,V> H(V x, int table_size)
with which, besides the default value x, also the
size of the first hash table can be defined. If this size is not
specified, the hash table starts with one slot.
![]() | Note |
|---|---|
There is no method
|
As we can see by the above output, the macros
forall and forall_defined
iterate over the containers neither in the (lexicographical) order of
the keys, nor in the order of the insertion times of the keys, nor in
the (arithmetical) order of the values.
Of course this is due to the implementation of the data structure as a hash: The key type does not have even to be ordered, but rather hashed, and hashing just has the property to bring (different) keys with an equal hash value together.
Therefore, the ladies Marion and Miriam appear side by side in the above output, although they are far away from one another in the input and although they occur there with a different multiplicity: They have the same hash value and therefore fall into the same slot. Although Julia appears in the input most frequently, she does not occur in the output in first position. Although Karin was inserted first, she does not occur in the output first either. Although Andrea starts with an “A”, she is not the first one either. The first place is due to Hanne - for no particular performances.
As a very nice application of
h_array in the implementation of an algorithm,
we want to write a program that creates Latin (or also German,
English, etc.) random texts. “Random text” means that the
text only looks as if it were Latin,
however, it is not real Latin, like the beautiful phrase “sita
usvila teinis esa vanet”[33].
The basic idea starts out from the following observation: In a long original text, some combinations of characters occur very frequently, some only seldom, some others never. So, for example, in an English text, the characters “so” are very frequently followed by an “m” (among other reasons because of the frequency of the word “some”), but extremely seldom by a “q” (there are only a few, very uncommon English words containing the character substring “soq”).
Now to create a random text in a certain language, one take a very long original text, compute for every occurring combination of a prefix p and a following suffix character s the absolute frequency, choose an arbitrary prefix p and start then to throw suffixes according to the given frequencies. One output the suffix c created last and make it the last character of the next prefix p'. One forget the first character of the old prefix (so c is shifted from the right into p, yielding p'). The length of the prefix is a parameter with which one can play. Depending on the language, different lengths achieve differently good results here.
The following program counts how often a certain
character c follows a prefix string p of length 4 in Cesar's “De
Bello Gallico”. It uses an h_array
H for counting. The key type is
string, the value is itself an
h_array that just counts the frequency
for every character c. (Note how elegantly the subscript operators
[][] can be written behind each other in the
source code - so the name “hashing array” is definitely
justified.)
In a second h_array the program
counts how often each prefix has occurred altogether; this information
is needed to be able to throw the suffixes later. The throwing itself
simulates a non-uniformly distributed random variable here, which, for a
given prefix p, creates a character according to the distribution
given by the relative frequencies of the characters
occurring behind this prefix p in the original text. For this, it
(conceptually) assigns an interval on the x-axis to every character
whose length corresponds to the absolute frequency. These intervals
are put side by side (conceptually); then one of them is randomly
selected, which corresponds to the next suffix to be output.
#include <LEDA/h_array.h>
#include <LEDA/string.h>
#include <LEDA/random_source.h>
#include <cstdlib>
using leda::h_array;
using leda::string;
using leda::random_source;
using std::cout;
h_array<string,h_array<char,int> > H;
h_array<string, int> N;
void shift_in(string& s, char c);
namespace leda { // must be in the namespace leda!
int Hash(const string& s) // must be const!
{
switch(s.length()) {
case 0: return 0;
case 1: return s[0];
case 2: return (s[0] << 8) | s[1];
case 3: return (s[0] << 16) | (s[1] << 8) | s[2];
default: return (s[0] << 24) | (s[1] << 16) | (s[2] << 8) | s[3];
}
}
}
int main(int argc, char* argv[1])
{
// command line arguments: number of characters and length of prefix
int chars_to_generate;
if(argc==1) chars_to_generate = 500;
else chars_to_generate = std::atoi(argv[1]);
int prefix_length;
if(argc<3) prefix_length = 4;
else prefix_length = std::atoi(argv[2]);
// original text
std::ifstream is("Data/DeBelloGallico_preformatted.txt");
char c;
string prefix;
// read and count all characters, do not ignore white space
while (is >> std::noskipws >> c) {
// fill prefix initially
if(prefix.length() < prefix_length) {
prefix += c;
continue;
}
// count this combination and the frequency of this prefix
H[prefix][c]++;
N[prefix]++;
// c becomes last character of next prefix
shift_in(prefix, c);
}
// "Caesar" is a good word to start with ;-)
prefix = "Caesar venit, vidit, vicit";
prefix = prefix.head(prefix_length);
cout << prefix;
int num_gen_chars = 0;
char ch = '#';
// generate chars_to_generate random characters
// always ending with '.'
random_source S;
int rand_int;
while(num_gen_chars < chars_to_generate || ch != '.') {
S.set_range(0,N[prefix]);
// use an implicit random variate here
// could be done better with binary search
// or a combination of arrays and random_variates
S >> rand_int;
forall_defined(ch, H[prefix]) {
rand_int -= H[prefix][ch];
if(rand_int <= 0) {
cout << ch;
num_gen_chars++;
break;
}
}
shift_in(prefix, ch);
}
cout << "\n";
}
// shift all character of s 1 position to the left
// dropping the first one, make c the last character of s
void shift_in(string& s, char c)
{
int i;
for(i = 0; i < s.length() - 1; i++)
s[i] = s[i+1];
s[i] = c;
}
One of the outputs of the program was the following:
Caesare coeperaret, sed omnes palus petum revertilitum esse coactam inferat, copia populis persiarum obtemperat venisset omnibusque Vsipetunt atque celerit Volusent re multo reperunt si a manu munitionis praeda quanis esse negotio praemis mittit cis locis ad fluminis cognitate opinione quae subito passus in oportaliae approponitum incommittit, reliquanos et Drappelli in viarum miserunt. Hoc animo prone circuitus Ambioribus qui nihil nisi proptereundissent et ne qui perfecti sese fossae, praedando Galliae provinciter a Caesarem.
Who feels like it and is good in Latin, may translate this now. However, he should not despair if he finds words or inflections therein that are unknown to him.
Here we use a fundamentally improved version of the hash function for strings (compared to our first one): We include the first 4 characters of a string into the computing of the hash value by mapping every string to the number that is given according to the coding of its first 4 characters. The differences compared with the first version stand out if one compares running times: On the machine of the author the simple version takes 90 seconds to generate a text of 500 characters (including the preprocessing), the version with the improved hash function, however, only 9 seconds!
The access via the subscript operator
[] looks very elegant; its
syntax makes an h_array look like
an ordinary array - after all, the class is called
“hashing array”. It hides a big danger for the
programmer, though: A variable H[i]
referenced by this operator is not persistent:
![]() | Warning |
|---|---|
In an V& x = H[42];
H[42] = y;
if(x == y) { ... }
the test One should therefore never use a reference or a
pointer to a variable in an |
The reason for this inpersistence is due to the following: The subscript operator actually returns a reference to an element of a linked list, namely the one that implements the corresponding slot in the underlying implementation. Such elements may disappear during a rehash (since they are hashed into different slots).
More information about the class h_array can be
found on the corresponding manual page.
Exercise 34. | Experiment in the above program with the length of the prefix (which can be specified on the command line) and with different original texts from different languages. Send the author the funniest outputs! |
Exercise 35. | Compare the above program on your machine with the program in which the simple hash function int Hash(const string& s) { return s.length() > 0 ? s[0] : 0; } is used. Experiment with different hash functions. Who writes the best hash function, that is, the one with which an output of 500 characters is created the fastest? Also compare your versions with the version in
which the container type |
Exercise 36. | Algorithmically speaking, the method employed here to determine the next
character is not optimal. LEDA offers the
class |
Exercise 37. | The hash function for
int Hash(string s, int table_size = MAXINT) {
int size = s.length();
int step = size / 10;
step++;
int i(size), j;
for(j = 0; j < size; j += step) {
i = 131*i + s[j];
i %= table_size;
}
return i;
}Discuss advantages and disadvantages of this implementation. Find inputs on which this hash function behaves particularly badly in the above algorithm. If this is difficult for you, this should convince you that this method behaves quite well in the average case. |
[33] Pronounced in the beautiful dialect of Saarland this sentence means as much as “Looks like Latin, however, it is not.” The author found it impossible to translate this game of words from German to English, and even many Germans may not understand it (in German).