3.3.2. Hashing arrays (class h_array)

Learning objectives

The class h_array
Persistent and non-persistent variables

As we already mentioned in the previous section, LEDA offers an associative container type h_array that stores keys according to the principle of “hashing with chaining and table doubling”. For this, the key type must be a hashed type, that is, the functions

namespace leda {
int Hash(const T&);
}

and

bool operator == (const T&, const T&);

must be defined.

[Important]Important

Since the version 4.4 of LEDA the function Hash() must be put into the namespace LEDA.

The name h_array stands for hashing array. The interface of h_array is almost the same as the one of d_array.

As a first working example, we once again implement the program that counts strings from standard input. This time, however, we use the type h_array to store the strings and the multiplicity with which these occur. So the keys are strings, the values are integer numbers:

Filename: StringCount.C
LEDA users as of version 5.0: Note that header files are now partitioned into modules (subfolders)!
#include <LEDA/h_array.h>
#include <LEDA/string.h>
#include <iostream>

using leda::h_array;
using leda::string;

namespace leda { // must be in the namespace leda!

int Hash(const string& s) // must be const!
{
  return s.length() > 0 ? s[0] : 0; 
}

}

int main()
{
  h_array<string,int> counter(0);
  string s;

  while (std::cin >> s) counter[s]++;

  std::cout << "I have counted the following strings:\n";

  forall_defined (s,counter) 
    std::cout << s << " " << counter[s] << std::endl;
}

An example run looks as follows:

Karin Julia Marion Hanne Karin Julia Karin Julia
Hanne Marion Julia Julia Andrea
Miriam Julia Julia Hanne Andrea Karin Hanne
I have counted the following strings:
Hanne 4
Marion 2
Miriam 1
Andrea 2
Julia 7
Karin 4

Since the class string is not a hashed type by default, we must make it one: We must define the function Hash() on strings. As a very simple possibility, we calculate the hash value of a string as the code of the first character in the underlying character encoding. (In general, this is not a good hash function! Also see Exercise 35.)

[Note]Note

string is a hashed type as of version 6.2 of LEDA. The built-in hash function works as follows: It conceptually partitions the string into 10 partial strings of equal length and then lets the character code of the first character of every partial string influence the computing of the hash value. See also Exercise 37.

However, it is not possible to overwrite this default version of Hash() in order to write a version of one's own that is suited to one's special problem definition.

The interface of h_array is exactly the same as the one of d_array. There is an additional constructor

h_array<K,V> H(V x, int table_size)

with which, besides the default value x, also the size of the first hash table can be defined. If this size is not specified, the hash table starts with one slot.

[Note]Note

There is no method set_default_value() in version 6.2 of LEDA with which the default value could be modified afterwards. However, this shall be available as of one of next versions.

Order of iteration

As we can see by the above output, the macros forall and forall_defined iterate over the containers neither in the (lexicographical) order of the keys, nor in the order of the insertion times of the keys, nor in the (arithmetical) order of the values.

Of course this is due to the implementation of the data structure as a hash: The key type does not have even to be ordered, but rather hashed, and hashing just has the property to bring (different) keys with an equal hash value together.

Therefore, the ladies Marion and Miriam appear side by side in the above output, although they are far away from one another in the input and although they occur there with a different multiplicity: They have the same hash value and therefore fall into the same slot. Although Julia appears in the input most frequently, she does not occur in the output in first position. Although Karin was inserted first, she does not occur in the output first either. Although Andrea starts with an “A”, she is not the first one either. The first place is due to Hanne - for no particular performances.

A working example: Creating Latin random texts

As a very nice application of h_array in the implementation of an algorithm, we want to write a program that creates Latin (or also German, English, etc.) random texts. “Random text” means that the text only looks as if it were Latin, however, it is not real Latin, like the beautiful phrase “sita usvila teinis esa vanet[33].

The basic idea starts out from the following observation: In a long original text, some combinations of characters occur very frequently, some only seldom, some others never. So, for example, in an English text, the characters “so” are very frequently followed by an “m” (among other reasons because of the frequency of the word “some”), but extremely seldom by a “q” (there are only a few, very uncommon English words containing the character substring “soq”).

Now to create a random text in a certain language, one take a very long original text, compute for every occurring combination of a prefix p and a following suffix character s the absolute frequency, choose an arbitrary prefix p and start then to throw suffixes according to the given frequencies. One output the suffix c created last and make it the last character of the next prefix p'. One forget the first character of the old prefix (so c is shifted from the right into p, yielding p'). The length of the prefix is a parameter with which one can play. Depending on the language, different lengths achieve differently good results here.

The following program counts how often a certain character c follows a prefix string p of length 4 in Cesar's “De Bello Gallico”. It uses an h_array H for counting. The key type is string, the value is itself an h_array that just counts the frequency for every character c. (Note how elegantly the subscript operators [][] can be written behind each other in the source code - so the name “hashing array” is definitely justified.)

In a second h_array the program counts how often each prefix has occurred altogether; this information is needed to be able to throw the suffixes later. The throwing itself simulates a non-uniformly distributed random variable here, which, for a given prefix p, creates a character according to the distribution given by the relative frequencies of the characters occurring behind this prefix p in the original text. For this, it (conceptually) assigns an interval on the x-axis to every character whose length corresponds to the absolute frequency. These intervals are put side by side (conceptually); then one of them is randomly selected, which corresponds to the next suffix to be output.

Filename: TextGenerator.C
LEDA users as of version 5.0: Note that header files are now partitioned into modules (subfolders)!
#include <LEDA/h_array.h>
#include <LEDA/string.h>
#include <LEDA/random_source.h>
#include <cstdlib>

using leda::h_array;
using leda::string;
using leda::random_source;
using std::cout;

h_array<string,h_array<char,int> > H;
h_array<string, int> N;

void shift_in(string& s, char c);

namespace leda { // must be in the namespace leda!

int Hash(const string& s) // must be const!
{
  switch(s.length()) {
  case 0:  return 0;
  case 1:  return s[0];
  case 2:  return (s[0] << 8) | s[1];
  case 3:  return (s[0] << 16) | (s[1] << 8) | s[2];
  default: return (s[0] << 24) | (s[1] << 16) | (s[2] << 8) | s[3];
  }
}

}


int main(int argc, char* argv[1])
{
  // command line arguments: number of characters and length of prefix
  int chars_to_generate;
  if(argc==1) chars_to_generate = 500;
  else chars_to_generate = std::atoi(argv[1]);
  
  int prefix_length;
  if(argc<3) prefix_length = 4;
  else prefix_length = std::atoi(argv[2]);

  // original text
  std::ifstream is("Data/DeBelloGallico_preformatted.txt");

  char c;
  string prefix;
  
  // read and count all characters, do not ignore white space
  while (is >> std::noskipws >> c) {

    // fill prefix initially
    if(prefix.length() < prefix_length) { 
      prefix += c; 
      continue; 
    }

    // count this combination and the frequency of this prefix
    H[prefix][c]++;
    N[prefix]++;

    // c becomes last character of next prefix
    shift_in(prefix, c);
  }
  
  // "Caesar" is a good word to start with ;-)
  prefix = "Caesar venit, vidit, vicit"; 
  prefix = prefix.head(prefix_length);
  cout << prefix;

  int num_gen_chars = 0;
  char ch = '#';

  // generate chars_to_generate random characters
  // always ending with '.'
  random_source S;
  int rand_int;

  while(num_gen_chars < chars_to_generate || ch != '.') {
    S.set_range(0,N[prefix]);
    // use an implicit random variate here
    // could be done better with binary search
    // or a combination of arrays and random_variates
    S >> rand_int;
    forall_defined(ch, H[prefix]) {
      rand_int -= H[prefix][ch];
      if(rand_int <= 0) {
	cout << ch;
	num_gen_chars++;
	break;
      }
    }
    shift_in(prefix, ch);
  }
  cout << "\n";
}

// shift all character of s 1 position to the left
// dropping the first one, make c the last character of s
void shift_in(string& s, char c) 
{
  int i;
  for(i = 0; i < s.length() - 1; i++)
    s[i] = s[i+1];
  s[i] = c;
}

One of the outputs of the program was the following:

Caesare coeperaret, sed omnes palus petum revertilitum esse coactam
inferat, copia populis persiarum obtemperat venisset omnibusque
Vsipetunt atque celerit Volusent re multo reperunt si a manu
munitionis praeda quanis esse negotio praemis mittit cis locis ad
fluminis cognitate opinione quae subito passus in oportaliae
approponitum incommittit, reliquanos et Drappelli in viarum
miserunt. Hoc animo prone circuitus Ambioribus qui nihil nisi
proptereundissent et ne qui perfecti sese fossae, praedando Galliae
provinciter a Caesarem.

Who feels like it and is good in Latin, may translate this now. However, he should not despair if he finds words or inflections therein that are unknown to him.

Here we use a fundamentally improved version of the hash function for strings (compared to our first one): We include the first 4 characters of a string into the computing of the hash value by mapping every string to the number that is given according to the coding of its first 4 characters. The differences compared with the first version stand out if one compares running times: On the machine of the author the simple version takes 90 seconds to generate a text of 500 characters (including the preprocessing), the version with the improved hash function, however, only 9 seconds!

Persistence of variables

The access via the subscript operator [] looks very elegant; its syntax makes an h_array look like an ordinary array - after all, the class is called “hashing array”. It hides a big danger for the programmer, though: A variable H[i] referenced by this operator is not persistent:

[Warning]Warning

In an h_array H (and in a map) one cannot rely on that in a sequence of statements like

V& x = H[42];
H[42] = y;
if(x == y) { ... }

the test x==y always returns the value true.

One should therefore never use a reference or a pointer to a variable in an h_array or in a map.

The reason for this inpersistence is due to the following: The subscript operator actually returns a reference to an element of a linked list, namely the one that implements the corresponding slot in the underlying implementation. Such elements may disappear during a rehash (since they are hashed into different slots).

Further information

More information about the class h_array can be found on the corresponding manual page.

Exercises

Exercise 34.

Experiment in the above program with the length of the prefix (which can be specified on the command line) and with different original texts from different languages. Send the author the funniest outputs!

Exercise 35.

Compare the above program on your machine with the program in which the simple hash function

int Hash(const string& s) { return s.length() > 0 ? s[0] : 0; } 

is used. Experiment with different hash functions. Who writes the best hash function, that is, the one with which an output of 500 characters is created the fastest?

Also compare your versions with the version in which the container type d_array is used instead of the container type h_array.

Exercise 36.

Algorithmically speaking, the method employed here to determine the next character is not optimal. LEDA offers the class random_variate in order to implement non-uniformly distributed random variables. Replace the computing of the next character in the above algorithm by a method that works with random_variate and compare the running times.

Exercise 37.

The hash function for string, supposed to be built-in as of one of the next versions of LEDA, will read approximately as follows:

int Hash(string s, int table_size =  MAXINT) {
  int size = s.length();
  int step = size / 10;
  step++;
  int i(size), j;
  for(j = 0; j < size; j += step) {
    i = 131*i + s[j];
    i %=  table_size;
  }
  return i;
}

Discuss advantages and disadvantages of this implementation. Find inputs on which this hash function behaves particularly badly in the above algorithm. If this is difficult for you, this should convince you that this method behaves quite well in the average case.



[33] Pronounced in the beautiful dialect of Saarland this sentence means as much as “Looks like Latin, however, it is not.” The author found it impossible to translate this game of words from German to English, and even many Germans may not understand it (in German).




Imprint-Dataprotection