3.2. Dictionary arrays (class d_array)

Learning objectives

The class d_array
Graphs and breadth first search

The next associative container type that we want to turn to offers essentially the same interface as dictionary; it differs in the underlying implementation, though. It is the class d_array.

The name of the class stands for dictionary array. As the name suggests, this class also implements dictionaries and offers an interface that resembles the one of an ordinary array, that is, here the associated value is read- and write-accessed by means of the subscript operator [], starting from a key.

The following program clarifies this with the task of counting all strings occurring in standard input and then outputting which string occurred how often.

Filename: StringCountDArray.C
LEDA users as of version 5.0: Note that header files are now partitioned into modules (subfolders)!
#include <LEDA/d_array.h>
#include <LEDA/string.h>
#include <iostream>

using leda::d_array;
using leda::string;

int main()
{
  d_array<string,int> counter(0);
  string s;

  while (std::cin >> s) counter[s]++;

  std::cout << "I have counted the following strings:\n";

  forall_defined(s,counter) 
    std::cout << s << " " << counter[s] << std::endl;
}

An example run looks as follows:

Ulla Julia Julia Ulla
Karin Antje Antje Bettina
Ulla Ulla Bettina
I have counted the following strings:
Antje 2
Bettina 2
Julia 2
Karin 1
Ulla 4

The interface of d_array is quite small. The only access to key-value pairs is carried out via the subscript operator

D[key];

This operator returns a reference to the associated value, starting out from a key key (being the index of the operator).

A key-value pair is inserted into a d_array by defining the value to be associated to a key. This can be achieved by a statement like

D[key] = value;

or

D[key]++;

Concerning the last statement, the question arises which previous value the dictionary array assumes in the very first increment. This value, the so-called default value, can be specified in the constructor as it is the case in the example program. It is the value that is used in a read-access if no value was associated to the corresponding key yet; if it is not specified in the constructor, the dictionary array uses the default value of the value type.

The method

D.set_default_value(x);

allows to set the default value to an arbitrary value x.

How can a key-value (k,v) pair be deleted from the dictionary array again? This is achieved by a

D.undefine(k);

So only the key k has to be specified. In other words, a D.undefine(k) resets the value associated to k to the default value.

A

D.clear();

deletes all key-value pairs.

Whether a certain key k occurs in D can be tested by

D.defined(k);

Finally,

D.size();

returns the number of key-value pairs in a dictionary array.

Requirements on the key type

Since the class d_array is implemented just like the class dictionary by search trees, the key parameter must be a linearly ordered type.

Iteration and iteration order

The macro

forall_defined(k, D)

iterates over all keys k that are currently contained in D. In contrast, the macro

forall(v, D)

iterates over all values v contained in D. It sets the variable v to a copy of the value. So the actual values cannot be changed this way; this is possible only with the operator [] and the respective key.

As in the case of the class dictionary, the forall macros iterate over the keys and the values, respectively, in the order given according to the linear order.

A working example: Breadth first search in a graph

How far is it from LOVE to PAIN? For computer scientists a problem that is easier to solve than to deal with in reality. Yes, particularly for computer scientists. In reality. But we do not want to digress: A known riddle consists in deriving the word PAIN from the word LOVE by successively replacing one letter by a different one; however, one must always reach words of the English language in every intermediate step. So the start of a derivation could look like this: LOVE → LONE → CONE → COKE → COPE → ... and then he was lost! The reader may try himself to find a solution by hand. Anyway, the author did not succeed: He remembered after a long and frustrating hour being a computer scientist. And to be more precise: in reality.

So how do we solve this problem by an algorithm? At first, we get ourselves a list as long as possible of 4-letter words[32] of the English language. For this purpose, the author took the list of English words that are allowed in the game of Scrabble™. This list contains 3862 words. It can be obtained from the download area of the tutorial for own experiments.

When we write down all these words on a sheet of paper and draw lines connecting each two words that differ by one letter only, a construct arises like the one in Figure 3.2. (There, only 35 of the 3862 words are drawn and only the lines connecting these 35 words).

Figure 3.2. A graph

A graph

A graph with 35 nodes and 80 edges. Two nodes are adjacent if and only if their words differ by exactly one letter.


Such a construct that consists of nodes connected by edges is called a graph. This structure frequently appears in computer science, everywhere where relations between single objects are modeled. The objects correspond to the nodes, the relations to the edges of the graph.

The question now is whether there is a way from the node LOVE to the node PAIN in the graph of Figure 3.2. How can we find such a way, in such an unmanageable set of 3862 nodes, for which there is hardly space even on a large drawing sheet, not to mention the unmanageable entanglement of the edges? Perhaps there is no way at all? And if there is a way actually, what is the shortest way, that is, how do we get from LOVE to PAIN as fast as possible, that is to say, with as few intermediate steps as possible?

One possible algorithm is the following: The basic idea is to first visit all nodes directly adjacent to the node LOVE. Then starting from these nodes, we visit again all their neighbors, then the neighbors of these neighbors etc. For this purpose, we work with a queue: At the beginning the queue contains only the node LOVE. We pull this node out of the queue and append all neighbors of LOVE to the queue in exchange. We then take the next node from the queue (a neighbor of LOVE, for example LOSE) and in turn append its neighbors to the queue. But caution! Since LOVE is also a neighbor of LOSE we would now append LOVE to the queue again, then remove it from the queue, then append it etc., over and over again. We would be trapped in an endless loop!

To avoid this, we must keep in mind whether a node was already in the queue; we must not append it a second time. We could color it on the paper, for example.

This algorithm visits all reachable nodes, one after another, starting from the start node. It visits them in layers: First the nodes with distance 1 are visited, then those with distance 2, then those with distance 3 etc. The distance of a node v from a node w is the length of a shortest path, that is a smallest possible sequence of edges leading from node v to node w. So the algorithm searches into breadth, exactly like waves spread out around a stone (the start node) thrown into water. It is therefore called breadth first search. Figure 3.3 illustrates this.

Figure 3.3. Breadth first search

Breadth first search

The individual nodes are visited starting from LOVE in layers. Thus, their distance from LOVE and a path back to LOVE results.


When we have visited every node reachable from LOVE in this manner, then PAIN is either in the set of the nodes visited or it is not. If it is, where do we then know the way back to LOVE from (or the other way round)? For this purpose, all we have to do is to memorize where we came from in the moment of appending a node to the queue; so we must memorize the predecessor on the way from LOVE. On the paper we can mark this by little arrows that are directed to the predecessor node backward, see Figure 3.3; the path resulting from this is a shortest path!

How can we implement this algorithm with the help of LEDA now? LEDA, of course, contains an (extremely powerful) data structure for the implementation of graphs. This class is called - how should it be different - graph. We will get to know it in Chapter 5, Graphs and graph algorithms. And of course, LEDA also contains the breadth first search algorithm whose use we also will get to know there. Furthermore, LEDA offers even a class by means of which graphs can be visualized and edited, the class GraphWin. For example, Figure 3.2 was created with that class.

Here however, we want to implement breadth first search once by ourselves, getting by on the means we got to know until now. The following program implements this algorithm with the help of two dictionary arrays and a queue. The first d_array is called was_in_queue and serves two purposes: We memorize therein whether a string was already in the queue and which strings there are in the graph at all. The second d_array is called pred; therein, we memorize the predecessor string for a string on the way back to LOVE. So the program does not build up the graph explicitly as a data structure.

To find all neighbor nodes starting from a node like LOVE, we first replace the first letter (L) by all other 25 letters of the alphabet, then - keeping the first letter L - we replace the second letter (O) by all other letters, then the third one and at last the fourth one. For each of the strings created in this manner, we test by means of the method defined() whether this string occurs as a node in the graph at all, that is in the dictionary array was_in_queue. If it does, it is a neighbor of the original string.

As long as the queue is not empty, we pull the first node out of the queue. If this is the node we are searching for, we stop and output the way back to the start node. Otherwise, we append all its adjacent nodes that were not in the queue yet to the queue, mark these nodes as inserted into the queue and memorize the way back to the node we came from.

Filename: SearchEnglishWordChainDArray.C
LEDA users as of version 5.0: Note that header files are now partitioned into modules (subfolders)!
#include <LEDA/d_array.h>
#include <LEDA/queue.h>
#include <LEDA/string.h>
#include <istream>

using leda::d_array;
using leda::queue;
using leda::string;
using std::ifstream;
using std::cout;


int main()
{
  ifstream is("Data/4LetterWords.txt");

  d_array<string,bool> was_in_queue;
  d_array<string,string> pred;

  string word;
  while (is >> word) 
    was_in_queue[word] = false;
  
  string from = "LOVE";
  string to = "PAIN";
  
  bool found = false;

  queue<string> Q;
  Q.append(from);
  was_in_queue[from] = true;

  while( !Q.empty() && !found) {

    string s = Q.pop(); // get next node from the queue

    if(s == to) { // we have found a path
      // and output it in reverse order
      cout << "Here is a path from " << to 
           << " back to " << from << ":\n";
      string t = to;
      cout << t << "\n";
      while(t != from) {
        t = pred[t];
        cout << t << "\n";
      }
      found = true;
    }

    else // insert all not yet queued neighbours into queue
      for(int i = 0 ; i < 4 ; i++) {
        string s_adjacent = s;
        // step through all letters of the alphabet
        for(char c = 'A' ; c <= 'Z' ; c++) {
          if(c == s[i]) continue; // don't generate s itself
          // candidate adjacent word: replace letter i
          s_adjacent[i] = c;
          // is s_adjacent really adjacent and was not yet in queue?
          if(was_in_queue.defined(s_adjacent) 
             && !was_in_queue[s_adjacent]) {
            // append s_adjacent to the queue for later processing
            // mark it as queued and memorize where we came from
            Q.append(s_adjacent);
            was_in_queue[s_adjacent] = true;
            pred[s_adjacent] = s;
          }
        }      
      }
  }

  if(!found)
    cout << "No path found from " << from << " to " << to << "\n";
}

The program outputs the following:

Here is a path from PAIN back to LOVE:
PAIN
LAIN
LOIN
LORN
LORE
LOVE

There you are! As a matter of fact, LOVE can come to PAIN, and the fastest one to get there is the computer scientist because he knows the shortest way!

Implementation

The class d_array is implemented by randomized search trees. If n key-value pairs are stored in a d_array, an access via the operator [] takes time O(log n). The space consumption is O(n).

Further information

More about the class d_array is found on the corresponding manual page.

Exercises

Exercise 26.

For printing out the way back, the loop can be formulated also as follows:

      string t = to;
      do {
        cout << t << "\n";
        t = pred[t];
      }
      while(t != "");      

What is this due to?

Exercise 27.

Modify the above program so that it outputs the path in reversed order, that is in the “correct” order.

Exercise 28.

Expand the above program so that it additionally outputs the distance to the start word for every word visited in the graph. Then expand it so that, starting from the start word, the layers are output, that is, first the nodes with a distance of 1, then a separating line, then the nodes with a distance of 2, then a separating line, etc.

Exercise 29.

Think over why breadth first search actually always calculates shortest paths and why these paths do not always have to be unique.

Exercise 30.

Another method for traversing a graph in a systematic way is depth first search. Here, starting from a node just visited, immediately all non-visited neighbors are visited by recursion. This method first runs deeply from the start node into the graph instead of running around this first node as in the case of breadth first search.

Implement depth first search with a recursive function first. Then simulate the recursion with the help of a stack on which the nodes to be worked on are pushed. Try to find a derivation as long as possible from LOVE to PAIN with that.

Exercise 31.

In the download area of the tutorial you find a file containing approximately 118,000 words of the English language. Use these words as input for your own experiments. Expand the above notion of “distance” so that two words are regarded as adjacent if they can be transformed into one another by insertion or deletion of one letter. Then search for derivation chains that are as long as possible and send the author your best results.



[32] Remember Bob Dylan? “LOVE is just a 4-letter word...




Imprint-Dataprotection