2.8.1. Inserting elements and membership test

Learning objectives

The class set
Inserting, deleting, and choosing elements
Membership test

Let us assume that we operate a web server. We have extracted a file from the server's access logfile access.log with the following structure:

139.19.1.151
139.19.1.151
207.46.249.27
139.19.1.153
139.19.1.151
139.19.1.154
198.182.196.56
139.19.1.151
139.19.1.154
139.19.1.151
139.19.1.151

The entries specify the IP-addresses of the computers by which our server was accessed. Since we have a subnet and publish important information for our employees on our web server, most of the entries originate from this subnet and are equal. There are only a few entries from outside of the subnet, but there are many entries altogether.

We want to know now by which machines our server was actually accessed, that is, we want to know which different entries there are in the logfile. These are only a few entries, although, as we said, there are many entries altogether: Most entries are equal.

This is a unification problem as we have seen it in Section 2.3.4 before. Exactly like there, we could throw all IP-addresses into a list, sort this list, and then invoke the list method unique(). However, this does not make sense with a very large number of IP-addresses because the list becomes enormously long and the sorting function then takes (too) much time. Another possibility would be to always keep the list sorted and to insert every new IP-address just in the right position. Then, however, for every insertion a previous search is necessary, starting from the beginning of the list. This is also too time consuming.

Here it is much more reasonable to work with a set of IP-addresses: We read line for line and test at every new IP-address whether it is already an element of the set. If not, we insert it into the set. Thus the structure always contains only one single copy of the corresponding element. (It cannot contain multiple copies.) At the end, we output all elements; they then correspond to the unique, different IP-addresses. We rely on the implementation of LEDA's class set performing single insert and search operations very efficiently. (What is indeed so because sets are implemented by search trees.)

The following program implements this procedure with the help of the class set. We feed it with the file access.log by redirecting standard input:

Filename: IPUnique.C
LEDA users as of version 5.0: Note that header files are now partitioned into modules (subfolders)!
#include <LEDA/set.h>
#include <LEDA/string.h>
#include <iostream>

using leda::set;
using leda::string;


int main()
{
  set<string> S;
  string IPaddress;
  
  while(std::cin >> IPaddress) 
    if(!S.member(IPaddress)) // this test could be left out
      S.insert(IPaddress);
  
  string unique_addr;
  forall(unique_addr,S) 
    std::cout << unique_addr << std::endl;  
}

Given the input above, the output of the program is

139.19.1.151
139.19.1.153
139.19.1.154
198.182.196.56
207.46.249.27

The method

S.insert(v);

adds an element with value v to S. If the value v is already contained in the set, nothing happens, that is, no elements are duplicated; every value can be contained in S at most once.

The method

S.member(v);

tests whether a value v is already a member of the set S. This test also could be left out in the above program since insert(), as already said, does not insert a value twice.

With the macro

forall(x, S) { ... }

the individual elements of a set can be iterated over. The macro assigns copies of the elements to the iteration variable x.

The above program does not make use of the methods

S.del(v);

and

S.choose();

S.del(v) deletes an element with value v from the set. S. choose() chooses an element arbitrarily, that is, the method returns a copy of an arbitrary element (under the precondition that the set is not empty). This arbitrary selection is used in some algorithms.

[Note]Note

The method choose() does not guarantee that an element of the set is selected at random and that all elements are selected with the same probability. It is definitely possible that the same element is selected over and over again (provided that it is not deleted).

The LEDA class set is, like the class array, a non-item-based, simple-structured type.

Tips

  • One uses a set if the number of search operations is fundamentally larger than the number of insertion and deletion operations.
  • The class set is useful if one wants to eliminate multiple copies of the same object from a collection of objects.
  • If the objects are integer numbers, the classes int_set or dynamic_integer_set may be better choices.
  • If the objects have to be stored in a given order, or if certain objects have to be accessed particularly often, an array or a list may be the structure of choice.



Imprint-Dataprotection