AHT documentation
Copyright (C) 2000-2001 by Salvatore Sanfilippo
<antirez@invece.org>

				   INDEX

	[1.0] The first example
  	  [1.1] ht_init()
	  [1.2] ht_add()
	  [1.3] ht_search()
	  [1.4] ht_destroy()
	[2.0] The hashtable and ht_element structures
	[3.0] Informations on aht internals
	[4.0] Advanced usage of aht
	  [4.1] Using your own hash function and multi-hashing
	  [4.2] Element destroiers and the ht_add_generic() function
	  [4.3] ht_add_generic()
	  [4.4] Collisions, resizing and double hasing
	    [4.4.1] ht_resize();
	    [4.4.2] ht_expand();
	[5.0] Running the table
	  [5.1] ht_get_byindex();

				 OVERVIEW

  Aht is a small library that implements dynamic in-memory hash tables.
  An hash table, also known as "a dictonary", is a data structure used to
  associate an object to a key, so that is fast to lookup for a given
  key inside the table, even if the table contains a big number of elements.

  Using aht you can create an hash table and insert all the couple of
  key-value you want, without care about the size of the hash table,
  that automatically will be expanded when needed, using a simple
  programming interface.

[1.0] The first example
~~~~~~~~~~~~~~~~~~~~~

  To show how aht works see the following code:

  #include "aht.h"

  /* don't check for errors to be simple */
  int main(void)
  {
  	struct hashtable t;
	int index;

	/* Initialize the hash table */
	ht_init(&t);

	/* Add three elements: (key --> value)
	 * one   --> 1
	 * two   --> 2
	 * three --> 3
	 */
	ht_add(&t, "one", 3, "1", 2);
	ht_add(&t, "two", 3, "2", 2);
	ht_add(&t, "three", 5, "3", 2);

	/* Print some information about the hashtable */
	printf("Elements in table: %u\n", t.used);
	printf("Hash table current size: %u\n", t.size);

	/* Search for the key "two", and print the associated data
	 * that in this case is the string "1", and print it */
	 ht_search(&t, "two", 3, NULL, &index);

	 printf("two is associated with %s\n", (char*) t.table[index]->data);

	 /* Destroy the table, deallocating the memory */
	 ht_destroy(&t);

	 return 0;
  }

  Compiling and executing this code the output is the following:

  Elements in table: 3
  Hash table current size: 257
  two is associated with 2

  Let's go to explain the function used:

  [1.1] void ht_init(struct hashtable *t);
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    The function ht_init() must be used first of all other
    ht_*() functions. It inizialize the hash table pointed
    by `*t' setting-up the structure fields.
    ht_init() never fails, so it returns no value.

  [1.2] int ht_add(struct hashtable *t, void *key, size_t keysize,
  				  void *data, size_t datasize);
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   The function ht_add() is used to add an element (a key/value couple)
   in the hash table. Both the key and the data can be what you
   want, so void pointers are used, with an exception: the key
   can't be zero length. In the above example we used the code:

	ht_add(&t, "one", 3, "1", 2);

   The first argument, &t, is a pointer to the hash table strucuture,
   "one" is the key, and 3 is the length of the key. Finally
   "1" and 2 are the data to associate to the key and the size of this
   data. We used a data size of 2 in order to include the string nul-term.
   ht_add() can return the following error codes:

   	HT_OK		Element added with success
	HT_INVALID	The data size is zero
	HT_BUSY		The given key already exist in the table
	HT_NOMEM	System ran out of memory

   The function ht_add() automatically allocates the memory for
   the key and the data of the new element. When you destroy
   the hash table all the memory will be automatically freed.
   Anyway if this behaviour is not desidered you can use instead
   the ht_add_generic() functions. See the `Advanced usage' section
   for more information.

  [1.3] int ht_search(struct hashtable *t, void *key, unsigned int keysize,
						unsigned int *avail_index,
						unsigned int *found_index);
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    ht_search() is the function used to search inside the hash table
    for a given key. It can be even used to obtain the index in which
    a given key must be stored in the hash table: this is used by
    the aht library internally and should normally be not useful for
    the end-user. Anyway detailed information about this usage can be
    found in the `Advanced usage' section.

    If the key exist, ht_search stores the index in the hash table of the given
    key in the unsigned integer pointed by the `*found_index' pointer,
    and returns HT_FOUND. Otherwise HT_NOTFOUND is returned, and the index
    in which the given key must be stored is stored in the unsigned integer
    pointed by `*avail_index'.

    You can select what information ht_search() should return using
    the NULL pointer for the undesidered information.
    For example to serach for an element we used in the code above:

    	ht_search(&t, "two", 3, NULL, &index);

   `&t' is as usually the pointer to the hash table, "two" the key,
   3 the key size. Since we was not interested in the available index
   NULL was passed for the 4th argument. `&index' is the pointer
   to the unsigned integer that will contain the index of the found
   element if ht_search() will return HT_FOUND.

   If the `unsigned int *avail_index' argument is not NULL and
   the hash table is full, the table will be silently expanded.

   ht_search can return the following exit codes:

   	HT_FOUND	The key was found, and the index stored in
			*found_index, if it is not NULL.
	HT_NOTFOUND	The key does not exist, and the index in which
			to store the new key was stored in *avail_index.
	HT_NOMEM	The *avail_index was not NULL, the hash table
			full, but the hash table expansion failed because
			the system ran out of memory.

  [1.4] int ht_destroy(struct hashtable *t);
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The ht_destroy() function is used to destroy an hash table:
  to destroy means remove and deallocate all the elements in
  the hash table and resize the hashtable itself to zero length.
  After you call this function you don't need to recall
  ht_init() to reuse the hash table: ht_destroy() do this for you.

  error codes:

	HT_OK		ht_destroy() never fails.

[2.0] The hashtable and ht_element structures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  As you can see, using ht_search() to search a given key, if the
  key is found we get an "index", not a pointer to the data associated
  with the key. To access to the data some background about the
  aht structures is needed.

  Every hash table is a `struct hashtable'. The structure is the following:

  struct hashtable {
          struct ht_ele **table;
          unsigned int size;
          unsigned int used;
          unsigned int collisions;
          u_int32_t (*hashf[HT_FUNCTIONS])(unsigned char *buf, size_t len);
  };

  The first element, `struct ht_ele **table', is the more important:
  It is an array of pointers to ht_ele structures.
  If we define an hash table like this:

  struct hashtable t;

  We can get the pointer to the structure that contains a given element
  using t.table[element_index]. If the hash table contains the element
  of index `n', t.table[n] isn't a NULL pointer, otherwise it is NULL.

  The structure that contains the elements data is the structure ht_ele:

  struct ht_ele {
          void *key;
          unsigned int keysize;
          void *data;
          unsigned int datasize;
          void (*destructor)(void *obj, unsigned int size);
  };

  It contains a pointer to the key and the key size, and a pointer
  to the data of the element and a data size. Also it contains a
  function pointer, to define an element destructor different from
  the default (see the `Advanced usage' for more informantion).

  So if the ht_search found an element for some key, you can access
  to the element data pointer using the returned index with:

  t.table[index]->data		/* pointer to the element data */
  t.table[index]->datasize	/* integer that contains the data size */

  Maybe it's useful to quote two lines from the example programs above:

	ht_search(&t, "two", 3, NULL, &index);
	printf("two is associated with %s\n", (char*) t.table[index]->data);

  ht_search stores the index of the element in the integer pointed
  by `index', so the printf() function can access to the element data
  using t.table[index]->data.

  This was just an example, in real programs you must check the
  return code of the function ht_searc(), for example the above lines
  can be written as:

	if (ht_search(&t, "two", 3, NULL, &index) == HT_NOTFOUND) {
		printf("Key not found!\n");
		return 0;
	} else {
		printf("two is associated with %s\n",
			(char*) t.table[index]->data);
	}

[3.0] Informations on aht internals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  This implemenation of hash table has the following features:

  o Dynamic size
  o User supplied initial size and run-time resize
  o Ability to pass your own hash function, and your own element destroier
  o Double and Third hashing, optional
  o Collisions counter for profiling

  The size of the hash table is anyway a prime number, the default
  hashing function is the djb, developed by D. J. Bernstein.
  An alternative hashing function is supplied: it is very different
  from the djb, so you can use it to perform double hashing.
  If the hash functions collide a linear rehash will be used.

  Note: From my tests it seems that with normal string keys the default
  is faster.

[4.0] Advanced usage of aht
~~~~~~~~~~~~~~~~~~~~~~~~~~~

  [4.1] Using your own hash function and multi-hashing
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The struct hashtable contains an array of function pointers:
  every element of this array (that contains three elements) may
  contain a function pointer to an hash function. After ht_init()
  is called the contents of the array is the following:

  [0] = djb_hash()
  [1] = NULL
  [2] = NULL

  If you want use double-hashing all you need to do is:

  struct hashtable t;

  ht_init(&t);
  t.hashf[1] = alt_hash();

  The prototypes of an hash function is the following:

	u_int32_t new_hash(unsigned char *buf, size_t len)

  The function alt_hash is an hashing function very different
  from the djb one. Aht will use all the functions that are presents
  in the hashf[] array to perform the search. From the first
  to the last.
  You can even replace the djb hash function with your own
  that looks better for your special keys:

  ht_init(&t);
  t.hashf[0] = my_own_hash();

  Note: from my tests it results that for normal string keys
  the highest performace are reached using the djb function,
  without double-hasing, i.e. the aht default.

  [4.2] Element destroiers and the ht_add_generic() function
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The behavoiur of the ht_add() function is to allocate
  using malloc() the memory needed for the key and for the
  associated value. As already explained when you destroy
  the hash table with the function ht_destroy() or when
  you cann ht_free() to remove an element, a distructor
  associated to the element is called. For the elements
  added with ht_add() the destroier is set to ht_free_destructor()
  that calls free() to deallocate the malloc()ated memory.

  So, using ht_add(), you can't add a statically allocated or
  a stack allocated element avoiding that ht_add allocates
  memory and do a copy of the element.

  To do this you can use the more generic ht_add_generic() function:

  [4.3] int ht_add_generic(struct hashtable *t, void *key, size_t keysize,
			  void *data, size_t datasize,
			  void (*destructor) (void *obj, unsigned int size));
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The ht_add_generic() is very similar to ht_add() with this two differences:

  o the data pointed by the pointer that you pass to ht_add_generic()
    as the 4th argument isn't copied to a dynamically allocated memory.
    But it is taken as it is, and inserted in the hash table.

  o You can specify a destructor, that is a function without return
    that takes as arguments a pointer to the object to destroy and
    its size. This function will be called with the arguments
    `data' and `data_size' that you are passing to ht_add_generic()
    when the element will be destroied.
    You you specify a NULL pointer a special ht_no_destroier that
    does not performs any operation will be used.

  Example: how to add a statically allocated value for a given key.

  	struct hashtable t;
	static char *string = "Just a string";

	ht_init(&t);
	ht_add_generic(&t, "mykey", 5, string, strlen(string)+1, NULL);
	ht_destroy(&t);

  The code avoids that the string is duplicated, and that when
  ht_destroy() is called aht try to free() a statically allocated
  portion of memory.

  Another use of ht_add_generic is to setup a destroier that do
  some specific work, like to free the elements of structures
  used as the value of the keys. This is very useful and makes
  your program more simple to manage.

  Example:

	struct my {
		char *name;
		char *address;
		char *note;
	}

  	struct hashtable t;
	struct my myentry;

	void special_destroier(void *obj, unsigned int datasize);

	myentry.name = strdup("Salvatore Sanfilippo");
	myentry.address = strdup("Myaddress");
	myentry.note = strdup("Note field");

	ht_init(&t);
	ht_add_generic(&t, "antirez", 7, &myentry, 0, special_destroier);
	ht_destroy(&t);

	void special_destroier(void *obj, unsigned int datasize)
	{
		struct my *t = obj;

		free(t->name);
		free(t->address);
		free(t->note);
	}

  [4.4] Collisions, resizing and multi hasing
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  struct hashtable {
          struct ht_ele **table;
          unsigned int size;
          unsigned int used;
          unsigned int collisions;
          u_int32_t (*hashf[HT_FUNCTIONS])(unsigned char *buf, size_t len);
  };

  The struct hashtable isn't opaque. Accessing his members directly
  you can get infos about the size of the hash table, the number
  of elements inside the table and the number of collision that occured
  from the creation of the table in both elements insertion, internal
  resizing and in keys lookup.

  While it is not possible to resize an hash table to get a
  size/used that's less than 2, you can use the function
  ht_resize() to obtain the littlest size (that is a prime number)
  that's closer to the size/used = 2 ratio, optimizing the hash
  table for size.
  The ht_resize() function is very simple to use:

  [4.4.1] int ht_resize(struct hashtable *t);
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  A very simple example of ht_resize() usage:

int main(void)
{
        char buffer[32];
        struct hashtable t;
        int i;

        ht_init(&t);

        for (i = 0; i < 10000; i++) {
                sprintf(buffer, "key-%d", rand());
                ht_add(&t, buffer, strlen(buffer), "value", 5);
        }
        printf("Adding 10000 elements\n\n");
        printf("Hash table size: %u\n", t.size);
        printf("size/used: %f\n", t.size/(float)t.used);
        printf("collisions: %u\n", t.collisions);
        printf("\nResizing...\n\n");

        t.collisions = 0; /* We must reset the collision counter */
        ht_resize(&t);

        printf("Hash table size: %u\n", t.size);
        printf("size/used: %f\n", t.size/(float)t.used);
        printf("collisions: %u\n", t.collisions);

        ht_destroy(&t);

        return 0;
}

  The output is the following:

	Adding 10000 elements

	Hash table size: 33703
	size/used: 3.370300
	collisions: 2105

	Resizing...

	Hash table size: 20011
	size/used: 2.001100
	collisions: 4643

  As you can see after ht_resize() was called the size/used ratio
  become very near to 2, even if the number of collisions are
  about double.

  [4.4.2] int ht_expand(struct hashtable *t, size_t size);
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  You can even optimize the hash table for speed, using the
  function ht_expand().

  Replacing the function call `ht_resize(&t)' with:

  	ht_expand(&t, t.size * 2);

  The output become:

	Adding 10000 elements

	Hash table size: 33703
	size/used: 3.370300
	collisions: 2105

	Resizing...

	Hash table size: 67409
	size/used: 6.740900
	collisions: 880

  Less collisions are often index of more speed, but you should
  perform some benchmark with your appication's tipical table size
  and keys.

  Also note that the program above uses very low related keys
  since they are generated using the rand() function.
  If you substitute the line

	sprintf(buffer, "key-%d", rand());

  with

  	sprintf(buffer, "key-%d", i);

  and run the program you can see that the number of collisions
  after the hash table expansion becomes much little than before
  the expansion. The new output is:

	Adding 10000 elements

	Hash table size: 33703
	size/used: 3.370300
	collisions: 118086

	Resizing...

	Hash table size: 67409
	size/used: 6.740900
	collisions: 2802

  A lot of collisions as you can see. This is the perfect example
  to show how double hashing can help with very related keys.
  Adding the line `t.hashf[1] = alt_hashf' just after `ht_init(&t)'
  you'll get this output, that sounds very different:

	Adding 10000 elements

	Hash table size: 33703
	size/used: 3.370300
	collisions: 3888

	Resizing...

	Hash table size: 67409
	size/used: 6.740900
	collisions: 477

  The program without double hasing takes to add 10000 elements
  with a AMD K6-III 420 Mhz:

  108097 micro seconds without double hashing
   69703 micro seconds with double hashing

  But double hashing is hardly faster with unrelated keys.
  Generating the keys with the rand() function the same program takes:

   76416 microseconds without double hashing
   77621 microseconds with double hashing

[5.0] Running the table
~~~~~~~~~~~~~~~~~~~~~~~

  In some application you will want to be able to run the
  entire table, getting all the elements inside. Aht provide
  a simple interface by the ht_get_byindex() function to
  do this.

  [5.1] int ht_get_byindex(struct hashtable *t, unsigned int index,
			   struct ht_ele **e);
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The ht_get_byindex() function allows you to get an element using
  the index of a given element. The first argument is the pointer
  to the hashtable structure, the second is the index you want to get
  and the last argument, a pointer to a struct ht_ele pointer will
  be used to store the pointer of the element stored in the given index.

  ht_get_byindex returns the following values:

	 0  -- the given index is empty, skip it
	 1  -- the index was found
	-1  -- you are out of the range of the hash table elements

  So you can run the entire hash table using some for cicle like this:

	struct hashtable t;
	int index;

	ht_init(&t);
	...
	add your elements here
	...

        for (index = 0; ;) {
                int ret;
                struct ht_ele *e;

                ret = ht_get_byindex(&t, index, &e);
                if (ret == -1)
                        break;
                if (ret) {
			/* Do what you want with the element */
                        printf("%s\n", (char*)e->data); /* assuming a string */
                }
                index++;
        }

  You may want to use ht_get_byindex() to ensure portability
  after a call to ht_search() since this function will make opaque
  the hashtable structure. Anyway all the aht structure elements
  used by the user are thinked to be taken in all the future
  version of the library.

EOF
