Algorithms

Understanding Hash table Data Structures

Hash fuction

What is Hashtable?

Hashtable is a data structure which records are stored in buckets using hash keys.

Advantage of using Hashtable is that a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.

 There are various ways where hashing is used in our daily lives, for example. Generally, every company assign their employee an employee code when they join. Whenever they need any information, they can use employee code to see all related data for that employee. Here is an example which I have taken from a StackOverflow answer:

Now to understand it better please read the following example taken from an answer on StackOverflow.

Let’s assume you want to fill up a library with books and not just stuff them in there, but you want to be able to easily find them again when you need them.

So, you decide that if the person that wants to read a book knows the title of the book and the exact title to boot, then that’s all it should take. With the title, the person, with the aid of the librarian, should be able to find the book easily and quickly.

So, how can you do that? Well, obviously you can keep some kind of list of where you put each book, but then you have the same problem as searching the library, you need to search the list. Granted, the list would be smaller and easier to search, but still you don’t want to search sequentially from one end of the library (or list) to the other.

You want something that, with the title of the book, can give you the right spot at once, so all you have to do is just stroll over to the right shelf, and pick up the book.

But how can that be done? Well, with a bit of forethought when you fill up the library and a lot of work when you fill up the library.

Instead of just starting to fill up the library from one end to the other, you devise a clever little method. You take the title of the book, run it through a small computer program, which spits out a shelf number and a slot number on that shelf. This is where you place the book.

The beauty of this program is that later on, when a person comes back in to read the book, you feed the title through the program once more, and get back the same shelf number and slot number that you were originally given, and this is where the book is located.

Now to map details with a hash key, a hash function is used to generate a hash key for information. But what’s a hash function?

What is a Hash function?

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or only hashes. It just takes things from one space and maps them to something useful for indexing.

I have taken the following example diagram from Wikipedia, which demonstrates the Hash Table pretty neatly. There is a Hash function which takes Names as input and converts them into easily mappable hashes.

Hash fuction

To create a good hashtable, It is essential to have a good hash function with the following basic requirements:

  1. Easy to compute: It should be easy to calculate and must not become an algorithm in itself.
  2. Uniform distribution: It should provide a uniform distribution across the hash table and should not result in clustering.
  3. Fewer collisions: Collisions occur when pairs of elements are mapped to the same hash value. These should be avoided.

No matter how good your hash function is, collisions are bound to occur. Unless you are doing “perfect hashing” you have to have a collision resolution strategy, to deal with collisions in the table. There are multiple ways, like 1. Linear probing 2. Double hashing 3. Random hashing 4.Separate chaining. We will cover that in future articles.

What is the load factor and Rehashing?

The load factor is the number of keys stored in the hash table divided by the capacity.

So if there are n entries and b is the size of the array there would be n/b entries on each index. This value n/b is called the load factor that represents the load that is there on the map. This Load Factor should be kept low, so that number of entries at one index is less and so our hash table is still efficient.

Now, remember this. Low load factor means less possible collisions and efficient Hashtable, but if load factor increases, so do the time complexity for searching. So we need to do Rehashing for it. So if load factor increases and crosses a particular threshold value, we should increase the array size and reinsert values in this broader array as the size of array increases, the load factor decreases and results in less collision in the data table.

Complexity of Hashing

The worst-case occurs if the hash achieves poor distribution (for any alpha), and as you stated in your original post, is O(N). The best-case occurs when you have a well-distributed hash and alpha is relatively large (>1.0), and as you said, that is O(1). So we agree on the best case and worst case.

 

Leave a Reply

Your email address will not be published. Required fields are marked *