Speed Up Your Searches

Usually, in Computer Science classes, sorting algorithms are studied in depth. The reason is quite simple: they are widely used and can be applied to several contexts. Moreover they represent a good way to introduce the concept of computational complexity.

In the real world, one of the reasons to sort items is for searching. On a large set of data, finding an item by scanning every element to see if it matches can be a very slow operation. If the array is sorted by a key, we can perform binary search in order to lower complexity on worst case from O(N) to O(log N). This is a good improvement on large set of data, isn't it? Now, what if I say that I can lower the computational complexity to O(1)?

No Magic, Just Some Math

A small phone book as a hash table (image by Jorge Stolfi)
The trick is to find a way (or an algorithm, if you prefer) to convert the key (that can be a number, a string or whatever) into an index that directly refers to the desired information. This "way" is called hash function and the data structure hash table.

There are hundreds of articles about how hash tables works under the hood and at least the same number of implementations. In many high level languages hash tables are available even as built-in types, often called associative arrays, dictionaries or maps.

Hash tables are widely used when the data set is huge and high response speed is needed. For example, in some RDBMS, indexes are implemented with hash tables. In addition, unlike arrays, you are not forced to have object of the same size stored into hash tables, so the used space (in memory or disk) is optimized and this also affects the average speed of a search.

All That Glitters Is Not Gold

Unfortunately there are some drawbacks with hash tables; the two biggest are both related to the hash function: speed and collisions. The speed issue is related to the execution time of the hash function itself and it's critical when the hash table doesn't contain so much data, so the cost of the linear search in the medium case may be comparable with execution of the hash function.

The collision is a more subtle problem: the hash function may produce the same output for different keys and exist various ways to store different data in this case but they all take some time, reducing the performances of the hash table.

The good news is that many implementations let you choose to use the built-in hash function or provide your own, so, if the speed is a critical requirement of your project, you can create a hash function that better fits your needs.

Conclusions

Hash tables are a very powerful and flexible data structure that can increase the speed of your searches on a huge set of data.

If you are wondering which hash table implementation I use in C, the answer is GLib Hash Table.


Post a Comment