TheIntRendz

Home » Posts tagged 'Search Trees'

Tag Archives: Search Trees

Survival of existence between Search trees and Hash Tables

1.1 Introduction:

Two of the fundamental data structures for searching are Search Trees and Hash Tables.

Search trees are a basic tool for solving combinatorial problems. The underlying idea is quite simple: decompose a given problem into finer and finer partial problems (applying a suitable branching operation) such that the generated partial problems together solve the original problem.

A search tree is a structure that stores objects, each object identified by a key value, in a tree structure (Cormen 2009) .The key value are typically integers and can be compared in constant time and these comparisons are used to guide to access a specific object by its key. It has a root, where any search starts, it contains in each node some key value for comparison with the query key, so one can go to different next nodes depending on whether the query key is smaller or larger than the key in the node until one finds a node that contains the right key. Each sub tree of a search tree is by itself again a search tree. Search trees are one method to implement the abstract structure called dictionary.Some examples of search-tree data structures are: AVL trees, Red-black trees, B trees, B+ Trees, Trie and many more.

Hash tables are a dictionary structure of great practical importance and can be very efficient. The underlying idea is quite simple: we have a universe U and want to store a set of objects with keys from U. We also have s buckets and a function h from U to S = {0 . . . s − 1}. Then we store the object with key u in the h (u)th bucket (Cormen 2009) . The idea of hash tables is quite old; apparently starting in several groups at IBM in 1953 (Brass 2008: 374).It uses hash functions to map identifying values known as keys. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought. Hashing allows for much improved search performance compared to that of the Search Trees. So this raises a question should the search trees be deprecated in favour of Hash tables. This essay can be seen as a quest to see if search trees can really be deprecated in favour of Hash tables.

Search Trees: If we compare each node of the query key with the key contained in the node and follow the left branch in the search tree and if the query key is smaller, and right branch if the query key is larger then what happens if the query key s are equal. Two models of search trees were suggested to handle this issue. First is take left branch if query key is smaller than node key; otherwise take the right branch, until you reach a leaf of the tree. The keys in the interior node of the tree are only for comparison; all the objects are in the leaves. Second is take left branch if query key is smaller than node key; take the right branch if the query key is larger than the node key; and take the object contained in the node if they are equal.Model 1 is a binary tree where as in model 2 each node is a ternary node with a special middle neighbour. In model 1 each interior node has a left and the right sub tree where as in model 2 we have to allow incomplete nodes where left and right tree might be missing and only the comparison object and keys are guaranteed to exist. So model 1 is more used than model 2 and has clear advantage for traversing an interior node requires only one comparison in model 1 where as in model 2 we need two comparisons to check the three possibilities. The central property which distinguishes the different combinatorial types of search trees is the height. The height of a search tree is the maximum length of a path from the root to a leaf – the maximum taken over all leaves. Not all leaves are at the same distance from the root; the distance of a specific tree node from the root is called the depth of that node. The maximum number of leaves of a search tree of height h is 2h. And at the other end, the minimum number of leaves is h + 1. A search tree for n objects has height at least log n and at most n − 1. A search tree for n objects has average depth at least log n and at most ((n−1) (n+2))/2n ≈ (½) n. A tabular representation of time taken by various operations is shown in figure 1.1

Operations AVL (2,4) tree Red black Splay tree
find O(log n) O(log n) O(log n) O(log n)
delete O(log n) O(log n) O(log n) O(log n)
insert O(1) O(log n) O(log n) O(log n)

Table 1.1 Running time analysis of various operations in search trees

Another way of find in tree is we give a key interval [a, b [and want to find all keys that are contained in this interval. There are other types of dictionary structures, like hash tables that cannot support this type of query.

Hash Table: A hash table for a given key type consists of a Hash function h and an array (called table) of size N.Maps uses hash tables. If several objects that we want to store are mapped to the same bucket, we have a collision between these objects. If there are no collisions, then we can realize the buckets just as an array, each array entry having space for one object. There are two possible ways to avoid collision. First is for each s ε S, have a secondary structure that stores all the elements x ∈ X with h(x) = s. So each of the s buckets contains another dictionary, but because the buckets should contain only few elements, this secondary dictionary can be very simple. The simplest method is just a linked list of the elements; this is called “chaining.” Second is,for each u ∈ U have a sequence of alternative addresses in S: if h (u) = h1 (u) is already used by a colliding element, we try h2 (u), h3 (u)… until we find an empty bucket. This is called “open addressing”. The goal of the hash function is to “disperse” the keys in an apparently random way. A hash function is perfect if it does not cause any collisions for the set it stores and the practical importance of perfect hash functions remains dubious, in spite of numerous papers on it. Linear probing and double hashing techniques are also used to handle collisions. In the worst case, searches, insertions and removals on a hash table take O (n).

1.2 Hash Tables versus Search Trees: Hash tables are faster in most cases, but can be very bad at their worst. Search trees have many advantages but are somewhat slower in typical cases. Balanced BST have a fairly uniform complexity: each element takes one node in the tree (typically 4 words of memory) and more precisely, an access in the tree takes about log 2 (n) comparisons and is guaranteed bounded. Hash tables are a bit more variable. They require an array of around 2n pointers. Access to one element depends on the quality of the hash function. A hash table “works” if all the elements you want to store in it have different hashes. If this is the case, then the basic operations (lookup, insertion, deletion) take O (1) time, with a fairly small constant (one hash calculation plus one pointer lookup). In Hash table O (1) complexity is not guaranteed. There’s a point where the table becomes full; when that happens (or, better, a little before that happens), the table needs to be enlarged, which requires moving all of its elements, for an O (n) cost. This can introduce weird behaviour when a lot of elements are added. It’s possible for the input to collide over a few hash values. This issue has led some programming language implementations (such as Perl and Python) to switch from a plain old hash table to a hash function involving a random number chosen when the hash table is built, together with a hash function that spreads this random datum well (which increases the multiplicative constant in the O (1)), or to a binary search tree. We can avoid collisions by using a cryptographic hash; this is not done in practice because cryptographic hashes are comparatively very slow to compute. When you throw data locality into the mix, hash tables do poorly. They work precisely because they store related elements far apart, which mean that if the application looks up elements sharing a prefix in sequence, it will not benefit from cache effects. This is not relevant if the application makes essentially random lookups. Search trees are immutable i.e. if you need to take a copy of a tree and change a few elements in it, you can share most of the data structure. If you take a copy of a hash table, you need to copy the whole array of pointers. Also, if you’re working in purely functional languages, hash tables are often not an option. When we go beyond strings hash tables and BST make different requirements on data keys, hash table requires a hash function while BST require a total order. Hashes can sometimes be cached, if there is enough room in the data structure where the key is stored; caching the result of comparisons (a binary operation) is often impractical. If you want to list the keys in alphabetical order than hash tables are of no help where as you can straight forwardly traverse a search tree in order. Similarly to compute shortest path in graphs we cannot use hash tables but the graphs disintegrates itself to a search tree and this can be used to calculate the shortest path. You can combine binary search trees and hash tables in the form of hash trees. This is useful, for example, in a purely functional programming language where you want to work on data that does not have an easy-to-compute order relation (Cormen 2009) . A Trie can be used when the keys are strings. A trie is a tree, but indexed differently from a search tree: you write the key in binary, and go left for a 0 and right for a 1. The cost of an access is thus proportional to the length of the key. Tries can be compressed to remove intermediate nodes; this is known as a Patricia tree or a Radix tree. Binary trees are of medium complex to implement and hash tables have high complexity to implement. A tree implementation also grows trivially to the required size, whereas a hash map starts to degrade when it gets full (for most implementations, it’s said around 70% of the buckets filled). You either need to rehash the entire table (again, bad for real time apps), or incrementally move to a new table, which is not a simple implementation.

1.3 Conclusion:There is a famous quote from Pike’s Notes on Programming in C (Bang et al 3rd edn) : “Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don’t get fancy. Search trees are useful for maintaining records of data without much extra space. Hash table requires more memory than is needed to hold its data. In general, a hash table should not be more than 75% – 80% full. This means that if we want to hold 100 elements, we need to have a hash table that can hold 125 to 133 elements. Rehashing is very expensive operation that requires n operations. Our constant time insert becomes of order O (n) when we need rehashing, which is far worse than the performance of a tree. Trees can save memory because they only use the amount of memory needed to hold its items. No need for rehashing because trees can grow and shrink as needed. It is simple to extract items in their natural order in trees. We use hash table if we know approximately how many items to maintain in our collection and do not want to extract items in a natural order. However, go for tree, if memory is tight and do not know how many items to store in memory, and/or want to extract objects in a natural order. In other words, if items will be consistently added and removed and do not know how many items will be stored in our collection, then we might be willing to accept O(log n) search time (instead of constant time from hash table) to avoid potential rehashing at runtime.

So from the above essay, I conclude that even though hash function has improved search performance than the search trees, still it has its disadvantages too and cannot be used in places where search trees can be used. Both of them rules in their own world and hence none of them can replace each other and so both are used for specific purpose. Hence search trees should not and cannot be deprecated ever unless there comes some more powerful data structure which has the power of both the hash table and search trees.

References:

[1] Ahuja, R., et al., (1 993) Network Flows: Theory, Algorithms and Applications, Prentice Hall

[2] Bang, Jensen., Jorgen., Gregory, Gutin. “Section 2.3.4: The Bellman-Ford-Moore algorithm”. (First edn) Digraphs: Theory, Algorithms and Applications

[3] Brass,P.(2008)AdvancedDataStructures.CambridgeandNewYork:CambridgeUniversityPress

[4] Brown,S.G.crc32.c [online] available from [ 1 July 2012]

[5] Cormen,T.H.,Leiserson,C.E.,Rivest,R.L.,andStein,C.(2009)IntroductiontoAlgorithms,3rd edn,NewDelhi:PrenticeHallofIndia

[6] Dasgupta,S., Papadimitriou,C.H.,Vazirani,U.V.,C(2006) Algorithms,July 18,2006

[7] Dial, R.B . et al.,( November 1969) Algorithm 360: Shortest-Path Forest with Topological Ordering. Communications of the ACM, (V 12), p. 11, 632-633.

[8] Fredman,L.M., Tarjan,R.E. (July 1987) Fibonacci Heaps and Their Uses in Improved Network Optimization Algorithms. Journal of the Association for Computing Machinery,(V34),596-615

[9] Gupta, P., Aggarwal, V.,Varshney,M. Design and Analysis of Algorithms, Eastern Economy

edn

[10] Knuth,D.E.(1998) The Art of Computer Programming, Volume3:SortingandSearching ,2nd edn , NewDelhi :Addison-WesleyIndianReprint

[11] Skiena,S.S. The Algorithm Design Manual, 2nd edn,Springer Science and Business Media

[12] Tanenbaum,S.A.,Computer Networks,3rd edn,Pearson

[13] Wang,T.(Jan 1997) Integer Hash Function, Version 3.1, last update Mar 2007