Array mapped hash tries and the nature of data structures

Clojure provides several immutable collection types: ordinary lists, associative arrays, and "a hash map and vector both based upon array mapped hash tries". Array mapped hash tries? What are those?

The short answer is that they're hashtables whose bodies are tries (prefix trees) made of bitmapped sparse arrays. ("Array mapped" refers, rather confusingly, to the last part.) The long answer can be found in two of Phil Bagwell's papers: Fast and Space Efficient Trie Searches and Ideal Hash Trees.

Nowadays most programmers are used to thinking of hashtables as a primitive data structure, but they're not as simple as they look. Ordinary hashtables combine three ideas in their implementation: arrays, hashing, and reprobing. Arrays are the most space-efficient collection type, so naturally you want to use them for everything, but they have the limitation that they can only be indexed with (a contiguous range of) integers. Hashing fixes this - it turns anything into an integer, suitable for indexing arrays or whatever data structure you please, at the cost of introducing collisions. Reprobing deals with those collisions, without the inefficiency of chaining. The combination has the coveted trait of fast constant-time lookup, so almost everyone uses it, but it's not the only way to build a hashtable.

Each of those three concepts can be used independently of the others. In particular, hashes can be used to index other structures than arrays. This is what Clojure does. Rather than backing its hashtables with (reprobed) arrays, it backs them with tries, treating the hashcode as a sequence of 5-bit integers. Tries are less efficient than arrays, of course, but they have the advantage of nondestructive update in log time. Clojure's collections are immutable, so efficient nondestructive update is a must.

(Mini-rant: I refuse to call it "persistence". There are plenty of other words for this - "nondestructive", "functional", "incremental". Do we have to appropriate a term with a well-established meaning that is both important and likely to be used in the same context?)

Tries have a singularly confusing name, which may explain how little attention they get. But they also have a space problem: each node may have a significant number of children, which must be accessed quickly, so an array is the obvious choice. But most nodes have only a few children, so arrays would waste a lot of space. Bagwell's (and Clojure's) solution is to use bitmapped sparse arrays. Each node of the tree has a bitmap showing which children are present - and then a small array containing only those entries. For example, in a base-32 trie, a node with children 1, 5, and 11 would be represented by a bitmap 100000100010 and a 3-element array containing pointers to the children. This represents a 32-element sparse array in only 4 words! It's not even inefficient to index into, as you can determine the index of an entry by counting the 1 bits below it in the bitmap. This is supported in hardware on most architectures - but not, alas, on x86.

But even with hardware support, these tries are considerably slower than ordinary hashtables - they're fast log time, not fast constant time. There is an unavoidable price to nondestructive updates: they require keeping more information around than destructive ones. We functional programmers are used to thinking of state as a valuable (if hazardous) optimization tool, but we often mistake the reason. The main performance benefit of side effects isn't from the obvious operations like nconc, that use state to recycle old objects and slightly reduce allocation. Much more important is the freedom they give us, to use different data structures and different algorithms, and to express our programs in different ways.

3 comments:

  1. As a long-term radix-tree fan(atic), this's JUST THE THING I've been looking for for years!

    Wow, thanks!

    ReplyDelete
  2. There are actually a bunch of clever algorithms for counting the number of set bits, as well as a special Intel instruction (I'm guessing the earlier x86 chips didn't have it though).

    Fast Bit Counting Algorithms

    ReplyDelete
  3. Clojure uses Integer.bitCount, so it may well take advantage of hardware where available. But of course bit-counting is logic, not memory access, so even without hardware it's probably not a significant cost; I probably shouldn't have mentioned it.

    ReplyDelete

It's OK to comment on old posts.