Hashtables, Pigeonholes, and Birthdays

cough
http://www.codinghorror.com/blog/archives/000834.html
Point number 2.

Amusingly, if I use your search box to search for either “pictures” or “photos”, it returns every single post you’ve ever made. I think that’s because the search engine is indexing some invisible text in the template for every page, but I’m not sure.

In defense of the English Police, he/she probably meant the “eg” in this sentence: “As you place more and more items in each bucket (eg, “collisions”)”

Ah, fair enough. I fixed that.

If we’re talking about storing every person’s DNA and running out of hash addresses, I think we’d first need to talk about running out of space in general before running out of buckets.

Actually, we aren’t storing the DNA, we’re just storing the hash. With enough bits in the hash (128?) this would work fine as a fingerprint; it’d be equivalent in uniqueness to the DNA, at least in theory.

@Cowherd – the picture is there to show the visual symmetry between a set of physical pigeonholes, and a calendar. What is a calendar, after all, except 365 arbitrary pigeonholes?

You can derive the 1.2 imes the square root of n approximation by using the binomial coefficient form and invoking Stirling’s approximation.

You can see how bad Feb. 29 birthdays can be by seeing The Pirates of Penzance.

There’s a visualization of the birthday problem at http://demonstrations.wolfram.com/TheBirthdayProblem/ which shows probabilities for both a pair and a trio of “collisions.”

I wrote a little about this a few months ago (http://blog.perfectapi.com/2007/10/hashtable-of-hashtables.html). When you have a scenario where you will likely encounter hash-key duplicates, then a simple solution is to have a second (or third, 4th etc) hash key.

So if you need a 320-bit key to avoid collisions, you can use 10x32-bit unique hash keys (or even 9x32 + 4x8). The only challenge remaining is finding a bunch of unique hashes (MD5, SHA-something, CRC, etc).

Actually, we aren’t storing the DNA, we’re just storing the hash. With enough bits in the hash (128?) this would work fine as a fingerprint; it’d be equivalent in uniqueness to the DNA, at least in theory.

It would be equivalent to the uniqueness of DNA in practice. Hashes rarely display the same uniqueness characteristics as their ranges in theory, because otherwise using the hash would be pointless.

Once again, theoretical DNA uniqueness: 1 in (number of nucleotide combinations per base)^(number of bases in genome), theoretical 128-bit hash uniqueness: 1 in 2^128. To say that 2^128 is smaller is a great understatement.

Even if you’re one in a million, there’s still more than 8 of you in New York City alone.

When talking about performance, people often forget that hash tables have some limitations.

For example, when facing hash collisions (which lower performance), many would naturally assume that the best answer would be a larger hash table. After all, less collisions means more O(1) accesses.

However, your array has to be as large as the hash key space.
Meaning if your keys cover, for example, the entire range of 24 bits, you need an array of the size of 2^24 items (be they pointers, lists, or just the actual stored objects).
Even if you insert only 100 objects into the table, a good hash function will probably spread them evenly across the table.

A hash table requires AND WASTES a lot more memory than other data structures if you don’t fill up the table. More memory means more page faults, and no cache coherence.

In one story I heard, when porting a compiler to a system that has a lot more memory, the compiler team increased the hash table used for symbol names so that more symbols could be inserted with less collisions.
To their surprise, the performance dropped significantly, even though the programs weren’t using that many symbols. When profiling, they discovered that the performance dropped due to large paging demands.

Hash tables are fun :slight_smile:

As for collision handling, there are some good options: linear probing, quadratic probing, double hashing, etc. I prefer chaining using linked lists or trees, but to each his own.

@Ravi: Ah dammit, missed the one in the actual text (no periods, and the comment included them). So I only spotted the two in the Wikipedia quote. Score one for easy quoting. I’m not sure that it should be “i.e.” there either though, because “i.e.” is used to give the explanation of a word, not the word itself. So it should probably be:
[With more and more “collisions” (ie, placing items into a bucket that already contains items)]
or just remove the i.e. completely.
And knowing is half the battle. Of course, the other half is to stop correcting poor Jeff’s grammar for the purposes of showing off one’s intelligence and knowledge of English grammar…

First, I see a heck of a lot of people advocating a single hashmap implementation with regard to collisions (linked list, tree, sorted array, linear search, exponential seatch, rehashing). This is like argueing that a Vector, Linked List, unsorted array, sorted array, etc is always the best option in every case. The analysis isn’t trivial like in the later case but the choice of what kind of hash map depends on how it is used: number of insertions, memory costs, retrieval frequency, etc. Kudos to K for touching on some easily missed considerations.

Good catch on the O(1) abuse Andy; I hadn’t considered the formal definition and it appears that noone else here has either. I’d still contend however that the colloquial use employed with regard to hashmaps still has some utility. The problem with the formal definition with regard to hashmaps is that it assumes arbitrarily large N independant of any context where-as any hashmap analysis requires size of N to be considered.

Yes, sure some theorists may avoid hashmaps because of the difficulty of analyzing them in absolutes but these are probably the same purists who would have never accepted TCP because it is too empirically based and difficult to model with absolute certainty using simple logical constructs. Theory has fantastic instructional and research-oriented uses but it should never be used to the detriment of practical application. In the end, it is usable applications that are the true goal of all computer science theories. This isn’t meant to be computer philosophy after all.

If you try to put 6 pigeons in 5 holes, at least one of the holes will contain at least two pigeons.

I have a small rant about big-oh abuses and hash tables here: http://hnr.dnsalias.net/wp/?p=29 if anyone is interested…

Now 2 to the power 32 approximately equals 4.3 billion?
No longer 4.0 billion?
Thank you. We needed that.

The hash function mentioned before:

f(x) = (ax + b mod p) mod 2^n

is really a good function, but it’s not called perfect hashing (since perfect hashing would never yield collisions and can only be achieved if the input values are known beforehand), it’s called universal hashing.

http://en.wikipedia.org/wiki/Universal_hashing

The hash function mentioned before:

f(x) = (ax + b mod p) mod 2^n

Note that if p divides 2^n this formula loses some of its value. If p and 2^n are coprime then I believe this produces a repeating sequence of length 2^n with no repetitions between cycles.

I suspect 2^n is there because of the optimisation x mod 2^n = x (2^n - 1). The “ax + b mod p” is the important part… and has uses other than hashing (e.g. generating a non-repeating sequence of p numbers).

There’s other performance considerations as well - how long does the hash take to compute? And how is its ‘hash performance’ affected by restricting its values to the index table’s size?

I’ve found that a C++ standard library ‘map’ container can outperform a ‘hash_map’ container (that’s part of the TR1 version of the standard library) when you have strings as keys. I think the reason for this is that the string values were reasonably evenly distributed across (in this case) the alphabet, so most of the map lookups only compared one or two characters before failing, which isn’t going to take too long. And I suspect that the ability of the hash function to produce unique values may have been adversely affected by reducing it to fit in the index table size, leading to multiple resultant key comparisons. I don’t know - all I know is that the application performed noticeably better with a ‘map’ vs. a ‘hash_map’. Noticably enough that I decided to measure it.

Jeff,

Something that might interest you is that you can actually define an “optimal” hash function that provably has no bias.

f(x) = (ax + b mod p) mod 2^n

p is a prime larger than any message you want to hash. a and b are random natural numbers between zero and p and unique for each hash operation. n is the size of the hash in bits.

There is no way with this scheme that the choice of x can bias the result of f(x).

While this is a perfect hashing algorithm, if x is larger than just a few bytes, this calculation starts to have a serious performance penalty

This scheme is the hashing equivalent of the one-time pad and can be used to provide provably secure authentication.

Simon.

I haven’t encountered a situation where 70000 entries are stored in a hashtable yet. It boggles my imagination to think of a business rule that requires such.

It could be possible if it was a simulation. Are there any swarm or repast users out there that can confirm this?

Uh, guys, the .NET hash table doesn’t use chaining. It’s a moot point whether you’d use a linked list or a sorted list (and if you’re going to go the memory-bloat route, why not use something even more efficient like another hash table or a btree?).

Almost every real-world hash table uses some form of re-hashing, and keeps its load factor below 50%. I know you CS types love to argue about this stuff, but the problem’s already been solved for 99% of plausible real-world scenarios. And if it doesn’t solve yours, then you probably need a more consistently-efficient data structure, like a tree or graph.