Hashtables, Pigeonholes, and Birthdays

One of the most beloved of all data structures in computer science is the hash table.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2007/12/hashtables-pigeonholes-and-birthdays.html

Joseph said:
“With a sufficiently large number of twins or triplets, we might be able to store the DNA of everyone in the planet in a single hashtable.”

…or a few nukes.

Shameless plug: http://blogs.msdn.com/dcook/archive/2007/09/09/hashes-and-tables-and-primes-oh-my.aspx

Modeling the number of collisions in a hash table is somewhat more involved than might be obvious. You end up using Poisson distributions and other techniques that you thought you’d never use again after finishing your college statistics class. See http://burtleburtle.net/bob/hash/birthday.html#half for a bit more information.

Here are a few numbers that I worked out (take them with a grain of salt – I’m a code monkey, not a statistician). If you need to hash 100 items, what results can you expect from various sized hash tables? (Note that this assumes some kind of chaining for collision resolution; otherwise things would get hairy after 100% load.) Cross your fingers that the table comes out OK in the blog :slight_smile:

Load Expected Expected
Factor Buckets Collisions # of Probes
50% 200.0 21.3 1.14
60% 166.7 24.8 1.16
70% 142.9 28.0 1.20
80% 125.0 31.2 1.23
90% 111.1 34.1 1.26
100% 100.0 36.8 1.29
110% 90.9 39.4 1.32
120% 83.3 41.8 1.36
130% 76.9 44.0 1.39
140% 71.4 46.2 1.43
150% 66.7 48.2 1.47
160% 62.5 50.1 1.50
170% 58.8 51.9 1.54
180% 55.6 53.6 1.58
190% 52.6 55.2 1.62
200% 50.0 56.7 1.66
210% 47.6 58.1 1.70
220% 45.5 59.5 1.74
230% 43.5 60.7 1.78
240% 41.7 61.9 1.82
250% 40.0 63.1 1.87
260% 38.5 64.1 1.91
270% 37.0 65.1 1.95
280% 35.7 66.1 2.00
290% 34.5 66.9 2.04
300% 33.3 67.8 2.09

@Telos: “Heh. I was just thinking that seasonal variations would have an effect when I read the last line.” I was wondering if there were any twins in the class so I could be snarky and say 100%…"

Well… to be honest, you’d lose if you had been snarky. I went to school with a pair of twins - who had different birthdays. One was born at 11:50PM or so, the other didn’t show up till 12:05AM the next day.

There was also a recent news item - during the end of Daylight Savings Time, a pair of twins showed up and the one born second is listed as “older” - the first one showed up at 2:50AM, then the clock popped back an hour. So the second showed up at 2:10AM…

As a leap-year baby, it has its down-sides. Many people don’t know which day to celebrate it on, but a few of the more happy-go-lucky will just congratulate me on both days! And that is just plain fun. :slight_smile:


Danny.

For 32bit hashes you don’t have 50% collision rate at 77k entries. That is, the probability of collision on inserting another entry, what one might think of as the collision rate, is 0.002%. But there is a 50% probability that within those 77k entries there is at least one collision.

Heh. I was just thinking that seasonal variations would have an effect when I read the last line.

"For 32bit hashes you don’t have 50% collision rate at 77k entries."
Imagine if you did, and it kept increasing. That’d be one hellishly unbalanced hash.

Jeff, I love your blog, so please forgive me for being the language police: you need to stop mixing up “e.g.” and “i.e.”. If you can’t keep the difference straight just stick to english - “for example” and “that is” are perfectly serviceable.

Ants/Tom – thanks. I corrected the text. What’s the math behind estimating the actual collision rate?

With a sufficiently large number of twins or triplets, we might be able to store the DNA of everyone in the planet in a single hashtable.

Jeff,

maybe you missed the other major point of hash functions. Besides trying to avoid collisions at all, most good implementations are designed to avoid collisions for similar_inputs, i.e. given two inputs of same size even a single bit difference will change the hash significantly.

So the chance of a collision might be unexpectedly high per se, but the probability of two similar inputs yielding the same hash value are next to 0 (provided that you use a good hash function). It’s exactly this property that makes hash functions suitable for data integrity checks.

If the hash function is good and you have an uniform distribution of hash values for your input data, then the collision rate is the amount of entries divided by size of table. If the hash function is not good then the math is more complicated and dependent on the hash.

Imagine for instance a 256 entry hash table of bytestrings where the hash function is the sum of bytes in the string modulo 256. Now if you will store strings composed of only @ characters (ascii 64) in there, all your hashes will be either 0,64,128 or 192. That means that there is nearly 50% chance of collision when there are only 2 entries in the table.

Jeff, the math behind the calculation is pretty straightforward.

For n available slots and m occupying items,
Prob. of no collision on first insertion = 1
Prob. of no collision on 2nd insertion = ( n - 1 ) / n
Prob. of no collision on 3rd insertion = ( n - 2 ) / n

Prob. of no collision on mth insertion = ( n - ( m - 1 ) ) / n

So the probability of no collision on ANY of m insertions is the product of the above values, which works out to

( n - 1 )! / ( ( n - m )! * n^( m - 1 ) )

and conversely, the likelihood of at least one collision is just 1 minus the above value.

In the worst case with one bucket - the performance of a well-written hash function shouldn’t be O(n). The elements should be stored in a sorted array to permit a binary search so that your performance is at most log(n). No sense taking a good scheme like a hash and messing it up with a linear search for multiple values per bucket.

The examples of e.g. in the quoted text are indeed correct. What he is more likely referring to is Jeff’s actual usage, which has probably been missed because it is not punctuated.

‘As you place more and more items in each bucket (eg, “collisions”)’ should be “As you place more and more items in each bucket (i.e., “collisions”)”.

It is not ‘As you place more and more items in each bucket (for example, “collisions”)’ but probably more correctly ‘As you place more and more items in each bucket (in other words, “collisions”)’

It is not true that every person has completely unique DNA. In fact, identical twins (and triplets, quadruplets, etc.) share identical DNA.

Apologies for defending the Language Police, but I believe it was the following that he was referring to:

“As you place more and more items in each bucket (eg, “collisions”) you edge closer”

Back on topic, not so long ago, the architects of our companies product didn’t understand what the hashCode function was for, and we had a fair few classes going around with hashCode being set to a constant value…

How about the following “HashTable” implementation? http://worsethanfailure.com/Articles/It-Had-Too-Many-Functions.aspx

:wink:

I’m not sure if this is a regional thing or not, but you are really abusing big-O notation. Statements like “offering O(1) performance in most cases” and “best-case performance: an O(1) lookup.” don’t make any sense. The big-O performance of a general hash table is exactly the same as the data structure you use to handle collisions.

Your two extremes aren’t extremes either. It is highly likely that you will have more buckets than you expect to have items.

I kind of agree with your main point, but from what I’ve found, computer science purists tend to ignore hash tables (for theoretical stuff) because they are hard to analyze, and their asymptotic performance characteristics are exactly the same as the collision resolving structure. And typically no one seems to care about actual performance :slight_smile:

One of the other interesting things for hash functions is the randomized ones in perl and other languages, which change the hash functions every time your program runs to prevent malicious users from forcing lots of collisions :slight_smile:

There are other problems with hashes.microsoft says: “Hash tables that use dynamically allocated linked lists can degrade performance. By extension, hash tables that use dynamically allocated linked lists to store their contents might perform substantially worse. In fact, in the final analysis, a simple linear search through an array might actually be faster (depending on the circumstances). Array-based hash tables (so-called closed hashing) is an often-overlooked implementation which frequently has superior performance.”