Hashtables, Pigeonholes, and Birthdays

“Heh. I was just thinking that seasonal variations would have an effect when I read the last line.”

I was wondering if there were any twins in the class so I could be snarky and say 100%…

I’m a java programmer and programming without hash tables would have been so boring!

Thanks for the post!

Hey Now Jeff,
I heard Frosty say Happy Birthday. I’ve always like the way you’ve quoted with the different color background to distinguish the difference. I thought of the B-Day puzzle when you posted http://www.codinghorror.com/blog/archives/000951.html, 23 .
Codding Horror Fan,
Catto

The birthday paradox is a paradox because most people suck at statistics and combinatorics. They see “chance” not in terms of tries, but in terms of patterns - if the wheel in the casino came up on red 4 times in a row, you can split the groups in those who fanatically believe it has to come up with red a 5th time to complete the pattern, or that it has to come up with black because there’s no way the pattern can be completed.

I can think of lots of business cases with hashtables 70000. Most databases use hashtables to perform large join operations. These can very quickly require millions of values. The point about hashtables is that it scales very well. Thus if you are joining tables with millions of rows together, hashtables are the ideal method.

In the worst case with one bucket - the performance of a well-written hash function shouldn’t be O(n). The elements should be stored in a sorted array to permit a binary search so that your performance is at most log(n). No sense taking a good scheme like a hash and messing it up with a linear search for multiple values per bucket.

No sense taking a bad scheme like a 90+% filled hash table and pimp it from O(n) to O(log n) where it should have been O(1) at the first place.

When I had to write my own hash table in college, we were required to make it double in size once it reached 50% capacity. We also used Rehashing when you encountered a collision.

Although, the hash function that we used was pretty simple. I think we were just storing integers, so we took the (int % hash_table_size) to get the placement.

If you’re really worried about hashtable performance you need to look at the type of hashtable you’re using. There are two kinds I see used:

  1. Hashtable is just an array of objects. If a collision occurs you walk down the table until you find an empty bucket and you use that one.

  2. Hashtable is an array of collections of objects. When you have a collision you simply add the object to that bucket’s collection.

Unfortunately, the first one is the one I see used the most (at least when people implement their own hash tables. I don’t know what the built in implementations in C#, Java, etc use). This one is the easiest to write, but requires that your hashtable’s array be at least as large as the set of items you’re using, and has worse performance. A collision causes the new object to go into a different bucket, effectively adding to the number of hash values that have collisions.

The second one can actually retain near constant performance while saving memory. The hashtable can be much smaller than the actual data set’s size, and as long as the hash function creates an even distribution then your lookups could still be constant time (as opposed to the first option, which would be linear).

For example. say you have N objects to hash, and a hashtable of size N/2. Assuming you have a good hash function, each bucket will only have 2 objects in it, so the lookup becomes O(1) + O(2) = O(1). Abstracted out, if your size is N/M your lookup becomes O(M), and you can keep M small very easily without having to rehash your table nearly as often as you would if you were using the first method (in which case any time your data set increased above the size of the hash table you’d have to rehash).

Probably a pointless discussion as most languages have good built-in hashtables, but I felt compelled to post this for some reason. :slight_smile:

The only way to achieve O(1) performance for a get operations is iff the hashing function is effectively perfect and the hash table can expand if it becomes sufficiently loaded. The former is up to you. The later is usually up to using a decent hash table implementation.

For hash functions, unless you’ve studied hashing, don’t write them yourself. Let the IDE generate them, or use a third party utility class like HashcodeBuilder (Java) to do this. This goes for equals methods as well because unless you write decent unit tests, you will screw it up.

As for the implementation, Java has HashMap, which supports expanding with a load factor. I’m pretty sure C# and the C++ STL has the same thing. Feel free to write a hash table implementation yourself, but do it as an academic exercise rather than creating something for production code. This wheel has been studied extensively and is near perfected for general usage.

I have very little programming knowledge.

However, I’m pretty well versed in Japanese.

That’s either not a calendar, or it’s crazy year this year. Someone please educate me…

Jeff,

If we’re talking about storing every person’s DNA and running out of hash addresses, I think we’d first need to talk about running out of space in general before running out of buckets.

At around 3 billion bases in length, with a base taking on one of four values (so lets make that 2 bits), we need 3,000,000,000 * 2 / 8 = 750 million bytes.

We’re out of space after just a few people, never mind 6 billion of them.

Just being sort of argumentative this morning I guess. =)

jeff. i can read that chinese calendar! it’s lunar too.

Very interesting. What are the mathematics for determining the probability of 2 students sharing a birthday?

Hi Val-- there are lots of sites that explain the Birthday Paradox math in more detail. Here’s one:

http://betterexplained.com/articles/understanding-the-birthday-paradox/

I think the pigeonhole principle is just a little bit different than what you stated (but I think it’s an important distinction):

“If you try to put 6 pigeons in 5 holes, one of the holes will contain two pigeons.”

(This also better illustrates the point you’re trying to make WRT hash tables.)

Note that the O(1) claim is not correct, in reality it’s O(size of key in bits). There are other data structures storing key-value-pairs (commonly called “finite maps”) like tries or patricia trees that have the same asymptotic complexity, even in the worst case (hash tables are average case only).

I use to work on a computer system called Cado. It was back in the early 1980s. The standard system had 48K of memory on an 8085A process, and could support up to four users. This was done by leaving out unnecessary things like a file allocation table.

You defined a file by telling the system what drive and track the file started on, and how long the key for each record was, and how many records you plan to store, and the average size of the record. The system would calculate everything out, and give you the ending drive and track for the file. It was your job to make sure you didn’t allocate another file on the same area.

Cado files contained records, and each record had a “key” associated with it (although not necessarily an unique key. For example, a key could be a customer number, and a customer might have multiple records.) The hash function looked at the key and calculated the track the record was on. The entire track was read into an 8Kb system buffer, and the key sectors (could be between 2 to 6 sectors on each track depending upon the estimated average length of the record) were searched for the key. This gave the offset on the track where the record was located. The record was then passed to that user’s read buffer where the programmer could manipulate it.

Considering that you only had 48K or memory, slow 8" floppy drives, and the processor was running at a fast 3Mhz, it was a pretty fast system. Thanks to the hash look up, could pull off the information almost instantly.

The biggest problem was file sizing due to the way the hashing worked. A file took up a fixed number of tracks on a disk – even if it was empty. Store more records than planned, and you had to resize the file. This meant redefining the file elsewhere on the system, and having the system copy the old file – record by record – to the new location. This could take hours. The other issue is that the file had to occupy contiguous tracks. That meant as you moved files around, you’d end up with empty tracks here and there. Every once in a while, we’d have to take the whole system in and redo all the files to consolidate tracks. In bigger sites, this could take days.

Of course, as I mentioned before, there was no file allocation table. We wrote down on a piece of paper where each file was located. One mistake, and we’d lose the customer data. No pressure, no pressure.

Andy – the trick about analyzing hash tables is that you have to define “average case.” This depends a lot on your inputs! English dictionary words are much different than, say, Esperanto dictionary words. My field of numerical analysis has a similar challenge when defining “average matrices” as test inputs: for example, (0,1) uniformly distributed random matrices are not characteristic of problems that many people would like to solve.

Don’t forget the useful trick of perfect hashing (and even minimal perfect hashing), which is handy for building read-only fast lookup tables at compile time. GNU Gperf is one of many tools that lets you do this:

http://www.gnu.org/software/gperf/

Hi Jeff, thanks for the mention! The 30-second summary:

With 23 people there are 23*22/2 or 253 unique comparisons to make. The chance of 1 comparison being unique 364/365. The chance of every comparison being unique (i.e. no overlap) is (364/365)^253 = 0.4995.

So, with 23 people there’s about a 50% chance of having at 1 or more collisions. There’s a few subtleties in there but that’s the general approach.

My professor once said that it’s good to have the magical number as a prime one. Many hash functions uses division and that may screw up with a non prime one.
One of the most basic hash functions, but yet somewhat effective is to use the mod operator (% in C) something like this:

#define magical_number 13; /a prime number/
#define hash(key) key % magical_number;

And for collisions, you can easily implement a dynamic list for every possible return in the hash function. It’s a pretty efficient way to make a dynamic list more efficient at lookup. And besides, collisions won’t make it much more slow.