Hashtables, Pigeonholes, and Birthdays

Awesome description. I thought you mis-stated a couple things, but they all turned out to be my mistakes. Thanks for fixing my head!

Will - Twins don’t have identidal DNA, nor your cells. Twins started with very close DNA like your cells, but variation is guaranteed because copying isn’t perfect, and DNA gets damaged.

Jon Abaca - Special hash implementations let us go WAY beyond 70k, which is useful in estimating DNA backgrounds and variation.

The bit-depth you need for hashing DNA depends on the k-mer length you’re looking at, so it varies from application to application. You can store all possible 2-mers with a 3-bit address at a load factor of 1, but this is a highly specialized and useless hashing function.

Nobody hashes complete human genomes because they haven’t been sequenced yet. You could however store fingerprints or patterns of k-mers that appear to identify people uniquely.

Your bit-depth doesn’t need to model the possible DNA-space exhaustively, instead it needs to be matched to the sample space that you’re trying to identify uniquely.

(better late than never…) There was a comment above something to the effect that if the distribution of the hash was not uniform (unbiased) then the math gets a whole lot messier and depends on the hash. But as I was diagraming the calculation for probability of at least one collision it occurred to me that the math must already be messy (not so clean as the simple factorial calculation of the series)…

Think about it… if you have 10 slots total and 5 are occupied, you would calculate a probability of a collision on the next insert at 50%. No problem there - we stated clearly that there are 5 occupied slots. BUT if we wanted to calculate the probability of at least one collision so far, the straight math gives you 70%… but only if you make the paradoxical assumption that there are 5 occupied slots - in which case you beat the odds because you had no collisions. If you in fact have a collision along the way, then the number of open slots is greater than you are basing the calculations on. See? If along the way of inserting 5 items into a 10 slot hashtable I in fact have a collision (and odds are good I do), then after 5 insertions I have 6 open slots, not 5. That make the calculation for the next insertion and the additive probability so far change (quite a bit for such a small hashtable).

Someone please tell me how to reconcile this so that some “real world” numbers come out of this. The worst case example of this is that you never have a collision until you hit m = n + 1 at which point the calculation is moot because every insert from then on must collide… but up until then you could have been beating the odds even at probabilities approaching 1… but that isn’t realistic. So the pure math does not reflect reality.

bh