The Danger of Naïveté

Tom_Dibble · December 8, 2007, 12:00am

“Does anyone know of some algorithm that gets for example 10 random items (without duplicates) out of a list of for example 100 items?”

The first approach I’d try would be to make a Set, fill it with all items, and for 1…10 pick one of the items out (at random), remove it from the set, and store it in your combination. This might work well enough for you (is simple and unlikely to house unfortunate bugs), or it might yield insufficient performance.

A more performant approach would be to devise a combination machine (an engine which will give you an entire series of combinations for a particular ‘n’ and ‘r’), pick an index at random, and have it produce the particular combination at that index. Explaining precisely how to do this would be beyond the scope of this text box, but suffice to say that a combination machine is fairly easy to write once you recognize the pattern in the series is based off diagonals of Pascal’s triangle. This approach would yield your 10 random items in very short order, although it’s really more efficient if you want to produce multiple such sets as the setup involved highly outweighs the per-combination cost.

A related approach would short-circuit the CM/index approach but use the base materials (pick an index in the position-0 series, then using that value pick an index in the corresponding position-1 series, etc), which would slightly increase the per-combination cost in favor of practically eliminating the setup cost.

The WRONG approach would be to pick one in the list at random, then another after it at random, etc, as this would unduly weight the later-in-list items.

None of these, by my thinking are “naive”. I’m not sure what Jeff’s definition of “naive” is, but it seems to be “missing a fundamental point of logic” aka “wrong”. So, I guess only the last would be “naive”. The other two certainly each have their place. Always favor straight-forwardness and simplicity over complexity. But, if performance is a problem then it is likely that you will need to increase the complexity (and thus raise the bar on any engineer who can step into that code to debug it) to achieve your goals.

codinghorror · December 9, 2007, 12:00am

I fixed the off-by-one summary of the rand.Next() calls. I also fixed the missing table cell tag. Sorry about these errors!

JFred · December 9, 2007, 12:00am

The kfy algorithm is itself a bit naive. Depending.

If you are trying to shuffle a 52 card deck the number of permutations is 52! which is nearly 10**68 (10 to the 68th, or more exactly, 80658175170943878571660636856403766975289505440883277824000000000000. If you are seeding your random number generator with a 32 bit number, you are limited to 2 32 or about 4.2 billion permutations (4.294967296 X 109).

That’s 1068 vs 109. Not good enough. Ruby, among other systems, uses a rand function that is a Mersenne Twister with a period of around 10 ** 6002, which is plenty large enough. So it depends on your PRNG.

JFred · December 9, 2007, 12:00am

The various algorithms are all short and prettier in Ruby:

  # Knuth's pretty good algorithm for shuffling an array.
  # From his text, I think.  Or maybe from Programming Pearls.
def knuth1(ary)
  0.upto(ary.length - 2) { |n|       # n is in [0..MaxSubscript - 1],
                                     # inclusive
    r = n + rand((ary.length - n));  # r is [n..MaxSubscript], incl
    ary[n], ary[r] = ary[r], ary[n]; # swap [n] and [r]
  }
end

and

  # Jeff Atwoods presentation of the Knuth-Fisher-Yates
  # shuffling algorithm, in Ruby
def kfy_shuffle(ary)  # I think this is equivalent to knuth1,
                      # above, but it's easier to present it
                      # than prove it.
  (ary.length - 1).downto(0) { |i|         # i is in
                                           # [0..MaxSubscript], incl.
     n = rand(i + 1)                       # n is in [0..i], inclusive
     ary[n], ary[i] = ary[i], ary[n]       # swap elements [n] and [i]
  }
end

def sort_shuffle(ary)            # Sort by random keys using key-sort
                                 # (aka Shwartzian transform)
  ary = ary.sort_by{|w| rand()}  # Is this a better shuffle,
                                 # than the kfy algorithm...
end

You could use the sort-shuffle before or after one of the others without compromizing randomness. Although it wouldn’t guarantee a seeing each permutation once before seeing any duplicates. Going through each permutation exactly once before repeating makes the last few shuffles very predictable, anyway. For shuffling decks in a gambling situation you want the most unpredictable, where previous permutations have are independent of what permutation will come up next.

JFred · December 9, 2007, 12:00am

Is there any way to put properly formatted code into comments on this site? My pretty code isn’t pretty any more.

ComputerG · December 9, 2007, 12:00am

Jeff, mind sharing what program you used for those graphs? In particular the one for the six-card deck shuffle?

ComputerG · December 9, 2007, 12:00am

Gah, hate to double-post, but anyway:

This is what I was trying (and miserably failing to) point out in the previous post:

Both end-results are as “random” as each other, but the second
method gets you a more “equal” randomness - which is something that
counts in cryptography.

ICR63 · December 9, 2007, 12:00am

Computer Guru - I don’t know but you might be interested in http://code.google.com/apis/chart/

Reimar · December 9, 2007, 12:00am

@alvin
well, “obviously” picking 10 unique numbers out of 100 is exactly the same as this shuffle problem, just that you only care about the first 10 (which also means you can stop after 10 steps).

@Tom Dibble
"The WRONG approach would be to pick one in the list at random, then another after it at random, etc, as this would unduly weight the later-in-list items."

Actually no (unless I misunderstood what you meant).
In the first step the probability of selecting a specific item is 1/n. The probability of selecting a specific one in the second step is ((n-1)/n)*(1/(n-1)) = 1/n as well - the first part is the probability of not having it picked in the first step, the second is the probability of picking it now.
In addition this is exactly the same thing that it is done with the “Knuth-Fisher-Yates shuffle algorithm” (never heard that name before).

Btw. one shocking thing about random numbers:
Assume you have a random number generator that gives you equally distributed random numbers in the range [1…n] but you want equally distributed numbers in the range [1…m] where m n and m does not divide n.
Then there is no deterministic algorithm that can do this mapping, and basically the only way is to test if your random number is m and if yes try again (you can optimize this a bit if m n/2).
Which is an algorithm that can not be proven to terminate if you use truly random numbers (though in practice that is not really a problem except in hard-realtime applications)…
This and other things are IMO nicely explained at http://java.sun.com/javase/6/docs/api/java/util/Random.html (see esp. the description of “public int nextInt(int n)”).

Reimar · December 9, 2007, 12:00am

Arg. This ate my smaller sign and half the following text.
The pre-last paragraph should have been:
Assume you have a random number generator that gives you equally distributed random numbers in the range [1…n] but you want equally distributed numbers in the range [1…m] where m smaller than n and m does not divide n.
Then there is no deterministic algorithm to do that, basically you have to check if that random number you got is larger than m and if yes try again (you can optimize this a bit if m is smaller than n/2).

gogole · December 9, 2007, 12:00am

David A. Lessnau : Oh!My bad.thanks for the correction.

JFred · December 9, 2007, 12:00am

N.B.:

All the major languages, and most of the minor ones, have crypto packages you can use to get better random numbers than the default rand function. I assumed you all knew this but then I guessed not.

JFred

Luthos · December 9, 2007, 12:00am

I completely dissagree with the statement that more
shuffling yields worse results with the naive algorithm.

Going with the 3 card example:
If you stick another loop outside the naive algorithm and
run the 600,000 shuffle simulation again you get nice, neat
100,000 bin sizes for all 6 combinations. I just ran it a
bunch of times - no bullshit.

for (int x=0; x cards.Length; x++)
for (int i = 0; i cards.Length; i++)
{
int n = rand.Next(cards.Length);
Swap(ref cards[i], ref cards[n]);
}

This is much less efficient, but more representative
of the way a naive code would have implemented the
shuffle in the first place.

JanB · December 9, 2007, 12:00am

I’ve worked in online gaming and shuffling is a moot point. You dont shuffle you simply draw each card (or whatever) out as you need it using a truly random number from the remaining population.

But good example.

Reimar · December 9, 2007, 12:00am

@Jan Bannister
That is exactly what the correct shuffling algorithm actually does, just that it keeps the drawn cards and the remaining population in the same array, the one in the front and the one in the end.
It ends when the remaining population is empty.

Dave · December 9, 2007, 12:00am

Never, ever read algorithm statistics before your morning coffee.

This is why you test, and re-test before product launch. At least this is evidence that my geek nature to complicate things and avoid the overly simple has some merit - on occasion…

nik3 · December 9, 2007, 12:00am

Luthos:
I think you’re correct, more shuffling with the naive algorithm will lead to more evenly distributed cards. But no-one is going to shuffle cards.Length^2 times as you have in your code, that’s way too expensive.

cheong · December 9, 2007, 12:00am

Regarding randomness… I noticed that the “Enter the word” image stays the same more than 3 times consecutively (I have no idea when did it begin). Aren’t that image supposed to be changed every time I load the page?

Albert · December 9, 2007, 12:00am

Great post, Jeff. I found the explanation very concise and logical. It’s always interesting to see how such a small difference in approach can affect an algorithm.

Oh, and talking about Naive. When I saw the list permutations each algorithm produced, I was immediately thinking of adding code to keep track of each permutation created by the Naive algorithm and ignoring a permutation if it has already been produced. You’ll end up with 6 perms, like the Knuth shuffle, but goodbye efficiency (in terms of time and space).

The thing I like about shuffling is that it is the opposite of sorting, yet you’d be able to change a sorting algorithm in a shuffle algorithm by editing only a few lines.

gogole · December 9, 2007, 12:00am

For all non C# programmers saying rand.Next(2) should be rand.Next(3) listen :
Arrays in c# are zero (0) based, that is they store values according to the real number system (0,1,2,3 …) not the counting number system (1,2,3,4 …)thus the length of an array in c# will be n-1 where n is the number of elements in the array.Comprende?