The Danger of Naïveté

Funny, I usually don’t do well with complex math issues, but I understood this problem easily. Maybe it’s because I ran into something similar many many years back when I tried to generate a random starfield with assembler and had to write my own RNG (which turned out to be surprisingly complicated).

It’s still hard for me to understand, but your post helped me a lot. Thanks Jeff :slight_smile:

Does anyone know of some algorithm that gets for example 10 random items (without duplicates) out of a list of for example 100 items?

Interesting article! When I read your shuffling post, I though ‘how would I solve it?’, my first thought was creating a new array, and for each position generate a random number, which would be the index of the card to place there. If that index already was taken, you should redo the random number pulling till you get a unique index.

I forgot to test it till I saw this post. Was I naive? Well, it’s definitely not fast hehe :slight_smile: I wrote a very bare-bones implementation and it somewhat fair but not as fair as I’d hoped:
Results:
312: 155360
132: 155582
123: 176399
213: 167727
321: 178951
231: 165981
Randcounters:
0: 989745
1: 994988
2: 1015270

the randcounters show the # of times that index was pulled at first.

So I guess my algo was also naive. I went back to check if I could rework it a bit to make it less naive, but that turns out to be very hard for a random naive algo like this. I think that’s the pitfall we should all look for, no matter if you’re an expert or a beginner. So lesson learned: use algo’s that are already proven to be not naive :wink:

This flawed shuffling algorithm was used earlier at online poker sites:
http://www.cigital.com/papers/download/developer_gambling.php

http://en.wikipedia.org/wiki/Shuffling

Python is amazing… the very same test program that I made for your previous article passes this test too, since random.shuffle() is well written.

Only three lines of code:

import random
deck = [1,2,3,4]
random.shuffle(deck)

You should write something about well designed programming languages, I think python is one the bests ones out there.

I would explain the problem with the naive algorithm differently. It builds an n-ary tree with n levels and so n^n leaves, each of which is a possible outcome. Since there are always more leaves than distinct permutations, and the number of leaves is not (always? ever?) evenly divisible by the number of permutations, some permutations occur as outcomes more often than others and the naive algorithm can’t be unbiased.

Nice post. One comment about the implementation - i think it’s probably more efficient to implement a pre-decrement rather than a post-decrement.

The post decrement you have implemented evaluates to True following the end of the iteration with i=1, and then performs a pointless swap of the first item of the array with itself.

if you implement a pre-decrement the loop statement will evaluate to false and that final, pointless swap wont take place.

That’s not a “nave” algorithm, that is a WRONG algorithm!

A nave algorithm gets the job done. It is just suboptimal and simple. This does not get the job done. And the correct algorithm is also pretty nave, as the problem is not complex enough to call for anything but a nave algorithm.

Jeff, great post, but I beg to differ with your conclusion, or at least the way you express it. You wrap up with, “I suppose the real lesson lies in testing.” The real lesson is not to assume that just because you wrote the code, you know what it’s doing.

Testing is a wonderful tool, but for too many people that translates into “unit testing”. The problem with unit tests is that, usually, the person writing the test is the same person who wrote the code. Such tests will often suffer from the same biases and blind spots as went into the original code. Those unit tests give developers a false sense of security. The only way to really know what’s going on is to manually step through the code, either on paper or, if you must, in the debugger. Only then will you really know what the code is doing.

And knowing, as they say, is half the battle. :wink:

I wrote about the same thing a while ago: http://szeryf.wordpress.com/2007/06/19/a-simple-shuffle-that-proved-not-so-simple-after-all/

You’re naive, alright. Why are you writing a supposedly informational blog?

Great post Jeff, your picture perfect presentation helped in understanding it clearly.

One other follow-up here. Don’t assume rand() is very random either. The standard ANSI C library implementation of rand() generates an extremely poor distribution of numbers from any seed. Generate a distribution of say 10 million numbers and you’ll see groups of some numbers coming up way more often than they should.

rand.Next(2);
rand.Next(2);
rand.Next(2);

I think that should read

rand.Next(3);
rand.Next(3);
rand.Next(3);

As someone pointed out on reddit:

That’s not a “nave” algorithm, that is a wrong algorithm!

If the person writing the algorithm doesn’t know math, what’s the point …

… I found this result strikingly counterintuitive

I suspect you are being overly-specific and actually mean - ‘I didn’t expect this result’.

There’s a problem with your table containing the possible outputs of the two algorithms. You’re missing a tag somewhere, and it’s messing up my livejournal friends page (I use LJ for tracking feeds). I’d appreciate it if you could fix that when you get a chance. Thanks!

Also, very interesting article! When you think about it, the bias is trivially visible in the context of the pigeonhole principle; if there’s 27 possible outputs but only 6 holes to put those outputs in, then there’s obviously going to be some over-represented results (at least 3, in fact, since 27%6=3)

I’ve noticed that the majority of comments on this article seem to fall into two categories for me: those that get the point of the article, and those that think that the incorrectness of the first algorithm is mathematically obvious and wonder what all the fuss is about. I’ll give this latter category the benefit of the doubt and assume that they are not just trying to make themselves look smarter than the rest of the herd. If there are developers out there who really are wondering why the example given seems like such a legitimate trap to the rest of us, I’d like to remind you of two things.

First, I’ve found that one of the most common (although not universal) weaknesses in developers without a formal computer science education is in the practical application of complexity theory and formal proofs to everyday code. Programmers (such as myself) who fall into this category need all the reminders we can get.

Second, if your main reaction to the given example is that it’s wrongness is obvious, then you’ve missed the point of the article, which Mr. Atwood so kindly emphasized in boldface for us. Slightly rephrased, it is possible for code to appear both simple and correct, while still being subtly wrong. If you would never stumble over the example given, be assured that there’s a different one waiting around the corner just for you.

In the naive algorithm an array element can change its value many times but in the KFY it can change it only once! Thats the difference. Thats why the naive algorithm leaks!