The Danger of Naïveté

@Reimar:

@Tom Dibble
"The WRONG approach would be to pick one in the list at random, then another after it at random, etc, as this would unduly weight the later-in-list items.”

Actually no (unless I misunderstood what you meant)."

You did, but mainly because I wrote it poorly :slight_smile:

I meant to describe an approach I’d fixed before, in which a programmer searching for a random combination of “r” items in a group of “n” would, essentially:

until ( found one )
… int minindex = 0;
… for ( i from 1…r )
… … minindex = random # between minindex and ‘n’
… … add item at ‘minindex’ to combination
… … minindex++;

Obviously, that’s a bit pseudo-codish, but the main thing is that the “next index” is alwasy a (random number of indexes) AFTER the “last index”. So, on average, the first index chosen skips half the list and eliminates that first half from consideration …

In any case, my point was that someone felt that this approach was either elegant (which I can’t imagine they really convinced themselves of that … any time you have an “until this works” loop you’ve pretty much written off “elegant”) or simple to understand despite possible performance problems, which I suppose might fit Jeff’s definition of “naive”, but definitely fits my definition of “WRONG”.

@Jeff:
"A nave algorithm is a very simple solution to a problem. It is meant to describe a suboptimal algorithm compared to a “clever” (but less simple) algorithm. Nave algorithms usually consume larger amounts of resources (time, space, memory accesses, …), but are simple to devise and implement.

It shouldn’t be wrong, just inefficient."

Well, the article spent a lot of time discussing what happens when something is “wrong”, not “correct yet naive”. I suppose that’s the base of confusion here in the comments.

I’d suggest there are two orthogonal issues here: “wrong” versus “right” and “simple” versus “complex”. It is just as incorrect to conflate “simple” with “wrong” as it is to conflate “simple” with “right”. “Naive” appears to be a synonym for “simple” (as opposed to “simple AND wrong”), but I’m not sure I’d agree with the connotations of that. Sure, “simple” isn’t always the best answer. The vast majority of the time, though, it is.

Using the shuffle approach, and pretending there isn’t a perfectly-simple-yet-properly-complex implementation of “shuffle” in pretty much any modern language, let’s examine some options here:

Simple, Wrong: The “exchange items with bug” approach listed in this article.
Simple, Right: Assign a random number to each item, sort according to that random number. Doesn’t perform well, but works. And is easy to explain to the next guy.
Elegant, Right: Use the published “exchange” algorithm. Performs well, works, but is hard to explain to the next guy why each line is precisely how it is without saying “just read the reference material!”

There’s a place where performance matters above maintenance, and there’s a place where the reverse is true. The trick for us as developers is figuring out which place (or where in the continuum between the two) we are in, and thus which approach to the problem makes the most sense for us.

In my previous post (in response to the question about pulling one of the 100C10 combinations) I gave two solutions, one of which was simple and easy to understand, the other of which would take several pages with diagrams to explain (but performs and scales significantly better than the simpler solution).

Which is the right one to use? I’d say you always start with the simple one, then move on to more complex implementations only when you’ve identified that performance is an issue. Moving forward, the difference is often between “anyone working on the application can touch this algorithm” and “any change to this algorithm needs to be looked at by YOU, yes, YOU, the original developer”. The former tends to foster sanity far more than the latter.

“Naive” tends to connote an incorrect choice. “Simple” is not always, and in fact is not even often, an incorrect choice.

gogole: unless someone changed something while I wasn’t looking, array.Length gives the number of elements in the array, not the index of the array. So, if an array has 3 elements, array.Length is 3, not 2. As written, Jeff’s “naive” code should run through:

rand.Next(3);
rand.Next(3);
rand.Next(3);

while the “non-naive” code should go through:

rand.Next(3);
rand.Next(2);

“It’s possible for code to be both simple and wrong”. Sorry, but I can’t understand the central point here. If complexity and correctness are not correlated, is that surprising?

Rather than using the swap based algorithm I went with:

  1. Start with a deck as a stack
  2. Create an array of items
  3. Take the top card from the deck (Pop)
  4. Insert it into a random array index
  5. If the array index was not available, discard the random and try again
  6. Repeat until all cards in deck gone
  7. Load that array back into a new deck

Results on 6 cards 600,000 itterations:

0|99777
1|100086
2|100058
3|99879
4|100163
5|100037

Not bad. And on a 6,000,000 sample:

*** RAND CHECK ***
0|1000514
1|1000070
2|999574
3|999222
4|999939
5|1000681

Seems like a pretty easy way to do it, doubt it’s as fast as the KFY way due to getting randoms in the array which are already taken.

Sorry, that was 3 cards, 0-5 representing each of the possible combinations.

Comparison on 10,000 shuffles in Milliseconds:

MyShuffle (very similar to PokerStars): 5984.375
KFY Shuffle: 1312.5

Suppose performance isn’t the key thing for online poker rooms.

it is interesting that most people cannot see why the naive shuffle algorithm is wrong - but I was watching one of the TED talks (cant remember the link anymore) about statistics, and I remember it saying that statistics is one of the most unintuitive things (probability is part of statistics), and that most people dont get them right.

its not the case that the problem with a naive algorithm is it could be wrong - any algorithm can be wrong, naive or not. to program properly, one must also understand wot he/she is doing - including the maths behind it. if all one is making is just a simple business data entry app, there would be no real need to have any deep understanding of mathematics behind the program, but anything more complex, such as rendering, scientific calculation, games etc etc, the programmer needs to be really educated in their domain, and thus, these things cannot be done by the average “script kiddy” who picks up code from the net and patches them together. that is what i would called naive.

Code that’s simple and wrong can be found in testing. Code that’s complex and wrong, all too often, cannot have effective tests written for it, IMHO.

The particular difficulty here is that checking the results of the algorithm for this seemingly simple problem is not obvious.
Like checking that an encryption algorithm is secure…

“The nave shuffle algorithm is biased and fundamentally broken.”

I can’t imagine that standard shuffling by hand fairs better than this algorithm. There’s a big difference between “not the Platonic ideal” and “fundamentally broken”.

Which is not to take anything away from the important point illustrated here: “It’s a minute difference in the code, but a profound difference in results.”

(…although “profound” may be a bit strong…)

smart pointers are C++ objects that simulate simple pointers by implementing operator- and the unary operator*.

Hey Now Jeff,
This post really made me think. Very interesting everyone has to agree testing sure is important.
Coding Horror Fan,
Catto

Why shuffle the deck when you can draw random cards? :slight_smile:

“For every complex problem, there is a solution that is simple, elegant, and wrong.”

BTW, as a side topic, procedure for getting non-modulo bias 32bit number from 0 - N.

private int _rndIndex(int items)
{
// Uses 32bit int as the random, so we have to handle negative
// and positive values. i.e. -2,147,483,648 to +2,147,483,647

        // Note: We  need to -1 from the max as we will have 2 fulls set of mod posibilities
        // mod = 0 at the minus end, mod = 0 at 0 and mod = 0 at the positive end which is one too many mod= 0's

        byte[] randomNum = new byte[4];

        // Calculate what the maximum acceptable random is to avoid modulus bias
        int minNum = (Int32.MinValue - (Int32.MinValue % items));
        int maxNum = (Int32.MaxValue - (Int32.MaxValue % items)) - 1;

        // Keep going until we get a number between min and max
        int rTemp = 0;
        do
        {
            cry.GetBytes(randomNum); // cry is a RngCryptoProvider
            rTemp = BitConverter.ToInt32(randomNum, 0);
        }
        while (rTemp  minNum || rTemp  maxNum);

        // Get the mod, this will never return = items, i.e. if items is 52 it would only return -51 to 51
        int r = (rTemp % items);
        // Always positive version for use as index
        return (r  0) ? r * -1 : r;

        // Note: this is not bais against 0.  Mod of a negative number will return a negative number or 0
        // Mod of a positive number will return a positive number or 0.  There is twice a much chance of getting 0 as getting
        // any specific positive or negative number.  So taking negatives and positives as both positives removes the bias.
    }

Note the int32 output from the rng is just the resulting random number, the RNGCryptoProvider thingy isn’t public spec so not sure how many bits of internal state it has, but I would assume it’s alot.

This is a excellent example of H L Mencken’s Law: To every problem there is a simple obvious solution that just does not work.

I mentioned the KFY algorithm in a short paper in “Software – Practice and Experience”, way back in 1977 (Vol 7, No 2, pp 201-203). I couldn’t find it in Knuth at the time and so failed to give credit. My version extracted a random combination of items from an array and shuffles the array at the same time. I’ve used it in half-a-dozen languages and many programs.

Minor point:

Arrays in c# are zero (0) based, that is they store values according to the real number system (0,1,2,3 …) not the counting number system (1,2,3,4 …)

Though that is the common way to conceptualize this feature of C/C++/C#/etc, I actually prefer to think of the so-called ‘index’ as actually ‘the distance from the beginning of the array’. a[0] is not the ‘zeroth’ item, it is the item zero away from the first item, and hence the first item.

(Good article, Jeff!)

:slight_smile:

Being pragmatic, if you want to shuffle you use python’s random.shuffle(cards) and if you want to draw n cards you use python’s random.sample(cards, n). No need to reinvent any wheels… unless for some very very special applications maybe.