Groundhog Day, or, the Problem with A/B Testing

Phil’s problem was not AB testing, it was the original premise that the perfect date would result in Rita falling in love with him. That is a problem with a lot of testing. You are not necessarily testing for the right things.

It’s a movie. She rejects him, not because of some inherent failure in the method he uses, but because it was written that way in the script. While it may be our (or the author’s) idea of the “purest form of A/B testing imaginable,” it’s not real. If the author had chosen a different outcome, would you then change your opinion of A/B testing? I know it would have changed your opinion of the movie.

In the end the author chooses to have Rita reject Phil because on a fundamental level we want A/B testing to fail. We want to believe that we operate on some higher esthetic than our base instincts. We want to believe that we can’t be manipulated by the satisfaction of our material desires. The success of companies like Amazon, Google, EBay that use it, though, belie this hope. In the end, we are all self-interested beings that only occasionally act outside those interests.

Two things. (1) I loved this movie. (2) I think Tim (above) brings up an excellent point. Imagine an alternate outcome of this movie – imagine that after the perfect date, Rita fell in love with Phil, they got married, and lived happily ever after; how would this change in the movie affect this post about A/B Testing? Would you then be posting about how A/B Testing is the perfect solution to make users fall in love with your site / application / new toothbrush?

Ultimately, Groundhog Day is a scripted movie determined by a group of writers and editors and movie people - not real life. And while many of the ideas presented in movies ring true to us, they’re often just playing on what we hope is true, and not what is actually true. We all hope that we can’t be manipulated into love, that the feeling of love is sincere and beautiful and above corruption - but we should all know that this is not the case. Our feelings can be manipulated. Throw a baby or a puppy into any commercial and we’re likely to associate the feelings we have for babies and puppies with whatever product the advertisement is peddling. We’re simple and predictable - sad but true.

I like your blog, but you’ve got to be kidding me with this post. Are you saying that A/B testing doesn’t work because of how a plot line in a comedy? Remember when your blog used to have numbers?

The thing is, on one level I agree with you; Google puts so much faith in numbers that it’s virtually religion for them, which at the same times makes seem cold and unfeeling. They don’t even provide end user support; they have forums of which I’m sure they analyze to death.

On the other hand, not doing A/B testing is like flying blind. In this case, Rita is just one subject, but A/B testing is to find the optimal way to achieve a goal given a diverse group, which is clearly not the case. Arguably, the multiple instances of Rita could be seen as the target audience, but perhaps we are reading too much into this.

Someone once said People don’t buy what you do, they buy why you do it.

I think that’s a great explanation for why A/B Testing doesn’t yield miracles.
When you’re doing something just for the numbers, you become the mainstream, at best. Yet the underdog will most always have a more passionate following. That’s all because of the branding applied and the credo - or perceived credo - behind each operation.

Of course A/B Testing makes sense, but sometimes, trusting your gut makes even more sense.

Hmmmm.

sounds familiar…

http://blog.asmartbear.com/local-minimum.html

This planet’s biological evolution is iterative A/B testing, no doubt based on a prerelease viewing of Groundhog Day, and that’s worked out pretty well.

I’m surprised it didn’t occur to you that her rejection was just one more step in the A/B testing.

Phil could then move on and put the “Contrived perfect date” into the “doesn’t work” pile.

It seems like this post is trying to shoehorn an example that doesn’t work into a preconceived notion of your own.

I think Sam made a really good point (“You are not necessarily testing for the right things.”). In order for A/B testing to be useful, you have to be testing for the right things, but you also have to already have the right things there.

Testing the difference between a “Buy now” button and a “Buy this now” button is useless if your web site is trying to sell cow’s blood and it is marketed to vegans.

It seems like Bill Murray’s premise was flawed: He was offering a product (his heart/love/companionship) to someone who did not want it. Through A/B testing, he was able to maximize the enjoyment of her experience in “the store” (the date), but when the time came for a buy-or-fly decision, she flew because she was not interested in the first place.

Which is where Jaryl’s point comes in - A/B testing isn’t about grabbing a specific person (that’s what sales calls are for!), it’s about making your website more friendly/more usable/more in tune to what the larger number of people are looking for.

Yikes, you guys are taking this too seriously. Groundhog Day is an analogy for Jeff’s thoughts on A/B testing, not evidence that led him to his conclusions.

I think the biggest problem with A/B testing is that people who want it don’t understand why they want it. What are they testing? What do they hope to accomplish? What will they do with the data they get? All they know is that such-and-such company did A/B testing and increased sales 0.01% so they want it too. Without a plan of action, any tool will fail.

The “usability is like Sandpaper” quote should probably be attributed to Alan Cooper, who says the following on page 206 (not sure which edition I have) of “The Inmates are Running the Asylum”:

“To me, usability methods seem like sandpaper. If you are making a chair, the sandpaper can make it smoother. If you are making a table, the sandpaper can also make it smoother. But no amount of sanding will turn a table into a chair. Yet I see thousands of well-intentioned people diligently sanding away at their tables with usability methods, trying to make chairs.”

It’s a good book. You should own a copy if you don’t already. (He said, generally, to everybody in the universe.)

Jeff A/B testing is, incidentally, what the universe uses when turning tables into chairs with a lot less than sandpaper. It’s called evolution. We can argue whether the universe has a soul or not, but I think it’s basically honest.

But your essay is still good! We don’t have time to play universe with our lives. And the monads that fill our hearts as this world turns (where they break and get twisted and eventually stop) will vibrate all the more sweetly if we stop pretending that we can judge the effects of our agency much and just try to connect … using our hearts, and minds and souls and all the ingredients of what make us humble humans, not the cosmos, let alone machines.

Beautiful. My thoughts exactly. I never thought of the connection to Groundhog Day though. That’s pretty clever.

But yeah, a couple years ago I was a little peeved about some SEO & marketing practices I was observing at the time. SEO, marketing, and A/B testing are all important to the success of an online business, of course. But there’s a right way and a wrong way to go about it.

I wrote down my thoughts on subject it in my article titled, “Give and You Shall Receive: A Guide to Improving Your Website.” The article is all about building the best possible user experience in all possible aspects. I bring up this idea that what’s best for your user is also best for you.

A/B testing is absolutely imperative! But it is possible to over analyze it and eventually kill the entire point.

Truth is, most businesses don’t have the time or resources to dedicate to continuous A/B testing so the next best thing is to find what hits and let it run until it doesn’t work… then repeat the process over again.

My blog

Wow, some great discussion on here. The post here definitely is interesting, but it seems to throw out the baby with the bathwater: there are certainly things that A/B testing is good for and bad for, but what are those? And, for the things that A/B testing is bad for, how do you figure out which is which? I wrote an incomplete post on this once that used a restaurant as an example. There are plenty of things that a restaurant can’t test with a typical A/B test. Or even worse, if they optimize for local minima (just as the author pointed out), they might be missing some other, longitudinal measure.

However, even though some numbers may not be immediately measurable, they’re always measurable at some time: it might just be hard to pinpoint a drop in sales or traffic (or not getting the girl) to any single thing.

Sometimes you gotta go with your gut, but only after you’ve analyzed the heck out of the numbers.

I agree with Tim VanFosson and Russell Uresti – the only reason Phil did not succeed by simple approach to A/B testing dates is that it’s a movie. And movie should deliver on hopes of audience, which likes to believe in pure non-rational love.

In real life (I mean in real Groundhog Day situation :-)) Phil would succeed in ~30 days or less.

That’s a very well thought out post, and some very interesting comments. Great discussion.

Like some commenters, I disagree with the premise that the movie has anything to teach us about A/B testing. I don’t have to tell anyone here, that A/B testing is comparing 2 or more subtle differences and measuring which one performs better. I don’t think Phil did any measuring.

If anything this is a better example of iteration, more akin to Agile programming. He made small improvements to his approach every day. Refine, deploy, refine, deploy, etc.

I agree with the commenter that mentioned Neil Strauss’ The Game. Basically, in about 1 year, the author became an expert pick up artist. I have no doubt that Phil could have conquered his target in 30 days as Dennis said, or even in less than a year, in reality. But this is fiction.

I suppose this concept was repeated in the movie, 50 First Dates. Instead of the day repeating, the woman had short term memory loss, so Adam Sandler could take her out again and again and try different things and she wouldn’t remember. What did you think of that? (Also not A/B testing.)

Now that you mention it Jeff, I feel that Phill’s behavior more closely matches programming methods than marketing testing. Code, Test, Debug, Correct, Code, Test, Debug, Correct…

Never thought of associating Groundhog Day to anything much really. But It’s an interesting thought you have, although I think it stretches a little the boundaries of what is A/B testing.

I have to agree with Michael Dorfman here.

In particular, I don’t think anything about conducting A/B testing suggests the origin of the inspiration. There’s no reason why you can’t A/B test a totally novel design. You could argue that the final progress Phil makes is as much an A/B test as the failed, disingenuous A/B test that Rita rejects. The motivation and source of the test are totally different, but that’s a failed experiment, not a failure of the method.

This is a really great post. I’m not going to argue with the premise (it is a movie after all!), but I do have some problems with the analysis. Most of the problem with splitting testing is in its application. Most people focus on conversion. In Phil’s pursuit of Rita, its worse: he is focusing on click-through! Focusing on click-through and conversion is easy, but the focus should be on lifetime value. By the end of the movie, Phil has focused on improving the lifetime value of everyone in the town, including himself. Is the movie satisfying because this leads to Rita’s conversion? No, she is just the icing on the cake. It’s satisfying because Phil has learned through trial and error how to be a better person, on improving his lifetime value. Ultimately, split testing is about learning. Even unsuccessful tests help you learn about your market and winning conversion tests need to be weighed against the lifetime value. Unfortunately, most of us don’t have 40 or 50 years to spend on one test.