Markov and You

codinghorror · June 11, 2008, 12:00am

In Finally, a Definition of Programming I Can Actually Understand I marvelled at particularly strange and wonderful comment left on this blog. Some commenters wondered if that comment was generated through Markov chains. I considered that, but I had a hard time imagining a text corpus input that could possibly produce output so profoundly weird.

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/06/markov-and-you.html

Chris · June 11, 2008, 12:00am

Just because it amuses you so much. This post adds at least 1 gallon back to the reservoir.

Juan · June 11, 2008, 12:00am

What if we use markov chains feeding it all of our basecode, hoping it will generate a program that can think and take over the world a la matrix, skynet, etc…

=)

Steve · June 11, 2008, 12:00am

There are some references online to Musical Markov Chains:

Digital Music Programming II: Markov Chains
http://peabody.sapp.org/class/dmp2/lab/markov1/

and some of the references date back to the 1950’s:
http://www.musiccog.ohio-state.edu/Music829D/Notes/Infotheory.html

BobC · June 11, 2008, 12:00am

There’s some confusion here about Markov models and Bayesian statistics in this blog post and in the presentation to which it refers.

A Markov model is just a model where the current output probability depends on only on the previous output’s value.
A hidden Markov model has an unobserved (aka latent or hidden) state for each output, with the state depending only on the previous state (that is, the states form a Markov model), and the output depending only on the state.
“Bayesian” spam filtering uses Bayes’s rule and marginalization to to “invert” estimates:

p(ham|e-mail) = p(e-mail,ham) / [p(e-mail,ham) + p(e-mail,spam)]

where p(e-mail,C) = p© * p(e-mail|C) is defined by the chain rule. Note the “inversion” in moving from directly modeling p(e-mail|ham) and p(ham) [the likelihood a message is ham without knowing anything about it] to inferring p(ham|e-mail) through probability theory (aka algebra).

The most common model for the text portions of p(e-mail|ham) is a Markov model!

A more accurate spam filter can be built using discriminative estimation techniques like logistic regression (as opposed to generative techniques like Markov models). See this paper by Goodman (the organizer of the annual spam research conference and Bill Gates’s technical assistant) and Yih:

http://scottyih.org/GoodmanYih-ceas06.pdf

“Fully” Bayesian statistics, as understood by statisticians, is a framework for inference. To be truly Bayesian, you need a full (joint) probability model of all variables of interest, and you need to propagate the uncertainty in estimates through inference.

Either read the intro to Gelman et al.'s Bayesian Data Analysis book or this nice intro by David MacKay, a leading Bayesian:

http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/457.466.pdf

or this one by Griffiths and Yulie:

http://cocosci.berkeley.edu/tom/papers/tutorial.pdf

Don’t trust the Wikipedia, which confuses Bayes rule with Bayesian inference and misses the importance of uncertainty.

The best estimates for Markov models are truly Bayesian in sense 4, as described in this paper:

http://www.stanford.edu/~sgwater/papers/acl07-bhmm.pdf

TomN · June 11, 2008, 12:00am

Is it a coincidence that I wrote a chatbot with MegaHal(another 4th order markov chain) for a game lobby this morning or are their spy cameras in my cereal!?

Leonel · June 11, 2008, 12:00am

This Markov-ized essay from Paul Graham does not yield a single instance of the word “Lisp” ? Wow, I’m amazed.

ChrisL · June 11, 2008, 12:00am

I’d like to see this applied to music and see what happens if you feed it all the works of Mozart…

Niyaz_PK · June 11, 2008, 12:00am

I have seen a lot of generated Morkovian Text in comments section of a lot of blogs and websites.
Somebody out there is trying hard to beat the system…

The problem is that we have to read two/three sentenses from the text to even understand that we are reading an artificial prose.
If the internet starts to get saturated with websites containing generated text (Which can be used to get search engine traffic), we are in deep troube. The signal/noise ratio of the web will decrease further.

Silvercode · June 11, 2008, 12:00am

What is this post about? I don’t want to read it through to get the point in case it is not worth reading.

Niyaz_PK · June 11, 2008, 12:00am

@ChrisL,
Try it out and tell us how the results are!!

Niyaz_PK · June 11, 2008, 12:00am

@Silvercode,
This post is about little cats doing funny stuff.

Sander · June 11, 2008, 12:00am

It’s interesting to feed your homepage to it as well:
“When we produce output so many times they work done by generating Paul Graham essays from industry experts. Free shipping.”

phatmonkey · June 11, 2008, 12:00am

http://kookybot.org/

That’s my IRC bot which uses a Markovian style algorithm. It sits in hundreds of IRC channels learning everything it hears, then generates replies based on that. Some of the best quotes are randomised on the front page (refresh to see more). You can find more at:

http://quotes.kookybot.org/top

Everything there is generated from its knowledge - it is not simply repeating back sentences it has heard. The database is 4.6 GB, almost 70 million rows.

gex · June 11, 2008, 12:00am

blockquoteUh, Jeffy? The rest of us heard about Markovs in like 1980-something./blockquote

Speak for yourself. Some of us were in diapers in 198x. Tiredest complaint on the net.

secretGeek · June 11, 2008, 12:00am

Have you tried getting Mr Spolsky started on garfield?

He’s quite the opposite of a fan.

Just slip something into the conversation like, “Soooo, that Jim Davis programmed in C a lot you know. Great sense of humor he has too.”

codinghorror · June 11, 2008, 12:00am

http://quotes.kookybot.org/top

That is GREAT. I am dying over here.

codinghorror · June 11, 2008, 12:00am

Also, this is something similar for Usenet:

http://en.wikipedia.org/wiki/Mark_V_Shaney

Timmay · June 11, 2008, 12:00am

@Silvercode

What is this post about? I don’t want to read it through to get the point in case it through to read it is not worth reading. @Silvercode
What is not worth reading. @Silvercode
What is this post about? I don’t want to get the point in case it is this post about? I don’t want to read it through to read it through to read it through to get the point in case it is this post about? I don’t want to read it is this post about? I don’t want to get the point

JoshM · June 11, 2008, 12:00am

Or, to put it another way: those who forget the joy of reimplementation are doomed to grouse about history.

I’m pretty sure that was Churchill.