In Finally, a Definition of Programming I Can Actually Understand I marvelled at particularly strange and wonderful comment left on this blog. Some commenters wondered if that comment was generated through Markov chains. I considered that, but I had a hard time imagining a text corpus input that could possibly produce output so profoundly weird.
Thereâs some confusion here about Markov models and Bayesian statistics in this blog post and in the presentation to which it refers.
A Markov model is just a model where the current output probability depends on only on the previous outputâs value.
A hidden Markov model has an unobserved (aka latent or hidden) state for each output, with the state depending only on the previous state (that is, the states form a Markov model), and the output depending only on the state.
âBayesianâ spam filtering uses Bayesâs rule and marginalization to to âinvertâ estimates:
where p(e-mail,C) = p(C) * p(e-mail|C) is defined by the chain rule. Note the âinversionâ in moving from directly modeling p(e-mail|ham) and p(ham) [the likelihood a message is ham without knowing anything about it] to inferring p(ham|e-mail) through probability theory (aka algebra).
The most common model for the text portions of p(e-mail|ham) is a Markov model!
A more accurate spam filter can be built using discriminative estimation techniques like logistic regression (as opposed to generative techniques like Markov models). See this paper by Goodman (the organizer of the annual spam research conference and Bill Gatesâs technical assistant) and Yih:
âFullyâ Bayesian statistics, as understood by statisticians, is a framework for inference. To be truly Bayesian, you need a full (joint) probability model of all variables of interest, and you need to propagate the uncertainty in estimates through inference.
Either read the intro to Gelman et al.'s Bayesian Data Analysis book or this nice intro by David MacKay, a leading Bayesian:
Is it a coincidence that I wrote a chatbot with MegaHal(another 4th order markov chain) for a game lobby this morning or are their spy cameras in my cereal!?
I have seen a lot of generated Morkovian Text in comments section of a lot of blogs and websites.
Somebody out there is trying hard to beat the system..
The problem is that we have to read two/three sentenses from the text to even understand that we are reading an artificial prose. If the internet starts to get saturated with websites containing generated text (Which can be used to get search engine traffic), we are in deep troube. The signal/noise ratio of the web will decrease further.
Thatâs my IRC bot which uses a Markovian style algorithm. It sits in hundreds of IRC channels learning everything it hears, then generates replies based on that. Some of the best quotes are randomised on the front page (refresh to see more). You can find more at:
Everything there is generated from its knowledge - it is not simply repeating back sentences it has heard. The database is 4.6 GB, almost 70 million rows.
If you play with the Markov text synthesizer, youâll quickly find that Markov methods are only as good as their input corpus. Input a bunch
of the same words, or random gibberish, and thatâs what youâll get back.
Thatâs more a feature than a bug By using a specific text as an input you can produce a text sounding like the input. E.g. texts in old English or try poems etc.
As ChrisL suggested, it will be more intersting to try Markov model on music than some text.
Generated music, if it is a function of the previous 5-6 notes will be great, I think.
Text anyway, does not make any sense.
One barrier to implementation is that itâs way too big to fit in memory on a single (reasonable) machine.
Also related is frac.notdot.net, an amusing free-association game that uses a 1st-order markov model to âplayâ free association with a human player, expanding its corpus as it goes. The corpus is quite sizeable by now.
Only because you brought it up (in a way), Jeff: Are you still using POPFile?
No, Iâve completely outsourced all my mail to Google / Microsoft / Yahoo now that they allow hosted email. I figure theyâre a lot smarter than I am. Fighting spam is interesting, but thankless and repetitive. Gmail does the best job, though occasionally (about once every two weeks) something will creep through.
The generator always correctly predicts that a carriage return is followed by one of these prefixes, so the generated text still looks like an IM conversation.
Thatâs the amazing thing about Markov chains. As Josh Millard, the author of Garkov, pointed out in this very thread: theyâre so ridiculously simple, yet they work almost eerily well.
Also thanks, Josh, for making an appearance and offering such useful insights on your Markov experience.
Theyâre a reminder that Garfield CANNOT TALK. When you remember that Arbuckle cannot know what Garfield is thinking, the strip gains a new level of ridiculousness.
âThatâs more a feature than a bug By using a specific text as an input you can produce a text sounding like the input. E.g. texts in old English or try poems etc.â
Yes, that is a fantastic feature. If you start talking to Kooky in other languages (German and Danish are the ones I know about, there are probably others), he will reply in that language. You never see these languages show up in English channels simply because there are no relationships with the foreign words (apart from a few exceptions - die, bin in german etc).
phatmonkey, there were some requests to get kookybot on Twitter, though Iâm not sure how that would work, as itâs a radically different environment than IRC.
Those top kookybot quotes are genius. I cannot remember the last time I have laughed so hard. WARNING, however, do not click on any links the KookyBot provides. Youâll be sorry.