In Finally, a Definition of Programming I Can Actually Understand I marvelled at particularly strange and wonderful comment left on this blog. Some commenters wondered if that comment was generated through Markov chains. I considered that, but I had a hard time imagining a text corpus input that could possibly produce output so profoundly weird.
What if we use markov chains feeding it all of our basecode, hoping it will generate a program that can think and take over the world a la matrix, skynet, etc…
There’s some confusion here about Markov models and Bayesian statistics in this blog post and in the presentation to which it refers.
A Markov model is just a model where the current output probability depends on only on the previous output’s value.
A hidden Markov model has an unobserved (aka latent or hidden) state for each output, with the state depending only on the previous state (that is, the states form a Markov model), and the output depending only on the state.
“Bayesian” spam filtering uses Bayes’s rule and marginalization to to “invert” estimates:
The most common model for the text portions of p(e-mail|ham) is a Markov model!
A more accurate spam filter can be built using discriminative estimation techniques like logistic regression (as opposed to generative techniques like Markov models). See this paper by Goodman (the organizer of the annual spam research conference and Bill Gates’s technical assistant) and Yih:
“Fully” Bayesian statistics, as understood by statisticians, is a framework for inference. To be truly Bayesian, you need a full (joint) probability model of all variables of interest, and you need to propagate the uncertainty in estimates through inference.
Either read the intro to Gelman et al.'s Bayesian Data Analysis book or this nice intro by David MacKay, a leading Bayesian:
Is it a coincidence that I wrote a chatbot with MegaHal(another 4th order markov chain) for a game lobby this morning or are their spy cameras in my cereal!?
I have seen a lot of generated Morkovian Text in comments section of a lot of blogs and websites.
Somebody out there is trying hard to beat the system…
The problem is that we have to read two/three sentenses from the text to even understand that we are reading an artificial prose.
If the internet starts to get saturated with websites containing generated text (Which can be used to get search engine traffic), we are in deep troube. The signal/noise ratio of the web will decrease further.
It’s interesting to feed your homepage to it as well:
“When we produce output so many times they work done by generating Paul Graham essays from industry experts. Free shipping.”
That’s my IRC bot which uses a Markovian style algorithm. It sits in hundreds of IRC channels learning everything it hears, then generates replies based on that. Some of the best quotes are randomised on the front page (refresh to see more). You can find more at:
Everything there is generated from its knowledge - it is not simply repeating back sentences it has heard. The database is 4.6 GB, almost 70 million rows.
What is this post about? I don’t want to read it through to get the point in case it through to read it is not worth reading. @Silvercode
What is not worth reading. @Silvercode
What is this post about? I don’t want to get the point in case it is this post about? I don’t want to read it through to read it through to read it through to get the point in case it is this post about? I don’t want to read it is this post about? I don’t want to get the point