a companion discussion area for blog.codinghorror.com

Parsing Html The Cthulhu Way


The HTML::Sanitizer 0.04 module is available on BackPan at http://backpan.perl.org/authors/id/N/NE/NESTING/. However, it does not appear to pass its own test suite (2 of 4 tests fail in t/03security.t) using Perl 5.10.1 on MacOS X 10.5.8. Sadly, that makes it of limited relevance.


Many programmers have a RegEx hammer and don’t want to learn a DOM/XPath based screwdriver and ratchet set.

Sadly, (X)HTML is mostly nuts, bolts and screws. Yeah, you can hammer it together, but it will fall back apart soon enough.


Personally, I always use a HTML parser whenever possible.

As a beginner in regular expressions, it’s a huge pain in the arse to write a regular expression - let alone one to parse HTML.


Good points, but I think you left one important piece of advise out: don’t do it at all. Both a library and regex approach are broken solutions if your source HTML isn’t up to the standard. Therefore, it is much more preferred to tap into a structured data source, like XML, RSS, JSON, a RDBMS. The HTML has to come from somewhere, right?

Of course, there are scenarios where you do not have that kind of access to the original data source, like when you write your own search engine :slight_smile:


Ie! Ie! Microsoft Fhtagn!


What a timely post. You’ve just convinced me to abandon my RegEx parsing hack and try to find a more ‘stable’ approach.

Found Html Agility Pack on codeplex - http://htmlagilitypack.codeplex.com/ Had working code in 10 minutes. Hmm, maybe there’s a lesson to be learned here…


lol. See a much better, more sophisticated treatment over at esr’s blog.


You use becoming a follower of Cthulu like it’s a bad thing ?


I really enjoyed this article today. You really nailed being a good developer.


I almost always use regular expressions to sanitize scraped content (add missing quotes, remove attributes that my parser of choice chokes on etc) and then run it through the parser. So far, so good.


I don’t waste time debating how to parse HTML since finding BeautifulSoup


I scrape HTML that is purposefully malformed to muck up the scraping process, using Regex. Had been using the DOM structure, but that has it’s own problems.

If it works…


There are no definitives really to this. The thing is most people parsing HTML are doing it for a specific set of pages usually in the same format. No RegEx could not perfectly parse HTML but it can parse it when you know the exact form of the HTML.

I started a project intending to use a library to parse the HTML but it became more trouble than it was worth. I knew the sections of information I wanted to pull out and I knew the WYSIWYG editor only allowed a small set of HTML for formatting and links e.g. strong, italic, underline, a link, bullets, numbers… In the end it was not using anything more than a simple bit of code to pull out the same content in plain text.



The problem is that (x)html is not a markup language, it’s an adhoc hacked together AST notation, and malformed html in particular is difficult because the rules for properly resolving html into its requisite tree structure are complicated and obtuse, and involve painful reverse engineering of multiple browsers. (it works in IE, so my markup must be correct!)

And so, if all you wanted to do was build a simple markup language, and a simple stylesheet language for sending your technical manual to the printers, yes, that’s drop dead simple for any slightly “competant” programmer. But if you’re Donald Knuth (You’ve heard of him, right?!), it takes about 10-20 years.

However, then using that markup language to extract useful information is an entirely different task for which a markup language is not really designed for. html was hacked into doing that task in the form of xml, but malformed tag soup, the sort of html you’d find out in the wild— well let’s just look at the facts: It takes a team of hundreds of developers several years to make a tolerably compatible html parser/renderer. And you’re just gonna hack one up in a day, are you?


So the conclusion that can be drawn here is that dogmatism is always bad? And yes, I realize the irony in that statement.


So the conclusion that can be drawn here is that dogmatism is always bad? And yes, I realize the irony in that statement.


You forgot to close the tag! Luckily, I think I got in there before all hell was unleashed.


What an awesome painting of Cthulhu.


I bet Chuck Norris can parse HTML using RegEx.


I bet Chuck Norris can parse HTML using RegEx.