Parsing Html The Cthulhu Way

JonathanL · November 16, 2009, 12:00am

The HTML::Sanitizer 0.04 module is available on BackPan at http://backpan.perl.org/authors/id/N/NE/NESTING/. However, it does not appear to pass its own test suite (2 of 4 tests fail in t/03security.t) using Perl 5.10.1 on MacOS X 10.5.8. Sadly, that makes it of limited relevance.

John_Lopez · November 16, 2009, 12:00am

Many programmers have a RegEx hammer and don’t want to learn a DOM/XPath based screwdriver and ratchet set.

Sadly, (X)HTML is mostly nuts, bolts and screws. Yeah, you can hammer it together, but it will fall back apart soon enough.

Tangr · November 16, 2009, 12:00am

Personally, I always use a HTML parser whenever possible.

As a beginner in regular expressions, it’s a huge pain in the arse to write a regular expression - let alone one to parse HTML.

Ferdy · November 16, 2009, 12:00am

Good points, but I think you left one important piece of advise out: don’t do it at all. Both a library and regex approach are broken solutions if your source HTML isn’t up to the standard. Therefore, it is much more preferred to tap into a structured data source, like XML, RSS, JSON, a RDBMS. The HTML has to come from somewhere, right?

Of course, there are scenarios where you do not have that kind of access to the original data source, like when you write your own search engine

Soren · November 16, 2009, 12:00am

Ie! Ie! Microsoft Fhtagn!

Adam_Lacey · November 16, 2009, 12:00am

What a timely post. You’ve just convinced me to abandon my RegEx parsing hack and try to find a more ‘stable’ approach.

Found Html Agility Pack on codeplex - http://htmlagilitypack.codeplex.com/ Had working code in 10 minutes. Hmm, maybe there’s a lesson to be learned here…

Andrew · November 16, 2009, 12:00am

lol. See a much better, more sophisticated treatment over at esr’s blog.

mgb1 · November 16, 2009, 12:00am

You use becoming a follower of Cthulu like it’s a bad thing ?

Gabe · November 16, 2009, 12:00am

I really enjoyed this article today. You really nailed being a good developer.

GustafS · November 16, 2009, 12:00am

I almost always use regular expressions to sanitize scraped content (add missing quotes, remove attributes that my parser of choice chokes on etc) and then run it through the parser. So far, so good.

jojo · November 16, 2009, 12:00am

I don’t waste time debating how to parse HTML since finding BeautifulSoup

Steve · November 16, 2009, 12:00am

I scrape HTML that is purposefully malformed to muck up the scraping process, using Regex. Had been using the DOM structure, but that has it’s own problems.

If it works…

pete · November 17, 2009, 12:00am

There are no definitives really to this. The thing is most people parsing HTML are doing it for a specific set of pages usually in the same format. No RegEx could not perfectly parse HTML but it can parse it when you know the exact form of the HTML.

I started a project intending to use a library to parse the HTML but it became more trouble than it was worth. I knew the sections of information I wanted to pull out and I knew the WYSIWYG editor only allowed a small set of HTML for formatting and links e.g. strong, italic, underline, a link, bullets, numbers… In the end it was not using anything more than a simple bit of code to pull out the same content in plain text.

Breton · November 17, 2009, 12:00am

@craigybear

The problem is that (x)html is not a markup language, it’s an adhoc hacked together AST notation, and malformed html in particular is difficult because the rules for properly resolving html into its requisite tree structure are complicated and obtuse, and involve painful reverse engineering of multiple browsers. (it works in IE, so my markup must be correct!)

And so, if all you wanted to do was build a simple markup language, and a simple stylesheet language for sending your technical manual to the printers, yes, that’s drop dead simple for any slightly “competant” programmer. But if you’re Donald Knuth (You’ve heard of him, right?!), it takes about 10-20 years.

However, then using that markup language to extract useful information is an entirely different task for which a markup language is not really designed for. html was hacked into doing that task in the form of xml, but malformed tag soup, the sort of html you’d find out in the wild— well let’s just look at the facts: It takes a team of hundreds of developers several years to make a tolerably compatible html parser/renderer. And you’re just gonna hack one up in a day, are you?

Julian · November 17, 2009, 12:00am

So the conclusion that can be drawn here is that dogmatism is always bad? And yes, I realize the irony in that statement.

Julian · November 17, 2009, 12:00am

So the conclusion that can be drawn here is that dogmatism is always bad? And yes, I realize the irony in that statement.

Skizz · November 17, 2009, 12:00am

You forgot to close the tag! Luckily, I think I got in there before all hell was unleashed.

NickW · November 17, 2009, 12:00am

What an awesome painting of Cthulhu.

ClutchC · November 17, 2009, 12:00am

I bet Chuck Norris can parse HTML using RegEx.

ClutchC · November 17, 2009, 12:00am

ElderSign
I bet Chuck Norris can parse HTML using RegEx.
/ElderSign