a companion discussion area for blog.codinghorror.com

Parsing Html The Cthulhu Way


The only crime committed by most novice developers is not knowing what the alternatives are. This post, on the other hand, does nothing to help promote what those alternatives are. No cherry picked recommendations for a few common web languages? PHP, RoR, etc?

Hpricot and Rubyful Soup help me a good bit in RoR.


Now, I have that song in my head! Go Metallica!


“Even Jon Skeet cannot parse HTML using regular expressions.”

Them’s fightin’ words.


ESR disagrees :slight_smile:


Is it just me or does this blog post try to argue both sides of the same issue?


“Even Jon Skeet cannot parse HTML using regular expressions.”

I lol’d, what a great way to put this in perspective.


But the link to CPAN HTML::Sanitizer is broken.


Link behind “HTML::Sanitizer” is dead.


So what is the preferred method for dealing with XSS (Cross Site Scripting) issues then, particularly if you’re using a Rich Text Editor that saves formatting as HTML?


The last time I went for an HTML library to parse some HTML, the HTML was so broken I had to resort to regex.

The regex broke afterwards, after the generated HTML was slightly changed. It was trivially fixed.

So, while I agree that HTML (and, particularly, XML) should be parsed appropriately, YMMV. I get the feeling a lot of people who complain about regex have never bothered to LEARN it as the complex language it is.


What about HTMLTidy? http://tidy.sourceforge.net/ Convert stuff to proper XHTML and then use your XML processing mechanism of choice to parse the data.


This would be a lot more helpful if some specific libs besides the Perl solution were posted. I had a non-trivial time trying to find ready-to-use stable libraries on various platforms (e.g. PHP). Any suggestions?


Can someone get me one of those T-shirts with “I parse HTML with RE” on front?


I think that’s just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine.

Maybe HTML processing isn’t trivial, Jeff.


http://search.cpan.org/~nesting/HTML-Sanitizer-0.04/Sanitizer.pm tells me “not found”, btw. Pretty surprising, as we can still find the code in Nesting’s archives, and that it is still refered to at e.g. http://search.cpan.org/~podmaster/HTML-Scrubber-0.08/Scrubber.pm


I’m torn here. I mean there’s jwz’s famous quote about

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

On the other hand, CPAN isn’t particularly useful if you aren’t using Perl. And “Hell is other people’s Perl”.


I have no joke here, I just like saying http://discordianquotes.com/quote/8008


“The only crime being perpetrated is not knowing what the alternatives are.”

Yup - that just about covers the whole darn thing! Who hasn’t worked with developers who rolled their own XYZ, when that program is already out there and supported by some other community.


This is the most useful thing you said in a long time. Glad to have you bad Jeff.


I have a daily job that scrapes a page. Unfortunately, it looks like the author of that page is emitting with Word or something similarly horrible. Tables within tables within tables without sanity(or regularity). The faint whisper of “ia, ia” curdles around the mind when contemplating the source.

A DOM parser would gibber insanely to itself, quietly screaming at the brokenness of the form.

Since I didn’t particularly want to do that to a parse, I regexed away and was able to do it. Only minor loss of sanity points…