Parsing Html The Cthulhu Way

phreakre · November 16, 2009, 12:00am

The only crime committed by most novice developers is not knowing what the alternatives are. This post, on the other hand, does nothing to help promote what those alternatives are. No cherry picked recommendations for a few common web languages? PHP, RoR, etc?

Hpricot and Rubyful Soup help me a good bit in RoR.

Wayne · November 16, 2009, 12:00am

Now, I have that song in my head! Go Metallica!

Chris_F · November 16, 2009, 12:00am

“Even Jon Skeet cannot parse HTML using regular expressions.”

Them’s fightin’ words.

PadraigB · November 16, 2009, 12:00am

ESR disagrees
http://www.jgc.org/blog/2009/11/parsing-html-in-python-with.html

R__Bemrose2 · November 16, 2009, 12:00am

Is it just me or does this blog post try to argue both sides of the same issue?

Patrick · November 16, 2009, 12:00am

“Even Jon Skeet cannot parse HTML using regular expressions.”

I lol’d, what a great way to put this in perspective.

RobertC · November 16, 2009, 12:00am

But the link to CPAN HTML::Sanitizer is broken.

DennisG · November 16, 2009, 12:00am

Link behind “HTML::Sanitizer” is dead.

DominicP · November 16, 2009, 12:00am

So what is the preferred method for dealing with XSS (Cross Site Scripting) issues then, particularly if you’re using a Rich Text Editor that saves formatting as HTML?

DanielS · November 16, 2009, 12:00am

The last time I went for an HTML library to parse some HTML, the HTML was so broken I had to resort to regex.

The regex broke afterwards, after the generated HTML was slightly changed. It was trivially fixed.

So, while I agree that HTML (and, particularly, XML) should be parsed appropriately, YMMV. I get the feeling a lot of people who complain about regex have never bothered to LEARN it as the complex language it is.

Arethuza · November 16, 2009, 12:00am

What about HTMLTidy? http://tidy.sourceforge.net/ Convert stuff to proper XHTML and then use your XML processing mechanism of choice to parse the data.

Joe · November 16, 2009, 12:00am

This would be a lot more helpful if some specific libs besides the Perl solution were posted. I had a non-trivial time trying to find ready-to-use stable libraries on various platforms (e.g. PHP). Any suggestions?

Goran · November 16, 2009, 12:00am

Can someone get me one of those T-shirts with “I parse HTML with RE” on front?

Anonymous · November 16, 2009, 12:00am

I think that’s just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine.

Maybe HTML processing isn’t trivial, Jeff.

sylvainulg · November 16, 2009, 12:00am

http://search.cpan.org/~nesting/HTML-Sanitizer-0.04/Sanitizer.pm tells me “not found”, btw. Pretty surprising, as we can still find the code in Nesting’s archives, and that it is still refered to at e.g. http://search.cpan.org/~podmaster/HTML-Scrubber-0.08/Scrubber.pm

Rev_Matt · November 16, 2009, 12:00am

I’m torn here. I mean there’s jwz’s famous quote about

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

On the other hand, CPAN isn’t particularly useful if you aren’t using Perl. And “Hell is other people’s Perl”.

chaos · November 16, 2009, 12:00am

I have no joke here, I just like saying http://discordianquotes.com/quote/8008

JarrettM · November 16, 2009, 12:00am

“The only crime being perpetrated is not knowing what the alternatives are.”

Yup - that just about covers the whole darn thing! Who hasn’t worked with developers who rolled their own XYZ, when that program is already out there and supported by some other community.

LinusT · November 16, 2009, 12:00am

This is the most useful thing you said in a long time. Glad to have you bad Jeff.

Paul_N · November 16, 2009, 12:00am

I have a daily job that scrapes a page. Unfortunately, it looks like the author of that page is emitting with Word or something similarly horrible. Tables within tables within tables without sanity(or regularity). The faint whisper of “ia, ia” curdles around the mind when contemplating the source.

A DOM parser would gibber insanely to itself, quietly screaming at the brokenness of the form.

Since I didn’t particularly want to do that to a parse, I regexed away and was able to do it. Only minor loss of sanity points…