Parsing Html The Cthulhu Way

The only crime committed by most novice developers is not knowing what the alternatives are. This post, on the other hand, does nothing to help promote what those alternatives are. No cherry picked recommendations for a few common web languages? PHP, RoR, etc?

Hpricot and Rubyful Soup help me a good bit in RoR.

Now, I have that song in my head! Go Metallica!

ā€œEven Jon Skeet cannot parse HTML using regular expressions.ā€

Themā€™s fightinā€™ words.

ESR disagrees :slight_smile:
http://www.jgc.org/blog/2009/11/parsing-html-in-python-with.html

Is it just me or does this blog post try to argue both sides of the same issue?

ā€œEven Jon Skeet cannot parse HTML using regular expressions.ā€

I lolā€™d, what a great way to put this in perspective.

But the link to CPAN HTML::Sanitizer is broken.

Link behind ā€œHTML::Sanitizerā€ is dead.

So what is the preferred method for dealing with XSS (Cross Site Scripting) issues then, particularly if youā€™re using a Rich Text Editor that saves formatting as HTML?

The last time I went for an HTML library to parse some HTML, the HTML was so broken I had to resort to regex.

The regex broke afterwards, after the generated HTML was slightly changed. It was trivially fixed.

So, while I agree that HTML (and, particularly, XML) should be parsed appropriately, YMMV. I get the feeling a lot of people who complain about regex have never bothered to LEARN it as the complex language it is.

What about HTMLTidy? http://tidy.sourceforge.net/ Convert stuff to proper XHTML and then use your XML processing mechanism of choice to parse the data.

This would be a lot more helpful if some specific libs besides the Perl solution were posted. I had a non-trivial time trying to find ready-to-use stable libraries on various platforms (e.g. PHP). Any suggestions?

Can someone get me one of those T-shirts with ā€œI parse HTML with REā€ on front?

I think thatā€™s just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine.

Maybe HTML processing isnā€™t trivial, Jeff.

http://search.cpan.org/~nesting/HTML-Sanitizer-0.04/Sanitizer.pm tells me ā€œnot foundā€, btw. Pretty surprising, as we can still find the code in Nestingā€™s archives, and that it is still refered to at e.g. http://search.cpan.org/~podmaster/HTML-Scrubber-0.08/Scrubber.pm

Iā€™m torn here. I mean thereā€™s jwzā€™s famous quote about

Some people, when confronted with a problem, think ā€œI know, Iā€™ll use regular expressions.ā€ Now they have two problems.

On the other hand, CPAN isnā€™t particularly useful if you arenā€™t using Perl. And ā€œHell is other peopleā€™s Perlā€.

I have no joke here, I just like saying http://discordianquotes.com/quote/8008

ā€œThe only crime being perpetrated is not knowing what the alternatives are.ā€

Yup - that just about covers the whole darn thing! Who hasnā€™t worked with developers who rolled their own XYZ, when that program is already out there and supported by some other community.

This is the most useful thing you said in a long time. Glad to have you bad Jeff.

I have a daily job that scrapes a page. Unfortunately, it looks like the author of that page is emitting with Word or something similarly horrible. Tables within tables within tables without sanity(or regularity). The faint whisper of ā€œia, iaā€ curdles around the mind when contemplating the source.

A DOM parser would gibber insanely to itself, quietly screaming at the brokenness of the form.

Since I didnā€™t particularly want to do that to a parse, I regexed away and was able to do it. Only minor loss of sanity pointsā€¦