Parsing Html The Cthulhu Way

Rob_Mueller · November 17, 2009, 12:00am

If you want a perl based HTML parser specifically designed to remove XSS type attacks, check out HTML::Defang (http://search.cpan.org/~kurianja/HTML-Defang-1.02/)

HTML::Defang uses a custom html tag parser. The parser has been designed and tested to work with nasty real world html and to try and emulate as close as possible what browsers actually do with strange looking constructs. The test suite has been built based on examples from a range of sources such as http://ha.ckers.org/xss.html and http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as possible XSS attack scenarios have been dealt with.

toettoe · November 17, 2009, 12:00am

We only allow XHTML to be saved, it is validated before saved. All problems are solved this way. The WYSIWYG-editor only allows valid XHTML to be created.
Standard parsers for X(HT)ML are avalaible in masses and are rock-stable. Of course wandering through the DOM is not as easy as thought at first glance, but once you understand the subtleties, the knowledge is useful on many tasks. The resulting code using this approach leads to performant, safe and correct apps.

Punky · November 17, 2009, 12:00am

101’st

If you use xhtml it’s pretty straightforward.

Anyways, everything’s just a heap of div tags these days anyways.

Laptopb1 · November 18, 2009, 12:00am

Wow,the code is beautiful for the expert,but i only know a few.So the website http://www.laptopbatterypack.org.uk for ours company is need some change from the expert.

Santi · November 18, 2009, 12:00am

Per request.

Santi · November 18, 2009, 12:00am

/cthulhu

differentC · November 18, 2009, 12:00am

What HTML are you parsing? Your own, someone you knows, or a whole engine for spidering? The first two could get away with regex, the last two I’m not so sure.

differentC · November 18, 2009, 12:00am

My last comment “Someone you knows” wasn’t a West Country accent but a typo

RobertS · November 18, 2009, 12:00am

Hello Coding Horror,

My name is Robert Sullivan and I am the advertising director for Dark Recesses Enterprise (www.darkrecesses.com). Dark Recesses is an on-line horror fiction periodical, published by Boyd E. Harris and edited by Bailey Hunter, among others.

Dark Recesses Enterprises wants to expand the contemporary definition of horror, to push the boundaries beyond the commercial marketplace definitions, by providing quality horror industry news and articles, and by publishing the best in short fiction, by today’s up and coming writers.

I am sending you this message just to make contact, to establish a line of communication. I do want to sell advertising space on our website and in our periodicals, but at this point I am taking a low pressure approach.

Please contact me at your earliest convenience.

Sincerely,

Robert Sullivan
Dark Recesses Enterprise
(Home) 256-747-8683
(Cell) 334-220-4117

Victor · November 18, 2009, 12:00am

Webrat?

http://github.com/brynary/webrat

wow · November 18, 2009, 12:00am

^

spam at its best

Steve_O2 · November 18, 2009, 12:00am

“The only crime being perpetrated is not knowing what the alternatives are.”

I commit this crime regularly. In some cases, there’s just so many options for everything you could think of doing… sometimes picking the ‘right one’ for the job takes longer than picking the first and hacking up a solution.

keldorn · November 19, 2009, 12:00am

Haven’t you ever heard of CGIProxy, Glype, or PHProxy? These do exceptionaly well mirroring websites, Specially Glype and CGIProxy, by modifying the html with Regexes.

Scott · November 20, 2009, 12:00am

In the beginning I parsed XML with regex, then I learned XSLT, and there was much rejoicing!

CJH___esper · November 21, 2009, 12:00am

Jeff, I believe you dropped this:
/cthulhu

Ratty · November 21, 2009, 12:00am

You guys are just quitters… I parse HTML with Regular Expressions all the time. The trick is to do it in two passes. The first pass extracts the tag, the second pass processes the tag. I use this approach in PHP to import external web pages into a CMS together with all their referenced stylesheets, images, media, and javascript files. It also recursively parses the stylesheet external file references.

Actually now that I think about it, this is kind of a compromise since I use individual RegEx’s for each tag. In other words its a kind of halfway house between a pure hand-crafted heuristic and a more orthogonal approach… This is probably the way the HTML parsing libraries do it anyway.

Kahl · November 22, 2009, 12:00am

@Ratty: And what’s the advantage of not using an existing parsor??? Too much spare time?

John · November 22, 2009, 12:00am

Can you please stop using PHP in a derogatory manner? After all, you’re the one actually advocating writing enterprise apps on Windows.

BobSmall · December 1, 2009, 12:00am

Parsing html with regular expressions = bad idea granted, but locating something specific within html with regex = good idea. Want to find A tags: use regex. Want to locate images: use regex. Want to apply XSLT to html for the purpose of converting it to an RSS feed: Use dedicated parser. Html Agility pack, Tidy, System.Html all fine parsers, all easy to use 99% of the result with 1% of the effort.

“A good artist copies. A great artist steals”. Leverage an API!

Bryan · December 1, 2009, 12:00am

It’s not really about using regex vs some other parsing method, its really just about the cohesion between the search mechanism and the rest of the software.

Regex has its place as a simple search mechanism. It’s easy to implement and generally gets the job in a productive fashion. If the searching is complex, then a different mechanism should be used.

The only thing that would irk me is if the searching function call was located deep within a 1000 line module. I wouldn’t care at all if I had to replace a single search class.

If the project had unit tests, that makes replacing the algorithm even easier.

I’m posting the Cthuluhu picture on my wall at work anyway.

Great post.