a companion discussion area for blog.codinghorror.com

Parsing Html The Cthulhu Way


There is a big difference between parsing and simply extracting. Sure you can’t parse html with regex but if you simply want to extract a bit of data it works better. Sanitizing html, parsing it into a DOM, traversing and extracting data and then crossing your fingers it will work for all the broken html out there seems like a big mistake when you just want to get a specific data value from a page. Not to mention you end up with bloatware that may not even work. This is the problem with having a cs degree. You tend to think that theory trumps practice. I remember a co-worker once who implemented a whole postscript parser to get at a single data value on page X of a document.


Depends on what you want to do with HTML. If you want to extract text, for example, regexes not only work, they work really well.


cough BeautifulSoup cough

cough Extremely slow cough


If you want a perl based HTML parser specifically designed to remove XSS type attacks, check out HTML::Defang (http://search.cpan.org/~kurianja/HTML-Defang-1.02/)

HTML::Defang uses a custom html tag parser. The parser has been designed and tested to work with nasty real world html and to try and emulate as close as possible what browsers actually do with strange looking constructs. The test suite has been built based on examples from a range of sources such as http://ha.ckers.org/xss.html and http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as possible XSS attack scenarios have been dealt with.


We only allow XHTML to be saved, it is validated before saved. All problems are solved this way. The WYSIWYG-editor only allows valid XHTML to be created.
Standard parsers for X(HT)ML are avalaible in masses and are rock-stable. Of course wandering through the DOM is not as easy as thought at first glance, but once you understand the subtleties, the knowledge is useful on many tasks. The resulting code using this approach leads to performant, safe and correct apps.



If you use xhtml it’s pretty straightforward.

Anyways, everything’s just a heap of div tags these days anyways.


Wow,the code is beautiful for the expert,but i only know a few.So the website http://www.laptopbatterypack.org.uk for ours company is need some change from the expert.


Per request.




What HTML are you parsing? Your own, someone you knows, or a whole engine for spidering? The first two could get away with regex, the last two I’m not so sure.


My last comment “Someone you knows” wasn’t a West Country accent but a typo


Hello Coding Horror,

My name is Robert Sullivan and I am the advertising director for Dark Recesses Enterprise (www.darkrecesses.com). Dark Recesses is an on-line horror fiction periodical, published by Boyd E. Harris and edited by Bailey Hunter, among others.

Dark Recesses Enterprises wants to expand the contemporary definition of horror, to push the boundaries beyond the commercial marketplace definitions, by providing quality horror industry news and articles, and by publishing the best in short fiction, by today’s up and coming writers.

I am sending you this message just to make contact, to establish a line of communication. I do want to sell advertising space on our website and in our periodicals, but at this point I am taking a low pressure approach.

Please contact me at your earliest convenience.


Robert Sullivan
Dark Recesses Enterprise
(Home) 256-747-8683
(Cell) 334-220-4117






spam at its best


“The only crime being perpetrated is not knowing what the alternatives are.”

I commit this crime regularly. In some cases, there’s just so many options for everything you could think of doing… sometimes picking the ‘right one’ for the job takes longer than picking the first and hacking up a solution.


Haven’t you ever heard of CGIProxy, Glype, or PHProxy? These do exceptionaly well mirroring websites, Specially Glype and CGIProxy, by modifying the html with Regexes.


In the beginning I parsed XML with regex, then I learned XSLT, and there was much rejoicing!


Jeff, I believe you dropped this:


You guys are just quitters… I parse HTML with Regular Expressions all the time. The trick is to do it in two passes. The first pass extracts the tag, the second pass processes the tag. I use this approach in PHP to import external web pages into a CMS together with all their referenced stylesheets, images, media, and javascript files. It also recursively parses the stylesheet external file references.

Actually now that I think about it, this is kind of a compromise since I use individual RegEx’s for each tag. In other words its a kind of halfway house between a pure hand-crafted heuristic and a more orthogonal approach… This is probably the way the HTML parsing libraries do it anyway.


@Ratty: And what’s the advantage of not using an existing parsor??? Too much spare time?