There is a big difference between parsing and simply extracting. Sure you can’t parse html with regex but if you simply want to extract a bit of data it works better. Sanitizing html, parsing it into a DOM, traversing and extracting data and then crossing your fingers it will work for all the broken html out there seems like a big mistake when you just want to get a specific data value from a page. Not to mention you end up with bloatware that may not even work. This is the problem with having a cs degree. You tend to think that theory trumps practice. I remember a co-worker once who implemented a whole postscript parser to get at a single data value on page X of a document.
Depends on what you want to do with HTML. If you want to extract text, for example, regexes not only work, they work really well.
cough BeautifulSoup cough
cough Extremely slow cough
If you want a perl based HTML parser specifically designed to remove XSS type attacks, check out HTML::Defang (http://search.cpan.org/~kurianja/HTML-Defang-1.02/)
HTML::Defang uses a custom html tag parser. The parser has been designed and tested to work with nasty real world html and to try and emulate as close as possible what browsers actually do with strange looking constructs. The test suite has been built based on examples from a range of sources such as http://ha.ckers.org/xss.html and http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as possible XSS attack scenarios have been dealt with.
We only allow XHTML to be saved, it is validated before saved. All problems are solved this way. The WYSIWYG-editor only allows valid XHTML to be created.
Standard parsers for X(HT)ML are avalaible in masses and are rock-stable. Of course wandering through the DOM is not as easy as thought at first glance, but once you understand the subtleties, the knowledge is useful on many tasks. The resulting code using this approach leads to performant, safe and correct apps.
If you use xhtml it’s pretty straightforward.
Anyways, everything’s just a heap of div tags these days anyways.
What HTML are you parsing? Your own, someone you knows, or a whole engine for spidering? The first two could get away with regex, the last two I’m not so sure.
My last comment “Someone you knows” wasn’t a West Country accent but a typo
Hello Coding Horror,
My name is Robert Sullivan and I am the advertising director for Dark Recesses Enterprise (www.darkrecesses.com). Dark Recesses is an on-line horror fiction periodical, published by Boyd E. Harris and edited by Bailey Hunter, among others.
Dark Recesses Enterprises wants to expand the contemporary definition of horror, to push the boundaries beyond the commercial marketplace definitions, by providing quality horror industry news and articles, and by publishing the best in short fiction, by today’s up and coming writers.
I am sending you this message just to make contact, to establish a line of communication. I do want to sell advertising space on our website and in our periodicals, but at this point I am taking a low pressure approach.
Please contact me at your earliest convenience.
Dark Recesses Enterprise
spam at its best
“The only crime being perpetrated is not knowing what the alternatives are.”
I commit this crime regularly. In some cases, there’s just so many options for everything you could think of doing… sometimes picking the ‘right one’ for the job takes longer than picking the first and hacking up a solution.
Haven’t you ever heard of CGIProxy, Glype, or PHProxy? These do exceptionaly well mirroring websites, Specially Glype and CGIProxy, by modifying the html with Regexes.
In the beginning I parsed XML with regex, then I learned XSLT, and there was much rejoicing!
Jeff, I believe you dropped this:
Actually now that I think about it, this is kind of a compromise since I use individual RegEx’s for each tag. In other words its a kind of halfway house between a pure hand-crafted heuristic and a more orthogonal approach… This is probably the way the HTML parsing libraries do it anyway.