Parsing Html The Cthulhu Way

The code of Cthulhu…

Am I the only one here wondering just WHAT EXACTLY THE HELL would Cthulhu actually be developing?

Well, ASP.Net uses regex to parse HTML, and it works quite well. I fact, open up your Reflector and point it at “System.Web.UI.BaseParser” class and “ParseStringInternal” method at “System.Web.UI.TemplateParser” class. You will see that it can work, when used properly.

Arrrrggghhhhhhh…I was too late. The open cthulhu tag has gained enough power to swallow attempts to close it. Run. Run for your lives. Chaos is coming.

Regular expressions is based on math (formal language actually:
http://en.wikipedia.org/wiki/Regular_expressions#Formal_language_theory
). (X)HTML is based on a tree structure which is a data structure. Those two fields are not related, that is why it’s awkward to use regex to parse HTML.

What I do is HTML Tidy the user input/document and then use XSLT to whitelist acceptable parts. No scripts, styles of proprietary shit makes it though.

I’ve even added a third step before running it through the XSLT that adds cool features similar to markdown or textile.

You can take my code and run, if you want, I wrote this as an extension for Symphony CMS: http://github.com/rowan-lewis/htmlformatter/

Ok, I’m a fucking retard, the code above no longer uses the XSLT whitelist, but what the hell, you get the idea right?

The pingback 1.0 specification actually uses regexp for parsing HTML to autodiscover the pingback URL.

However, in this case I think it’s not a bad case because it greatly simplifies handling and code.

Who else thinks there should be a Cthulhu badge for StackOverflow?

am i the only one that noticed that cthulhu is not a god, but a great old one

cough BeautifulSoup cough

Are you all insane? HTML is easy to parse using any language that has good string handling/matching (even VB works, although it gets slow).

How do you think a browser manages this? Typesetting programmes have been doing the same for decades, with the same type of tags (think SGML), long before HTML. Evaluating a bunch of tags is a trivial first-principle task to any competent programmer.

Naturally, using a library is the quickest and most reliable way of doing this and there are lots of 'em.

What is the big deal?

I remember years ago writing a web app and needing a back end piece to parse some HTML. It started out so simple and naively and then a month later I had built this monstrous library of perl regex to parse the HTML. It was a tar pat that held no escape.

@Rob
"Am I the only one here wondering just WHAT EXACTLY THE HELL would Cthulhu actually be developing?"

See Charles Stross’s “The Atrocity Archives” (http://www.amazon.com/Atrocity-Archives-Charles-Stross/dp/0441016685/) and “The Jennifer Morgue” to see what software might interest Cthulu and his kin.

there should a filter in stackoverflow automatically deleting/reporting/marking as dangerous the posts that include “html parsing regular expressions” in it. and redirecting the submitter to this post.

Nice to see a respectable minority familiar with the mythos, tiki li mother fuckers!

“Only minor loss of sanity points…”
–Paul N on November 16, 2009 9:36 AM

But Paul N goes the extra step, exposing his familiarity with the beloved RPG; see you at GenCon! Now roll 3d20 and ignore the result.

Cthulhu for president 2012 : Why settle for the lesser of two evils?

@Phil Brass: Regular expressions don’t permit recursion. You’ve smuggled it in by using the semantics of the language in which you’re defining the regex. You have to go to type-2 grammars in order to parse nested constructs (Like HTML, or parentheses).

Turing-complete is a stronger computational class than that of a type-2 grammar (Which is, IIRC, a pushdown automaton - regexs are nondeterministic finite state machines), so it’s not really surprising that you can parse HTML with regular expressions + glue code in Perl or whatever, but it’s still not really a good idea compared to writing a proper recursive-descent parser.

You can do whatever you like, even if it seems stupid, but only if you do it well.

There is a big difference between parsing and simply extracting. Sure you can’t parse html with regex but if you simply want to extract a bit of data it works better. Sanitizing html, parsing it into a DOM, traversing and extracting data and then crossing your fingers it will work for all the broken html out there seems like a big mistake when you just want to get a specific data value from a page. Not to mention you end up with bloatware that may not even work. This is the problem with having a cs degree. You tend to think that theory trumps practice. I remember a co-worker once who implemented a whole postscript parser to get at a single data value on page X of a document.

Depends on what you want to do with HTML. If you want to extract text, for example, regexes not only work, they work really well.

cough BeautifulSoup cough

cough Extremely slow cough