Parsing Html The Cthulhu Way

Rob_Uttley · November 17, 2009, 12:00am

The code of Cthulhu…

Am I the only one here wondering just WHAT EXACTLY THE HELL would Cthulhu actually be developing?

RicardoN · November 17, 2009, 12:00am

Well, ASP.Net uses regex to parse HTML, and it works quite well. I fact, open up your Reflector and point it at “System.Web.UI.BaseParser” class and “ParseStringInternal” method at “System.Web.UI.TemplateParser” class. You will see that it can work, when used properly.

Skizz · November 17, 2009, 12:00am

Arrrrggghhhhhhh…I was too late. The open cthulhu tag has gained enough power to swallow attempts to close it. Run. Run for your lives. Chaos is coming.

Hoffmann · November 17, 2009, 12:00am

Regular expressions is based on math (formal language actually:
http://en.wikipedia.org/wiki/Regular_expressions#Formal_language_theory
). (X)HTML is based on a tree structure which is a data structure. Those two fields are not related, that is why it’s awkward to use regex to parse HTML.

Nobody · November 17, 2009, 12:00am

What I do is HTML Tidy the user input/document and then use XSLT to whitelist acceptable parts. No scripts, styles of proprietary shit makes it though.

I’ve even added a third step before running it through the XSLT that adds cool features similar to markdown or textile.

You can take my code and run, if you want, I wrote this as an extension for Symphony CMS: http://github.com/rowan-lewis/htmlformatter/

Nobody · November 17, 2009, 12:00am

Ok, I’m a fucking retard, the code above no longer uses the XSLT whitelist, but what the hell, you get the idea right?

Gasper_Zejn · November 17, 2009, 12:00am

The pingback 1.0 specification actually uses regexp for parsing HTML to autodiscover the pingback URL.

However, in this case I think it’s not a bad case because it greatly simplifies handling and code.

JosephC · November 17, 2009, 12:00am

Who else thinks there should be a Cthulhu badge for StackOverflow?

dystopia · November 17, 2009, 12:00am

am i the only one that noticed that cthulhu is not a god, but a great old one

geekboxjockey · November 17, 2009, 12:00am

cough BeautifulSoup cough

Craiggybear · November 17, 2009, 12:00am

Are you all insane? HTML is easy to parse using any language that has good string handling/matching (even VB works, although it gets slow).

How do you think a browser manages this? Typesetting programmes have been doing the same for decades, with the same type of tags (think SGML), long before HTML. Evaluating a bunch of tags is a trivial first-principle task to any competent programmer.

Naturally, using a library is the quickest and most reliable way of doing this and there are lots of 'em.

What is the big deal?

JosephC · November 17, 2009, 12:00am

I remember years ago writing a web app and needing a back end piece to parse some HTML. It started out so simple and naively and then a month later I had built this monstrous library of perl regex to parse the HTML. It was a tar pat that held no escape.

Dave_C · November 17, 2009, 12:00am

@Rob
"Am I the only one here wondering just WHAT EXACTLY THE HELL would Cthulhu actually be developing?"

See Charles Stross’s “The Atrocity Archives” (http://www.amazon.com/Atrocity-Archives-Charles-Stross/dp/0441016685/) and “The Jennifer Morgue” to see what software might interest Cthulu and his kin.

Youri · November 17, 2009, 12:00am

there should a filter in stackoverflow automatically deleting/reporting/marking as dangerous the posts that include “html parsing regular expressions” in it. and redirecting the submitter to this post.

Azathoth · November 17, 2009, 12:00am

Nice to see a respectable minority familiar with the mythos, tiki li mother fuckers!

“Only minor loss of sanity points…”
–Paul N on November 16, 2009 9:36 AM

But Paul N goes the extra step, exposing his familiarity with the beloved RPG; see you at GenCon! Now roll 3d20 and ignore the result.

Cthulhu for president 2012 : Why settle for the lesser of two evils?

JamesP · November 17, 2009, 12:00am

@Phil Brass: Regular expressions don’t permit recursion. You’ve smuggled it in by using the semantics of the language in which you’re defining the regex. You have to go to type-2 grammars in order to parse nested constructs (Like HTML, or parentheses).

Turing-complete is a stronger computational class than that of a type-2 grammar (Which is, IIRC, a pushdown automaton - regexs are nondeterministic finite state machines), so it’s not really surprising that you can parse HTML with regular expressions + glue code in Perl or whatever, but it’s still not really a good idea compared to writing a proper recursive-descent parser.

Kapusta · November 17, 2009, 12:00am

You can do whatever you like, even if it seems stupid, but only if you do it well.

Chris_S · November 17, 2009, 12:00am

There is a big difference between parsing and simply extracting. Sure you can’t parse html with regex but if you simply want to extract a bit of data it works better. Sanitizing html, parsing it into a DOM, traversing and extracting data and then crossing your fingers it will work for all the broken html out there seems like a big mistake when you just want to get a specific data value from a page. Not to mention you end up with bloatware that may not even work. This is the problem with having a cs degree. You tend to think that theory trumps practice. I remember a co-worker once who implemented a whole postscript parser to get at a single data value on page X of a document.

DMB · November 17, 2009, 12:00am

Depends on what you want to do with HTML. If you want to extract text, for example, regexes not only work, they work really well.

DMB · November 17, 2009, 12:00am

cough BeautifulSoup cough

cough Extremely slow cough