a companion discussion area for blog.codinghorror.com

Parsing Html The Cthulhu Way


#81

ElderSign
I bet Chuck Norris can parse HTML using RegEx.
/ElderSign


#82

The code of Cthulhu…

Am I the only one here wondering just WHAT EXACTLY THE HELL would Cthulhu actually be developing?


#83

Well, ASP.Net uses regex to parse HTML, and it works quite well. I fact, open up your Reflector and point it at “System.Web.UI.BaseParser” class and “ParseStringInternal” method at “System.Web.UI.TemplateParser” class. You will see that it can work, when used properly.


#84

Arrrrggghhhhhhh…I was too late. The open cthulhu tag has gained enough power to swallow attempts to close it. Run. Run for your lives. Chaos is coming.


#85

Regular expressions is based on math (formal language actually:
http://en.wikipedia.org/wiki/Regular_expressions#Formal_language_theory
). (X)HTML is based on a tree structure which is a data structure. Those two fields are not related, that is why it’s awkward to use regex to parse HTML.


#86

What I do is HTML Tidy the user input/document and then use XSLT to whitelist acceptable parts. No scripts, styles of proprietary shit makes it though.

I’ve even added a third step before running it through the XSLT that adds cool features similar to markdown or textile.

You can take my code and run, if you want, I wrote this as an extension for Symphony CMS: http://github.com/rowan-lewis/htmlformatter/


#87

Ok, I’m a fucking retard, the code above no longer uses the XSLT whitelist, but what the hell, you get the idea right?


#88

The pingback 1.0 specification actually uses regexp for parsing HTML to autodiscover the pingback URL.

However, in this case I think it’s not a bad case because it greatly simplifies handling and code.


#89

Who else thinks there should be a Cthulhu badge for StackOverflow?


#90

am i the only one that noticed that cthulhu is not a god, but a great old one


#91

cough BeautifulSoup cough


#92

Are you all insane? HTML is easy to parse using any language that has good string handling/matching (even VB works, although it gets slow).

How do you think a browser manages this? Typesetting programmes have been doing the same for decades, with the same type of tags (think SGML), long before HTML. Evaluating a bunch of tags is a trivial first-principle task to any competent programmer.

Naturally, using a library is the quickest and most reliable way of doing this and there are lots of 'em.

What is the big deal?


#93

I remember years ago writing a web app and needing a back end piece to parse some HTML. It started out so simple and naively and then a month later I had built this monstrous library of perl regex to parse the HTML. It was a tar pat that held no escape.


#94

I remember years ago writing a web app and needing a back end piece to parse some HTML. It started out so simple and naively and then a month later I had built this monstrous library of perl regex to parse the HTML. It was a tar pat that held no escape.


#95

I remember years ago writing a web app and needing a back end piece to parse some HTML. It started out so simple and naively and then a month later I had built this monstrous library of perl regex to parse the HTML. It was a tar pat that held no escape.


#96

@Rob
"Am I the only one here wondering just WHAT EXACTLY THE HELL would Cthulhu actually be developing?"

See Charles Stross’s “The Atrocity Archives” (http://www.amazon.com/Atrocity-Archives-Charles-Stross/dp/0441016685/) and “The Jennifer Morgue” to see what software might interest Cthulu and his kin.


#97

there should a filter in stackoverflow automatically deleting/reporting/marking as dangerous the posts that include “html parsing regular expressions” in it. and redirecting the submitter to this post.


#98

Nice to see a respectable minority familiar with the mythos, tiki li mother fuckers!

“Only minor loss of sanity points…”
–Paul N on November 16, 2009 9:36 AM

But Paul N goes the extra step, exposing his familiarity with the beloved RPG; see you at GenCon! Now roll 3d20 and ignore the result.

Cthulhu for president 2012 : Why settle for the lesser of two evils?


#99

@Phil Brass: Regular expressions don’t permit recursion. You’ve smuggled it in by using the semantics of the language in which you’re defining the regex. You have to go to type-2 grammars in order to parse nested constructs (Like HTML, or parentheses).

Turing-complete is a stronger computational class than that of a type-2 grammar (Which is, IIRC, a pushdown automaton - regexs are nondeterministic finite state machines), so it’s not really surprising that you can parse HTML with regular expressions + glue code in Perl or whatever, but it’s still not really a good idea compared to writing a proper recursive-descent parser.


#100

You can do whatever you like, even if it seems stupid, but only if you do it well.