Parsing Html The Cthulhu Way

“Ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn.”

Nice post , good to see other developers considering the wrath of the old ones whilst they are working having Shub-Niggurath show up due to faulty error trapping is in no-ones interests

It’s not Cthulhu. It’s ZA̡͊͠͝LGΌ.

Instead of HTML::Sanitiizer. just point to http://search.cpan.org/search?query=html+parser&mode=all and let people pick one.

Meh… use the right tool for the job. And sometimes, that means using regexes - if you’re dealing with a consistently formed XML or HTML file, a simple regex may be a lot less effort than using a dedicated parser…

Didn’t we argue about this a year ago, and you dismissed me with “programming is hard, let’s go shopping”?

Yes, that’s right, you did: http://www.codinghorror.com/blog/archives/001172.html

Instead of putting your time into improving a working, open source HTML parser (which just recently added a selector engine), you wrote a bunch of hacky regex. Now you have 2 problems wasted valuable development hours, and you deserve the pain taunting my warnings.

Also, your wack-ass busted old moveeeablee typee cobol blogg enginne hath wacked my comment formatting. Bah.

I got downvoted on StackOverflow for saying that Regex is not the right solution for parsing HTML. It was offset by 11 upvotes, but some people will just never get it. It’s one thing to use a regex to tokenize HTML, but another thing entirely to use them as if HTML were a regular grammar.

Jeff, didn’t you spend a considerable amount of time in one of the StackOverflow podcasts trying to convince Joel that it was OK for you to try and parse Markup with a bunch of regular expressions, despite the fact that it’s not a regular language and runs into a bunch of the same types of problems?

Whoops heh, that’s what I get for not looking at the date of the post… for some reason this just popped up in my rss reader again.

Back in the day I wrote my own C HTML parser, back before it was a solved problem. I even had my own version of xpath for it.

I was seduced by the RegExHtmlMonster. I woke up screaming and decided it was time to parse the nightmares away.

Jeff, How do you explain the popularity of syntax highlighters that use regular expressions ?

http://pygments.org
http://qbnz.com/highlighter
http://code.google.com/p/google-code-prettify/

Very informative and trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading Arkadaslik Sitesi - Sohbet Odalari

This guy slaps Cthulhu across the face and laughs heartily!
http://jmrware.com/articles/2009/uri_regexp/URI_regex.html

If the ideology of this post and most of the comments are to believed as gospel then the following book will certainly make the baby Jesus cry…

LOL… now I understand bobince’s persistence in MY post about regex vs HTML: http://stackoverflow.com/questions/3951485/regex-extracting-only-the-visible-page-text-from-a-html-source-document

(…and maybe some of you would be amused by my own persistence, too :slight_smile:

However, as I stated numerous times in my comments, I wasn’t out to parse the HTML per se, but “merely” interested in a much coarser extraction. And for my purposes, the regex approach works - it’s a tradeoff between efficiency and total robustness. But the outcome is surprisingly solid. The final implementation can be found here: http://www.martinwardener.com/regex/

Mind you, regarding the “secondary” issue (extracting all links/URLs from an HTML document), it is of no concern that this implementation is over-eager (by design, btw) and picks out a few invalid URLs (mostly pertaining to script blocks) - those will be filtered out during the subsequent URL validation anyway.

I was recently working on a java project to retrieve all the separate unique words found (content) on a specified HTML page, and print them alphabetically along with their frequency on that page.

My program, instead of using regular expressions, reads the file line by line. Any text that is within the ending ‘>’ and beginning ‘<’ HTML brackets is read into a new variable. This new variable then contains all of the words found (visible, not alt tags) on that web page, separated by spaces.

Using this method, the only text that is really left out are image alt tags and meta descriptions and keywords. Three regular expressions, since you love them so much, could get those before or after the fact.

My program then built a Binary Search Tree based on the words found in that HTML file, along with their frequency. Being a web developer, I have found this a neat tool to have to evaluate keywords of a website, as it works quite well. Not saying it’s the ‘perfect parser’, but it works with HTML, BROKEN HTML, PHP, ASP, or most any kind of web page out there.

I am a beginning Python coder. I wanted to be able to go to www.fictionpress.com (a website containing stories people write) and turn raw HTML into the story I am trying to extract. Is there any better method than using regular expressions? Using the RE module allowed me to not only parse the HTML, but also remove headers and footers I didn’t want to see.

What should I be using instead of regular expressions for HTML?

If the HTML changes then almost any scraper or parser will fail. Anyone that says otherwise is bullshitting in order to justify doing the ‘right thing’.

It is the right thing to use a DOM or other similar method because it can be easier to read the code, and there are often other useful functions hanging about that can make any future development easier. However this ‘robust’ BS needs to stop.

If the HTML changes then any parser breaks, unless it’s your own HTML you’re parsing and you design it in a careful way, i.e. using lots of unique ‘id’ attributes in tags so changing the structure of the HTML doesn’t break anything.

I agree with Chris S from November 17, 2009. His comment still stands now.