Parsing Html The Cthulhu Way

Also, your wack-ass busted old moveeeablee typee cobol blogg enginne hath wacked my comment formatting. Bah.

I got downvoted on StackOverflow for saying that Regex is not the right solution for parsing HTML. It was offset by 11 upvotes, but some people will just never get it. It’s one thing to use a regex to tokenize HTML, but another thing entirely to use them as if HTML were a regular grammar.

Jeff, didn’t you spend a considerable amount of time in one of the StackOverflow podcasts trying to convince Joel that it was OK for you to try and parse Markup with a bunch of regular expressions, despite the fact that it’s not a regular language and runs into a bunch of the same types of problems?

Whoops heh, that’s what I get for not looking at the date of the post… for some reason this just popped up in my rss reader again.

Back in the day I wrote my own C HTML parser, back before it was a solved problem. I even had my own version of xpath for it.

I was seduced by the RegExHtmlMonster. I woke up screaming and decided it was time to parse the nightmares away.

Jeff, How do you explain the popularity of syntax highlighters that use regular expressions ?

http://pygments.org
http://qbnz.com/highlighter
http://code.google.com/p/google-code-prettify/

Very informative and trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading Arkadaslik Sitesi - Sohbet Odalari

This guy slaps Cthulhu across the face and laughs heartily!
http://jmrware.com/articles/2009/uri_regexp/URI_regex.html

If the ideology of this post and most of the comments are to believed as gospel then the following book will certainly make the baby Jesus cry…

LOL… now I understand bobince’s persistence in MY post about regex vs HTML: http://stackoverflow.com/questions/3951485/regex-extracting-only-the-visible-page-text-from-a-html-source-document

(…and maybe some of you would be amused by my own persistence, too :slight_smile:

However, as I stated numerous times in my comments, I wasn’t out to parse the HTML per se, but “merely” interested in a much coarser extraction. And for my purposes, the regex approach works - it’s a tradeoff between efficiency and total robustness. But the outcome is surprisingly solid. The final implementation can be found here: http://www.martinwardener.com/regex/

Mind you, regarding the “secondary” issue (extracting all links/URLs from an HTML document), it is of no concern that this implementation is over-eager (by design, btw) and picks out a few invalid URLs (mostly pertaining to script blocks) - those will be filtered out during the subsequent URL validation anyway.

I was recently working on a java project to retrieve all the separate unique words found (content) on a specified HTML page, and print them alphabetically along with their frequency on that page.

My program, instead of using regular expressions, reads the file line by line. Any text that is within the ending ‘>’ and beginning ‘<’ HTML brackets is read into a new variable. This new variable then contains all of the words found (visible, not alt tags) on that web page, separated by spaces.

Using this method, the only text that is really left out are image alt tags and meta descriptions and keywords. Three regular expressions, since you love them so much, could get those before or after the fact.

My program then built a Binary Search Tree based on the words found in that HTML file, along with their frequency. Being a web developer, I have found this a neat tool to have to evaluate keywords of a website, as it works quite well. Not saying it’s the ‘perfect parser’, but it works with HTML, BROKEN HTML, PHP, ASP, or most any kind of web page out there.

I am a beginning Python coder. I wanted to be able to go to www.fictionpress.com (a website containing stories people write) and turn raw HTML into the story I am trying to extract. Is there any better method than using regular expressions? Using the RE module allowed me to not only parse the HTML, but also remove headers and footers I didn’t want to see.

What should I be using instead of regular expressions for HTML?

If the HTML changes then almost any scraper or parser will fail. Anyone that says otherwise is bullshitting in order to justify doing the ‘right thing’.

It is the right thing to use a DOM or other similar method because it can be easier to read the code, and there are often other useful functions hanging about that can make any future development easier. However this ‘robust’ BS needs to stop.

If the HTML changes then any parser breaks, unless it’s your own HTML you’re parsing and you design it in a careful way, i.e. using lots of unique ‘id’ attributes in tags so changing the structure of the HTML doesn’t break anything.

I agree with Chris S from November 17, 2009. His comment still stands now.

Hi, Jeff.

You say “It’s considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that’s just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine.” Well, let me tell you my story.

Once, I had a rep of 1486 in Stack Overflow. I was so excited because finally, FINALLY, I could create my own tags. This was the objective of my life. I got 616 rep points in one month. I deleted my Twitter and Google + accounts for not losing a second. I just needed mere fourteen points! My question at http://stackoverflow.com/q/6873945 finally would have a “mozmill” tag; http://stackoverflow.com/q/6797631 and http://stackoverflow.com/q/6797779 would have the “rhinounit” tag; I could solve problems such as http://meta.stackoverflow.com/q/98584 by myself whether I find them. I rejoiced in anticipation.

Then, I found a quite innocent question about extracting some data from HTML. It seemed to be a pretty stably structured document, so I answered with a regex that could solve the problem: http://stackoverflow.com/q/6878032#6878203 Note that I emphasized that the solution was quick’n’dirty, an unstable document required some more sophisticated tool.

And I got a downvote. I could see my dreamt tags going away. I just give two steps behind, my journey would be longer. What if more people find my answer and downvote it too? What if I lost hundred of rep points?! My tags! MY TAGS! I panicked. I just managed to refrain my mourning to, between hiccups, give my testimony here.

There is a clear lesson here: do not parse HTML with regular expressions in any way. It can destroy your dreams, your soul, your life. If you do it, you’ll end up smoking crack. I learned the lesson and am trying to rebuild my life, maybe - MAYBE - with the ability of creating tags in SO. Do not make my mistake. It is not worth it.

Jeff, I really enjoyed your article. I posted an answer to the question on SO you referred to in this article here http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/7564061#7564061. Seeing as there are so many answers, it may never be read, but what do you think about Balancing Group Definitions? I just find it interesting b/c it allows a regex engine to have state and act as a PDA.

Holler if you find my response interesting.

I see all the discussion about Parsing Html but I still havnt been able to find an example that would parse

<CLIOutput>
  <Results>
    <ReturnCode>0</ReturnCode>
    <EventCode>23000</EventCode>
    <EventSummary>CLI command completed successfully.</EventSummary>
  </Results>
  <Data>
    <Row>
      <Group>DNS</Group>
      <Domain>/CSCi</Domain>
      <Type>Normal</Type>
    </Row>
    <Row>
      <Group>GBS</Group>
      <Domain>/CSCi</Domain>
      <Type>Normal</Type>
    </Row>
    <Row>
      <Group>CSCi_7PM_Group</Group>
      <Domain>/</Domain>
      <Type>Normal</Type>
    </Row>
....   Etc

html cleaner is a parser library that i used in the past to handle malformed html, it also provides a limited amount of xpath selectors.

I think we should use the time saved using regex to argue about why not to use regex

Someone said:

“Simple things like finding all the href attributes in a document are easily accomplished with a regex.”

Not even that is true.

Say I have a document that includes the following:

...
<script>
const pwn='<a href="example.com">fail</a>';
</script>
...

The Cthulhu regexp will most likely extract the “example.com” contained in the script, which was probably not the intent.

2 Likes