a companion discussion area for blog.codinghorror.com

Parsing Html The Cthulhu Way


#142

I was recently working on a java project to retrieve all the separate unique words found (content) on a specified HTML page, and print them alphabetically along with their frequency on that page.

My program, instead of using regular expressions, reads the file line by line. Any text that is within the ending ‘>’ and beginning ‘<’ HTML brackets is read into a new variable. This new variable then contains all of the words found (visible, not alt tags) on that web page, separated by spaces.

Using this method, the only text that is really left out are image alt tags and meta descriptions and keywords. Three regular expressions, since you love them so much, could get those before or after the fact.

My program then built a Binary Search Tree based on the words found in that HTML file, along with their frequency. Being a web developer, I have found this a neat tool to have to evaluate keywords of a website, as it works quite well. Not saying it’s the ‘perfect parser’, but it works with HTML, BROKEN HTML, PHP, ASP, or most any kind of web page out there.


#143

I am a beginning Python coder. I wanted to be able to go to www.fictionpress.com (a website containing stories people write) and turn raw HTML into the story I am trying to extract. Is there any better method than using regular expressions? Using the RE module allowed me to not only parse the HTML, but also remove headers and footers I didn’t want to see.

What should I be using instead of regular expressions for HTML?


#144

If the HTML changes then almost any scraper or parser will fail. Anyone that says otherwise is bullshitting in order to justify doing the ‘right thing’.

It is the right thing to use a DOM or other similar method because it can be easier to read the code, and there are often other useful functions hanging about that can make any future development easier. However this ‘robust’ BS needs to stop.

If the HTML changes then any parser breaks, unless it’s your own HTML you’re parsing and you design it in a careful way, i.e. using lots of unique ‘id’ attributes in tags so changing the structure of the HTML doesn’t break anything.

I agree with Chris S from November 17, 2009. His comment still stands now.


#145

Hi, Jeff.

You say “It’s considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that’s just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine.” Well, let me tell you my story.

Once, I had a rep of 1486 in Stack Overflow. I was so excited because finally, FINALLY, I could create my own tags. This was the objective of my life. I got 616 rep points in one month. I deleted my Twitter and Google + accounts for not losing a second. I just needed mere fourteen points! My question at http://stackoverflow.com/q/6873945 finally would have a “mozmill” tag; http://stackoverflow.com/q/6797631 and http://stackoverflow.com/q/6797779 would have the “rhinounit” tag; I could solve problems such as http://meta.stackoverflow.com/q/98584 by myself whether I find them. I rejoiced in anticipation.

Then, I found a quite innocent question about extracting some data from HTML. It seemed to be a pretty stably structured document, so I answered with a regex that could solve the problem: http://stackoverflow.com/q/6878032#6878203 Note that I emphasized that the solution was quick’n’dirty, an unstable document required some more sophisticated tool.

And I got a downvote. I could see my dreamt tags going away. I just give two steps behind, my journey would be longer. What if more people find my answer and downvote it too? What if I lost hundred of rep points?! My tags! MY TAGS! I panicked. I just managed to refrain my mourning to, between hiccups, give my testimony here.

There is a clear lesson here: do not parse HTML with regular expressions in any way. It can destroy your dreams, your soul, your life. If you do it, you’ll end up smoking crack. I learned the lesson and am trying to rebuild my life, maybe - MAYBE - with the ability of creating tags in SO. Do not make my mistake. It is not worth it.


#146

Jeff, I really enjoyed your article. I posted an answer to the question on SO you referred to in this article here http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/7564061#7564061. Seeing as there are so many answers, it may never be read, but what do you think about Balancing Group Definitions? I just find it interesting b/c it allows a regex engine to have state and act as a PDA.

Holler if you find my response interesting.


#147

I see all the discussion about Parsing Html but I still havnt been able to find an example that would parse

<CLIOutput>
  <Results>
    <ReturnCode>0</ReturnCode>
    <EventCode>23000</EventCode>
    <EventSummary>CLI command completed successfully.</EventSummary>
  </Results>
  <Data>
    <Row>
      <Group>DNS</Group>
      <Domain>/CSCi</Domain>
      <Type>Normal</Type>
    </Row>
    <Row>
      <Group>GBS</Group>
      <Domain>/CSCi</Domain>
      <Type>Normal</Type>
    </Row>
    <Row>
      <Group>CSCi_7PM_Group</Group>
      <Domain>/</Domain>
      <Type>Normal</Type>
    </Row>
....   Etc

#148

html cleaner is a parser library that i used in the past to handle malformed html, it also provides a limited amount of xpath selectors.