I was recently working on a java project to retrieve all the separate unique words found (content) on a specified HTML page, and print them alphabetically along with their frequency on that page.
My program, instead of using regular expressions, reads the file line by line. Any text that is within the ending ‘>’ and beginning ‘<’ HTML brackets is read into a new variable. This new variable then contains all of the words found (visible, not alt tags) on that web page, separated by spaces.
Using this method, the only text that is really left out are image alt tags and meta descriptions and keywords. Three regular expressions, since you love them so much, could get those before or after the fact.
My program then built a Binary Search Tree based on the words found in that HTML file, along with their frequency. Being a web developer, I have found this a neat tool to have to evaluate keywords of a website, as it works quite well. Not saying it’s the ‘perfect parser’, but it works with HTML, BROKEN HTML, PHP, ASP, or most any kind of web page out there.