The only crime committed by most novice developers is not knowing what the alternatives are. This post, on the other hand, does nothing to help promote what those alternatives are. No cherry picked recommendations for a few common web languages? PHP, RoR, etc?
Hpricot and Rubyful Soup help me a good bit in RoR.
So what is the preferred method for dealing with XSS (Cross Site Scripting) issues then, particularly if youāre using a Rich Text Editor that saves formatting as HTML?
The last time I went for an HTML library to parse some HTML, the HTML was so broken I had to resort to regex.
The regex broke afterwards, after the generated HTML was slightly changed. It was trivially fixed.
So, while I agree that HTML (and, particularly, XML) should be parsed appropriately, YMMV. I get the feeling a lot of people who complain about regex have never bothered to LEARN it as the complex language it is.
What about HTMLTidy? http://tidy.sourceforge.net/ Convert stuff to proper XHTML and then use your XML processing mechanism of choice to parse the data.
This would be a lot more helpful if some specific libs besides the Perl solution were posted. I had a non-trivial time trying to find ready-to-use stable libraries on various platforms (e.g. PHP). Any suggestions?
āThe only crime being perpetrated is not knowing what the alternatives are.ā
Yup - that just about covers the whole darn thing! Who hasnāt worked with developers who rolled their own XYZ, when that program is already out there and supported by some other community.
I have a daily job that scrapes a page. Unfortunately, it looks like the author of that page is emitting with Word or something similarly horrible. Tables within tables within tables without sanity(or regularity). The faint whisper of āia, iaā curdles around the mind when contemplating the source.
A DOM parser would gibber insanely to itself, quietly screaming at the brokenness of the form.
Since I didnāt particularly want to do that to a parse, I regexed away and was able to do it. Only minor loss of sanity pointsā¦