Regular Expressions: Now You Have Two Problems

  1. DON’T use regular expressions to parse markup (HTML/XML/whatever).
  2. I agree with others who said that regular expressions are not a good way to sanitize HTML.
  3. Sanitisation is an extremely hard problem, which can only really be solved using a proper parser.

You can solve it in regex if you define the solution very, very strictly as we have. It’s really a special case. There are a few regexes I use to accomplish this. See the actual code here:

http://refactormycode.com/codes/333-sanitize-html#refactor_11455

Comment there if you test the code and find it doesn’t work. I think you’ll be pleasantly surprised.

Actual tag balance has to be achieved in another, unrelated routine. Perfectly safe HTML can have unbalanced tags.

As a seasoned Perlmonger, I regularly deal with complicated regexes that do some very tricky stuff. Fortunately, Perl is excellent at providing you with nice syntax to make regexes both readable and scalable.

Here’s how I would have implemented your example:
http://pastebin.com/f467492d4

Note that Perl make it extremely easy to build a regex from sections, defining each part separately, with full commenting. Much of the body of the regex can be easily factored out into arrays, which are considerably easier to modify!

Perl also provides a natural syntax for including comments within your regexes. Both valuable techniques for building large, but usable regexes.

Admittedly, your example is probably a bit too simplistic for the slightly verbose treatment I’ve given it. But imagine a more complicated regex…

The way I see it is, if you think you need something like RegexBuddy, you probably need to refactor your regex into easily-understandable (and easily-testable) component parts instead. I can see how it might be useful if you’re trying to reverse-engineer someone’s badly-written opaque regex, or if you’re trying to match a very complicated pattern. But in general I would say if you need it, you’re doing it wrong.

(What were you thinking? talking about regexes and taking a poke at Perl in the same sentence? you really brought it on yourself! :))

HATE regular expressions.

HATE HATE HATE.

It drove me nuts when I ran across them and couldn’t figure them out, so I learned how to use them very well for about a year. I wrote some moderately complex ones, some simple, and then I just stopped using them.

My problem wasn’t so much not being able to understand what they did, but whether it was correct or not.
It is very easy to write a regex that looks like it should work but misses on a few things.
Just go to regexlib.com and search for currency, you’ll find 30+ distinct different ways to parse or format US currency.

How easily can you tell the difference between these two?
^\d*.\d{2}$
^\d+(?:.\d{0,2})?$

What about these two?
^$( )\d(.\d{1,2})?$
([^,0-9]\D*)([0-9]|\d,\d*)$

Or God forbid these two?
^$?-?([1-9]{1}[0-9]{0,2}(,\d{3})(.\d{0,2})?|[1-9]{1}\d{0,}(.\d{0,2})?|0(.\d{0,2})?|(.\d{1,2}))$|^-?$?([1-9]{1}\d{0,2}(,\d{3})(.\d{0,2})?|[1-9]{1}\d{0,}(.\d{0,2})?|0(.\d{0,2})?|(.\d{1,2}))$|^($?([1-9]{1}\d{0,2}(,\d{3})*(.\d{0,2})?|[1-9]{1}\d{0,}(.\d{0,2})?|0(.\d{0,2})?|(.\d{1,2})))$

^$([0]|([1-9]\d{1,2})|([1-9]\d{0,1},\d{3,3})|([1-9]\d{2,2},\d{3,3})|([1-9],\d{3,3},\d{3,3}))([.]\d{1,2})?$|^($([0]|([1-9]\d{1,2})|([1-9]\d{0,1},\d{3,3})|([1-9]\d{2,2},\d{3,3})|([1-9],\d{3,3},\d{3,3}))([.]\d{1,2})?)$|^($)?(-)?([0]|([1-9]\d{0,6}))([.]\d{1,2})?$

I’d much rather bank on writing a ParseCurrency function to parse or format the data using standard string manipulation.
That’s way easier to look at in 3 months or 3 years.

There is nothing that can be done with a regex that can’t be done with a function call. The function call may be 10 more lines than a single regex, but will always be 100 times easier to read and debug.

I feel that it goes along with Code Complete’s Self Documenting Code idea. If your code or regex can’t be understood without several lines of comments or a separate tool to parse it then there must be a better way.

Another good regex tool is Expresso:
http://www.ultrapico.com/Expresso.htm

It has really made some tricky regex easy to understand.

Intelligently adding whitespace helps, because before we read something we subconsciously observe the shape of its layout. This gives us an important clue to the underlying data hierarchy; it provides a means of navigating the text.

Without whitespace, we have to read the text in its entirety before seeing the forest for the trees.

As an aside, that’s also why USING ALL CAPITAL LETTERS makes things more difficult to read – it removes the shape of words, so we don’t get those free visual hints.

@Sean

What’s your language of choose? Whitespace? (http://en.wikipedia.org/wiki/Whitespace_(programming_language) ?)

Just thought I’d mention that if you want to get really good, get a copy of Mastering Regular Expressions by Jeffrey Friedl. Everything mentioned in Mike’s blog posts above is covered pretty exhaustively in the first 3 chapters, and chapters 4-6 will take you well beyond that, into understanding the underlying regex engines, and working with them to optimise your regex - important if the regex is going to be used over and over again, as would be the case in the above example. Then there are chapters on 4 implementations (perl, java, .NET and pcre as used in PHP).

A relevant example of efficiency optimisation - if the regex engine is aware that all tags start with ‘’ then it will not even bother to start trying to match except where there is a ‘’ character. In many cases this optimisation means the regex is never applied, for the cost of a quick indexof() call.

To make it easy for the regex to spot that all matches start with the same character, take the first character out of the alternatives bit. This would give

var whitelist =
@ (?# opening angle bracket - here so that regex engine can spot it)
( (?# start alternative)
br\s?/? | (?# allow space at end)
/?p |
/?b |
/?strong |
/?i |
/?em |
/?s |
/?strike |
/?blockquote |
/?sub |
/?super |
/?h(1|2|3) | (?# h1,h2,h3)
/?pre |
hr\s?/? | (?# allow space at end)
/?code |
/?ul |
/?ol |
/?li |
/a |
a[^]+ | (?# allow attribs)
img[^]+/?(?# allow attribs)
)
(?# closing angle bracket)
;

(Hope the formatting survives …)

You could go a little further and factor out the ‘/?’ that starts most of the lines. It will be repeatedly tested in the current format, and factoring it out would mean it was only tested once, though you will lose a little readability by doing that. A little benchmarking with two alternatives would let you know how much difference that change would make …

Great post.

Another free regex tool: http://www.gskinner.com/RegExr/
It also has an offline version.

I know Perl is the traditional soft target when it comes to observations about the folly of overusing regular expressions - and based on past atrocities this reputation may have been deserved a few years ago.

But these days well written Perl (no kids, that’s not an oxymoron) tends not to rely too heavily on them. I just grabbed some of my code at random. I seem to average about 0 to 5 regular expressions per 1,000 lines of code - although of course it depends what I’m doing.

And Perl’s regular expressions (which are actually not regular expressions in the formal sense - they’re more general than that) are now pretty highly evolved; features like named captures, expanded syntax (which as a previous commenter notes allows patterns to be laid out quite readably) and support for matching recursive syntaxes make them safely and expressively useful - at least in the hands of someone capable of restraint :slight_smile:

You know Jeff, sometime when we’re on the same continent I’d like to sit down and show you ‘modern’ Perl (again, not an oxymoron). Based on your general approach to problem solving and apparent philosophy of coding I think you might actually like it…

Anyway, enough with the sales pitch. Keep up the good work.

1 Like

Oh for goodness sake stop slagging off perl. Perl is like English, a bit hard to learn, but very expressive and very very useful. Also the CPAN module a href=http://search.cpan.org/~dland/Regexp-Assemble/Regexp::Assemble/a is insanely useful, and bfast/b.

@Rob Assembly is easy, I could do that at 15, but I still have trouble understanding many regular expressions.

Even though I don’t fully understand them, they are very cool for stuff like this:
http://wincue.cvs.sourceforge.net/wincue/wincue/src/filename_formats.txt?revision=1.2view=markup

The file is used for guessing album, artist, track number and track titles from file names. The older version was a hand-written parser which a friend of mine reimplemented with regular expressions, making it much easier to maintain and customize.

I don’t think regular expressions are necessary, unless you’ve got a nightmare of a parsing task ahead of you. It’s just one more syntax to learn, and I sure as hell don’t need that. I’d rather hand-write it. Sure it’s a little more code, but more is less.

I would recommend (if you have .NET2) to get FREE tool, it also generates a dll with the regex once u developed it.

http://tools.osherove.com/CoolTools/Regulator/tabid/185/Default.aspx

Hey, Regex Buddy is built with Delphi!

I couldn’t agree with you more. When I frist met regex I thought either I was too stupid to understand it or the guy that wrote it was a genius.

Once I found the right tool and toyed with it a little, I realized what a powerful weapon it can be.

The tool I use is pretty simple and offers no major light effects, but it’s usable inside eclipse, so for this convenience, that’s what I chose. http://regex-util.sourceforge.net/update/

Well, I think Jeff is spot on with this regular expression business.

I ran into the same problem a while back, and did the exact same thing, using the same tool and all.

Good to know I’m doing SOMETHING good.

I always wished my college had a course in regexes. I’ve used them a few times, but it’s always been such a pain. I think I just need to make a project that really emphasizes them, so that they get ingrained into me.

If you drench your plate in hot sauce, you’re going to be very, very sorry later.

I beg to differ. I love hot sauce. I put it on almost everything, in the amounts that would kill normal people or at least cause a major permanent injury. I eat raw habaneros, too.

Although I absolutely cherish regular expressions (Viva la PCRE!) as one of the most lethal tools in my batman belt of programming tricks (I’m the regular expression go-to guy in my office), I do completely agree that it is extraordinarily easy to overuse them.

@Jeff: I think you may have done your less regex savvy readers a slightly better service by noting their alternatives when you mention not regexing themselves to death. I think a good follow-up post would be to point those folks in the direction of their languages’ built in string manipulation functionality. While regex can, in some scenarios save you hours of pointless string twiddling, I think it’s important to note that, with great power comes great responsibility. For simple- and even sometimes medium-level tasks, smart use of string manipulation will scream past regex performance-wise. Otherwise, great stuff, as usual!