January 2006
Nice post.
I’ve always been highly suspicious of regular expressions and hence have never used them. When I considered what it would take to write the exact same code to perform the required parsing it took my breath away. Surely all of this complex parsing logic must come at a pretty hefty price. Apparently it does…
January 2006
I’d be hesitant to toss out regular expressions because of a few catastrophic examples. They are quite rare in the wild. For example, the “(x+x+)+y” is egregiously contrived and could be replaced with “xx+y”.
So yes, it’s important to watch out for those rare exponential cases, but let’s not toss out the baby with the bathwater and rubber ducky.
January 2006
Nice post.
Nitpicking: This is not n-squared behavior. It is exponential – 2-to-the-nth. Every number in the series is almost exactly double its predecessor.
And an important note: For regexes to be really useful for parsing, we need to match groups; and there lies the crux of the problem. If you remove the requirement to match groups, it is not very hard to write a non-backtracking regex engine, in which the complexity of parsing can be exponential in the pattern size, but it is linear in the matched text size.
January 2006
Surely all of this complex parsing logic must come at a pretty hefty price.
Well, no, I wouldn’t say that.
This post is about specific backtracking scenarios that can cause exponential behavior. It’s not very common. But if it happens the results can be quite painful.
Also the hotmail devs are writing code that probably gets 10,000 times more pressure under use than anything we’ll ever write. So keep that in mind, too.
January 2006
Surely all of this complex parsing logic must come at a pretty hefty price.
Yup. That’s why I stopped using the for statement. I mean, I could do something like this:
for(int i = 0; i 1; i = i + 0)
That would just kill my performance.
You’ve never ever used regexes before? Yikes. You do know, at the very least, there are lots of places where you can get pre-formed regexes that work just fine? (Hint: use Google 
January 2006
The trouble is that people regard regexes as some kind of magic instead of simply a kind of programming language that provides text matching features.
I’m with foobar on this. If a novice programmer codes up for(;;); and it spins for ever, no-one would blame the programming language.
Programming does have to be learned. Regexes are no exception.
January 2006
I do not know how works the Regex stuff. Does it generate some code JIT compiled, the same way as Serializer ? Anyone remembers Lex and Yacc tools (code generator based on lexical and grammatical rules) that belongs to UX word ? I think that code generation may help to get rid with tricky performances issues. What is your opinion about it Jeff ?
January 2006
And then the requisite bad use of regex when a couple small string operations would have been much smarter.
Here’s an example of how not to validate whether or not the first 2 chars in a string are alpha’s:
http://lamecode.com/?p=152
June 2006
For a great treatment in regex in general, and for avoiding catastrophic backtracking in particular, I highly recommend Jeffrey Friedl’s book Mastering Regular Expressions (2nd Ed.): http://www.oreilly.com/catalog/regex2/
Some of this is covered on http://www.regular-expressions.info/atomic.html , but the book goes into more topics, more depth, and more analysis.
The two biggest innovations that made dealing with these problems so far have been non-greedy quantifiers, and to a greater extent, possesive quantifiers. These innovations, like most things in the modern regex world, came out of perl*. They have also influenced the planning for perl 6’s regex engine. I’m not holding my breath on perl 6’s release, but the plans make interesting reading (if you use regex a lot):
http://dev.perl.org/perl6/doc/design/syn/S05.html
http://dev.perl.org/perl6/doc/design/apo/A05.html
http://dev.perl.org/perl6/doc/design/exe/E05.html
*: Friedl’s book actually had some influence here as well. He pointed out that if you could tell the regex engine “backtracking is useless here”, you could get rid of a lot of pathological situations. Hence, possesive quantifiers. (Friedl’s examples are in perl-ese, but different regex engines (and types of engines) are covered in the book.)
July 2006
Jeff Atwood: “Looks like n-squared behavior to me.”
matt: “For example, the ‘(x+x+)+y’ is egregiously contrived and could be replaced with ‘xx+y’.”
Dare I propose that the Programming Crisis bloggers are always worried about might be due simply to the fact that blogger-programmers couldn’t pass Algorithms 101 with a laxative?
“2, 4, 8, 16, 32, 64,…” is 2^n, not n^2; and “(x+x+)+y” is equivalent to “(xx)+y”, not “xx+y”.
July 2006
Dare I propose that the Programming Crisis bloggers are always worried about might be due simply to the fact that blogger-programmers couldn’t pass Algorithms 101 with a laxative? “2, 4, 8, 16, 32, 64,…” is 2^n, not n^2; and “(x+x+)+y” is equivalent to “(xx)+y”, not “xx+y”.
Actually, “(x+x+)+y” matches two or more x’s, as does “xx+y”. “(xx)+y” matches an even number of x’s. Good luck with those laxatives Anonymous Coward.
January 2008
just a rectification, because you’re being so anal it hurts:
(x+x+)+y IS a sintact equivalent in matching xx+y
for a semantic equivalent version, you could use [xx]+y
and if you need grouping, just go for (x+)y
then, someone could tell me please WHY everyone need to fix an example?
is a TRIVIAL example for this behaviour, for the twelve gods!
did you have a prograsm in fixing this???
January 2008
To the Anonymous Coward from 2008-01-03 who criticized people for correcting examples while doing so himself, (x+x+)+y is not equivalent to [xx]+y (no need to repeat characters in character classes) or x+y (with or without the parens). Rather, it is equivalent to xx+y (as you noted correctly) or x{2,}y
The pertinent resources for information on how to avoid pathological regex matching cases with a backtracking engine have already been mentioned (JFriedl’s book and regular-expressions.info are among the most thorough yet accessible guides). A regex expert should not really have to worry about these cases though, because they are generally quite obvious if one truly understands how backtracking works.
@Jacob J, possessive quantifiers were neither invented by Perl (v5.8 doesn’t even support them) nor JFriedl (who recommends in his book that regex engine implementers optimize by automatically possessifying quantifiers when this can be determined to not change potential matches).
A recent, related regex innovation is backtracking control verbs included in Perl 5.10 and recent versions of PCRE, which afford greater control over backtracking than ever.
April 2014
The essential part of the customer example he gives is converting a dot (any character) match to a [^,rn]
(not comma, not carriage return, not newline) match. In other words – be as specific as possible!
Are you sure you this RegEx won’t match CR and LF? I always thought you had to use \r
and \n
for that, like this: [^,\r\n]
.