Regex Performance

I was intrigued by a recent comment from a Microsoft Hotmail developer on the ptifalls they've run into while upgrading Hotmail to .NET 2.0:


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2006/01/regex-performance.html

Nice post.

I’ve always been highly suspicious of regular expressions and hence have never used them. When I considered what it would take to write the exact same code to perform the required parsing it took my breath away. Surely all of this complex parsing logic must come at a pretty hefty price. Apparently it does…

I’d be hesitant to toss out regular expressions because of a few catastrophic examples. They are quite rare in the wild. For example, the “(x+x+)+y” is egregiously contrived and could be replaced with “xx+y”.

So yes, it’s important to watch out for those rare exponential cases, but let’s not toss out the baby with the bathwater and rubber ducky.

Nice post.

Nitpicking: This is not n-squared behavior. It is exponential – 2-to-the-nth. Every number in the series is almost exactly double its predecessor.

And an important note: For regexes to be really useful for parsing, we need to match groups; and there lies the crux of the problem. If you remove the requirement to match groups, it is not very hard to write a non-backtracking regex engine, in which the complexity of parsing can be exponential in the pattern size, but it is linear in the matched text size.

Surely all of this complex parsing logic must come at a pretty hefty price.

Well, no, I wouldn’t say that.

This post is about specific backtracking scenarios that can cause exponential behavior. It’s not very common. But if it happens the results can be quite painful.

Also the hotmail devs are writing code that probably gets 10,000 times more pressure under use than anything we’ll ever write. So keep that in mind, too.

Surely all of this complex parsing logic must come at a pretty hefty price.

Yup. That’s why I stopped using the for statement. I mean, I could do something like this:

for(int i = 0; i 1; i = i + 0)

That would just kill my performance.

You’ve never ever used regexes before? Yikes. You do know, at the very least, there are lots of places where you can get pre-formed regexes that work just fine? (Hint: use Google :slight_smile:

The trouble is that people regard regexes as some kind of magic instead of simply a kind of programming language that provides text matching features.

I’m with foobar on this. If a novice programmer codes up for(;;); and it spins for ever, no-one would blame the programming language.

Programming does have to be learned. Regexes are no exception.

I do not know how works the Regex stuff. Does it generate some code JIT compiled, the same way as Serializer ? Anyone remembers Lex and Yacc tools (code generator based on lexical and grammatical rules) that belongs to UX word ? I think that code generation may help to get rid with tricky performances issues. What is your opinion about it Jeff ?

And then the requisite bad use of regex when a couple small string operations would have been much smarter.

Here’s an example of how not to validate whether or not the first 2 chars in a string are alpha’s:

http://lamecode.com/?p=152

For a great treatment in regex in general, and for avoiding catastrophic backtracking in particular, I highly recommend Jeffrey Friedl’s book Mastering Regular Expressions (2nd Ed.): http://www.oreilly.com/catalog/regex2/

Some of this is covered on http://www.regular-expressions.info/atomic.html , but the book goes into more topics, more depth, and more analysis.

The two biggest innovations that made dealing with these problems so far have been non-greedy quantifiers, and to a greater extent, possesive quantifiers. These innovations, like most things in the modern regex world, came out of perl*. They have also influenced the planning for perl 6’s regex engine. I’m not holding my breath on perl 6’s release, but the plans make interesting reading (if you use regex a lot):

http://dev.perl.org/perl6/doc/design/syn/S05.html
http://dev.perl.org/perl6/doc/design/apo/A05.html
http://dev.perl.org/perl6/doc/design/exe/E05.html

*: Friedl’s book actually had some influence here as well. He pointed out that if you could tell the regex engine “backtracking is useless here”, you could get rid of a lot of pathological situations. Hence, possesive quantifiers. (Friedl’s examples are in perl-ese, but different regex engines (and types of engines) are covered in the book.)

Jeff Atwood: “Looks like n-squared behavior to me.”

matt: “For example, the ‘(x+x+)+y’ is egregiously contrived and could be replaced with ‘xx+y’.”

Dare I propose that the Programming Crisis bloggers are always worried about might be due simply to the fact that blogger-programmers couldn’t pass Algorithms 101 with a laxative?
“2, 4, 8, 16, 32, 64,…” is 2^n, not n^2; and “(x+x+)+y” is equivalent to “(xx)+y”, not “xx+y”.

Dare I propose that the Programming Crisis bloggers are always worried about might be due simply to the fact that blogger-programmers couldn’t pass Algorithms 101 with a laxative? “2, 4, 8, 16, 32, 64,…” is 2^n, not n^2; and “(x+x+)+y” is equivalent to “(xx)+y”, not “xx+y”.

Actually, “(x+x+)+y” matches two or more x’s, as does “xx+y”. “(xx)+y” matches an even number of x’s. Good luck with those laxatives Anonymous Coward.

just a rectification, because you’re being so anal it hurts:

(x+x+)+y IS a sintact equivalent in matching xx+y
for a semantic equivalent version, you could use [xx]+y
and if you need grouping, just go for (x+)y

then, someone could tell me please WHY everyone need to fix an example?

is a TRIVIAL example for this behaviour, for the twelve gods!

did you have a prograsm in fixing this???

To the Anonymous Coward from 2008-01-03 who criticized people for correcting examples while doing so himself, (x+x+)+y is not equivalent to [xx]+y (no need to repeat characters in character classes) or x+y (with or without the parens). Rather, it is equivalent to xx+y (as you noted correctly) or x{2,}y

The pertinent resources for information on how to avoid pathological regex matching cases with a backtracking engine have already been mentioned (JFriedl’s book and regular-expressions.info are among the most thorough yet accessible guides). A regex expert should not really have to worry about these cases though, because they are generally quite obvious if one truly understands how backtracking works.

@Jacob J, possessive quantifiers were neither invented by Perl (v5.8 doesn’t even support them) nor JFriedl (who recommends in his book that regex engine implementers optimize by automatically possessifying quantifiers when this can be determined to not change potential matches).

A recent, related regex innovation is backtracking control verbs included in Perl 5.10 and recent versions of PCRE, which afford greater control over backtracking than ever.

The essential part of the customer example he gives is converting a dot (any character) match to a [^,rn] (not comma, not carriage return, not newline) match. In other words – be as specific as possible!

Are you sure you this RegEx won’t match CR and LF? I always thought you had to use \r and \n for that, like this: [^,\r\n].

It’s fix.

The MatchTimeout property defines the approximate maximum time interval for a Regex instance to execute a single matching operation before the operation times out. The regular expression engine throws a RegexMatchTimeoutException exception during its next timing check after the time-out interval has elapsed. This prevents the regular expression engine from processing input strings that require excessive backtracking. For more information, see Backtracking in Regular Expressions and Best Practices for Regular Expressions in the .NET Framework.

1 Like