Regular Expressions are a very powerful tool that all developers should know, but sometimes you can fall into deep subtle pits of despair if you don’t know PERFECTLY what you are doing.
The most important things I discovered one month ago are:
[1] NOT ALL REGEXPR ENGINES USE THE SAME SYNTAX AND/OR MATCHING ALGORYTHM
[2] SOMETIMES, REGEXPR ENGINES CHEAT!
For [1], just check the RegExp section in Xml Schema Specification at W3C (http://www.w3.org/TR/xmlschema11-2/#regexs). They decided that, since most people would want a full match on a RegExp, their parser would automatically anchor it (WORST. IDEA. EVER).
So, if you decided (like I did, fool me, fool me) to define a RegExp in a Schema for Validation, and then use it also in another part of my application, you will have lots of trouble.
Basically, in XSD you get the full Perl RegExp syntax, without ^ $ (which will be treated as NORMAL CHARACRERS) and /A /Z (which will BREAK your RegExp), and you will get an automatic anchor instead…
For [2], some engines (ie: .NET Regex engine) cheat on some expressions, to make things work almost any time. Basically, I had 2 expressions that should have returned different matches (by Perl Syntax), but they returned the same matches (in .NET Match). I’m sorry I can’t remember the exact expressions right now, but I remember shouting the loudest WTF ever, when I checked this… and I will not tell you about the differences between .NET Parser and the various Java Parsers
So, I would add this advice to the list of this post:
- Always check (double-triple-check) your Expressions IN THE ENVIRONMENT they will be executed (or with the right options in your tool of choice).