Regular Expressions: Now You Have Two Problems

Regular Expressions are a very powerful tool that all developers should know, but sometimes you can fall into deep subtle pits of despair if you don’t know PERFECTLY what you are doing.
The most important things I discovered one month ago are:
[1] NOT ALL REGEXPR ENGINES USE THE SAME SYNTAX AND/OR MATCHING ALGORYTHM
[2] SOMETIMES, REGEXPR ENGINES CHEAT!

For [1], just check the RegExp section in Xml Schema Specification at W3C (http://www.w3.org/TR/xmlschema11-2/#regexs). They decided that, since most people would want a full match on a RegExp, their parser would automatically anchor it (WORST. IDEA. EVER).
So, if you decided (like I did, fool me, fool me) to define a RegExp in a Schema for Validation, and then use it also in another part of my application, you will have lots of trouble.
Basically, in XSD you get the full Perl RegExp syntax, without ^ $ (which will be treated as NORMAL CHARACRERS) and /A /Z (which will BREAK your RegExp), and you will get an automatic anchor instead…

For [2], some engines (ie: .NET Regex engine) cheat on some expressions, to make things work almost any time. Basically, I had 2 expressions that should have returned different matches (by Perl Syntax), but they returned the same matches (in .NET Match). I’m sorry I can’t remember the exact expressions right now, but I remember shouting the loudest WTF ever, when I checked this… and I will not tell you about the differences between .NET Parser and the various Java Parsers :slight_smile:

So, I would add this advice to the list of this post:

  • Always check (double-triple-check) your Expressions IN THE ENVIRONMENT they will be executed (or with the right options in your tool of choice).

I hate regexes. Don’t get me wrong on this, I have used them very often, I know how to write very, very advanced regexes. I love the idea behind them… but I hate their syntax. The syntax simply sucks. Also there are about 20 different flavors and each time I use them in a different language or with a different tool, there are different pitfalls I have to fix.

I would like them a lot more if they had a better, cleaner syntax and if there was one, and exactly one standard that defines them and all tools and languages either stick to this standard or should not offer them in the first place.

Also note that regexes are nothing more than shortcuts. Whatever you do in a regex could be done with /normal/ code as well, you would just need a lot more code to do it. E.g. ever regex can be written as a simple state machine parser. Sometimes such a parser can be much more powerful, easier to extend and … this is an important thing to consider … it can also be much faster. Depending on how good the regex compiler is, it might create better or much worse code.

E.g. your sanitize regex is nothing more than (pseudo code):

for (i = 0; i lengthOfString(string); i++) {
if (charAt(string, i) == ‘’) {
// Handle HTML tag
}
}

Inside the if, you skip the first character if it is /, then you can if-elsif-else about all known tags, check if they allow whitespace or other characters up to the character and then loop till you see the character. It might be harder to read at first, but that way you can see how the string is really processed, what is going on where, you can optimize the process (e.g. instead of if’s you can switch, place the tags in a sorted table, look it up with binary search, have a number assigned to every tag, jump to the right code with switch-case).

Using regexes is like SQL. You only say what you want the computer to do and the computer magically presents the result. You have zero influence on how it gets there, how fast it gets there, etc.

Mecki, you have no idea how many times I’ve refactored out for-loops with long, tortuous string processing inside them in favour of a simple regular expression.

You are right though - sometimes the compilation of the regex can make it slower than just coding - but in my experience, that’s a pretty rare state, and normally it’s 'cos I’ve crafted a very slow regex.

I would strongly recommend Fiedl’s book ‘Mastering Regular Expressions’.

I do agree with the other posters who mention using already existing solutions where possible, for things like HTML validation.

RegExs Like XML, good when uses appropriately, bad when used inappropriately …

Don’t use in place of a parser … it is not a full parser and should not be used as such

Don’t use to do simple string manipulation, your language should have better/faster tools to do this

Do use to do simple pattern matching, it’s what it was designed for

XML great for storing/sending structured data between programs, terrible to read, terrible to write, use an interface!

They should absolutely be a key part of every modern coder’s toolkit.

If you avoid web, and data chores then find and replace is just as good. I’ve never really found much use for regexes when coding my win32 c++ apps… every now and again I will want to change something a bit complicated… but its pretty rare that find/replace won’t cut it. The last example I can think of was changing a function pointer type… i had to regex the various static function definitions… but only because I had used inconsistent variable naming. :slight_smile:

Regexes are great for manipulating data rather than code imo, and HTML is more typical of data than code…

Allowing parts of HTML that are safe and blocking unsafe parts is a classic problem. Best to block it all if at all possible, by just escaping every special char. Its a lot easier, and is extremely secure too. :slight_smile:

Actually RegExps come from pure computer science.

Every Deterministic Finite State Machine is equal
to an RegExp. (I know them under the acronym DFA which
stands for Deterministic Finite Automaton).
Actually they are even equal to the more powerful Nondeterministic
Finite State Machines.
( But then DFA’s and NDFA’s are actually the same thing -
don’t bother, that is real computer science …
nondeterministic means that there are nasty operators
that are allowed not to eat any input - RegExps has
these nasty operations in form of the asterisk, the question
mark, …)

So while using RegExps for String Matching Operations is a valid
use case there is much more to RegExps then this.
You can build any DFA/NDFA using a RegExp. Any !

Regexes are fine in code and all that, but for the past few years, I’ve been using a regex tool I built to make writing the code itself easier. ( http://www.hova.org/regexhelper )

One of the key features (probably the only one) is the idea of a match mode. Given a large input, most of which is garbage, the regular expression match/replace is run on only the matches. The rest is discarded. What does this allow you to do? Well, it lets you perform manipulation of things you’re interested in, while ignoring the rest.

A good example is using this new method to strip all the element ID’s from a large HTML document so that you can then convert it to server-side code.

If you are a python developer two regex tools that are quite useful:

http://www.pythonregex.com (online)

http://kodos.sourceforge.net/ (offline)

Disclaimer, I wrote the pythonregex website while playing around with the Google App Engine.

There are so many people complaining about how RegEx is too complicated, etc. But the point of this was to keep them short, to a focused point.

Yeah, you can write functions to validate or clean your input, but really, is writing an expression like [^a-z0-9\s] really that hard to use to clean up some input? (yes, there is a shorter version than that, but just as an example)

The actual function could effecively do the same thing, but even the shortest and simplest Regular Expressions can save a lot of development time.

To me it’s perfectly readable.

Ha ha ha ha ha

I once had to write a WML sanitizer- It essentially took user-created HTML (the whole thing was to create a mobile view of pages entered in my employer’s proprietary CMS) and checked the entries for what would amount to invalid WML, and fixed it. I can honestly tell you that at least as far as *ml sanitizing, regexes aren’t good enough by themselves, but used in conjunction with parsing, they’re much easier than parsing alone.

All this complaining about Orange… you ought to put the link to that post next to the Orange captcha so people know and will quit littering the comments about the broken captha.

You may find it a little odd that a hack who grew up using a language with the ain’t keyword would fall so head over heels in love with something as obtuse and arcane as regular expressions.

Ummm, well, no. That seems totally consistent :slight_smile:

They should absolutely be a key part of every modern coder’s toolkit.

Well, ish. You should know what they are, and when to use them. Whether you ever need to is a completely different thing. It’s so long since I used one I can’t remember how long it’s been. So calling them a key part of my toolkit, well, I dunno.

@Alex

regexes aren’t good enough by themselves, but used in conjunction with parsing, they’re much easier than parsing alone.

Lots of (most?) parsers use regexes somewhere, often for extracting tokens (eg numbers, identifiers, etc)

regextester.com is the only RegEx tool I’ve ever bothered to use.

It’s free, it’s online, it’s good.

@kenneth, @GoA:

Do you guys mind revealing this connection you see between XML and Regular Expressions?
XML, a data description format and Regular Expressions, a pattern recognition engine … I fail to see the connection that you too see.

Does anyone else see why it is odd that one should like both or none at all?

Just a tip to the people who somehow think writing their own loops to manipulate a string is faster than a PROPER regex … in 99% of the cases writing your own will not give any performance benefits and will probably be slower. Sloppy regular expressions can be slow, but if you make sure your regex is short and precise they are usually very fast.

In Perl, and probably in other languages, all static regular expressions in a program are compiled once, making them very fast, especially in cases where you need to use the same regex to match against multiple strings. The occurances where regular expressions can really cause a hit are when they are built dynamically, like as part of an eval statement, or contain a variable, which there is no way to compile ahead of time.

Arguing that you shouldn’t use them because you don’t know them is no argument at all. Everyone starts off not knowing any programming languages, yet they learn them because it is a good idea to know the most effective tools for the job.

Before you complain how hard regular expressions are to deal with, get the right tools!

Like most such programming needs, emacs ships with a built-in mode for this. :slight_smile:

See http://www.emacswiki.org/cgi-bin/wiki/ReBuilder for more info

I posted this on refactormycode but figure’d I’d post it here too.

SCRIPT SRC=http://myscript/xss.js?B is vulnerable to XSS.

As I posted on the refactor site (under a different name, oops):

HTMLEncode your string and then replace the entities. Doing it the way you are trying to do it is is REALLY, REALLY hard. And as others have said, you have to know about every browser quirk. I can’t emphasize enough how dangerous this is.

Oh, and REs rule! :slight_smile:

Count me in as another one who uses Expresso http://www.ultrapico.com/Expresso.htm.

Also, the site www.regular-expressions.info is very helpful as a tutorial. It’s from the author of RegexBuddy.