Regular Expressions: Now You Have Two Problems

codinghorror · June 27, 2008, 12:00am

I love regular expressions. No, I'm not sure you understand: I really love regular expressions.

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

Kenneth · June 27, 2008, 12:00am

Rail against XML yet embrace Regex!?! Me no understand so good.

Of course, I will admit to using regular expressions very rarely, whereas I utilize XML rather extensively.

RyanB · June 27, 2008, 12:00am

In the same vein as RegexBuddy, IntelliJ has a fantastic regex plugin:

http://www.twoqubed.com/blog/

RyanB · June 27, 2008, 12:00am

Oops, wrong link:

a href=http://plugins.intellij.net/plugin/?id=19http://plugins.intellij.net/plugin/?id=19/a

RichardD · June 27, 2008, 12:00am

@Randy - there’s an old trial version (2.04) of RegexBuddy floating around various shareware sites. I couldn’t recommend it more - it really is a fantastic tool. Even their website is valuable as a reference for writing regular expressions.

Tim_B · June 27, 2008, 12:00am

A coworker and I cracked up laughing reading this post. We have someone on our team who overuses regular expressions and hot sauce.

Rev_Matt · June 27, 2008, 12:00am

Brilliant, well said. I struggle with RE but I do value them. JWZ is imminently quotable but usually done so wildly out of context. Another of his great lines is Linux is only free if your time has no value, often used by MS zealots to bash Linux zealots.

Kris · June 27, 2008, 12:00am

First, a great cheat sheet for writing these things…

http://www.ilovejackdaniels.com/cheat-sheets/regular-expressions-cheat-sheet/

I find them incredibly useful, mostly I use them for validating input… basic stuff like making sure the user entered a valid email address or phone number or date. Maybe I haven’t used them enough, but i feel like bashing my head on the desk whenever I’m trying to write them.

I had a class in college where we were given regular expressions and we had to walk through them with specific terms too. That wasn’t much better.

Bossy_Joe · June 27, 2008, 12:00am

Eclipse users can get QuickREx for free. It’s fantastic.

http://www.bastian-bergerhoff.com/eclipse/features/web/QuickREx/toc.html

HB153 · June 27, 2008, 12:00am

There is one called Expresso that is pretty slick. It’s free as far as I’m aware (just a nag screen)

Rich · June 27, 2008, 12:00am

I agree with others who said that regular expressions are not a good way to sanitize HTML. Somewhat appropriately, IIRC, the JWZ quote was about someone using RegEx for parsing HTML. JWZ wasn’t complaining in general about overuse of RegEx’es (though he has), but the context of that quote I believe was specifically for using them in HTML where they’re not the right tool. He suggested more of a true parser for HTML. I’m sure javascript and XSS exploits have just made that statement even move true.

Grain of salt warning - my memory isn’t 100% sure this was on HTML, though it was either HTML or SGML/XML. The original post was on usenet, and a google search only shows the quote, not the full context. If it was HTML, it’s somewhat ironic that the example is HTML yet pulls in the JWZ quote recommending against it.

Nicolas · June 27, 2008, 12:00am

DON’T use regular expressions to parse markup (HTML/XML/whatever).

http://htmlparsing.icenine.ca/

http://wiki.hypexr.org/wikka.php?wakka=/RegexFAQ

PaoloB · June 27, 2008, 12:00am

Am I the only one still curious what the official verdict on that Mensa page is?

Jeff, are you going to post your thoughts on it?

mike70 · June 27, 2008, 12:00am

As with other languages you mention (e.g. Ruby), you have to use regular expressions often enough that you internalize some of the less, um, user-friendly aspects of the expressions. Classic cognitive problem in regular expressions: reserved characters that mean different things in different contexts. (Well, unless it’s inside square brackets. Then it means …) An interesting dilemma for a) regex novices working on b) straightforward problems is that it can be just as fast, by the time you take into account all the debugging time for your regular expression, to just write a parsing function in Your Language Of Choice. Not as elegant, of course, but for one-offs, it can be awful tempting to forego the headaches of line-noise syntax …

JuanZ · June 27, 2008, 12:00am

I abuse of hot sauce but I’m Mexican so it’s OK for me, and please, please, no more Tabasco, that’s not real salsa, it’s not even hot, if you want something hot and delicious try salsa de chile habanero
http://www.salsasetc.com/graphics/H-175A%20large.jpg, it will make burn your tushi like never before.

Juan Zamudio

Rob · June 27, 2008, 12:00am

I never, ever want to hear from anyone, ever again, that programming in assembly language is hard or useless.

SteveS · June 27, 2008, 12:00am

Of course then you always get the guy asking you for help making a Regular Expression to match strings of balanced parentheses.

ValentinG · June 27, 2008, 12:00am

This is the reg expression to rule all reg expressions: %s\n !

Just kidding! But do take a look at this small post on how to read quoted strings with scanf: http://narg.eu/?p=6 - in one simple reg exp.

Justin · June 27, 2008, 12:00am

I ~love~ regular expressions. I do know that they’re slow compared to strstr, strpos, or whatever your language’s equivalents are, because they are yet a different coding language that gets PARSED and COMPILED. At their basic level, they represent a finite state machine (this is not so much true with modern regex, but the basic commands in like POSIX regex are).

Therefore - long regular expressions are going to be SLOWER than shorter ones. Personally… I’d have taken the article’s long ‘or’ string and broke it up into a loop over a list of allowed elements (easier to add onto later too).

Something to note is that every regex engine is different - some optimize things differently (ie, if your parser is naive about building the FSM, the long or statement above will result in a huge structure, one for every OR), some have wholly different functionality (though, in general, ‘keyword’ characters are consistent), and some have different ‘shortcuts’, especially for character classes. The most widely used is probably PCRE (Perl Compatible Regular Expressions), which obviously works just like Perl, but it’s a C library that is used in a number of different places, but its syntax is just a little different than say Java or Javascript’s syntax, which is very different than BRE (basic regex), etc.

Its super powerful though, however you cut it. Its one of those tools that programmers need to know, because often times its the best tool for the job - especially in the text-based web world.

Kyle · June 27, 2008, 12:00am

Jeff, any thoughts on introducing BNF? It is the perfect compliment to fill in the gaps of regular expressions. Simplest possible way to get balanced matching and parsing. Arguably easier to use than regular expressions. It is supported in all major programming languages. Whenever turning text into a data structure, reach for a BNF first.

Besides, the complexity of regular expressions seem to grow at length^2. BNFs feel more like log(length).

http://en.wikipedia.org/wiki/Backus-Naur_form