Regex use vs. Regex abuse

I'm a huge fan of regular expressions; they're the swiss army knife of web-era development tools. I'm always finding new places to use them in my code. Although other developers I work with may be uncomfortable with regular expressions at first, I eventually convert them to the regex religion sooner or later. If you're working with strings in any capacity at all-- and what developer isn't-- it's hard to deny the flexibility of regular expressions. Why use 6 lines of procedural If..Then blocks to process a string when you can do the same thing in a concise, 20 character regex pattern? And if you put that same pattern in a .config file, you can now change the behavior of your app without recompiling. It's less code doing more work.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2005/02/regex-use-vs-regex-abuse.html

Somewhat? LOL.

Good topic. I try and use them when I can and not get myself confused. I hate getting it to work and then another developer looks at and then asks me to explain what it’s doing and all I really did was use the Expresso designer.

Reg-exps aren’t hard. They’re covered in ungrad courses on state machines. Not understanding the tool you are using can be dangerous. Letting a wizard handle it for you means you no longer understand what the code is doing. You end up doing cargo-cult programming.

What’s hard is using reg-exps to deal with fuzzy real-world data.

Your phone number example:

"^\(*\d{3}\)*( |-)*\d{3}( |-)*\d{4}$"

fails when confronted with a number with an international dialing prefix, or any non-US number.

Your domain example

"[^\\]+$"

fails when confronted with the e-mail address form (user@domain) or a forward-slash form (user/domain)

It’s dealing with all this that makes reg-exps hard. It’s why the RFC-822 parser is so huge. E-mail addresses aren’t always foo@bar.com.

1 Like

fails when confronted with a number with an international dialing prefix, or any non-US number.

By design! I usually work on apps that are only deployed to US users. The more interesting thing I forgot is, this is valid: (((((919) 555-1212. Unlikely to happen in the real world (eg real users doing real data entry), but it’s a flaw.

fails when confronted with the e-mail address form (user@domain) or a forward-slash form (user/domain)

Again by design. The only goal there is to parse the LOGON_USER http header, which is always in that format AFAIK.

It’s dealing with all this that makes reg-exps hard.

No, trying to solve all known conditions in a single regexp, as you’re implying, is what makes them “hard”. It’s simpler in a lot of cases to break your testing into 3-4 different regexps, with comments, rather than one Uber-regex that God himself can barely decipher.

Like I said, it’s a balancing act.

That’s paraser has a stunning resemblance to the code I have to maintain.

validating a phone number

^\(*\d{3}\)*( |-)*\d{3}( |-)*\d{4}$

Who cares about the separators? Strip out non-numerics, since they’re meaningless:

s/\D//g

Then do whatever validation you want on a string of digits. You still have the international problem, but it works for (((((919) 555-1212, or a slightly more realistic 919.555.1212.

I don’t think that the RFC822 is a good example of poor regular expression design. Because its mechanically generated code, one would not edit the regex in the same way one wouldn’t edit an executable generated from gcc. One would however work on the source code that generates that RE.

Of cause something like perl6’s rules would make RFC822 human readable, when such things become production ready things are going to be so much better for all.

Ermmm… your upgrade of Moveable Type might just have made your exmple regexes look like laughable riddles…?

I take it your domain example hasn’t always looked like this

"[^]+$"

or you would surely have gotten some replies as to that :slight_smile:

PS. Now that orange is gone (yay) you might want to consider putting preview on the agenda (for, say, 2014). That will prevent me and others from making a fool of oneself, by posting a nitpick that will probably (Murphy?) come out just as distorted as the original post…

Halleluja! It made it - uncrippled.

One further note, even the corrected regex (someone posted that, I hope it comes out correctly if I specify it here)

"[^\\]+$"

should probably include some limiter to make it non-greedy, as in Perl5 and .NET:

"[^\\]+?$"

Other flavours of regexpen will require you to specify something funny like

/[^\\]\{-1,}$/

Or something like \\SQLSERVER01\DBINST03\SchemaUser would match DBINST03\SchemaUser instead of SchemaUser. As a rule of thumb, with every quantifier I write I consider whether it really needs to be greedy. More often than not, it doesn’t. Greedy quantifier can do silly things and gobble up way more than you expect, making the rest of your application fail in miserable ways, or (worse) making the regex run in exponential time on worst case (or evilly crafted…) input. DOS anyone?

Once you use greedy quantifiers it is almost always possible to craft an input that exploits that to do something bad, unexpected or both.

in perl, "^(*d{3})*( |-)*d{3}( |-)*d{4}$"

would be

/
^ # start from the beginnning
( # capture the following
 [:digit:]{3} # exactly 3 digits
)?  # only once, and not necissarily (ends capture)
(?: # then look for this group (but don't capture it)
 \w* | - # possibly some whitespace, or a single hyphen
)? # only once, and not necissarily
( # capture the following
 [:digit:]{3} # exactly 3 digits
)# (ends capture)
(?: # then look for this group (but don't capture it)
 \w* | - # possibly some whitespace, or a single hyphen
)? # only once, and not necissarily
( # then capture the following
 [:digit:]{3} # exactly 3 digits
) # (end capture)
$ # test if we are at the end of a line, don't match otherwise
/x # allow comments, ignore whitespace.
1 Like

Wow, major disconnect here. :slight_smile: I actually believe that Friedl’s regex book is one of the best programming books I’ve read! It breaks down the seemingly confusing world of regular expressions into elemental pieces, and then builds the language back up from the base, so that when you look at a line of Q-Bert vomit, you have a clear plan of attack for analyzing the line. That book gives one a strong fundamental understanding of how regex works.

Interestingly, out of the first 5 or so websites that test regex’s online, only 1 would take a regex that long.

I wonder how long it took that guy to make a regex that huge. And how many comments he put into the code in which it resides.

Rather than dealing with the expressions as one lump of text, if you took about building them from the various pieces that they represent, they’d be a little easier to digest, as well as maintain.

Something like what this guy did in this paper:

http://www.cs.sfu.ca/~cameron/REX.html

I remember that e-mail regex from somewhere before… I remember it was said it’s wrong, and automaticly generated. I can’t remember the real one, but its at most ~500 chars (/^[]+(\.[]+)*@[]+(\.[]+)*$/ should be enough from memory, though).
peters example fails to distinguish +12 (03) 555 1111 (which could be valid) from +120 (35) 551 111 (which is totally screwey)

make that regex: /^[(somechars)]+(\.[(somechars)]+)*@[(someotherchars)]+(\.[(someotherchars)]+)*$/ (stupid HTML stripping… what’s wrong with escaping?)

I’m not a programmer. I wish I was, but I’m not.

My brother’s an application support developer, and when I posed my query on developing a chrome extension that could translate the tabular info on websites (eg 320033) into note-form (GBDGDG) he instantly suggested regexp, although confessed he was not personally skilled in that fine art…

So I have two parts to my problem

  1. Identify frets on each string, ie
    xx0232 = x x 0 2 3 2,
    911111099 = 9 11 11 10 9 9,
    etc

  2. Translate fret-number into note, for each string, ie
    x x 0 2 3 2 = xx D A D F#
    8 10 8 7 x x = C G A# D x x

I don’t know where to start. Thoughts welcome!

  1. Have an array of the 12 chromatic note name (indexed 0 to 11)
    sharps = ['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']
    You could alternatively use a flats representation:
    flats = ['C','Db','D','Eb','E','F','Gb','G','Ab','A','Bb','B']
  2. Create an array that represents your tuning (for standard guitar tuning: E A D G B E):
    tuning = [4,9,2,7,11,4]
  3. Trim your chord from trailing white space and split it at spaces to get the 6 string fret values
  4. For each string (0 to 5):
    a. If fret value is “x”, output “x”
    b. If fret value can be converted to a number, calculate the offset of the note in the sharps (of flats) array:
    offset = ( tuning[string] + int(fret) ) mod 12
    c. Output sharps[offset] (or flats[offset], alternatively).

Regular expressions would be useful to EXTRACT the chords in the tablature text, though! But that was not what you asked! Have a good learn!

1 Like

Oh darn, I saw Nov 14 on the previous post, I thought your post was only a couple weeks old…

1 Like

How can this shirt not be available in orange?!?!

1 Like