Regular Expressions: Now You Have Two Problems

@N
But that’s because of my shortcoming.
Disagree. What if you’re not planning to write a regexp, but you want to use an existing one, where there’s something complex with in it. Knowing roughly what it does helps. This is yet another case of overuse = failure, non-use = failure.

Also, no, I knew not of RMC before this. Thanks, Jeff!

  1. If you use it all the time, regex is great.

  2. If you use it once in awhile, avoid it. Seems like I have to re-learn it each time it hit a case where I need it. Even with the tools.

  3. If other people who don’t use regex all the time will be supporting the code, don’t use it.

But this is a chicken/egg story. If you use it lots, you know it. If you don’t… avoid.

Regex always seems like going back to assembler. That’s why you have to all those utilities. But gee… we are already working in a compiled environment. Why go back for regex with unobvious shift-numeric syntax (!@#$%^*). I’d prefer to use something with regex’s power, but with a more obvious syntax - a regex compiler - but with the original code part of the real code. It probably exists.

Remember if you seldomly use regex but have a case for one you can often find the expression on the internet. I couldn’t be bothered to work out how to parse a date in exact format dd/mm/yy. Just looked it up on the internet pasted it in checked it works, great even stops you going past the max days of that month and past 12 months including the leap day.

orange - it is all the time

Can somebody tell me(from personal experience) the scenarios

  • where to use regex
  • where to avoid

I only find it useful in validating email, searching to strip out dangerous HTML from input.

Regular expressions rock.They should absolutely be a key part of every modern coder’s toolkit.

I always find myself disagreeing with you whenever you say something should apply to every programmer, regardless of their area of expertise. As just one of many examples, what if the coder is in the video game industry? Sometimes it seems like people forget that there’s more to programming than processing text files and validating inputs.

@Josh Stodola: One of the nice things is that the underlying engines will also have been optimized for speed, so you don’t have to. Sure you can hand-write a simple parser, but can you hand-write it to make it fast?

Coding language does seem important to level of regex use. For some thoughts on using regexes in Perl vs Python, see

http://www.fluidinfo.com/terry/2007/06/13/resorting-to-regular-expressions/

Terry

While I don’t use regex much, I do see its beauty for some problems - plus I’ve never really understood the quote about having two problems, but given context that its saying regex isn’t a solution for everything… obviously I agree.

Here’s the thing though, I just don’t see a html sanitizer as being a good example for regex… sure regex in this scenario can bring results quickly… and it does the base things perfectly… but sanitizing html is more than matching patterns… and while I’m sure you could build more regex to progressively pull everything apart and back together… it kind of makes me wonder if you are then almost trying to parse with regex…

Naively speaking, because I’ve never actually written a complicated sanitizer… I would say that a traditional programming approach - although much slower to see results at first… would be more flexible… and given the recursive patterns that exist… once you start to hit a point… you’ll see results, and get past problems that regex would become more troublesom… faster.

I’d be hugely suprized if there wasn’t a .NET lib out there for doing this already… if not, then I think a codeplex/sf project is called for, obviously from your posting on refactormycode - and on here… theres a lot of hugely knowledgable people in regards to how sanitization should work… and it would be really interesting to see a product from it.

I’d do it myself, but I know anything I put up would be ripped apart instantly - but hey, if it triggers people to do something in a (gah, give it here, this is how you do it!) kinda way… then maybe I should :stuck_out_tongue:

mihondo: You said regex feels like assembly, and you want something higher level. Guess what the following does:

number :: ‘0’…‘9’*
phoneNumber :: [ ‘(’ number ‘)’ ] number ‘-’ number

This is a BNF (Backus-Naur Form) to match phone numbers. BNFs are a high level grammar designer. You can do just about anything a regex can do, though BNFs ted to be self-documenting. The typical use-case is for turning a program’s source code into an abstract syntaxt tree, but it fits really well for simple stuff. The wikipedia page has a simple example for matching any US postal office. The nice thing about BNF is that it turns the text into a data structure. The really nice thing about BNF is it is designed to deal with things like nested tags/parens, the weakest part of regexes.

Common extended-BNF parsers include Yacc/bison (for the unixes) and ANTLR (for Java). My personal favorite is PyParsing, as it has some tasty syntatic sugar.

I can also show you something written in the very same medium that is so beautiful it will make your eyes water

Ok, I’m calling you out on the Klingon. Let’s see that beautiful eye-watering Klingon.

I once heard a good advice that seems to work for me and my code:

Try to use a lot of vertical space and
very little horizontal space!

This applies to regular expressions as well. Readable code is all about whitespace, comments and proper naming.

Orange

I’ve yet to get a handle on Regex but I do appreciate expressions that others have published and just work for me. Really, really appreciate it. Checking email addresses, post (zip) codes, phone numbers, etc. All this validation of text allows my code to remain concise and also allows me to get on with my job. The language independence is a big bonus in this respect.

I enjoy writing the unit tests against them to make sure all is good and this allows me to know exactly what is and is not covered in each expression.

Having said that, readability is atrocious even with whitespace and this adds further importance to the unit testing as this now doubles up as documentation.

+1 for Expresso (http://www.ultrapico.com/Expresso.htm)

A more simple one is Regulazy (http://tools.osherove.com/CoolTools/Regulazy/tabid/182/Default.aspx)

Hi Jeff… I have similar feelings about the overuse of ajax as you do about the overuse of regular expressions

http://blog.pnbconsulting.com.au/?p=134

The best regex advice I’ve heard is to not try to write your own parser…FWIW

Should you try to solve every problem you encounter with a regular expression? Well, no. Then you’d be writing Perl

Shame on you for perpetuating this tired old piece of nonsense.

Are you also writing your own webserver and C library?

I couldn’t find any good c# HTML sanitizing code that wasn’t a huge, dumb dependency. Now I can, because I wrote it!

Try to use a lot of vertical space and very little horizontal space!

Agree, see flattening arrow code
http://www.codinghorror.com/blog/archives/000486.html

I’d prefer to use something with regex’s power, but with a more obvious syntax

Maybe fluent interface? But I disagree.
http://www.codinghorror.com/blog/archives/000989.html

Saw some replies asking about open source regex editor:

KDE regular expression editor manual:
a href=http://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.htmlhttp://docs.kde.org/kde3/en/kdeutils/KRegExpEditor/index.html/a

Redet:
a href=http://billposer.org/Software/redet.htmlhttp://billposer.org/Software/redet.html/a

Simple version:
a href=http://www.arachnoid.com/regex_lab/http://www.arachnoid.com/regex_lab//a

One of the best books to learn how to use regex : http://oreilly.com/catalog/9780596528126/
Before reading it, I thought I knew regular expression. It made me change my mind.

On the subject of good regular expression tools, I would like to recommend a free online one that is designed for .NET programmers:
http://www.lastdomainnameonearth.com.