Testing With "The Force"

Peter · July 8, 2009, 12:00am

I guess the point of the article is to use real test data for tests, so, good on you.

But it would have been nice to hear a word about BNF grammars and actual, real text parsing, instead of hacking it together from regex. Even if Markdown isn’t defined in a deterministic way, it still would have been nice to hear a little peep about parsing and how it would have solved all those nasty little edge cases, and why we can’t use the traditional approach today.

Jim · July 8, 2009, 12:00am

How about saving a “markup version” with each comment? Then you could render old submissions without any formatting or the old way… and new submissions with the current method…

Or… if you have a way of escaping your newly introduced control chars, escape all old comments…

securityhorror · July 8, 2009, 12:00am

I hope you will find the problem one day you are trying to solve with this mindless post.

Regular expressions are useful. We get it. You are a cool. Your web site is cool. You will make a lot of $$$ (or $#%@%^@, it remains to be seen).

Kenneth · July 8, 2009, 12:00am

Testing is for pansies! Test this, test that…blah, blah, blah. Real codeslingers just throw their product out to the masses and deal with the - ahem - very small number of bugs that their exceptional code will somehow manage to generate. What’s the worst that could happen, eh?

BradG · July 8, 2009, 12:00am

You should have just posted a question on Stack Overflow, asking for help, I know I would have seen some of the problems with the first one, and I think there still may be some problems with the last one.

( The question should also have had a link to the Meta SO question were people could make comments about the idea )

Tim · July 8, 2009, 12:00am

This won’t match

Testing is good.

You may want:

(?=^|[\s^,(])*(?=\S)(.+?)(?=\S)*(?=[\s$,.?!])

Kost · July 8, 2009, 12:00am

Well, here is an excellent example of what your stackoverflow dumps can be used for, TEST CASES FOR FREE

Secure · July 8, 2009, 12:00am

“Especially when we eventually have italics #bold# @underline@ %hidden% etc.”

Back from the old BBS days, I find /italics/ bold underline much more intuitive.

LXj · July 8, 2009, 12:00am

This will work until you decide to add some escaping. Oh wait… you have already added it: code!

So now you need to modify your regexp so that it text *some* text would not match

Mecki · July 8, 2009, 12:00am

Don’t forget to watch in which mode you currently are. E.g. within a code block you probably don’t want * to be matched at all (people will most likely not use italic within a code block, will they?). This is probably the hardest thing to do – if possible with only regex at all… I’m not sure about that (not unless you pack all three components into a single regex, code, italic and bold).

BTW, I don’t understand why xxx is italic and xxx is bold. Ages ago, where nobody ever heard of HTML, we styled out read-me’s, mails and Usenet posts with: /italic/, bold, underlined

(apart from the fact that I think italic is rather useless. To me italic text is not emphasized at all, it is rather slimmer and less emphasized, de-emphasized so to say. Sometimes it’s also just plain ugly or hardly even distinguishable from normal text, depending on the font being used. So I always use bold to emphasize and if / dies tomorrow, I will not really miss it)

Matt · July 8, 2009, 12:00am

Sure. The best use of your time is to stare at the code and ponder all the possible holes in it. Do this for hours so you can be SURE you thought of everything. Definitely do NOT run a simple quick query on a huge amount of real world test data. That would be stupid.

While you are at it, I think you should go back through your code base, ignore all profiling you have done so far, and start pre-optimizing code. That’s always a good use of time.

Because as a team of 3, I’m sure you guys have enough diversity to think like every person in every culture out there and fully know exactly what they will click on, in what order, what type of data they will enter, etc.

After you then get your perfect code, release it. When people complain of problems, go back and fix your code (what? I thought I was so smart I’d found every possible bug! that’s unpossible!).

But for Dennis’ sake, definitely do NOT use your test data first!

Come on people. Using a huge amount of real world data as your first cut is the best possible use. You will see how people are using the *, and then, after you solve all those edge cases, you can THEN stare at your code and analyze each special case you had to consider. This will actually help you think of as of yet not encountered uses of the * character.

Some people are so incredibly arrogant it amazes me.

Martin · July 8, 2009, 12:00am

While I agree with the original idea of “this is exactly the stuff regex was born to do” when the problem seemed trivial enough, I will have to agree with tb: in this case, regexes are not the best tool for the job (although I smiled at the fact that using “**” instead of “*” doesn’t solve the problem, because things like “char **p;” could happen too).

On the other hand, maybe you are not looking for the perfect solution, maybe something that works right 99.98% of the time is good enough, in which case you have already (almost) solved your problem. Furthermore, maybe not spending two days writing some free-context grammar or a state machine could be one of those “bad for the software, good for the business” things, right?

Having nitpicked enough, +1 for the article.

Side note: you should consider some form of comment moderation. I’m not saying you should get rid of anyone who doesn’t agree with you (Dennis Forbes, for instance, disagrees with you, but in a good way), but are insults and such really necesary (securityhorror, I’m looking at you)?

NickJ · July 8, 2009, 12:00am

The real solution would be to realize that regexes are a poor tool for this, and use a proper parser instead. Continue down this path and you have something like MediaWiki markup - an incredibly irregular markup language that can only be properly parsed by the one and only canonical implementation, using a horrid mishmash of regex and other functions.

John_H · July 8, 2009, 12:00am

I really think this is one of those situations where regex is not the answer. By the time you’ve coaxed out all the tricksy situations, you’ve got a monster unmaintainable regex.

Much better to do it in “normal” code. Of course that normal code might use simple regexes.

jeffH1 · July 8, 2009, 12:00am

Midichlorians?

gag

I thought we had all agreed that “The Phantom Menace” didn’t really happen.

DennisF · July 8, 2009, 12:00am

@Kenneth-

Who are you directing your sarcasm at, because no comment that I can see in here implies that testing is a bad thing. Testing is a great thing.

However Jeff is pursuing Test Driven-Off-A-Cliff Development here. Worse, it isn’t even really tests at all because the tests “pass” based upon him eyeballing generated output to see if it’s getting closer to expectations. It is classic hackery-in-the-bad-way sort of coding.

@Matt-

Groan. It isn’t worth replying.

JonathanD · July 8, 2009, 12:00am

This is why I use HTML. It’s slightly more time-consuming than Markdown, especially for unordered lists, but as someone very familiar with HTML I’m not really bothered.

code_monkey7 · July 8, 2009, 12:00am

Jeff, what kind of shoes do you wear? I’ll buy the same because I want to be like you. And please tell me more about you.

Noah1 · July 8, 2009, 12:00am

Tim was right. When you put ^ or $ inside a character class (like [\s^,(] or [\s$,.?!]), they no longer match positions, but the those literal characters. \s may or may not mean “any whitespace” inside a character class, depending on your regex engine (some allow it, some interpret it as “either \ or s”).

So I believe what you meant was:

(?=^|[\s,(])\*(?=\S)(.+?)(?=\S)\*(?=$|[\s,.?!])

(and this matches one or more characters inside the asterisks, not “more than one character in total”).

However, it seems to be working well so far! Thanks for the new feature!

HearWa · July 8, 2009, 12:00am

It’s funny, but with each new article discussing regex I seem to dislike it more and more. I’ve toyed with regex’s before but to me writing code with my chosen languages native string functions is much easier to read, modify and maintain, especially as complexity grows.

Sure, often I do in ten lines of code what can be done in one, but hasn’t C taught us that that isn’t always the brightest idea?