Mastering GUIDs with Occam's Razor

Do you remember the scene from the movie Full Metal Jacket where the marines recite the USMC creed?


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2005/09/mastering-guids-with-occams-razor.html

Speanking of Full Metal Jacket… if you’re having trouble meeting your development milestones, have you considered the R. Lee Ermey motivational figure?

http://www.rleeermey.com/motivationalfigure.php

Personally, I like to welcome all new developers to our team with a slightly modified but highly inspirational R. Lee Ermey speech:

“If you ladies survive recruit training, you will be a [DEVELOPER]. You will be a minister of death praying for [MORE DEVELOPMENT TASKS]. But until that day you are pukes. You are the lowest form of life on Earth. You are not even human fucking beings. You are nothing but unorganized grabastic pieces of amphibian shit. Because I am hard you will not like me. But the more you hate me the more you will learn. I am hard but I am fair. There is no racial bigotry here. I do not look down on niggers, kikes, wops or greasers. Here you are all equally worthless. And my orders are to weed out all non-hackers who do not pack the gear to serve on my beloved [DEVELOPMENT TEAM]. Do you maggots understand that?”

I have no idea why there’s so much turnover.

Well, per the documentation, the GUID constructor supports the following:


32 contiguous digits:

dddddddddddddddddddddddddddddddd

-or-

Groups of 8, 4, 4, 4, and 12 digits with hyphens between the groups. The entire GUID can optionally be enclosed in matching braces or parentheses:

dddddddd-dddd-dddd-dddd-dddddddddddd

-or-

{dddddddd-dddd-dddd-dddd-dddddddddddd}

-or-

(dddddddd-dddd-dddd-dddd-dddddddddddd)

-or-

Groups of 8, 4, and 4 digits, and a subset of eight groups of 2 digits, with each group prefixed by “0x” or “0X”, and separated by commas. The entire GUID, as well as the subset, is enclosed in matching braces:

{0xdddddddd, 0xdddd, 0xdddd,{0xdd,0xdd,0xdd,0xdd,0xdd,0xdd,0xdd,0xdd}}

All braces, commas, and “0x” prefixes are required. All embedded spaces are ignored. All leading zeroes in a group are ignored.

For every case but the last one, this improved regex should work:

“^[A-Fa-f0-9]{32}$|^({|()?[A-Fa-f0-9]{8}-([A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}(}|))?$”

And with the final optimization (IsGuid2 becomes a one-liner .IsMatch) this is still slightly faster than the GUID constructor in my testing.

So you’re saying you liked the .Parse solution better than the Regex solution in this situation?

Faster, shmaster. What about the elegance? Couldn’t we use a design pattern of some sort to increase the elegance? I’m all about the elegance.

Also I’m concerned about heap contention and fragmentation. Which of these makes the fewest number of memory allocations? Bzzzzt! Wrong answer, we’re doing it my way now…

Yes.

I realized I missed another optimization, too! IsGuid2 should be a one-liner:

return r.IsMatch(guidString);

That shaves another 18% off:

0.5468
0.4531

And it means the regex, in the final compiled non-debugger run, is now consistently faster (!) than using the Guid constructor.

And that’s with zero exceptions, which means if there were exceptions, the perf gap would widen considerably.

Not that it matters. But I’m just saying.

Sorry, for a second there I channeled the ghost of a guy I used to work with in discussion eerily similar to this.

^[A-Fa-f0-9]{32}$|({|()?[A-Fa-f0-9]{8}-([A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}(}|))?$|^({)?[0xA-Fa-f0-9]{3,10}(, {0,1}[0xA-Fa-f0-9]{3,6}){2}, {0,1}({)([0xA-Fa-f0-9]{3,4}, {0,1}){7}[0xA-Fa-f0-9]{3,4}(}})$

Tested against the guid values documented here:

a href="http://msdn2.microsoft.com/en-us/library/96ff78dc.aspx"http://msdn2.microsoft.com/en-us/library/96ff78dc.aspx/a

Cheers,
Colin

Hey,

If anyone follows C# 3.0, if MS forgets it, one can actualy add a .TryParse() method to Guid if I understand correctly with Extension methods. I opt that we nag them into including it in the next release!

Rick

I am getting the reverse results. The RegEx method never ends up being faster. The try/catch method seems to be 2x faster than the RegEx method.

Are you running on 32 or 64 bit?

The RegEx method never ends up being faster. The try/catch method seems to be 2x faster than the RegEx method.

The code sample has to be modified. Follow the instructions given above:

  1. Increase iterations to 100,000 to get a better measurement.

  2. Hoist the creation of the regex out of the loop. It’s insane to create 100,000 regex objects.

  3. Make IsGuid2 a one-liner: return r.IsMatch(guidString)

  4. Run with CTRL+F5 (sans debugger)

After implementing those changes, on my current machine I end up with 0.073 for the first method and 0.017 for the second method (regex).

Switching to System.Diagnostics.Stopwatch (new to .NET 2.0) produces the same results.

One thing you’re doing on IsGuid() is creating 100,000 Guid objects in the first version, but not the second. Those objects will need to get collected by the garbage collector, which might be running periodically. That might be screwing up your metrics. Is there a way to do the cast to an existing object? Or maybe you can inhibit garbage collection.

What the F!!!

There is a bigger reason this benchmark is bogus - it never tries an invalid GUID !!

Should I submit this blog post to the Daily WTF?

You need a benchmark that fails at least some of the time - then your algorithm will kick the ass off the try/catch one.

Geez

Well, I ran the code with Jeff’s changes using good and bad GUIDs and got the following results:

Catching the exception took 0.1902736 seconds for 100,000 good GUIDs, but 1.8226208 seconds for 100,000 bad GUIDs.

Using a regular expression took 0.250360 for 100,000 good GUIDs, but only 0.0600864 seconds for 100,000 bad GUIDs.

The original code never constructs a Guid object in the Regex case. Presumably, if you are going to test the validity of a Guid, you’re likely going to want to do something with it requiring constructing a Guid object anyways. Granted a Guid.TryParse would be nicer.

And then there’s the issue of the all those Guids that gave their lives for these tests. The horror! :slight_smile:

you’re likely going to want to do something with it requiring constructing a Guid object anyways

Well, maybe, but not really. The string representation of a GUID is usually all you need.

And the RegEx problem is even worse in the pre-Whidbey releases of the CLR before LCG (lightweight code generation) existed. Temporary assemblies are generated during construction and those couldn’t be GC’d prior to Whidbey. This same problem applies to XmlSerializer and XSLTransform as Tess Ferrandez discusses on her blog (which I think is second only to yours and Rico Mariani’s for sheer geektasticity):

[snip]

If we take a look at the XmlSerializer constructor with Reflector we find that it calls

this.tempAssembly = XmlSerializer.GenerateTempAssembly(this.mapping, type, defaultNamespace, location, evidence);

which generates a temp (dynamic) assembly. So every time this code runs (i.e. every time the page is hit) it will generate a new assembly.

The reason it generates an assembly is that it needs to generate functions for serializing and deserializing and these need to reside somewhere.

Ok, fine… it creates an assembly, so what? When we’re done with it, it should just disappear right?

Well… an assembly is not an object on the GC Heap, the GC is really unaware of assemblies, so it won’t get garbage collected. The only way to get rid of assemblies in 1.0 and 1.1 is to unload the app domain in which it resides.

And therein lies the problem Dr Watson.

Temporary assemblies are also used for regular expressions, as well as script blocks in XSL Transforms. In the case of the script blocks I would suggest using XSLT extension objects or cache the transform.

[snip]

Here’s the original post for anyone who is interested:

http://blogs.msdn.com/tess/archive/2006/02/15/532804.aspx

Clarification…I didn’t mean to imply assemblies can be gc’d starting with Whidbey (obviously they are not allocated on the GC heap but rather in internal CLR memory so it’s apples and oranges). The point is assemblies can’t be unloaded without tearing down the app domain, but with LCG, they’re no longer generated in the first place.

http://msdn.microsoft.com/msdnmag/issues/06/01/CLRInsideOut/#S7

Well, maybe, but not really. The string representation of a GUID is usually all you need.

From a type safety perspective, it’s better to pass around Guid objects. If you’re going to consume the Guid immediately, then yeah the string will work.

What a cool thread! I returned to college in my 40’s and graduated with a BS Computer Science in 2002. Now I am a semi-senior C# programmer, but discussions like this make me feel like an infant. Thanks, gentlemen, for the stimulating discussion, and the information.