Java vs. .NET RegEx performance

I was intrigued when I saw a cryptic reference to " the lackluster RegEx performance in .NET 1.1" on Don Park's blog. Don referred me to this page, which displays some really crazy benchmark results from a Java regex test class-- calling C#'s regex support "20 times slower than [Java]." First of all, them's fightin' words, and second, those results are just too crazy to really make sense. Why would C# be over an order of magnitude slower than Java at a classic computer science problem like a regular expression parser? I don't believe it.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2004/08/java-vs-net-regex-performance.html

As promised here are some benchmarks from a few machines around the house. I ran each benchmark 4-5 times and took the best reported score.

Athlon 64 3800+ (2.4ghz)
.NET 170ms
Java 265ms

P4 3.2 (3.2ghz), w/hyperthreading
.NET 185ms
Java 330ms

Athlon XP 2400+ (1.7 ghz), dual proc
.NET 250ms
Java -

Pentium M (centrino) 1.2ghz
.NET 320ms
Java 541ms

  • I didn’t want to install the JDK on my server

Jeff,

The benchmarks I did were done back in Feb. of 2003, so are relevant in that period. So accusing me of smoking crack is a little below the belt.

Now maybe you need to provide the .NET source afterall you’re jumping through some peculiar hoops to get your timing “right”. How exactly are you timing both versions? Don’t time individual calls, time the aggregate result. Finally, let us all see the code!

Jeff,

Take a very good look at your own results and you’ll realize that Java is indeed faster at Regex than .NET.

Look at the averages and the max timings and you’ll see the Java bests .NET in almost all the cases. Now for some reason the aggregate times show the opposite. This of course can only be attributed to the way you measure time. That is you use a different routine and this should account for the disparity.

So, the two conclusion that you can make with your results are:

(1) Java still is faster with Regex than .NET
(2) Using QueryPerformanceCounter can gives you more precision and faster timing than DateTime.Now.Ticks.

Finally, why don’t you try using 1,000,000 iterations instead of the measly 10,000. The result will surely put you in complete shock!

Carlos

Carlos,

  1. There is a bug in the original code’s innermost timing loop, as I pointed out in my blog entry, and in my commented port of the source to VB.NET. Each iteration of that loop is calculating elapsed time using a start time from OUTSIDE the loop rather than at the start of each iteration. See my comments in the source. This will definitely cause the aggregate total times (which are added up from each regex time) to be off. What can I say, I didn’t write the code.

  2. The java System.currentTimeMillis timings seem to have the same precision as DateTime.Now.Ticks, eg, about 10ms. Notice how the “minimum” execution time in the Java code is 0ms? That’s because it was too fast to be measured with that granularity (or, maybe it executed in zero time? lol). The average per-regex time on today’s fast machines is going to be highly unreliable with a timing granularity of 10ms. My .NET port uses the QueryPerformanceCounter call which harnesses low-level processor support for nanosecond type timing accuracy-- those timings are granular to a few nanoseconds and definitely valid (minus the bug in the original source). Notice my minimum times are actual numbers… Basically both System.currentTimeMillis and DateTime.Now.Ticks are worthless for timing these kinds of single-regex operations because they happen way too fast. And there’s a bug on top of that!

  3. the total loop count is 3 (regexes) * 4 (strings) * 10,000 (iterations) = 120,000. I tried this using 100,000 iterations for a total of 1,200,000 calls. Results (on the Athlon FX-53):

Java: 2531ms
.NET: 1772ms

  1. Precisely, with the timing code in the inner loop you are in essence benchmarking your time code! There’s an enormous difference between counting seconds and figuring out exactly what time it is. You are measuring the former versus the latter. Get rid of the inner loop timing and just get the aggregate time.

  2. Your result show Java being faster.

  3. Do 1,000,000 iterations * 3 (regex) * 4 (string) and just get the aggregate timed result. There’s absolutely no need for high resolution timing when you are recording aggregate results that may take a couple of seconds.

Carlos

Do you actually read anything I post? As I said in the original blog entry that you are commenting on, I do TWO PASSES.

Pass #1 - 120,000 regexes
Low-level timing of each regex call via QueryPerformanceCounter. No total time is recorded.

Pass #2 - 120,000 regexes
No timing of the regex calls. Only total execution time is recorded.

All the total times you have seen are recorded in pass #2. Trust me, these numbers are correct. Don’t take my word for it-- download the source and try it yourself. The output shows the results from both passes. Detailed timing from pass #1 (with the source bug, which is why the totals don’t add up), and total time from pass #2.

You’re right that the overhead of calling the timing code 120,000 times intereferes with total execution time. I pointed out the same thing in the original blog post:

“One side effect of this is that I have to make two passes to get all the benchmark results: the first pass times each of the 120,000 regex calls individually into an array, and the second pass times just the total execution time. You’ll notice that the first pass takes twice as long; that’s due to the overhead of calling QueryPerformanceCounter 120,000 times. The upside is that I provide far more accurate timing results, as you can see in the table above.”

Also I did run 1,000,000 iterations. The results were predictable. Multiply the result by 10…

10,000 iterations : ~170 ms
100,000 iterations : ~1.7 seconds
1,000,000 iterations : ~17 seconds

Same for Java. Just multiply time by 10-- still 40% slower.

It’s in the article:

“You may be interested in my VS.NET 2003 console solution which includes both the stripped down java class and the VB.NET equivalent, so you can run this test on your own machine.”

http://www.codinghorror.com/files/code/regexbenchmark.zip

Also your article is dated December 2003!

http://www.manageability.org/blog/archive/20030210%23p_i_found_a_a/document_view

Jeff,

There you go, that the reason you VB code is faster, you don’t check for time in the inner loop when you get the aggregate time.

That is why when you measure each call Java consistently is faster.

That is why when you measure the aggregate call, Java is longer, that’s because you have the timing code inside the Java version and in the VB version you DO NOT have it.

If you want to compare apples to apples then you should compare indentically functioning code. For example here is one you should run on your machines:

http://www.manageability.org/blog/archive/20030520%23p_the_problem_with_cameron/view

Carlos

Jeff,

I double checked your Java code and removed timing information to see what the difference would be. The timing code slows Java down code by about 12%.

The original benchmarks (Feb 2003) see: http://www.jroller.com/page/ceperez/20030210#p_i_found_a_a/document_view
you pointed to a version that was archived in Dec 2003. To the best of my knowledge was correct at the time. Since then Microsoft may have made improvements to the regex class.

However, the orginal benchmarks do not stress regex enough to be informative. The benchmark link in my previous message should do a better job at that.

Carlos

It’s true that I didn’t run a version of the Java code without per-regex timings. So I commented out the individual regex times:

// long iterStarTime = System.currentTimeMillis();
// timeTaken[regnum][itter][strnum] = ( System.currentTimeMillis() - iterStarTime ) ;

Total time is still 265ms with or without these lines. I tried it both ways, 4-5 times.

Bear in mind that QueryPerformanceCounter is a windows API call so it has a lot of overhead associated with it… DateTime.Now.Ticks is much, much faster, and it looks like System.currentTimeMillis() is also too fast to make a difference.

But these calls both lack timing precision…

In case your interested I ran Jeff’s benchmarks using the beta JDK 1.5 and VS 2005 beta using 100000 iterations, and measured the following total time ( I disabled the inner time measurements).

Java 1.5 JDK: 3282 secs.
VS 2005: 2637 secs.

I’m seeing results simular to what jeff is seeing.

Carlos, in you work do you typicall find Java to be signficantly faster then .NET? The reason I ask is that I do a lot of performance oriented work, and I generally find .Net to be 5 to 20 faster in most cases. There may be a library here or there where the java implementation is better (for example most of the collection classes), and more performant, but for the most part .net has a slight performance advantaged. I will say the 1.5 JDK java is get closer.

Also the your link to more comprehensive regex bench mark showing that java is faster is a bit miss leading. This particular benchmark creates 1000’s random of regex, which the .net implementation attempts to cache. Due to the randomness of the regex’s this caching causes signfincant overhead. In other more real world senarious .nets caching may be a benifit.

terry

Sorry for the confusion.
The previous post sould have said:
“I generally find .Net to be 5 to 20 percent faster in most cases.”

Just to calrify, I do like java, and I think it is very well done. Same goes for .net. So any time I see something like x is 5x faster then y. I get suspious. I hope I didn’t add anything to the hysteria.

Terry,

Interesting results, thanks for posting them.

What are your PC specs just out of curiosity? Did you notice that Java REALLY likes the Athlon 64-- it is disproportionately faster on the A64 compared to the Pentium 4.

Did you do any testing of compiling regular expressions vs running the regular expressions? ie, is the regular expression compiled each time you use it, or are you just compiling each once and running the tests on the resulting state machine?

That’s a function of the regex engine. I know how .NET does it (you have to pass a special flag to get compiled regexes), but I can’t speak for Java.

I did a direct 1-1 port of the existing code. I suggest downloading the sample solution linked in the post, which contains both the .NET and Java classes used.

jeff,
i am interested in your VS.NET 2003 console solution which includes both the stripped down java class and the VB.NET equivalent, so that i can run this test on my own machine, but the link u provided is having a corrupt zip file, can u plz provide me the link

It was a problem with the server config, ZIP files were being compressed with GZIP compression. I think Windows Update overwrote part of my metabase.xml IIS config file…

For what it’s worth, I get nearly identical timing results in VS.NET 2005 (final RTM) as I do in VS.NET 2003.