Exploring Wide Finder

codinghorror · June 9, 2008, 12:00am

I have decidedly mixed feelings about the book Beautiful Code, but one of the better chapters is Tim Bray's "Finding Things". In it, he outlines the creation of a small Ruby program:

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/06/exploring-wide-finder.html

Bill99 · June 10, 2008, 12:00am

First of all, to even consider code size makes me wonder about the goals of the original author. Complexity and readability are critical, but they have virtually no relationship to code size. The only time you would want to consider code size without other variables would be to promote a language.

In fact, if you count all the utilities that the ruby solution uses, it’s probably thousands of line of code long. If I write and use a utility, I can make my code much smaller than that ruby solution! Or is there some arbitrary rule that nothing counts if it ships with the language?

I haven’t seen anyone approach the real problem–the fact that all the solutions I’ve seen move text, and most read text in a line at a time.

If you really had to speed it up, there are some definite areas to attack:

read the whole thing into a buffer at once. Use a call that makes use of DMA if you have control of that.
Run through the buffer without moving code/creating objects for each line.
Avoid regular expressions.

You probably jumped straight to “Oh, that’s complicated. Lots more code”. If you write your code correctly, it will be longer (should be), but it should also be more readable than the ruby code above. The main function would probably be cleaner because the sub-functions you use will be tailored to your requirements, not generic in-language solutions.

If you still want to go multi-threaded, calculate the center of the buffer and search for the first carriage return after it; mark that spot. Pass a pointer to the buffer and an index/length of each half to different threads.

If you want to use more than two threads, just recurse.

See, the real problem here is the language. Ruby and all these other “Elegant” languages give you these tricks that make you think you are implementing a great solution because your ruby code looks small, but actually these tricks often lure you into a pretty poor solution that no good programmer would use if he could actually see what he was doing.

I’m not saying optimizing is a good idea–it’s a last resort; and if ruby is fast enough for you and you feel it makes you code faster–great…

But I also really agree with the callout in the article saying that it’s an illusion. Personally I worked with Ruby for a year and never got the hang of it–I can’t read the example given because I forget all the tricks.

I prefer a language with as few surprises, syntactical exceptions and tricks as it can possibly have. Explicit, clear, long (if necessary) code is fine. I can type a lot faster than I can troubleshoot someone elses’ “Short” code, and if the code is done right, there is no reason any given method should be significantly longer.

Kenneth · June 10, 2008, 12:00am

I know this is kind of off-topic but, Jeff, I just read your response to the offer to contribute to Beautiful Code. I really enjoyed it. Especially where you explain how even highly intelligent, gifted programmers can’t hope to keep pace with the complexity of modern projects. And if that’s true then a programmer like me doesn’t have a snowflake’s chance in hell!! So I learn as much as I can and follow good practices and strive not for beauty but function. Keep up the good work.

Jorge · June 10, 2008, 12:00am

It’s funny to read a pragmatic blog about a unpractical language such the “beauty” ruby.

Ray_Waldin · June 10, 2008, 12:00am

My favorite quote on this subject comes third hand via http://rickyclarkson.blogspot.com/2008/01/in-defence-of-0l-in-scala.html

“Optimising your notation to not confuse people in the first 10 minutes of seeing it but to hinder readability ever after is a really bad mistake.”

-David MacIver

Davidf8 · June 10, 2008, 12:00am

Step 1: Use Java
Step 2: Use Jakarta Oro for the regex using precompiled pattern matchers before the entrance to the main loop. It’s ridiculously faster then the default Java regex and precompiling the patterns also is a huge speed increase.
Step 3: Use threads to do the matching… but actually retrieving the line might be complicated if called from multiple threads… if we could abstract (yes gasp) it into a datasource like object which was thread safe this would be ideal (oh no, class explosion!!!)
Step 4: Use a Concurrent HashMap which uses Threadlocal to store the results from each of the threads
Step 5: Profit. Seriously, I’ve seen Jakarta ORO beat optimized Perl regex, that library kicks ass.

Weighing up what the merits of the above solution, it would probably be more complicated in the threading aspect - it appears that little asterisk thingo in Ruby is something I’m going to have to look at… but apart from that, this code should be pretty damn fast.

mrm · June 10, 2008, 12:00am

I don’t get it regarding size. Okay it’s a contest of sorts, and it’s good to have constraints, but I feel it’s not a fair indicator. A large standard library gives you much more power, but it’s the old ‘my program’s only 2k, but just download the 22M runtime first’

The solution method that invokes the processing should be short and readable. In class, many moons ago, I had to create a hashtable in C. It certainly wasn’t easy, but why should a c implementation be removed just because the guys who wrote perl decided to include hashmaps in the language as first class citizens.

I’m not dissing newer languages that have these constructs built in. I get a hell of a lot more done with C# than I ever could in C, and I can focus more on the problem, not the data structure. It’s a win for me, and all I care about at the end of the day is getting work done by writing expressive code and not rewriting a hashtable or linked list for the millionth time. C# isn’t perfect by any stretch, but it’s fun to work in.

In any event, this sounds like a good candidate for MapReduce (http://labs.google.com/papers/mapreduce.html).

Guy · June 10, 2008, 12:00am

Am I allowed to use SQL?

I’m sure, on a correctly configured SQL Server or Oracle DB, that I can produce results with an “elegant” SELECT FROM GROUP BY (Oracle Analytics anyone?) that will return results a lot quicker than 8 minutes.

Use the right tool for the job. A DB is inherently multi-processor friendly.

codinghorror · June 10, 2008, 12:00am

P.S. Kev: That fad has been around since the invention of Perl. That’s why it’s called Perl Golf.

Ah yes, Perl Golf!

http://www.codinghorror.com/blog/archives/001025.html

owen1 · June 10, 2008, 12:00am

I’m surprised nobody has mentioned the playstation 3 in here. It seems that some hardware-nut thought “hey, if we put on more processors, things should be faster shouldn’t it?”. Unfortunately current programming theories weren’t designed to be “multi-core”. Multi-core performace as far as I can see will only be achieved (if at all) in a multi-core language that has no code blocks or streams - at totally new (or maybe unknown to me) programming technique.

Saw that 3d language that they had in the movie “Contact”? yeah something like that.

Somethings benefit from it like complicated stream systems example video but faster what we want - not more.

codinghorror · June 10, 2008, 12:00am

–
programmers like being able to say “Run this code on each line of this data” and that’s reasonable. So what we want is a really cheap way for them to add "… and I don’t care what order you run ’em in, or if a bunch run at once”.

http://www.tbray.org/ongoing/When/200x/2007/11/12/WF-Conclusions

This is sort of what Parallel Linq extensions (mentioned by a few commenters, above) does.

http://msdn.microsoft.com/en-us/magazine/cc163329.aspx

MarkA · June 10, 2008, 12:00am

@Cedric

The quality and speed of code produce is completely
unrelated to the size of the source code

This is completely correct, when the source code is being compiled.

However, interpreters don’t produce code, and Ruby is an interpreter.

Kris · June 10, 2008, 12:00am

Not sure about the Ruby (coz’ I don’t really speak it but it doesn’t look bad to me). The regex however I’d do differently

by changing this:
GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)

into something like
GET /ongoing/when/[\d]{3}x([\d]{4}/[\d]{2}/{[\d]2}[^.]+)
or rather go for [0-9] instead of \d like so:
GET /ongoing/when/[0-9]{3}x([0-9]{4}/[0-9]{2}/{[0-9]2}[^.]+)

So you can read how many digits are required instead of having to count them. (which isn’t as bad in a monospace font on account of the forward slashes doing some grouping)

If I didn’t make too many typos al regexes should do the same, but I’d go for 3 for being the easiest on my eyes. (personal preference, i know)

I also agree with a previous poster it would be more “scalable” (love the buzzword) to maintain statistics at runtime, on the other hand… that would deprive us all of this little exercise.

Anyhow, thanks for being the gazillionth person to remind me to look into Ruby a bit.

Kris

Kris · June 10, 2008, 12:00am

And immediately after posting… I see I did make typos:

GET /ongoing/when/[\d]{3}x([\d]{4}/[\d]{2}/[\d]{2}[^.]+)
or rather go for [0-9] instead of \d like so:
GET /ongoing/when/[0-9]{3}x([0-9]{4}/[0-9]{2}/[0-9]{2}[^.]+)

(there shouldn’t be accolades around the class, just directly after it to indicate how many characters of the described class to match)

Kris

Matt_V · June 10, 2008, 12:00am

A translation for non-rubyists, in comments

// creates a new hash
// this hash will return zero if the key or value does not exist

// for each line in the arguments, with that line do the following
// if the line matches the given regex
// increment the value for the key matching the backreference
// (end the if)
// (end the iteration)

// sort the hash keys in order of descending hash values (the part in the {} is the sorting algorithm)
// for each key from keys_by_count[0] to keys_by_count[9]
// output the key value followed by the key itself
// end the iteration

Matt_V · June 10, 2008, 12:00am

actually - to clarify one part:
{ |a, b| counts[b] = counts[a] }

this is a comparator, it compares a and b using counts[b] and counts[a] and = returns -1, 0 or +1 based on the comparison

MitchB · June 10, 2008, 12:00am

Very interesting, but what is it about our fascination (read: total obsession) with programming languages? Why not take the time to design the solution first and then determine what is the most appropriate programming language (read: tool) for the job?

I think we software developers (or is that engineers?) could focus more on the design and less about which (read: insert your favorite) programming language to use. As a suggestion, why not use Alloy to design the solution first. You can try it for free at: http://alloy.mit.edu/ Maybe the issue is that the design is wrong in the first place?

Jheriko · June 10, 2008, 12:00am

“Even if you have no idea what Ruby is, this simple little program isn’t too difficult to decipher.”

I hate to completely disagree… but I do. I had to look more than one thing up to work this out. I do not normally have this problem with Java or PHP, or anything else which at least sticks to relatively normal grammar and syntax (i.e. C/Pascal style)

Surrounding things with | and using = ???

I have to say I was put off of Ruby by the learning curve… its not impossible, but I don’t feel its worth it.

It is one of two languages to have achieved this for me, along with Haskell… I picked up and ran with Java, JS, C, C++, VB, C#, PHP, Python and numerous other small ones not worth mentioning. I was only ever taught Pascal and Ada…

Not to ignore the rest of the blog. Its a shame that compilers can’t solve the parellisation problem yet… but its to be expected. The necessary information just isn’t there for the compiler to optimise most of the time. Its difficult to identify a canonical for loop for example…

Personally I feel the immediate future for this is something like OpenMP where the programmer personally marks up what is easy to parallelise to help the compiler out. A compiler that parallelises general procedural code may in fact be impossible… it has not been shown to be impossible afaik, but its not been shown to be possible either…

Niyaz_PK · June 10, 2008, 12:00am

Let me try.

Niyaz_PK · June 10, 2008, 12:00am

Meanwhile you could read the article:
Which Programming Language Should I Use?
http://www.diovo.com/?p=49

Exploring Wide Finder

– programmers like being able to say “Run this code on each line of this data” and that’s reasonable. So what we want is a really cheap way for them to add "… and I don’t care what order you run ’em in, or if a bunch run at once”.

–
programmers like being able to say “Run this code on each line of this data” and that’s reasonable. So what we want is a really cheap way for them to add "… and I don’t care what order you run ’em in, or if a bunch run at once”.