Exploring Wide Finder

Jheriko · June 10, 2008, 12:00am

that should be “using = ???”

seems that the anti-injection code here is a bit fussy about the greater and less thans…

Jeff: can you not just escape all of the characters in the text? there are normally some helpful functions for this sort of thing… or is it to protect against some awesome attack that I am not aware of?

Jheriko · June 10, 2008, 12:00am

got my lt and gt backwards…

speaking of which, it says here above the comment input box “(no HTML)”. But isn’t amp;gt; HTML? for instance…

Scott · June 10, 2008, 12:00am

I don’t know… I think it’s very easy to get into Ruby code that makes no sense at all if you aren’t familiar with the language.

“each” is simple. “collect”, not so much.

ChrisB · June 10, 2008, 12:00am

GET /ongoing/when/\d{3}x/(\d{4}/\d{2}/\d{2}[^.]+)

Jheriko · June 10, 2008, 12:00am

i am probably to harsh, its mainly the minutia that get me… i can see the gist of the program, but understanding exactly what it does needs me to look things up.

“keys_by_count[0 … 9].each do |key| … end”

looks nothing like anything i see in another program language… and its not unambiguous as to what it could do until i look it up…

some kind of for each loop… /i guess/.

“keys_by_count = counts.keys.sort { |a, b| counts[b] = counts[a] }”

is even more confusion…

what does this do… well i have to guess massively and say that it sorts the keys and stores them in a new array… but the bit inside the {} is completely ambiguous. it does something to do with pairs of values and something which /looks like/ it might be a swapping operation.

It would help if i had more patience… but I prefer to think of it as “every other programming langauge does fine, this one must just suck” rather than “i must suck because this one langauge is slightly more difficult”. as arrogant as that position maybe… it seems more reasonable

If Ruby offered something new I would have learned it fine tbh… its just difficult enough to not be able to “pick up and run with” like almost everything else out there… but honestly, it wouldn’t let me do anything I can’t already do.

Mike_Arthur · June 10, 2008, 12:00am

Maybe I’m being too harsh but this article sums up perfectly why I’ll never use Ruby for anything and why I wouldn’t recommend using it for anything that is either in production or other programmers need to read.

Ruby is very, very, very slow in seemingly every situation. I’ll happily accept the performance loss to go from C++ to Java and eliminate memory-management errors but I’m incapable of seeing what benefits would overset the pathetic performance of Ruby. People keep telling me “in a lot of projects speed doesn’t matter” but in no serious business do resources never matter.
As mentioned above, C/C++/Java/C#/PHP all share very similar syntax and most developers know at least one of the above so I just can’t understand why Ruby’s vastly different syntax is so massively superior?

Twitter is the worst possible evangelist for Ruby. Their uptime is fairly pathetic for a serious outfit and they’d probably have been quicker to rewrite Twitter in a “serious” language by now rather than desperately trying to work out how to hack RoR into a serious production environment.

I welcome your imminent flames

HB166 · June 10, 2008, 12:00am

That certainly does seem to use several more “Special Characters” than other languages I’ve worked with. I thought C++ was a little annoying with the goofy - and :: stuff…

if line =~ %r{GET… WTF?

Rob_Janssen · June 10, 2008, 12:00am

This is a program to count the most common HTTP GET URL entries in a webserver log file

Why not do that directly when the HTTP GET requests are made? Log files are for post-mortem stuff, statistics should be updated immediately.

I’m reminded of the electric bicycle steering handlebar grip-heater (I think I saw this on TheDailyWTF, but I’m not certain). Correct me if I’m wrong.

Someone on a corporate newsgroup was complaining about having cold fingers when he arrived cycling to work. Someone else replied with sympathy and said that maybe it was a good idea to use a little electric circuit to heat the handles. This concept mushroomed in scope and grandiosity until someone sane - after half an hour or so - told the original topicstarter to just wear gloves.

Now -that- is beautiful code.

titrat · June 10, 2008, 12:00am

I could not help, but i found this C#-Linq solution to be more readable for me. Reading means understanding here, in spite of translating to my mind.
There is to much magic in ruby for my taste, like the $1 - where does it come from? Heaven?

Source:
http://jcheng.wordpress.com/2007/10/02/wide-finder-with-linq/

IEnumerablestring data = new LineReader(args[0]);
// (LineReader is not a built in function)

Regex regex = new Regex(@"GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/([^ ]+) ",
RegexOptions.Compiled | RegexOptions.CultureInvariant);

var result = from line in data
let match = regex.Match(line)
where match.Success
group match by match.Groups[1].Value into grp
orderby grp.Count() descending
select new { Article = grp.Key, Count = grp.Count() };

foreach (var v in result.Take(10))
Console.WriteLine("{0}: {1}", v.Article, v.Count);

Kev · June 10, 2008, 12:00am

“Tim calls Ruby “the most readable of languages”; I think that’s a bit of a stretch, but I’m probably the wrong person to ask,”

I have a reasonable smattering of Ruby under my belt but I find Ruby no more readable than any other language that I have familiarity with. It all comes down to how the programmer expresses his or her intent. I’ve seen super readable C#, ASP, VB, Perl, Python but also conversely, code that’s an utter mess, the same goes with Ruby. Ruby is nothing special. It’s just another language in a sea of languages. Yes it may have some cool features to make the expression of a programmers intent more concise, but abuse/misuse of these features can make even just a few lines of code look sociopathic. You only have to look as far as ternary operators in C++ or C# to see where that can lead.

There’s also a current fad just now with new Ruby afficionados, converts and zealots to see how much they can compress their intent into as few lines as possible using every trick in the book. I find this obfuscates Ruby code as much as it does C#, Perl or any other language that lets you pull of these tricks. You just end up with another blob of ‘write-only’ code.

Just my 2c from current observations.

Jeff_Davis · June 10, 2008, 12:00am

You won’t get any flames here. I haven’t yet figured out why any programmer would like Ruby. I understand why the web designers like it, but that’s because they are… well… designers, not programmers.

Designers like pretty things, programmers like things that work. It’s the fundamental difference of the job description.

Zeroth · June 10, 2008, 12:00am

Well, here’s the thing. Ruby still has a naive interpreter, and as optimized as you can get such things, they are still deadly slow vs the VM interpreters. Python, Java, and the .Net languages are all compiled to bytecode, which is then run on a VM. You can optimize VM’s to be hideously fast, and achieve huge performance gains there.

So no matter how clever you can get with the code, until Ruby moves to a VM implementation, it will still be the last of the pack. It also makes it particularly difficult to efficiently instantiate and manage threads in a naive interpreter, vs a VM.

I don’t know how Python achieve such speed gains, since its still single-threaded, unless you use stackless… checks the link

Craig · June 10, 2008, 12:00am

in a perfect world, a one-character change to the original Ruby program would be all it takes to enable all the necessary multicore optimizations.

In a perfect world, the compiler would handle the multicore optimizations without a code change.

bandini · June 10, 2008, 12:00am

I guess without OS or language support, there won’t be any progress in thies field.

I only know python and I use it as an enhancement over shell script, but I guess the map function for example would be the perfect place for the interpreter to automagically spawn some thread…

Bill100 · June 10, 2008, 12:00am

Well Jeff, once again you’ve shown that it is a bad idea to have any sort of digression in your blog posts because the comments will inevitably be mostly about the digression and not the main point of the post. grumble, grumble

PS. Yeah, I found parts of the ruby code example to be impenetrable as well. Shocking!

ElvisM · June 10, 2008, 12:00am

Disclaimer: Ruby neophyte here.

I’ll have to side with Jheriko on this one. That snippet of code is not the clearest of them all. I had to scrutinize the code for a few minutes before I could understand its purpose. Personally, I think that’s the last thing you want as a developer. Aren’t we expected to strive for clarity?

PeterB · June 10, 2008, 12:00am

I saw some interesting thoughts on multi-threading presented by a fancy researcher at the launch of Microsoft’s 2008 products launch.

At the time there seemed to be some great ideas there for the future - but I can’t remember what he said now

MikeD · June 10, 2008, 12:00am

I don’t think core parallelism will help you much on this problem, actually. That’s why it runs like a dog on almost any language. The problem is, at its core, one of disk and memory caching and of paging. If your processing - whatever it is - can’t beat the disk in retrieving the next block, you have a problem. It doesn’t help to parallelize the CPU-bound parts if your problem is overall I/O-bound. A ‘warm’ run will likely still have the file in the disk cache, so you have to be careful to do a complete flush before testing. If you don’t the numbers are meaningless. You’d also do well to ensure the file is defragmented and contiguous - disks are better at sequential access than at random access - and ensure that there won’t be other random I/O on that disk.

I expect the difficulty largely to be in computing the hash and sizing the hash table, particularly if you, or your environment, try to resize the hash table while it’s running, causing all the hashes to be recalculated. Imperative languages tend to give you more control, at the cost of having to be more explicit about the algorithm. As the hashtable gets big the OS will start paging your HT out anyway.

If there is scope for parallelism, you have to be very careful to avoid causing locks on shared data structures. For this problem you, or your environment, would be better placed to accumulate results on each thread and combine them together when each part is completed so they’re not touching a shared data structure. An interlocked operation (‘lock-free’ programming) is not free by any means, it will stall the core for many cycles as it has to go and hit main memory directly while asserting exclusive control to that address across all processors.

You and Tim would probably do well to watch Herb Sutter’s presentation to the Northwest C++ Users’ Group “Machine Architecture: Things Your Programming Language Never Told You”, which you can find at http://www.nwcpp.org/Meetings/2007/09.html. (Scroll down for the Google Video recording of the presentation and the PDF of the slides.)

MikeD · June 10, 2008, 12:00am

@Aaron G:

You’d need maybe 20 extra lines, if that, to create some worker threads or dip into the thread pool in C#, and synchronize access to the hash table.

Your program would be blocked on the hash table most of the time. Look up “lock convoy”.

Hk18 · June 10, 2008, 12:00am

I just wonder why there are no submissions in C or similar languages.

I guess, I will give SNet a try sometime for that.

SNet (www.snet-home.org) has the goal to make programming for this enviroments easier. It creates boxes that communicate via streams, a box is implemented in any language with a language binding (granted, currently, only C and SAC are supported languages).
In the next step, you connect the boxes. With some assumptions (a box has a single input and a single output) you do not use wiremappins (that is, Output 1 from Box 1 to input 11 of Box 42, Output 98 of box 72 to input 55 of box 23, …), but rather nice clean statements like: A…B, that is, the output of A is being piped into box B, or things like A*(termination condition) - A star - , that is, feed the output of A into A again until the termination condition becomes true.

Given that, you can do such things fairly nice.
At first, you have a bottleneck, because the machine has just one IO-device, thus, you create a box that reads data and pumps them into the network. (Or rather, the runtime system will do this automagically, heh),

After that, you just implement a box B that contains an encoding of a single step for an automata that parses the regex up there and marks a certain exit condition and possible outputs after finishing. This box is put into a star, that is, B* and bang, you are done.

The only speed caveat currently is the runtime system, as it is not implemented for multiple processes yet, currently it only works for multiple threads in a single process.

However, the nice thing about SNet is that you just need little more work to get that stuff parallel. Granted, it is more work than just adding a tiny star somewhere, but once you have this done, and our research continues well, you will be able to scale as mad as you want.
Let me rephrase that.
If our SNet-Runtime system works properly someday, you can create a software that runs on a shared memory machine with … 5 threads, but you can scale it up to run on millions of computers all over the internet just like Seti@Home did, without touching your production code (after a little bit of reorganization)!
And do not fear. The changing of production code mostly is splitting modules apart, which should be easy with enforced, solid APIs.

/This/ is what I call beauty, as I can boldly answer YES to your question, especially because this can stomp other parallelization methods into the ground for more complicated things.

Greetings, Hk