Sorting for Humans : Natural Sort Order

You’re absolutely right, Jeff - there’s no reason that there shouldn’t, somewhere in the libraries for .NET or Java, be a natural sorting algorithm. I know that in Java there’s more than enough locale specific functionality that the argument of different character sets or sorting rules (someone in one of the links mentions that, in Sweden, ‘V’ and ‘W’ sort together) doesn’t really hold water. Besides, it’s Sun and Microsoft - don’t they have motivated interns to do that sort of work?

Though I don’t disagree with the theme that we should do what we can as programmers to reduce cognative friction, make lives easier, and exceed user expectations, I have to wonder, what exactly is coming out unsorted here? Obviously the file names are a contrived example, because let’s be honest, no (rational) human would name files with such a scheme if they ever intended those files for human consumption.

I can think of only a few situations (though admittedly I may be missing some) where sorting ASCIIbetically produces counter-intuitive results. For example, numbers/dates tacked on to the beginning (or end) of file names as an ad hoc method of version control. Maybe, a user ID number (perhaps an integer) that’s used as a primary key in a database might sort out of order. In answer to these “incorrect” sorts, I would argue that the users are not using the proper tools (perhaps because we as programmers haven’t supplied them, or due to other reasons such as cost).

My point is that the types of things that tend to sort poorly in ASCIIbetical are the type of things that we as programmers should be abstracting away in our software so that users don’t have to consider whether something’s sorted “incorrectly.” Certainly there will be situations where abstracting such things is infeasible, or unnecessary, and such a “natural” sort might be appropriate, but when that happens, write the sort, port it out to a library assembly (especially now with .NET 3.0 extension methods this becomes tractable), and get on with it. If you come across a situation where it’s necessary later, link the library into your assembly.

Just don’t put the cart before the horse Jeff.

Leading zeros are your friends.

because let’s be honest, no (rational) human would name files with such a scheme if they ever intended those files for human consumption.

We are thinking as programmers here, ask any normal person to write a list of file names and he/she will write
photo1
photo2
photo3
photo…
photo10
photo11

Natural sort is the way!

If you want to sort like Explorer, you are apparently supposed to use StrCmpLogicalW. See: (http://msdn2.microsoft.com/en-us/library/bb759947.aspx) and (http://www.pinvoke.net/default.aspx/shlwapi/StrCmpLogicalW.html). I think it is only suited to sorting filenames though, because it may not take account of sorting properly for different languages/cultures.

I’m sorry, but I don’t agree. Let’s call it alphanumerical.

z100 should always go BEFORE z2, the same way that table goes before tar.

Also, anyone naming their files name-number and not putting in the extra digits should not order by name, but by date.

And I take offense at Kate Rhodes’ comments. It’s yet another case of stupid users.

Off the top of my head…

Can’t you just treat the whole “name” as a base-62* numbering system?

The number line would look something like:
0 1 2 3 4 5 6 7 8 9 a A b B c C d D e E f F g G h H i I … etc

Convert to decimal in the usual way so:

z1 = (61 * 62) + (1) = 3783
z2 = 3784
z10 = (61 * 62^2) + (1 * 62) + 0 = 234546
z11 = 234547
z100 = (61 * 62^3) + (1 * 62^2) + (0 * 62) + 0 = 14541852
z101 = 14541853

That seems to work.

Downsides are that you need to be able to handle very large numbers to get support a useful filename length and base-62 only covers alphanumeric, you’d need a much bigger base to cover punctuation and unicode.

(Apologies if my maths is a bit wonky on the above. It’s before lunch and my coffee isn’t kicking in. But I think the principle is sound.)

We are thinking as programmers here, ask any normal person to write a list of file names and he/she will write
photo1
photo2
photo3
photo…
photo10
photo11

Why are they choosing file names like that though? Could it be because we’ve failed as developers to provide them tools to give their files more descriptive names?
Why not name those same photos:
birthday cake.jpg
football cheerleader.jpg
Pam smiling.jpg
zoo animals.jpg

That certainly seems more like the kind of names a user would pick. I would wager if you asked a person if they’d rather have files named rationally with a description of their content, rather than some arbitrary numbering scheme, just about everyone would rather have descriptive file names.

Let’s solve the root of the problem before so that we don’t have to continually solve the symptoms.

My Ruby version of Dave Koelle’s algorithm:
http://rikkus.info/arch/sensible_sort.rb

Leading zeros are your friends

How many leading zeros?? 3, 4, 10??

Then you have to type them in to access whatever the thing is you’ve indexed with them - really unfriendly when there are more than about 3.

More thought required methinks.

Why are they choosing file names like that though?

How many keyboards have you ever seen on a digital camera?

Certainly computers should always do what the user wants, but in this circumstance I doubt that you would find a consensus amongst users at what the proper order should be anyhow.

Appending numbers to the end of filenames is a dead give away that the file was produced by a machine and in that case the name itself is not relevant. I mean if every single one of your digital camera pictures start with “PIC” and you had dozens of folders with PIC1.JPG then putting them together and sorting them will just end up with a nonsensical order anyhow.

Sorting them by the date would be much more meaningful to the user.

I also agree with one of the earliest posters who said this wasn’t a sorting problem at all but a compare routine problem that the sort program calls.

How many keyboards have you ever seen on a digital camera?

Well, duh! The camera has no way of knowing if you’re taking a picture of a pumpkin or a polar bear, and without a keyboard you cannot tell it. But most digital cameras these days come with a suite of software that you use to sink the photos from the camera to the computer. Why is that software not gently prodding those users to provide descriptions of the content of those files so that an intelligible value can be used for the file name? If a user chooses not to provide those descriptions, then they have to remember that DSC01722.jpg is the picture of Pam from the office party. If that’s what they want to do, you can’t stop them. But you can put the right tools in the hands of users who do want the feature.

Cocoa has numerical comparison as follows: [someString compare:someOtherString options:NSNumericalSearch];.

It’s trivial to implement a function for NSArray’s sortUsingFunction:context: that uses it, or a category on NSString to add a -numericalCompare: method to be used with sortUsingSelector:.

Anyone coding for European languages will have certainly come across a variant of this where bugs crop up complaining that accented characters (like ) end up at the bottom of a sort instead of next to their a-z equivalents. It gets even more fun when you start considering compound letters like ‘’.

To the poster who suggested treating the name in base-62, what you’re essentially doing is sorting by string length first, then alphabetically - not addressing the core problem, e.g. if you filenames were all the same length (z1…, z10., z100) the system falls down.

Davide wrote: “Certainly computers should always do what the user wants, but in this circumstance I doubt that you would find a consensus amongst users at what the proper order should be anyhow.”

No one seems to have noticed that Kate Rhodes is actually Kay Rhodes. He’s actually a guy!

1 Like

Ooof. I need to edit this post to fix that, then!

Found this out when I was making my first version of a jpeg player to play some of the 7 million jpegs that Zoneminder has created.

A regular array_sort will put some images in the wrong order so when my app is playing it will flash a previous image… Playing jpg: 198.jpg, 199.jpg, 2.jpg, 200.jpg, 2000.jpg, 201.jpg, 202.jpg…

And yes PHP does have a natural sort! My first version of jpeg player was done using Javascript and PHP. I updated it to use the natural sort command instead of depending on the scandir() to sort the files.

Now I am developing the same thing in an Adobe Air app. But I am finding that my same code in PHP is definitely slower in Adobe Air. PHP takes a few seconds to get a stack of 5,000-10,000 files but Adobe Air takes 5-10 minutes for 1,000 using the same code I used in PHP. I spent late nights trying to optimize my code for air and now hopefully a Natural Sort won’t take 5 minutes to sort.

Hey Jeff, thanks for this, it’s perfect for a couple use-cases we have in Hue! We’d like to add a slightly modified version of your code (to handle both list and dictionary collections), is it covered by any sort of copyright license (Creative Commons or otherwise)? We’d link to this post as credit in a comment above the snippet.

Jenny

hi,

i tested a few of the above solutions and ended up writing my own, thinking a more concise and faster solution would be possible by employing a simple state machine to divide and conquer complexity and limit the number of tests. it beats other C/C++ i’ve tested by a factor of 2 to 5, depending on the nature of the input. see https://www.o-rho.com/naturalsort

Interesting reading.
As much as it may be “stupid users”, it very much is a problem I’m staring at right now

They want wards in numerical order based on only descriptions
Ward 1
Ward 2
Ward 9
Ward 10
COVID-19 HDU Ward 15
Ward 17

Yeah… this isn’t fun

1 Like