Sorting for Humans : Natural Sort Order

The default sort functions in almost every programming language are poorly suited for human consumption. What do I mean by that? Well, consider the difference between sorting filenames in Windows explorer, and sorting those very same filenames via Array.Sort() code:


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html

This discussion is funny.

The problem isn’t with sorting. The problem is with the data structure. If you want a hierarchy use one. Make a tree and sort the levels of the tree.

Take this shopping list for an example:

Wheat bread
White Bread
White Milk
Chocolate Milk

It’s a list. If i sort the list i’ll get:
Chocolate milk
Wheat Bread
White Bread
White Milk

Whaaah /cry the user says 'cause milk is seperated and not sorted propery!
Whaaah /cry the developer says “get used to it” this is the way computers have worked since the iron age.

the problem isnt the sort, the problem is the list. What we want is a Tree:

Bread:
Wheat
White
Milk:
Chocolate
White

Then when we sort we sort each level of the tree.

When you sort the strings:

dumbuser10
dumbuser2
dumbdeveloper10
dumbdeveloper2

If you expect the result:

dumbdeveloper2
dumbdeveloper10
dumbuser2
dumbuser10

Then you are expecting the tree:

levels of the tree.

Take this shopping list for an example:

Wheat bread
White Bread
White Milk
Chocolate Milk

It’s a list. If i sort the list i’ll get:
Chocolate milk
Wheat Bread
White Bread
White Milk

Whaaah /cry the user says 'cause milk is seperated and not sorted propery!
Whaaah /cry the developer says “get used to it” this is the way computers have worked since the iron age.

the problem isnt the sort, the problem is the list. What we want is a Tree:

Bread:
Wheat
White
Milk:
Chocolate
White

Then when we sort we sort each level of the tree.

When you sort the strings:

dumbuser10
dumbuser2
dumbdeveloper15
dumbdeveloper3
smartdeveloper1

If you expect the result:

dumbdeveloper3
dumbdeveloper15
dumbuser2
dumbuser10
smartdeveloper1

You are expecting a tree:
dumb:
developer:
3
15
user:
2
10

smart:
developer:
1

So fricken use a tree user and developer alike.

omg sorry about that post, i guess i’m dumb user 2 'cause i cut and pasted all fooked

“natural sort groups numerical substrings”

Yes, but if “natural” means “the way humans do it”, it has to do much more and handle other popular ordered substrings…

Days of week:
•tuesday
•wednesday
•thursday

Date/time:
•15-8-2004
•1-10-2004
•1-1-05

•5:00 AM
•14:00
•4:00 PM
(special tricky for 12:00-1:00 AM/PM)

Street adresses:
•9, James street
•10, James street
•10 bis, James street
•10 ter, James street
•5, Rudolph street

Product reference, or anywhere digits are not numerical, like phone “number”:
-01K305
-1K205

Hexadecimal:
0001
F01A
f01b
(0-9 and A-Z/a-z in straight order)

Version:
•v1.9
•v1.10
•v1.10.1
(a period is not a decimal separator)

Sorting with or without leading article (Mr, Miss, St, “The” etc).

etc etc

And then there is the issue that some sorts depend on locale.
And then there is the issue that some strings were built with other, untold locale (like filenames).
And then there is the issue that some strings are ambiguous and require looking at the entire set.
And then there is the issue that some sets are fundamentally ambiguous.
And then there is the issue that since the sort depends on the set, it is not stable nor predictable.

If you want to make good human-like sort… good luck.

And in the case of the version numbering… thats a tree:

v:
–1:
----9:

Windows Explorer in your example sorts by file name, not by file name with extension. Hence the difference.

If you apply Array.Sort() to file names only – you would get result that is pleasant for users.

I have both learned to use padding zeroes in the filenames, as well as written thesorting a couple of times in my profession.

Hmm… I bet there’s an ISO standard for this somewhere. If people liked communicating across borders more, they’d be less exited about localization and more into internationalization. There is, seriously, no need to have a local way or writing dates. But then again, what do you expect from people who don’t use the metric system…

Musaran, I agree with you totally - natural sorting is not well-defined. I was simply describing how most natural sort algorithms are implemented.

Aaron G, funny you should say nobody needs to sort decimal numbers as a sub-string. This natural sort algorithm actually supports sorting decimal sub-strings by NOT ignoring leading zeroes (although that behaviour can be disabled):
http://sourcefrog.net/projects/natsort/
Before you say “well, that’s just some random guy’s sort routine”, he claims that his code has been incorporated into PHP:
http://us3.php.net/manual/en/function.strnatcmp.php

Actually, this whole discussion boils down to a central issue in human factors: How Software Works (well-defined algorithms) versus What Users Want (fuzzy, ill-defined specifications).

Once AI is smart enough to figure what users really want, we’ll all be out of jobs. (The code will write itself).

Daniel ‘Dang’ Griffiths:

Didn’t your teacher tell you you’d go blind if you continued to “Embrace the Python”.

Maybe you could try “polishing the Ruby”?

On the topic of this post:

One sort approach is never going to work in all situations. Let your users try out the sort with various filenames etc, get feedback from them, change the sort to behave as close to their expectations as possible. Easy really.

Ash wrote:
“One sort approach is never going to work in all situations. Let your users try out the sort with various filenames etc, get feedback from them, change the sort to behave as close to their expectations as possible. Easy really.”

Aaron G wrote:

“As if you’d ever need to sort hexadecimal in a system with actual users. If you find yourself encountering this issue frequently, it’s a strong indication of awful UI design and potentially terrible back-end design.”

Peterh wrote: “If people liked communicating across borders more, they’d be less exited about localization and more into internationalization.”

Hey Jeff, first time commenting. :slight_smile:

Great blog, your recent blog about the 80/20 actually inspired me to start my own blog.

Saw this one and have to give it my own try:
Quick Java Comparator:
http://simplesql.tigris.org/servlets/ProjectDocumentList?folderID=0

Cheers!

@Tom:
“What? Really? Hex?
Not only do only only you and DEAD people know Hex, but most of them don’t use it that regularly.”

That’s not the point. The point is what is the natural order for hex strings? The ‘natural order’ is under defined, you can’t have it in the language (or in its standard library).
Only the programmer can decide what is the natural order. No, the user can’t, the user sees only what she’d like you have to take into account all your users.

@ Aaron G:

"If some of you are really so concerned about this obscure edge case, put in an heuristic that checks for a string composed of only decimal digits and same-case letters A-F, with at least two characters from each category. "

Oh yes… right… lets put there an heuristcs so that we use two different ordering depending on the elements of the set. Wait… what appens if only some strings are hexadecimal?

“Jesus guys, it IS important to consider edge cases, but that doesn’t mean you should scrap the whole idea if a few of those edge cases will produce incorrect results. You just make sure that you’ve defined this behaviour, and set expectations accordingly”

No, this is the problem. A lexicographical ordering is the same everywhere. The ‘natural order’ is not, you can’t have it as a function in the programming language, it depends on the domain.

If you want to show a sorted list of strings to the user you have to choose one ordering. If, depending on the domain, there is an ordering that is alwayscorrect* you should implement that. If the list of strings is general enough you have to choose an ordering and screw some of your users. The lexicographical ordering has the undeniable advantage of screwing the users in a predictable fashion with a solid mathematical background.
The natural order is just something you pulled off your @ss, subject to change when you find a “bug”.

E[X], well whether you like it or not, strnatcmp seems to be standard in PHP:
http://ca.php.net/strnatcmp

Just because “natural sorting” is ambiguous, doesn’t mean someone can’t make up a “standard” for it.

@Will
well PHP is also the language where “no one needs namespaces”, calling that a standard is quite a bit far fetched. Is just the way they think it should be.

E[X] wrote: “well PHP is also the language where “no one needs namespaces”, calling that a standard is quite a bit far fetched. Is just the way they think it should be.”

Leading zeros are your friends.

PHP has a built in natural sort function as well: http://be2.php.net/natsort

Would be nice to have it built in .NET as well.