I {entity} Unicode

Jaster · March 28, 2008, 12:00am

Have some fun with Unicode …

http://www.revfad.com/flip.html

u#653;op #477;p#305;sdn #387;u#305;#647;#305;#633;#653; #654;#633;#647;

SeanP · March 28, 2008, 12:00am

ROFL!!! I Love it!!!

I just printed this out to add to my “cubicle bumperstickers” right next to:

/(bb|[^b]{2})/

A nice piece of friday humor for sure…

Mo116 · March 28, 2008, 12:00am

Wojtek:

That’s because MySQL (used to, at least—I don’t know if it still does) defaults to the Swedish Latin-1 locale and encoding.

Absolutely everything should use some encoding of Unicode. If they disagree about that, things will break. At the very least, databases and file storage should use one of the Unicode encodings even if your front-end doesn’t (provided you convert your encodings properly), which will mean you won’t have to mass-convert your database later…

Chris Chubb:

For those “99%” who just use ASCII, UTF-8 is byte-for-byte identical.

Jeff:

The only reserved characters in XML are , , , and the appropriate quote mark within an attribute values. Everything else is fair game, provided it’s properly represented by your encoding (which should be UTF-8 anyway). That’s why XML only provides named entities for , and amp; (possibly quot;? I don’t think so, though)—you don’t need any others.

Mo117 · March 28, 2008, 12:00am

Also, where you say “no HTML” preceding this comment box, it would imply that comments are interpreted as plain text.

The output of my previous comment suggests that it’s not

Joe · March 28, 2008, 12:00am

As long as `this’ practice is ``done away with’’ … using backticks and foot marks to fake up quotation marks is ugly and pointless

offler · March 28, 2008, 12:00am

it is not funny.

PaulC · March 29, 2008, 12:00am

LOL!

Seriously, though, I am looking at a number of “time saving” tools for bulk file operations (sequenced renaming, cropping, applying colour profiles, etc) that I have on my desktop, which, after all these years, are still saving me no time at all because they can’t be used with Unicode filenames (a lot of my files that need to be organised are in Japanese). So, I have to rename the files to ASCII, run the batch process, then rename them back again. I often have to temporarily move them elsewhere to avoid renaming an entire folder tree.

The pain of this is, they claim (as it turns out, falsely) to be ready for WinNT, Win2K, WinXP and/or Vista.

Not to mention that Unicode is nice for English filenames when you need colons, quotes, and question marks in them.

Yes, I think after all these years we are entitled to have natural English filenames even though some folk still insist on using 8.3 as a “badge of honour”.

“I don’t want to do a lot of extra work for the 99% of the apps that I write that have a 99% North American audience.”

This is a common objection, but it is self-fulfilling prophecy. Perhaps if your applications actually worked in other countries you’d have a bigger market? And perhaps the market outside the US, being billions of people more, might actually be more profitable?

(Yes, I realise you are probably referring to in-house applications, which are a different kettle of fish, but it’s fun to niggle (;P) ).

And I think people should rethink this arbitrary 1% plucked from the air: 25%-80% of households in the US in any given area are multilingual (based on a quick Google, may not be totally accurate, but enough to prove the point). Most western countries are absolutely multicultural.

That’s a lot of frustrated customers to ignore.

And that’s not taking into account multinational companies and government departments, either. I wonder if Homeland Security needs to be multilingual? Hmmm… “sorry Sir, that latest message from Al Quaeda was in Unicode… and we can’t read it.”

Mo118 · March 29, 2008, 12:00am

@Shawn: Are you absolutely sure about that?

I could have sworn Windows used UCS-2 (and didn’t handle characters beyond 0xFFFF), rather than UTF-16.

Konrad · March 29, 2008, 12:00am

I don’t care for “” versus “” , or — versus --. Mainly because they have to be escaped in XML whereas the simple ASCII equivalents don’t.

This, in response to your own article decrying ignorance of Unicode, is saddening (or brilliant satire). Perhaps you were laughing for the wrong reasons after all.

Chris_Nahr · March 29, 2008, 12:00am

“I could have sworn Windows used UCS-2 (and didn’t handle characters beyond 0xFFFF), rather than UTF-16.”

I believe NT originally used UCS-2 but that was extended at some point to handle UTF-16 strings with multi-word characters (whatever they’re called). The .NET Framework has always supported UTF-16, not just UCS-2.

codinghorror · March 29, 2008, 12:00am

This, in response to your own article decrying ignorance of Unicode, is saddening (or brilliant satire). Perhaps you were laughing for the wrong reasons after all.

You tell me, then-- the RSS feed for this blog is marked “UTF-8”

http://feeds.feedburner.com/codinghorror/
?xml version=“1.0” encoding=“UTF-8”?

Yet when any “” or — characters make its way into my feed (usually in quoted text), those characters are tagged by the feed validator.

http://feedvalidator.org/

These characters are tagged by the validator as “description contains bad characters” with “The XML encoding does not appear to match the characters used.” Each instance of “” or — results in one error from the validator.

I also get emails from people who subscribe via FeedBurner’s email service; these “” or — characters are unprintable in the emails.

hellyeahdudec · March 30, 2008, 12:00am

I do not know why, but I do find it funny. My mind inserts happiness where I have the possibility to. Good topic though. We are all robots.

DavidP · March 31, 2008, 12:00am

@ Fred Fnord,

Many programmers’ English skills seem to deteriorate in direct proportion to their programming skills. This is sad.

Have you ever written a line of code? Based on that comment I would think no. The best programmers I have ever had the pleasure to meet could care less whether or not anybody can understand their day-to-day English skills. It’s not their English skills that make them ROCKSTAR hackers, it’s their hacker skills that make them ROCKSTAR hackers. No, there is ZERO correlation between the two. Sorry.

DavidP · March 31, 2008, 12:00am

Oh, and BTW… I’m *#$!ing Jeff Atwood!

DavidP · March 31, 2008, 12:00am

Oh, and one last BTW… I’m I the only one who finds it ironic that you chose to compare English skills and programming skills in a post focused towards the (in)proper usage of UTF-* encoding?

DavidP · March 31, 2008, 12:00am

s/inproper/improper

Vadim · March 31, 2008, 12:00am

The solution: entire world should learn Esperanto.

Andy · March 31, 2008, 12:00am

Isn’t it funny because both the question mark and the box mean that the current encoding doesn’t know about that code point?

So literally translated both sentences say “I ‘don’t know’ Unicode”

Please don’t eat me.

Jheriko · March 31, 2008, 12:00am

I ? Unicode too…

Seriously though, its a pain to make you unicode app work nicely with the rest of the world. I mean, what character do you substitute for “”? Clearly “n”. But what about something like “#8594;”? Do we expect our program to substitute it with ascii art like “–”? And thats before we even get onto the subject of foreign characters…

There is no pain the other way at least…

DelphiUser · April 1, 2008, 12:00am

Its funny because programmers think with Unicode the box will be filled with a heart symbol that is typographically correct.

Until such colossal yet reliable Unicode character sets exist, typesetters will still be struggling with multi-language texts.