I {entity} Unicode

Rhywun · March 27, 2008, 12:00am

I use – all the time too. Sometimes it gets converted into a dash by whatever software I’m using, sometimes not. Either way, the message is clear. The double-hyphen is a pretty well-known internet convention for expressing a “dash”, similar to asterisks for bold or underscores for underlining.

Harold · March 27, 2008, 12:00am

Unicode is da BOM

Josh53 · March 27, 2008, 12:00am

It’s a follow up to a previous post on Internationalisation - Does your software pass the Turkey Test? Or at least an addition. Handy things to know for software developers. That’s why it was blogged about.

JamesJ · March 27, 2008, 12:00am

I just wish people knew UTF-16 was a variable-width encoding, like UTF-8. It is not a fixed-width encoding like UTF-32! Why do so many people not know that?

JamesJ · March 27, 2008, 12:00am

“Mainly because they have to be escaped in XML …”

WTF? Does some tired old blogger need to write an article titled, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About XML (No Excuses!)” for you?

Sean · March 27, 2008, 12:00am

Single-alphabet minded bastards…

PeterisK · March 27, 2008, 12:00am

Why do you think anyone finds it funny?

codinghorror · March 27, 2008, 12:00am

For extra credit: what is the BOM?

http://en.wikipedia.org/wiki/Byte-order_mark

dannygutters · March 27, 2008, 12:00am

Really, that question is not a rhetorical question, you expect me to answer, “This is funny because the box is representative of displaying a unicode value with an incorrect encoding scheme”.

Maybe the rhetorical question is what kind of person am I that finds this funny.

codinghorror · March 27, 2008, 12:00am

rhetorical question n. A question to which no answer is expected, often used for rhetorical effect.

No answer is expected because every developer worth his or her salt should already know why it’s funny-- no explanation required.

Rhywun · March 27, 2008, 12:00am

It’s funny because, fingers crossed, someday UTF-8 will take over the world and we’ll never have to read another article like “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” again. Text will flow seamlessly anywhere and in any language. And that will be a beautiful day. And these slogans will be a badge of honor of to those of us who had to deal with crap like installing an auxiliary library and learning a whole new API just to be able to output a different language.

Jaster · March 27, 2008, 12:00am

I agree Unicode should be implemented on all software No Excuses! It’s not difficult…

Windows is a problem in that cutting and pasting between standard Windows apps fails… Why does copying a Unicode string from and HTML editor to Word to Outlook not work when they are all written by Microsoft!

The one I love is your comment “As it turns out, Windows-1252 can be a better default for web strings than UTF-8”

This is false except on Windows because Windows seems to make a distinction between Ascii text and Unicode text because it uses UTF-16

If it used UTF-8 then (at least in America and the UK) Ascii 0-128 and Unicode are the same …

Chris_Chubb · March 27, 2008, 12:00am

On one hand, I understand the need to support characters outside the standard American alphabet. On the other hand, I don’t want to do a lot of extra work for the 99% of the apps that I write that have a 99% North American audience. On the gripping hand, perhaps knowing the ASCII charset will come in as handy in the next 20 years as knowing EBCDIC has in the last 20 years.

John_Grimes · March 27, 2008, 12:00am

it’s funny because people are answering the rhetorical question (including me now!).

JesusD · March 27, 2008, 12:00am

I remember the day I spent 10 hours fixing the question mark bug that a user reported. argh

OleE · March 27, 2008, 12:00am

How about this one:

I#65533;Unicode

Ole

Andrew_R · March 27, 2008, 12:00am

But Jeff, why do you still avoid using goodness of Unicode?
For instance, why use ‘–’ when there is ‘—’ available?

Jeff_Davis · March 27, 2008, 12:00am

So, do I have to not read the link that you send us to before answering about the BOM?

But, we use UTF-8. In UTF-8, you don’t need the BOM.

http://unicode.org/faq/utf_bom.html#29

Rhywun · March 27, 2008, 12:00am

I guess his — key is broken.

Chris · March 27, 2008, 12:00am

Great article.

Even greater use of “on the gripping hand” in Chris Chubb’s comment!