I {entity} Unicode

MontanaR · April 9, 2008, 12:00am

@Frank 32 bits, I hope you mean.

Alex · July 3, 2008, 12:00am

I’ve been working on a corpus blogs recently in the context of a summer project. This corpus has been gathered by another university and has a major case of Unicoditis.
Most documents, including those that are written in English, suffer from badly mangled data. Different documents are mangled in different inconsistant ways. 2 or 3 byte sequences of UTF-8 in the documents may be translated individually from their meaning in a single-byte encoding to UTF-8 encoding, sometimes more than once.

This is incredibly frustrating. I #9829; Unicode indeed and am looking forward to days when the majority of plain text will be utf-8 encoded.

Alex · July 3, 2008, 12:00am

Hmmm your comments system ate my heart into a numbered entity.

Fred_Fnord · August 27, 2008, 12:00am

Not that you’ll read it, but, Mr Peterson:
Have you ever written a line of code? Based on that comment I would think no. The best
programmers I have ever had the pleasure to meet could care less whether or not anybody
can understand their day-to-day English skills. It’s not their English skills that make them
ROCKSTAR hackers, it’s their hacker skills that make them ROCKSTAR hackers. No,
there is ZERO correlation between the two. Sorry.

First, yes, I’m a programmer. Any claims that either of us is a better programmer than the other would clearly be silly in a forum like this, so I’ll just say that, only counting since college, I’ve been working in positions involving software engineering for 10 years.

Second, I have met a lot of, as you call them, ‘ROCKSTAR hackers’. I tend to divide them up into two subsets: the ones that YOU would consider ‘rock stars’ and the ones that I would consider rock stars. And I have a much more exacting definition than you do, I’m sure. This is because in past jobs I have been the one trying to clean up after your rock stars, who in general are the ones who are incapable of or uninterested in correctly documenting their code and/or their apps (if they’re writing internal tools), totally immune to design documentation (let alone having the ability to actually write or revise it) and very resistant to decent collaboration on products.

For me, a real rockstar coder is someone who is excellent at writing code, conservative about documenting it, and capable of interacting with other human beings in ways that make them a pleasant addition to a team. An example would be the friend of mine who works at Google as a consultant now, pulling down better than $200k a year from them. If he weren’t able to coherently document things, write project specifications, and produce excellent code, all at the same time, he literally wouldn’t be able to command half of that from Google.

So by all means, use your ‘rockstar coders’. And I, with a team half the size, but with people who actually know English and can communicate and do real design work and who are actually organized, will get a better product out, in half the time, and it will be maintainable.

-fred

AnonymousI · January 14, 2009, 12:00am

joke_(but not so much)_mode
Men, when Indians invented (or at least formalised) the notion of 0 as a number and shown the advantages of positional numbering system over non-positional numbering systems, almost all we agreed in few centuries to use that system.

Not even a minimally matematically minded person would today complain of latin numbering system is not enough represented on modern calculators to reflect the cultural influence and blablabla of Roman Empire.

But even if Phoenician shown comparable advantages of RISC (reduced set of characters ) alphabets over hieroglyphs and pictograms a millennia before indian numbering system, literates, humanists and philosophers still complains about the alphabet-centrism of characters sets early implemented on computer was detrimental to the culture and literature and history and whatever and blablabla of non-alphabetic cultures.

What does that have to teach to us?

positional is better
RISC is better
given an unlimited amount of time literates, humanists, philosophers and blabla-ers will not end up in agreeing upon a standard base for a working system, while matematicians and bankers will do it in no time.
consequently, the REAL problem for internationalization are literates, humanists, philosophers and blabla-ers.
/joke_(but not so much)_mode

Aston · February 6, 2010, 12:00am

Lol, that is pretty funny. You often get these when someone tries to copy and paste content from Word into their HTML editor.

JasonM · February 6, 2010, 12:00am

These are great! Here is one where you can get a bunch of design pattern stuff too - shirts, posters, etc.

http://www.cafepress.com/codergear

Aaron_G · February 6, 2010, 12:00am

I might just have to get one of those. It’s high on the geekiness scale, but I think almost everyone who’s used a computer knows the dreaded white box.

Simon · February 6, 2010, 12:00am

Jeff,

encoding stuff is pretty nasty sometimes, I constantly run into trouble with it too. But your problems with “special” characters like “, ” or — are probably due to not storing your files in UTF-8, because I just created a test feed and feedvalidator does not complain about these characters.

See here: a href="http://www.slashslash.de/feed.xml"http://www.slashslash.de/feed.xml/a

Most of the problems I had with encodings were due to the fact that you can set the XML encoding attribute or the charset attribute of the Content-Type header field to UTF-8, but as long as your files aren’t actually STORED in UTF-8 by your editor (or your scripts on your server), it doesn’t work. The reason that all ASCII works fine is, that it was made a subset of UTF-8 exactly for the reason of easy transition.

I think it’s often a real pain to have all your tools use the enconding you want to use. And in some cases, it isn’t even possible to tell what your editor is actually using or to change what it uses.