There Ain't No Such Thing as Plain Text

Over the last few months, I've come to realize that I had an ugly American view of strings. I always wondered what those crazy foreigners were complaining about in their comments on my CodeProject articles, and now I know: there ain't no such thing as plain text:

This is a companion discussion topic for the original blog entry at:

As it turns out, Windows-1252 can be a better default for web strings than UTF-8.

Microsoft’s Mikhail Arkhipov describes some of the changes in VS.NET 2005 in this area:

"First, Visual Studio is a Unicode application and actually even supports Unicode Surrogates Pairs. Most of Web pages, however, are not stored in Unicode. Therefore when opening a Web page VS has to figure out how to convert document to Unicode and how to convert it back on save. Here is how Visual Studio does it: "

UTF-8 is good as a default. But here is a better rule:

  • If the string is a valid UTF-8 encoded string, interpret it as UTF-8
  • If the string is not valid UTF-8, interpret it as windows-1252.

This is because there are certain combinations of bits and bytes that are not allowed in UTF-8. So if it is valid UTF-8, then you can be pretty sure that it is UTF-8.

Moreover, I want to comment on an earlier comment on this page: When talking about the “Unicode” encoding, it really means the UTF-16 encoding which has surrogate pairs. The spec says that a UTF-16 encoding has a Byte Order Mark at the beginning of the file, i.e. two bytes with the value FFFE or FEFF.

Actually, I think according to the spec, text/{something} === text/{something}; charset=US-ASCII

If no charset is defined on a text/{something} mime type, then the bytes must be interpreted as us-ascii. That is why the application/xml mime type is preferred to the text/xml, with application/xml, you can just pass the bytes to your parser, with text/xml, you have to assume those bytes are ascii