The Great Newline Schism

Have you ever opened a simple little ASCII text file to see it inexplicably displayed as onegiantunbrokenline?


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2010/01/the-great-newline-schism.html

Hurrah for new lines!

Hurrah for comments!

Let’s not forget about the Tab size schism.

VS2010 now forces you to set Tab size = indent size. So much for interoperability across systems and software.

https://connect.microsoft.com/VisualStudio/feedback/details/517188/tab-and-indent-size-should-be-seperate-settings-as-in-previous-vs-versions

Ok, no you’re just trying to make me feel old. Never used an electric typewriter? That’s where I learned to touch type. Although they’d probably be billed as ‘word processors’. And those were fancy compared to the old manual typewriters. With those you’d have to push the carriage back by hand. LF vs. CR comes in handy when you’re right aligning some text. If you do CR+LF you’d have to tab (or space) back to the right margin. A LF lets you stay on the same margin, so it is/was useful to have the distinction. The problem usually only comes up these days when you’re converting between systems, and gets really bad when you’re doing layout because it can be maddening converting large text files for characters you may not realize are even there.

It’s even more of an issue if you work in a Windows shop and post to Linux webserver, or vice-versa.

If you use a not so smart FTP and transfer files to and back, sometimes you will end up with double spacing between paragraphs. Or maybe, all your end of lines just disappear. Or it may be that the transfer converts the extra end of line characters into unknown characters and you end up with big ugly questions marks all over your web pages.

It also causes problems when you use multiple OSes and transfer files between them. What a pain.

It’s too bad we don’t have some benevolent standards community that can just tell all the OS companies to use one standard for newlines. It would save the economy 10s-of-thousands of developer-hours. They could do something useful instead of detecting and converting newlines.

There is more to it than just CR/LF. As we moved from physical typewriters to printers a large quantity of printers were controlled by Main Frames. The translation to PC printing is not always one to one replaceable. But a few of the reasons behind separating the commands go well beyond the speed of the head.

CR was used to BOLD lines by typing on the twice.

SLEW and TAB commands were used heavily to move the head around a line as code was VERY procedural. You may want to call a common DATE function to print a date on a line, but the head my not be where the date needs to go, so you tab or slew the head to the proper position. Then LF and reset.

Take a look at the word BELL, think of a manual typewriter and then think of why it showed up in print function code for years and years.

I was reminded of something that is still being hammered out: the move of the Python programming language development to mercurial vcs. One of the problems is that mercurial, being distributed, allows developers to check-in code on their local machine, using a variety of line ending. This seems to cause a lot of grief… See

http://mail.python.org/pipermail/python-dev/2009-July/090330.html

So, great post!

I wasn’t sure of the exact function of LINE SEPARATOR (U+2028) and PARAGRAPH SEPARATOR (U+2029), but you prompted me to go find out.

http://unicode.org/versions/Unicode5.2.0/ch05.pdf

The Unicode spec says:

“Traditionally, NLF started out as a line separator (and sometimes record separator). It is still used as a line separator in simple text editors such as program editors. As platforms and programs started to handle word processing with automatic line-wrap, these characters were reinterpreted to stand for paragraph separators. For example, even such simple programs as the Windows Notepad program and the Mac SimpleText program interpret their platform’s NLF as a paragraph separator, not a line separator.”

NLF (New Line Function) in this context is shorthand for CR, LF and CRLF. By contrast, the two Unicode characters have unambiguous uses. Not that I’ve ever seen them in the wild. I could see them being used converting HTML to UTF-8 plain/text or somesuch (maybe).

Note that that Mac OS X uses LF not CR. Mac OS 9 and older used CR.

This post is a very good explanation of line endings, but I had assumed that everyone already knew this information. I guess only old farts know it.

Two weeks ago, I was pissed off at a teammate who checked in a file used on the Windows version of our product from his Mac. When I went to edit the file in Visual Studio, I saw the line ending dialog showed here. Our source control system, like most, uses the client OS to determine the line endings. That was the first time I had seen anybody make that mistake in years. The culprit was suitably embarrassed.

As a real old timer, I always understood that MSDOS adopted CR - LF and the ‘’ character rather than ‘/’ so that AT&T wouldn’t sue them for infringing Unix. Also for the early Mac and CR.

One thing that struck me as funny is that Windows for once is the company who is backwards-compatible…

Hi Jeff,

Using the CR and LF keys in all combinations was useful when using a manual typewriter.

CR and LF - The typist pushed the carriage return lever to return the carriage to the start of the line and move the paper up to the next line.

CR only - As Wikipedia stated, the only way to underline, cross out, or even “bold” text is to return the carriage to the start either without the linefeed (by using a carriage release button) or using the carriage return and linefeed and then turning the platten knob to roll the paper back one line. (The platten is the cylinder, sort of like a rolling pin, that the paper wrapped around which provided a firm backing so the keys could strike the paper without the paper tearing.)

LF only - rolling up the paper by turning the platten was useful if you needed to use whiteout to correct the text. You’d then roll the paper back to the original line, hit the backspace key, and correct the text.

There was also a button on the platten knob that allowed the paper to be rolled up part of a line which was useful for a footnote number, subscript, math formulas, an “*”, etc. Rolling the platten without using this knob moved a fixed distance, stopping at a “bump”. There were, give or take, 20 lines per platten rotation. (Just a guess, but you know what I mean.)

http://site.xavier.edu/polt/typewriters/tw-parts.html gives a picture of the typewriter and names its parts.

1 Like

Very interesting on how the type writer ties into all of this. Nice post.

There’s a common and accepted way to normalize newlines (using js regex for examples):

.replace(/\r\n|\r/g, "\n");
.replace(/\r\n|\r|\n/g, "\r\n");
.replace(/\r\n|\n/g, "\r");

The first method is used for server-sent events (not regex, but the normalizing of \r\n and \r to \n via parsing).

The second method is used in HTML5 for textarea value getters/setters. Opera does this already. However, Webkit currently uses the first method and Gecko and IE do different things depending on the situation.

You can use __lookupGetter/Setter__ and __defineGetter/Setter__ or Object.getOwnPropertyDescriptor and Object.defineProperty to patch a textarea’s ‘value’ getter and setter to work around the differences in newline normalization.

Newlines being right for a textarea’s value is important because it affects the value’s length, which many sites (like Twitter) use as character counters to enforce limits.

I absolutely hate the newline deal though. I wish we could get rid of it and just use only \n everywhere.

As for a Unicode newline, I once tried to switch to using it (just to be
Unicode-proper and all that), but couldn’t find anything that handled it
display-wise.

You asked: “The distinction between CR and LF does seem kind of pointless why would you want to move to the beginning of a line without also advancing to the next line?”

Back in the old days (1990 or 1991) at WPI, when we had a lab with VT220s connected over serial to a shared unix system, we used CRs without LFs to produce 1-line animations, where we’d basically set up a bunch of “frames” of animation in a textfile separated by CRs and then cat that textfile.

We got pretty elaborate, and I think I had something like a 5-minute animation in my .plan for a while (this was the primary venue for these animations, showing off your leet dumb terminal skillz or something) until the sysadmin there fingered me or one of my friends from a 2400-baud modem. A day or two later we noticed that finger started converting \r’s into \n’s and stopped displaying beyond 10 (post-conversion) lines of one’s .plan…

The VT220 could do some weird bitmapped graphics things, too (“sixel graphics”), so there were even more complex animations out there, including a memorable Twilight Zone intro and the head of J.R. “Bob” Dobbs. We hacked up a fake Mac System 7 windowing system to confuse people looking over our shoulders when we were using the VT220’s…

In a more modern context, some command-line programs that display a progress bar update the bar’s status (without extraneous scrolling if you happened to be at the bottom of your terminal window, as would happen if you only had \n and you had to print that and then a cursor-up character) by printing their progress bar updates separated by \r. For example, from wget’s progress.c:

/* Print the contents of the buffer as a one-line ASCII "image" so that it can be overwritten next time.  */

static void
display_image (char *buf)
{
 bool old = log_set_save_context (false);
 logputs (LOG_VERBOSE, "\r");
 logputs (LOG_VERBOSE, buf);
 log_set_save_context (old);
}

Anyway, your points are good, and I can’t count the number of times I’ve had to help people who have edited unix files using Notepad and messed up their \n’s and couldn’t figure out why their CGI scripts or whatever had stopped working… Perl (even on unix) will work fine with either \n or \n\r at the end of a line, but the shell is dumb, and if the CGI began with #!/usr/bin/perl^M the user would get a mysterious “Command not found” or “Command interpreter not found” error. Fun times.

1 Like

I’ve been burned by the newline problem many times myself; some tools
make it easier than others to deal with it (UltraEdit, my preferred
text editor, shows which mode a file is in and lets you convert
easily; but you can still get confused by a file that has multiple
ending types caused by copy-and-pasting), and if you use FTP to
transfer files you can select “ASCII mode” to convert to the
appropriate conventions of the destination.

Everywhere you look in computing, you find the debris of all the past
archaic hardware and software, platform and standards wars, and so
on, preserved out of desire for compatibility. This can be both
fascinating and maddening. Inquire into just about anything: Why
does Windows use backslashes for directory paths? Why do URLs use
forward slashes? What’s the point of the double slash near the
beginning of URLs? Why do some standards, such as for e-mail,
specify lines of no more than 80 characters? Why is “prn” not a
legal file/pathname in M$ operating systems and application
frameworks? Why do most user agent identifier strings start with
"Mozilla"? Why does Windows 7 have “6.1” as its internal version
number? Why are reverse domain lookups done with a top level domain
named “.arpa”? These things will always lead to long, tangled
stories, sometimes stretching into the dim past of computing up to or
beyond a half century ago. (I think the 80-character limits derived
from 1890s Hollerith punch cards.)

But if that’s what we have after less than a century of the
computerized world, imagine what sorts of historical baggage there’ll
be in the devices of a millennium or more in the future.

LOVE this exploration of fiddly implementation gotchas. Thanks!

I happen to have both Word '08 and Pages '09 open (backward compatibility vs. usability), so I decided to see how they responded to all the nominally invisible Unicode ‘separator’ glyphs.

Pages doesn’t respond at all to the inscrutable ‘Information Separator’ glyphs, but it does interpret ‘Line Separator’ and ‘Paragraph Separator’ as whatever iWork apps use internally as LF and New Paragraph. (Too lazy to find that out at the moment.)

Word 2008, on the other hand, responds to two of the Information Separators with box or hyphen glyphs, but ignores Line Separator, Paragraph Separator & the other two Information Separator glyphs.

BTW…

See http://unicode.org/standard/reports/tr13/tr13-5.html for converting between Unicode and non-Unicode line break schemes.

See http://bugs.python.org/msg97407 for discussion of Information Separators (before ignoring their insufficiently-specified butts forever).

I actually use those ASCII file/unit/record separators on occasion in data storage and transfer formats used internally in my programs; they’re handy precisely because they’re so rarely used by anybody else, so they don’t clash with characters within the individual data items themselves as happens often with comma-separated data, and it allows a hierarchy of several levels of structure using the different characters.

Thank you for this article. Just a week ago I was tearing my hair out, raving and ranting “why the hell aren’t those unicode standards c*&k s&#%#@s smart enough to create a universal cross platform CR/LF/CRLF equivalent!!!”

The reason being. I was doing work on an Open Source .NET library called Packet.net, where my work was being done using VS2008 in Windows and the project admin was working in MD using *nix.

git has a hackish solution (autocrlf) to automatically convert ASCII line ending issues but it sucks. I.E. If a file with mixed line endings accidentally gets through in a commit, you’re SOL. So, I was forced to resolve them myself.

In VS2008 line endings are handled based on the type already used in the file. For instance, if the file was created in *nix, it handles the line endings as LF. If it sees inconsistent line endings (or if it randomly feels like seeing if it can trick you into changing to windows CRLF line endings) it pops up the menu you illustrated.

The major shortcoming of VS2008 on handling line endings can be seen if one of the project XML files needs modification (Ex. settings, .vsproj, etc…). Since those are automatically generated by VS they always write line endings as CRLF. Which lead me to my final evaluation. VS sucks at line endings.

After getting tired of opening .csproj files in Notepad++, and converting the line endings manually. Or, having patches rejected because I forgot to convert a file. I finally threw in the towel and partitioned my HDD to work as a dual Windows/LinuxMint.

Long story (not-so)short. Creating the source files in unicode and using LS as the default line ending would completely solve this issue. No more line ending woes. Now, I wonder if both VS and MD both support it. This article (and discovering the LS char), finally gave me a viable reason to use unicode.

::NepoleonDynamiteSigh:: I was much happier in the days when I was still ignorant of what the term “line ending” meant.

SideNote: It’s still useful to have LF work as linefeed. Mostly in interactive console apps where you’re trying to update a status during processing. Instead of doing a ClearScreen and reprinting all the lines with the status updated, or printing a new line for every update (filling the screen with updates) you can just to print(LF) followed by the updated message and overwrite it to the same line.

I know the “Home” key on the keyboard is useless to 99.7% of average computer users, but I use it all the time.

Just to add a little detail to the beginning of the article: Mechanical typewriter like the great little Smith-Corona shown above, and early electric typewriters, used pivoting arms with molded letterforms on the paper-striking end. This was followed by IBM’s Selectric typewriter (and later the 2741 computer terminal), which used a semi-spherical typehead usually referred to as a golfball containing all of the characters. Somewhere around that same time, the Teletype Model ASR-33 was using a cylinder for a printhead. The actual daisy-wheel printers (made primarily by Xerox, Diablo, and Qume) used a metal or plastic disk of “petals” each with a molded character. I’ve used typewriters and printers with all 3 mechanisms.
Thanks for the great article, and good comments too!