Hacker News new | past | comments | ask | show | jobs | submit login
CR+LF Has a Long History (revk.uk)
102 points by rhabarba on Feb 8, 2022 | hide | past | favorite | 65 comments



There's no need for conspiracy theories about adding delays. CR and LF were just separate functions and it's useful for them to be separate. CRT terminals (or emulations thereof) still work the same way. You have to send CR and LF to a VT100 to get the cursor to both move to the left margin and then go to the next line down.

The Unix stty command lets you set cr0/1/2/3 modes for inserting NUL characters after a CR for exactly this reason, to give the carriage time to return on your printing terminal.


Not just terminals: most printers until the mid-80's or so also cared about the distinction for the same reason. They started adding DIP switches allowing you change the default behavior and by the end of the 80's the need for CR+LF was pretty much a thing of the past.

While the original reason for this distinction probably did date back to teletypes, some practical use cases evolved out of the capability. IIRC, some word processors used the ability to provide an overstrike mode before bold was a thing most printers could do. ASCII art photos were also a thing both for printers that didn't support a graphics mode as well as providing a way of printing a larger 'image' on the paper faster while also using less (ribbon) ink.


The distinction goes back to typewriters.

The author mentions that the typewriters had a return lever that could also do a line feed.

What they missed is that:

* You could still freely move the carriage without advancing the paper if you wanted: there was a carriage release button;

* You could also advance paper manually at any time, while keeping the carriage at its current position;

* You could often adjust line spacing, including setting it to 0, which means that the big lever would only return the carriage, but not advance the paper.

In short, CR and LF were separate useful functions on a typewriter.

The backspace key was useful for the same reason; obviously, it wouldn't erase text - it would move the carriage just like the space bar would, but in opposite direction.

These were necessary for:

* Underlining text (by overtyping the underscore character over lines of text);

* Adding accents, or crossing zeros with overtyping;

* Bold text by retyping the same line;

Etc.

Tab stops in your favorite document editor come from the same era: the "Tab" key would release the carriage, and the spring would pull it forward until the next physical stop (whose positions you would manually set). The feature was made to make typing tabular data easier, hence the name.

That said, the author has a solid point about why we have CR and LF in that order.

To wit, on typewriters, the LF+CR lever would advance the paper first, because it made sense both mechanically and from the UX perspective (you would nudge it to advance paper by whatever line spacing you set, as opposed to manual advance which could be a fraction of that).

Returning the carriage, and then advancing paper indeed makes more sense on the terminal, where all operations but CR take a fixed amount of time.

That was the value of the article, really : to give motivation for CR+LF rather than LF+CR for line ending.

The former works well with the terminals, the latter - with typewriters.

Computers therefore use CR+LF.


CR and LF were invented as codes, along with the teletypewriter, by Douglas Murray around 1900. Original Baudot (ITA#1) and other earlier telegraph codes were one-dimensional with no concept of lines.

Perhaps someone with institutional access to his paper, Setting type by telegraph, can see whether it discusses the intent of the pair.

Setting type by telegraph, Journal of the Institution of Electrical Engineers, Volume 34, Issue 172, May 1905, p. 555–597 https://doi.org/10.1049/jiee-1.1905.0034


You can download it from sci-hub. I just want to point out how ridiculous it is that a document that is obviously out of copyright is paywalled by these publishers.


Back in the eighties I worked in the military with teletypes. Some examples are:

https://www.radiomuseum.org/r/siemens_fernschreiber_t37.html https://www.cryptomuseum.com/telex/siemens/t100/index.htm https://www.camion-militar.com/spa/item/ART02979.html

There were also other makes, like Olivetti.

They were connected to various networks, e.g. military, the civil Telex network or the ICAO AFTN network.

https://en.wikipedia.org/wiki/Telex https://en.wikipedia.org/wiki/Aeronautical_Fixed_Telecommuni...

They all used a 5 bit character set Baudot/ITA2: https://en.wikipedia.org/wiki/Baudot_code (I can still read a paper tape message directly from the reel)

As these teletypes were quite slow (typically 50 baud) and mechanical, the procedure was for line shifts to always be CR CR LF. The reason for this is to allow for the carriage (printing head) to have time enough to move to the start of the line before the line feed.

This was enforced in the procedures, so you might end up with a rejected message if you did not do it properly.

When computers were introduced in the various networks, ITA2 was converted to ASCII and some of these restrictions were relaxed a bit.


What a coincidence to see CRLF discussed in HN front page, this bit me just yesterday!

While using the aws cli tool on Linux (the Docker containerized version) with the "text" output format, it turns out each output line ends with CRLF ("\r\n") instead of simply LF ("\n") as is common for any Linux cli tool, so I had to add a post-process step in my commands to remove CRs (a chained call to tr -d '\r'). Otherwise, my bash variables would end up containing those pesky carriage returns, and debug logs would show up unexpectedly bizarre.

In retrospect it's obvious what happened, but it got me confused for a good while. Why would it output CRLF at all? Also of course, not a single mention of this in any of the docs I was reading.


The Unix convention is for "\n" alone to signify a line-break; when Unix programs talk amongst themselves (such as piping one to another), that's what they do.

The terminal convention (as described in the linked blog post) is for "\r\n" to signify a line-break. When a Unix program talks to a terminal, the kernel will (by default) automatically add or strip "\r" characters as needed.

Docker messes this up, since the tool inside the container is isolated from the environment where the "docker" command is run. Since it can't directly talk to the "docker" command's standard output, you have to manually choose whether it's connected to a pseudo-terminal ("docker run -t ...") or with a pipe ("docker run ...").

If the command runs with a pipe, no output translation will be done, but the program will likely be difficult to use interactively - there's likely to be no interactive prompts, no line-editing, etc.

If the command runs with a pseudo-terminal, then it will behave as if it's attached to a terminal - you're likely to get input prompts, ANSI colour codes, and "\r\n" line endings, even if the output of "docker run -t ..." is being piped to a file.

I don't know if this is the behaviour you hit, but I can easily imagine somebody running a tool in a docker container interactively with "docker run -it ..." until they got the behaviour they wanted, then just sticking that command in a shell pipeline like they would any other Unix tool, and getting a nasty surprise.


Thank you a lot for this explanation. I have a good understanding of the difference between using or not a pseudo-terminal (and its effects on Docker commands) but I had never learned before about this distinction on behavior of line breaks.

I'm launching the command with -it as suggested on https://docs.aws.amazon.com/cli/latest/userguide/install-cli...

However, now that you mention it, this page does comment on the usage of -it, so that might be it... thanks again for the hint!


For the record, ONLRET is the relevant flag and you can disable it with “stty -onlret”. Then output newlines will stay newlines (which typically results in bizarre behavior if those raw newlines are sent to a terminal).


The dos2unix tool is purpose-built for this!


You are right, that's the best tool to use for the job!

However, as it usually happens with scripts, adding non-coreutils dependencies becomes an issue (now the script needs to check for existence, request users to install if missing, etc...) For this use case which is so well scoped, a single character removal seems enough but in general I totally agree with dos2unix as the better choice.


Note that dos2unix and unix2dos do what you expect when used as filters, but overwrite the specified file in place when given a file name. This is a footgun second only to one in pdftotext, arguably, which (unsurprisingly) refuses to work as a filter and (surprisingly) will silently create a matching .txt file when passed a single .pdf as an argument.


Why is that a footgun? If it just piped the output to stdout, then to convert a file, you then have to

    $ dos2unix file >x
    $ mv x file
In my mind, overwriting the file is the Right Thing.


The Right Thing would be to require an explicit -o <output filename> option to avoid modifying files without getting user's consent first.

Especially because the changes it makes are irreversible (e. g. when a text file has mixed unix/dos line endings).


Or flip (https://ccrma.stanford.edu/~craig/utility/flip/)! I've been carrying this thing around with me for years.


Vague memory, so take it with a grain of salt, but I think this might be a way of getting the break to show up in an editor that expects \n or one that expects \r. Maybe most editors that expect one simply ignore the other?


Related to that, the ansible "community.aws.aws_ssm" connection plugin[0] always returns the output of "raw:" tasks with "\r" appended to it, too, and I'd guess it's for a similar reason

0: https://github.com/ansible-collections/community.aws/blob/3....


UPDATE: The comment by user thristian [0] did indeed contain an explanation for this and hinted to what is the solution to this problem!

"docker run ..." should be scripted without "-t", in order to run the command without a pseudo-terminal, which then yields output with Linux style line breaks ("\n").

[0]: https://news.ycombinator.com/item?id=30255017


If for some reason you do want a terminal, you can run `stty -onlcr` to turn off the NL → CR LF conversion.


Weird - I also had to write this exact thing:

   Creds ()
   {
       kubectl exec -ti $(getpod cred-api) -- printenv | grep --color=auto --color=auto RAWDATA | tr -d '\r'
   }


See my "UPDATE" comment; basically the issue seems to be that the output from "docker run" will contain DOS-type line breaks ("\r\n") if running with a pseudo-tty attached, so, with the "-t" argument. If that is not needed (e.g. when scripting) this argument can be dropped, and then the stdout comes clean with Linux style line breaks ("\n").

I don't know about kubectl, but if those "-ti" mean the same as in plain Docker, you could check if dropping the "-t" helps. After all, it looks like you are also scripting, so any kind of interactive input/output would probably be unexpected and undesired in your command.


My first "computer" was a thermal paper terminal, with modem-phone coupler to dial-in to Compuserve (like $6 an hour after 6pm with minimums)

The separate actions of CR and LF were definitely important.

PDP-11 at school also was a paper terminal with same importance.

It's really amazing how far we've come considering what everyone has in their pocket.


It's a dreadfully common story: some ugly hack which was necessary at the time persists for decades (or centuries, in the case of legal/social systems), because nobody can really be bothered to change it.

The extra file size is not much of an issue, but it does mean we still require stupid workarounds with Git and anything else that needs to handle cross-platform plaintext.


But it’s still worse when the ugly hack was only useful for a very short time, probably wasn’t even necessary, and ruins other things by existing.

UTF-16 is my favourite to complain of: created because a few large companies had put a couple of years of heavy investment into UCS-2—which was a massive breaking change itself—and didn’t want to let that go to waste (with all the bother they’d been putting users through), even though by the time they actually made UTF-16, UTF-8 with its ASCII compatibility had been invented. And so they ruined Unicode for everyone for ever after with that accursed abomination surrogates and the distinction between code points and scalar values.


And now Unicode have declared themselves kneecapped to ~20 bits (1,111,998) because of it. I doubt we'll run out of space like we did with 16 bits, but never say never. The whole reason UCS-2 was used was companies saying "16 bits is enough" and then it not actually being enough. Almost every issue with Unicode is because developers made assumptions, despite Unicode themselves providing guidance saying otherwise.


Almost every issue with software in general is because developers made assumptions.


How is this a "hack"? If anything, not using CRLF line endings is the hack, because it means you can't just send the file straight to a terminal: the line endings have to be translated by something on the way to the terminal for it to display properly.


The "hack" is to wrongly define "move down, move left" to be the same as "line ending". It isn't. The first are commands to a machine, the second are elements of encoding text in binary storage.

Or more general, the "hack" is to act as if ASCII was a character encoding, when it's actually the encoding for a set of commands to a machine.


It's both, and more.

It's also document encoding (there's characters to encode structured data too).

Given the era ASCII was invented, I wouldn't really call it a hack. Character encodings were one of the lowest level abstractions. The size of a byte wasn't always fixed to 8 bits and it was the number of bits in your character set that often defined how big your byte was (yes, I know ASCII is technically 7-bit but one of the original considerations was that the 8th bit would be a parity bit: https://en.wikipedia.org/wiki/Parity_bit ). It's also the reason the lowest size of addressable data is a byte / char.

Bare in mind that we are talking about an era before operating systems as we recognize them. Back then each machine would have it's own operating system and some institutes would even have different operating systems despite having the same machines. Really these operating systems were more like firmware than what we have today. So there wasn't any distinction between character encodings and machine commands but there were differences between machines. So it made sense that ASCII was cross purpose because that's exactly what characters were: raw addressable data containing instructions. Those instructions were sometimes printable data, it was sometimes formatting data, and it was sometimes control data.

I doubt any designers of ASCII could have even imagined the way we use computers these days when writing that specification. Even just the fact that character encodings have become an interchangeable abstraction for rendering text would have blown their minds. Never mind everything else.


Is your text modeling a text document or is it sending instructions to a terminal?


It wasn't a hack though. Characters were low level addressable data. It's no coincidence that on machines that had a 6-bit wide character set the smallest addressable data (a byte) was also 6-bits (I know ASCII is 7-bits but that was to reduce transmission overhead or allow for a parity bit when required, eg on punch tape or cards that often supported 8 columns of bits).

The way machines worked back then would be you'd address a byte of data, that byte contained instructions. Those instructions were sometimes printable data, it was sometimes formatting data, and it was sometimes control data.

It wasn't until operating systems became more sophisticated, hardware become more sophisticated and the two became decoupled, that character encodings became the abstraction for rendering text that we think of them as today.


Character codes predate computers. CR and LF codes date to 1901.


Even earlier than that. CR and LF go back to Baudot code (which is where term 'baud' also originates) and that was created in the 1870s.

https://en.wikipedia.org/wiki/Baudot_code


Baudot's (ITA1) was linear, with no concept of lines. Murray's (which evolved into ITA2 with minor changes) added CR LF, since he developed a typewriter-like system.


Ahhh I see where you've gotten that 1901 date from now:

> In 1901, Baudot's code was modified by Donald Murray

> The Murray code also introduced what became known as "format effectors" or "control characters" – the CR (Carriage Return) and LF (Line Feed) codes.

(source: https://en.wikipedia.org/wiki/Baudot_code#Murray_code)

Thanks for the correction there. Learned something new today :)


because nobody can really be bothered to change it

Python 2 to 3 all the things, I guess? You’d then need another workaround choosing between CR (now obsolete), CRLF, LF and RS (renamed to LE), the latter being a true newline.


I'm a little skeptical. Equipment from this era was designed by slide rule and sent to draftsmen to draw up and send down to the factory floor. The solution to problems was more power.

I would imagine some wily old programmer knew commands would be latched, and maybe as maintenance fell behind, the CR would lag a little, but could be trusted to complete. Better to do CR LF than LF CR. but they both have to happen.

I dunno. Maybe this model sucked. But every teletype I've ever used was kinda scary, those things will take your fingers off. They're power tools capable of violence. I'd guess, if you pulled out the maintenance guide, and applied proper lubrication, and replaced wear parts, it wouldn't miss those deadlines. I guess you might need to pull out the schematic and verify the electronics are still in spec.

I mean, yeah, the author has a point, but I'd believe it's a wear issue that this teletype doesn't reliably return in time.


The carriage return time spec for the 5-level teletypes is that CR LF should be sufficient. I've restored two Teletype Model 15 machines, and both work properly with plain CR LF. They're 45 baud, so two characters is 400ms. 8-level, 110 baud devices often needed more time.

The early Model 15 teletypes would just overtype at the right margin if you didn't send a CR, and would type on top of the previous line if you didn't send an LF. Mechanical options requiring extra parts were later offered so that overrunning the end of line would force a CR LF. One letter would be typed about halfway across the page, but it was better than losing an entire line because one character was garbled. This was useful for links over shortwave radio, where losing characters was not uncommon. With that feature, CR implied LF. I have two Model 15 machines; one has that feature, the other does not.

Too many years ago, I had to overhaul a Teletype driver to handle all the serial devices that appeared just before memory got cheap and everything got buffering. I put in a delay function of the form K1 + linelength*K2. Some serial printers worked with wide paper and needed long delays for their long lines. There was at least one device [1] that needed a negative K2 term. It had a minimum time to print a line, so short lines needed an extra delay.

(The early Teletype machines are the dream of the right-to-repair people. Everything is fully documented, all parts are removeable, and they are very repairable. They're also high-maintenance. They need regular oiling, and there are several hundred oiling points and three different lubricants. In heavy use, they need an annual cleaning which involves partial disassembly, soaking in a cleaning bath, and lubrication. They have several hundred adjustments, and a whole "adjustments" manual. Nobody would put up with that today. That's the downside of "right to repair" - the hardware is repairable, but clunky.)

[1] http://archive.computerhistory.org/resources/text/DataInterf...


Thanks for the awesome insight!

One note:

> That's the downside of "right to repair" - the hardware is repairable, but clunky

I'd say there is no downside; cause and effect are swapped here. It's that clunky, expensive hardware better be repairable, because the user would rather retain a technician than send the machine in for maintenance (at the cost of time and money).

Things that are easily repairable don't have to be clunky or unreliable. Between my bicycles and synthesizers (and music hardware in general), I have dozens of pieces of functional machinery that has lasted decades, and will last for decades to come.

That with little maintenance required, and with maintenance being outright pleasant.


Thank you for the wonderfully detailed reply. I mostly agree with your right to repair sentiments. I have a remote controlled car, and have to keep reminding myself fixing it is half the fun. It sure would be nice to have the option to evaluate repair, either myself or pay a professional, rather than throw things away and buy a new one.


HTTP also uses CRLF but the spec is flexible I think.

Still most if not all implementations use it.

That is alot of bytes when you sum them up.

Also CRLFCRLF is used to demarkate end of headers and chunks.


All textual protocols with legacy of IETF style design use CRLF, and that includes HTTP which heavily drunk from that legacy.

Personally I suspect it's because ARPA was a bit cheap on some things and thus they used typewriters instead of protocol analyzers ;)


You couldn't do overtype without them being distinct. I think It's that simple.


Well you could if you have a line-unfeed. So by default it would do both but then in the uncommon case where you want overtype you could go back up.

A little late for that insight, if its even worthwhile, but typically I would optimize the common case by default and add features for doing advanced / uncommon things.


That has always been my understanding. If you want bold, underline or strikethrough then you need to be able to send a carriage return on its own.


You could just backspace repeatedly, but it'd be a waste of bits.


> CR and LF are used for each new line.

The author's insistence on calling CR+LF "new line" only serves to confuse.

LF is line feed, thus new line. The CR is to readjust the write head back to the starting position.


> LF is line feed, thus new line.

  LF is line feed, thus next line.
                                   The CR
                                          is to
  readjust the write head
  back to the
  starting position.
A line feed alone is not a new line because some or most of the line is already used up. The B '*n' and C '\n' introduced the notion of new line, which outputs both a next line (LF) and back to the starting position (CR).


More intuitively, CR is the horizontal dimension and LF is the vertical dimension in the printer head position on the page, which makes it more obvious which is the new line char.


> More intuitively

Not for me. I get the dimensions but hell if I can remember it.

But mnemonic 'ReturN' (for the Windows style \r\n) helps (well, a bit) here:

    \r: carrier Return
    \n: liNe feed


A physical printer (like a dot-matrix one) couldn’t care less whether you told it to CR LF or to LF CR, you’ll end up at the start of the next line either way; neither does a terminal emulator, nornally (but mind the generic kernel terminal driver in between, which normally has the onlcr knob turned on). Apparently some systems even used LF CR as the line terminator. (Bare CR is more well-known, having been used by classic Mac OS, and also makes a bit of sense if you remember the parallel port had an AutoLF line that instructed the attached printer to translate CR to CR LF. I don’t know if that is the actual reason for the convention.)

The caveat is that advancing the paper vertically is relatively fast, but backing up the carriage[1] horizontally can take some time, so on a dumb and slow printer LF CR abc might end up printing abc in the wrong place while the carriage is still moving backwards, while on a smart but slow printer it will just be slower than CR LF abc. I suspect the dumb part is the origin of the CR NUL convention in Net-ASCII[2]; at least both termcap (dC capability) and terminfo (padding machinery) can describe, and curses handles, the possibility that the terminal requires a delay between issuing a CR and printing new characters at the beginning of the line.

[1] It’s a carriage, as in a moving thing that contains useful stuff (cf 3D printers), not a carrier, as in a bare unmodulated signal (cf modems).

[2] https://tools.ietf.org/html/rfc20, which probably takes the prize as the lowest-numbered RFC you could still encounter as an up-to-date reference.


The linked article talks about this timing issue a bit, including samples of what happens if you print too soon on a physical printer while the CR is still physically happening (in the "flyback" section).


> a moving thing that contains useful stuff, not a carrier

Say that to USS Theodore Roosevelt!

:-)


Remarkably uninteresting.

Next: Let us discuss what the Unicode Character 'REVERSE LINE FEED' (U+008D) is good for.


Uninteresting depends on your perspective. For anyone old enough to have used a Teletype certainly it is no mystery, but how many of us are left?

Speaking of, a Teletype didn't understand anything outside of the ASCII range. So what was reverse line feed good for? Absolutely nothing!

The author remarks that his Teletype was old because it didn't have lower case. But in my experience that was almost universal - I think I only saw a single Teletype that had lower case.


Olivetti Teletype had lower case. I had an interesting episode, where the University of Nitpicking Nazis discussed if my thesis, printed entirely on Teletype, was acceptable. -- It was not as good-looking as with IBM Selectric, but quite comparable to regular manual typewriters.


Was it the letterforms that bothered them, or the curled yellow paper?


I really dont know, I think it was a moral issue. When the paper was made by a computer, it was like cheating. I would have been cheap and easy to hire a professional typist to make a copy, but it would have been so lame, that I refused to do it.

Few years later those same professors got their own computer terminals and then they demanded personal computer operators. Masters of Academia do not dirty their hands on keyboards.


Someone needs to introduce them to Donald Knuth.


VERTICAL TAB is much more puzzling.


Devices that needed to produce output on printed forms would use a physical reference to ensure the text went to the right place. One of the commands a tabulator could give the typewriter it drove was to move the form to the next marked position—a literal tab, or maybe a functional equivalent like a hope in a punched tape. Thus, vertical tab.


interesting, hadn't heard of it before. Apparently it was primarily for printing purposes

https://stackoverflow.com/questions/3380538/what-is-a-vertic...


Until you insert a floppy written in DOS into a Unix machine.


And hopefully a short future!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: