Java gets newline handling right

The platform-dependent representation of newlines (as CR, LF, or CR+LF) is a surprisingly persistent annoyance. In theory, this problem was solved decades ago, when I/O libraries began to automatically convert from the host platform's newline convention to a portable in-memory representation. But in practice, files are routinely shared across platforms without being converted along the way, so a file may contain any of the three newline sequences, or even a mixture of them. (No, really, I've run into this repeatedly.) Knowing the host platform's newline convention doesn't help when the data doesn't follow it.

The obvious solution is to recognize all three newline endings, and treat them all identically. Unfortunately this creates ambiguity: a character such as CR might be a newline, or it might by a carriage return that occurs in the middle of a line, and the two will read identically. This means attempting to copy a file by reading and writing lines won't always preserve the original file. But this isn't a problem for most programs, because those that read lines of text generally only want input that makes sense as text, and that doesn't include arbitrary control characters. (Compare to C's restriction on the null character: it's valid but rarely plausible, so the restriction is not usually a problem.) As long as there are other operations for reading data in its raw form without interpreting it as text, the operation for reading lines need not distinguish between newline conventions.

It's also convenient to do newline conversion in the read-line operation, not as a feature of the underlying stream. This makes it easy to reversibly read text with control characters, and also makes it easier to mix text and other data in one stream. As a bonus, it eliminates the need to specify text or binary mode when opening a file.

My brilliant ideas have, of course, already occurred to someone else, and been implemented in a popular language: Java. From the documentation for java.io.BufferedReader:

public String readLine()
                throws IOException

Reads a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.

Returns:
A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached

Wrapping a stream in a BufferedReader is awkward, but readLine does exactly what I want. Not even Perl solves this text processing problem so conveniently! Java is not rich in such conveniences, so I was surprised to discover that it has this one. It may be motivated by portability: Java wants programs to behave identically on all platforms, and platform-dependent newline handling violates that.

C# (or rather .NET) follows Java here. So, presumably, do most JVM and .NET languages. I suspect there are others that have either borrowed this feature or invented it independently. Which ones?

Update April 2014: Python calls this “universal newlines”, and has had it since 2003.

(It would be nice to detect text encodings the same way, since many streams have unknown or mislabeled encodings. Unfortunately, there are byte sequences that are not only possible but plausible in more than one encoding, so there is no solution reliable enough to be the default.)

2 comments:

  1. Delphi does something very similar (TStrings.SetTextStr). Everything up to #10 or #13 is read as part of the line, but the 10 or 13 is not itself consumed. Then, if the current character is #13, it's skipped. Then, if the current character is #10, it's skipped. And now we're on to the next line.

    That means that any of #13, #10 or #13#10 are treated as line break sequences.

    It may predate Java's implementation; at least Delphi 2 worked this way, which was released in February 1996. BufferedReader is from JDK 1.1, which wasn't released until 1997.

    ReplyDelete
  2. R6RS get-line and the forthcoming R7RS read-line do the right thing. R6RS ports also have the ability to convert any newline (CR, LF, CR+LF, NEL, CR+NEL, LS) into LF on read, and to produce any of these on output

    ReplyDelete

It's OK to comment on old posts.