Pointer arithmetic can be safe

Advocates of memory-safe languages sometimes contrast them with C by saying that they don't have pointers, or (when someone points out how impractical they'd be if that were really true) that they don't have pointer arithmetic. This is supposed to make them safer. Because pointer arithmetic is unsafe, right?

Not necessarily. Pointer arithmetic in C happens to be unsafe, but this is not a problem with pointer arithmetic, only with C — or rather with its conventional implementation technique. C pointers are usually implemented as raw addresses, and pointer arithmetic as simple arithmetic on those addresses. The C standard, however, doesn't require this. It only requires pointer arithmetic (and comparisons) on pointers to the elements of an array (or one past the end), and it does not specify the behavior of pointer dereferences that don't point into the array. It doesn't require bounds checking, but it doesn't prohibit it either. So it's possible to make a conforming C implementation with bounds-checking on all pointer operations.

This has been done. Zeta-C (source here) was a C compiler for Lisp machines, which don't support unsafe array access at all. Scott Burson, the author, explains how it handled this:

All pointers were represented as pairs of an array and an index

Pointer arithmetic operated on the index, leaving the array intact. Pointer dereferences used the index with the ordinary, safe array operations, so all pointer dereferences were bounds-checked. Since Zeta-C also fixed C's other memory-safety problems (free did nothing, uninitialized variables were not garbage, and casts could not forge pointers), it was a memory-safe C compiler. This was part of its attraction — people didn't use it because they wanted to run C programs on Lisp machines, but because they wanted to debug their C programs in a safe implementation with the Lisp machine's excellent tools.

C programmers are well aware that memory unsafety is their biggest problem, and many other tools have been written to deal with it, but few of them recreate this feature. The technique is known to implementors (often by the name “fat pointers”) and is available as a patch for GCC. But it's not considered a standard part of the C debugging toolkit, even though it's easier to implement, and has a smaller performance cost, than commonly used tools like valgrind. I don't understand why. Wouldn't it be nice if your C compiler had a --safe mode which eliminated most memory safety bugs?

Update December 2013: The big problem with fat pointers is that they're incompatible with ABIs that use raw pointers, and virtually all interesting C programs make system or library calls through such an ABI. So a practical implementation of fat pointers needs to support raw pointers too, which adds complexity and greatly reduces the benefit.

Scott also tells a cute story about portability:

If you looked real closely, there were lots of little corners of C semantics where ZETA-C was not correct. In practice, however, one very rarely tripped over any of these.

For instance, I used Lisp integers for C int and long. This meant bignums would be created automatically, as usual in Lisp. Technically this is not a correct C implementation (even though I don't think the standard specifically says that the length of int and long shall be finite, one can take this as implied) but it very rarely ran into trouble. The only such case I remember, which was rather amusing, was a program that did something like

  int i;
  for (i = 1; i; i <<= 1) ...

(shifting a 1 bit left repeatedly, expecting it to fall off the left end of the word).

Who expects their C programs to run on a machine with infinite word size?

“Irritants”

I laughed when I first saw the arglist for R6RS's error:

(error who msg irritant1 ...)

There are irritants in the Scheme oyster, and error pearls them away for later reference. Cute!

“Irritant” is not a new term; it's existed in Scheme culture for a long time. The earliest reference I can find is in a Maclisp Scheme info file last modified in 1985, and it has turned up occasionally on the rrrs-authors mailing list. I haven't found it in any non-Scheme manuals (yet). Is it a Scheme-only term? Does anyone know where it originated?

For a while, the cuteness blinded me to a possible confusion: “irritants” suggests that the arguments should be the cause of the error — that they should be in some sense wrong or invalid. But they're only informative, and are often innocent indicators, not the causes of the irritation.

A format string and arguments is probably better, because it makes clearer messages. Or simply a string, in a language with string interpolation. Even though this doesn't call for a cute name.

Weak symbol tables in Common Lisp?

Christophe Rhodes asks, about Common Lisp:

something that has been suggested a few times is that packages should intern their symbols in a hash table which is weak: dear cleverweb, is there any way in which this can be detected? Think like pfdietz

Disappearing symbols can be detected by find-symbol...

CL-USER> 'foo
FOO
CL-USER> (setf * nil) ;get rid of the REPL's reference
NIL
CL-USER> (gc :full t)
NIL
CL-USER> (find-symbol "FOO")
NIL

...or by counting the symbols in a package (which might plausibly happen as part of a show-all-packages feature)...

CL-USER> 'foo
FOO
CL-USER> (loop for s being the symbols of "CL-USER" count t)
1265
CL-USER> (progn (setf ** nil) (gc :full t))
NIL
CL-USER> (loop for s being the symbols of "CL-USER" count t)
1264

...or by the apparent change in the home package of a symbol when a shadowing symbol disappears:

CL-USER> (defpackage "EXPORTER" (:export "FOO"))
#<PACKAGE "EXPORTER">
CL-USER> (defpackage "IMPORTER")
#<PACKAGE "IMPORTER">
CL-USER> (intern "FOO" "IMPORTER")
IMPORTER::FOO
NIL
CL-USER> (use-package "EXPORTER" "IMPORTER")
T
CL-USER> (intern "FOO" "IMPORTER")
IMPORTER::FOO
:INTERNAL
CL-USER> (progn (setf * nil) (gc :full t))
NIL
CL-USER> (intern "FOO" "IMPORTER")
EXPORTER:FOO
:INHERITED

I don't think any of these are really problems. It's already dangerous to attach any meaning to whether a symbol exists, since it might be unpredictably created by read. Unpredictable disappearances aren't any more confusing.

But instead of “thinking like pfdietz”, what about thinking like a user? There's a much more important problem here: symbols have definitions!

CL-USER> (defun foo ())
FOO
CL-USER> (progn (setf * nil) (gc :full t))
NIL
CL-USER> (foo)
Error: FOO has no fdefinition

CL symbols combine two functions: that of interned strings and that of top-level bindings. Top-level bindings are exceedingly important things, and the symbol table is often the only reference to them. So CL can't GC unreferenced symbols. :( And indeed, SBCL, Clisp and ABCL all hold their symbols strongly.

In languages with pure symbols, however, this is not a problem, so weak symbol tables are common (so much so that it never occurred to me that they might be impossible in CL!). They're present in Racket, OpenDylan, Clojure (for keywords) and even Java (for String.intern).

It might be safe to hold symbols weakly, even in CL, when they have no interesting properties: when they're not boundp or fboundp, have nothing on their plists, and aren't external. These are the symbols that, if GCed and interned again, would be recreated exactly as they were. (Keywords could be held weakly even if boundp, since intern would recreate that too.)

This would allow garbage collection of symbols which have only been read, but protect those that have been used for anything else. This could be implemented by keeping a separate weak hashtable for “plain” symbols, and moving them to the regular table when they become “interesting” (on set, (setf fdefinition), (setf get), etc.).

Is this a good idea? Is it worth the complexity? Is this all obvious and the real question is whether there's anything in the spec that forbids it?

When was ML invented?

Most sources say that ML, the language which introduced Hindley-Milner type inference, was invented in 1973. Its type system, however, was not described until 1977 or 1978. Milner's A Theory of Type Polymorphism in Programming says:

the polymorphic type discipline which we discuss here has been incorporated in the LCF metalanguage ML [2, 3], and has been in use for nearly 2 years. The compile-time type checker for this language has proved to be a valuable filter which traps a significant proportion of programming errors.

The paper was written in 1977 and revised in 1978, so “nearly 2 years” means 1975 or 1976. (References 2 and 3 are LCF and ML documentation, from 1977 and 1978; neither is on the net.) The 1972 LCF documentation (warning: bad partial OCR) doesn't mention the metalanguage, so I suspect the date of 1973 refers to when it was first added.

Without its distinctive type system, ML would have been an uninteresting ISWIM derivative (with dynamic type, or monomorphic static type?), hardly recognizable as the ancestor of all modern statically typed languages. So we should date it not from its first, rudimentary version, but from the introduction of its most important feature, circa 1975, or from its publication, in 1977.

Update August 2015: Uninteresting? What about exceptions and pattern-matching?

Ill-phrased slogans don't go right

“Well-typed programs don't go wrong.” This slogan for ML-style static typing is infamous, since it's true only under an extraordinarily narrow definition of “go wrong”. This definition is not used for any other purpose, so it's virtually impossible for an innocent listener to arrive at the “correct” interpretation. For propaganda purposes, this is a feature, as it allows the user to make an outrageous claim and then blame its falsity on the audience's ignorance.

The slogan is a little bit justifiable in its original context. Robin Milner introduced it (along with polymorphic type inference) in his 1978 paper A Theory of Type Polymorphism in Programming, as the heading “Well-Typed Expressions Do Not Go Wrong”. The original use refers to a toy language in which the only possible runtime errors are type errors, so the typechecker really does prove the absence of all errors. Well-typed expressions, in that language, can loop forever or run out of memory, but they can't have semantic errors: they can't “go wrong”.

Milner must have known it was misleading, though.

Square in R7RS

It's one of the most trivial library functions ever:

(define (square x) (* x x))

But I've written it many times, and I'm not alone. So it was a pleasant surprise to see that WG1 recently voted to add it to R7RS.

The request gave the (weak) reason that it could have an optimized method for bignums, but voter comments suggest they were more interested in symmetry with sqrt. They also added make-list for symmetry with make-vector. I don't think symmetry is a good reason in either case, let alone sufficient reason to add features to Small Scheme, but square is attractive as a convenience.

It's not the most important feature (and it probably belongs in Big Scheme, not Small Scheme), but it's a small step forward. Every language that aims to be convenient should have square in its standard library.

“However, the compiler arranges for it to work anyway.”

There are some things you don't want to hear from a language manual...

You might expect this not to work if it was compiled and res was not declared special, since non-special compiled variables are not represented as symbols. However, the compiler arranges for it to work anyway.

...especially not in the section on a low-level primitive like, say, pointers.

That's from page 110 of the 1979 Lisp Machine manual (20 MB of interesting history).

Unlike most lisps, Lisp Machine Lisp had pointers, called “locatives”: first-class places, implemented as (unboxed, IIUC!) pointers. One of their uses was to refer to variables by pointing to their names' value cells. But local variables generally don't live in their names' value cells, so locatives on them do nothing useful (and potentially something harmful). Apparently there was a workaround for this: the compiler recognized these locatives as intended to refer to local variables, and replaced them with something else.

Isn't it nice using a language with clean, well-specified semantics?