Arcane Sentiment: Taming unspecified behavior

When a language spec leaves the behavior of some operation unspecified, there are several things an implementation can do:

Signal an error in the usual way (whatever that is).
Extend the language by defining a useful meaning.
Crash, i.e. report an unrecoverable error.
Return an arbitrary value.
Break safety by e.g. corrupting memory.
Choose behavior unpredictably. Some C compilers now do this, to the horror of their users.

Traditionally, when a spec leaves some behavior unspecified, it's completely unspecified, with no constraints at all on what implementations can do. This maximizes implementor freedom, but minimizes the amount of behaviour users can rely on. This sometimes forces them into contortions to stay within the specified language, or leads them to write nonportable code without realizing it. Even worse, implementors sometimes take lack of specification as a license for arbitrarily perverse behaviour.

A spec can reduce these problems by leaving behavior only partially unspecified. Here are some options, in roughly increasing order of unspecifiedness:

Signals an error: The meaning of this operation is undefined — so undefined that implementations must detect it and report it. This provides maximum safety for users, but no freedom for implementors. (This isn't actually unspecified behaviour, but it's pragmatically similar.)
Signals an error unless extended: Implementations must detect the undefined behavior, but they have the option of giving it some useful definition instead of signaling an error. For example, in a language without complex numbers, (sqrt -2) might be specified to signal an error, but an implementation that does have complex numbers could make it return one. In Scheme, (map - (vector 1 2 3)) might be specified to signal an error (because the vector is not a list) unless map is extended to work on other sequence types. This lets implementors extend where they want to while preserving safety everywhere else, so it's a good default for languages that aim to be safe.
Unspecified value: The operation will return normally and safely, but the result is unspecified, often with constraints such as a type. For example, C's INT_MAX is an unspecified integer at least 32767. In Scheme, the result of (exact? (/ 1 2)) is unspecified but must be a boolean.
Unspecified but safe: The language's basic safety guarantees continue to apply, but behavior is otherwise unspecified. For example, the result of arithmetic overflow in many languages is unspecified — it might signal an error, it might overflow into bignums or flonums or +Inf, it might be modulo some constant, or it might return nil or nonsense — but it won't corrupt memory or crash.
Unspecified but implementationally unsurprising: The behaviour is not specified, but it should make sense in terms of some underlying model. For example, many languages do not specify what sort of pathnames their file operations accept, except that they should be those of the host system. C does not specify that the result of falling off the end of an array or dereferencing NULL is to blindly attempt to access that address, but that's what users expect.
Unspecified and unsafe: The language's usual safety guarantees no longer apply. Anything might happen, including crashes or corruption. In particular:
Unspecified but consistent: The implementation may choose whatever semantics it likes, but it must preserve those semantics when optimizing. It may not assume the operation won't happen, or choose semantics unpredictably.
Unspecified and unpredictable: Behavior is completely unspecified, and the compiler may do whatever it likes, even if it's inconsistent and doesn't make sense in terms of the underlying implementation. Avoid this! As John Regehr puts it, “A compiler that is very smart at recognizing and silently destroying [code with unspecified behavior] becomes effectively evil, from the developer’s point of view.”

These options are combinations of simpler constraints on behavior: safety; normal return vs. signaling an error; predictability; consistency with the underlying implementation. What other constraints, or combinations thereof, are useful?

Update 15 December: See also John Regehr's When is Undefined Behavior OK?

7 comments:

John15 September 2013 at 00:34
Be careful talking about C where "unspecified" means something very specific. INT_MAX is not unspecified but rather implementation defined. The order of evaluation of arguments to a function is unspecified. The behavior of a program that accesses out-of-bounds storage is undefined.
John Cowan15 September 2013 at 02:25
The meaning of this operation is undefined — so undefined that implementations must detect it and report it.

On the contrary, the meaning of the operation is defined — defined to throw an exception, that is. In Common Lisp, for example, a compiler might detect that (concatenate 'number 3 4) has something funny about it, but it cannot reject the program which contains it: it must allow the program to go through and raise an exception at run time, because that is the defined effect. An R5RS or R7RS-small Scheme compiler (but not an R6RS compiler), given (append 3 4), is free to reject it at once as patent nonsense.
John Cowan20 September 2013 at 01:44
When the compiler can prove the error happens, yes. However, I/O has nothing to do with it: the size of the catastrophe is not bounded, and may include destroying the observable universe (or at any rate the observer).

I ran the Petrofsky catastrophe against my suite of 42 Schemes with these results (at least in the REPL):

Returns 1: Racket, Gauche, MIT, Gambit, Chicken, Bigloo, Guile, Chez, Vicare, Larceny, Ypsilon, Mosh, NexJ, STklos, KSi, Shoe, TinyScheme, Scheme9, RScheme, S7, BDC, XLisp, Rep, UMB, SXM

Prints a "not a procedure" warning but returns 1 anyway: Scheme48/scsh, Kawa, SISC, Chibi

Prints a "not a procedure" error: SCM, IronScheme, JScheme, SigScheme, Elk, Llava, Sizzle, Inlab

Prints a "not a procedure" error and hangs: Owl Lisp

Segfaults: Schemik

No support for call-with-current-continuation: FemtoLisp, Dfsch

It's OK to comment on old posts.