A controlled experiment in static type

There should be more language research like Stefan Hanenberg's experiment on how static typing affects productivity. It uses a rigorous methodology which is standard everywhere except in software engineering research: a controlled experiment. It compares a substantial number of programmers solving the same problem in two purpose-built languages which differ only in a single feature: the presence of a static type system.

Unfortunately the languages tested are not state of the art – the dynamically typed language is approximately Smalltalk, and the static types are those of early Java, with required type annotations and casts and no parametric types. Not surprisingly, it found that this sort of static typing is useless or worse. (Curiously, it also found that debugging type errors took longer when they were found statically than when they were found dynamically, but this is weak evidence, because the debugging times may include work other than fixing the type errors.)

This needs to be replicated with parametric types and optional type annotations!

The obvious languages to compare are a minimal pair differing only in the presence of the static checker — perhaps ML and ML-minus-the-typechecker, since that's easy to implement. We can do better, though: we can distinguish between different effects of static typing. There are many hypotheses about how it affects productivity; some of the most plausible are:

  • It finds bugs faster.
  • It directs the user's attention to types. (It's not clear whether (when?) this helps or hurts.)
  • It restricts programs by preventing constructs like arbitrary union types and variable arity.

ML-minus-the-typechecker suffers from the third effect: even though it doesn't have a typechecker, its library is limited to things the typechecker can handle – so no read or eval or type predicates or even printf (without exotic tricks). So if that effect is important, comparing ML to ML-minus-the-typechecker will not detect it.

We can distinguish between these three effects by devising a way to inflict each effect independently. We can start with a dynamically typed language with an appropriate standard library, and vary three features:

  • To find bugs, use a nonrestrictive (“soft”) typechecker, which detects some type errors but doesn't interfere with running programs.
  • To selectively divert the user's attention to types, vary the prominence of type declarations: whether they're used in the standard library documentation, whether they're printed in the output of the REPL, whether they're emphasized in the language documentation, and maybe even whether they're enabled at all.
  • To restrict programs, use a version of the standard library with everything that would confuse the Hindley-Milner typechecker removed or altered. (This is not the typechecker we're testing, but it represents a typical degree of restriction.) This doesn't restrict user code, but restricting the library should have a similar effect, especially on small programs. (We could also restrict user code by running the H-M typechecker on it and delaying error reporting until runtime, to make it less helpful for finding bugs.)

The eight combinations of these three features allow us to test all three hypotheses, and detect interactions between them.

(Note that we don't test the Hindley-Milner typechecker, because it both finds bugs and restricts programs, and we want to distinguish between these effects. It would be interesting to compare it to the nonrestrictive typechecker, though, to compare the two checkers' efficacy at finding bugs. Has this been done before?)

One experiment like this can do more to advance our understanding of type systems than ten type theory papers, because it addresses questions we care about, not just ones we know how to answer. I wish I could do this one myself, but as that's unlikely to happen, I encourage someone else to do it.

(Via It will never work in theory — the world's finest, and probably only, empirical software engineering research blog.)

(By the way, I was surprised to see the Mann-Whitney U-test used for all the significance tests, instead of Student's t-test. I thought the U-test was a special-purpose test for ordinal (not cardinal) data, but apparently some statisticians recommend it as the default significance test, on the grounds that it's more robust and only a little less powerful.)

Literate programming tools are mostly obsolete

Literate programming tools are utopian technology, like Plan Nine and Lisp machines: they appear often in descriptions of how the world ought to be, but rarely in actual use.

There's a reason for this. Literate programming tools traditionally have four features:

  1. Comments by default: text is interpreted as comments unless explicitly marked as code.
  2. Reordering: there is a macro system that allows writing code in a different order than that required by the language.
  3. Markup: comments can contain more than just text — they may contain markup in some language like Markdown or HTML, perhaps with additional language-specific features as in Javadoc.
  4. Typesetting: source and comments can be cross-referenced and printed neatly in HTML (or occasionally in print formats like TEχ or PDF, but at the moment HTML is the one that matters).

Only one of these features is of much use.

In the literate ideal, programs have more comments than actual code, so making them the default saves typing. This level of commenting might be plausible for APL or hairy assembly code or mathematical wizardry, but none of these are common. Real code, even very good real code, usually has much more code than comments, so while making comments the default might be justified as a device to encourage comments, it doesn't make programs shorter.

The original literate programming system, Knuth's WEB, was for Pascal, which is strict about the order of code; in particular it expects all definitions to appear before they're used. It also has no macros and no lambda, so it's limited in its ability to factor out code even when order is not a problem. WEB's macros address both problems, making Pascal much more flexible. Most modern languages, however, are much less rigid and more expressive, so they don't need the help.

Markup sounds like it would help with readability of comments. Unfortunately, most of what we have to say in comments is simple prose, which markup can't improve much. When it is used, markup (unless it's very lightweight, like Markdown) adds enough clutter to make comments less readable in source form — and almost all reading of comments is in that form. The architects of Utopia may wish we saw our code in typeset form, but we mostly see it in our editors.

Typesetting, at least, has been a great success — so much so that editors have absorbed it. Today, most programmers consider an editor primitive if it doesn't have language-aware autoindent and syntax-coloring, and many also expect name completion, type hints and the ability to jump to definitions. Though originally intended for fancy printouts, typesetting and cross-referencing are now considered too important to postpone to compile time; we expect them while editing. Only documentation is still generated by a separate pass.

I don't see literate programming as a failure; it explored the important question of how tools can make programs more readable. But the best answers to that question turned out to be embodied in languages, not literate programming tools. It wasn't obvious a priori that readability matters more when editing code than when publishing it, nor that it was simpler to make languages more expressive than to add flexibility in another layer of tools. These things are obvious in retrospect, but it took experience with tools to discover them. Rest in peace, WEB: you taught us not to need you any more.

“There need be no real danger of it ever becoming a drudge”

Jonathan Edwards quotes a beautiful piece of naive optimism from Turing:

The process of constructing instruction tables should be very fascinating. There need be no real danger of it ever becoming a drudge, for any processes that are quite mechanical may be turned over to the machine itself.

- Turing, A. M., 1946, Proposed electronic calculator, report for National Physical Laboratory, Teddington

Admittedly Turing was speculating, and he's sort of right; programming is more amenable to automation than other things humans do. But he must have been disappointed to discover how difficult and slow the process of turning processes over to the machine is.

It's tempting to believe there's a sort of incompleteness here: as the ability of formal systems to express propositions always grows faster than their ability to prove them (more optimistically, their ability to ask questions always grows faster than their ability to answer them), does the ability of languages to generate repetitive tasks always grow faster than their ability to automate the repetition?

Wrt expressiveness, no: programming languages don't necessarily face an incompleteness problem because they don't need to be consistent: they need only express programs, not completely avoid expressing errors. (And even if they did, mere incompleteness does not imply drudgery; if programming generated endless novel problems, it would hardly be a drudge.) But most of the repetition in programming is about understanding and transforming code. This does need to be consistent, so increased expressive power doesn't necessarily help. Any increase in the space of programs that can be expressed is an increase in the space of programs that must be excluded to understand a program — and it's typically undecidable which features affect which code, so this can't always be done by machine. QED?

I'm not convinced by this argument. Incompleteness and undecidability are familiar reasons for things to be impossible, so they make good excuses, but I doubt they have anything to do with the existence of drudgery in programming. Most of the repetitive analyses we do are amenable to automation — if not in the general case, then at least in most common cases. But we don't try very hard. We write tools for many purposes, but program understanding has not traditionally been a popular one; we hardly use tools for it unless they're imposed on us by being built into a language. Is the real problem here that we don't mind the challenging drudgery of understanding?

Making the right thing easy

Standard output is often slightly abused. Its proper use is for programs that produce an output stream, but its most popular use is for debug prints. These really ought to go to standard error (or perhaps to a more specific stream like CL's *trace-output*). But in most languages that's more trouble, requiring something like fprintf(stderr, ...) instead of just printf(...), and it usually makes no practical difference (especially in programs that don't use standard output for anything else), so most programmers don't bother. We'd prefer to do the Right Thing, but not if it's any extra work.

Caml removes this barrier to doing the right thing: its prerr_* functions are just like print_*, but print to standard error instead of standard output. This means you can send your debug messages to the right stream as easily as to the wrong one. (In retrospect, this seems obvious — if a language provides convenience functions for one stream, shouldn't it be standard error?)

This particular feature is not important, but the general principle is. Programmers tend to take the path of least effort, even if that leads to bugs (as with C's string functions or unescaped Unix shell arguments) or awkwardness. Any case where they need to be told to do things differently may be a case where the language could prevent a problem by making the right thing easy.

Standard trajectories

History repeats itself — in programming languages too.

Andreas Rossberg pointed out a hysterical raisin in Java:

The reason Java generics turned out sub-par is mainly because they were added too late! They had to work around all kinds of language design and infrastructure legacy at that point. Compare with C#, where generics were put into the language early on and integrate much nicer. That is, generics should be designed into a language from the very beginning, and not as an afterthought.

John Nagle replied:

The other classic language design mistake is not having a Boolean type. One is usually retrofitted later, and the semantics are usually slightly off for reasons of backwards compatibility. This happened to Python, C, LISP, and FORTRAN.

These are two of the standard trajectories languages take through design space. Some things happen over and over:

  • Languages make more type distinctions over time. Booleans are a common case of this: if you have integers or nil or a defined/undefined distinction, separate Booleans are obviously unnecessary, so they're often left out of early versions of languages. Eventually the designers tire of wondering which ints are really integers and which are Booleans, and add a separate type.
  • Restrictions chafe and get relaxed. In particular, languages with restrictive static type systems usually add extensions to make them less restrictive: witness Haskell's numerous extensions for high-order polymorphism, dependent types, deferred type errors, and so on. Java's generics are a case of this — it got along without them for a while because it was possible to escape from the static type system by downcasting.
  • General cases supplant special cases. In particular, many languages start with a small closed set of datatypes, and add more very slowly. At some point they add user-defined types, and initially treat them as another, very flexible type. Once they get used to user-defined types, they realize all the others are unnecessary special cases which could be expressed as ordinary (i.e. user-defined) types. Python went through this (painfully, because its built-in types had special semantics); Scheme will soon, since R7RS is standardising define-record-type. (I'm not sure if the Schemers realize this; several WG1ers objected to a proposed record? operator on grounds of strong abstraction, but none mentioned that it should return true for all types, including built-in ones.)
  • Interfaces are regularized. It's easier to add a feature than to understand its relation to the rest of the language, so many features have ad-hoc interfaces, which are later replaced with uniform ones. This seems to happen to collection libraries a lot, although this might be simply because they're large and popular.
  • Newly fashionable features are added in haste and repented later. How many languages have poorly-designed bolted-on “object systems”? Sometimes parts of the new features are later extended to the rest of the language; sometimes they're made obsolete by better-designed replacements; sometimes they're abandoned as not useful.
  • Support for libraries and large programs is often added later. Newborn languages have few libraries and no large programs, so there's little need for features like modules or library fetching or separate compilation. When they're later needed (or perceived to be needed), they're added, often as kludges like Common Lisp's packages and C's header files, or even as external tools like C's linker (and makefiles) and Clojure's leiningen. This may be changing: libraries are now considered important enough that new languages usually have modules, at least.
  • And, of course, languages grow. It's much easier to add a feature than remove one.

Many of these trajectories are easy to trace, because they leave trails of obsolete features kept around for compatibility.

What other pieces of history repeat themselves?