Arcane Sentiment: August 2012

Autoconversion is like weak typing

A while ago, I made a simple mistake using zlib:

char *file = "/some/path";
gzFile f = gzopen(file, "wb");
if (f) {
  gzprintf(file, "something\n");
  /* ... */
}

This is a static type error, of course: gzprintf expects a gzFile, not a char*, and they have different C types. Unfortunately, C doesn't detect it, because (in older versions of zlib) gzFile happens to be a typedef for void*, and in C char*, like all pointers, can autoconvert to void*. By accident of memory layout, this didn't crash at runtime either; it simply did nothing, causing me some puzzlement when this one line was missing from the otherwise intact output.

In addition to providing a puzzle, this mistake sheds light on some confusing terminology. Like most terms related to type, “weak typing” has several meanings. I usually reserve it for languages with typechecks that don't always apply, either because they can be suppressed by casts, or because some common¹ operations don't provide them. But it's also used for the unrelated concept² of autoconversion.

I thought this was just an arbitrary reuse of a term, but the autoconversion of char* to gzFile demonstrates a common meaning: autoconversion, like casts and missing type checks, means some type errors aren't detected. They usually cause conversions instead of blind misinterpretation of data (though not in this case), but from the user's point of view, the effect is the same: the type checks are unreliable. If you accidentally do arithmetic on a Perl string and it autoconverts to zero, it's small comfort that the conversion is type-safe — the program still continues happily computing incorrect results after a type error. So it makes sense to use the same term for any system of type checks that overlooks some errors. It's confusing to call completely different mechanisms by the same name, but the result is the same: both autoconversion and casts make type checks weak.

¹ If some rare, exotic operations circumvent typechecks, that's OK, because you won't use them by accident.

² And sometimes for even more unrelated concepts like dynamic typing, but this usage is usually either careless or pejorative. When people want to talk about dynamic typing, they don't call it “weak typing”.

“Types considered harmful” considered insightful

Provocative titles work. When a famous type theorist like Benjamin Pierce gives a talk titled Types Considered Harmful, it gets my attention. (Eventually. Years later. After someone links me to it.) This talk describes a contract-checking system, which wouldn't be very interesting in its own right, title or no, but it also includes a bunch of insights about typechecking. Here are the good bits:

First (and this is not new), why does type inference find bugs? It's not because there's anything special about types per se, but simply because it entails enough analysis that inconsistencies in the program are likely to produce contradictions.

Attempting to prove any nontrivial theorem about your program will expose lots of bugs
The particular choice of theorem makes little difference!
Typechecking is good because it proves lots and lots of little theorems about your program

(“Proving theorems” sounds scary, but if you describe it as “checking properties” it means the same thing and is much less intimidating.)

Second, it acknowledges an oft-ignored cost of static typing: it makes languages much more complicated.

Fancy types make the programmer's head hurt

Type systems – especially very precise ones – force programmers to think rigorously about what they are doing
This is good... up to a point!
Do we want languages where a PhD* is required to understand the library documentation?

* two PhDs for Haskell

Fancy types make my head hurt

Complex type systems can lead to complex language definitions
Easy to blow the overall complexity budget

The notion of a complexity budget ought to be more widely used. In addition to the limitations of designer and implementor effort, there's a limit to how much users will learn about a language. Once the language's complexity reaches this limit, additional features are not very useful, because the users who are supposed to benefit from them don't know about them.

Third, it has a useful interpretation of the relationship between typechecking, contract checking (in the particular form described in the talk), and tests:

Types ~ General but weak theorems (usually checked statically)
Contracts ~ General and strong theorems, checked dynamically for particular instances that occur during regular program operation
Unit tests ~ Specific and strong theorems, checked quasi-statically on particular “interesting instances”

These two dimensions of generality and strength are easy to reuse:

A dynamic typecheck is specific (because it checks only one case) and weak (because all it proves is that the value has the right type) and checked dynamically.
The purity of core Haskell is another general and weak theorem — or rather assumption, as it's not checked at all.
Traditional assertions are specific, of variable strength, and checked dynamically.
Model checkers prove general strong theorems statically.
The “typeful” style often used in ML and Haskell, where more distinctions are moved into types, is an attempt to make the typechecker's weak theorems stronger.

The title is not just for provocation. It's also a summary of the talk's apparent thesis. Pierce doesn't say this explicitly in any of the slides, so maybe I'm reading my own conclusion into them, but he seems to be saying: proving theorems about programs is good, but having no choice about which theorems to prove is bad. What if the theorem that's forced on you is hard to prove, or easy to prove but not by the methods the checker knows? What if it's not even true, or shouldn't be true? A typechecker (or any obligatory checker) is “considered harmful” because it chooses the theorems for you, and you can't refuse it when it chooses poorly.

Other uses for Unicode

Unicode identifiers are impractical because users can't type them. But languages can safely use Unicode characters for many other purposes. For example:

Prompts. Doesn't every language need a distinctive prompt? Why not adopt a Unicode character, so the prompt won't look like anything users type?
Alternate names: If you allow one symbol to have multiple names, the pretty-printer can use Unicode while humans type in ASCII. Doing a lot of this would interfere with learning, but basics like printing Haskell \ x -> x as λ x → x shouldn't hurt much.
Alternate reader syntax: Similarly, print can use Unicode characters in place of the least satisfactory ASCII ones. Wouldn't it be nice if vectors printed with proper angle brackets instead of #(...)?
Unreadable objects: If an object's printed representation can't be read back in, it need not be typed by humans, so it can use arbitrary characters. This might improve the printed representations of functions, classes, promises, etc., which often print unhelpfully at the REPL.
Non-ASCII art: Languages (and libraries) often print everything as text, even things like tables that shouldn't be, because that's the only output interface they can rely on. Unicode can help make this less ugly. (Box-drawing characters!)
Documentation and error messages: you can use whatever you want here.

REPLs, by the way, should exploit terminal control before worrying about Unicode. Small improvements like coloring and a syntax-aware line editor make a big difference to usability. Not everyone has an IDE or wants to use it for everything.

Number syntax extensions

I approve of James Hague's proposal to allow numbers with k as a suffix, so 64k is 65536. However, there is that binary vs. decimal issue.

The IEC/ISO binary prefixes (ki, Mi, Gi, etc.) have not been widely adopted, and for good reason: unlike the regular SI prefixes, they don't represent anything in spoken language. They come with proposed pronunciations, but these are too awkward (especially when followed by ‘byte’) to be accepted — ‘kibibyte’ sounds like a mockery of the SI prefixes, not an addition to them. As long as there are no binary prefixes in speech, there will be little demand for abbreviations that aren't short for anything.

There is (was?) another proposed set of binary prefixes which suffer this problem slightly less: bk, bM, bG, etc. They resemble the spoken ‘binary kilo-’, ‘binary mega-’, etc., so they might have some chance of adoption. Maybe programming languages should use them, so 1bk = 1024 and 1k = 1000. Or maybe they should unapologetically use 1k = 1024 — who needs powers of 10 anyway?

Some other, less controversial extensions to number syntax:

Allow flonums to omit zeroes before or after the decimal point: 1. and .1 should be acceptable as floats. (In Common Lisp, 1. is an integer, for historical reasons.)
Separators, as in Ada and Ruby and rather a lot of languages nowadays: 10⁹ can be written as 1_000_000_000. (Or maybe as 1,000,000,000, but I'd rather avoid comma, to avoid confusion over whether a string like 1,234.5 is 1234.5 or 1.2345.) This is especially useful in REPL output, so you don't have to read unpunctuated ten-digit numbers. (Usability hint: don't print separators for four- or five-digit numbers; they're not needed there.)
Exponents in integers: 1e9 should read as the integer 1000000000, not as a float. (This is a good substitute for decimal SI suffixes.) Negative suffixes could read as ratios if available, or else as flonums.
Fractional SI suffixes: if 1k is 1024 or 1000, might 1m be 1/1000 (or 1/1024)?
Radix with ‘r’: 2r1101011 instead of #2r1101011, saving a dispatching character. (There is virtually no demand for radices over 16, so don't bother supporting them. If you want base 36, you'd probably like base 64 even better.)
Scheme-style complex numbers (-2+3i, 3.033-198.26i) may be easier to read than CL-style ones, although this seldom matters.

I suppose none of this is very important.