Good and bad syntax-coloring

David Nolen pretty-printed some Clojure with SLaTeX. The result, as usual for SLaTeX, is typographically impressive but rather painful to read. The problem is that SLaTeX has horrible defaults for syntax coloring. It shows most code, including this Clojure, in only two faces:

  • Special form names are in bold.
  • Other names are in italics.

Italics are harder to read than plain (“Roman”) text, because they're less familiar, so using them for most symbols hurts readability. (SLaTeX also uses proportional fonts; I'm not sure whether that helps or hurts.) But there's a much bigger problem: misplaced emphasis.

Bold, like bright colors, draws attention. This is very useful: it can help the reader quickly find important bits of code they might otherwise overlook. But special forms, however semantically special, are not such important bits. When reading a program, occurrences of lambda or begin are hardly ever the parts that deserve notice. They do deserve some subtle syntax-coloring, but printing them in bold makes code considerably less readable, because it directs attention to the irrelevant.

It's much more useful to emphasize binding occurrences, especially top-level ones. One of the most common operations when reading code is to look for a definition, and showing the names prominently makes it easier to find the right one. Emacs, as usual, gets this right: for most languages it shows top-level binding occurrences in bright colors, which makes them easy to find.

Clojure-mode also colors names in clojure.core. I've found this quite helpful for learning the language, because it provides an edit-time hint of whether I've guessed the right name. I think it's also a small help when reading code, because the color used is low-contrast, so it makes calls to standard-library functions less conspicuous — which is good, because they're usually less deserving of attention than calls to my own functions. It may also be helpful when reading an unfamiliar language, because it tells the reader when to guess what a name means or look it up with doc or clojuredocs rather than look for its definition elsewhere in the program.

It's easy to change SLaTeX's fonts to something less annoying:

\def\keywordfont#1{{\it#1\/}}
\def\constantfont#1{{\rm#1}}
\def\variablefont#1{{\rm#1}}
\let\datafont\constantfont

If you don't mind abusing \setkeyword, you can also make it recognize names from the standard library, by pretending they're special forms:

\setkeyword{sorted-map read-line re-pattern keyword? ClassVisitor
  asm-type VecSeq print-defrecord val IVecImpl ProcessBuilder chunked-seq?
  Enum find-protocol-impl SuppressWarnings
  %and so on for the rest of (keys (ns-map (find-ns 'clojure.core)))
}

Unfortunately SLaTeX doesn't support this directly, nor does it support recognizing top-level binding occurrences, but neither would be hard to add; it already recognizes similar things like macro-defining forms and quoted literals. If you're writing a pretty-printer (or a syntax-coloring editor), I encourage you to add these two features, because they're much more useful than highlighting special forms.

Another problem with non-ASCII operators

Unicode offers many symbols that make beautiful names for operators, but in addition to the usual problems (they're hard to identify with their ASCII equivalents, and almost impossible to type, and they become unintelligible when encodings get confused) there's a rendering problem: rare characters are often displayed poorly, because most fonts don't cover them very well.

I encountered this recently in one of Conal Elliott's posts which uses the character . The font I saw it in didn't contain that character, so it was rendered in some other font, but scaled wrong, so it was illegibly small. I had to triple the font size to figure out that it was supposed to be a slashed arrow, not a squashed gnat. Amusingly, a change note on that post says this character was a replacement for another that was even more badly rendered:

2011-01-31: Switched trie notation to "k ↛ v" to avoid missing character on iPad.

Even fonts that have a character may not render it well. The composition operator (which also appears in that post) is one of the best candidates for a non-ASCII name, because it's so well known and has no good ASCII equivalent. Unfortunately, all the fixed-pitch fonts on my machine display it too high and with too much space on the right, making it look more like the degree sign ° than composition. Making a font with good glyphs for 95 ASCII characters is not prohibitively difficult, but making one for a few thousand Unicode characters is apparently a challenge even for the professionals.

Unfamiliar operators like the trie constructor aren't very compelling candidates for symbolic names anyway, so it's not much of a loss to use ASCII for them, but it's sad that we still can't use the familiar symbols for composition and set membership. (Although composition may be moot — I tend to write it as . even on paper, because of Haskell.)

Quote should not terminate tokens

Haskell allows the single-quote character in identifiers, so you can use variable names with primes, like x'. This is so convenient (especially for naming “revised” versions of variables, which seems to happen a lot when assignment isn't available) that I've been missing it in Clojure recently.

There's no reason lisps can't have this. Common Lisp supports nonterminating macro characters — characters that have special meaning when they appear on their own, but not when they appear in the middle of a token. Like many of CL's features (generic functions, anyone?) this isn't used much in the standard library; by default there's only one nonterminating macro character, #, and that's only for historical reasons (compatibility with Maclisp, I think). But it's easy to make new ones, which solves the x' problem in one line:

CL-USER> (set-macro-character #\' (get-macro-character #\') t)
T
CL-USER> '(x x' y y')
(X |X'| Y |Y'|)

(T as the third argument of set-macro-character means “nonterminating”.) Note that quote still works. Symbols with primes print funny, because the printer doesn't realize nonterminating macro characters don't have to be escaped, but they work fine; you can name variables with primes to your heart's content.

This should be the default in any new lisp: ' should not terminate tokens. Neither should any macro character except ), ], } and ;, really — termination only matters when you leave out spaces between tokens, and who does that?

How typep got its name

Paul McJones has a large collection of old Lisp manuals, which shed a lot of light on historical questions like how unary typep got its irregular name. The earliest reference I've found is in a 1973 Maclisp manual, which describes it thus, immediately after the other type predicates:

typep  SUBR 1 arg
  typep is a general type-predicate.  It returns an atomic symbol
  describing the type of its argument, chosen from the list
    (fixnum flonum bignum list symbol string random)
  symbol means atomic symbol.  Random is for all types that don't
  fit in any other category.

A “general type-predicate”? This wording suggests typep got its name because it was seen as a generalization of the regular type predicates. This makes sense: as Maclisp acquired an increasingly unwieldy number of type predicates, its maintainers would want to merge them into one operation, and would probably think of that operation as “a general type-predicate”, even though it's not actually a predicate.

(That random type is probably just an implementation detail showing through, but it's still adorable. Rudimentary type systems are to languages what tiny clumsy paws are to kittens.)