Superficial features affect style

While translating some code from Perl to a toy Lisp, I noticed I was treating something differently in the two languages: variables with one-letter names look much worse in Lisp than in Perl. $s obviously doesn't collide with anything predefined, but s looks like it might collide with something (perhaps the s combinator?), so I felt obliged to give it a longer name in Lisp, even though this didn't really help readability. But I was perfectly comfortable with the one-letter name in the original Perl.

There are some good reasons for worrying less about name collisions in Perl. Variable sigils reduce the risk of collisions by putting functions and scalars in separate namespaces. Perl doesn't have many predefined names, and entirely avoids one-letter names, so there's less need to worry about collisions with them. But I suspect my main reason was just that $s doesn't look like a one-letter name.

Closures of sets

Some operations are common in mathematical reasoning, but are not traditionally used in programming, even when they make sense. I stumbled on one of these recently: I described something as the closure of a set under an operation, and then sought some other way to compute it, because the mathematical description didn't sound like an implementation. After a few minutes I realized there was no reason it shouldn't be:

(defn closure-unary "Closure of a set under a function." [f xs]
  (loop [done #{}
	 pending xs]
    (if (empty? pending)
      done
      (let [x (first pending)]
        (if (contains? done x)
	  (recur done (rest pending))
	  (recur (conj done x) (cons (f x) (rest pending))))))))

user=> (closure-unary - (range 3))
#{0 1 2 -2 -1}
user=> (closure-unary #(mod (inc %) 10) [0])
#{0 1 2 3 4 5 6 7 8 9}
user=> (defn collatz [n] (if (even? n) (quot n 2) (inc (* n 3))))
#'user/collatz
user=> (closure-unary collatz [27])
#{160 1 161 577 2 1154 1186 322 866 2051 35 4 2308 484 1732 5 3077
 325 4102 70 3238 71 103 167 263 8 40 488 4616 41 137 233 425 10
 6154 106 650 107 395 1132 364 46 2158 142 206 334 526 2734 47 175
 719 911 16 9232 80 976 433 593 82 242 274 466 754 850 1619 20 244
 1300 1780 53 182 214 310 502 566 790 23 1079 1367 7288 184 1336
 121 377 122 4858 890 27 91 155 251 283 92 124 1276 412 3644 668
 700 61 2429 445 62 94 350 1438 638 1822 958 31 319 479}

That was the version I wanted, but closure under a binary operator is probably more common:

(defn closure-binary "Closure of a set under a binary operator." [f xs]
  (loop [done #{}
	 pending xs]
    (if (empty? pending)
      done
      (let [x (first pending)]
	(if (contains? done x)
	  (recur done (rest pending))
	  (recur (conj done x)
		 (concat (map #(f x %) done)
			 (map #(f % x) done) ;in case it's not associative
			 (list (f x x))
			 (rest pending))))))))

user=> (closure-binary #(not (or %1 %2)) [false])
#{true false}
user=> (closure-binary #(mod (- %1 %2) 11) [1])
#{0 1 2 3 4 5 6 7 8 9 10}

The two variants are identical except in how they generate new elements. The common part could be factored out...

(defn closure* "Closure of a set under a function generating new elements:
(children element old-elements) should return a seq of elements to add."
  [children xs]
  (loop [done #{}
	 pending xs]
    (if (empty? pending)
      done
      (let [x (first pending)]
        (if (contains? done x)
	  (recur done (rest pending))
	  (recur (conj done x) (into (rest pending) (children x done))))))))

(defn closure-unary [f xs]
  (closure* (fn [x done] (list (f x))) xs))

(defn closure-binary [f xs]
  (closure* (fn [x done]
	       (concat (map #(f x %) done)
		       (map #(f % x) done)
		       (list (f x x))))
	    xs))

...but passing done to children is not very intuitive.

Special behaviour for interactive redefinitions

Dynamism — allowing definitions to be replaced in a running program — is a hugely convenient feature, because it can shorten the edit-test loop by so much. But like many powerful features, it makes some errors harder to detect — in this case, name collisions. It's easy to catch these automatically in a static language: simply make redefinition of an existing variable an error. But in a dynamic language, redefinition is a feature, so we can't simply reject it. We can give a warning, so collisions won't go undetected, but this will annoy users whenever they interactively redefine something. Can we do better?

In some Lisps, defvar has a related problem, and illustrates a possible solution. Its purpose is to create a (global) variable with a given initial value. There's a problem with redefining: what do you do if the variable already exists, and has a different value?

In both Common Lisp and Emacs Lisp, defvar has the curious feature that it sets the variable only if it doesn't already have a value — if it already exists, it's not changed. This is intended to make re-loading sourcefiles safe: otherwise, it would reset your global variables.

If you think this is surprising behavior, you're not alone. Alan Crowe once suggested that it would be less confusing if defvar were called something like def-perv-noclobber instead, because neither its pervasive specialness nor its clobber-avoidance are what you would expect from something called defvar. (defonce is a better name.)

When a user reevaluates a single defvar interactively, however, they probably do want to clobber the value, or they wouldn't bother. Both Emacs and Slime therefore have a special case: when a defvar is evaluated interactively (as by C-M-x), it will clobber any existing value. This is ugly, of course, but like C-t's irregular behaviour at end-of-line, it does what the user wants.

Something similar might be useful for detecting errors of redefinition. Can we allow redefinitions when a definition is evaluated interactively, but not when it's loaded as part of a file? No, because it's normal to reload a source file after editing it, which replaces every definition in the file, but shouldn't produce any warnings. Nor should we disable the warning for all interactive evaluation, including load, because that's when the warnings are most useful. Interactivity alone is not a sufficient reason to suppress warnings.

What we really want is to warn about redefinition only within a single interactive operation. It's generally OK for a load to replace old definitions; it's only suspicious if a definition is made twice without user intervention. So if a name is defined twice in a single load (directly, or in any other files it loads), that's an error or warning, but when it's defined twice in two separate batches, that's OK. This is easy to implement (by e.g. recording each definition made in the current load), and gives much more precise warnings.

If it's good enough for a real antivirus, it's good enough for a fake one

I encountered a piece of scareware recently — “Windows Vista Antispyware”, which claimed to be part of Windows and to have detected viruses and other malware, and demanded money to “remove” them.

It was slickly done. The UI looked like part of the Windows Security Center control panel, there were only a few typos and stylistic missteps, and the infections it reported were the (randomly selected) names of real malware. But it would have been more convincing if it hadn't claimed the machine was infected with the dreaded EICAR-Test-File.

State is not always to blame

Nelson Elhage describes a cute Linux vulnerability, of which the heart is that Linux sometimes temporarily turns off its pointer validity checks, and then does things that aren't safe without those checks.

My reaction (after the initial wow, Linux has stupid mistakes too, as if it should be a surprise) was interestingly wrong: I blamed the problem on the use of state, and briefly supposed it should have been done with something like dynamic scope instead. But actually state had nothing to do with it. The problem was in turning off the checks over an unnecessarily large scope, and in not being clear about what operations were safe with the checks turned off. That they were turned off statefully was irrelevant; the result would have been the same regardless of the mechanism.

Zealous functional puritans blame all kinds of problems on state, and they have an unwholesome influence on other functional programmers. We learn that state is an acceptable target, that it's the usual suspect for any problem, that no evidence is necessary to accuse it. And so we do so even in cases like this vulnerability, where a moment's thought would reveal that state was innocent, and for many other problems whose causes are not so obvious. How many really have anything to do with state?

Why use keywords as symbols?

From the beginning, Lisp has been associated with symbolic programming, for which symbols have always been the canonical type. So it's odd that many Lisps also have a separate :keyword type, which they use instead of symbols for all symbolic purposes except representing their own programs. Why bother? Why have two separate types that mean the same thing?

The motivation, in Common Lisp and (AFAIK) Clojure, is that these languages' symbols aren't pure symbols: in addition to identity and name, they include some other features, in particular a package or namespace. That's useful for representing programs, but most other applications of symbols don't want namespaces, so the languages also have keywords to serve as pure symbols. Lisps whose symbols don't have packages don't usually bother with keywords, but Racket and elisp have them anyway for use with keyword arguments. (In Racket, they need to be distinct from symbols because the function call form handles them specially rather than just passing them as part of the arglist. In elisp, they're only cosmetic; plain symbols could have been used instead.)

Avoiding namespaces is a good reason to use keywords, but I don't think it's the only reason, because users don't see keywords as a workaround; they seldom have to be told to avoid symbols because they're more complex than they appear. Instead, they overwhelmingly prefer keywords, not just for keyword arguments but as symbolic values. Even in elisp, where there's no reason not to use ordinary symbols, keywords are sometimes used instead. Why?

For one thing, they help distinguish programs from data. Humans, even Lispers who know better, tend to see these as two different things, and get confused when they mistake one for the other. When programs are written with symbols and data is written with keywords, it's easy to tell them apart.

Keywords are also more regular in appearance, because they're self-evaluating. It's tempting (for beginners, and as an unconscious heuristic for non-beginners) to think of symbols and keywords as differing only in their prefix character: 'x is the practical equivalent of :x, right? But they're not equivalent, because symbols have a quote only when they appear as forms. When they appear in a larger piece of quoted data, or as output from the REPL, or in non-evaluated contexts, they don't have the quote. This is only a superficial inconsistency, but it's annoying:

CL-USER> (list 'a 'b 'c)
(A B C)
CL_USER> (second '(a b c))
B
CL-USER> (defclass foo () ((a type integer initarg a accessor foo-a)))

(That last is not valid CL, obviously; it's what defclass would look like without keywords. I suspect users would perennially forget that most of the slot options aren't evaluated, even more than they already do, and put in quotes anyway, because they're not accustomed to writing unquoted symbols.)

Keywords, because they're self-evaluating, don't have this variation. They look the same whether they're in quoted data, or in the output of the REPL, or standing alone as forms, or as non-forms:

CL-USER> (list :a :b :c)
(:A :B :C)
CL-USER> (second '(:a :b :c))
:B
CL-USER> (defclass foo () ((a :type integer :initarg :a :accessor foo-a)))

More comfortable, isn't it? I wonder if this explains some of the popularity of keywords. (And also of Clojure's vectors, which while not strictly self-evaluating are similarly more regular in appearance than lists, because they don't usually require quoting.) If so, this is a disappointment to me, because it means this seemingly pointless distinction, like that between symbols and strings, is actually doing something important, and can't necessarily be simplified away.

An anaphoric-macro-defining macro

Attention conservation notice: this post is about a special-purpose macro-defining macro. I doubt it will be of much interest to anyone but macro fans.

A year ago, Vladimir Sedach asked for arguments about why macros are and aren't useful. (He didn't get many; most of the people who can make such arguments are tired of doing so and not confident they're right.) In the same post, he mentioned that he uses the Anaphora macro library. Anaphora itself provides an example of how macros are useful, not only because it defines some handy ones, but because many of their definitions are repetitive:

(defmacro awhen (test &body body)
  "Like WHEN, except binds the result of the test to IT (via LET) for the scope
of the body."
  `(anaphoric when ,test ,@body))

(defmacro swhen (test &body body)
  "Like WHEN, except binds the test form to IT (via SYMBOL-MACROLET) for the
scope of the body. IT can be set with SETF."
  `(symbolic when ,test ,@body))

(defmacro acase (keyform &body cases)
  "Like CASE, except binds the result of the keyform to IT (via LET) for the
scope of the cases."
  `(anaphoric case ,keyform ,@cases))

(defmacro aetypecase (keyform &body cases)
  "Like ETYPECASE, except binds the result of the keyform to IT (via LET) for
the scope of the cases."
  `(anaphoric etypecase ,keyform ,@cases))

This is the sort of repetition that could be addressed with a macro. We can generate these definitions automatically, docstrings and all:

(defmacro defanaphoric (name (firstarg &rest moreargs) original &key settable)
  "Define NAME as a macro like ORIGINAL, but binding IT to the result of FIRSTARG.
If SETTABLE is true, IT will be a symbol-macro which can be set with SETF."
  (let ((whole (gensym "WHOLE")))
    `(defmacro ,name (&whole ,whole ,firstarg ,@moreargs)
       ,(format nil "Like ~S, except binds the result of ~S to IT (via ~S) for the scope of the ~A.~A"
          original
          firstarg
          (if settable 'let 'symbol-macrolet)
          (if (and moreargs (member (car moreargs) '(&body &rest))) (cadr moreargs) "body")
          (if settable " IT can be set with SETF." ""))
       ,@(mapcan (lambda (a) (unless (member a lambda-list-keywords)
                               `((declare (ignore ,a)))))
                 (cons firstarg moreargs))
       `(,',(if settable 'symbolic 'anaphoric) ,',original ,@(cdr ,whole)))))

Then it's easy to define individual anaphoric macros:

(defanaphoric awhen (test &body body) when)
(defanaphoric swhen (test &body body) when :symbolic t)
(defanaphoric acase (keyform &body cases) case)
(defanaphoric aetypecase (keyform &body cases) etypecase)

Note that almost half the code about generating docstrings. Anaphoric macros, like many other things, would be much simpler to define if they didn't need documentation.

The argument list (firstarg &rest moreargs) also exists purely for documentation purposes. We don't need it to define the macros — a simple &rest would do — but we need to pass it to defmacro so tools like SLIME and describe can display it. But repeating it as a parameter is ugly; it would be better to get the original's argument list automatically. There's no standard way to do this, but every implementation supports it; if we have a portability wrapper like swank-backend:arglist (or some non-Swank-dependent equivalent), we can do away with the explicit argument list:

(defmacro defanaphoric (name original &key settable)
  "Define NAME as a macro like ORIGINAL, but binding IT to the result of the first argument.
If SETTABLE is true, IT will be a symbol-macro which can be set with SETF."
  (let ((whole (gensym "WHOLE"))
        (arglist (swank-backend:arglist original)))
    `(defmacro ,name (&whole ,whole ,@arglist)
       ,(format nil "Like ~S, except binds the result of ~S to IT (via ~S) for the scope of the ~A.~A"
          original
          (car arglist)
          (if settable 'let 'symbol-macrolet)
          (if (and (cdr arglist) (member (cadr arglist) '(&body &rest)))
       (caddr arglist)
       "body")
          (if settable " IT can be set with SETF." ""))
       ,@(mapcan (lambda (a) (unless (member a lambda-list-keywords)
                               `((declare (ignore ,a)))))
                 arglist)
       `(,',(if settable 'symbolic 'anaphoric) ,',original ,@(cdr ,whole)))))

This makes the definitions of anaphoric macros transparently simple:

(defanaphoric awhen when)
(defanaphoric swhen when :settable t)
(defanaphoric acase case)
(defanaphoric aetypecase etypecase)

This might be useful for simplifying Anaphora, or even as a tool to expose to its users, but it's not much good as an answer to Vladimir's request for arguments for and against macros — macro-defining macros do not look useful to anyone who doesn't already think macros are useful.

Quantitative results are rewarding

Recently I spent some time profiling and optimizing a certain program, and tripled its performance. My reaction was predictable. I am awesome, I thought. I made this three times as good as it was before. I'm three times as good a programmer as whoever wrote that slow code. (Even though “whoever wrote that slow code” was me.)

The results of optimization are easy to quantify: not just faster, but three times as fast. I suspect this accounts for some of the lure of optimization. Comparing performance numbers provides an obvious indication of improvement, and exaggerates its value, so even a modest optimization feels like a triumph. Adding a new feature usually has much less emotional reward, especially if it's not technically impressive or immediately useful. Fixing a bug may bring feelings of disgust at the bug, and guilt (if I created it) or contempt (if some other idiot someone else did), but it seldom feels like much of an accomplishment. Optimization, though, feels good, because it's measurable and therefore obviously valuable (even when it's not).

I also enjoy cleaning up code, for the same reason: if I make something 50 lines shorter, that tells me how much I've accomplished, so I have something to feel good about. Predictably, I feel better about making code shorter than about merely making it more readable, because readability doesn't come with numerical confirmation.

If this effect is real, it would be valuable to have a similar quantification of bugs and features, to make improvements emotionally rewarding. An automated test framework can provide this by conspicuously reporting the number of tests passed, and perhaps the change from previous versions. (It's probably best to report the number passed, not the number failed, so writing more tests makes the number better rather than worse.) If you've used a test framework with this feature, have you noticed any effect on your motivation to fix bugs?

Irregular indentation reflects semantic structure

There are well-known rules for indenting Lisp code, but it's sometimes convenient to break the rules when the code has semantic structure not reflected in the S-expressions. This sometimes happens when there are alternating sequences, as in some versions of let and cond and setf. It's also common to leave forms unindented when they're supposed to be top-level but something has been wrapped around them anyway, such as eval-when, or the library or unit forms required in some Scheme module systems. In either case, the indentation reflects the structure a human reader understands the program by, rather than the structure of its S-expressions.

JRM recently posted a particularly extreme example of irregular indentation:

(define (polynomial-erf z)
  ;; Numerical Recipies 6.2
  ;; fractional error < 1.2e-7
  (let* ((t (/ 1.0 (+ 1.0 (* .5 (abs z)))))
         (ans (- 1.0 (* t
                        (exp (+ (- (* z z))
                                (+ -1.26551223
                                   (* t (+ 1.00002368
                                   (* t (+ .37409196
                                   (* t (+ .09678418
                                   (* t (+ -.18628806 
                                   (* t (+ .27886807 
                                   (* t (+ -1.13520398
                                   (* t (+ 1.48851587
                                   (* t (+ -.82215223 
                                   (* t .17087277))))))))))))))))))))))))
    (if (>= z 0) ans (- ans))))

Note the big polynomial written in sequential multiply-and-add style. It has two semantic structures: it's a tree of arithmetic expressions, but it's also a sequence of polynomial terms. The latter interpretation is more useful, and the indentation reflects this: it's intended to be read as a sequence, so it's indented as a sequence. If it were indented normally, according to its tree structure, its semantic sequentiality would be less obvious, not to mention it would be impractically wide:

(+ -1.26551223
   (* t (+ 1.00002368
           (* t (+ .37409196
                   (* t (+ .09678418
                           (* t (+ -.18628806 
                                   (* t (+ .27886807 
                                           (* t (+ -1.13520398
                                                   (* t (+ 1.48851587
                                                           (* t (+ -.82215223 
                                                                   (* t .17087277))))))))))))))))))

Either way, that's an imposing pile of right-parens. (You could avoid this with a simple polynomial function or macro, but it's not worth the trouble for one polynomial.)

One of the reasons expression-based languages are so convenient is that their program structure reflects the semantic structure humans understand them by. Irregular indentation is a reminder that they don't always do so perfectly.

Clojure is almost as big as Common Lisp

How big is Clojure's standard library? Never mind the Java library, or even clojure.contrib; how much is built in to clojure.jar?

One way to answer this is to count definitions, in the form of public Vars:

(defn ns-size "How many public Vars does this namespace have?" [name]
  (require name)
  (count (ns-publics (find-ns name))))

(def builtins '(clojure.core clojure.core.protocols clojure.data
                clojure.inspector clojure.java.browse clojure.java.browse-ui
                clojure.java.io clojure.java.javadoc clojure.java.shell
                clojure.main clojure.pprint ;clojure.parallel ;deprecated
                clojure.reflect clojure.repl clojure.set clojure.stacktrace
                clojure.string clojure.template clojure.test clojure.test.junit
                clojure.test.tap clojure.walk clojure.xml clojure.zip))

In Clojure 1.3.0a6:

user=> (ns-size 'clojure.core)
563
user=> (reduce + (map ns-size builtins))
826

So Clojure is already approaching Common Lisp's 1064 definitions1. This measure probably overstates Clojure's size relative to CL, because it compares all of Clojure to only the standard parts of Common Lisp, but every CL implementation includes other libraries in addition to the standard. CL also tends to pack more features into each definition, through keyword arguments and complex embedded languages like format and loop, so counting definitions understates its size. On the other hand, many CL features are of dubious value, so Clojure may already have surpassed it in useful library.

If it hasn't, it will soon, because Clojure's library is still growing. The Var count has increased by 50% in the last two years:

VersionDateclojure.coreclojure.*
1.0.0May 2009458543
1.1.0Dec 2009502576
1.2.0Aug 2010546782
1.3.0a6Mar 2011563826

At this rate, Clojure 1.6 will have a bigger standard library than Common Lisp. (The timing depends on how quickly parts of clojure.contrib get assimilated into clojure.core.) I suppose when that happens, Common Lisp will still be perceived as huge and bloated, and Clojure as relatively small and clean. Any perceived complexity in Clojure will be blamed on Java, even though Java interoperation accounts for only a tiny part of Clojure's library. Scheme will still be considered small and elegant, no matter how big its libraries (or its core semantics) get. (R5RS Scheme has 196 definitions, and R6RS about 170, or 680 with its standard libraries. Racket is much bigger.) Traditional beliefs about language sizes are not very sensitive to data.

Update: riffraff on HN counts Python's definitions: len(__builtins__.__dict__.values()) ⇒ 143 definitions in the default namespace, plus len(set(chain( *[dir(o) for o in __builtins__.__dict__.values()]))) ⇒ 250 method names, so about 400 built-in definitions. There are also over 200 other modules in the standard distribution, so its full standard library is much bigger — the first 24 modules have sum([len(dir(__import__(n))) - 4 for n in "string re struct difflib StringIO cStringIO textwrap codecs unicodedata stringprep fpformat datetime calendar collections heapq bisect array sets sched mutex weakref UserDict UserList UserString".split()]) ⇒ 476 definitions, so the total is something like 5000. “Batteries included” indeed.

However, standard library size is less important than it used to be. When every language has an automatic library fetcher like Leiningen or Quicklisp or PLaneT, built-in libraries aren't much more readily available than other well-known libraries. The main obstacle to library use is no longer finding suitable libraries, or downloading them, but learning to use them.

1 Common Lisp has 978 symbols, but symbols ≠ definitions: some symbols have definitions in more than one namespace, and some have no definition. Common Lisp has 752 definitions in the function namespace (636 functions, 91 macros, and 25 special forms), 116 variables, 85 classes, 66 setf functions, and 45(?) type specifiers, for a total of 1064 definitions. There are about 30 symbols without definitions: declare and its declarations and optimization qualities, a few locally bound symbols like call-next-method, and eight lambda-list keywords. There's also plenty of room for disagreement about what counts as a definition.

Good and bad syntax-coloring

David Nolen pretty-printed some Clojure with SLaTeX. The result, as usual for SLaTeX, is typographically impressive but rather painful to read. The problem is that SLaTeX has horrible defaults for syntax coloring. It shows most code, including this Clojure, in only two faces:

  • Special form names are in bold.
  • Other names are in italics.

Italics are harder to read than plain (“Roman”) text, because they're less familiar, so using them for most symbols hurts readability. (SLaTeX also uses proportional fonts; I'm not sure whether that helps or hurts.) But there's a much bigger problem: misplaced emphasis.

Bold, like bright colors, draws attention. This is very useful: it can help the reader quickly find important bits of code they might otherwise overlook. But special forms, however semantically special, are not such important bits. When reading a program, occurrences of lambda or begin are hardly ever the parts that deserve notice. They do deserve some subtle syntax-coloring, but printing them in bold makes code considerably less readable, because it directs attention to the irrelevant.

It's much more useful to emphasize binding occurrences, especially top-level ones. One of the most common operations when reading code is to look for a definition, and showing the names prominently makes it easier to find the right one. Emacs, as usual, gets this right: for most languages it shows top-level binding occurrences in bright colors, which makes them easy to find.

Clojure-mode also colors names in clojure.core. I've found this quite helpful for learning the language, because it provides an edit-time hint of whether I've guessed the right name. I think it's also a small help when reading code, because the color used is low-contrast, so it makes calls to standard-library functions less conspicuous — which is good, because they're usually less deserving of attention than calls to my own functions. It may also be helpful when reading an unfamiliar language, because it tells the reader when to guess what a name means or look it up with doc or clojuredocs rather than look for its definition elsewhere in the program.

It's easy to change SLaTeX's fonts to something less annoying:

\def\keywordfont#1{{\it#1\/}}
\def\constantfont#1{{\rm#1}}
\def\variablefont#1{{\rm#1}}
\let\datafont\constantfont

If you don't mind abusing \setkeyword, you can also make it recognize names from the standard library, by pretending they're special forms:

\setkeyword{sorted-map read-line re-pattern keyword? ClassVisitor
  asm-type VecSeq print-defrecord val IVecImpl ProcessBuilder chunked-seq?
  Enum find-protocol-impl SuppressWarnings
  %and so on for the rest of (keys (ns-map (find-ns 'clojure.core)))
}

Unfortunately SLaTeX doesn't support this directly, nor does it support recognizing top-level binding occurrences, but neither would be hard to add; it already recognizes similar things like macro-defining forms and quoted literals. If you're writing a pretty-printer (or a syntax-coloring editor), I encourage you to add these two features, because they're much more useful than highlighting special forms.

Another problem with non-ASCII operators

Unicode offers many symbols that make beautiful names for operators, but in addition to the usual problems (they're hard to identify with their ASCII equivalents, and almost impossible to type, and they become unintelligible when encodings get confused) there's a rendering problem: rare characters are often displayed poorly, because most fonts don't cover them very well.

I encountered this recently in one of Conal Elliott's posts which uses the character . The font I saw it in didn't contain that character, so it was rendered in some other font, but scaled wrong, so it was illegibly small. I had to triple the font size to figure out that it was supposed to be a slashed arrow, not a squashed gnat. Amusingly, a change note on that post says this character was a replacement for another that was even more badly rendered:

2011-01-31: Switched trie notation to "k ↛ v" to avoid missing character on iPad.

Even fonts that have a character may not render it well. The composition operator (which also appears in that post) is one of the best candidates for a non-ASCII name, because it's so well known and has no good ASCII equivalent. Unfortunately, all the fixed-pitch fonts on my machine display it too high and with too much space on the right, making it look more like the degree sign ° than composition. Making a font with good glyphs for 95 ASCII characters is not prohibitively difficult, but making one for a few thousand Unicode characters is apparently a challenge even for the professionals.

Unfamiliar operators like the trie constructor aren't very compelling candidates for symbolic names anyway, so it's not much of a loss to use ASCII for them, but it's sad that we still can't use the familiar symbols for composition and set membership. (Although composition may be moot — I tend to write it as . even on paper, because of Haskell.)

Quote should not terminate tokens

Haskell allows the single-quote character in identifiers, so you can use variable names with primes, like x'. This is so convenient (especially for naming “revised” versions of variables, which seems to happen a lot when assignment isn't available) that I've been missing it in Clojure recently.

There's no reason lisps can't have this. Common Lisp supports nonterminating macro characters — characters that have special meaning when they appear on their own, but not when they appear in the middle of a token. Like many of CL's features (generic functions, anyone?) this isn't used much in the standard library; by default there's only one nonterminating macro character, #, and that's only for historical reasons (compatibility with Maclisp, I think). But it's easy to make new ones, which solves the x' problem in one line:

CL-USER> (set-macro-character #\' (get-macro-character #\') t)
T
CL-USER> '(x x' y y')
(X |X'| Y |Y'|)

(T as the third argument of set-macro-character means “nonterminating”.) Note that quote still works. Symbols with primes print funny, because the printer doesn't realize nonterminating macro characters don't have to be escaped, but they work fine; you can name variables with primes to your heart's content.

This should be the default in any new lisp: ' should not terminate tokens. Neither should any macro character except ), ], } and ;, really — termination only matters when you leave out spaces between tokens, and who does that?

How typep got its name

Paul McJones has a large collection of old Lisp manuals, which shed a lot of light on historical questions like how unary typep got its irregular name. The earliest reference I've found is in a 1973 Maclisp manual, which describes it thus, immediately after the other type predicates:

typep  SUBR 1 arg
  typep is a general type-predicate.  It returns an atomic symbol
  describing the type of its argument, chosen from the list
    (fixnum flonum bignum list symbol string random)
  symbol means atomic symbol.  Random is for all types that don't
  fit in any other category.

A “general type-predicate”? This wording suggests typep got its name because it was seen as a generalization of the regular type predicates. This makes sense: as Maclisp acquired an increasingly unwieldy number of type predicates, its maintainers would want to merge them into one operation, and would probably think of that operation as “a general type-predicate”, even though it's not actually a predicate.

(That random type is probably just an implementation detail showing through, but it's still adorable. Rudimentary type systems are to languages what tiny clumsy paws are to kittens.)

The curious attraction of one-argument typep

In Maclisp, the function typep takes one argument and returns its type in the form of a symbol. This is a fundamental operation, and it's egregiously misnamed, as it's not a predicate. (The -p convention is much older than typep, so the name was misleading even in Maclisp.) Lisp Machine Lisp preserved it for compatibility, but Common Lisp fixes the confusion by calling it type-of and using typep only for the two-argument type membership predicate.

I've never used Maclisp, nor any dialect with one-argument typep, so this misnaming should interest me only as historical trivia. But I find it oddly compelling. When I read typep, I think of the unary version first, and only remember the type membership test when context reminds me. When I refer to the “what type is this?” operation without having a specific language in mind, I tend to carelessly call it typep.

Why? I suspect it's because the most obvious name, type, is too bare — it sounds like it might collide with something else, although it doesn't in either Maclisp or Common Lisp — so I look for an affix to distinguish it. The -of in type-of works, but it's meaningless and irregular, and sounds wrong — more like function application than part of a name. -p is semantically inappropriate, but it's so common and familiar that it sounds right anyway. So I latch onto it and don't think about the meaning.

I can't be the only one with this problem. I've heard other people call it “one-argument typep”, and someone must have made the same mistake decades ago to give typep its misleading name in the first place. (Was it derived from a predicate? Did its designers have a different interpretation in mind, like “return the type predicate that's true of this value”?) If you are also drawn to this misnomer, or if you know more of its history, I'd like to hear about it.

Followup: How typep got its name

Formal comments and stylistic lag

If you've spent much time programming in industry, you've probably seen comments like this:

/***************************************
 * Function: Frobnicate
 * Parameters:
 *    frob - a pointer to a Frob
 *    options - a pointer to a FrobnicationOptions
 * Returns:
 *    a boolean: true if frobnication succeeded, false otherwise
 * Description:
 *    Frobnicate the given frob.
 * Logic:
 *    1. Check for valid frob
 *    2. Check for valid options
 *    3. Frobnicate the frob
 *    4. Clean up
 ******************************************/
bool Frobnicate(Frob *frob, FrobnicationOptions *options) {
  ...200 lines bearing no resemblance to the “Logic” section above...
}

Such formal comments could in principle contain useful information, but in practice they hardly ever do. Usually the name and parameters simply repeat information from the signature, the return value and description are easily guessable, and the Logic section is useless if not completely wrong. I don't think I've ever seen one that includes the sort of information the maintainer is most likely to want. (Does Frobnicate write to options->frobnication_output? What does it return if the options say to skip this frob? Does it lock frob->mutex, or does the caller have to do that?) Most simply waste space and prevent the reader from seeing other code.

Why are useless formal comments so common? I suppose they're often written by people who want to comment their code but don't understand what information is useful, and who, by following the form, can crank out lots of comments they don't realize are useless. But why do some people not only write them but even advocate them, and include them in style guides?

There is a context where these comments make sense: assembly code. When there are no parameter lists or type declarations or even variable names, the information they convey will have to be included somewhere else, because it's vital to maintenance. When the body of a function consists of many pages of inscrutably low-level instructions, an introductory explanation of the logic can be a big help (although I still prefer having it sprinkled among the code, where it helps with navigation and is more likely to be kept up to date.) So prescriptions to include these things in comments are not completely out of the blue. They're based on what made sense, decades ago, for languages that couldn't express this information in any other way. Naturally, they've become entrenched in some programmers' concepts of good style, and are applied even in languages where they make no sense.

I wonder how many of our widely accepted stylistic rules are similarly out of date or misapplied. The widespread misuse of Hungarian notation qualifies, but that's no longer popular. There are a lot of prescriptions for object-oriented style that make no sense in functional languages, but they're seldom applied there. There are some language-specific archaisms, like writing typedef struct Foo_ { ... } Foo; in C++, but I don't think that's actively promoted except for code that has to also be valid as C.

What stylistic prescriptions are still widely accepted today, even where they don't make sense?

Japanese Lisp, Forth, and historical contingency

Yusuke Shinyama speculates: what would Lisp look like if it had been invented by speakers of a consistently postfix language like Japanese? Might it be written postfix instead of prefix?

Maybe so. But this is superficial. The car of a Lisp form is special because it's the head, not because it's first; as long as there are head and rest operators, it makes no difference whether the head is stored first or last or even in the middle. So while a postfix Lisp looks different, this is only a superficial matter of syntax; Lisp would work the same way if it had been invented in Japan.

Forth, though, might be dramatically affected — not in nature but in timing. Despite its simplicity, Forth appeared rather late: it was developed in the 1960s and not publicized until 1970, which was too late to become part of the cultural foundation of computing. I suspect this was an anomaly; Forth is so simple that it could easily have been discovered earlier, had anyone bothered to explore postfix notation. Speakers of a postfix natural language have an obvious example to encourage them. (Postfix mathematical notation would be an even stronger encouragement. IIUC Japanese arithmetic words are infix, so a Japanese-inspired notation would also be infix; postfix arithmetic could arise more naturally in a natlang where arithmetic operators are postpositions, and postpositional phrases follow their head noun, but this is not a common combination.)

If Forth had been known by the mid-1950s, it could have outcompeted Fortran to become the canonical high-level language. This would have exerted significant pressure on hardware: machines would be designed to run Forth, much as they're designed to run C today, so there would be a lot of stack machines. Since Forth makes such a good assembly language for such machines, there would be less pressure to develop other high-level languages. Programmers accustomed to its simplicity and flexibility and convenience would see all proposed alternatives as unusable and crippled and insanely complex, so other language families could go unexplored. Forth and its descendants might rule computing unchallenged for decades, until some long-delayed development made tree languages competitive.

History could be different. Lisp, Fortran, Algol — all the great old language families — might not exist, if the pioneers of computing had spoken a head-last natural language and found Forth first.

Infixifiers

Haskell allows using any function as an infix operator, without declaring it infix or giving it a nonalphanumeric name: x `mod` 3 is the same as mod x 3, for any mod. I used to think this was a silly extravagance of syntax, but I've come to like it, and to use it frequently in pseudocode. Like the pipe operator, it lets me write operations in a more natural order, and this is important enough to be worth a little syntactic complexity. With a suitably low precedence, it can also save a few parentheses, which is convenient, especially when writing code on paper, where balancing parentheses is hard.

The backquotes still seem odd to me, probably because I confuse them with the output-of-a-command operator in the Bourne shell and Perl. I currently prefer a colon suffix: x mod: 3, like Smalltalk. (However, I also sometimes use that for keyword arguments, as I prefer sort xs key: cost to sort xs :key cost.)

Scala almost does this without requiring an explicit infixifier at all, as it uses mere juxtaposition to mean method call: x mod 3 is x.mod(3) (which happens not to work, since there is no Integer.mod). However, this doesn't work for functions in general, as Scala distinguishes them from methods. And of course it conflicts with using juxtaposition as ordinary prefix function call, which is a far more important and more general (because it allows arbitrary numbers of arguments) construct than infix.

Update 25 March: R also has an infixifier: a %f% b is f(a, b).

Alternating cond

The traditional cond used in most Lisps, including Scheme and CL, has (test . body) pairs:

(define (signum x)
  (cond ((< x 0) -1)
        ((> x 0) 1)
        (else 0)))

That has Lots of Irritating Superfluous Parentheses, of a sort that are particularly annoying because they aren't forms. Arc and Clojure avoid this by alternating tests and bodies:

(defn signum [x]
  (cond (< x 0) -1
        (> x 0) 1
        :else 0))

Which is nice. But: when test and body no longer fit on one line, it becomes hard to indent, because the list structure doesn't match the semantic structure. Indenting normally, according to list structure, makes the body look like another test:

(defn fibonacci [n]
  (cond (not (and (integer? n) (>= n 0)))
        (throw (java.lang.IllegalArgumentException.
                (str "fibonacci's argument must be a nonnegative integer: "
                     (pr-str n))))
        (> n 1) (+ (fibonacci (- n 2)) (fibonacci (dec n)))
        :else n))

I tend to read conds by vertically scanning the conditions, which doesn't work with this indentation. Indenting according to sense solves this, but makes the body look like a subform of the test:

(defn fibonacci [n]
  (cond (not (and (integer? n) (>= n 0)))
          (throw (java.lang.IllegalArgumentException.
                  (str "fibonacci's argument must be a nonnegative integer: "
                       (pr-str n))))
        (> n 1) (+ (fibonacci (- n 2)) (fibonacci (dec n)))
        :else n))

...and your editor will probably “fix” it anyway the next time you autoindent.

I'm tempted to rewrite with nested ifs just to avoid the misleading indentation. Which is madness, because cond is common, and tests and bodies often outgrow one line, and I shouldn't change code structure for formatting reasons. But I don't like being unable to parse my code. The main value of S-expressions is that structure is obvious, and the structure of alternating cond isn't.

The same problem can occur in any expression with alternating subforms, such as Clojure's let, or dictionaries, but in neither case is it as bad as in cond. Variable names are distinctive, and anyway tend to be short, so let is rarely formatted with initforms on their own lines. Dictionaries don't usually have large subforms, and often have distinctive :keywords as their keys, so it's harder to get the keys and values confused. cond is particularly vulnerable (probably along with friends like typecase), because it often has large subforms that look alike but don't work alike.

I think I prefer the old cond, despite the parentheses.

min-key and max-key should take lists

Clojure's min-key and max-key operators are variants of min and max that minimize or maximize some function of their arguments:

user=> (max-key count "Find" "the" "longest" "word")
"longest"

I used min-key a few times recently, and found it awkward every time, because the input arrived in a list, not in several separate expressions. Converting from one to the other is a simple matter of apply, but the resulting expression is not as clear as it ought to be:

user=> (apply min-key count
         (clojure.string/split "Find the shortest word" #" "))
"the"

It seems to me that while min and max are almost always used on separate arguments, min-key and max-key are much more likely to be used on lists. (In general, I think high-order functions are much more likely to be used on lists than their first-order relatives.) Googling clojure min-key supports this: almost all uses are (apply min-key ...), not (min-key ...) — and all of the latter seem to be either artificial examples or mistakes. Even the spelling corrector which was the original motivation for adding max-key uses it with apply. So these functions should really be provided in list form (like Arc's most) rather than separate-argument form.

This is a general problem, though. Any operator that takes an arbitrary number of arguments will probably be used sometimes on lists and sometimes on a few explicit arguments. apply or reduce are the obvious ways to transform one into the other, but they're rather disappointing when you're expecting a clear, convenient list function.

(Functions that aren't useful with only one argument (like min-key) can be overloaded to provide both: (min-key f xs) could operate on the elements of xs, while (min-key f a b) operates on a and b. This is what Python does. This is a potentially susprising irregularity, though ((min-key f a) no longer does what you expect), and it doesn't work for functions that can usefully take one argument. So I don't think it's a good idea.)

Single-user Unix

There are a few basic security rules that every Unix newbie is taught, and one of them is: don't log in as root. So when I went to demonstrate something to a new coworker a while ago, and began by logging in as root, it was an awkward moment.

Was I being reckless? No, I was using Unix as a single-user OS. This was a shared machine, used for testing and little else, and there was no reason to distinguish multiple users; everyone needed root access anyway. Having multiple accounts would have been useless complexity; all we needed was a single-user system. Unix can be one, if you log in as root.

It's not a very good single-user OS, because most of its safety checks are user-based. When everything runs as root, a minor accident can destroy everything on the machine. But this machine had no valuable data on it — at worst, we'd be forced to reinstall everything, so we didn't even bother to use a shared non-root account. Safety just wasn't much of an issue. Security wasn't an issue at all; everyone who used that machine needed to know the root password anyway.

This isn't an unusual situation. Most personal computers have only a single user, and many others are shared among several users with no personal data. The user-based security model was developed for timeshared machines, which are now quite rare. Yet our orthodoxy still says user-based security is necessary, and systems without it are not to be taken seriously. Even after decades in which the most important computers have been personal ones, and even now that single-user smartphones have become the most fashionable platforms, we still judge operating systems by their ability to protect multiple users from each other, not their ability to protect a single user from malicious code. Security, we think, means inter-user security, whether or not that's a good model of the threats we face.

So I feel guilty every time I log in as root. I've been indoctrinated too thoroughly; even though I know it's not really a problem, I feel compelled to make excuses, and to point out that it wasn't me who set up that machine, and I wouldn't have done it that way. But I shouldn't. A single-user system is not necessarily a bad thing, even if it's Unix.

Clojure's unconventional symbols

If you asked me what a symbol is, I'd probably say something like “a canonicalized string”. I might explain why this is useful, or mention that some languages (e.g. Common Lisp) include top-level bindings and plists and packages as part of their symbols, even though those are really a different concept. But I'd take for granted that canonical identity is the central idea. You can't have symbols without intern, can you?

Clojure does. It creates a new symbol for each occurrence it reads, with no attempt to canonicalize them. Despite its name, clojure.lang.Symbol/intern always makes a new symbol. It does intern the strings that are their names, to make comparisons faster, but not the symbols themselves. (There's also clojure.core/intern, but that creates Vars — top-level definitions — not Symbols.) Observe:

user=> (identical? 'a 'a)
false
user=> (= 'a 'a)
true
user=> (identical? (clojure.lang.Symbol/intern "a") (clojure.lang.Symbol/intern "a"))
false

(identical? is Java ==, and = is Java .equals. The lengths of their names hint at their relative frequency in Clojure.)

Because symbols aren't interned, any code doing symbolic processing must compare them by name rather than identity. This looks at first glance like sloppiness, or a missing feature. Indeed, it does have one harmful effect: gensym can't quite guarantee that the symbols it creates won't collide with anything (although this risk is largely theoretical, and could be essentially eliminated by putting gensyms in their own namespace). But for the most part, it's harmless. Interning symbols sounds like a basic semantic feature, but it's not. It's only an optimization.

Digression: namespaces, and references vs. names

Symbol comparisons are actually more complicated than =, because Clojure symbols also optionally include a namespace. user/a and a are thus two different symbols — one with a namespace, one without — but they name the same variable:

user=> (= 'user/a 'a)
false
user=> (def a 1)
#'user/a
user=> user/a
1
user=> (def user/a 2)  ;redefinition!
#'user/a
user=> a
2

This appears to make symbol comparison impossible, since determining what definition a name refers to requires looking at the environment, not just at the symbol. But this is true in any language with lexical scope; including a namespace in the symbol doesn't make it harder. It's only a potential problem for plain old symbol processing, and you probably wouldn't use qualified symbols for that anyway.

What's really going on here is that symbols in Clojure have two roles. In addition to being names, they're also used as references to names, relative to the current namespace. Clojure uses the same class for both concepts: references may or may not specify a namespace, but names always(?) do. References are resolved to names (by clojure.lang.Compiler/resolveSymbol), so the first def above defines a variable called user/a, not a. This means names can be compared by = without worrying further about namespaces or collisions between them (although local bindings can still collide). In a language which resolved references directly to their definitions rather than via names, it would not be necessary to include the namespace in names, and SymbolRef could be distinct from Symbol.