That's not what lexical scope is for

In one of Joe Marshall's posts on ways to accidentally break an abstraction, there are several top-level definitions which close over a lexical variable, like this:

(define lib-mapcar 
  (let ((calls-to-f 0))
    (add-counter! 'calls-to-f (lambda () calls-to-f))
    (lambda (f list)
      ...)))

I know this isn't supposed to be an example of good style, but I wince to see functions defined this way. lib-mapcar's argument list is several lines away from its name, hidden behind some unrelated profiling code and a lambda. This sort of thing is distressingly common in Scheme, and is even presented as good style sometimes. The usual justification is to keep the variable private, so you don't have to worry about name collisions or unexpected dependencies. But the cost in clarity far exceeds the tiny benefit in privacy.

I guess this is what happens when you don't have (or aren't accustomed to having) a module system. You start to think of top-level defines as "globals", to be avoided whenever possible, and look for some alternative. Some languages provide convenient alternatives, such as C's static (in all three of its senses), but in Scheme the only obvious alternative is to abuse lexical scope. And since everyone knows lexical scope is an unequivocal Good Thing, it's easy to overlook how awkward it is as an access-control tool.

I'm not the only one with this problem

Arnold Zwicky, of the famous Language Log, describes his blogging habits:

Meanwhile, proto-postings pile up. I'd estimate that I have about 800 partially completed postings, but though I used to keep some inventories of these, there are now so many different inventories in so many different places that I can't keep track of them.

That's exactly my situation, except I only have about 350.

Overloading in math

A few months ago Joe Marshall did a series of posts about how computer science has affected his view of math, beginning with its surpising lack of rigor: unlike programming languages, math notation is often ambiguous.

Check out these two equations:


By the first equation, Q is obviously some sort of function (the quantile function). H therefore operates on a function. By the second equation, the argument to H, which is called x, is used as the argument to f, which is the probability distribution function. f maps numbers to numbers, so x is a number. The types don't match.

This type error is precisely why this notation works. We can write Q to mean both a function and its result (or, to put it differently, both a quantity and a function relating that quantity to something else) because the types differentiate them. It's just like overloading in programming languages, but it's more pervasive, and it's not declared in advance, and one of the overloaded variables isn't even a function.

Indeed, if both Qs were functions, or even if Q returned another function instead of a number, this overloading wouldn't work. It would be too easy to confuse the two. But since their types are so different, we can safely overload arbitrary variables with the functions that compute them, and only programmers will be confused.

Programmers are paranoid about ambiguous notation, because computers can't be trusted to resolve it right. Mathematicians are much less so, because humans handle it better.

I think we can learn something from the mathematicians here. Language designers sometimes reject new overloadings for fear they'd be confusing, even if they wouldn't pose a problem for compilers. For example, overloading arithmetic operations to work on collections or functions sounds scary — wouldn't it be hard to tell what was being added and what was just being mapped or composed? But similarly wide-ranging overloadings in mathematical notation don't cause much trouble, and indeed these specific overloadings exist and work fine in APL and J. When a new construct looks confusing to humans but not to machines, that may be a sign that it's just unfamiliar.

Languages with one user

I don't usually read the reddit comments on my posts, because the signal-to-noise ratio is not that high, and because the "Arcane totally doesn't get it" comments are discouraging. But this means I miss some good ones. Xach said:

When a language has only one user, who is also the implementor, badly interacting patterns don't always become obvious. The tendency is to automatically avoid problematic things without thinking about it.

Good point. I will remember this next time I start to say "...but it doesn't seem to be a problem in practice."

Mainstream at last!

Functional programming is slowly becoming popular. Yesterday I heard someone refer to Haskell as a "mainstream language". ^_^

Eliminate unsightly warnings!

Compiler warnings getting you down? People complaining about all the “weird garbage” printed when they type make? Fear not — there's a solution!

A while ago I had occasion to build a particular (over)complicated project that had recently been ported from Windows to Unix. The Windows version spewed pages of warnings when building, but on Unix there was none of that — just an empty terminal with a helpful success message.

Wait a minute...

At the end of the makefile's default rule, I found this:

        clear
        echo Done building $(TARGET)!

I hope you wouldn't consider doing anything like that. Now, if you'll excuse me, I have so many warnings to look at...

Errors

About signaling conditions in Common Lisp, Nikodemus Siivola writes:

Use #'ERROR to signal conditions inheriting from SERIOUS-CONDITION. This includes all subclasses or ERROR. SERIOUS-CONDITION is a convention that means "this will land you in the debugger if not handled", which is what #'ERROR ensures.

  • Inherit from ERROR if unwinding from the error leaves the system in a consistent state. Subclasses of ERROR mean "oops, we have a problem, but it's OK to unwind."
  • Inherit from SERIOUS-CONDITION, not ERROR, if the system is messed up somehow, and/or requires either human intervention or special knowledge about the situation to resolve: SERIOUS-CONDITIONs that are not ERRORs mean that "something is deeply wrong and you should know what you are doing if you are going to resolve this."

There are two problems with this interpretation.

First, it gives meaning to a difference type: (and serious-condition (not error)) means something beyond what serious-condition alone means. You could interpret this as “serious-condition means we can't continue; error means we can exit safely” - but non-serious-condition signals also mean we can exit safely. So the conditions that can be handled safely are (and condition (or (not serious-condition) error)). This complex type is at least a warning sign. If you want to distinguish recoverable from unrecoverable conditions, shouldn't one or the other have its own class?

Second, it's not what the Hyperspec says. It defines an error as

a situation in which the semantics of a program are not specified, and in which the consequences are undefined

and a serious condition as

a situation that is generally sufficiently severe that entry into the debugger should be expected if the condition is signaled but not handled.

These aren't the most lucid definitions, but the idea is that serious-condition means the signal can't be ignored, and error means this is because the semantics are undefined. (and serious-condition (not error)), therefore, is a situation where the semantics are defined, but where execution nonetheless cannot continue for some other reason - running out of memory, for instance. The distinction isn't between well-behaved states and messed-up ones, but between semantic nonsense and other failures. C++ has an analogous distinction between logic_error, which is the same as CL's error, and runtime_error, which covers the rest of CL's serious-condition. (C++ has only terminating exceptions, so it doesn't have non-serious conditions.)

I'm tempted to make more such distinctions, giving something like this:

  • Nonsense: the program has asked for something undefined, like (+ '(hello) "world!"). (CL error, C++ logic_error) You can define a reasonable meaning for most nonsense errors; they are errors only because their meaning hasn't been defined yet.
  • Limitation: the implementation knows what you mean, but can't deliver, e.g. arithmetic overflow or running out of memory. (RNRS calls this a “violation of an implementation restriction”.) You could consider “not yet implemented” to be a limitation as well.
  • Environmental failure: the operation failed because the outside world did not cooperate, e.g. a file was not found or a connection was refused.
  • Forbidden: operations which do have an well-defined meaning, but you're not allowed to do them, usually for security or safety reasons, e.g. you don't have permission to unlink("/").

In other words, “I don't understand”, “I can't”, “It's not working”, and “I'm sorry, Dave. I'm afraid I can't do that.”.

This is a nice taxonomy, but I'm not sure it's actually useful. The problem is that one error often leads to another, and the error that's detected isn't necessarily the original problem. For instance, failure to open a nonexistent file (an environmental failure) may manifest as a nonsensical attempt to read from an invalid stream. (CL actually signals it as an error, even though it's not a semantic error, perhaps for this reason.) Or a fixed-size buffer (a limitation) causes a buffer overflow (nonsense) which is eventually detected as a segfault (a forbidden operation). Why try to distinguish between classes of errors when they're so fluid?

Fortunately, programs don't usually care why they failed, unless it was a certain specific error they were expecting. IME most exception handlers are either quite broad (e.g. catching serious-condition near the root of a large computation), and recover in generic ways such as aborting, or quite specific (e.g. catching file-not-found-error on one particular open), and attempt to deal with the problem. There aren't many handlers that try to generalize at intermediate levels, even in languages that support them.

So maybe it's not so useful to classify exceptions by their cause. Maybe we should instead classify them by the possible responses: things like whether they can be ignored (CL's serious-condition) and whether they can be handled at all. In that case Nikodemus' interpretation of (and serious_condition (not error)) as “unrecoverable error” is right — not for CL, but for some future language.

Ambivalent about anaphors

I like anaphoric macros like aif and aand, but that a- is rather opaque. If you didn't know what it stood for, would you ever guess "anaphoric", or have a clue what the macros did?

Abstruse naming conventions are nothing new — how long have Lispers been using -p for predicates? — but that's no reason to inflict more of them. Is there a less exotic affix for anaphoric macros? Probably not, because it has to be short to be useful, and anything that short will be pretty opaque.

In any case, I'm not sure separate anaphoric macros are the right idea. It's nice to be able to bind a variable by adding a single character to a name, but it's not general. It won't work with a newly defined macro until someone defines an anaphoric variant, which realistically won't happen very often. Maybe that doesn't matter, since you can always fall back on let.

I wonder if it wouldn't be better to be like Perl, and make everything anaphoric. If the normal versions of if and progn bound a variable (perhaps with a more distinctive name than it?), then new macros based on them would too. This sounds nightmarishly unpredictable, since a newly defined macro might use several anaphoric constructs, and the innermost is not necessarily the one you want. It works for and but not for (the naive definition of) case, for instance.

Rampant anaphora might also pose a maintenance hazard, since simple transformations of code may change their referents. However, I haven't heard that this is a big problem in Perl. I suspect this is because people normally base their changes on a mental model of local dataflow, and in Perl, this model includes $_. Unfortunately I don't have enough Perl experience to check this on myself.

I would like to dismiss anaphora on the grounds that they don't compose well: while one aif can make an expression much simpler, a second will often conflict with the first. But this isn't a very distinctive fault. Many useful constructs are hard to compose — even function composition, ironically — and anaphoric macros are too handy to ignore.

Identity of numbers in Common Lisp and Maclisp

Nick Levine asks:

If anyone knows of a Common Lisp implementation for which (eq 3 3) need not be true, could they get back to me? I know what the Spec says about this but have difficulty believing it.

Commenters duly reply that ABCL and some versions of GCL do sometimes have multiple fixnums with the same value. Basically, this can happen in any implementation that uses boxed fixnums and doesn't intern them.

But this is not the only thing that can happen to fixnum identity. Common Lisp allows much worse:

An implementation is permitted to make "copies" of characters and numbers at any time.

This is an ugly thing to have in a language with otherwise straightforward object semantics. CLtL explains:

In general, numbers in Common Lisp are not true objects; eq cannot be counted upon to operate on them reliably. In particular, it is possible that the expression

(let ((x z) (y z)) (eq x y))

may be false rather than true if the value of z is a number.

Rationale: This odd breakdown of eq in the case of numbers allows the implementor enough design freedom to produce exceptionally efficient numerical code on conventional architectures. MacLisp requires this freedom, for example, in order to produce compiled numerical code equal in speed to Fortran. Common Lisp makes this same restriction, if not for this freedom, then at least for the sake of compatibility.

If I understand correctly, the rationale refers to how PDP-10 Maclisp handled arithmetic. It used boxed numbers, and to reduce allocation it often kept them on temporary stacks. This created a problem when a stack-allocated number had to escape and become a permanent value:

Some operations, such as storing a number in a special variable, or RPLAC'ing it into a cons, clearly require permanent allocation of the number. However, in order to allow allocation-free numeric computation, PDP-10 Lisp allows the passing of "intermediate" numeric results between functions without allocating them. Because of this, a function receiving any argument which cannot be positively verified at compile time as not being a pointer to a "temporary" number ("PDL number") must be extremely circumspect about what it does with that argument, in fact, calling a runtime routine named PDLNMK whenever a result of "unknown parentage" is stored into a permanent place. The routine checks to see if the result being stored is in fact a pointer to a PDL number, and allocates it permanently if so.

Steele's Fast Arithmetic in MacLISP provides more detail: NCOMPLR reallocated numbers whenever they escaped and had to become permanent, so it sometimes allocated a single value more than once, producing multiple non-eq copies.

The same problem arises with deboxing optimizations in other compilers. Most deboxing compilers, however, are careful to avoid multiple reboxing, for performance reasons: they don't want to embarrass themselves by accidentally allocating even more numbers than they would with naive boxed arithmetic. This has the convenient side effect of preserving identity of numbers.

So while Common Lisp allows implementations to copy numbers at will, I suspect this is purely a historical relic, “rooted in NCOMPLR” rather than in any practical need. Has there ever been an implementation that exploited it?

Function of the day: lerp

In response to clamp, a commenter unerringly pointed out another function whose common name I didn't know: lerp, for linear interpolation. You know the operation - you have two points and you want a line:

lerp :: (Num a, Num b) ⇒ (a, b) → (a, b) → a → b
lerp (x1,y1) (x2,y2) x = y1 + (x - x1) * (y2 - y1) / (x2 - x1)

? foo = lerp (1,1) (10,4)
? foo 4
2
? foo -5
-1

What if you have more than two points? The obvious generalization is to interpolate between successive pairs of points:

polylerp :: (Num a, Num b) ⇒ [(a, b)] → a → b
polylerp [] _                              = error "domain"
polylerp ((x1, _):ps) x          | x1 > x  = error "domain"
polylerp ((x1, y1):ps) x         | x1 == x = y1
polylerp ((x1, y1):(x2, y2):_) x | x2 > x  = lerp (x1, y1) (x2, y2) x
polylerp (_:ps) x                          = polylerp ps x

(Apologies for the explicit recursion. I wanted to write this with combinators to more closely resemble the natural-language description, but even with made-to-order combinators, I can't find anything clearer. This sort of iteration - a fold that operates on more than one element at once - is often awkward to express.)

The commenter mentioned that lerp, like clamp, is used a lot in graphics. This use is more specific than interpolating arbitrary functions between points: it's used to mix two pixels, for applications like antialiasing or transparency. This doesn't require arbitrary X-coordinates, so a simpler variant of lerp is common:

lerp :: (Num a, Num b) ⇒ b → b → a → b
lerp y1 y2 s = y1 + s * (y2 - y1)

As with clamp, I've used this form of lerp before, but didn't know its standard name - I called it mix. Can you tell I'm not a graphics person?

Now for a question. The examples above are in (untested) Haskell, but the type signatures are not the real ones, since Haskell requires in all of them that a == b. I wrote them this way to make the distinction between X and Y coordinates clear, since the types would be unreadable otherwise. Is this a good idea?

Function of the day: clamp

It's trivial and useful: clamp restricts a number to a range.

clamp val low high = min (max val low) high

Argument order varies, of course. This particular order has the convenient property that if you forget and write clamp min val max, it still works.

I wanted this function years ago, in an application where I'd written a great many calls to min and max, with a number of mistakes because of the counterintuitive usage (min to impose a maximum, max a minimum). I almost replaced them with clamp, but I couldn't think of a good name for it. Then recently I stumbled across it in some language's standard library (I have already forgotten which) and was delighted to finally be able to put a name to the concept; I must have been (so I ashamedly imagine) the only mathematically inclined programmer in the world who didn't know what it was called.

Doing without modify macros

Speaking of setf-like macros, there's something that bothers me about Common Lisp's define-modify-macro: most of its uses are very predictable. Alexandria, for example, contains at least five of them: minf, maxf, appendf, removef, and deletef. Except for argument order, all five are identical macroifications of standard functions. From their names, you can predict not only their meanings but their whole implementations.

This is not a good sign. Rather than trying to define a modify macro for each function, wouldn't it be better to just have one modify macro which takes a function as a parameter?

(defmacro modify (fn place &rest args &environment e)
  (expand-setf (v r w) place e
    `(let ((,v (,fn ,r ,@args)))
       ,w)))

So instead of (define-modify-macro foof (args ...) foo) and (foof place args ...), you'd write (modify foo place args ...). This could make most modify macros unnecessary, although it would need a shorter name, perhaps even one as short as !. However, its usefulness depends heavily on having functions available with the right argument order. If you have to fiddle with the function to fix its argument order, modify loses its brevity, and you might as well write out the setf.

Simplifying setf-like macros

Attention conservation notice: I sometimes post things that are of general interest to language designers, but this is not one of those times. If you don't speak Common Lisp, don't like macrology, or don't care about generalized assignment, this post is not for you.

As the rplacd&pop macro illustrates, defining new setf-like macros is quite awkward in Common Lisp. I'm not talking about defining new setf-able places, but about defining macros that operate on places, like push and swapf. They're so awkward to write that the Hyperspec doesn't directly define most of them, but instead resorts to simplified versions with disclaimers:

The effect of (push item place) is equivalent to

(setf place (cons item place))

except that the subforms of place are evaluated only once, and item is evaluated before place.

There is actually a sample implementation of pop at get-setf-expansion, but it's not linked from pop, probably because it's too complex to be of much help in explaining what pop does. Handling places properly requires calling get-setf-expansion and doing nonobvious things with its five return values, which is a lot of trouble that has little to do with the desired macro. Consider the definition of rplacd&pop:

(defmacro rplacd&pop (x &optional y &environment e)
  (multiple-value-bind (temps forms svars writeform readform)
                       (get-setf-expansion x e)
    `(let* (,@(mapcar #'list temps forms)
            (,(car svars) (cdr ,readform)))
      (prog1 ,readform
        (setf (cdr ,readform) ,y)
        ,writeform))))

There is a lot of predictable code here. This macro differs from pop in only two sections (italicized above), and from push in those two plus argument order. Most of the code is the same. Can we abstract it out with a macro?

(defmacro expand-setf ((valuevar readvar writevar) place env &body body)
  "Generate code for a SETF-like operation: evaluate BODY with the three
vars bound to the corresponding parts of PLACE's SETF expansion, then
wrap the result in a LET to bind the SETF temporaries.
BODY should normally evaluate to a form that binds VALUEVAR to the
desired new value of PLACE and then does WRITEVAR."
  (with-gensyms (temps forms storevars)
    `(multiple-value-bind (,temps ,forms ,storevars ,writevar ,readvar)
                          (get-setf-expansion ,place ,env)
       (let ((,valuevar (car ,storevars)))
         `(let (,@(mapcar #'list ,temps ,forms))
            ,,@body)))))

I confess that I had some trouble writing this: I got confused about nested splicing. I'd feel bad about this, but I know everyone else has trouble with it too. Is there any other part of Lisp that so reliably earns its difficult reputation?

Anyway, having spent an hour writing expand-setf, I tried it out and was disappointed: it doesn't save as much code as I hoped. I forgot one of the basic rules of making difficult abstractions: write an example call before you write the macro.

(defmacro rplacd&pop (place &optional newcdr &environment e)
  (expand-setf (val readform writeform) place e
    `(let ((,val (cdr ,readform)))
       (prog1 ,readform
         (setf (cdr ,readform) ,newcdr)
         ,writeform)))

This is, unfortunately, only two lines shorter than the get-setf-expansion version, because expand-setf still requires the hassle of binding the store variable and passing the environment around. It scales nicely to more complicated macros, though:

(defmacro swapf (p1 p2 &environment e)
  "Two-argument ROTATEF."
  (expand-setf (v1 r1 w1) p1 e
    (expand-setf (v2 r2 w2) p2 e
      `(let ((,v1 ,r2) (,v2 ,r1))
         ,w1
         ,w2))))

We could get rid of the magical storevar by packaging writeform as a function, so you do (,writefn exp) instead of (let ((,storeform exp)) ,writeform). Unfortunately this doesn't necessarily make the callers any shorter, because they often have to store the value in a variable anyway:

(defmacro rplacd&pop (place &optional y &environment e)
  (expand-setf (r w) place e
    (with-gensyms (newval)
      `(let ((,newval ,r))
         (prog1 ,r
           (rplacd (cdr ,r) ,y)
           (,w ,newval)))))

It's also annoying that expand-setf wraps a form around the result of its body. That's a surprising thing for a macro to do, and I can imagine it confusing people and causing bugs. We can get rid of that - and of the whole macro - in a clean way, by packaging the expansion as a function:

(defun place-accessor (place env)
  "Return a form evaluating to an accessor for PLACE: a function
that returns the value of PLACE when called with no arguments, and
sets PLACE when called with one argument."
  (expand-setf (v r w env)
    (with-gensyms (set?)
      `(lambda (&optional (,v nil ,set?))
         (if ,set?    ;CASE-LAMBDA would be ideal here
           ,r
           (progn ,w ,v))))))

This turns places into first-class values, which, not surprisingly, makes them easier to work with. (I think this has been done before to emulate pass-by-reference, but I've never seen it done to simplify writing other macros. Update January 2016: Strachey did it in 1967 in Fundamental Concepts in Programming Languages, under the name “load-update pair”.) Distinguishing reads from writes by arity this way isn't entirely clean, but it beats packaging them as two separate functions, which would be prohibitively awkward for the caller. Here's how it works in practice:

(defmacro swapf (place1 place2 &environment e)
  (with-gensyms (a1 a2 v)
    `(let ((,a1 ,(place-accessor place1 e))
           (,a2 ,(place-accessor place2 e)))
       (let ((,v (funcall ,a1)))
         (funcall ,a1 (funcall ,a2))
         (funcall ,a2 ,v)))))

This is longer than the expand-setf version, because it doesn't handle the gensyms and variables itself. (Ignore the funcalls; they wouldn't be there in a Lisp-1.) It's less magical, but I'm not sure it's any easier to read. place-accessor would be best used as the foundation of a setf system, since it's conceptually simpler than get-setf-expansion, but it still needs a with-place-accessor wrapper macro. And I've lost sight of the original problem. None of these solutions make defining rplacd&pop and push as simple as it ought to be. There must be a better way.

Maybe I'm coming at this from the wrong direction. Rather than simplify the general case, what about generalizing the simple case? define-modify-macro already covers the most common kind of one-place macros: those that simply apply some function to the value in the place, and store (and return) the result:

(setf place (some-function place args ...))

For example:

CL-USER> (define-modify-macro pop* () cdr
           "Like POP, but doesn't return the popped value.")
POP*
CL-USER> (macroexpand-1 '(pop* place))
(LET* ((#:G4467 (CDR PLACE)))
  (SETQ PLACE #:G4467))

The only reason define-modify-macro can't express pop and rplacd&pop is that the macros it defines must return the same value they store. We could fix this by adding another argument: in addition to a function to determine the stored value, we can supply one to determine the returned value:

(defmacro define-modify-macro* (name arglist transformfn resultfn
                                &optional docstring)
  "Like DEFINE-MODIFY-MACRO, but takes an additional function, to
define a macro that returns a different value than it stores.
Subforms of the place are evaluated first, then any additional arguments,
then TRANSFORMFN is called, then RESULTFN, then the value is stored, and
finally the result of RESULTFN is returned.
ARGLIST may contain &optional arguments but not &rest or &key."
  (let* ((args (arg-names arglist))
         (gensyms (mapcar (lambda (a) (gensym (symbol-name a))) args)))
         ;GENSYMS is a list of names (in the intermediate expansion) of
         ; names (in the final expansion) of the values of the ARGS. They
         ; have to be gensymmed to protect TRANSFORMFM and RESULTFN.
    (with-gensyms (place env)
      `(defmacro ,name (,place ,@arglist &environment ,env)
         ,@(when docstring (list docstring))
         (expand-setf (v r w) ,place ,env
           (with-gensyms ,gensyms
             `(let (,,@(mapcar (lambda (gen arg) ``(,,gen ,,arg))
                               gensyms args))
                (let ((,v (,',transformfn ,r ,,@gensyms)))
                  (prog1 (,',resultfn ,r ,,@gensyms)
                         ,w)))))))))

(defun arg-names (arglist)
  "Return the names of the arguments in a lambda-list."
  (loop for arg in arglist
        unless (member arg lambda-list-keywords)
        collect (if (consp arg) (car arg) arg)))

(I had trouble writing this macro too - not just the multiple levels of splicing and gensyms, but also the awkwardness of extracting the argument names. Common Lisp has complex argument-lists, but no tools to parse them, so it took me a few tries before I realized I needed to do it myself, and then a few more before I gave up trying to handle &rest and &key. And I almost forgot to prevent multiple evaluation of argument forms!)

With this macro, pop-like macros can be defined more conveniently:

(define-modify-macro* rplacd&pop (&optional newcdr)
  (lambda (x y) (declare (ignore y)) (cdr x))
  rplacd
  "I am not explaining this one again.")

Except for the hassle of lambda and ignore to give cdr the right argument, this is not too painful. push requires a wrapper to fix its argument order, but pop is easy:

(define-modify-macro %push (val)
  (lambda (list val) (cons val list))
  (lambda (list val) (declare (ignore list)) val)
  "Helper for PUSH")

(defmacro push (val place)
  `(%push ,place ,val))

(define-modify-macro* pop () cdr car)

Correction: push returns the same value it stores, so it doesn't require define-modify-macro*.

I started this post intending to learn something about how to design a better setf-like generalized assignment system, and succeeded only in writing a few macros to make dealing with Common Lisp's setf less painful. This is, I suppose, still useful, but it's disappointing.

When the functional way is much worse

One of the strengths of functional (i.e. applicative) programming style is that it makes dataflow obvious. Usually this is a good thing, but there are many cases where parts of a program's dataflow are best kept hidden. For instance, some functions return a value (or several) not because their caller is ever interested in it, but so the caller can pass it to somewhere else that actually is interested. This often happens when you're keeping some sort of profile information. For example, suppose you're playing with the very recursive Sudan function:

(defun sudan (n x y)
  "The first known function that is recursive but not primitive recursive."
  (cond ((zerop n) (+ x y))
        ((zerop y) x)
        (t (let ((s (sudan n x (1- y))))
      (sudan (1- n) s (+ s y))))))
CL-USER> (sudan 1 2 3)
27
CL-USER> (sudan 3 2 1)
Control stack exhausted (no more space for function call frames).
   [Condition of type SB-KERNEL::CONTROL-STACK-EXHAUSTED]
CL-USER> (sudan 2 2 1)
27
CL-USER> (sudan 2 2 2)
15569256417
CL-USER> (sudan 2 2 3)
Control stack exhausted (no more space for function call frames).
   [Condition of type SB-KERNEL::CONTROL-STACK-EXHAUSTED]
CL-USER> (sudan 2 3 2)
5742397643169488579854258

It returns instantly for very small arguments, and overflows for slightly larger ones. What's going on?

The obvious way to find out is to instrument it to count how many times it was called. (Yes, I know there are tools for this, but this is an example of a general problem, of which most cases don't have tool support.) The proper, functional way is to return the number of calls as an additional value:

(defun sudan/count (n x y)
  "The first known function that is recursive but not primitive recursive.
Returns two values: S_n(x,y), and the number of calls required to compute it."
  (cond ((zerop n) (values (+ x y) 1))
        ((zerop y) (values x 1))
        (t (multiple-value-bind (s calls)
                                (sudan/count n x (1- y))
      (multiple-value-bind (s2 calls2)
                                  (sudan/count (1- n) s (+ s y))
  (values s2 (+ calls calls2 1)))))))

CL-USER> (sudan3 2 2 2)
15569256417
69
CL-USER> (sudan/count 2 3 2)
5742397643169488579854258
165

Hmm. That's not so bad. The number of calls is not so bad, I mean. The code, though, is a mess. The relationship of the recursive calls is obscured with multiple-value-binds, and every return site has to be wrapped with values, and if you miss one the function won't work. This is not a good way to instrument a function. More precisely, this is not a good way to pass profiling data up the call tree. But is there a better way?

If I were in a stateful mood, I would have done it this way:

(defun sudan/count2 (n x y)
  "The first known function that is recursive but not primitive recursive.
Returns two values: S_n(x,y), and the number of calls required to compute it."
  (let ((calls 0))
    (labels ((sudan (n x y)
               (incf calls)
               (cond ((zerop n) (+ x y))
                     ((zerop y) x)
                     (t (let ((s (sudan n x (1- y))))
                          (sudan (1- n) s (+ s y)))))))
      (values (sudan n x y) calls))))

That's one line longer than the multivalue version, but shorter in characters or nodes - and much easier to maintain, because the profiling is not tangled through the rest of the function. It also scales much better, because it doesn't require such pervasive changes. The trouble with state - well, one of the troubles with state - is that it hides dataflow, but in this case hiding dataflow is exactly what we want.

By the way, I had a predictable bug in the first version of sudan/count: when I renamed the function, I forgot to change the recursive calls, so it called the old version. Recursing by name is redundant in a bad way - it depends on the function's name, so merely renaming the function changes its behaviour! This wouldn't have happened in Forth, which has a recurse operation that doesn't depend on the function's name. Anonymous recursion is inflexible, of course, but there are many simple cases like this, where it more closely expresses what you mean, and avoids a possible bug.

Speaking of possible bugs: If you noticed the evaluation order dependency in (values (sudan n x y) calls), good for you. I didn't.

How toy problems are different

A lot of simple text-processing problems involve operating on the words of the input. Norvig's spellchecker, for instance, begins by extracting the words:

def words(text): return re.findall('[a-z]+', text.lower())

So do some of the old Shootout problems (though none of the currently active ones, and they don't keep old ones around, so if you want to see an example you'll have to use the WayBackMachine). This sets my library-senses tingling. If so many programs need to split text into words, shouldn't words be in the standard library?

There are two problems here. The lesser is that the definition of "word" varies a bit - sometimes it's defined by alphanumeric characters, sometimes by whitespace. This isn't a showstopper, because some reasonable definition will probably work for a large fraction of the potential uses. The greater problem is that splitting things into words is much less common in real programs than in toy examples. Words are like Fibonacci numbers: they're familiar and easy to ask for, so they occur unusually often in made-up requirements. Neither is common in real programs - probably not enough so to warrant including them in a language's library.

This is one of the pitfalls of testing a language against toy problems. Unlike real problems, they're strongly biased toward simplicity of requirements, which affects what they demand from a language. Real programs tend to spend a lot of code on I/O and error handling, neither of which turns up much in toy problems. Is it a coincidence that these two areas are awkward to express in most languages?

Where have all the pointers gone?

Dynamically typed languages represent almost everything as tagged pointers, which sounds very inefficient: doesn't that waste a lot of space? And doesn't 64-bit addressing make this waste much, much worse?

I used to take this seriously. But one day a few years ago I was playing with Squeak, and wondering why its images (savable worlds) were so large - 10 MB, when the original Smalltalk ran in 64 KB. I stumbled across allObjectsDo:, which iterates across all objects in the heap, so I wrote a heap profiler to find out where the space was going. Once I finished it, I discovered Squeak came with a much better one, so I used that instead. Here are all the classes that take more than 1% of the space:

ClassPurpose# InstancesSpace%
CompiledMethodbytecode339512.0 MB20%
Bitmapgraphics12081.8 MB17%
Stringtext287431.7 MB16%
Arrayvarious302631.3 MB12%
ByteArrayvarious379800 KB8%
Symbolselectors mostly27786550 KB5%
MethodDictionarymethod dispatch3092320 KB3%
WeakArraysymbol table mostly15230 KB2.2%
PointGUI17193200 KB1.9%
MorphExtensionGUI3344170 KB1.6%
ColorGUI8323163 KB1.6%
Associationhashtables10761126 KB1.2%
Floatboxed float9732114 KB1.1%
All others1.1 MB10%
Classes without pointers>6.6 MB>65%
Total10.2 MB100%

This shows why tagged pointers are not a big space problem: there just aren't that many of them. Five of the top six classes, accounting for two thirds of the space, are binary data. (The exception is Array.) Dynamically typed languages, it turns out, don't encode most data as tagged pointers.

Well, of course. What do you do with a large data structure? You find a more compact representation. Often that means replacing low-entropy pointers with high-entropy bytes. That's what happened to all of those five non-pointer classes. Dynamic typing makes no difference: as long as you have some sort of byte arrays (in Smalltalk, Class»variableByteSubclass:), you can do this in dynamically typed languages the same way you do it in statically typed ones. If tagged pointers are wasteful, the waste is in boxing small objects like floats, not in the presence of pointers.

Of course this is only one image, and an odd one at that - instead of application-specific data, it's dominated by library code. The reason Squeak is so much bigger than the old Smalltalks is that it has a much bigger standard library - and no autoloading, so it's all in the image all the time. Code accounts for at least 30% of the space: CompiledMethod, Symbol, MethodDictionary, and WeakArray (most of which is the symbol table). This 3 MB would be insignificant in a bigger image, but otherwise I expect larger applications to have space profiles much like Squeak's. Most classes are made of pointers, but most of the space goes on the few that aren't.

A hazard of precedence

I encountered a bug recently from a precedence mistake. Someone had written if (a = b == c), intending if ((a = b) == c) - but in C == has higher precedence than =, so that wasn't how the parser saw it. The typechecker didn't object, because the result type of == in C is int, so all was apparently well. None of the humans noticed either, perhaps because b was a large expression, so the two infix operators were far apart.

Operator precedence saves lots of parentheses, but it does occasionally lead to bugs. It's one of the class of language features that work by resolving ambiguity, and these have a common hazard: when you don't notice the ambiguity, it will still be resolved, and not necessarily as you intend.

Short names are addictive

This morning I tried to do this in Common Lisp:

(defun neighbors (x y)
  (mapcar (fn (dx dy) (vector (+ x dx) (+ y dy)))
          '(-1 0 1 0)
          '(0 1 0 -1)))

Naturally, SBCL spit out half a page of warnings, starting with this:

;     (+ X DX)
; 
; caught WARNING:
;   undefined variable: DX

;     (DX DY)
; 
; caught STYLE-WARNING:
;   undefined function: DX
; 
; caught WARNING:
;   undefined variable: DY

It took me an embarassingly long time to figure out why. I have become so used to fn being lambda — in Arc, Clojure, and any number of other new Lisps, including my own — that I didn't notice anything out of the ordinary.

I did at least remember that CL can't add vectors, which was how I wanted to write it. This is a hazard of learning new languages, and of designing them: improvements, even such minor ones as shorter names, are addictive. They're fun at first, but soon you become dependent on them, and the old ways seem unbearably clunky, if you can even remember them.

The value of extensible format strings

Common Lisp's version of printf is the famously overpowered format. In addition to everything printf does, it has conditionals, iteration, recursion, case-folding, tabulation and justification, pretty-printing, English plurals and number names (cardinal and ordinal), and two kinds of Roman numerals. Surprisingly, in a language where almost everything is user-accessible, most of these features are not available separately, only through format. The complexity is such that there's a compiler to make format strings execute faster. Sadly (or not, depending on your perspective) it's not Turing-complete, but that's only because it has no way to store information.

Of course there's a way around that. None of format's features is quite as overpowered as ~/, which calls arbitrary functions: (format stream "~/package:function/" x) does (package:function stream x nil nil). (The two nils are for some other features not used in this case. Yes, there are more. Are you surprised? And yes, in practice the explicit package name is required.) User extensibility is a good principle, but this seems silly. Why would you ever want to call a function through format instead of doing so directly?

Well, the other day I had to generate some XML in C, and the obvious way was with printf:

fprintf(somefile, "<element attr=\"%s\" />", somestring);

But there might be XML-meaningful characters in the string, and I didn't want to mess around with fancy XML-generation libraries. So I had to either escape the string before printing...

char buffer[BIG_ENOUGH]; /* yeah, right */
escape_xml(buffer, BIG_ENOUGH, somestring);
fprintf(somefile, "<element attr=\"%s\" />", buffer);

...or give up on printf, which was what I ended up doing:

fprintf(somefile, "<element attr=\"");
print_xml_string(somefile, somestring);
fprintf(somefile, "\" />");

Either option destroys the clarity of the printf. What I really wanted was a custom printf operation, like this:

fprintf(somefile, "<element attr=\"%/escape_xml/\" />", somestring);

That's exactly what format's ~/ command does:

(format somefile "<element attr=\"~/xml:escape/\" />" somestring)

I guess it's not so silly after all. It is verbose (imagine repeating ~/xml:escape/ for each of a dozen attributes), but that's fixable. If there were an interface for defining new format commands, then any frequently used ~/ function could be given a one-character form, and all would be well, except possibly for readability. (Although in this case all of the obvious characters x e & < s are already taken.) Lisp being Lisp, you generally can get at the implementation's way of defining format commands, e.g. sb!format::def-format-directive, but depending on implementation internals is not usually a good idea. Exposing this interface would make format more malleable, like the rest of the language, and would also make its long feature list easier to swallow.

For new languages, though, I think I prefer string interpolation, which avoids the issue entirely:

(put somefile "<element attr=\"$(xml:escape somestring)\" />")

It would also be nice to have a choice of string delimiters, so the quotemarks don't require escaping. But that's a different, less interesting issue.