Arcane Sentiment: February 2008

Losing data made easy

In the Save As... dialog box in Mac OS X, it is possible to select an existing file, in which case that file will be overwritten. Just like Windows, in other words. (Is there a single feature Mac OS X has adopted from Windows that's good?) Actually it's more dangerous than the Windows equivalent, because there's no obvious indication that you've selected a file, and the filename field is at the top of the dialog, where you won't look after clicking on the file list below it. So it's easy to misclick and not notice that your carefully typed filename has just been replaced by the name of an existing file you do not want to lose. And if you thoughtlessly press return when asked "Are you sure you want to overwrite this file?", it will.

I had noticed this could be a problem, but I wasn't careful enough, and today I did it for the first time. Fortunately I was using Aquamacs, which helpfully made a backup of the file before clobbering it, so I didn't lose anything. (The backups can be annoying, but this is not the first time they've saved me. I should really be using a versioning filesystem.) But the same problem affects virtually every OS X application, and very few of them make backups. So here, in a system designed for ease of use, is a subtle feature with a rather unlikely intended use (usually the Save command does all the overwriting you want), and a very easy accidental use that loses data.

As the old joke goes:

How do you shoot yourself in the foot with Mac OS?
It's easy - just point and shoot!

The cost of macros

I had a tool problem while reading obfuscated Lisp: I wanted automatic refactoring. In particular I wanted to be able to α-rename the obnoxious variables to something that didn't look so much like brackets. But I had to do it by hand, because there are no good refactoring tools for Lisp. This is partly because Lisp culture values tools for expression more than tools for maintenance, but partly because automating most refactorings is hard in the presence of macros.

Most Lisps have macros in their purest form: they're arbitrary functions that transform new forms to old ones. That means there's no general way to walk the arguments, because neither the function nor the expansion will tell you what parts of the call are forms, let alone what environment they belong in. There is no way to be sure the macro doesn't implement a different language, which makes analysis nearly impossible.

You can almost do it by observation: if part of a macro call appears as a form in the expansion, you can treat it as a form in the call — and you even know its environment. (Note that this requires eq, because you want to detect that it's the same value, not another identical one.) Unfortunately this fails when the same form appears more than once in the original — and this is normal for symbols, so this technique doesn't get you very far. Even if the representation of code were different, so variable references appeared as (ref x) rather than being abbreviated to x, it would still break when a form appears more than once in the same tree. And in the presence of macros, partial sharing is actually rather common, because multiple calls to the same macro often share part of their expansions. So reliably walking macro calls requires having more information about a macro than just how to expand it.

This is an advantage of more restrictive macro systems: they're easier to analyze. In a strict template-filling system like syntax-rules, you can always determine the role of a macro argument. DrScheme takes advantage of this for its fancy (but not very useful IME) syntax-highlighting. It doesn't work for procedural macros (Update: yes it does; see comments) ~~but it could, if there were a way for macro definitions to supply the analysis along with the expander~~. Of course it would still be necessary to support unanalyzable mystery macros, because some macros are too hard to analyze, and because many authors won't bother.

I don't think procedural macros are a bad feature — on the contrary, I think they are the best and purest form of one of the four most important abstraction methods in any language (the other three are variables, functions, and user-defined datatypes). But they do have a cost. And I think the cost is mostly in what other tools they interfere with, not in any difficulty humans have with them.

A digression from obfuscation to representation of code

Faré has a nice obfuscated program in his signature:

(labels(({(] &rest [)(apply([
])[))([(>)(elt(]())>))(](<)(do-external-symbols(] :cl)(push ] <))(sort
<`string<`:key`string))(}({ + ^)({`816`1/5)({`688({`875({`398()"~{~A~^
~}"(]())){(+ { +)))({`381)^))(do*(({`5248({`584 }`36063))([`874({`395
{`6))(]`4({`584 {`6))(}`#36RH4G6HUTA1NVC1ZHC({`395 }`36063)))((} [ ]
({`977 ]))({`902)({`381))))

Whitespacelessness is nice for fitting programs into signatures (although this one is still not strictly McQ), but as a tool of obfuscation it has been obsolete since pprint was invented:

(LABELS (({ (] &REST [)
           (APPLY ([ ]) [))
         ([ (>)
           (ELT (] NIL) >))
         (] (<)
           (DO-EXTERNAL-SYMBOLS (] :CL) (PUSH ] <))
           (SORT < 'STRING< ':KEY 'STRING))
         (} ({ + ^)
           ({ 816 1/5)
           ({ 688
              ({ 875
                 ({ 398
                    NIL
                    "~{~A~^
~}"
                    (] NIL))
                 {
                 (+ { +)))
           ({ 381)
           ^))
  (DO* (({ 5248 ({ 584 } 36063))
        ([ 874 ({ 395 { 6))
        (] 4 ({ 584 { 6))
        (} 3785580492276528215065056 ({ 395 } 36063)))
       ((} [ ] ({ 977 ])) ({ 902) ({ 381))))

So in addition to the formatting, this program has — well, I won't spoil it. Check out that { operator, though.

Wait a minute — what happened to the backquotes and #36r? They're both purely read-time constructs, so there's nothing left of them to pretty-print. This is fine for purposes of reading obfuscated code, but annoying to anyone trying to build editing tools, because the canonical representation of code does not contain the whole program.

Many CL implementations (including SBCL, evidently) expand backquote at read time, because it's slightly simpler to implement. They try to preserve it by expanding into something distinctive (e.g. using sb-impl::backq-list instead of list) so they can unexpand it when pretty-printing, but this doesn't always work — there are lots of holes and ambiguous cases. Fortunately there's a better way: make backquote a simple abbreviation for (quasiquote ...) (as in Scheme), and do the expansion in a macro.

Number representations are harder to preserve. One could read #16rF00 as something like (radix 16 "F00"), and define radix as a macro. But this doesn't work for unevaluated data. In a system where code is more than just lists (like syntax-case), radix could be preserved along with the other extra information (line numbers etc.). The same approach can preserve #. and comments, but the conceptual cost is high — programs are no longer as simple as they appear.

I think it's worth investigating anyway. It is possible to have a richer representation of code than lists without going to the opaque lengths of syntax-case. By avoiding structural abbreviations, it may even be possible to make something easier to work with than lists. There are obviously a lot of challenges here, but the possibility of improving what Lisp does best — metaprogramming — is worth spending some effort on.

And think of the possibilities for obfuscated code!

`==` is not so bad

Here's a language feature I (and many others) love to hate: using == as an equality test, while = means define or even assignment. It's easy to object to this on the grounds that it abuses =, but this abuse is so common that it's not really confusing any more.

Update 13 Feb: Conveniently, equality is the only common meaning of ==. Using it for anything else would be confusing. = has so many meanings that it hardly suggests anything, misleading or not.

There's also a minor advantage to this choice of names. define (or assignment, in imperative languages) can be considerably more common than equality tests, so we can save characters by giving it the shorter name. This is a tiny savings, but it is in the most important part of a language. Top-level definitions, and especially their first lines, are read more often than other code, because you scan them while searching for the definition you want. So if it makes Haskell-style definitions a little easier to read, using = could be worth a little confusion.

The three-line = sign (≡, if you have that character) would be better, but it's not a practical option yet. So I suppose I should stop complaining about = and enjoy its readability.

Symbols and strings

Does a modern Lisp really need both symbols and strings? In particular, is there any reason to have two separate types, and reader syntax for both?

Traditionally strings have been mutable (even literals!), and symbols have had properties other than their names, so it has been easy to justify having separate types. But a modern Lisp may reasonably have immutable strings, and remove all extraneous features from symbols, so the only difference is whether they're interned. That's not worth making a distinction of type.

There's a practical problem: when they appear in source code, "foo" is written for its characters and foo for its identity. So "foo" needs to be self-evaluating, unlike foo. Does this mean they have to be different types?

No: " can be a reader macro, so "ice cream" reads as '|ice cream|. That interferes with its other uses, like docstrings, but not badly. (And it's about time we replaced docstrings with something a little more flexible, anyway.) Maclisp used '|explicit quoted symbols| wherever strings were needed, and it was annoying to read, so we do need the abbreviation.

While we're messing with ", let's make it read as something distinctive: (doublequote foo) instead of (quote foo). This has two good properties which every reader macro should have. First, it's reversible, so read and write can remain inverses as far as possible. Second, it's a simple abbreviation, so it preserves the transparency of syntax, and avoids entangling the reader with other parts of the language. It's also more flexible, because we can change its meaning by redefining doublequote. This makes it easy to add features like string interpolation, where "Hello, $scope!" expands to something like (format nil '|Hello, ~A!| scope). And this can all be done in user code, without complicating the language kernel.

I think I won't miss strings.

Chris Okasaki has a blog and booleans

Chris Okasaki (he of the heroic efforts to analyze performance of partly-lazy data structures) has a blog now. It's pretty good, and readable.

I like the post on confusion over boolean expressions. (But comparisons are distinct from booleans in most machine languages...) I know I have been guilty of excessively explicit logic, usually in the misguided service of clarity. The other day I found an example in some code I had written a few months ago: if (match && !desired || !match && desired). I didn't understand it until I verified that it was equivalent to if (match != desired), which I had avoided on the grounds that it was confusing. I guess I was wrong.

You know you've been using zero-based indices too long when...

I was reading about population genetics of unusual sex ratios and encountered this bit:

Then the fitness of the second organism is proportional to:
x /( x + x₀ ) [ ( 1 - x ) + ( 1 - x₀ ) ] + ( 1 - x )

Where x is the sex ratio produced by organism #2 and x₀ is the sex ratio of organism #1.

Confused by the variable names, I mentally corrected x to x₁:

Where x₁ is the sex ratio produced by organism #2 and x₀ is the sex ratio of organism #1.

One for the second organism and zero for the first? This seemed perfectly intuitive and unconfusing to me, because it's so familiar. Only when I tried to put it into words did I see the problem. Numbering from zero breaks the correspondence of cardinal to ordinal numbers, so speaking of it means fighting natural language with every word. It may be mathematically natural, and sometimes more convenient (and sometimes less), but I still prefer one-based indexing.

Clones for comparison

There is one reason to clone example programs: they're nice for comparing languages. As long as they're short enough to read quickly, they're a nice way to show how a language differs from others. As long as it's not too different, anyway - if the examples aren't expressed in much the same way, they won't map onto each other easily, and the reader will learn nothing except that the language is really weird.

I've used the Shootout programs as comparative examples occasionally, since they're available in so many languages that there's usually one close enough to compare to. I also use them to test the expressiveness of my languages. They're simple and easy, yet well-defined, so I can't reinterpret the problem to match whatever's easy in my language. Unfortunately I haven't found a solution to the inverse problem: I am tempted to revise minor details of my language to fit the problem. Done generically, this is good, but done for a specific problem, it destroys the value of the test, and does nothing to improve the language.

It does produce some impressive expressiveness results, though. When you even minutely customize the language to the problem, it's not hard to make it shorter and clearer than Perl and Haskell, for a wide range of problems. Now if I could just figure out how to do that in general...

Enough APL hate!

Ron Garret, while talking about something else, brings up APL for a fashionable slander:

But the drawback to APL is so immediately evident that the mere mention of the language is usually enough to refute the extreme version of the short-is-better argument: APL programs are completely inscrutable, and hence unmaintainable.

I don't mean to pick on Ron especially - a lot of people repeat this, because it's easy and satisfying, even when they know better. Ron evidently does, as he says a few paragraphs later:

Most people find APL code inscrutable, but not APL programmers. Text from *any* language, programming or otherwise, is inscrutable until you know the language.

So why the reputation? APL's one-character names give it an opaque surface, but there's nothing difficult there. Once you learn the names (and there are only a few dozen), there's no more mystery. If this were all there were to APL, the mysterious hieroglyphs would correspond to familiar concepts, and would be easy to learn. But there is something else going on: APL is a functional language.

Despite its prominent assignment operator, APL has very few side-effecting operations. Its programs are usually in a functional style, and are therefore much shorter than their imperative equivalents. To a reader who doesn't expect functional code, this completely obfuscates them, and makes it impossible to find a correspondence between the mysterious operators and familiar concepts. In addition, there are two features that further shorten programs:

1. map is implicit. Values are arrays, and scalar operations (e.g. +) are mapped across arrays without any need to say so. Together with a large supply of predefined array operations (with one-character names!), this makes a heavily collection-based style practical.

2. There are several high-order operators. / is reduce; \ is scanl; . is a combination of Cartesian product and map. As it turns out, high-order operators are especially handy when they have short names and high precedence. (APL's successor, J, takes high-order support even further: it supports some points-free function definitions.)

Here's the real reason I'm talking about APL: both of these features are easily adaptable to other languages. You can get a considerable amount of implicit map just by adding methods to the arithmetic primitives. Libraries of collection operations are even easier. I suspect that if you add these features to a language, you can get much of APL's brevity, even without syntax or inscrutably short names.

And APL may even begin to look less mysterious.

By the way, Ron also said:

a web server in APL would probably be an awful lot of work

Googling... it appears to have been done, but not in public. But I doubt it's especially long. APL can do most of the same things other languages do, so while it might look verbose to an APLer, I doubt it would be longer than in imperative languages.

A plea for original examples

Language designers: when you write sample programs to show off your language, please don't clone existing programs. Clones demonstrate that your language can do the same things other languages can do, but so what? The world doesn't need more programs whose only virtue is the language they're implemented in. We already have Tetris and Minesweeper and a zillion web servers. The world especially doesn't need more implementations of Emacs. Yes, it could have a better extension language, but Emacs' strength is the corpus of elisp that people have written, and your clone will not have that. So don't waste your time.

Instead, go directly to the good stuff. Think of how often you have cool ideas that you'd like to implement, but don't because you think it's too difficult. Make these your example programs, and judge your language on its ability to make them easy. That's what it's for, right?

As a language designer and as a programmer, your hope of notoriety rests on creating something new, something that isn't already available. It's easy to follow someone else's example, but the result will not inspire anyone else. There is no language-promotion tool like a killer application. Try to write one.

Experience with vapor

Lispjobs says:

Arc developer, seven+ years experience required.

How ordinary. :)

(Via Planet Lisp.)