Proglangers and conlangers

Yesterday I used the obscure word "proglanger", meaning someone who designs programming languages. It's so rare that for the moment Google knows no other occurrences, but its meaning is probably obvious. Like "natlang", it comes from the slang of conlangers: people who construct languages for fun, like Tolkien. (People who make up whole languages don't hesitate to make up words.) Inventing human languages is obviously related to inventing programming languages (and there is some overlap between the two communities), so I want to think there are lessons to be learned from it, but... I can't think of any. Even from linguistics it's disappointingly hard to learn anything that applies to proglangs.

Okay, one habit from conlanging helps me in proglanging: When I encounter a strange new language (natural or not), I see it as a source of fun ideas, not a mysterious foreign gibberish. This gives me a rather exploitative attitude to new languages, but it probably makes me learn languages faster. But this doesn't impress the part of me that wants to see one of my hobbies inform another.

Learning by translating

A couple of years ago (see how long it takes me to blog this?) I translated two programs from languages I didn't know to one I did. The destination language in both cases was one of my pet Lisps. I was translating code into it to test its expressiveness, which was painfully instructive: there a number of places where my language couldn't do what I wanted, and some of them were hard to fix. But most of what I learned from the exercise had nothing to do with my language.

One of the programs was in obfuscated Perl. It (ab)used enough obscure parts of the language that I didn't translate it properly. That's fine. I learned a lot of Perl in the process.

The other program was in Groovy, a sort of functional Java. It's a nice example because it involves networking, XML, and a graphical user interface, so it tests a language's libraries in ways most toy examples don't. It was probably chosen to make Groovy look good, and it does - but my Lisp version came out about half the size of the Groovy one. Unfortunately for my bragging rights, this wasn't because my language has greater expressive power. It's because Groovy uses Java's libraries, and their interfaces are limited to things that can be used from Java (archaic Java, in many cases). Groovy is a much more expressive language than Java, but the libraries take no advantage of that expressiveness, so half of the program was spent working around limitations Groovy doesn't have. If Groovy had its own libraries, its program would be about as small as the Lisp one.

Well, maybe not quite as small. The Lisp translation had the unfair advantage that I invented many of the libraries as I went, so they were particularly convenient to that program. And since I didn't implement them, I may have missed some ill-definednesses. I don't think they were unrealistic, but in reality the program would probably be a bit longer than the one I came up with. Whenever you can reimagine your language to fit the task at hand, translation will flatter it. This can be useful - there is nothing as encouraging to a proglanger as seeing your language beat Perl at text processing, Unix at file management, and Haskell at data structure crunching. I've done all of these, and let me tell you, it feels good! But it doesn't mean much.

I didn't learn as much Groovy as I did Perl, probably because the Groovy program did not depend on many obscure details of the language. I treated it as a functional language with Java syntax, and that was all I needed to know to understand it. The Perl program was harder, of course; I had to learn a good deal of syntax to see through the line noise to the underlying structure, which was not as exotic as it looked. Like APL, Perl gets much of its apparent strangeness comes from its abbreviated syntax; the language behind the syntax is only a little weird. But in this case there were more layers of language-dependent obfuscation...

This isn't a bad way to learn a language, as long as you enjoy being puzzled. Just make sure the program you're translating is a bad one. Code that makes its meaning clear won't help you, because you'll understand it without having to think about how it works. For once, that's a bad thing.

Loopy

An anonymous commenter says the tail-recursive loop in my previous post would be clearer with loop. I agree in principle: recursion is more general than needed, so a loop should be clearer. Should be. But Common Lisp is not especially great at expressing iteration.

(loop for c = (peek-char nil stream eof-error-p)
  while (whitespacep c)
  do (read-char stream)
  finally (return c))

I'm one of those “knee-jerk anti-loopists”, so maybe I'm reading this with unfriendly eyes, but it doesn't look any clearer to me. It saves a line of explicit recursion, but spends one to explicitly return the result (but maybe I just don't know how to use loop; is there a better way?), which makes the control flow a little odd. Plus I don't like having to parse an S-expression!

What about ordinary do?

(do ((c (peek-char nil stream eof-error-p) (peek-char nil stream eof-error-p)))
  ((not (whitespacep c)) c)
  (read-char stream eof-error-p))

As usual, do has more non-form parentheses than I'd like, but it would be considerably shorter if it weren't for the hideous repetition of (peek-char nil stream eof-error-p). This is an unfortunate side-effect of a feature (not stepping variables with no step-form) I rarely (never?) use - if I wanted a non-stepping variable, I'd use let. In my opinion do would be much more useful if it handled the common case instead, and made a two-element binding clause (var step) equivalent to (var step step). One could also reduce the mystery parentheses a bit (and increase generality) at the cost of only a little verbosity, by making the end-test a standalone form:

(defmacro until (test &body result-forms)
  "Terminate a loop if TEST evaluates to true."
  `(when ,test (return (progn ,@result-forms))))

(do ((c (peek-char nil stream eof-error-p)))
  (until (not (whitespacep c)) c)
  (read-char stream eof-error-p))

That's not so bad.

I'm not the first Common Lisper to complain about the lack of iteration facilities. Jonathan Amsterdam's answer is iterate, which is basically loop with parentheses and more features - including one, finding, for writing this sort of loop. Here's the (untested) iterate version:

(iter (for c next (peek-char nil stream eof-error-p))
  (finding c such-that (not (whitespacep c)))
  (read-char stream eof-error-p))

It would be a two-liner, except that (finding ... such-that (complement #'whitespacep)) doesn't work - that shortcut only works if the right side is a (function ...) form.

That functional bit is a reminder that I'm going about this the wrong way. None of these loops approaches the clarity of the functional way. If the sequence functions worked on streams, reading characters as needed, we could just do something like:

(find-if (complement #'whitespacep) stream)

and unread the character afterward, if we got one. Or, if peeking returns a derived stream which delays removing each character until the next one is needed:

(find-if (complement #'whitespacep) (peeking stream))

These functional conveniences are less general than loop or iterate, but that's okay. Their lack of generality makes the common loops much simpler. Too bad they don't exist.

Common Lisp skips whitespace, and skips whitespace

It's well known to Lispers that Common Lisp's library is remarkably complete in some places, and remarkably incomplete in others. I ran into both cases yesterday. I needed to see what was behind some whitespace, so I did the obvious thing:

(defun skip-whitespace (stream &optional eof-error-p)
  "Skip leading whitespace on STREAM, and return the first
non-whitespace character without removing it."
  (let ((c (peek-char stream eof-error-p)))
    (cond ((whitespacep c) (read-char stream eof-error-p)
                           (skip-whitespace stream eof-error-p))
          (t c))))

There are two problems with this code. The first is that whitespacep is missing. Did I get the name wrong? No, it simply doesn't exist. Common Lisp has graphic-char-p, alpha-char-p, digit-char-p, alphanumericp, standard-char-p, upper-case-p, lower-case-p, and even both-case-p (which is not quite what it sounds like), but nothing for whitespace. And you can't even easily assemble it out of the available predicates.

Oh well, it's easy to fake.

(defun whitespacep (c)
  (member c '(#\  #\Tab #\Return #\Newline)))

The second problem was a mysterious type error: peek-char didn't like *standard-input*; it wanted a character or a boolean instead. Huh?

If I had bothered to read the Hyperspec (or even look closely at the argument list Slime displayed), I would have known that peek-char's first argument isn't the stream, it's the whitespace setting. The function I wanted is already built in to the language: (peek-char t stream) skips whitespace and returns the first non-whitespace character. Obviously I'm not the first person to want this. peek-char may not be the most obvious place to put this feature, but this is one of those places where Common Lisp is almost absurdly complete. It may not be able to identify whitespace characters, but it knows what to do with them.

Edit: fixed an uninteresting third problem: I forgot to pass eof-error-p to peek-char.

No value

Slime (the Emacs mode for Common Lisp) is nothing if not helpful. It autoindents, it completes symbols, it shows backtraces, it fontifies output. It even helpfully points out when an expression returned no values:

CL-USER> (values)
; No value

This is missing the point. Almost all CL functions can find something useful to return — even those, like rplacd or write, that are called purely for their side effects usually return one of their arguments, just to make them easier to insert into existing code. Functions producing zero values usually do so not because they have nothing to return, but because they're intended to be used at the REPL, and they don't want to clutter up the output with an uninteresting value. But Slime's helpful comment defeats their efforts. Observe:

CL-USER> (describe 'values)
VALUES is an external symbol in #<PACKAGE "COMMON-LISP">.
Function: #<FUNCTION VALUES>
Its associated name (as in FUNCTION-LAMBDA-EXPRESSION) is VALUES.
The function's arguments are:  (&REST VALUES)
Function documentation:
  Return all arguments, in order, as values.
On Fri, Mar 24, 2006 07:42:39 AM EST it was compiled from:
SYS:SRC;CODE;EVAL.LISP
  Created: Friday, April 15, 2005 09:57:55 AM EDT

; No value

How is this an improvement on a meaningless NIL?

(Hey, look, a logical pathname! I didn't know anyone used those.)

how-much-space-do-those-hyphens-take?

In a rant about naming conventions, Yossi Kreinin praises the way Lisp does multi-word names:

I think that the best naming convention out there is the Lisp one: case-insensitive-dash-separated. It just doesn’t get any better:

  • You never have to hit Shift or Caps Lock in the middle of a name, which makes typing easier. This is especially important for names because you use auto-completion with names. Auto-completion requires you to press things like Ctrl-Space or Alt-Slash or Ctrl-P. Together with another Shift needed for the name to be completed correctly, auto-completion is much more likely to cause repetitive strain injury.
  • You never have to think about the case. Figuring out the case of a letter in a case-sensitive naming convention can be non-trivial; more on that later.
  • The dash-as-a-separator-convention is used in English, so your names look natural.

I'm not so fond of this convention. Well, it is nice to read; I can't think of anything more natural. And case-irrelevance is nice. (If your favourite lisp stopped case-folding one day and became case-sensitive in practice as well as theory, would you even notice?) There's just one problem: those hyphens take space.

How much?

(let ((syms 0) (chars 0) (hyphens 0))
  (do-symbols (s (find-package :cl))
    (incf syms)
    (incf chars (length (symbol-name s)))
    (incf hyphens (count #\- (symbol-name s))))
  (format nil "~S symbols, ~S chars (avg. ~2,2f)~%~
             ~S hyphens, ~2,2f% hyphens (avg. ~2,2f)~%"
    syms chars (/ chars syms 1.0) hyphens
    (/ hyphens chars 0.01) (/ hyphens syms 1.0)))

"978 symbols, 11271 chars (avg. 11.52)
875 hyphens, 7.76% hyphens (avg. .89)"

Hmm. Not as bad as I thought. 7.7% sounds like a lot, but most of those hyphens are in rarely-used names; the common ones are short. The big expression above has only four hyphens (plus the #\-), so it's not a huge problem. On the other hand, standard library names (even in CL) tend to be short; user code is often full of rather long names. Let's see...

(defun slurp (filename)
  "Read the contents of a file as a string."
  (with-open-file (s filename)
    (let* ((len (file-length s))
           (str (make-string len)))
      (do ((i 0 (1+ i)))
          ((>= i len) str)
        (setf (schar str i) (read-char s t)))))

(defun count-hyphens (filename)
  (let ((s (slurp filename)))
    (format t "~2,2f% of ~S chars~%"
    (/ (count #\- s) (length s) 0.01) (length s))))

(count-hyphens "asdf-install/installer.lisp")
3.13% of 12210 chars
(count-hyphens "edit-modes/python.el")
3.13% of 90348 chars   ;coincidence
(count-hyphens "keywiz.el")
2.88% of 10230 chars
(count-hyphens "tetris.el")
4.54% of 22458 chars
(count-hyphens "edit-modes/slime/swank.lisp")
2.86% of 191438 chars

There's a lot of variation (partly because some files are diluted with lots of comments, which I didn't try to strip out), but evidently hyphens are still significant. Tetris.el has a lot because it has a lot of names with short words, like tetris-top-left-x. Elisp in general suffers (in total length as well as hyphens) because it has no module system, so every name needs a module prefix. (If you want to make a language's code shorter, this is the first thing to worry about: make sure users don't have to defend against name collisions.) But even in CL, hyphens take around 3% of the code. That's about half as much as parentheses.

Maybe this isn't directly a problem. People probably don't look at the hyphens even as much as they look at parentheses, so what does it matter how many there are? But they still lengthen lines, making the eyes move farther, and increasing the chance of linebreaks. When one character accounts for 3% of your code, it's hard to argue that it doesn't matter. But as with parentheses, it's even harder to do anything about it, since all the alternatives are harder to read. So we use hyphens. But I still don't feel altogether good about them.

Update: It seems some people think I'm bashing Lisp on the grounds that hyphenated names make its code 3% longer. Relax; of course this is not a big problem, and I like Lisp, or I wouldn't be worrying about its naming conventions. And I prefer hyphens to every alternative, because they're so much more readable. But this readability isn't free, and at 3% the cost (in screen space) is surprisingly high. I don't have a solution, but it's nice to be aware of the problem. I will take CamelCase a little more seriously now.

Clearly illicit frobnication

I hack C++ at work. In addition to the usual difficulties of writing in a low-level language with a buggy compiler, this means I have to write in a style other C++ speakers will understand. Occasionally I forget, and write as I would in a functional language. The other day I wrote some code like this:

if (constraint == (frobnicated ? MUST_NOT_FROBNICATE : MUST_FROBNICATE))
  die("illicit frobnication");

Someone found it hard to read, so I changed it to make the logic very explicit:

if (frobnicated && constraint == MUST_NOT_FROBNICATE
    || !frobnicated && constraint == MUST_FROBNICATE)
  die("illicit frobnication");

On reflection I agreed it was easier to read, even though it was longer. Why?

Maybe it's because the ternary operator is so awkward in C++. It's often hard to parse, so it's not used much, so readers of C++ don't expect to see it, especially not as an argument to ==, and especially not in the control expression of another conditional. And since they don't expect to see it, it's harder to read. Is this easier in a different language?

(when (eq constraint (if frobnicated 'must-not-frobnicate 'must-frobnicate))
  (error "illicit frobnication"))

In my opinion that's a small improvement in syntax, but none in semantics. It's still confusing. So my initial impression was wrong. This isn't about C++ after all; it's the logic that's confusing.

What if the test is something other than equality? ISTM this variation is a little less confusing, despite being more general:

(when (member (if frobnicated 'must-not-frobnicate 'must-frobnicate)
              constraints)
  (error "illicit frobnication"))

I think the real problem is the pseudo-Boolean nature of the constraints. It's hard to read MUST_NOT_FROBNICATE in a Boolean expression without thinking the NOT is the negation operator and it's involved in the expression somehow. (Of course it is, but not directly enough that it helps to think of it.) That's only a minor problem, but when it's combined with a conditional expression and a comparison of Booleans, the total confusion is enough to slow the reader down, or to overwhelm the less logic-minded.

We can get rid of the whole thing easily enough. What if the constraint were a function instead of an enumeration?

(unless (funcall allow-frobnication frobnicated)
  (error "illicit frobnication"))

That's much simpler, and IMO easier to read too. Another victory for functional style! Unfortunately it moves some complexity (especially in C++) to the three allow-frobnication functions. And to many C++ speakers it's odd enough to be much more confusing than the original version. In this case expressiveness is limited by what the readers understand, not by what the language does.

Pretty fonts

I would just like to point out that Book Antiqua is a very pretty font. The sort of font you could set books in, even. Or blogs.

So is Gill Sans.

Infix lambda

I was playing with syntax today, specifically terse lambdas. After a brief dalliance with Smalltalk-like blocks with a Haskell-like -> separator, I dumped the square brackets and was left with an infix operator for lambda:

successor = (a → a + 1)
fib = memoize (n → if (n < 2) n (fib (n - 1) + fib (n - 2)))
apropos term = filter id
                 (map (f → if (find-substring term (docstring f))
                                  (name f))
                   *module*)

I like it because it's terse without requiring any special parentheses, and it minimizes the difference between an expression and a function abstracting that expression. Like most infix operators, it gets harder to read as the expression gets bigger, but for small expressions it's great - it approaches the brevity of combinators without requiring gymnastics to convert an existing expression to a function. It's also a perfect example of the value of infix: the operator itself serves to delimit the argument list, saving a pair of parentheses.

I vaguely remember having heard of an infix lambda operator before, but casual googling does not reveal it - the results for both "infix lambda" and "lambda infix" are dominated by accidental juxtapositions. Do any of my fearsomely knowledgeable readers know whether and where this has been done before?

I haven't bothered to implement it - I know what it does; I'm trying to figure out what it's like to read. So far I've only found one significant problem: in a language which also has infix define, I'm tempted to read the first argument as the name of the function. Which is just fine, but I tend to confuse the λ and named-λ versions.

Consider (→ body). It's obviously a nice short way to write a thunk: (λ () body). But (f g → body) could be either (λ (f g) body) or (named-λ f (g) body). Neither λ nor named-λ is more right than the other, but there's no reason to choose one - they are both frequently useful, so it's fine to have both. How about for λ and for named-λ, since the double arrow resembles =?

There's also the define/defun question: for (named-λ f (a b) body), do you write (f a b ⇒ body) or (f (a b) ⇒ body)? In prefix syntax, I prefer a separate λ-list: I find (defun f (x) ...) easier to read than (define (f x) ...), because the function name is not mixed in with the semantically different argument names. But in infix I would rather dispose of the parentheses and write the Haskell-like f x = .... This pattern holds for non-defining forms too: I like named-λ to separate the name from the λ-list, but this just looks silly in infix.

The define style also has the advantage that it generalizes nicely to recursive bindings of non-functions:

endless x = (l ⇒ (lcons x l))

Now, does that look familiar? It's the infix version of SRFI-31's rec!

Update 7 Sep: It turns out C# 3.0 has infix lambda, and calls it =>.

Arc gets I/O right

In any significant program, a great deal of code is spent on two activities that have very little intrinsic interest: I/O and error handling. This makes them good targets for language design. If a language can make these two areas easier, it can remove a significant part of the accidental complexity of programs. Conversely, a language in which either is difficult is a language in which effort is wasted on uninteresting problems.

This is where Arc is good. Many parts of it are hacky or just not there yet, but its I/O library is strikingly convenient. There's a lot there worth seeing and maybe stealing. In particular:

  • The whole-file I/O routines, starting with readfile, are a great idea - usually what you want in a file is to read the whole thing. There should be one (file?) for reading the whole file as a string too. (The point here is not that they read the whole file in one piece, but that they handle the opening and closing - all they need is the filename.)
  • infile and outfile are nice and simple. Opening files for reading and writing are quite different operations, and there's no reason to do both with the same function. Getting rid of the mode argument also makes the functions easier to compose. This can go further: the remaining binary, text and append arguments should be separate functions too.
  • Using tostring is much simpler than having an optional stream argument to every output routine. For instance, (tostring (prall thingies "We have ") (when (no done?) (pr " and more coming!"))). Dynamic variables are a perfectly reasonable way to pass arguments, especially those like streams that tend to be used in dynamic scope - so why don't I feel good about this? Are there any cases it handles badly?
  • Printing several values separated by spaces is a common operation (especially in throwaway programs) but it needs a good name. I like prs better than what I was using (prv, for "print values") although it might conflict with other functions like "print to stream".
  • prall is another function I often want, although it feels unwieldy to me, possibly because the second argument is printed before the first. The prefix argument also seems out of place.
  • prf is (format t ...), plus it does string interpolation in case you prefer that. Where's the non-printing format (well, probably fmt in Arc)? This is definitely common enough to deserve a function, not just (tostring (prf ...)).
  • There are some other perlesque string operations - tokens and such. You can never have enough of these (even in Perl).
  • I'm not sure if w/bars is useful (for what, debugging?) but it's an interesting kind of output wrapper - it captures write operations, not characters. How about a similar wrapper that indents every line?
  • It's not an I/O routine, but ellipsize handles a common problem (truncating long strings and adding an ellipsis) that would often go unhandled without it.

I/O is really a subset of a more general category, glue, which probably accounts for more complexity than everything else put together. And it's all accidental complexity, so it should be a target for libraries. Unfortunately I don't think there are any general ways to reduce it, because there are so many different transformations that have to be done. But simplifying I/O addresses the most common kind of glue. Remarkably, considering its age and importance, this area is not solved. There are still considerable improvements to be found, and Arc is finding some of them.

I wish I could find similar improvements in error handling.

A peculiar deficiency

Internet Explorer 7 doesn't support HTML's q element. This is rather annoying, as I've been using it heavily for marking up quotes - on browsers that do support it, it produces nice language-sensitive quote marks. I could understand this if it were an ancient proto-IE, but why does the latest version not support an element that's been in the standard since 1999?

I'll have to go edit old posts and replace <q>...</q> with "...". Of course, Blogger's search helpfully ignores HTML tags, so I can't find them that way...

The other advantage of Wikipedia

Wikipedia is faster than Google, for the common class of Web searches where you already know the name of the thing you're looking for: "tell me about Ipomoea". When you look something up by googling it, Google gives you a list of results (often including the Wikipedia page), but then you have to click on a link to get to the actual page. When you type in the URL of a Wikipedia page, it takes you directly to the result - and most of the time it's a good one.

Google's I'm Feeling Lucky feature is just as fast, but less specific. It takes you to the most popular or important page on the search term, which is often not the most informative - it might be someone selling it, or a story about it, or something named after it. None of these are as much help as even a mediocre Wikipedia page.

Of course Wikipedia URLs are a bit long to type. Fortunately Mozilla (including Firefox) has a feature to make it easier: you can teach it commands that expand to URLs, so "w Proteobacteria" takes you to http://en.wikipedia.org/wiki/Proteobacteria, and "g carnot cycle" asks Google, which refers you to Wikipedia, of course. This feature is intended for searches, but it also works for many other purposes. Until I wrote this paragraph, I thought Firefox didn't have this feature; I've been exploiting URL autocompletion instead. That's awkward enough that I wonder if it's any faster than using Google; I may have just thought so because it's more interesting. Blogging can be educational.

Curiously, when you don't know the name of what you're looking for, Wikipedia's search is not much help. In my experience Google, even without the help of a site:en.wikipedia.org restriction, is more likely to find the right Wikipedia page than Wikipedia's own search is. It's that bad.