Arcane Sentiment: March 2009

A lot of simple text-processing problems involve operating on the words of the input. Norvig's spellchecker, for instance, begins by extracting the words:

def words(text): return re.findall('[a-z]+', text.lower())

So do some of the old Shootout problems (though none of the currently active ones, and they don't keep old ones around, so if you want to see an example you'll have to use the WayBackMachine). This sets my library-senses tingling. If so many programs need to split text into words, shouldn't words be in the standard library?

There are two problems here. The lesser is that the definition of "word" varies a bit - sometimes it's defined by alphanumeric characters, sometimes by whitespace. This isn't a showstopper, because some reasonable definition will probably work for a large fraction of the potential uses. The greater problem is that splitting things into words is much less common in real programs than in toy examples. Words are like Fibonacci numbers: they're familiar and easy to ask for, so they occur unusually often in made-up requirements. Neither is common in real programs - probably not enough so to warrant including them in a language's library.

This is one of the pitfalls of testing a language against toy problems. Unlike real problems, they're strongly biased toward simplicity of requirements, which affects what they demand from a language. Real programs tend to spend a lot of code on I/O and error handling, neither of which turns up much in toy problems. Is it a coincidence that these two areas are awkward to express in most languages?

Dynamically typed languages represent almost everything as tagged pointers, which sounds very inefficient: doesn't that waste a lot of space? And doesn't 64-bit addressing make this waste much, much worse?

I used to take this seriously. But one day a few years ago I was playing with Squeak, and wondering why its images (savable worlds) were so large - 10 MB, when the original Smalltalk ran in 64 KB. I stumbled across allObjectsDo:, which iterates across all objects in the heap, so I wrote a heap profiler to find out where the space was going. Once I finished it, I discovered Squeak came with a much better one, so I used that instead. Here are all the classes that take more than 1% of the space:

Class	Purpose	# Instances	Space	%
`CompiledMethod`	bytecode	33951	2.0 MB	20%
`Bitmap`	graphics	1208	1.8 MB	17%
`String`	text	28743	1.7 MB	16%
`Array`	various	30263	1.3 MB	12%
`ByteArray`	various	379	800 KB	8%
`Symbol`	selectors mostly	27786	550 KB	5%
`MethodDictionary`	method dispatch	3092	320 KB	3%
`WeakArray`	symbol table mostly	15	230 KB	2.2%
`Point`	GUI	17193	200 KB	1.9%
`MorphExtension`	GUI	3344	170 KB	1.6%
`Color`	GUI	8323	163 KB	1.6%
`Association`	hashtables	10761	126 KB	1.2%
`Float`	boxed float	9732	114 KB	1.1%
All others		1.1 MB	10%
Classes without pointers		>6.6 MB	>65%
Total		10.2 MB	100%

This shows why tagged pointers are not a big space problem: there just aren't that many of them. Five of the top six classes, accounting for two thirds of the space, are binary data. (The exception is Array.) Dynamically typed languages, it turns out, don't encode most data as tagged pointers.

Well, of course. What do you do with a large data structure? You find a more compact representation. Often that means replacing low-entropy pointers with high-entropy bytes. That's what happened to all of those five non-pointer classes. Dynamic typing makes no difference: as long as you have some sort of byte arrays (in Smalltalk, Class»variableByteSubclass:), you can do this in dynamically typed languages the same way you do it in statically typed ones. If tagged pointers are wasteful, the waste is in boxing small objects like floats, not in the presence of pointers.

Of course this is only one image, and an odd one at that - instead of application-specific data, it's dominated by library code. The reason Squeak is so much bigger than the old Smalltalks is that it has a much bigger standard library - and no autoloading, so it's all in the image all the time. Code accounts for at least 30% of the space: CompiledMethod, Symbol, MethodDictionary, and WeakArray (most of which is the symbol table). This 3 MB would be insignificant in a bigger image, but otherwise I expect larger applications to have space profiles much like Squeak's. Most classes are made of pointers, but most of the space goes on the few that aren't.

Arcane Sentiment

How toy problems are different

Where have all the pointers gone?