Where have all the pointers gone?

Dynamically typed languages represent almost everything as tagged pointers, which sounds very inefficient: doesn't that waste a lot of space? And doesn't 64-bit addressing make this waste much, much worse?

I used to take this seriously. But one day a few years ago I was playing with Squeak, and wondering why its images (savable worlds) were so large - 10 MB, when the original Smalltalk ran in 64 KB. I stumbled across allObjectsDo:, which iterates across all objects in the heap, so I wrote a heap profiler to find out where the space was going. Once I finished it, I discovered Squeak came with a much better one, so I used that instead. Here are all the classes that take more than 1% of the space:

ClassPurpose# InstancesSpace%
CompiledMethodbytecode339512.0 MB20%
Bitmapgraphics12081.8 MB17%
Stringtext287431.7 MB16%
Arrayvarious302631.3 MB12%
ByteArrayvarious379800 KB8%
Symbolselectors mostly27786550 KB5%
MethodDictionarymethod dispatch3092320 KB3%
WeakArraysymbol table mostly15230 KB2.2%
PointGUI17193200 KB1.9%
MorphExtensionGUI3344170 KB1.6%
ColorGUI8323163 KB1.6%
Associationhashtables10761126 KB1.2%
Floatboxed float9732114 KB1.1%
All others1.1 MB10%
Classes without pointers>6.6 MB>65%
Total10.2 MB100%

This shows why tagged pointers are not a big space problem: there just aren't that many of them. Five of the top six classes, accounting for two thirds of the space, are binary data. (The exception is Array.) Dynamically typed languages, it turns out, don't encode most data as tagged pointers.

Well, of course. What do you do with a large data structure? You find a more compact representation. Often that means replacing low-entropy pointers with high-entropy bytes. That's what happened to all of those five non-pointer classes. Dynamic typing makes no difference: as long as you have some sort of byte arrays (in Smalltalk, Class»variableByteSubclass:), you can do this in dynamically typed languages the same way you do it in statically typed ones. If tagged pointers are wasteful, the waste is in boxing small objects like floats, not in the presence of pointers.

Of course this is only one image, and an odd one at that - instead of application-specific data, it's dominated by library code. The reason Squeak is so much bigger than the old Smalltalks is that it has a much bigger standard library - and no autoloading, so it's all in the image all the time. Code accounts for at least 30% of the space: CompiledMethod, Symbol, MethodDictionary, and WeakArray (most of which is the symbol table). This 3 MB would be insignificant in a bigger image, but otherwise I expect larger applications to have space profiles much like Squeak's. Most classes are made of pointers, but most of the space goes on the few that aren't.

No comments:

Post a Comment

It's OK to comment on old posts.