Text statistics as summaries

Darius Bacon mechanically summarizes the Lord of the Rings by determining the most anomalously frequent words in each chapter. The word lists are surprisingly evocative. It's obvious which chapter this is, for instance:

Gorbag Shelob Lugbúrz Shagrat Ladyship Ya hoi belly

Or this:

Parth Galen Rauros glade cloven boat fate bearer

Even those word lists that don't include important names or events are still recognizable. I would not expect to recognize this, but I do:

quality Númenóreans Damrod mistress squirrel wiser arts Mablung

Word frequency is sensitive to repetitive text, such as the songs Tolkien is so fond of. This can be a problem. This chapter, for instance, is less obvious:

fiddle inn cow local Mugwort diddle tray comer

Or this one, which is dominated by characters who aren't even in the story:

Tinúviel Beren Thingol Bob Lúthien hemlock midges Gil

But most stories don't include many songs or poems. (Whether this is a bad thing depends on how good you expect the marginal poem to be.)

This could be a great way to generate tables of contents. It's much more flexible than hand-written chapter titles, since it works on arbitrary chunks of text. Instead of summarizing only the chunks someone thought to write titles for, you could summarize arbitrary levels of coarseness: hundred-page chunks for a high-level TOC, and smaller ones as you zoom in on a specific section. It could also be helpful for navigating large files: when you're looking for a certain part of a file but don't have a good search term, it might be faster to read through an automatic summary than to manually skim through the text.

Darius' post includes code, so you can see how it works, and try it yourself.

Strings and lines

I was recently surprised by the behavior of the . regex operator in Perl: it matches any character except newline. This seems like the sort of thing I should have already known. It's right there in the documentation. But if I ever knew it, I've forgotten, which suggests I've never tried to match a regex against a multiline string.

Well, of course. How often do you do anything to strings containing newlines, except for printing them out intact?

For that matter, how often do you have such strings at all? Most strings, after all, are either 1) lines of files, which by definition don't contain newlines, or 2) names of some sort, which usually can't contain control characters of any kind. (Yes, filenames can theoretically contain control characters. But have you ever seen one, except for the time you created one to see if it worked?) Even strings that could in principle contain newlines, such as error messages, usually don't, because the possibility doesn't occur to programmers. Control characters aren't real characters, are they?

And, of course, many languages don't allow newlines in string literals. In those that do, they're rarely used.

It seems to me that I (and most programmers, probably) overgeneralize from this: I think of newlines, like nulls and other control characters, as not generally being valid in strings. Consciously, I know strings can contain arbitrary characters, but unconsciously, I think a string must be one line.

The inevitable interpreter

This must be the cutest code injection method ever. Hovav Shacham's paper The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86) describes yet another method of injecting code into a process by clobbering its stack. The cute part isn't the title (although it did tempt me to read the paper). The cute part is that this method uses a Turing-complete language for which an interpreter already exists by accident.

It's a generalization of a well-known method called return-into-libc: if you can overwrite a function's return address, you can turn its return into a call to some existing function, such as one from a common library like libc. If you can arrange for the return address of that call to be another library function, you can call a series of them: it's threaded code, expressed through return addresses. This lets you run code of your choosing without injecting any machine code of your own, so it works even if you don't have write access to any executable memory.

But it's limited in its expressiveness, because it only calls the functions provided by the target library, and those usually don't form a complete basis for computation. The available libraries don't generally include functions for branching, looping, moving data around, and the other trivial operations that enable computation. (Languages whose standard libraries contain lots of high-order functions may be more vulnerable in this respect, but since they're not so widely used and most of them are memory-safe, they aren't very attractive attack targets.) Return-into-libc exploits an accidental threaded-code interpreter, but not a very flexible one: it can call arbitrary functions, but it can't compute much.

This paper removes this limitation by showing that the provided functions are not the only possible targets for return-into-libc. Any sequence of instructions ending in a ret can in principle be called as a function, and such sequences are common — they occur at every return site of every function! Furthermore, many such sequences perform useful operations such as arithmetic, conditionals, moves, and looping. These are the missing primitives needed to make return-into-libc Turing-complete. The necessary sequences occur quite frequently in real code, so any substantial x86 library contains a Turing-complete threaded-code interpreter.

The paper also argues that it's difficult for any x86 library to avoid accidentally providing such an interpreter. The architecture goes out of its way to provide convenient instructions for common operations, and to give the common instructions short representations. In particular it provides function return, the crucial operation for this attack, as the convenient one-byte ret instruction. Many other useful instructions are also only one or two bytes, so useful sequences can be quite short. This means they occur not only at intended return sites but as frame shifts inside other instructions, and even in literal constants. So even if a compiler carefully arranged for all return sites to be useless for return-into-libc purposes, the necessary sequences would still appear elsewhere. On x86, it's hard to avoid making the return-into-libc language Turing-complete.

It's not hard to make a Turing-complete language. Many language implementors have done so by accident, when they took languages not intended as programming languages and added convenient features like functions without realizing the implications. Many absurdly simple mathematical systems are also Turing-complete, including ones that don't look like languages, such as Fractran and Rule 110 and Conway's Game of Life. But this is the first case I've heard of where a Turing-complete language exists in the wild, without anyone intending to implement anything like an interpreter at all, as an inevitable consequence of a machine's architecture.

(Via Z-Bo on LtU. This is old news, but I hadn't heard of it, so maybe you haven't either.)