Text statistics as summaries

Darius Bacon mechanically summarizes the Lord of the Rings by determining the most anomalously frequent words in each chapter. The word lists are surprisingly evocative. It's obvious which chapter this is, for instance:

Gorbag Shelob Lugbúrz Shagrat Ladyship Ya hoi belly

Or this:

Parth Galen Rauros glade cloven boat fate bearer

Even those word lists that don't include important names or events are still recognizable. I would not expect to recognize this, but I do:

quality Númenóreans Damrod mistress squirrel wiser arts Mablung

Word frequency is sensitive to repetitive text, such as the songs Tolkien is so fond of. This can be a problem. This chapter, for instance, is less obvious:

fiddle inn cow local Mugwort diddle tray comer

Or this one, which is dominated by characters who aren't even in the story:

Tinúviel Beren Thingol Bob Lúthien hemlock midges Gil

But most stories don't include many songs or poems. (Whether this is a bad thing depends on how good you expect the marginal poem to be.)

This could be a great way to generate tables of contents. It's much more flexible than hand-written chapter titles, since it works on arbitrary chunks of text. Instead of summarizing only the chunks someone thought to write titles for, you could summarize arbitrary levels of coarseness: hundred-page chunks for a high-level TOC, and smaller ones as you zoom in on a specific section. It could also be helpful for navigating large files: when you're looking for a certain part of a file but don't have a good search term, it might be faster to read through an automatic summary than to manually skim through the text.

Darius' post includes code, so you can see how it works, and try it yourself.

1 comment:

  1. http://libots.sourceforge.net/

    The open text summarizer already does that. It is an open source tool for summarizing texts.

    ReplyDelete

It's OK to comment on old posts.