The value of extensible format strings

Common Lisp's version of printf is the famously overpowered format. In addition to everything printf does, it has conditionals, iteration, recursion, case-folding, tabulation and justification, pretty-printing, English plurals and number names (cardinal and ordinal), and two kinds of Roman numerals. Surprisingly, in a language where almost everything is user-accessible, most of these features are not available separately, only through format. The complexity is such that there's a compiler to make format strings execute faster. Sadly (or not, depending on your perspective) it's not Turing-complete, but that's only because it has no way to store information.

Of course there's a way around that. None of format's features is quite as overpowered as ~/, which calls arbitrary functions: (format stream "~/package:function/" x) does (package:function stream x nil nil). (The two nils are for some other features not used in this case. Yes, there are more. Are you surprised? And yes, in practice the explicit package name is required.) User extensibility is a good principle, but this seems silly. Why would you ever want to call a function through format instead of doing so directly?

Well, the other day I had to generate some XML in C, and the obvious way was with printf:

fprintf(somefile, "<element attr=\"%s\" />", somestring);

But there might be XML-meaningful characters in the string, and I didn't want to mess around with fancy XML-generation libraries. So I had to either escape the string before printing...

char buffer[BIG_ENOUGH]; /* yeah, right */
escape_xml(buffer, BIG_ENOUGH, somestring);
fprintf(somefile, "<element attr=\"%s\" />", buffer);

...or give up on printf, which was what I ended up doing:

fprintf(somefile, "<element attr=\"");
print_xml_string(somefile, somestring);
fprintf(somefile, "\" />");

Either option destroys the clarity of the printf. What I really wanted was a custom printf operation, like this:

fprintf(somefile, "<element attr=\"%/escape_xml/\" />", somestring);

That's exactly what format's ~/ command does:

(format somefile "<element attr=\"~/xml:escape/\" />" somestring)

I guess it's not so silly after all. It is verbose (imagine repeating ~/xml:escape/ for each of a dozen attributes), but that's fixable. If there were an interface for defining new format commands, then any frequently used ~/ function could be given a one-character form, and all would be well, except possibly for readability. (Although in this case all of the obvious characters x e & < s are already taken.) Lisp being Lisp, you generally can get at the implementation's way of defining format commands, e.g. sb!format::def-format-directive, but depending on implementation internals is not usually a good idea. Exposing this interface would make format more malleable, like the rest of the language, and would also make its long feature list easier to swallow.

For new languages, though, I think I prefer string interpolation, which avoids the issue entirely:

(put somefile "<element attr=\"$(xml:escape somestring)\" />")

It would also be nice to have a choice of string delimiters, so the quotemarks don't require escaping. But that's a different, less interesting issue.

5 comments:

  1. No, no, no, you're doing it all wrong.

    What you're doing is treating structured output as text inside your program. Programming languages have much better mechanisms for working with structured output than raw text interpolation, the assembly-language of output.

    Have a look at how e.g. XmlTextWriter in .NET works. Writing out elements, attributes, text content, CDATA, entities etc. can and should be handled in a structured way for correct and, more importantly, type-safe production of output.

    ReplyDelete
  2. No, no, no, you're doing it all wrong.

    I wondered if someone would say that. :) Yeah, an XML generator would take care of this, but it's overkill for such a small problem. XML is (by design) easy to generate as text, so an XML generator would have to be exceptionally easy to use (unlikely in C, if there even were one handy) to be worthwhile. On the other hand, this was in an application that does generate XML in several other places, which would be more robust if they used a generator.

    I wouldn't try to parse XML by hand, of course. (I have seen code that did, and broke when the schema changed).

    ReplyDelete
  3. Yeah, genx would have worked fine - I forgot that while C input interfaces tend to be ugly, it has no particular difficulty with output. In this case it wouldn't have been worth the trouble of an extra library dependency, but maybe I'll use it next time I have to generate some more complicated XML.

    ReplyDelete
  4. Of course, in more reasonable languages than C you can use output combinators, like Alex Shinn's fmt library. There are two implementations under the covers: a bunch of macros that serve as a compiler, and a run-time interpreter of the combinators as S-expressions.

    ReplyDelete

It's OK to comment on old posts.