Arcane Sentiment: April 2012

What's a regular expression?

The “official” definition is the literal one: an expression denoting a regular grammar. In this sense, a Parsec expression like many1 digit is a regex (equivalent to [0-9]+). But this is quite at odds with actual usage. No one calls Parsec expressions regexes when they happen to describe regular grammars! They're not regexes because the language isn't a regex language.

So is a regex an expression in a language that can only express regular grammars? No: Perl's regexes can call arbitrary Perl code, and thus can express non-regular grammars. But no one hesitates to call them regexes — they're the canonical implementation!

Like many Lisp-based regex libraries, CL-PPCRE offers a sexpr representation, which looks like (:greedy-repetition 1 nil :digit). It's rarely used because it's verbose, but if I saw one in the wild, I probably wouldn't call it a regex, because it doesn't have the distinctive syntax. CL-PPCRE's variant recursive-regex can express arbitrary recursive grammars, like Perl, but it still has “regex” in its name — not only as a joke, but because even when they're recursive, its expressions are still in regex syntax.

In common usage, regular expressions are defined by their syntax, not by the grammars they express.

Like any language, the regex language shares its name with its extensions: any very terse parsing language with the standard metacharacters is called a regex, regardless of what other features it has. Perl s/.../.../ expressions, for instance, are sometimes referred to as regexes, which is wrong both theoretically and in Perl terms, but right in terms of common usage. They contain regex-like syntax, so they're regexes, and the distinction between matching and substitution is not salient enough to earn them a different name. Even glob-expressions are sometimes called regexes — usually as a result of confusion, but it's an understandable confusion, because they share a metacharacter-rich syntax, and some of the metacharacters are even the same.

This focus on syntax sounds like a superficial trait distracting attention from the underlying ideas. It's really the other way around. Classification of grammars is seldom of much concern to anyone but parser theorists (and implementors, and sometimes users, of parsing libraries), but syntax is a constant source of problems for everyone who uses regexes. Consider JWZ's famous malediction (directed at himself or at emacs?):

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

The part of regexes that makes them a problem is the syntax: the metacharacters and consequent escaping, the sensitivity to single-character mistakes, the difficulty of extracting results (i.e. captures), the lack of a way to name and reuse parts of expressions. Not the regular grammar! So common usage makes exactly the right distinction. Syntax is what makes regular expressions; their (in)expressiveness is a superficial detail.

Xach pointed out that Nikodemus Siivola recently added support for reading package::(arbitrary form) in SBCL.

I had heard that Zetalisp had this feature, but apparently it's older than that — this was how packages originally worked. The 1979 Lisp Machine manual says:

The colon character (":") has a special meaning to the Lisp reader. When the reader sees a colon preceded by the name of a package, it will read in the next Lisp object with package bound to that package.

I don't know why CL degeneralized the package prefix to only work on symbols. The only reason I've heard is that a misplaced colon could accidentally cause the next form to be read in the wrong package, but that doesn't sound more dangerous than other typos like stray parentheses.

Update August 2015: It's more dangerous because unlike most typos, it has a read-time side effect: it pollutes the other package with lots of unwanted symbols.

Old Lisp manuals are a fascinating mix of futuristic and primitive. Lisp Machine Lisp also had hierarchical packages: package names were symbols, not strings, and could therefore live in other packages. But there are no earmuffs on package, nor on any other special variables; apparently they hadn't been invented yet.

Arcane Sentiment

What do you mean by “regex”?

Return of the Lisp Machine package system features