What do you mean by “regex”?

What's a regular expression?

The “official” definition is the literal one: an expression denoting a regular grammar. In this sense, a Parsec expression like many1 digit is a regex (equivalent to [0-9]+). But this is quite at odds with actual usage. No one calls Parsec expressions regexes when they happen to describe regular grammars! They're not regexes because the language isn't a regex language.

So is a regex an expression in a language that can only express regular grammars? No: Perl's regexes can call arbitrary Perl code, and thus can express non-regular grammars. But no one hesitates to call them regexes — they're the canonical implementation!

Like many Lisp-based regex libraries, CL-PPCRE offers a sexpr representation, which looks like (:greedy-repetition 1 nil :digit). It's rarely used because it's verbose, but if I saw one in the wild, I probably wouldn't call it a regex, because it doesn't have the distinctive syntax. CL-PPCRE's variant recursive-regex can express arbitrary recursive grammars, like Perl, but it still has “regex” in its name — not only as a joke, but because even when they're recursive, its expressions are still in regex syntax.

In common usage, regular expressions are defined by their syntax, not by the grammars they express.

Like any language, the regex language shares its name with its extensions: any very terse parsing language with the standard metacharacters is called a regex, regardless of what other features it has. Perl s/.../.../ expressions, for instance, are sometimes referred to as regexes, which is wrong both theoretically and in Perl terms, but right in terms of common usage. They contain regex-like syntax, so they're regexes, and the distinction between matching and substitution is not salient enough to earn them a different name. Even glob-expressions are sometimes called regexes — usually as a result of confusion, but it's an understandable confusion, because they share a metacharacter-rich syntax, and some of the metacharacters are even the same.

This focus on syntax sounds like a superficial trait distracting attention from the underlying ideas. It's really the other way around. Classification of grammars is seldom of much concern to anyone but parser theorists (and implementors, and sometimes users, of parsing libraries), but syntax is a constant source of problems for everyone who uses regexes. Consider JWZ's famous malediction (directed at himself or at emacs?):

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

The part of regexes that makes them a problem is the syntax: the metacharacters and consequent escaping, the sensitivity to single-character mistakes, the difficulty of extracting results (i.e. captures), the lack of a way to name and reuse parts of expressions. Not the regular grammar! So common usage makes exactly the right distinction. Syntax is what makes regular expressions; their (in)expressiveness is a superficial detail.

No comments:

Post a Comment

It's OK to comment on old posts.