XML validation is an instance of
the regular expression matching problem:
given a regular expression *e* and a string *s*,
is *s* in *L*(*e*)?
(Here *L*(*e*) denotes the language accepted by *e*).

The most commonly-used technique to solve this problem is based on finite automata. There is another algorithm, based on derivatives of regular expressions, which deserves to be more widely known.

One of these days I'm going to get around to writing up a more complete description of the algorithm; in the meantime here is a (mostly uncommented, untested) Haskell program fragment that implements it.

The basic idea is as follows:
Let *A* be an alphabet,
*L* be a language over *A* (i.e., a subset of *A ^{*}*),
and

s\L= {w:swinL}

That is, *s* \ *L* is the language containing
the suffix of all the strings in *L* beginning
with the prefix *s* (after stripping off the prefix).

Given a regular expression *e* over *A*
and a symbol *x* in *A*,
we can compute a new regular expression *delta e x*
such that

L(delta e x) =x\L(e)

This turns out to be surprisingly easy, and it can be extended to strings in the obvious way.

We also define a function *nullable : RE -> Bool*,
such that *nullable e* is true
if and only if *L*(*e*) contains the empty string [].
Then for any string *w* in *A ^{*}*,
we have

winL(e) if and only if [] inw\L(e),

which is equivalent tonullable(delta).^{*}e w

Much easier than building a finite automaton! Extending this to XML validation is relatively straightforward.

You can think of the algorithm as working on an
infinite DFA with states labelled by regular expressions
(instead of symbols as in the canonical construction).
Then the final states are those states *e* for
which *nullable e* is true, and *delta* is
the transition function.

Sorry this is so terse; feel free to e-mail me if you have any questions and I'll be happy to elaborate further.

Mark Hopkins wrote a small C program based on this algorithm with several (truly nifty!) enhancements. (His paper is where I first learned about this technique). The code and supporting documentation is available at the comp.compilers archive. His technique can be summarized as:

- exploit operator identities
- memoize RE constructors
- memoize the delta function

This leads to a very efficient, *O*(*N*) amortized-time matching
algorithm that, in effect, lazily builds a deterministic automaton
(*N* is the length of the string to be matched).
But check out his paper, it has a much better description.

Joe English / joe@flightlab.com