How to validate XML

XML validation is an instance of the regular expression matching problem: given a regular expression e and a string s, is s in L(e)? (Here L(e) denotes the language accepted by e).

The most commonly-used technique to solve this problem is based on finite automata. There is another algorithm, based on derivatives of regular expressions, which deserves to be more widely known.

One of these days I'm going to get around to writing up a more complete description of the algorithm; in the meantime here is a (mostly uncommented, untested) Haskell program fragment that implements it.

The basic idea is as follows: Let A be an alphabet, L be a language over A (i.e., a subset of A^*), and s be a string in A^*. The derivative of L with respect to s, s \ L, is defined as:

s \ L = { w : sw in L }

That is, s \ L is the language containing the suffix of all the strings in L beginning with the prefix s (after stripping off the prefix).

Given a regular expression e over A and a symbol x in A, we can compute a new regular expression delta e x such that

L(delta e x) = x \ L(e)

This turns out to be surprisingly easy, and it can be extended to strings in the obvious way.

We also define a function nullable : RE -> Bool, such that nullable e is true if and only if L(e) contains the empty string []. Then for any string w in A^*, we have

w in L(e) if and only if [] in w \ L(e),
which is equivalent to nullable(delta^* e w).

Much easier than building a finite automaton! Extending this to XML validation is relatively straightforward.

You can think of the algorithm as working on an infinite DFA with states labelled by regular expressions (instead of symbols as in the canonical construction). Then the final states are those states e for which nullable e is true, and delta is the transition function.

Sorry this is so terse; feel free to e-mail me if you have any questions and I'll be happy to elaborate further.

Mark Hopkins wrote a small C program based on this algorithm with several (truly nifty!) enhancements. (His paper is where I first learned about this technique). The code and supporting documentation is available at the comp.compilers archive. His technique can be summarized as:

exploit operator identities
memoize RE constructors
memoize the delta function

This leads to a very efficient, O(N) amortized-time matching algorithm that, in effect, lazily builds a deterministic automaton (N is the length of the string to be matched). But check out his paper, it has a much better description.

Joe English / joe@flightlab.com
$Date: 1999/06/27 18:18:10 $