SGML documents: different views

There are several different ways of looking at an SGML document. Here are the ones that I find most useful:

The Entity Structure view

An SGML document is a nested collection of entities. Most entities are named. Some are text entities; these contain a sequence of characters which are fed to the parser, and can reference other entities. (This is where the nesting comes in.) Others are data entities, which can contain any type of data in any format. The parser doesn't directly examine data entities, it just keeps track of them and reports them to the application.

The Lexical view

An SGML document is a stream of characters. Some are data characters, and others are delimiter characters. The process of determining which is which is called delimiter recognition, and is controlled by a set of delimiter recognition modes[1].

The stream of characters is essentially the resolution of the entity structure, plus some out-of-band information concerning entity boundaries.

The Syntactic view

An SGML document has three parts: an SGML declaration, a document type definition and other prolog, and a document instance. The third part contains data and markup, and must conform to the rules described in the second part. The first part is confusing.

There are five kinds of markup: tags, references, declarations, processing instructions, and marked sections [2]. Tags delimit the element structure (see below), and references define the entity structure (see above). Declarations are instructions to the SGML parser, and processing instructions are instructions to the application. Marked sections identify and control the disposition of parts of text entities.

The syntactic view is created by the parser from the stream of data and delimiters determined by the lexical view.

The Element Structure view

An SGML document is an Ordered Hierarchy of Content-based Objects (OHCO) [3]; it is a tree of elements. Elements have attributes, content, and other auxiliary properties. Attributes are named, and may contain character data or data entity references. Elements can contain other elements and/or character data, with data entity references mixed in.

The most important property of an element is its generic identifier. The GI determines what attributes, an element has and what its content must look like.

The element structure is created by the parser from the stream of markup and data determined by the syntactic view.

The Overview

An SGML document is a complex thing, which must be processed in several phases. The storage manager keeps track of where all the pieces are kept, and supplies them to an entity manager. The entity manager is the glue between the storage manager and the parser. The parser itself works in multiple phases, including delimiter recognition, syntactic analysis, and structural analysis or validation [4]. The parser reports what it finds [5] to an application, which does all the really important work.

The Application view

An SGML document is whatever you want it to be.

This can be anything from a stream of start-tag, end-tag, and data events to a cross-linked forest of structured data. The basic OHCO framework, augmented with cross-reference mechanisms and validation rules, can be used to encode just about any type of data.

An SGML document could be the text of a book with all the structural elements identified for a publisher; it could be a government specification or legal document with rigid format requirements; it could be the text of an ancient manuscript annotated with scholarly interpretations; it could be a technical reference manual indexed and cross-referenced for easy access. It might not even be text at all: it could be a musical score, a hypermedia presentation, or a stylesheet describing how to format another SGML document [6].

If it could be said that there is such a thing as ``the SGML philosophy'', I would say that it is this:

Data belongs to whoever creates it, and you get to decide what's important about your own data.

Footnotes

[1] Delimiter recognition is IMHO the single most important thing to understand about SGML. It's also one of the most difficult. Not that it's terribly complicated, it's just not easy to figure out from reading the standard.

[2] The standard counts marked sections as a kind of declaration, but I think of them as sui generis; there are more differences than similarities between marked sections and other kinds of declarations.

[3] The term ``OHCO'' comes from HyTime, I think. I find it very descriptive.

[4] Many people with a background in computer science -- myself included -- try to fit the lexical, syntactic, and element structure views into the conventional ``lex/yacc'' or ``tokenization/parsing'' model of text analysis.

This doesn't work very well: Delimiter recognition is not quite the same thing as tokenization, and it is difficult to parse both the markup (syntactic view) and the element hierarchy (element structure view) with a single process. The lex/yacc model also obscures the equally important entity structure view.

[5] The output of the parser is called the Element Structure Information Set or ESIS. ``ESIS'' is actually a generic term meaning, roughly, ``the properties of a document that are useful for processing it.'' A bare-bones ESIS is defined in Annex A of ISO 8879. HyTime uses a somewhat richer ESIS, and DSSSL defines yet another. The DSSSL information set is the most complete; with all optional features enabled, it contains all the information that an SGML parser has available.

[6] I've used SGML to encode data flow graphs for parallel processing; other potential applications include a definition language for graphical user interfaces, and the native file format for a figure editor.

This is an edited version of an article I originally posted to Usenet, 8 Oct 1995.

Joe English / joe@trystero.art.com $Revision: 1.3 $ / $Date: 1995/10/10 18:03:29 $