A taxonomy of SGML entities

1 Introduction

There are many different varieties of SGML entities, and the distinction between them is often confusing.

There are general entities and parameter entities; there are internal entities and external entities; and there are text entities and data entities. Data entities can be CDATA, SDATA, or NDATA, and external data entities are further qualified by an associated data content notation.

There are also subdocument entities, which are a different beast altogether, and entities declared with bracketed text, which is just syntactic sugar for internal text entities that expand into markup instead of data.

There are parameter entity references, general entity references, and ``implicit'' references to unnamed external entities in document type declarations. Data entities may be implicitly referenced by naming them in attribute values. There are also character references, which look sort of like general entity references but they are treated very differently.

This article attempts to explain how all this works.

NOTE --It is wrong in several places, though, so don't trust it yet!

2 Overview

An SGML document can be viewed as a nested collection of entities. Parsing begins with a top-level document entity. Entities resolve to a sequence of characters, which are scanned for markup and data, and may reference other entities.

With the exception of the document entity, entities must be declared before they are used. An entity declaration looks like this:

<!ENTITY entity-name entity-text >

This defines a new entity with the given entity-name and associates it with the entity-text. There are many different forms for the entity-text; see below.

It is legal for multiple entity declarations to specify the same entity-name; in this case, the first declaration takes precedence.

3 Parameter entities vs. General entities

parameter entities are used in markup declarations, and general entities are used in the document instance.

Parameter and general entities do not share the same namespace. A parameter entity cannot be used as a general entity, or vice versa. Conversely, it is possible to declare a parameter entity and a general entity with the same name.

Parameter entities have a special declaration syntax:

<!ENTITY % pename  entity-text >
<!--     ^ note -->

If the entity-name does not start with a lone PERO delimiter, then it is a general entity declaration:

<!ENTITY gename  entity-text >

Note that there must be a space between the PERO and the parameter entity name; otherwise this would be interpreted as a parameter entity reference.

General and parameter entities have a different reference syntax as well: general entity references begin with a ERO (&) delimiter (ampersand), and parameter entity references begin with a PERO (%) (percent sign). In both cases, the reference open delimiter is followed by a name token and (optionally) a REFC (;) delimiter or record-end.

General entity references are recognized in content, replaceable character data, and attribute value literals. Parameter entity references are recognized in markup declarations (including the status keyword part of marked sections) and in parameter literals.

That's all for parameter entities, for now.

4 Internal entities vs. External entities

An entity's replacement text can be specified directly in the <!ENTITY ... declaration:

<!ENTITY ename "replacement text...">

In this case, the entity is internal.

In an external entity, the entity text is specified as an external identifier, in which case the replacement text is stored ``somewhere else'' on the system.

An external identifier consists of a system identifier, a public identifier, or both:

<!ENTITY bplate PUBLIC "-//Xyzzy Corp.//TEXT boilerplate header//EN">
<!ENTITY chap2 SYSTEM "chap2.sgml">
<!ENTITY % html-dtd SYSTEM "html.dtd" PUBLIC "-//IETF//DTD HTML//EN">

It is up to the entity manager to turn an external identifier into a sequence of characters, and this article leaves the explanation of system identifiers, public identifiers, and entity management up to some other source. @@ Cite Goldfarb's paper on Entity Mgmt; stuff from SGML Open, too @@

5 Text entities vs. Data entities

@@ Write this part @@

6 Subdocument entities

Recall that an SGML document has three parts:

The SGML declaration;
The document type declaration (and other prolog);
The document instance.

A subdocument entity is an external entity which contains an (almost-) complete document, including its own prolog and document instance. (The parser uses the main document's SGML declaration for all subdocument entities.)

Subdocument entities are declared by adding the keyword SUBDOC after the external identifier:

<!ENTITY chap1 SYSTEM "chap1.sgml" SUBDOC>

The replacement text of a subdocument entity must begin with a <!DOCTYPE ...> declaration. It may use a different DTD than the main document, depending on the application. Each subdocument has a separate namespace for element IDs, general entity names, and parameter entity names. (This means that the ID/IDREF reference mechanism cannot be used across subdocument boundaries. HyTime has facilities to do this.)

The SUBDOC feature is useful for managing large collections of documents that need to be used on their own or combined into larger documents.

The document entity, subdocument entities, and text entities are collectively known as SGML entities. The document entity contains an SGML declaration, prolog, and a document instance; subdocument entities contain only a prolog and a document instance; and text entities only contain data and markup (which goes inside a document instance).