CDATA Confusion

Joe English
Last updated: Tue Sep 16 10:31:20 PDT 1997



1 Introduction

The keyword CDATA has (by my count) at least five different meanings in SGML. This tends to cause a great deal of confusion. CDATA is commonly misunderstood to mean ``no markup is recognized'' or ``verbatim text'', but this is not always the case.

2 Attribute declared values

Attributes may have CDATA as their declared value. For example (from the HTML 2.0 DTD):

<!ATTLIST IMG
        ...
        ALT CDATA #IMPLIED
	...
>

This means that the attribute value may contain arbitrary character data (as opposed to an ID, a NAME or NAMES, a NUMBER, NUMBERS, et cetera.) CDATA attributes are not folded to upper case, and are not tokenized like other attribute types.

Note that attribute value literals are always parsed as replaceable character data, regardless of the attribute's declared value. This means that references (&xxx;, &#yyy;) are recognized and replaced in attribute specifications, even for CDATA attributes.

For example, this HTML fragment:

    <IMG SRC="eqn1.gif" ALT = "A &lt; B">
will be displayed as
    A < B
in a text-mode browser or with image loading turned off (assuming the browser is working properly, of course).

3 Internal entity declarations

The second most common source of confusion is in entity declarations:

<!ENTITY amp CDATA "&#38;"	-- ampersand -- >

Here, the CDATA keyword signals that the entity is a character data entity (as opposed to a text entity, or an SDATA or PI data entity.)

In this case, no markup is recognized in the replacement text when the entity is referenced. Note however that character references (&#nnn;) and parameter entity references (%nnn;) are recognized in parameter literals, so some references are expanded when the entity is declared.

For example,

<!ENTITY foo       "BAR"  >
<!ENTITY e1        "&foo;">
<!ENTITY e2  CDATA "&foo;">
<!ENTITY e3        "&#38;foo;"  -- &#38; = ampersand or ERO delimiter -->
<!ENTITY e4  CDATA "&#38;foo;" >
will be replaced as follows:
foo: BAR
e1: BAR
e2: &foo;
e3: BAR
e4: &foo;

The entities e1 through e4 all have the same replacement text, namely &foo;. The difference is that when e1 and e3 are referenced, the parser treats the replacement text as if it had appeared in the document directly, so &foo; is itself parsed as an entity reference. On the other hand, since e2 and e4 are data entities, the parser inserts the replacement text literally.

4 External entity declarations

External entities may be declared as CDATA, with an associated data content notation:

<!NOTATION some-notation SYSTEM>
<!ENTITY foo1 SYSTEM "foo.sgml">
<!ENTITY foo2 SYSTEM "foo.sgml" CDATA some-notation>

Here, CDATA means much the same thing as it does for internal entities: the entity's replacement text is to be treated as literal character data, and the parser does not scan for markup.

(In fact, ESIS-producing parsers such as SGMLS don't even examine the content of external data entities at all, and simply report the reference.)

External entities may also be declared as SDATA, NDATA, or SUBDOC.

5 Marked sections

CDATA may appear a status keyword in a marked section declaration:

<![ CDATA [  blah, blah, blah. ]]>

The only markup that is recognized in CDATA marked sections is an MSC (]]>) delimiter, which closes the marked section. CDATA marked sections are the preferred method for entering ``verbatim text'' in an SGML document.

Other marked section status keywords are

RCDATA
Replaceable character data -- recognize references, but not tags or other markup.
IGNORE
Skip the marked section entirely.
INCLUDE
The opposite of IGNORE; useful for making ``conditional text''.
TEMP
The same as INCLUDE, only different.

6 Element declarations

Elements may have a declared content of CDATA.

Don't use this feature if you're designing a DTD. It's evil. In fact, you're better off forgetting about CDATA and RCDATA declared content altogether; SGML is much less confusing if you ignore them.

In case you want to know the whole story, there are five choices for an element's content definition:

  1. A model group;
  2. ANY;
  3. EMPTY;
  4. CDATA;
  5. RCDATA;

A model group is the normal case, as in:

<!ELEMENT letter - - (recipient, salutation, body, closing, (attach|cc)* >

The other keywords may be used instead of a model group:

<!ELEMENT badnews1	- - CDATA 	>
<!ELEMENT badnews2	- - RCDATA 	>
<!ELEMENT stuff		- - ANY		>
<!ELEMENT nothing	- O EMPTY	>

CDATA declared content means that, when the start-tag for that element is seen, the parser switches to a delimiter recognition mode in which no markup is recognized except for a TAGC ("</") delimiter-in-context. RCDATA declared content is similar, except that general entity references and character references are recognized and replaced. In both cases, as soon as the parser encounters a TAGC followed by a name start character, the element ends and the delimiter recognition mode changes back. Note that a TAGC delimiter-in-context always terminates elements with CDATA and RCDATA declared content, even if it does not begin a valid end-tag for that element.

EMPTY means that the element may not have any content (or an end-tag).

ANY means that the element may contain any subelements or character data, in any order. Exclusion exceptions apply, however, and subelements must be declared in the DTD.

Another source of confusion is the distinction between CDATA and RCDATA declared content and the #PCDATA content token:

<!ELEMENT badnews1 - -  CDATA >
<!ELEMENT phrase   - -	(#PCDATA) >
<!-- The following means that an OOPS element must contain a 
     single subelement with generic identifier "CDATA": 
-->
<!ELEMENT oops	   - - (CDATA) >
<!-- And the following is illegal:
-->
<!ELEMENT oops2    - - #PCDATA > 

Notice that the CDATA, RCDATA, EMPTY, and ANY keywords do not (and cannot) appear inside a parenthesized model group, and they are not prefixed with an RNI (#) delimiter like #PCDATA is.

[[ (Thanks to Arjun Ray and Marcy Thompson for feedback and clarification.) ]]