Message-id: <200204051643.g35Gh4s10393@dragon.flightlab.com> From: Joe English To: xml-dev mailing list Date: Fri, 05 Apr 2002 08:43:04 -0800 Subject: [xml-dev] A plea for Sanity [ Also sent to xml-names-editor ] "Namespaces in XML 1.1 Requirements" cites the ability to "undeclare" a namespace as the principal (only?) new needed feature, because of the case where: | information items [...] from another document [...] may | have fewer in-scope namespaces than their parent. There is | no mechanism for accurately serializing this situation. If | the infoset is naively serialized and reparsed, the children | will end up with additional namespace information items which | serve no useful purpose. I believe that this requirement is ill-considered. Under SGML and XML 1.0, applications can treat generic identifiers as atomic strings; with XML 1.0 + Namespaces, element and attribute names become compound objects consisting of a URI and a local name. This complicates applications a bit, but by itself is not an onerous burden: toolkits like SAX can provide namespace processors that keep track of the namespace environment, map GIs to {URI+localname} pairs, and throw away the original namespace declarations. The real complexity starts to show up in applications which themselves need to keep track of the namespace environment (e.g., XSLT). This is usually required for applications that need to reserialize an Infoset as XML and wish to retain the original namespace prefixes on output. (It gets hairier for markup vocabularies that include QNames in content, but that's a different issue.) But the new requirement implies that the *exact set of in-scope namespaces at each node* is an essential part of the Infoset. This is the part that I think is ill-considered. This property should be deemed inessential, just as whitespace in tags and the order of attribute value specifications are deemed inessential. XML-related specifications should not expect or demand that it be preserved; any set of namespace declarations that produce the same {URI+localname} pairs after namespace processing should be considered equivalent. In particular, "additional namespace information items which serve no useful purpose" -- and hence do not affect the interpretation of QNames in markup or content -- should not matter. Applications should be free to insert or discard them as they see fit without changing the meaning of the Infoset. * * * Now a plea for sanity. (This is for people who design XML vocabularies and applications; xml-names-editor, I know you're busy, so you can stop reading here.) There are certain practices which, if avoided, can make life simpler for application and toolkit developers. These are all legal according to the Namespaces REC, and I don't suggest that they be disallowed in XML 1.1, but it may be beneficial for individual applications to disallow them. Some definitions: Let's say that an XML document is _neurotic_ if it maps the same namespace prefix to two different namespace URIs at different points. Neurosis makes it necessary for XML processors to work with {URI+localname} pairs instead of GIs, and to keep track of the namespace environment at each point in the tree if there are QNames-in-content. If it weren't for neurosis, applications could use a single namespace map that applied to the entire document. Conversely, a document is _borderline_ if it maps two different namespace prefixes to the same namespace URI. Borderline documents complicate reserialization: the choice of which prefix to use for a particular {URI+localname} pair depends on its position in the tree. A document is _psychotic_ if it maps two different namespace prefixes to the same URI _in the same scope_. Psychosis presents an even bigger difficulty for reserialization: now applications must keep track of the original prefix as well as the {URI+localname} pair. A document is _normal_ (or _in namespace-normal form_) if all namespace declarations appear on the root element and it is not psychotic. (A borderline document with all namespace declarations in the same place is automatically psychotic; a neurotic document with this property would be illegal according to the Namespaces REC.) Normal documents are the easiest to process: the application can determine the global namespace environment at the beginning of the parse, and can use it throughout processing. It's not always possible to produce normal documents -- the producer might not know all of the relevant namespaces at the time it emits the root element start-tag -- so a weaker definition is useful: A document is _sane_ if it is neither neurotic nor borderline. Document producers should be designed to emit sane documents. This is not hard to do -- the serializer just needs to maintain a monotonic, bijective URI/prefix map and reuse the same prefix whenever a namespace URI leaves and comes back into scope. ("Bijective": there is precisely one URI for each prefix and one prefix for each URI; by "monotonic" I mean that prefix+URI pairs may be added to the map but not removed.) A sane document can be transformed into a normal document simply by moving all namespace declarations to the root element and filtering out duplicates. (This can't be done in streaming mode, but it might be an appropriate technique for XML databases.) Now general-purpose XML consumers cannot expect to receive sane documents. However *special-purpose* consumers, designed to work with specific markup vocabularies, can be a lot simpler if the markup vocabulary includes namespace sanity as a requirement. As an application developer, I'd prefer not to have to worry about namespace nodes or {URI+localname} pairs. I'd rather be able to give the parser an internal namespace map describing all the namespace URIs I'm interested in, and have the parser translate QNames in markup to use my prefixes. Then the application can work with GIs instead of {URI+localname} pairs. If the source document is sane, then it's possible to preserve the original prefixes on reserialization simply by remembering the original namespace map; it's not necessary to keep track of namespace nodes during processing. QNames in content are a lot easier to process in a sane document. Sanity guarantees that a given QName means the same thing wherever it appears. Any future markup vocabulary which uses QNames in content should include sanity as an application requirement. A requirement for sanity shifts part of the burden onto document producers, where it's easy to handle. The alternative is maddening complexity for document consumers. --Joe English ----------------------------------------------------------------- The xml-dev list is sponsored by XML.org , an initiative of OASIS The list archives are at http://lists.xml.org/archives/xml-dev/ To subscribe or unsubscribe from this list use the subscription manager: