Say No to QNames in Content

Currently XML-DEV is discussing an issue that causes all sorts of pain when designing XML processing APIs. The issue is “qualified names” (qnames), and particularly qnames in content. Before discussing qnames in content, let’s discuss a general issue with qnames that you might not have known about. Take the following XML:
“/ >
” />

Notice the first two elements, both ostensibly named “p:elem”, but if we treat the element names as opaque strings, we’ll get confused and think the elements are the same. Luckily, we have this magical thing called a qname that uses namespace instead of prefix, and so we can note that the two element names are actually “{}elem” and “{}/elem” — different. By the same token, if we compare the first and third element using opaque strings, we think that they are different (“p:elem” and “x:elem”). But if we look at the qnames, we see they are both “{}elem”.

The interesting part, though, is that you don’t know that the second p:elem is different from the first one until after you have parsed all of the attributes up to the end of the element, where you encounter the namespace redeclaration. There could be thousands of attributes between the start of the element and the namespace redeclaration. Now suppose that you are implementing a “streaming” parser like SAX or XmlReader, and you want to return the full qname (localName + namespaceName) for the element that you are parsing, then report each child attribute of the element. What the example above means is that you will need to buffer all of the attributes of an element before you report that an element has been encountered.

If you want to report element names as opaque strings, you don’t need to buffer attributes, but as described above, this could lead to incorrect behavior.

OK, so we have just proved that qnames for element names is a killer for perf. We can live with that, so what is the big deal for qnames in content? Look at the following XML:

here is some data: with a colon for no good reason x:address x:address

Now, do the last two “p:elem” elements contain the same text, or different text? If you compared using XSLT or XPath, what would be the result? How about if you used the values in XSD key/keyref? The answer is that XSLT and XPath have no way of knowing that you intend those last two elements to be qnames, so they will treat them as opaque strings. With XSD, you could type the node as qname and tell the difference in key/keyref, but you’d have to get rid of the first element, and you would likely have some other unintended side-effects to worry about. For example, suppose you loaded the document into a DOM, and decided to copy out each child node into a separate document. Most APIs are smart enough to inject namespace declarations if necessary, so the first node would write correctly as:

here is some data: with a colon for no good reason

But, since the DOM has no idea that you stuffed a qname in the element content, it’s got no way to know that you want to preserve the namespace for x:


There is really only one way to get around this, and this is for any API which writes XML to always emit namespace declarations for all namespaces in scope, whether they are used or not (or else understand enough about the XSD and make some guesses). Some APIs do this, but it is not something that all APIs can be trusted to do, and it yields horribly cluttered XML output and other problems.

Since qnames in content are only treated as proper qnames in rare cases, and not necessarily reliably, it is good advice to always just avoid qnames in content when designing schemas. If you have tight control over your scenarios and data usage, you might get away with it, but otherwise don’t even try.

This is a particularly interesting situation when compared to the fact that, in RDF, all important data is identified by URI, often split between local and namespace parts. In other words, the whole RDF data model is a graph of qnames (with some literals dangling about). The fact that we have a problem has been well-known for many years, but the important thing to communicate at this point is that you shouldn’t be depending on qnames in content for anything that requires reliable equality-comparison (like identifiers).

Leave a Reply

Your email address will not be published. Required fields are marked *