Say No to QNames in Content

Currently XML-DEV is discussing an issue that causes all sorts of pain when designing XML processing APIs. The issue is “qualified names” (qnames), and particularly qnames in content. Before discussing qnames in content, let’s discuss a general issue with qnames that you might not have known about. Take the following XML:
“/ >
” />

Notice the first two elements, both ostensibly named “p:elem”, but if we treat the element names as opaque strings, we’ll get confused and think the elements are the same. Luckily, we have this magical thing called a qname that uses namespace instead of prefix, and so we can note that the two element names are actually “{}elem” and “{}/elem” — different. By the same token, if we compare the first and third element using opaque strings, we think that they are different (“p:elem” and “x:elem”). But if we look at the qnames, we see they are both “{}elem”.

The interesting part, though, is that you don’t know that the second p:elem is different from the first one until after you have parsed all of the attributes up to the end of the element, where you encounter the namespace redeclaration. There could be thousands of attributes between the start of the element and the namespace redeclaration. Now suppose that you are implementing a “streaming” parser like SAX or XmlReader, and you want to return the full qname (localName + namespaceName) for the element that you are parsing, then report each child attribute of the element. What the example above means is that you will need to buffer all of the attributes of an element before you report that an element has been encountered.

If you want to report element names as opaque strings, you don’t need to buffer attributes, but as described above, this could lead to incorrect behavior.

OK, so we have just proved that qnames for element names is a killer for perf. We can live with that, so what is the big deal for qnames in content? Look at the following XML:

here is some data: with a colon for no good reason x:address x:address

Now, do the last two “p:elem” elements contain the same text, or different text? If you compared using XSLT or XPath, what would be the result? How about if you used the values in XSD key/keyref? The answer is that XSLT and XPath have no way of knowing that you intend those last two elements to be qnames, so they will treat them as opaque strings. With XSD, you could type the node as qname and tell the difference in key/keyref, but you’d have to get rid of the first element, and you would likely have some other unintended side-effects to worry about. For example, suppose you loaded the document into a DOM, and decided to copy out each child node into a separate document. Most APIs are smart enough to inject namespace declarations if necessary, so the first node would write correctly as:

here is some data: with a colon for no good reason

But, since the DOM has no idea that you stuffed a qname in the element content, it’s got no way to know that you want to preserve the namespace for x:


There is really only one way to get around this, and this is for any API which writes XML to always emit namespace declarations for all namespaces in scope, whether they are used or not (or else understand enough about the XSD and make some guesses). Some APIs do this, but it is not something that all APIs can be trusted to do, and it yields horribly cluttered XML output and other problems.

Since qnames in content are only treated as proper qnames in rare cases, and not necessarily reliably, it is good advice to always just avoid qnames in content when designing schemas. If you have tight control over your scenarios and data usage, you might get away with it, but otherwise don’t even try.

This is a particularly interesting situation when compared to the fact that, in RDF, all important data is identified by URI, often split between local and namespace parts. In other words, the whole RDF data model is a graph of qnames (with some literals dangling about). The fact that we have a problem has been well-known for many years, but the important thing to communicate at this point is that you shouldn’t be depending on qnames in content for anything that requires reliable equality-comparison (like identifiers).

DoS Flaw in SOAP

This article discusses a fix that we’ve been working on internally for quite awhile. Since this partially involves my product, I want to clarify a few things. First, the article says that we recommend avoiding DTS when possible — that should be DTDs, not DTS. Also, while it is correct that XSD is much less dangerous than DTDs for this sort of attack, XSD has some issues that people need to be aware of, so shouldn’t be considered an “automatically secure” pass. Specifically, systems should not use untrusted external schemas for validation, since schema imports can be used maliciously to connect to other sites or files. Also, XSD schemas which assess key/keyref will need n^2 resources to compute, so systems need to bound the size of XML validated, and generally be aware of the issues.

Music Business Models

Just returning from watching a really amazing Nutcracker by International Youth Ballet, and now watching an astonishingly bad hacker flick with Hugh Jackman and John Travolta on Cable; and some thoughts about music vs. TV come to mind.

This interview with Steve Jobs [via Scripting] is enlightening. Steve talks about the failure of subscription-based music services, something I’ve been surprised about. Like Aaron, I have assumed that some sort of subscription with usage-based distribution was inevitable. Not sure if I totally understand Steve’s argument against subscription yet, though. Just as interesting are his comments about the overhead in the current distribution system. And finally, he ends with an intriguing rant against television. Intriguing because he correctly asserts that TV is mostly useless bubblegum for the mind, but then goes on to say that the current generation is deprived because we listen to music less hours per week than his generation did. As if music is significantly less passive than television? Or was he just pandering to Rolling Stone? Music is more conducive to reflection, I suppose, but I wonder..


This AP report sheds some fascinating light on the fiasco that has been this year’s flu season. It’s been obvious for awhile that the flu vaccine wasn’t working. Yet the media has whipped up the population into a frenzy with reports that “the vaccine has run out”, “the profiteering vaccine companies are leaving old people to die”, the AARP has told members to “be persistent with your doctor”. One wonders how many people were left with increased vulnerability to the flu caused suppressed imune functiondue tostress-induced elevations in cortisol levels. Standing in line with sick people and getting jabbed with a needle is certainly stressful, too. And a certain percentage of people always get sick from the flu vaccine itself. Of course, nobody can be directly blamed, but it seems that the whole vaccine system resulted in a significant net increase in deaths this year.

Interview Question

All of the recent talkabout Mr. Tetris had me thinking of a new interview question. Think of the normal tetris game: a sequence of random shapes appears on the screen, and the user responds with sequences of left, right, down arrow and spacebar until the screen is filled. Now, suppose that you are given a list of keys pressed in sequence from the beggining to the end of the game. Can you use this sequence of keypresses to figure out what sequence of shapes appeared in the game? Of course, it’s not possible to know the exact sequence, but this can serve as an interesting starting point to discuss the problem. Simplifying steps can be introduced; extra points if the interviewee asks for clarification on these points or others:

  • Given a sequence of keystrokes and a sequence of shapes, how would you determine whether the sequence of shapes is a sequence that might match the keys given, or could not possibly match the keys given?
  • Suppose that you know every third shape. How does that change the solution?
  • Suppose that you know every point at which a row cleared (and how many rows).
  • Suppose that you know every time a shape falls (as if space is always pressed).

This interview question is too complicated tobe useful for determining the skill of an applicant for designing algorithms, but would be good for determining an applicant’s ability to decompose a problem and discuss possible solution approaches.

WinFX Review

Today we had a checkpoint review of the APIs that our team is working on for WinFX (the Longhorn API). The review is an opportunity for people from around the company to look at our API design and comment on usability, security, performance, and generally anything else that needs to be looked at for an API that has aspirations to become the replacement for Win32 for the next decade. It’s our third such review, but probably the last one before we ship Whidbey. We had good attendance from the various API design owners around the company, and it went fairly well. Lots of good discussion, especially around usability where we still have lots of room for improvement. I couldn’t help but notice that more than half of the people who werein the room have blogs. I suppose it’s been this way since the DOTNET mailing lists were opened on Developmentor, but it’s still impressive to me to seehow involved and publicly accessible the whole frameworks team is on an ongoing basis.

Tivo 2 and Comcast HDTV

Some notes from the Comcast HDTV install today. Maybe this will save someone some time. The Comcast HD Cable Box is a Motorola DCT5100, and doesn’t have the serial remote that the General Instruments had. Using the Tivo’s IR blaster, you need to configure the Tivo to connect to a General Instruments (that’s right, NOT Motorola) Cable Box, use “enter” to change channels, and use code 1006. And of course it goes without saying that the Tivo doesn’t take an HD signal yet, so you need to bypass the Tivo when watching HD channels.

Retiring the Four-Platform Framework

ZapThink recently offered a critique of Gartner’s “Four-Platform Framework” for Web Services:

“we have seen a recurring vision for Web Services that has outlived its questionable usefulness at representing how the market is implementing and producing products for real-world Web Services and SOA solutions – namely Gartner’s Four-Platform Framework of Web Services.”

For the most part, I agree with the analysis. They gointo some detailed analysis, buttheir analysis is summed up by the quote, “the Four-Platform Framework is an example of what ZapThink calls a “horseless carriage” mentality. That is, the framework applies traditional ways of thinking about existing markets to new, emerging markets.”

TPC Arms Race Continues

Oracle has just become the first company to break a million tpmC on a single box. The price per tpmC is still high, but more disturbing is the cluster result on linux, where Oracle also breaks a million, but gets nearlyfifty percentmore tpmC than SQL Server and at nearly a third the price per tpmC. To our credit, the Oracle result comes two and a half years after the SQL Server result, but it’s proof that Oracle is still in the game with Oracle 10g, and we need to hurry up and ship something new.

Tips for SkyTrain at Newark Airport

Many people have horror stories about riding the SkyTrain at Newark Airport. It has a tendency to get stuck between stations, or to lock customers in and shuttle them around endlessly like cows in a claustrophobic boxcar. Here are my tips:

  • The attendants can’t really help you. They can radio ahead to have the software (Polsoft v1.0) on the trains rebooted, but that doesn’t help.
  • The driver can’t help you.
  • The emergency call buttons don’t work. Even when the train is operating smoothly, nobody is going to answer when you push that button. Try it.
  • However, you CAN help yourself. Each train car has two enclosed brake activators near the bottom of the seats. Simply break the plastic cover and pull the brake. When this happens, the train will be unable to move any further, and the doors of the train car can easily be pulled open. The doors at the station can easily be open by pushing the lever handle and pulling. Then you can be on your way.
  • When the train is mechanically prevented from moving forward, putting an abrupt end to the driver’s joyride, he will probably run away as fast as possible, like a guilty dog who has just peed on your shoe. The attendants similarly stay as far away as possible from any human contact. It won’t matter, though, because you’ll be too busy accepting accolades from your grateful fellow passengers to pay attention to the driver and attendants.