Our columnist, Kirk Pepperdine, interviews Dan Diephouse of MuleSource the leading provider of open source service-oriented architecture infrastructure software. Dan covers what you need to know about XML processing frameworks, including telling us which are the fastest.Published March 2008, Author Kirk Pepperdine
Thanks for coming Dan, can you introduce yourself and tell us a little about what you do?
Yes, my name is Dan. and I work for MuleSource. And I'm very much involved in Open Source and web services and lots of things XML related. Particularly, I started XFire, which is a codehausservices project. And that's now transitioned into Apache CXF, and I'm busy there. I'm involved in Mule and I'm involved in some other projects which I've started, like Jettison which does XML JSON conversions and other random things as well. Thats the kind of a focus on web services, performance, and XML. That's kind of what I've been looking at for the past couple of years.
When you think of web services, performance isn't really one of the things that comes to mind. You don't think of, high performance architectures that are using a lot of web services. Is that your experience or is there a fallacy in that thinking?
There's definitely a class of applications which will never ever ever ever use XML, especially related to stock market applications, where they are trying to transmit just the minimal amount necessary, and every microsecond is billions of dollars or something like that. But XML in my experience can actually work well with most applications. There are many, many, many things you can do to create performant XML applications. And if you do it right, XML performance shouldn't be the bottleneck of your application. It's much more likely to be your database or something like that.
When you say "do it right", you are saying there's the wrong way of doing it and the right way of doing it. Could you describe from your experiences what are some of the wrong ways that people use webservices.
I don't know if the wrong way is the best way to come at it, but there definitely are guiding principles that you can keep in mind. So, maybe a wrong way to do it here would be to use something like a DOM model when it's not necessary. A lot of people will take the XML, load it all up in the memory as a document object and they can kind of traverse it from there. That's really actually quite inefficient, because this takes a whole load of memory and you kind of end up with this two-phased process, where you're parsing the XML, loading it into a document object and then doing some kind of business integration with it, typically transferring it to your domain objects. A much more efficient way to do it is to kind of use this streaming XML model, which most people support now, and there are a lot of databinding frameworks which can help you do that. So, you can use something like JAXB reference implementation or you can use JiBX, and what this will do it will actually read the XML off the wire, bit by bit, and build your domain objects at the same time. And you never have to build the document object in memory.
Is this streaming model like the SAX event model or is it slightly different from that?
There are two streaming models for JAVA. One is SAX, which is, kind of what they call "push" API, where you have the SAX content handler and then it calls your content handler whenever it receives a new XML event, like a start element or it encountered some characters, or it encountered these attributes. And there is also this kind of the opposite model which is the pull model and that's called StAX and that's the streaming XML API's. You basically move the parser forward one event at a time and then pull the data out so you can do Reader.next and then it will tell you what type of XML event you've encountered, and if it's a start element event you can read the element name, and if it's a character or if it's XML data inside an element, then you can read the element data, and so on and so forth. And these are kind of two ways that you can interface, go from a streaming model to your domain objects, and I think both JiBX and JAXB support it and there's a couple of others out there as well.
This does bring up another question, there are more than a couple of others out there. There's a plethora of these XML parsing libraries out there. Why would one choose one over the other and are there performance implications for using xerces over something like Woodstox, something you've worked with?
There are kind of two levels that you need to look at here, on the performance side of things. The first is kind of the XML parsing level. And the second is kind of the data binding level. So, at the XML parsing level, most people don't ever actually worry about their XML parser, which can turn out to be a bad thing for performance because there are some really fast parsers out there and there are some pretty slow ones. One of the fastest, well, I think it's actually the fastest one that I've seen that's open source is Woodstox. And Woodstox is made by this XML genius called Tatu Saloranta. I hope that I got his name right. He's Finnish, I believe, awesome guy, and he's just really nailed this performance thing and been tweaking Woodstox for many, many years. He's created the fastest thing out there that I know of, and at the same time he's also created the most XML conformant implementation out there. There's lots of things out there like Xerces, which, is pretty conformant, but still not as conformant as Woodstox and not nearly as fast as Woodstox. I think we did some tests and Woodstox came out a median two times faster than Xerces at XML parsing. And up to twenty times in some scenarios.
How would you actually benchmark something like XML performance? What are the factors you need to consider? I could imagine that the level of nesting would have some effect and the number of elements also. It might be possibly that one XML parser is better for small and others are better for large? Or some might be better on flats, some might be better on deeply nested ... How do you define a benchmark to say Woodstox parses faster?
Right, that's a very good point. Really, when you benchmark XML parsers, you need to have this range of different documents which you compare, deeply nested, ones with lots of adjectives, ones with lots of text and so on. The ones that I think Tatu looked at, and the ones that I've seen, they have these different documents. And when I said median a little while ago, I mean, there were like twenty something different documents and you know out of these different documents, Woodstox was this many times faster. It really does depend on the application, but overall, I've seen that Woodstox is pretty much the fastest thing out there and can pretty much can beat anything else in all situations.
Do you find single processor, multi-processor, and memory types, have an effect on parsing?
Oh, they have a major effect on results for XML parsing. I don't understand all of the details but if you have the right hardware, it can definitely accelerate things even more and I seem to recall that, kind of, bus speed being a limiting factor.
I didn't know if some of these libraries were intrinsically threaded?
Not really. But at the XML level, they don't have a lot of the synchronization stuff to worry about, other than the first kind of creation of an XML parser. Because they do recycle buffers, and like, but other than that it's pretty straightforward.
XML parsing was one aspect of what we were talking about and you said the next aspect was the data bindings?
Yeah, data binding is a huge part as well, so if you can imagine the first generation of data binding frameworks, they were basically building a DOM out of the XML and then pulling the data out. Kind of reversing the DOM, but then there's kind of this whole next generation of frameworks out there which can actually just stream directly to objects. Some of these are JAXB, which is kind of my all around favorite because it's so easy to use for people, it's getting pretty well understood, it's part of JAVA 6 now, for better for worse, and it's pretty fast. It's not the fastest one out there, and for most applications it doesn't seem to matter, but the fastest one out there probably all around is this thing called JiBX and what that does, it builds like a compiled reader which is specific to your domain object, so it will basically say: "take a SAX content handler and have an optimized compiled version of that for your POJOs or your domain objects." I've also been working on this thing for JAXB which can do the same thing and should provide the same level of performance as JiBX, and that's called SXC and that's at sxc.codehaus.org. What that is, is an XML compiler. Internally, it looks like the JAXB objects, and it builds kind of the compiled model optimized reader and optimized writer for your JAXB objects. So it's as fast as it can possibly be.
And there's an ease of use factor that would make it attractive to people?
Yeah. It's easier to use than JiBX because JiBX requires compiled time, StAX requires you to change your build, requires adding mapping files. Which are good for some applications, because it provides a degree of flexibility, but I think that this JAXB is quite a bit simpler for people.
There's other data formats that should be considered such as binary data formats and binary XML?
Yes, there's been some work out there and this thing called Fast InfoSet. Sun in particular has been working on this Fast InfoSet parser, it's available at fi.dev.java.net, and basically what it does is, instead of writing the XML as text, it just writes it as binary data, so it should provide a little bit more performance. I'm trying to think about the exact numbers here, but it's something on the order of two times the performance of the raw XML level.
Is this a parsing improvement or is this a transportation improvement?
I think it's both. I don't know a lot about the details, but it's going to compact the data which needs to be sent and then it's also gonna make it easier for the parser and the writer to read it, to read and write actually.
The alternative is to do the compression yourself with the XML document before you transmit it?
Yeah, the Fast Infoset website actually has benchmarks with and without compression and it yields performance gains in some situations. But sometimes it can slow you down too, because you've got this extra compression step.
Right. So, you're basically trading off network latency for CPU?
Right. The question is, is network latency going to outweigh CPU time?
And presumably the CPU latency is another issue too? You need spare CPU to do the compression.
So, wouldn't burning CPU for bandwidth not have the same issue or is it just seen differently?
I think that there are some scenarios. Once again, I'm not an expert in the Fast Infoset stuff, but we did a little bit of benchmarking in that. And there were definitely some scenarios where Fast Infoset was slower, even without compression.
With XML do you find that network latency outstrips the document management or do you find document creation, management and manipulation, is more resource intensive than just shipping them from one point to another across the wire?
Hmm... I think it depends on your class of application really. In general, I think that we've got XML parsing to a level where it's probably not the bottleneck in most applications anymore. Latency is a little more of an issue. Whether you're using XML or not it's still gonna take so long to send this incredibly large purchase order if it was a picture, across the wire. So if you're worried about milliseconds, I'm gonna say I'd be very wary of using XML. But if you have small documents, you can definitely parse your document in under a millisecond now. But you need to know, are you transferring small documents, how much data are you sending around, and what your requirements are. I've mentioned before, we are parsing at the sub-millisecond level for a lot of documents. Of course you've gotta be thoughtful of what you're transmitting. If you're looking for really high performance and you have crazy rough long element names and not very much data - well then your elements names are gonna make up like, 90% of your document and if you're trying to do something very, very fast, that might make a difference. If you have lots of different name spaces, then you gotta encode all those, and lots of name space resolution going on will slow things down.
So, if you had to do different name spaces, would you recommend using different documents for them and doing multiple fetches or? Or would you just not mix them up? Or how does that affect things?
No, no ... I guess that I would just recommend not transmitting redundant name spaces, keep everything in kind of one, one prefix per name space. Just don't have thousands of them. If you can keep it to just three or four, I think, most documents won't be an issue.
Would you consider tokenizing tags to single characters? Do you find that really make a difference?
Only, if your document is very, very low on actual data. Let's say that you're transmitting a list of numbers and all these numbers are single digits and then your element names are all five to twenty characters, somewhere along there...
So, you want to try to weight the document, so that, there's a percentage of content that is actually significant, relative to the size?
Yes, but once again this is probably a little bit too far for most people. That's only if your app is really, really, sensitive to this, to the performance aspect of it.
What are your thoughts on JSON?
... that people are preferring to use JSON over XML, then?
Do you have any parting words of wisdom or tagline that we can, uh, contribute to you?
XML is not evil, only particular uses of it are evil.
Thanks Dan. Thank you very much.