XML: The Angle Bracket Tax

Everywhere I look, programmers and programming tools seem to have standardized on XML. Configuration files, build scripts, local data storage, code comments, project files, you name it -- if it's stored in a text file and needs to be retrieved and parsed, it's probably XML. I realize that we have to use something to represent reasonably human readable data stored in a text file, but XML sometimes feels an awful lot like using an enormous sledgehammer to drive common household nails.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/05/xml-the-angle-bracket-tax.html

I have to deal with data files, that are basically just flat data (think of a simple “select * from table”). It bothers me every time a customer sends us an XML file… CSV is perfect for that thing.

Erm… have you ever heard of the INTERNET which uses this stuff called HTML which is, well, to all intents and purposes… XML?!

Fuck no it’s not. Had the web been xml, with all it entails, it would never have taken off.

Oh, and some people tried to XMLify the web, with XHTML1.0, XHTML1.1 and a tentative XHTML2 spec.

Last time I checked, they failed epically and the bleeding edge moved to an actually feasible revision of HTML instead.

  1. YAML sucks. It’s really, really poor.
    Quite the convincing argument you have there, Robin!

because of the way tcp/ip works, much of the xml bracket tax can be dismissed by the fact that you can’t really send less than about 1400 bytes at a go anyways.

Yes, ok, there are work arounds to optimize the smaller packets, but on the whole, I suspect you’ll find that sending 1 byte and 1000 bytes has very little difference over most connections.

Connections that compress data (such as VPNs) are really trying to fold 2k in to 1k, not 1000 bytes into 500 bytes, so even there, it is really just a wash.

Once the data starts getting past the size of frame, the cost of the tax starts dropping. Before that, it is almost free itself, except for the front and back end processing.

And that is where the real tax is - processor and memory overhead pushing data through an ackward envelope.

Still, if someone would just write an efficent parser for the lightweird stuff, 99.9% of the xml cases could be handled without it seeming like a sledgehammer.

Quoting:
// It bothers me every time a customer sends us an XML file… CSV is perfect for that thing.

Thinking of “select * from …”: suppose one of the varchar fields contains commas, hard returns, or quotation marks. CSV all of the sudden becomes less simple. XML would handle all of that with no extra effort.

I absolutely agree that flat files are extremely useful when the situation calls for it (although I prefer pipe-delimited instead of comma), but if you’re working with more complex data or text, serialization, etc., XML is the way to go.

Bobby: Makefiles? Those poorly documented things*, that require tabs-not-spaces?

Makefiles? Seriously?

It’s not 1975, people. We don’t have to use stone knives and bearskins, no matter how scary that shining bronze is, okay?

Look, I’ve certainly seen XML be abused, but let’s not be ridiculous.

I think it’s great for configuration files, as long as you don’t do stupid things with it - and if “you can’t do stupid things with it” is our criterion for a proper tool, then none exist.

(* At least, last time I looked, there simply wasn’t any proper documentation on make(1)'s config file beyond the make sources, and what got passed down (“use tabs, or else!”) as received wisdom. Maybe someone’s documented it better since I last looked, but I really doubt it.)

And, Jeff, YAML? I know this will piss off the Python people, but indentation shouldn’t matter. If your parser depends on indents, that’s a problem, not a solution.

Let’s see…

The design goals for XML are:

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance.

Based on this list I wouldn’t score XML more than 2/10.

If you’re parsing XML yourself (the “this.reply.FirstChild.NextSibling.FirstChild.FirstChild.FirstChild” situation described above), then no wonder you hate it. IMO, the beauty of XML is XPath, which lets us dig into XML config files by writing a (relatively) simple query expression.

And while other formats (e.g., YAML) may have better bindings for languages like C++, it’s not hard to write a little wrapper that will provide getInt(), getDouble() and even getList() wrappers for arbitrary XPath expressions. I’ve been using that approach for a few years with Xerces, MSXML and libxml2 parsers, and it’s a piece of cake.

This is disappointing.

Seriously, it’s been said an arbitrary number of times up to here, but I’ll add my voice.

For all it’s flaws, we’re far, far better off with some a standard syntax and encoding scheme these days than not. We bused/b to have to code line-level parsing by hand. When I started my first job out of college, I was hand-parsing a mix of US and European EDI transactions. It was a bloody mess. (It was also in server-side JavaScript, which made it Super Happy Fun).

These days? That step isn’t even a line of code anymore – it’s invisible, I just get back the object representation now.

“But”, you say, “XML doesn’t buy you anything by itself, you still have to interpret the data!”

Yes, but you USED TO HAVE TO DO THAT AS WELL.

Fixed width, binary, or delimited formats didn’t magically interpret themselves either. You had to both parse them at the line-level, AS WELL as interpret the structure of the data you got out. The fact that most people probably mixed those two steps back then is not an argument in favour of that method.

As for the S-expression argument, it’s been said before and better: http://www.prescod.net/xml/sexprs.html . Once you start adding attributes and getting beyond trivial cases, S-expressions are no prettier than XML to either humans or machines.

YAML and JSON are both fine for what they do (when used in a “Plain Old XML Lite” scenario), but you do kind of have to ask yourself – “Is this software going to be used or maintained by people who aren’t iconoclasts about XML? Do I want to force people to learn YAML/JSON/This-other-pet-markup-language if they want to deal with my software?”

So xml is a glorified version of .txt?

I don’t know why some of the comments say use of XML isn’t about the tools. Everything we do in IT is about tools.

Tools are the things that make us productive and help us make other tools, and so on.

With XML the use of tools is critical. For this technology at least there’s a vast range of specialist tools to choose from, that work at a number of different levels - each one has its own strengths for particular tasks but also for the different ways that we all prefer to work. But the best thing is they’re all compatible (well, more or less).

We should not dismiss languages because they require specific tools to be mose effective. Quite the reverse, we should continue to develop new languages, or enhance existing ones so that they can make better use of the tools and the enhanced processing power and memory we now have.

The good thing is, that 10 years on, XML tools still have a way to go - there’s so much more that can be done. I experiment with my own XML tools project (ironically its XPath based - in my view the best bit of XML - though its not XML itself), and whilst I haven’t the resources to pull through half of my ideas into the finished product I look forward to seeing continued innovation in the more well established players.

XML has a long way to go. When it comes, the replacement for XML will have to be pretty good, and not only that, but have the backing of a good portion of the tools creators out there.

My favourite XML quote is:
Some people, when confronted with a problem, think “I know, I’ll use XML.” Now they have two problems.

I think this is, actually, a paraphrase of a comment by Jamie Zawinski about regular expressions, but is just as apropos here.

There’s an insidious and darkly troubling reason for xml. The network providers want to completely privatize the internet. You know it and I know it. They want to charge for every little action, every email, every hot link, every mouse click, every single byte and bit. This is not a new idea. In fact, it was codified and the technology was finalized back in the mid 90s. There used to be an acronym for the umbrella organization and all the big boys signed up. Guess what is the basis for it. Yep! XML! Think about it. All they gotta do is count the tags and charge accordingly. Cha-ching! The whole thing kinda went underground and no one talks about it openly anymore and all the url’s I had are dead, but you can bet your ass it’s still there, just waiting. In the meantime, xml spreads and grows for some bizarre reason. Why? I don’t like it. You don’t like it. But, someone is pushing it, aren’t they? Now you know why. Screw xml. I will not use it.

Talk about paranoia…

Is this what you were talking about?

http://www.flickr.com/photos/chantastic/1590993819/sizes/l/

In my day to day job, it’s Microsoft that chose to use XML. All I see is some fancy user interface (‘design surface’). In those cases it’s no problem.

have you ever tried to parse MIME headers?
It’s two orders of magnitude more complex than what you imagine.
for a start, just think about hundreds of buggy email clients, servers, proxies and forwarders, each implementing it slightly differently.

I expected more from you, Jeff. Sad to see such a talented programmer express something so idiotic.

Then again, you did recently admit that you and the command line just don’t see eye-to-eye. So I guess I shouldn’t expect you to GET XML?