Revisiting the XML Angle Bracket Tax

@leppie the problem with s-expressions is the same problem with json. because you eval it directly, you better hope that it just contains data. And to my eye, looking at s-expressions is no easier, nor harder, than xml.

Nobody ever said Xml was anything revolutionary. It’s not a silver bullet. But just because it brought together a bunch of disparate technologies hardly makes a counterpoint. (The Bugatti Veyron takes all the best know-how to make one kick-ass car, therefore it sucks- no wait that can’t be right)

@RevMike - it’s entirely possible that you shouldn’t be querying multi-gigabyte files in pure xml. Nobody ever said Xml was a replacement for a database, or other file structures. Index nodes in the file or use the xml as the basis for a cached copy if it’s a measured performance bottleneck, or convert it to a format that works for it’s intended use.

Peter Palludan,

(defparameter summary
…’((pros
…“everybody can do it”
…“global standard”
…“it can be used for almost everything”)
…(cons
…“too verbose”
…“hard to parse”
…“another language”)))

;; SAX? DOM? Pah!

(defvar good)
(defvar bad)

(dolist (yay (rest (first summary)))
…(push yay good))

(dolist (nay (rest (second summary)))
…(push nay bad))

XML is great for taking certain types of complex data and storing it in a file that’s not crazy.

But it boggles my mind, and infuriates me when someone tries to store really simple information that way. Every once and a while an open source project with a simple config file, not much longer than the vegetable=carrot example, and convert it to some six layer deep XML monstrosity. What’s WRONG with those developers? Are they just not thinking about what they’re doing? Do they not understand the purpose of XML in the first place? The purpose of config files?

Here are three types of data for which XML is a Really Bad Idea™:

  • Non-hierarchical data, since you’ll have to deal with idrefs everywhere. Just use a DB.
  • key=value data, but don’t come around complaining when your neat little format turns crazy since you hacked it to contain 2-dimensional arrays and ID references.
  • Enormous amounts of data - Use a DB or a custom binary format to optimize the handling.

I’d say the line-by-line method for your config files is fine. If you have a spec that includes strict restrictions on what can be in this file. Strangely the software using such ‘simple’ files seems to be exactly the software without such specifications.

Once you start having strings in there, you’ll already have to guess the encoding and probably about how to escape line endings and = signs as well. Before you know it you’re trapped in this and have to actually think about what to do when writing your config file. With XML you just use a readymade library and there your are.

For easy readablity you can still use a graphical XML viewer. But I’m pretty sure that some clever syntax highlighting will get you most of the way already.

For me it’s not XML itself I hate but rather the developers that thinks that by selecting XML one actually solves any major problem. One always have the problem of deciding good structuring, referencing and representation. If you have a simple thing to send/store XML why use XML? If you have a complex thing, do XML help? How? Outside webservices I seldom see it…

Quote from subversion 1.4 release notes

Working copy performance improvements (client)

The way in which the Subversion client manages your working copy has undergone radical changes. The .svn/entries file is no longer XML, and the client has become smarter about the way it manages and stores property metadata.

As a result, there are substantial performance improvements. The new working copy format allows the client to more quickly search a working copy, detect file modifications, manage property metadata, and deal with large files. The overall disk footprint is smaller as well, with fewer inodes being used. Additionally, a number of long standing bugs related to merging and copying have been fixed.

Says it all

It is odd how people feel that the alternative to not using XML is to use an in-house format.
Why not have 3 or 4 standards, ranging from “simple but limited”, to “comprehensive but complex”. In fact, we already have these, in the form of everything from XML to a simple list of line separated entries. (Ever encountered the dreaded list of names, all surrounded by “name” tags, AND THAT’S ALL THAT’S IN THE FILE? XML was not needed there.)

Think of it like Newtonian and Einsteinien physics. Einsteinien physics provides a more accurate model, but Newtonian physics is used for most situations, because it’s easier to do the maths, and on the scales it’s used, there’s no difference to the results.

I figure a lot of people who have declared geekhad on Jeff have encountered someone who developed an in-house format all their own, and are assuming that’s what he suggested. Surely all he’s saying is that you should consider other options before the knee-jerk XML route. There’s probably a list of questions you could ask, a few examples being

  1. How much data is being used?
  2. How deep does that data go?
    Anyone got any others?

[Third time’s a charm :-)]

Thinking about it, most of the repetition (and therefore visual crud) comes from the closing tags repeating the name in the opening tag. I’m just thinking out loud here, but how hard would it be to come up with a shorter “default” closing tag? For example, an empty closing tag ("/") could be used to mean “close the innermost open tag”. The sample XML would look like this:

doc
fruitpear/ !-- closes “fruit” –
vegetablecarrot/ !-- closes “vegetable” –
toppingwax/ !-- closes “topping” –
/ !-- closes “doc” –

It’s just syntatic sugar but it’s considerably shorter than the original (if you remove my comments, of course).

Two Points

  1. XML is unreadable - how many times do you end up reformating the web.config file to line the attributes up just so you have a chance of reading it.

[add name=“zzzzzzz” value=“vvvvvvv” /]
[add name=“z”_____ value=“v” /]

  1. .net needs to support other text based serilizers out of the box - before we stand a chance of anything changing, I bet that list of names mentioned by a previous poster was a serilized array. (0 effort on the part of the coder to persist some data to a file.)

The project I’m working on uses XML for all its configuration and output data. Some of these files are pleasant to work with whilst others are truly dreadful. The main difference between them is the quality of the design and the level of understanding of XML the designers had.

The well designed files are easy to read and easy to edit (especially in an editor that can parse the schema and do auto-completion).

The poorly designed files fail on many levels. For example, the system we’re developing consists of a set of software components. The components to instantiate are defined in an XML file. Each component has a set of parameters for each instance. These parameters are stored as child elements of the component definition as a key/value pair list. Components reference their parameters by name, the name has to be cross referenced to the key using another XML file. The upshot of which is that hand editing the file is impossible, even with auto-completion. An example (convert to XML):

component
__id
__name
__type
__parameter
____parameter_id1
____value1
__parameter
____parameter_id2
____value2
__parameter
____parameter_id3
____value3

parameter_definition
__parameter
____parameter_name1
____parameter_id1
__parameter
____parameter_name2
____parameter_id2
__parameter
____parameter_name3
____parameter_id3

So, when the component is instantiated it attempts to get parameter ‘parameter_name3’ which has to be found in the parameter_definition table to get parameter_id3 (no guarantee it’s there though) then use parameter_id3 in the component’s local parameter table. So even though you have XML that conforms to the schema, data can be invalid or even missing.

With XML and well designed schemas, the source control check-in process can be set up to validate XML files against their schemas:

on check in xml
test against checked in schema defined in XML file
pass schema check - check in file
fail schema check - display error, don’t check in file

As Jeff points out, it’s all about using the right tool for the job. You wouldn’t use XML to transmit data across a CAN bus for example, or as configuration data on limited performance embedded systems.

Although XML is very useful, it’s not a panacea for all software problems.

Skizz

When XML came out, and then exploded in popularity, I groaned on the inside, since even though everyone was parroting the lie that XML would never be read by humans, it was painfully obvious that it would.

I had retard try to rewrite my code to use XML, broke it horribly, and then spent a year trying to make it work, while I carried on blithely with my original code and finished a 5 year project in about 1.5 years.

Once again, the issue is that XML isn’t meant to be parsed by a human. It’s intention is not to be human readable

well, actually, the officially given reason for repeating the opening tag name in the closing tag (there’s no real reason to) is for parsing by a human. :slight_smile: heh

anyway - I just read a relevant paragraph in the O’reilly “RESTful Web Services” which I thought I’d share (just the conclusion) -

“JSON is useful when you need to describe a data structure that doesn’t fit easily into the document paradigm”

Windows only, but I recommend liquid XML studio freeware. Makes parsing those XSD’s and making sure that you’ve created documents that match them much easier.

http://www.liquid-technologies.com/Product_XmlStudio.aspx

Haven’t tried the pay for version, but the free one is OK. A tad buggy, in that sometimes the search gets lost and stuff, but for freeware really good.

Jeff, I disagree often and a lot with things you write in your blog; but hey, it’s everyone’s right to have his/her own opinion, isn’t it? Still your opinion matters to me; why would I otherwise even read your blog? I read your blog, because sometimes you come up with really interesting ideas and aspects most people never even thought of. Many programmers take certain ideas as facts; “that’s just the way it is”. They don’t even dare to question them. You question many of these and even if you may not be able to come up with a solution to all problems, your blog at least makes people aware of possible issues, kind of “See this, now see that… see the problem?” and this often causes an “Ohhh” or “Ahah” effect. People start to reconsider their facts and recognize that these are not set in stone.

Back to topic: XML is not a fact. XML is not God-given. XML is an idea. An idea that got popular. XML might be a good solution for some or even many problems, however it may be a poor solution for other problems and even if it works as a solution for some problems, there might still be better solutions than that. You seem to dislike XML and guess what, this is one of the topics I seem to agree with you.

The main problem I have with XML is: For whom is this language actually designed?

A) For human beings, so you have human readable data? Really? Well, as you pointed out before, XML is very hard to read. Easy samples like shown above are still human parsable, but I can give you a 2 MB XML file that will make you cry. XML is not for human beings, it’s too verbose and too complicated once the data file grows beyond certain limits.

B) For computers, so you have a standard way to store arbitrarily data? Certainly not. XML is far from being easily parsable for machines. I can think of 100times easier to parse data formats if it only needs to be machine parsable.

So if XML is neither for A nor for B, what is it good for anyway? I guess it is the try to create a format that is at least somewhat human readable and at least somewhat easy to parse for machines. Bad choice!

Instead I had created two equivalent formats - so that you can always convert between them in a 100% lossless way. One that is very easy for machines to be readable and one that is very easy for human beings to be readable. Sounds like a much better approach to me.

Actually Apple had such an approach. Apple has the old NextStep PLIST format, which is very easy for human eyes to read. And they have a very compact binary format. Both were replaced by XML instead. The NextStep Format is legacy and the binary is not legacy, but it’s not the default format being used either.

Here’s an example for the new XML PLISTs:
http://tinyurl.com/54dv52

Compare this to the old, human readable ones (much more readable):
http://tinyurl.com/5yefeb

There is no description for binary PLISTs; but be assured, these are optimized for being machine readable.

I think you’re missing a detail: XML is meant to be read and edited by humans only as a last resort. The normal situation is to edit and consume them using tools written for that purpose.

I mean, you don’t really bash the bmp format because it’s verbose and hard to read and edit, do you? That’s because you understand that you’re not supposed to edit it with a hex editor. Same goes for wav files. And so on.

Why does XML baffle you so much? Because it looks readable by a human, albeit with some difficulty, instead of looking like line noise? That doesn’t mean it’s meant to be edited in its raw form.

Moreover, programming is not an art. It’s an engineering discipline, therefore it’s not about creating the perfect program; it’s about delivering something good enough for the customer, on time and with the least possible cost.
Sure, JSON, S-lists and custom binary protocols would be better solutions, performance-wise. But are the tools to produce and consume these formats as cheap and pervasive and tested as those for XML? Not as far as I know. So, yeah, XML is almost always the best choice whenever you have to serialize a tree structure. Its performance may suck compared to the alternatives, but it’s cheaper and allows you to deliver sooner, and that makes your customers happier than a more expensive program delivered later that runs 20% faster.

XML is powerful and that’s why it is popular. You can represent almost any data model using it.

Should you use it in order to represent a simple properties list. I would rather just use the Java Properties file or a Windows INI file. But, for anything more complex than a simple key = value file, I would use it.

XML is text based which means it is easy to work with. It is white space neutral and EOL agnostic which many of its proposed replacements are not. I’ve used some of them, and when you’re looking at trying to keep the seventh level of indent straight, you’ll find that they can be even more of a pain to use.

That said: XML is very hard to read and write. But, there are hundreds of XML editors that can help. Use them.

And, does Microsoft over use it? Of course they do. Microsoft takes everything to the extreme.

Your company is based upon proprietary technology? Then reject ALL open source as pure evil. Make sure your stuff works with nothing open source without a lot of pain and suffering. Keep all protocols secret and keep updating them, so anything open source that tries to use them will fail. Spend hundreds of hours inventing your own way of doing everything even though there is already a readily available solution being used by everyone else. Take standards and extend them until they break!

Is XML good? Then make everything XML. Save files in XML format. Make all configurations, no matter how simple in XML. Make XMLs of XML files.

Were there too many DOS INI files in those pre-windows days and they tended to be all over the place? They create a massive binary repository for all settings. Heck, even create new settings that can go in there. Link everything together with GUID and make it so fragile that one random change can destroy everything.

Maybe they should be drinking a bit more decaf over in Redmond.

XML is bad for many reasons already outlined, but YAML is not a good solution imho. JSON is better. YAML obscures the structure with white spaces. It’s dangerous, if you let someone edit a YAML file who is not experienced with YAML, they can completely mess up the structure of the data.

On the subject of XML and abuses thereof, how do you feel about XML comments in C#, Jeff? My feeling is that they are hideous and bloated, but that I can’t get away from them because there’s no other way to get the benefit of comments appearing in Intellisense tooltips and stuff like that. I only wish that they’d chosen a more concise format that’s human-readable. Either that or a more advanced source editor that renders and edits the XML comments differently from the rest of the code.

I was under the impression that XML was arrived at so that endless discussions like these wouldn’t happen!

Make XMLs of XML files.

We will most certainly need XML files to keep track of our XMLs of XML files. I’m envisioning a beautiful forest of angle trees…