It's a Malformed World

I’m sorry… this blog thinging chokes on greater than signs… One last try

When I daydream, I often travel back in time. To buy Netscape stock before it set records? No. I go back to a time before CSS and before browsers were more than just a CSCI project. I go back to correct the horrible, horrible mistakes made by those who created the XHTML/CSS standards, which is to blame for the situation that we now toil under.

I want to flesh this out into a proper blog entry, but here are some points that I think are relevant to this discussion:

1: Making a language forgiving does not make it easier. In fact, it almost always has the opposite result. Example: in old VB, what is the result of 1 + “3.5”? It’s not immediately obvious (is it “13.5” or 4.5?). Is it really harder to write 1 + toInt(“3.5”)? Forgiveness necessitates ambiguity. Errors happen in front of your audience, which is using a browser that you don’t have and that came to different conclusions about your ambiguous code.

2: Strict enforcement of the language does not necessarily mean hard. The trouble with what is out there is that we lack a sufficiently complete CSS standard. Powerful tools could have been provided and without confusing ambiguity. Cases in point:

CSS Conditionals:
It would be great to be able to ask if a feature is supported or which browser you are working with within CSS and then implement code accordingly. Why not this kind of statement:

min-width? (min-width:40%:wink: | (width: [Microsoft garbage here]:wink:

The ‘?’ would basically have asked, “Is it supported?”. How much simpler is this compared to what we have to do now. Just this one feature alone would suck much of the horrible complexity that we deal with today. And asking “which browser/version” would cover bad implementations as well. For cases like this:

browser(“Explorer”, (less than sign) 7)?(background(“Crapy.gif”):wink: | (background(“Pretty.png”}:wink:

Or how about the ability to do this:

margin-top: =(SomeIDName(width) – 2px);
…or…
width: (80% - 2px);

Features like the above examples would fundamentally change the way the XHTML is designed. It would remove most of the empty DIV tags that we’re forced to pollute our beautiful code with.

Most, except for when we have to make a rounded box. Enough has been written on that subject. I don’t have to hammer out fantasy pseudo code, which would just serve to depress.

The main point is that XHTML has ALMOST enough in it to be completely fine (add br/ back in and I’m happy). CSS, which is designed for graphic designers, does not have enough positioning and conditional tools to solve most of our problems without depending on browser bugs and redundant and ugly code. Demanding quotes, closed end tags and unique ID’s makes it much easier for design tools and browser to interpret the code easily. To be lazy with these details solves nothing.

Another reason to try to get as close to well-formed or even valid as possible (no matter what your DTD) is predicatability.

These days all the browser developers follow the W3C specs to a good degree. There is predictable behavior here for a given input.

But each browser developer necessarily handles malformed or invalid code differently. The W3C specs don’t tell developers what to do with malformed stuff, so each browser pretty much does its own thing. There is some consistency here after all these years, but it’s mostly coincidence.

Therefore, it’s in people’s interests, mostly for their own sanity’s sake, to follow the W3C specs to the extent that they are able.

This is all more critical for complex designs that use a lot of CSS and JavaScript than it is for simpler sites, which can usually get by with less rigorous formedness. So, for people aspiring to do complex or intricate designs, they need to care; people who have more modest aims don’t need to. This to me is a pretty good state of affairs, even though the whole story is a little ugly and overcomplicated.

What Simon Willison said.

People on teh interweb will cite you lots of reasons for writing standards compliant markup. Almost all of those reasons are bullshit. Aside from satisfying your personal sense of aesthetics, the one practical reason to validate is that creating any web page means wandering through a VAST space of cross-browser rendering errors. Validating your markup shrinks that space dramatically. That’s all.

I’m not a web developer nor designer. That probably explains why I feel concerned about valid HTML: it’s a very “young” notion in the web world. Web designers with experience have passed through many iterations of HTML. What they have learned years ago still work.

The real reason why there are so many invalid pages (in the eyes of the new specifications) is because HTML started plain badly (in the eyes of those who make up specifications) and changed a lot. Moreover, new specifications often make obsolete constructs in the previous ones. Imagine if evolution of ANSI C/C++ would have come with the same ratio of “breaking” changes. Thank God, it mostly added stuff.

But in the end, since browsers and users don’t care if your HTML is malformed, it means this debate is about maintainability. Valid HTML is a coding standard, nothing more.

Your users don’t care if your HTML is well-formed. So why should you?

For myself.

  • For my pride, because I see myself as a craftman, I consider that producing valid, well-formed markup is part of my craft, part of not just doing my job, but doing my job well.
  • For my sanity, because malformed markup’s rendering will depend on the way the HTML parser “patches” the malformed HTML, well-formed HTML allows me to just know how my HTML will be parsed.
  • For my productivity, because of the previous point well-formed markup makes scripting much easier, and cleaner. And it’s even cleaner if the HTML is semantic as well.

The only thing I “allow” my pages to report are non-standard attributes, because i sometimes use them to enhance my documents for scripting purposes (custom/non-standard attributes are much more flexible than just having classes available, and much more powerful).

And to end this post, browsers may care about the validity of your markup: just try to feed true XHTML to Firefox (including the application/xhtml+xml markup) and see what he does when the markup is invalid.

These days all the browser developers follow the W3C specs to a good degree.

Uh… no.

HTML and forgiving browsers are part of the age of no flowcharts, no specs, no thought for memory management, and little concern for scalabiltiy (other than to verify it runs using the Northwind database in Access).

I have to agree that the biggest reason to go for validation is to be able to debug.

Nothing bugs me more when I’m writing a Greasemonkey script and the site I’m modifying already has tons of javascript errors. Same thing with XHTML. One of your readers tells you that things are going wacky in IE6 because of your latest post, it’s going to be harder to track down if you don’t already validate.

In terms of handling basic well-formed HTML according to the specs, the browser developers do indeed follow the W3C specs to a good degree, although they’re by no means perfect yet. IE 6 and earlier has some annoying CSS bugs, but many are fixed in IE 7; nonetheless IE has done HTML well for years, and was the first browser with decent CSS support. The Gecko, KHTML and Opera rendering engines all do tremendously well against W3C specs.

Browsers that support less-than-perfect code lowered the barrier to entry for people of all skill levels to participate in the Web since it started. Why shouldn’t browsers continue to be forgiving for those people? Why make the Web more brittle than it needs to be? Invalid and ill-formed pages has allowed people to Get Things Done and get them on the Web, easily, cheaply and quickly. Doing it right is something of another matter, but that’s possible too. Best of both worlds.

Is Jeff being serious here? Isn’t this a little like saying “Your compiler doesn’t care if your code is well designed. Your users don’t care if your code is well designed. So why should you?” The answer is simple: readability, writability, and maintainability.

When writing code, to take the easy road is to just write code, web pages, etc. that “just work.” The compiler only cares if the syntax and typing (if the language is typed) are correct. But, in the long run this only makes the reading, writing, and maintenance of code much more difficult. Indeed, one of the major arguments in favor of (static) typing in programming languages is that it enforces certain standards in code. Why is it that that we make fun of “bad” code posted to the Daily WTF (much of the code posted there does work after all)? Why is it that most serious discussions on programming focus on good design, not merely getting software to work? Well designed code is easier to write, read, and maintain. Well designed code is less likely to break tomorrow when something is changed today. Think about all of the major advances in programming over the past 30 years. Most of those advances were advances because they made the reading, writing, and maintenance of code much easier. As far as the user and computer are concerned, there isn’t anything that we can do now that couldn’t be done 30 years ago (programming languages were just as Turing complete then as they are now). But, as far as we, the programmers, are concerned, there is much more we can do now that we couldn’t do 30 years ago.

My point in making the analogy (if you want to call it such) between HTML and programs is that a lot of the reasons for being concerned with proper HTML and CSS (indeed the reason for creating the standards of HTML and CSS in the first place) is to make it possible for the web developer to create writable, readable, and maintainable web pages. Indeed, your users don’t care, but that doesn’t mean that you shouldn’t.

between HTML and programs

Is markup exactly the same as code? I don’t think so. Along those same lines, consider the role of the compiler in dynamically typed languages. It’s far less useful, because it can’t tell what you’re trying to do.

But, in the long run this only makes the reading, writing, and maintenance of code much more difficult.

And yet plenty of XHTML validation rules cause more pain:

http://codinginparadise.org/weblog/2005/08/xhtml-considered-harmful.html

Yeah yours falls into the 94% category. Its a shame people can’t write clean code, its not like its hard to write clean XHTML code even when its dynamically generated.

Tagsoup is a great evil. It makes writing tools to extract information from “HTML” pages a real pain in the ass. It confuses content-creating end users. It bloats browser software with incomprehensible work-arounds to problems that shouldn’t exist in the first place. It makes writing browsers for small or embedded systems unpleasant, and limits their usefulness in the wild.

You have 48 errors on your page…

Do your users care if your site is indexed by Google? Do you care?

Malformed HTML works more or less consistently in major browsers, but only because the browser vendors have spent considerable effort to reverse-engineer each others error handling. They have to, because otherwise users complain.

Don’t expect that the various parsers used by spiders and search engines implement exactly the same (extremely complicated and never specified) error handling logic as your favorite browser.

If you forget quotes around the url, a search engine might not traverse the link. If you forget to close a quote, the contents of the rest of the document might not be indexed. You can’t really be sure. Why take the risk?

If your HTML is well-formed and valid, at least you have one less thing to worry about.

The problem with well-formedness and valid markup is that some errors will actually trigger quirks mode or even a rendering error, but many of the validation errors seem to me to be nit-picks. At some point you end up putting in extra hours just to please the validator where the changes have no practical effect on the rendering of your code.

The same reason static typing is great. Fail early.

I agree with Olav. And I disagree with Jeff’s statement about the tools not needing to be better.

To me the heart of the matter is this: how accessible is my content? I do not think it matters much if you use XHTML 1.0 versus HTML 4: they’re the same freaking semantics. What matters to me is that we craft projects that follow the (albeit misquoted here) axim “be liberal in what you accept and conservative in what you produce.”

There should be zero reason that any new undertaking should produce invalid markup. The reasons of parseability and maintainablility should be satisfactory to pretty much anyone who creates something new. In a business sense it doesn’t matter if your markup is valid or not because that doesn’t translate directly into gain or loss. But it should (and secretly does) due to the idea of unforseen circumstances.

Take for example microformats. Using microformats allows for some enterprising people (like Technorati) to come up with new services that scour the web and understand microformatted materials. Valid markup is analogous to this.

Valid markup makes your content that much easier to parse and therefore gain from other services and tools other than the ones you thought of when you first created your product/tool/whatever your content is.

That’s potentially huge. Web browsers are the most forgiving renderers on the planet and therefore a poor choice of target. I believe that the world benefits from having any public content as publically accessible as possible, and validating our markup against a standard is the first step in accomplishing this.

Tools are also to blame for not being more friendly to the world. Microsoft Word and Frontpage are especially sad examples of this behavior (though the last version of Frontpage got much better and Word’s sometimes hard to find export to compact HTML filter was good too). Any content created in Word or Frontpage is likely to be only readable in a web browser. If you write a parser that comes across Word- or Frontpage-created content, you’re in for a heckuva lot of work.

I hope one would care if Word made a document more difficult to understand by introducing unneeded sentences (or even worse: sentence fragments) so why shouldn’t a developer care if a tool she or he uses/creates outputs difficult to understand markup? Laziness only goes so far.

I agree that well formness can be quite cumbersome and the immediate benefices are often negligable, at best.

But he XHTML Transitional wasn’t meant to be fully XML since it’s a “Transitional” state. XHTML 1.0 Strict document will not show up if you neglect a slash in a short tag or use the wrong entity. But there is no good reason why you would want to build a site in XHTML 1.0 Strict right now since Microsoft Internet Explorer does NOT threat any XHTML document as XML and wont untill IE8.

XHTML Transitional was meant only to ease the transition of html to xml

Maxime: Actually the “Transitional” doctype was meant for the transition from presentational markup to semantic markup + CSS, not from HTML to XML.

XHTML Transitional is just as much XML as XHTML Strict, and they are equally unforgiving with regard to syntactic errors. The difference is that the Transitional doctype allows a number of presentational elements and attributes which is illegal in Strict, like font, align, bgcolor and so on.

The low-level syntax (XML or HTML) is independent from the doctype (Strict, Transitional or Frameset).

Honestly, I find that with PHP on my back-end doing a lot of the code generation for me, it’s easier to make my templates conform, since I can throw something on the order of three or four PHP commands into a single document and have it all pull from a common source.

I’m not saying my code is airtight, but using a combination of NVU, which seems to pride itself on XHTML standardization and simple PHP coding, I seem to get by. Its when I use pre-formed scripts that I run into problems where my code formatting is malformed.