Programming Is Hard, Let's Go Shopping!

Raisins · October 17, 2008, 12:00am

I always enjoyed the phrase. Don’t reinvent the wheel, unless you plan on learning more about wheels

Wayne · October 17, 2008, 12:00am

Honestly, I disagree entirely with Joel’s comment that’s being referenced here. Yes, for certain core functions you should write things yourself, but using a framework or library to help get it done quicker is a good thing, not a bad thing.

For example, if I was writing a storefront for an e-commerce site, I would prefer to write my own store and fulfillment system to fully encapsulate the business needs, but I would gladly use an existing storefront framework out there (for example, Satchmo if I was using Django) that takes care of the payment gateways, even if I end up redoing everything else from scratch.

I guess it depends on the context of the application. I would not trust a drop it in e-commerce package for anything except the most basic of online stores, but I would gladly borrow the payment and generic CRUD modules (e.g. adding new customers) from one to shorten development time.

BuggyF · October 17, 2008, 12:00am

Oddly, many businesses blithely Go Shopping. The proliferation of BOM/MRP/ERP software systems is the prime example. And SAP is the prime of the prime. How you make your widgets is your core competence. But many still buy such software. May be that’s why the USofA is going down the tubes.

BlueRaja · October 17, 2008, 12:00am

I agree that, for instance, a pharmaceutical company should write their own drug research software, but writing your own software and writing your own software from scratch are two completely different cups of tea. Especially when it comes to security - if history has taught us anything, it’s that you should never write your own custom security-related routines, if at all possible.
However, the fact that you released your HTML-sanitizer to the public and posted it on your blog is certainly a plus, as I’m sure that now it will be picked apart and scrutinized by everyone in the community, especially those trying to prove that you have no idea what you’re talking about

KevinF · October 17, 2008, 12:00am

The comment thread on this one is freakin’ hilarious! But, yeah, I agree with Jonathan Buchanan’s sentiments… Dare’s post seemed to attack the usage of regular expressions to accomplish your goal – not just the HTML sanitisation.

– Kevin Fairchild

Absconditus · October 17, 2008, 12:00am

By Jeff’s logic writing his own web server would be acceptable as well. Serving web pages is clearly part of his core business.

How many questions/answers actually contain HTML? Would it really have been that great an inconvenience to disallow HTML markup? Jeff even alludes to this when talking about how much easier things would have been with BBCode.

Why not just encode all HTML before Markdown sees it? Why not consider a different markup language?

Henry · October 17, 2008, 12:00am

You don’t need to understand everything to run StackOverflow. Computer science has this wonderful philosophical device: abstraction. Black boxes make the composition of systems from smaller functional units wonderfully tractable. What you are complaining about is the lack of a suitable black box, so you wrote your own. No problem there. But by the ‘Feynman metric’ from the blackboard, I doubt you understand the entire operation of StackOverflow. Did you write your own database (and could you, from scratch)? Is the network stack custom rolled?

I doubt it, and rightly so as rewriting them would be crazy. Feynman wanted to understand the entire universe stack, from top to bottom. You needed something that didn’t exist, so you made it. That’s the great luxury of software development.

Imagine, however, that a suitable sanitization engine had existed. Then you would have been crazy, from a production point of view, to roll your own if the extant engine had decent documentation, and the time of integration was small enough. You trust black boxes to give certain guarantees at every level of operation; another one here wouldn’t have been a problem.

From a ‘do I understand the universe’ point of view, you could have written your own HTML sanitizer to scratch that particular curious itch, but it’s a weird one to start out with when there are far more interesting problems to be able to solve.

DeveloperD · October 17, 2008, 12:00am

If you are able and willing to write significantly better code than what exists, or code doesn’t exist, or the code that exists can’t be easily adapted to what you want to do - then you pretty much have to write code. Otherwise don’t waste time and get on with your job.

So much support code that is written is just reinventing the wheel, and very poorly at that. Most of the time that devs reinvent the wheel they are neither willing or able to write better code - they just want to write the code. They also usually don’t have the benefit of a lot of eyes looking at and testing their code, so rarely does it even begin to approach the quality of code that is already out there and used by other people.

ChaseS · October 17, 2008, 12:00am

These days, HTML sanitization is primarily about security (preventing XSS attacks). When it comes to security, you want to use a proven, standardized solution. Would you roll your own version of SSL, or a cryptographic hash?

You say that your solution is proven, and can now be reused. Call me in 5 years when that’s actually true; right now it has gone through precious little battle-testing.

I disagree that this is core business functionality for stackoverflow. Your core is how you facilitate collaboration, not the content format.

cthrall · October 17, 2008, 12:00am

When it’s a week of work, easy call.

When it’s six months to a year, involving a not insignificant investment, what do you do? The choice is not easy then. And no matter which way you go, you will always wonder if the other way was better.

codinghorror · October 17, 2008, 12:00am

Would you roll your own version of SSL, or a cryptographic hash?

Well, first I’d design my own CPU, RAM, and motherboard. From scratch, naturally. Then an OS to run everything. Maybe an IDE, debugger, things like that. But after that I’ll be all over SSL and hashes like fleas on a dog!

If you are a security vendor, you might want to build SSL or hashes.

If your website allows arbitrary user-generated HTML in markup for every single page, you might… just… consider… writing your own HTML sanitizer.

But what the hell do I know.

doug_t1 · October 17, 2008, 12:00am

So, by the same rationale, does that mean you should learn C? If you can’t create (given, like a thousand manyears) the .Net framework, how can you understand it? How can you defend your use of it.

I mean this only half jokingly.

I’ll await a response while building my webserver driven by telegraph latches, based on what I’ve learned in Charles Petzold’s Code ;).

Wayne · October 17, 2008, 12:00am

Coding Horror is turning into the DailyWTF with all the submissions coming from Jeff himself. Talk about over-complication! I see the problem as being this:

Markdown allows users to intermix HTML into the markup

And the solution is to change your markdown interpreter so that intermixing HTML is not allowed. Problem solved. No need to spend a week (or more) creating some HTML sanitizer that frankly isn’t needed at all. Markdown includes more than enough formatting options without having to drop into HTML.

Bill70 · October 17, 2008, 12:00am

From a href=http://daringfireball.net/projects/markdown/licensehttp://daringfireball.net/projects/markdown/license/a
Markdown is free software, available under the terms of a BSD-style open source license.

If HTML is the problem, then strip it out of your 3rd party library. If you want to foster the markdown community, offer the patch to other developers. I don’t believe you absolutely have to write your core functionality yourself. I do believe however you have to modify it to suit your needs.

FelixP · October 17, 2008, 12:00am

This whole talk of core competencies gave me an idea. Microsoft is a software company. Apple is a hardware company. Windows was developed completely in-house. MacOS X is built on open source Unix foundations. Which one should have come out the better? Which one did?

Inventing your own wheel is sometimes necessary. I think it was in this case. But it should always be the exception, not the rule.

Esteban · October 17, 2008, 12:00am

I just finished reading the October issue of MSDN magazine, and would have to say that at the rate the .NET framework is growing, we’ll soon be writing nothing but business logic.

Take the Coding Tools article on page 86; even if it’s your core business to write that are processor intensive (I’m thinking applying filters to images, etc.), you’d be crazy not to use the new support for parallelism the new version of the framework will offer.

I’m impressed with the future of the framework; however, I realize that not having a thorough understanding of parallelism (even though I may never have to write boiler plate parallelism code) is probably dangerous.

Wayne · October 17, 2008, 12:00am

I sure hope that Jeff packages his HTML sanitizer as an open source library and posts it SourceForge. .NET will forever be backwater unless developers start publishing their hand-rolled libraries.

Daniel · October 17, 2008, 12:00am

Tyler: I was just about to post the same thing. HTML sanitisation is a tangenital issue to the primary functionality provided by stackoverflow: a programming community that doesn’t suck.

Jeff: You totally missed the point of Joel’s original post.

You also could have saved yourself a week of hacking by not being so stubborn about allowing html markup. We’re hackers. We can quickly pick up whatever small markup is required to make a post look nice.

DamianC · October 17, 2008, 12:00am

I don’t think writing a simple sanitizer is all that hard. At least, I have done it myself, taking the conservative approach of running through the input character-by-character with a finite-state machine scrawled on a piece of paper that says whether or is allowed at any point and whether to recover from errors by inserting or escaping the faulty code. You then have a table that says which element names are permitted and which attributes those elements can have. The result is well-formed XHTML fragment that will display safely.

The point of the above is that (a) you only let through known-good HTML, rather than trying to spot and fix known-bad HTML (since making a list of good things is easier than making a list of all bad things), and (b) approach the problem systematically rather than trying a quick regex-based bodge and then a few more bodges on top of that.

I don’t see how translating to BBCode and back again could possibly be simpler as it presumably involves interpreting the HTML to generate the BBCode… And forbidding HTML in Markdown is not much simpler because after sanitizing the Markdown (on order to forbid HTML), you then convert it to HTML and have to hope there is no way to fool the Markdown formatter to make it produce bad HTML. Safer to write a bullet-proof HTML sanitizer and apply it at the very end of the pipeline.

Scott · October 17, 2008, 12:00am

So, what’s the difference between rolling your own HTML sanitizer and rolling your own jQuery?