Programming Is Hard, Let's Go Shopping!

codinghorror · October 16, 2008, 12:00am

A few months ago, Dare Obasanjo noticed a brief exchange my friend Jon Galloway and I had on Twitter. Unfortunately, Twitter makes it unusually difficult to follow conversations, but Dare outlines the gist of it in Developers, Using Libraries is not a Sign of Weakness:

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/10/programming-is-hard-lets-go-shopping.html

Rob · October 17, 2008, 12:00am

The easy solution is to allow only well formed image tags, and zap everything else, no?

Bernard · October 17, 2008, 12:00am

I don’t see a license attached to your sanitizer. Doesn’t that make it unusable for anyone else?

(Also, does C# have first-class functions, or is that just for clarity? If so - neat!)

Matt_Green · October 17, 2008, 12:00am

I’m a fan of reuse, but, seriously, give it a rest people. You act like all of the open source code out there is of equal quality and documented. It isn’t. Forgive me I’m hesitant to put NightOwl201978’s homegrown HTML sanitizer in there that he’s used on his blog, which receives 3 visitors a day.

I see this as a problem with the open source community. You go to write an application, and 99% of the time, people say, oh, don’t write that from scratch! Go work on decidedly-mediocre-project-that-prompted-you-to-develop-this-in-the-first-place instead! Oftentimes these projects have SEVERE issues (symptoms like memory leaks or unmanageable complexity) that are NOT simple fixes to make, they’re often architectural, or, worse, cultural. (Such as inappropriate use of low level languages, failure to abstract properly, etc.) The very thing that project needs the most is someone to come along and outdo it, who isn’t afraid to say that the code quality is unacceptable.

Perhaps that is the overall problem with open source: when all code is free, we wrongly assume it is good code.

Absconditus · October 17, 2008, 12:00am

Jeff still hasn’t provided a convincing case as to why he needs to allow HTML at all. He more or less admits that this isn’t necessary when discussing BBCode. Can he even show us a question/answer where user-entered HTML was necessary/desirable?

CharlesC · October 17, 2008, 12:00am

Aaron, the big difference there is that the only thing user-generated in that entire list is…

Bingo, the HTML.

Stack Overflow has a targeted audience of programmers. This isn’t some random forum on the internet about knitting, its a group of professionals, a percentage of which probably have the ability to break something one has written.

How many HTML sanitizers are written with this kind of audience in mind?

Thats not a rhetorical question, I’d sincerely like to know.

Bottom line is, its one of the most important features of Stack Overflow and requires a lot of attention to detail.

CharlesC · October 17, 2008, 12:00am

@Absconditus - Now, I’d agree with that

GlyphL · October 17, 2008, 12:00am

Jeff,

I can see your more general point about re-use. Having written my own HTML sanitizer, I can understand why you wouldn’t want to use some code that you really don’t understand very well in a core function of your product.

Also, for what it’s worth, I love stack overflow’s input idiom. It combines the ease and familiarity of entering plain text with the immediate feedback of a GUI editor.

But you really ought to understand how difficult the problem domain is, and have humility about your solution. You have to assume your code will be wrong. Concentrate on making sure that, when it fails, it fails well.

Your use of a whitelist rather than a blacklist is a step forward, but the use of regular expressions is two steps back. You can’t parse HTML with regular expressions. There are a few thousand screeds on this topic, but here’s a good one: http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

What you need to do to sanitize HTML - or, for that matter, any untrusted network input, is:

Parse it: load it into a data structure. An actual structure, with actual rules. During this process, your parser may fail. That’s great. Give up. If you see input you can’t handle, assume it’s an attack. On a site like StackOverflow, where users get real-time feedback, this doesn’t even create a usability problem! You immediately say I’m sorry, I couldn’t understand that.
Emit it. During this phase, if somebody tricked your parser, it shouldn’t matter: your emitter should be smart. Let’s say your parser has a bug where it mistakenly thinks the bar in foo bar baz=‘boz’ is actually the start of text. If someone tries to attack that, your emitter can neuter the errant by quoting it as foobar baz='boz’gt;/foo. It will look ugly, but it will fail in a way which is at least safe. Importantly, the emitter should be as distant and disconnected from the parser as possible, so that they do not share bugs.

If you are concerned about memory overhead (but you shouldn’t be, because you’ve loaded the whole thing as a string anyway) you can always use an event-driven parser/emitter pair (think SAX) rather than loading everything into a single structure first.

If you follow this structure, you can still even use regexps for the parsing phase, because your parser can screw up horribly and your emitter will still produce valid, if unpleasant, output.

Esteban · October 17, 2008, 12:00am

Jeff,

Even though SO is awesome, I’m glad to see you’re again writing more regularly.

I don’t know if input sanitation is core to StackOverflow. It seems to me the SO experience could still be great even if you hadn’t written some sanitation code.

Having said that, I’m glad you did - just 3 weeks ago we used some of the code you posted on refactormycode.com.

Keep up the good work! And thanks for contributing to the .NET community.

-Esteban

MitchB · October 17, 2008, 12:00am

Hey Jeff, interesting article you have there. I spent a great deal of time writing an open source project http://www.codeplex.com/gsb where I developed things that I knew already existed. It took me 1 1/2 years of spare time to write it and still not done.

I refactored it continuously to use a number of 3rd party components that were really better than mine, but that did not stop me from learning, which is why I did it in the first place - to learn. Now though my learning’s, I was able to pick and choose the components that more closely matched my requirements and less likely to give me grief in the future.

Now if I spent the time writing those components to match the functionality of what my app now has, well it would have taken me 5 years instead of 1 1/2.

Perspective is one thing, but programming is damn hard :-)!

John · October 17, 2008, 12:00am

Good read! I also liked the article about how I can be loud too

Yes, it’s very hard to detect evil from good, but I don’t blame you for writing an html sanitizer yourself. I’m definitely an idealist who dreams of taking third party libraries and designing/putting the pieces together and (effortlessly) building an app. And I try to do that as much as I can. But there are some things that you just have to do yourself (yes… like If it’s a core business function, write that code yourself is a reasonable standard)…
And in the back of my mind I wonder, how trusting should I really be toward all the third party stuff I’m using?. I mean, considering all the unknowns about it, efficiency, security, etc… even with open source, do you (or anyone) go in and review/inspect all the source code?

TijsT · October 17, 2008, 12:00am

Wow, dare i chime in with yet another comment? I guess i do.

Your core business is community Jeff, that and the people in it. A drupal install or whatever standard CMS with some code parsing would have been fine. I guess we do like you care so much about the plumbing, so perhaps thats why those people come back in the first place. Perhaps you did work on your core business after all but never even noticed.

Anonymous · October 17, 2008, 12:00am

Code re-use is over-rated, because most code (especially in-house) is not of very good quality, and cannot accurately anticipate unknown, future needs. Creating reusable code requires better developers, who are rare and expensive, and rewards mystery projects from the future at the expense of the resources the client has available now. Trying to bend old code to new uses increases the risk of revealing bugs, and ties up developers who spend a lot of time trying to gain an understanding of the existing code, which generally ends up full of hacks to work around the grand visions of the first developers.

This in turn ends up as a maintenance nightmare, as different teams or departments fork the ‘reusable code’ or, god forbid, try to keep it in sync. It would be crazy for any manager to allow some other team or department who doesn’t fully understand their project to contribute code or design direction to some underlying code they both rely on. The other team cannot possibly be familiar with all the other projects that re-use the code, and the assumptions the developers on those projects have made.

The main exception is library re-use, such as HTML sanitizers, which is an excellent idea. If the library itself is full of un-reusable code - no problem. It is far more important that it is maintanable and well-tested than reusable.

Reusability deserves to go on the scrapheap of moribund trends, like making everything object-oriented. A nice idea, so long as you can stay out of the real world.

Vassi · October 17, 2008, 12:00am

I think you’re all missing the fact that a huge point in this is that he’s also trying to contribute to the .NET web ‘world’ as it were.

I, just recently, went in search of a CMS to do fill my projects - and I would have loved nothing more than to find a .NET solution, it would be a perfect excuse for me to pick up ASP.NET a bit more and leave behind (finally!) my PHP roots. And yet, precious few solutions are to be had and almost all of them are a PAIN to install compared to Drupal, Wordpress, and the like.

.NET is in danger, yet again, of being considered only a big kids toy for office and intranet, and nobody has done anything yet to prove that wrong - myself included, as I just installed Drupal with a bunch of fancy modules that will make my life easier. I can’t fault Jeff for not taking that easy way out, it’s not like I gave the man money to do this project, it is his, he had no commitment or reason to produce it other than he wanted to, he can write whatever he pleases on whatever timeline he wants.

Steve · October 17, 2008, 12:00am

Wow, so many sanitizer-writer experts!

Alan · October 17, 2008, 12:00am

I’m with Jeff on this one, even if that’s the unpopular opinion. I work at a large, very visible organization. We have been burned on 3rd party libraries and tools…BADLY. One very well-supported (and necessary) tool recently closed up shop after the parent organization was purchased by another organization. There is no longer even any mention of it on their website, when it used to be a front-and-center app. The licenses just stopped working one morning, which means…wait for it…all the code that relied on that tool that we paid for immediately failed. Guess which IT department had to scramble (and is currently scrambling) to craft a home-grown solution now when we could have spent that time better at an earlier date?

The worst thing is that the performance of this tool was hiding some serious design and performance flaws in the underlying code we inherited. There are just layers and layers of WTF-ery in this…I’m all for not reinventing the wheel, but Joel’s quote is dead on. The developers should NOT have relied on this third party tool to be the keystone of a mission critical app. The performance of this app is a core business issue, and anything related to it should have been hand-rolled, even if that was the harder road to travel.

Kris · October 17, 2008, 12:00am

If it’s a core business function, write that code yourself, no matter what.

I would have to agree with that… if you have problems/bugs with some component that is critical to your app, you want to not only be able to go in and fix/debug the issue but also to fully understand the piece if you expect to fix it right. I never feel good when I have to relly on someone outside of my business when things affecting my business break.

John · October 17, 2008, 12:00am

Not invented here attitudes are a problem, so is the opposite, the attitude that everything your colleagues create is crap and that anything created by a third party must be better. These folks develop an exotic bug collection, different patterns of bugs from every source they copied and pasted from.

DouglasS · October 17, 2008, 12:00am

a solid week building a set of HTML sanitization functions

It sounds like the time may have been better spent unit-testing an existing library, and if it was found to fail a lot of the tests, write your own against those unit tests. Best case; you spend a day writing tests and the rest of the week to work on other things. Worst case: you have some tests and expectations to write your own library against, and you’d be doing that anyway. Wouldn’t you?

I can now see how SOF slipped by several months.

Most of the rest of your post amounts to nothing but snobbery!

Tyler · October 17, 2008, 12:00am

What a busy thread! I’m inclined to think that StackOverflow’s core competency is storing/indexing developer questions/answers. The schema, SPROCS, and the Lucene.NET index sum up a lot of the value that I see in StackOverflow (which is a great site btw).

If there’s truly nothing else out there that does a decent job for .NET (and I’m not completely sold on this) then I would agree that you’re painted into a corner.

A lot of developers on this thread, (myself included) may be a little over sensitive to re-developing code that already exists when something like that costs so much (coding, testing, maintenance). The best line of code is the one you don’t write, don’t own and still delivers value to you.