Programming Is Hard, Let's Go Shopping!

Donal, your tone seems to imply that one should care about whether they gave back to the community. This is an erroneous assumption.

I’ve posted this like a hundred times, but I’ll mention it again just for fun. HTMLEncode your string, and then replace lt;bgt; with b, etc. So what if people can see that someone put script in their post.

Might be a bit OT but to me this is a huge WTF:

due to the liberal HTML parsing policies of many modern Web browsers

I mean serioulsy, how did the browser-developers think when they explicitly added support for lazy html in the first place.

Hmm, I pretty often write foeach by mistake instead of foreach…let’s make foeach do the same thing as foreach and I’ll save 2sec per day! I mean, all developers must have this problem so I’m doing the world a huge favor!

Today i understand that every browser has to add support for all crap that all other browsers already added, but why add support lazy html in the first place?!

Seriously Jeff,

Do you really think posting some code on a website qualifies some code as having been contributed back to the community?

  • Donal

I understand Google is rolling their own browser because they wanted to make sure that their applications will run smoothly. FOr some of their business core.

Looks like you’ve managed to get every programming expert in the country to come post a comment. They all know the best way and each of them is smarter than the rest. This is good stuff. :slight_smile:

I have to say that I definitely agree with Jeff on this one. While I think claiming that HTML sanitizing is the core competency may be a stretch, I think it’s core enough for the purposes of this topic. Even if there are third party libraries available for such a thing, unless there is one that is truly complete, time tested and professional, it makes sense to write your own. You can look at the others for ideas and to learn the things that they’ve already learned, but it’s something that’s better off being written in a way that is more easily understood and maintained.

Comparing that to writing your own web server pretty ridiculous. There is no such thing as a third party HTML sanitizer that is on the order of reliability as Apache or IIS, in any language. Using a library that is written in Python and forcing it into a C# .NET package via IronPython would be madness, unless you happened to be a seasoned Python expert or have one in-house that is able to make changes and corrections in a timely manner.

There are a lot of square wheels out there.

3rd-party libraries are definitely a problem - the OpenSSl debacle should make everyone think twice. Security software is especially troublesome. Experts (like Schneier) will tell you if you create your own encryption algorithm, you’re almost certainly a fool. If you’re not a fool, you rely on a commercial product or something like OpenSSL. On the other hand, unless you’re an expert on the subject, you’ll have to assume the product or library you selected is secure - that assumption will only be based on the assertions of the vendor. OpenSSL was not secure for two years, which ought to tell you something about the level of testing done. Are the commercial products any better? How can you tell?

Didn’t Dare quit the internets for ever? IMO, he hasn’t had much good to say - apart from causing drama with Arrington.

http://www.25hoursaday.com/weblog/2008/03/05/IndefiniteHiatus.aspx

I also don’t agree with using regular expression for writing an HTML sanitizer.

I’m sorry to inform you, but if the spec requires HTML as input, the spec is wrong. If your core business is to somehow display user generated content, you simple don’t allow HTML, period. And if you want to allow some HTML, just go with the BB code way.

Ooops, just clicked on the hear it spoken link…

Wow, so many guru’s on here, who really know their stuff! Well done all, you’re a bunch of heroes that all have wonderfully usable sites that I visit every day. Amazing how you all picked exactly the correct technology (which you do every time -don’t you) giving you time to come on here and share your much valued wisdom. Yes, jolly well done you!

Anyway Jeff, I think SO looks great and I’m amazed you managed to make such a funky looking site in .NET. Also, you may may be louder than some through blogging, but it’s also because you talk a lot of sense, making you IMO talented developer too. It’s not just about writing beautiful code…

I’m sorry to inform you, but if the spec requires HTML as input, the spec is wrong.

Since more people know HTML than any other form of marking up and generating rich user content, I fail to see how you can make that assertion. Forcing users to adapt to something unfamiliar is a bad spec, not allowing them to use something that is.

If you are a security vendor, you might want to build SSL or hashes.

WTF?

deeply understanding HTML sanitization is a critical part of my business

I thought you were running a ā€˜people’ site. Even if you did not support HTML, SO will work fine.

Jeff, Your core business is not sanitizing html.
There are communities whose core business is sanitizing html. You should have borrowed code from them.

And the solution is to change your markdown interpreter so that intermixing HTML is not allowed. Problem solved. No need to spend a week (or more) creating some HTML sanitizer that frankly isn’t needed at all. Markdown includes more than enough formatting options without having to drop into HTML.
You can’t just kill functionality until a product is safe. Denying any user the right to enter any word longer than 6 characters would solve some problems. So would disconnecting SO from the Internet.
It’s a balancing act.

I thought you were running a ā€˜people’ site. Even if you did not support HTML, SO will work fine.
Surely allowing the people, many of whom are web programmers, to program with a known markup, rather than forcing them to learn YASWM (Yet Another S***ing Web Markup).

I agree with you. If you don’t write your own stuff, you either have to rely on an outside programmer to fix it, or set aside a week and a half to figure out their code and change it yourself. That can take a lot longer than writing it yourself and spending an hour to debug.

Whatever the arguments over code-reuse versus NIH, posting the code to http://refactormycode.com/codes/333-sanitize-html certainly doesn’t count as [contributing] the core code back to the community.

Of course, all third-party and open source libraries are completely perfect and you shouldn’t dare question it.

Right…

SO looks great - glad you’re taking the time to build something for the community that will save us all time and effort in the future.

And a big part of the reason it doesn’t suck is that people can format their posts
almost as much as I formatted this blog post.

Sounds great. How long until we get some of that non-suckage in the commenting software here?

I think writing your own HTML sanitizer to better understand XSS attacks is a good idea. Web application security is terrible because the hackers are collaborating more than developers. As a developer, your knowledge of security exploits is often nothing more than recommended practices and the report from a security scanner. You usually don’t have any idea of how the exploit really works. For example, form bots have always plagued my web sites and I really need to create my own to understand how to defeat them and to put my web forms to the test.