This is one time I have to disagree with you. Attitudes like this are why the quality of software doesn’t advance at a better pace than it does. We spend so much time re-inventing the wheel and not adding new value to an application.
You can use an open source library like beautiful soup and get all the ability to fix problems on your own, retain the ability to read and understand what the code is doing, and at the same time benefit from the work from alot of other people keeping an eye out for things you may not have thought of. You might actually learn something new. Beautiful Soup is written in python, but you could quite easily wrap it/run it in IronPython to match your .NET requirement.
Bypassing options like this to roll your own is just a combination of vanity (I can do everything better than anyone else can) and/or laziness (it is harder to read/understand someone else’s code than to roll your own).
With the variety of open source solutions out there now, you can avoid reinventing the wheel while at the same time having complete understanding of the code being run and retaining the ability to modify it as you need to.
This was an interesting article, and I’d love to hear you talk about when you WOULD want to use an existing library. My reading of this post was that you were arguing that sometimes reinventing the wheel can be good in specific situations, and that writing your own HTML sanitizer for Stack Overflow was one of those situations for a variety of reasons.
So my question is, under what circumstances would you have used a library for HTML sanitization rather than rolling your own? You mention ed that there are few sanitizers for C# and VB.NET; if there had been one or more really mature and widely used ones then would you have not written your own?
I’ve love to see a blog entry on the specific criteria you use to decide when to use an existing library with examples of when you did and when you didn’t. Especially focusing on things for which external libraries exist and you had to think long and hard about whether to go with one of them, since that would give us insight into the criteria you use to make those decisions.
I find reusing existing libraries is often in conflict with the principle of YAGNI. Libraries grow over to time to satisfy the needs of multiple usage scenarios. If you only need 50% of the functionality of that library for your scenario then the library brings along a heap of baggage that has security, performance and complexity implications.
the time may have been better spent unit-testing an existing library
I agree with you; Jon had a set of unit tests he worked up that I wanted to build on. However, many of the XSS exploits would require actual browser code to execute against – different browsers interpret sketchy markup differently. So a complete and accurate XSS test suite would have to fire up browser .exes and somehow detect JS execution and other conditions in the browser.
You can use an open source library like beautiful soup
In .NET? How?
This is like telling me I should use rainbows and cotton candy. Well, obviously.
There are almost no options in .NET, which is one of the reasons I wanted to go this route. So others would have more solutions!
Since refactormycode is sadly defunct, I’ve added the raw code to the blog post and this comment. Consider it MIT licensed.
private static Regex _tags = new Regex("<[^>]*(>|$)",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
private static Regex _whitelist = new Regex(@"
^</?(b(lockquote)?|code|d(d|t|l|el)|em|h(1|2|3)|i|kbd|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$|
^<(b|h)r\s?/?>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_a = new Regex(@"
^<a\s
href=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)""
(\stitle=""[^""<>]+"")?\s?>$|
^</a>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_img = new Regex(@"
^<img\s
src=""https?://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""
(\swidth=""\d{1,3}"")?
(\sheight=""\d{1,3}"")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\s?/?>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
/// <summary>
/// sanitize any potentially dangerous tags from the provided raw HTML input using
/// a whitelist based approach, leaving the "safe" HTML tags
/// CODESNIPPET:4100A61A-1711-4366-B0B0-144D1179A937
/// </summary>
public static string Sanitize(string html)
{
if (String.IsNullOrEmpty(html)) return html;
string tagname;
Match tag;
// match every HTML tag in the input
MatchCollection tags = _tags.Matches(html);
for (int i = tags.Count - 1; i > -1; i--)
{
tag = tags[i];
tagname = tag.Value.ToLowerInvariant();
if(!(_whitelist.IsMatch(tagname) || _whitelist_a.IsMatch(tagname) || _whitelist_img.IsMatch(tagname)))
{
html = html.Remove(tag.Index, tag.Length);
System.Diagnostics.Debug.WriteLine("tag sanitized: " + tagname);
}
}
return html;
}
I’m definitely in agreement here, controlling dependencies is a key way to control complexity. Sometimes the fast and easy solution ends up eating more resources over the long haul:
Maybe I’m misunderstanding core competency here, but do users seriously go to Stack Overflow so that they can write in Markdown? I thought they went there to find and answer programming questions. I thought that was your core competency.
I understand there weren’t any libraries doing what you wanted, so that much I’m with you on, but the core competency explanation falls a little limp.
Jeff’s talking about his own core competency - he’s the one running the site, so he has to be proficient in handling everything people would do with the site. You’re confusing his focus on SO with that of the users.
Excellent article, I couldn’t agree more. Aside from the obvious benefits to the business its great for developers to have the opportunity to solve Hard Problems. Going out and buying libraries to solve Hard Problems tells your developers that you don’t trust them to get it right.
I think it should go without saying that there’s an addendum to that quote on If it’s a core business function – do it yourself, no matter what. It should probably be something like if that thing is hard and you can’t trust the quality of an off the shelf implementation. The classic example as John Carmack explaining why they wrote a 3d engine - it was a big deal and the major seller. Nothing else would do.
I don’t think that applies in this case. There would almost certainly have been some sanitisation code which you could have reused as others have pointed out. Some things are just building blocks. If your core competency is web sites, you don’t necessarily have to start by writing a web server.
So… Stack Overflow’s Core Business Function is sanitizing HTML?~
If there were no decent HTML sanitizers for .NET (seriously? I come from Python, where I can think of 3 excellent sanitizers off the top of my head - BeautifulSoup, lxml, and feedparser’s built-in one - not to mention that we’re a much smaller community) then I can understand building one from scratch. But even then, I would have gone with a port of another language/platform’s sanitizer, because I guarantee there are domain problems they’ve dealt with that you couldn’t possibly have have thought up on your first iteration.
That would actually be the best of both worlds: not having to make your own mistakes, but still intricately understanding how the code works.
You can talk about efficiency, delivery dates, and the right ways to do things in your project, but after all as a developer you understand that it is your project, and you started it because you wanted to, because you like to code, and coding stackoverflow was fun. That is the only and the major reason you decided to DIY. Digging through other people messy code and trying to pervert it to work with your framework is NOT FUN. Coding your own solution for interesting problem IS FUN. And there is nothing else to add.
Keep up reinventing the weal as long as it is fun!
So… Stack Overflow’s Core Business Function is sanitizing HTML?~
nice looking, easy to understand user-generated content is our core business. And guess how that content is generated?
If there were no decent HTML sanitizers for .NET (seriously?
You’d be surprised. In many ways .NET is kind of a backwater. Compare how many blogging engines there are in PHP, for example, to how many there are in .NET.
(from mid-2005, but the relative stats have not changed since then, may even be worse as PHP has exploded)
There’s probably a good reason for not being able to do this but my first though after reading your post is why don’t you write a sanitizer that converts valid html tags to something like BBCode?
You can convert valid tags to BBCode and escape the rest, that way users don’t have to learn BBCode to format their comments and you get to throw away anything that’s invalid. You can still allow people to use whatever pseudo markup language you’re converting to but it makes it so much easier for people who already know html.
This is like telling me I should use rainbows and cotton candy. Well,
obviously.
There are almost no options in .NET, which is one of the reasons I
wanted to go this route. So others would have more solutions!
Perhaps this should have been examined before choosing .NET. The group of developers around .NET are very different that the community around something like Python or PHP. They are more likely to understand the value of shared code, especially where that code is not the differentiation between them and their competition.
One thing, though on third part libraries. They are great, as long as you have the source. If not, you do not even have all of the source for your own application, so how canyou be expected to maintain it?
Perhaps this whole experience has made you a little wiser about the value of source code. Hopefully others can learn from your pain.
The main point of Dare’s post, which you’ve totally failed to address, wasn’t that you shouldn’t ever write your own HTML sanitisation code, but that trying to do so with regular expressions is a huge source of problems.
You could just as well have rolled your own validation by requiring valid XHTML input and using a SAX parser, which makes it quite easy to whitelist tags and attributes, or even validate that the input is well-formed in other ways (e.g. that inline elements don’t contain block elements).
nice looking, easy to understand user-generated content is our core business
A big part of what makes SO nice-looking and easy to understand is the delectably responsive UI. But you use a third party library to support that. Is it distinctively less of a core biz function than Markdown support?