@omalley
If you don’t allow unsafe characters, then just completely remove them from input. Done
Think about what this means. What is an unsafe character?
In the context of the user’s message, nothing. It’s only when you go to insert that message directly into a HTML/JS document that certain characters take on a different meaning. And so at that time you escape them. This way the user’s message displays as they intended it AND it doesn’t break the HTML. Everyone wins.
It’s the same for when you’re putting it into SQL, or into a shell-command, or into a URL, etc. You can’t store your data escaped for every single purpose in your DB, you need to do the escaping exactly when it’s needed and keep your original data raw and intact.
Your policy of stripping unsafe characters gets in the way of the user’s perfectly legitimate message. And there’s absolutely no reason for that.
You store user input verbatim, and you always remember to escape when displaying output, and you hope input cleaning works 100%
There is no hope required. You don’t have to always remember if you have a standard method of building DB queries and building HTML documents/templating, and it’s tested. And you should have this.
Where and when to escape (assuming a DB store):
- Untrusted data comes in
- Validate it (do NOT alter it)
And, if it’s valid - Store it (escape for SQL here)
later, if you want to display it in a HTML page:
retrieve from DB and escape for HTML
or, if you want to use it in a unix command line:
retrieve from DB and escape for shell
or, into a url:
retrieve from DB and URL encode
etc…
The key is not MODIFYING the user’s data. Just accept or reject. Then you escape if necessary when you use it in different contexts.
Now you can do anything you want with your data. You don’t have to impose confusing constraints on what your users can and can’t say.