Hacker News new | past | comments | ask | show | jobs | submit login
Don’t try to sanitize input, escape output (2020) (benhoyt.com)
68 points by benhoyt 3 months ago | hide | past | favorite | 79 comments



We were contacted by a bug hunter once stating he has access to our database and asking for a bounty for his finding, he even provided a sample of first 100 users from the users table in the database.

After some investigating, I figured out how did he obtain the data.

He was one of the first 100 users, he set one of his fields to an xss hunter payload, and slept on it.

After two years, a developer had a dump of data to test some things on, and he loaded the data into an sql development software on his mac, and using his vscode muscle memory, he did a command+shift+p to show the vscode command bar, but on the sql editor it opened "Print Preview", and the software rendered the current table view into a webview to ease the printing, where the xss payload got executed and page content was sent to the researcher.

Escape input, you never know where will it be rendered.


Not storing raw HTML might be a last resort to avoid these kinds of bugs in other software, but a good amount of things need to go wrong for them to happen in the first place. The issue is that your data is rendered outside of your software and known-good environment, so all bets are off.

You could as well have triggered a bug in some LaTeX engine that happened to be configured to allow arbitrary shell command execution.

Another strategy to defend against these issue you describe would be to not let developers access raw production data in the first place, but always anonymize it first, or remove internet access from machines accessing production data. (How sensitive is the data in your users table? Could a developer's test script accidentally send emails to your live users?)


If you dont know where it will be rendered, how do you know what to sanitize for?


Sanitize it for all XSS. Or better yet, avoid something like HTML or anything that can contain executable instructions, when all you need instead is a regular markup language.

I’ve seen HTML be used for user rich text input and it was an absolute mess, with old data that wasn’t properly sanitized, the sanitization library itself getting outdated, someone putting potentially unsafe content in from another system and so on, whereas people would sometimes bikeshead and worry about breaking old style classes or display of the data across multiple systems instead of addressing just how serious the potential risks are.

Not all of the details here might be accurate, but honestly just use Markdown or something like that for user input, disallow HTML altogether and never use the raw input.


That sounds like a pretty serious bug in the sql development software!


If you don't know where it will be rendered, you have no idea what escape syntax to use. If the field can end up in JSON, CSV, SQL, HTML, ... are you going to try escape it for all of them at once?

This idea of escaping input worse than sanitizing input (what the article says not to do).


This is such an important lesson, but it's a difficult one to convince people of - telling people NOT to sanitize their input goes against so much existing thinking and teaching about web application security.

It's worth emphasizing that there's still plenty of scope for sensible input validation. If a field is a number, or one of a known list of items (US States for example) then obviously you should reject invalid data.

But... most web apps end up with some level of free-form text. A comment on Hacker News. A user's bio field. A feedback form.

Filtering those is where things go wrong. You don't want to accidentally create a web development discussion forum where people can't talk about HTML because it gets stripped out of their comments!


An anecdote: I haven't checked recently, but at least until 2015 there was a Belgian govt web site for the register of companies that would yield 0 companies in the city of Aalter. The reason is not hard to find. In fear of SQL injection they would filter out all SQL keywords; in this case "alter".


Classic Scunthorpe.


You beat me to it. I couldn't remember the exact term. I just knew the "pattern" – and that it wasn't something that would have made the issue as obvious as eg. the pee-ass-word.


Don't you mean Clbuttic Scunthorpe?


I have never sanitized any input, full respect to whatever text there is.

It's simple - html/xml/javascript/json/url is not text. You render it with whatever tools you have to - and that tools happen not to be concat. You render xml - use DOM, xslt, etc. html - same story, use whatever templating engine you wish. json - use your own model and render it to json. SQL - prepared statements.


I think this misses important aspects. One does not often want to render HTML in place of a user name on a website as HTML, for example.


It's buried a bit in the article, but if you have to sanitize input to allow only some kinds of inputs (e.g., specific tags), you should really be parsing it fully to an AST and then acting on that (or using a library doing the same) since otherwise you're going to be subject to all sorts of pain.


Which raises another pithy phrase in this area: parse, don't validate.

https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...


This doesn’t have to be over engineered either! After all an enum is a perfectly valid AST.


I still wish that the Unicode folks had set up a bunch of duplicate code points which could have been used exclusively for processing marked-up text and that the folks making markup systems/languages had followed through.

Say one was updating TeX to take advantage of this --- all the normal Unicode character points would then have catcodes set to make them appropriate to process as text (or a matching special character), while "processing-marked-up" characters would then be set up so that for example:

- \ (processing-marked-up variant) would work to begin TeX commands

- # (processing-marked-up variant) would work to enumerate macro command arguments

- & (processing-marked-up variant) would work to delineate table columns

&c.

and the matching "normal" characters when encountered would simply be set.


Why not not both? Escaping output should be a requirement but doesn't hurt to remove obvious garbage in the input (including harmless stuff like pointless spaces)


The article clearly states why not: first you probably screw up your data (example is names, but there are many more examples) and second what is garbage depends on your output usage. Html/sql/JSON/etc all require a different sanitization.


Regarding harmless stuff like pointless spaces.

I know a person who uses two spaces between his first and last name, because his culture users a second given name yet he has none. So, one space between the first given name and the second (nonexistent) given name, then another space between the second (nonexistent) given name and the family name.

You might think it is weird or unnecessary, but that is his identity. One could counter with far weirder or seemingly unnecessary things we accept regarding peoples' identity today.


Obvious solution is for this person to adopt the "Zero Width Non Joiner" character as their middle name --- that would ensure that this doubled-up-space would be preserved.


I hate to belittle a point, but your obvious solution is wrong.

For one thing, considering that you know that unicode even exists, then it is likely that you know the difference between NULL and the "Zero Width Non Joiner" character. His second given name is NULL.

Secondly, throwing obscure technical solutions to laymens' problems for which they already have a working solution is nearly always a problem in the making.


The article tell us that input validation is okay but argues against input sanitization. I.e., if there is obvious garbage in the input tell the user that the input is wrong and don't store it and also do not try to correct it. That seems good advice in most cases. Your example of point less spaces is good though. I think it is generally a good idea to remove those.


How do you define what is and isn't garbage in free-form text input though? You can probably come up with some set of rules that covers all the current exceptions, but at what cost in creating and maintaining such a system?


I store the raw input in my database, but run it through bluemonday before rendering it. Simples.

https://github.com/microcosm-cc/bluemonday


That seems like it's doing a LOT more than what you'd want a sanitizer to do. For HTML you really just want to escape a couple characters with special meaning. You probably can get away with just replacing < and > with their substitution sequences.

That way people can still discuss XSS exploits without your sanitizer deleting a bunch of the text they entered on purpose.


This is another place where 80% of the time one way works but 20% of the time you need to go the other way.

Of course once the product is in production you can swim one direction but not fight the current going in the other. You can always move to escaping output, but retroactively sanitizing input is a giant pain in the ass.

But the problem comes in with your architecture, and whether you can discern data you generated from data the customers generated. Choose the wrong metaphors and you end up with partially formatted data existing halfway up your call stack instead of only at the view layer. And now you really are fucked.

Rails has a cheat for this. It sets a single boolean value on the strings which is meant to indicate the provenance of the string content. If it has already been escaped, it is not escaped again. If you are combining escaped and unescaped data, you have to write your own templating function that is responsible for escaping the unescaped data (or it can lie and create security vulnerabilities. "It's fine! This data will always be clean!" Oh foolish man.)

The better solution is to push the formatting down the stack. But this is a rule that Expediency is particularly fond of breaking.


I've always been a big fan of structuring data on input, escaping it on output.

I think the big problem with just escaping output is that you can accidentally change what the output will actually be in ways that your users can't predict. If I am explaining some HTML in a field and drop `<i>...</i>` in there today, your escaper may escape this properly. But next month when you decide to change your output to actually allow an `<i>` tag, then all of a sudden my comment looks like some italicized dots, which broke it.

Instead if you structure it, and store it in your datastore as a tree of nodes and tags, then next month when you want to support `<i>` you update the input reader to generate the new structure, and the output writer to handle the new tags. You preserve old values while sanitizing or escaping things properly for each platform.


It is a reasonable idea, but there are other things that can be done too.

However, in the stuff about SQL, you could use SQL host parameters (usually denoted by question marks) if the database system you use supports it, which can avoid SQL injection problems.

If you deliberately allow the user to enter SQL queries, there are some better ways to handle this. If you use a database system that allows restricting SQL queries (like the authorizer callback and several other functions in SQLite which can be used for this purpose), then you might use that; I think it is better than trying to write a parser for the SQL code which is independent of the database, and expecting it to work. Another alternative is to allow the database (in CSV or SQLite format) to be downloaded (and if the MIME type is set correctly, then it is possible that a browser or browser extension will allow the user to do so using their own user interface if they wish to do so; otherwise, an external program can be used).

Some of the other problems mentioned, and the complexity involved, are due to problems with the messy complexity of HTML and WWW, in general.

For validation, you should of course validate on the back end, and you may do so in the front end too (especially if the data needed for validation is small and is intended to be publicly known). However, if JavaScripts are disabled, then it should still send the form and the server will reply with an error message if the validation fails; if JavaScripts are enabled then it can check for the error before sending it to the server; therefore it will work either way.


This has been the way for Drupal since ... 2005 at least. My memory becomes fuzzy before that. Since 2015 it's highly automated too thanks to Twig autoescape.


They're not even related. Sanitizing input is at best a formatting/style issue. Escaping output is a security issue.


Of the “six famous bad ideas in computer security”, the first and second are “default permit” and “enumerating badness”.

http://www.ranum.com/security/computer_security/editorials/d...


This isn't “default permit” or “enumerating badness”. Its kind of the opposite.

The idea is that you don't want to store text in your database in a form that is safe when rendered as HTML, JS, JSON, SQL, etc. That would be "enumerating badness". Instead, at the moment you render the text as HTML, you encode the text into an HTML-friendly form (via escape characters). If you want to embed the text into a SQL query, have your SQL library add sql-specific escape characters where needed in the text. Same for your JSON library, and so on.

Its the responsibility of an encoding library to encode and decode text in the appropriate way. A JSON or SQL library should be able to encode then decode any arbitrary unicode string, even one which contains quote characters. Just like how any arbitrary unicode string should be able to be used on a webpage, in a text field without being able to interact with the rest of the page in any way.

Most libraries already do this if used properly. SQL libraries (using parameters) will escape text where needed. React will embed text in an html-safe way. JSON libraries escape quotes in strings. And so on.


I think the challenge is, you share data with other systems. If you don't treat "sharing" as "output" you're in trouble.


yeah I think this is the key takeaway for me, because the sanitation needed depends on the consumer of the output. i was never a successful developer but when i was at least trying, i would constantly run into other devs that would happily code forms that allowed sql injection, even in relatively normal winforms apps, not to mention poor understanding of regular ol concurrency / transactions.

since i had a poor reputation (which i take full responsibility for), my concerns would always be dismissed by a combination of "elitist ivory tower thinking", "toxic interactions" and rebuffed with comebacks like the "database server just handles this" etc etc.

if your comment is anything but solid black for the duration of folks reading it, its just more evidence that the vast majority of developers are just shit at their jobs haha


"my concerns"

Sorry, decades ago I was one of those people.


building software is just a wierd activity for humans i think lol


Of course you should sanitize input, and escape everything properly in the context-specific way.

Defining what is valid for an input field and rejecting everything else helps the user catch mistakes. It's not just for security.

Some kinds of information are tricky to sanitize. Names, addresses and such. Especially in an application or site that has global users. Do the wrong thing and you end up aggravating users, who are not able to input something legitimate.

But maybe don't allow, say, a date field to be "la la la" or even "December 47, 2023".


Still looking for a way to safely parse HTML string into DOM while avoiding XSS attacks. Most solutions end up with sanitizing input.


Our approach at work: parse it as HTML, define a short list of known-acceptable tags & attributes, and strip everything else.

Limiting attributes to ["href", "src"] and tags to ["p", "br", "h1", "ul", "ol", "li", "span", "div", "img"] gets you remarkably close to rendering the safe bits of HTML - add to that list upon request.

If you want to take it further, use an `iframe srcdoc=""` with sandbox attributes set.


> Limiting attributes to ["href", "src"]

You need to clean that up as well to avoid e.g. javascript: links, and then there are more issues with SVG if you allow media uploads.

Then you need to be very sure you’re using a proper html5 parser and your rendering is completely canonicalized or you open yourself up to filter evasions (https://cheatsheetseries.owasp.org/cheatsheets/XSS_Filter_Ev...)

And of course I assume that’s what you meant but you should not add upon request, you should evaluate the addition.


Yes - just double checked those, thankfully the framework builtins are correct (staying up to date with a well maintained framework does wonders for your security posture).


Wasn't there this case of a security issue coming from abusing different parsers, in different places? Server, client, or different browsers



If you want to avoid XSS attacks, have you tried a CSP header? I know it is more of an output validation, as you restrict what can happen with external scripts.

You can only fit so many characters in your exploit, often due to max field lengths, unless you can load some external script. Disabling loading unknown external scripts with CSP significantly reduces possible attacks, including XSS attacks, because you simply don't have the space.


There's another one that works 100% of the time.

Do client server rendering. Send HTML, then query backend for content. Something like p.textContent = ... It's safe.

It's pretty much the same as what a prepared statement does in SQL, send data and code in different channels


Huh? This is a 100% solved problem in most languages. You just need to replace all of HTML's special characters with their escaped / encoded form. Eg, '<' becomes '&lt;', and so on for &, ", ', >, and all the rest.

There are libraries in almost every language to do this for you. A quick google search found these:

JS: https://github.com/parshap/html-escape

PHP: https://www.php.net/manual/en/function.htmlentities.php

And there are many more.


You are right in that it is solved if the goal is "I don't want any part of the string to be treated as HTML"

It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute". It is possible, with things like DOMPurify, but ideally you'd try to avoid this if at all possible.


> It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute"

yes, because you're no longer allowing HTML, but allowing something similar to HTML but not (and which subset is different for different people/project etc).

So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects.

Or, as an alternative, host the html (without the stripping of "undesirables") in a separate iframe, on a totally different domain, and rely on the browser's cross-origin protection to prevent undesirable scripts or data leaks.


"So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects."

This would be how I would choose to solve this, if the option was available.

But sometimes people do want some HTML compatibility for legit reasons.


If you want a data format which expresses some specific subset of HTML, well, do that then. Again, validate on output that the text you're showing is within the defined subset and escape everything else. Eg, "<strong> is passed verbatim to the browser but any other < character is replaced with &lt;".

This technique still works fine. You just need to also do the work of defining what your data format looks like, and how it should be parsed and displayed in a web browser.


> You just need to also do the work ...

and thus, make mistakes and allow XSS.


I don't think this is sufficient. Scripts could still do bad things, for example mining crypto.


Markdown sources can contain HTML, which most parsers will gladly spit back out unescaped unless it's wrapped in a code block.

I would much rather trust a sanitizer library written by someone who knows about security, than trust a Markdown parser that was never intended for that kind of role. I've built apps that ingest Markdown, and I always pipe the parser's output to a proper sanitizer.

Using an iframe is a clever workaround, but good luck convincing Google et al. to treat the contents of that iframe as part of the page you want indexed.


Ehhh!? I don't get this at all. You obviously do both.

1) you get your input data into the form that is meaningful in the database by validating, sanitising and transforming it. Because you know what form that data should be in, and that's the only form that belongs in your database. Data isn't just output, sometimes it is processed, queried, joined upon.

2) you correctly format/transform it for output formats. Now you know what the normalised form is in the database, you likely have a simpler job to transform it for output.

It's not just lazy to suggest there's a choice here, it's wrong.


Validating data on input is just checking for "known badness", and is pretty ineffective.

If you've got specific structure requirements for the data you store, parse it into that structure.


Right -- I would include that in "transform" in my example above.

I've seen too many forum developers spend far too much time after the fact dealing with their decision to "just use TinyMCE" ==> Oh hey, you're a server-side HTML parsing expert now anyway; wasn't that what you were trying to avoid?


Absolutely the worst advice ever!


Porque no los dos?


Disagree.

Escaping/sanitizing on output takes extras cycles/energy that can be spared if the same process is done once upon submission.

Think more sustainable.


Yes I love seeing &lt; in my database. Every time I see it I think oh boy how many cycles will I save when I display this in HTML!


It’s a trade off for sure. But if you know at the time of collection all the ways in which your data is going to be used in the future, you work at a more well-run organization than me!


Obviously one thing does not exclude the other, but the more processed data is before, the less has to be done later for every page retrieval.

It surprises me that this seem unfamiliar these days?


That is true only for simple systems that don't change often and that have a single view of the data.

For most real world use cases your approach is break even at best, and often way worse.


You mean it's not suitable for the deploy every hour crowd it has nothing to do with complexity.


No, that's what caches are for.


What do you escape for? Html? Postscript? Json? SQL? Latex? Custom report generator which haven't been created yet but you'll start using in 2 years? Things move - you can't choose the escaping format at input time.


Solve problems you have, not ones you invent


Html, Json, SQL are the very basics which most web apps will run into. You don't have to invent anything to get more than one possible output format.


For SQL you already have the language provided sanitation of prepared statements. For most backends, the output format is always json, which ends up on the frontend via dedicated browser APIs that don't allow html injection.

Maybe if you also directly render some html from the backend, that would change things.

Document these assumptions in your central code standards/architecture document to get everyone aligned, and then just stick to it and enjoy a more sane codebase.


So what you're proposing is encode to html on input, to sql on output, passthrough to html, and encode to Json on output but after decoding from the saved html - and just document it well?


I'm not sure, I've never run into this issue. I'd probably just look for another job if you want the honest answer.

For what it's worth: if a column in your database is used in so many contexts, you could document that it's simply unsanitized. Like, I prefer having as few entries for bugs and fuckups, and knowing that the database data is always sanitized helps with that. But it's not an unbreakable, absolute sort of rule.

And I never really escape strings for SQL, as I always use prepared statements - forgetting prepared statements is an easy thing for code review to catch.

I also have a project where there is HTML stored, which is not escaped because the purpose is to directly output that HTML somewhere, and it really is supposed to be HTML.

It simply depends. But as long as it's documented, and hopefully as clear and consistent as possible.


Not sure the escaping/sanitizing proposition can hold a candle to the overwhelming performance dumpster fire that is modern web dev.


I was going to say this sounds like optimizing the stuff that takes 0.1% of runtime for performance over safety.

Of course you'd need to measure this for your application, but without a performance measurement maybe it's better to default to security.


Did you calculate that, or are you just guessing?


The reason you sanitise input is because the data can attack the host and the client.

This post has a narrow view on attackers.


It’s really not. The point is that sanitizing output is a better way to protect the host and the client.


Sure, keep writing exploitable code then, have fun!


yes, and the article clearly agrees with you, different destinations of the data have different escaping requirements.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: