We were contacted by a bug hunter once stating he has access to our database and asking for a bounty for his finding, he even provided a sample of first 100 users from the users table in the database.
After some investigating, I figured out how did he obtain the data.
He was one of the first 100 users, he set one of his fields to an xss hunter payload, and slept on it.
After two years, a developer had a dump of data to test some things on, and he loaded the data into an sql development software on his mac, and using his vscode muscle memory, he did a command+shift+p to show the vscode command bar, but on the sql editor it opened "Print Preview", and the software rendered the current table view into a webview to ease the printing, where the xss payload got executed and page content was sent to the researcher.
Escape input, you never know where will it be rendered.
Not storing raw HTML might be a last resort to avoid these kinds of bugs in other software, but a good amount of things need to go wrong for them to happen in the first place. The issue is that your data is rendered outside of your software and known-good environment, so all bets are off.
You could as well have triggered a bug in some LaTeX engine that happened to be configured to allow arbitrary shell command execution.
Another strategy to defend against these issue you describe would be to not let developers access raw production data in the first place, but always anonymize it first, or remove internet access from machines accessing production data. (How sensitive is the data in your users table? Could a developer's test script accidentally send emails to your live users?)
Sanitize it for all XSS. Or better yet, avoid something like HTML or anything that can contain executable instructions, when all you need instead is a regular markup language.
I’ve seen HTML be used for user rich text input and it was an absolute mess, with old data that wasn’t properly sanitized, the sanitization library itself getting outdated, someone putting potentially unsafe content in from another system and so on, whereas people would sometimes bikeshead and worry about breaking old style classes or display of the data across multiple systems instead of addressing just how serious the potential risks are.
Not all of the details here might be accurate, but honestly just use Markdown or something like that for user input, disallow HTML altogether and never use the raw input.
If you don't know where it will be rendered, you have no idea what escape syntax to use. If the field can end up in JSON, CSV, SQL, HTML, ... are you going to try escape it for all of them at once?
This idea of escaping input worse than sanitizing input (what the article says not to do).
After some investigating, I figured out how did he obtain the data.
He was one of the first 100 users, he set one of his fields to an xss hunter payload, and slept on it.
After two years, a developer had a dump of data to test some things on, and he loaded the data into an sql development software on his mac, and using his vscode muscle memory, he did a command+shift+p to show the vscode command bar, but on the sql editor it opened "Print Preview", and the software rendered the current table view into a webview to ease the printing, where the xss payload got executed and page content was sent to the researcher.
Escape input, you never know where will it be rendered.