Hacker News new | past | comments | ask | show | jobs | submit login

Huh? This is a 100% solved problem in most languages. You just need to replace all of HTML's special characters with their escaped / encoded form. Eg, '<' becomes '&lt;', and so on for &, ", ', >, and all the rest.

There are libraries in almost every language to do this for you. A quick google search found these:

JS: https://github.com/parshap/html-escape

PHP: https://www.php.net/manual/en/function.htmlentities.php

And there are many more.




You are right in that it is solved if the goal is "I don't want any part of the string to be treated as HTML"

It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute". It is possible, with things like DOMPurify, but ideally you'd try to avoid this if at all possible.


> It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute"

yes, because you're no longer allowing HTML, but allowing something similar to HTML but not (and which subset is different for different people/project etc).

So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects.

Or, as an alternative, host the html (without the stripping of "undesirables") in a separate iframe, on a totally different domain, and rely on the browser's cross-origin protection to prevent undesirable scripts or data leaks.


"So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects."

This would be how I would choose to solve this, if the option was available.

But sometimes people do want some HTML compatibility for legit reasons.


If you want a data format which expresses some specific subset of HTML, well, do that then. Again, validate on output that the text you're showing is within the defined subset and escape everything else. Eg, "<strong> is passed verbatim to the browser but any other < character is replaced with &lt;".

This technique still works fine. You just need to also do the work of defining what your data format looks like, and how it should be parsed and displayed in a web browser.


> You just need to also do the work ...

and thus, make mistakes and allow XSS.


I don't think this is sufficient. Scripts could still do bad things, for example mining crypto.


Markdown sources can contain HTML, which most parsers will gladly spit back out unescaped unless it's wrapped in a code block.

I would much rather trust a sanitizer library written by someone who knows about security, than trust a Markdown parser that was never intended for that kind of role. I've built apps that ingest Markdown, and I always pipe the parser's output to a proper sanitizer.

Using an iframe is a clever workaround, but good luck convincing Google et al. to treat the contents of that iframe as part of the page you want indexed.




The deadline for YC's W25 batch is 8pm PT tonight. Go for it!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: