Hacker News new | past | comments | ask | show | jobs | submit login
What is a parser mismatch vulnerability? (2022) (brainonfire.net)
78 points by subset 8 months ago | hide | past | favorite | 22 comments



I did research on parser differentials for my bachelor's thesis. My initial hope was that I would find a few mismatches for formats without a formal specification. I found mismatches for _every single_ pair of parsers I looked at and that included formats with formal specifications. My personal takeaway was "If you use one parser for validation and another parser for evaluation, you're fucked. No exceptions."


Unsurprising.

One of my personal pet peeves is websites that use an e-mail parser on the client side which allows a + in e-mail addresses, and then use some other parser somewhere in their chain of things that happen on the back end that does not.

Everything fails silently, accounts are created I don't have access to, and it generally just sucks.


This is funny because as a data/backend programmer, I had to standup some web services for some stuff. I asked all the front end folks I know if there was some unified solution to parsing and validation to ensure the logic I use in the front end is the same as the backend. They all looked at me like I was a bit crazy, before telling me they weren’t aware of any solutions they would be compatible without significant changes to the architecture.

I’m hopeful someone on here could point me to a solution.


I suspect your frontend team won't like this solution, but here goes: Back in my day, we did all the logic on the backend and we never did worry if the page was out of sync with internal state because the page was static.


That is extremely common for formats without extensive and publicly available test suites. Without an easy way to check edge cases, implementations are destined to diverge.


Totally agree! Sounds like your thesis is worth a read. Can you share a link?

thrums.shrimps.0m@icloud.com if you do not want to share publicly


As the article mentions, Postel's Law is likely to create vulnerabilities. It makes individual systems more robust, but the whole becomes fragile.


> Well, these browsers "helpfully" fix the URL to change backslashes into regular forward slashes, I suppose because people sometimes type in URLs and get their forward and back slashes confused.

More likely because Windows has historically used \ rather than the / that's standard in Unixish systems. Windows people are used to typing \, so it's indeed somewhat helpful for the browser to accept either (e.g., in file:// URLs).


Odd that the article doesn't use the more standard term "parser differential", with "differential fuzzing" as the fuzzing community's method for finding those.


This is a LANGSEC concept. A broader survey can be found at: https://www.computer.org/csdl/proceedings-article/spw/2023/1...



I guess if we add all the problems in IT that were caused by bugs and poor designs of parsers/serializations, e.g. SQL injections, XSS, null byte vulns etc., we get billions of human hours in damages.

What should be instead is an absolutely clear serialization format into a byte string of ANY data structure that must processed by two different programs.

Parsers are programs, they should "parse" bytes, not strings, like we humans do.


If BABLR succeeds in creating a shared instruction set for defining parsers, you'd just have portable parser grammars running on compatible parser VMs


Usually? a result of the parser not having a machine-readable specification.

For parsing proper, `bison --xml` is useful if you're allergic to code-generation. I don't have an equivalent for lexing.


Honestly we should have a name for such class of bugs. It's not an "I didn't know" kind of mistake. Every person sufficiently intelligent to program should figure out by themselves that having 2 parser implementations can cause various undesired consequences.


> Every person sufficiently intelligent to program should figure out by themselves that having 2 parser implementations can cause various undesired consequences.

I disagree. Hindsight is 20/20, it's now obvious to me that using two different parsers for the same thing in a single process can cause bugs, but it didn't occur to me before reading about it.

Now that I'm aware of this, in particular, I'll be extra careful not to parse something manually if something already does it in whatever I'm working on and there's an API for it.

URLs and paths are the canonical (ah!) example of this: it's tempting to just take the string and split by "/". Whoever who has never done this should throw the first stone.

And people who have never written any parser may not clearly see this stuff.


A lot of vulnerabilities are rather obvious, but hard to avoid due to more than one person working on a project.

For a sufficiently large project it's not always obvious that another implementation of something exists. And it's not always easy to search for a concept. You might use different words for it than the other person that implemented it.

And even if you find it, it might have a weird badly thought out API and similar implementation, making it likely people reimplement it anyway.


>Honestly we should have a name for such class of bugs.

They are called parser differentials


Usually, some not verified and cleaned enough external input text managed to get into some complex and often brain damaged text parser (printf,sql,etc).


What is brain damaged about SQL parsing?

But this is not about injection. This is about parsing mismatch, when you use different parsers that produce different results for the same thing. The article is about URLs, curl's author has a good article on this too and this is the (biggest?) motivation behind providing an URL parsing API in libcurl.

https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/


For bad historical reasons, every SQL implementation has a completely bogus and wrong way to do string and identifier lexing. Nowadays most of them can be told to be more standard (no backslashes, single quotes are for strings, double quotes are for identifiers, repeat the quote to escape) but the default is usually still wrong.

(and that's ignoring extensions)

Postgres is sanest but even it casefolds in the wrong direction.


> but even it casefolds in the wrong direction

I'd say that it's the only one that casefolds on the right direction.

We don't need to keep using the upper-case only idioms from the 80's. We can have legible text nowadays. (And yes, it's non-standard, but there are plenty of things that are best done by ignoring the SQL standard.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: