I don't follow. Parsers for what? I'm never clear what they're talking about. Parsers for programming languages? You don't think modern languages are competently parsed? Parsers for file formats?
Yes. As trivial example, consider a UDP packet format { u16 type,size; u8 data[size]; } fed the 4-byte packet {ECHO,0xFFFF}, which is incompetently parsed as {ECHO,"<65535 bytes of stack memory>"} because the 'shotgun parser' assumes its input is well-formed. Whereas a 'recognizer' (ie a non-buggy parser) would reject that packet as 65535 bytes too short.
Good programming language design can make it harder to write the buggy parser and easier to write the 'recognizer', especially if the language/standard library provides built-in parsing tools.
See, I get the idea, but to me it's just like saying "decent code is better than shitty code". Well, I mean, no shit.
I'm not being snarky. Compare "LangSec" to memory safety, which also kills this class of bug dead. Which approach is more powerful and forecloses on more bug classes? Which approach requires more developer effort? Introduces more jargon?
I know multiple very smart, capable people who work under the rubric of "LangSec". But I just don't get it. Is it a real thing?
Actually I was just answering the "parsers for what" question.
FWIW, I think LangSec is saying "Code that doesn't have remote code execution vulnerabilities[0], or limits them to a weak computational model[1], is better than code with RCE vulnerabilities." - which is also "Well, no shit." - and "Parsing a nontrivial data format is the same thing as executing a (not-necessarily-)very constrained programing language."[2] - which seems obvious to me, but could plausibly be a "superpositions don't collapse"-level epiphany for someone who doesn't think about parsing the right way.
0: such as javascript or stack execution
1: like FSMs or pushdown atomata
2: with the implication that you had better make sure it actually is very constrained
There are whole classes of errors related to programs that parse then validate input when it's already too late. And often the validation happens in the source code in a cloud of checks that happen at run time. It's rather difficult to verify these programs.
It is much easier to verify a parser that only produces valid values at the edge of a program, isolated from the main program.
Yes, parsers for incoming packets and file formats, for example. If they are poorly written, you end up with Heartbleed, for example. I should maybe add that the LangSec folks try to provide formal footing for what an adversary can achieve with a given set of primitives. They refer to it as programming a weird machine. The reason that is important is that their view is that access to computational capability amounts to privilege. If you have a packet coming in from the untrusted outside, you should not allow it to be parsed by a parser with unbounded computational complexity, rather you should opt for a computationally limited parser.
Could you be more specific about Heartbleed was a parsing problem?
I'm familiar with the LangSec lingo and the concept of a "weird machine", much as I hate the term itself, has value. But it's not a product of LangSec so much as a name for a concept we've had for decades.
In Heartbleed the parser parsed an incoming field specifying length, but failed to correlate it with the rest of the request. Had the parser been written to a stricter specification, that would not have happened. Typically that is what is meant by bounding computational complexity. Or at least that is my understanding of it.
I'm familiar with the bug, but not with the parser-theoretic response to the bug. Would LangSec somehow do away with the on-the-wire length encoding? Or would it simply say "your parser should check the length of incoming data"? Isn't that about as useful an insight as "validate user input"?
You raise a good point. The thing to realize is that the incoming packet data, like all data, is code. "Code written for which machine?" you might ask. Well, think of that data as code that is executed by the parser. If you think of the parser as the machine (yes, it is weird), then it becomes easier to see how the adversary tries to program it using data as code. Just like we advise people never to eval() untrusted data, the LangSec response is not just to validate user input, but to constrain the programmability of the machine. By lowering its ability to compute, you are effectively lowering the privilege given to the adversary. So, the LangSec answer would be to design the protocol so as to not require a parser that can become so easily confused by conflicting information. Hope that helps. FWIW, I think you raise a great point about the ease of use and practical utility of LangSec, and agree that its potential is thus capped.
I don't understand how that's a problem with a parser. A parser is a tool for turning a flat buffer into structured data. The heartbleed problem described there is that a person wrote code that read outside of the buffer it was meant to read. Why aren't they independent problems?
Also, how can you design an efficient data format that accepts data of a length determined by the user (hopefully cooperatively with the server), that is immune to buffer overflow reads.
Once you say "I want between four and one thousand bytes", haven't you just stuffed yourself?