Can you explain why 'validating it with grammar' is better than 'some adhoc parser'?
IMHO a webserver is one of the things you want to be relaxed, laid back, and basically not care if the client gets things wrong. Just serve up what they look like they wanted.
Many of the security attacks on Apache (and other browesers) were based on invalid requests.
The original Mongrel was known for its powerful request handler that didn't let many of the same security attacks through. In fact, in the Ruby world, many other non-mongrel web servers used the mongrel handler for that very reason.
As long as you don't get into the algorithms it's pretty simple.
A hand written http parser is kind of like writing a "white-list" of what the server rejects. Since there's no algorithm backing it the only thing you can do is list out all the things you can think of or have run into that is "wrong".
Using a parser (well lexer really) like Ragel I can make something that's relaxed, but it's more of a white-list of what it accepts. The algorithm explictily says this particular set of characters in this grammar is all that I'll answer to.
If you then write the grammar so that it handles 99% of the requests you run into in the wild, you get the same relaxed quality as a hand written one, but it explicitly drops the 1% that are invalid or usually hacks.
This is also the same parser that's power a large number of web servers in multiple languages, so it's proven to work.
Yeah, that is what Microsoft did (does). The result is most of the requests "look like" a desire to serve up viruses or spam.
A grammar is theoretically provable (yes, that is a double entendre). An ad-hoc implementation is not provable and exhaustively testing its validity is unrealistic for anything but trivial grammars.
Sorry but I'm even more confused now. What are we proving?
"The result is most of the requests "look like" a desire to serve up viruses or spam."
I have no idea what you mean by that.
HTTP is a trivial grammar. The parser is the simple bit. What you do with the headers and how you respond to them is the more interesting bit.
Why would rejecting invalid requests be desirable? Why not just serve up what we think they want? (Of course there's levels of 'invalid'. Reject the crazies, but allow some).
With a parser that implements a grammar, you can prove that (a) it accepts every string that is valid as defined by the grammar and (b) it rejects every string that is invalid. The specifying of a grammar is relatively straight-forward (hopefully). Proving that an ad-hoc parser does (a) and (b) is nearly impossible.
Ad-hoc parsers can be shown to accept all "OK" strings that somebody used to test the parser and can be shown to reject all "not OK" strings that somebody used to test the parser.[1] "The problem with idiots (and black-hats) is that they are so ingenious." The only way to prove that an ad-hoc parser is truly correct is to run all possible strings through it, complete with a-priory knowledge of which strings are OK and which are to be rejected. This is an O(infinite) problem (i.e. the halting problem http://en.wikipedia.org/wiki/Halting_problem).
Guessing intent is a wormhole: how close does the request need to be? What if you guess wrong?
The combination of ad-hoc parsers with guessing intent is a potent way to introduce security flaws in your program. In the case of a web server, the "attack surface" is the whole internet, i.e. there is a huge number of idiots and black-hats that could potentially attack your program.
[1] War story: in a previous life, the company decided they needed to have a custom code standards checker program (a result of a chain of four or five decisions, all of them really stupid, but that is a different war story). They contracted out the creation of the program, complete with a requirement that the contractee company write the test cases (fox in the hen house). The program was a POS (how did you know that was coming???).
When I looked at the test cases: they had one "positive" (i.e. catches a "bad" construct) test case and NO "negative" (i.e. does not have false positive) test cases. As a result, when run on real code, the "standards checking" program was actively sabotaging good code!
The headers and such for even the most static requests still get used all over — dispatch, caches, logging, etc. The overhead is minuscule, especially compared to a hand-rolled parser that's literate enough to be maintainable.
And the purpose isn't to "block application specific hacky looking requests", it only does that as a side-effect — this isn't some inane IDS bullshit sold to PHBs. It's not looking for exploit signatures, it just sanitizes all input as a consequence of correctness.
It's quite simple really. Do you like your compiler? Or would you rather write code and hope for the best? Compilers work because they have a formal grammar of what is and is not acceptable in the programming language. This same principle is being applied to handling web requests. We have a standard - HTTP - and any requests that don't conform to the protocol are immediately rejected by Mongrel2. Since many attacks against web servers involve sending improper web requests, this sort of approach simply rejects those requests and doesn't even begin to process them. This certainly doesn't prevent Mongrel2 from implementing proper security at other appropriate places in the code. It simply stops a whole lot of potential exploits before they start.
HTTP requests come from millions of different browsers. Some with bugs, some with idiot creators, etc etc.
My point was that an HTTP request parser is trivial to write correctly. What you do with the headers and request later on are where sometimes you need to be careful.
TBH Though I think I'm just in a different world to all of this mongrel stuff.
IMHO a webserver is one of the things you want to be relaxed, laid back, and basically not care if the client gets things wrong. Just serve up what they look like they wanted.