I agree - I find your post is clearer than the original paper, Crockford's explanation, and Lundh's Python post. It helps that you start with just addition and multiplication, then add more functionality after explaining the basics.
That post is excellent! Somehow I didn't find that when I was looking to learn more about TDOP. The only change I made from yours and Pratt's is that I split out the nud and led parsers completely. It seemed weird to me to lump them together just because they share a token when they aren't semantically related.
For example, the "(" token handles grouping in prefix position and function calls in infix, and those don't have anything to do with each other, so I split out PrefixParselets and InfixParselets into entirely separate dictionaries.
I love Pratt parsers (or Top Down Operator Precedence parsers rather), I recently re-wrote the IronJS lexer and parser by hand (from using one generated by ANTLR) using the techniques described in this blog post and saw some pretty hefty performance increments http://ironjs.wordpress.com/2011/03/19/new-lexer-and-parser-...
Edit, I also have a generalized version that can take pretty much any input and any output (F#) here: https://github.com/fholm/Vaughan
I believe this algorithm is also known as "precedence climbing", and it's the most common way to deal with operator precedence in a recursive-descent parser.
I'd say that having a single production for each precedence level is a more common technique in recursive descent, but it that can be slow because every factor needs to be parsed by recursing through all precedence levels. This technique avoids that, and can parse factors straight away before going into the priority checking loop.
Delphi uses this parsing method for expressions, and is one of the reasons it's so fast at compiling.
I don't actually know if Delphi is fast or not, I'll take your word on it.
However, I'd be skeptical that the parser is one of the main reasons for its fast compile times. In my experience, parsing is really a tiny amount of compilation time. The VB.NET compiler, for example, uses the same parsing technique, and it is ... "not fast" at compiling. It spends most of its time binding symbols.
Delphi's parser binds symbols and types the tree as it parses. The only thing left to do after parsing is code generation, which is only a couple of passes over the tree.
The parser (or more accurately, the lexer) is the bit of the compiler which ultimately limits its performance, because it needs to see every character of the source. The compiler can never be faster than linear in the length of its input. Metrics on the Delphi compiler actually show that string hashing is the hottest piece of code, and that's already optimized about as far as it can go.
Which means its fast because of the way it binds symbols, not because of the way it parses infix expressions. One thing to consider is that such a design hurts the usability of the language. It's a lot easier to not have to worry about forward declarations and "module" interfaces. I don't know Delphi, but I imagine what you describe relies on something like the "unit" construct from Turbo Pascal.
There are many things that make it fast - or rather, it's the lack of slow things that make it fast. Excessive recursion during parsing is one of those things that is avoided (and that was what was on-topic here). Your focus on binding is instead a focus on something that another compiler does particularly slowly. That a different compiler is fast is not due to it being fast at binding specifically; it's due to being fast, or at least not slow, at almost everything.
Delphi does indeed inherit unit syntax from Turbo Pascal. This is both a blessing and a curse; it makes writing well-structured programs easier by enforcing a logical consistency in ordering (programs more or less have to be written in a procedurally decomposed way), but it also makes writing programs with complex interdependencies between parts more awkward. The unit format is also one of the things that makes the compiler fast (or not slow); relinking a binary unit based on a partial compile is very quick, and as it's not a dumb C-style object format, the compiler can be intelligent about dependencies. Every symbol has a version, a kind of hash, associated with it, computed from its type, definition, etc. When units are relinked after a recompile of a unit, only the symbols need be looked up and their versions compared; if there is no version mismatch between imports and exports, then dependent units don't need recompilation.
Random factoid: the guy who wrote the original source for the current Delphi compiler was also one of the developers for the .NET GC, and CLR performance architect (Peter Sollich).
I've also played around with combinatorial parsing using F#'s FParsec (a clone of Haskell's Parsec lib) and I found very pleasant and powerful to work with.
[sorry for the plug, but I think my tutorial is more comprehensive in terms of explaining how the thing actually works]