It amazes me how far out of their way many people will go to avoid using parsing tools.
He avoided using a lexer generator because "I'd figured using a lexer generator was more trouble than it was worth". To him it was more convenient to write a lexer manually (using macros) in C++, then later completely rewrite it in a different language (D) as a one-off trie matcher code generator.
I am amazed that this could be thought of as less work. How can writing two lexers from scratch manually be less trouble than writing a bunch of regexes in a Flex or Ragel file? Especially since the result was likely slower than using Flex or Ragel would have been?
To me the interesting question is: how much of this allergy to external tools is:
1. the trouble of learning/maintaining something new/different (inherent tool overhead)
2. design of the tool isn't as user-friendly as it could be (incidental tool overhead)
3. irrational belief that the tool will be more work/trouble than it actually is (non-optimal decision making)
If part of the answer is (2), then improved tools will lead to greater adoption. And hopefully more convenient tools will lead to them being better known and more widely adopted, which should lessen (1) also.
Everyone uses regexes; using a lexer generator should (in my opinion) be as easy as using a one-off regex. I think the key is to make the lexer generator an embeddable library like regex libraries are.
Using third party lexer generators means your project is now dependent on that tool. This means:
1. the tool must exist on every platform your project will ever be ported to
2. the same version of that tool must exist on every such platform
3. if the hosted compiler changes, the tool must be updated in sync with that compiler - out of sync means your project won't build
4. if there are blocker bugs in the tool, you're not in a good position to fix it. You're stuck with developing workarounds.
5. Everyone building your project has to get the tool successfully installed.
The current build of D's runtime library has a dependency on libcurl. Supposedly, since libcurl is ubiquitous, that shouldn't be any problem. But it has been an ongoing sore relentlessly sucking up multiple peoples' time, of the issues I outlined above.
> Using third party lexer generators means your project is now dependent on that tool.
I would submit that porting D is signficantly harder than porting flex and/or ragel. Not to mention that the number of platforms that support C++ and D but not flex should be smaller than the number of platforms that support C++ and flex but not D?
Your arguments seem to equally apply to using third-party libraries at all. Are you suggesting that it's better to (for example) re-implement HTTP rather than use libcurl?
I'm saying you've got to have pretty strong arguments to make a project dependent on a third party tool, if you expect to distribute it, expect it to be long lived, and expect to port it to many platforms.
Yes I agree with the point that the portability of your dependencies limits your own portability.
This is definitely something that is dependent on the tool or library though. For example, if you decided to use Lua as a tool, it is extremely unlikely that the portability of Lua will be a limiting factor. It is very likely to be more portable than your code (given that it compiles unmodified on Turbo C 1.0 from 1990: http://www.youtube.com/watch?v=-jvLY5pUwic)
So I definitely agree with "choose your dependencies wisely."
There's a fine line to walk between "Not Invented Here" and suffering through an inadequate tool. We at Facebook have a long history of trying stock tools and see them failing at our scale.
I'm going to have to disagree with you and Walter. While I agree that for a fixed language, a well-written hand-built recursive descent parser will beat the pants off the output of a parser generator, languages aren't necessarily fixed and writing a good hand-build parser is harder than it seems. I still recommend that people start with parser generators.
(Edit: yes, we're talking about lexers above, but the article mentions both lexers and parsers. I think the case for using a hand-built lexer is much weaker than the case for using a hand-built parser, not that I'm a fan of either.)
If you use a parser generator, you can be sure that your language has a clear, (hopefully) LALR(1) syntax that anyone can understand and port to another parsing environment. If you hand-parse, then the temptation exists not only to introduce subtle irregularities and context dependencies into the language, but also to mix semantic interpretation and parse tree generation, and both of these anti-patterns make it difficult to build generic tools that work with your new language. With a hand parser, you also run the risk of making silly coding errors that end up being baked into your language specification. (Ahem, PHP.)
> While I agree that for a fixed language, a well-written hand-built recursive descent parser will beat the pants off the output of a parser generator
I'm not sure this is true, do you have numbers that demonstrate this?
I know that Clang knows some tricks that lexer generators don't currently know like using SSE to skip long blocks of comments. But even that, to me, represents an opportunity for better parsing tools, not an inherent performance advantage of hand-written parsers/lexers.
> I'm not sure this is true, do you have numbers that demonstrate this?
No numbers offhand, but GCC's experience of moving toward a hand-written C++ parser was very positive. I don't believe hand-written parsers have an inherent speed advantage, but it does seems that it'd always be easier for a hand-written parser to produce more informative error messages --- if anything, because the hand-written parser has more embedded syntactic information.
> I don't believe hand-written parsers have an inherent speed advantage, but it does seems that it'd always be easier for a hand-written parser to produce more informative error messages
From what I can see, this is the major shortcoming of parser tools at the moment. I haven't looked into this problem too deeply yet, but at the moment I suspect that it isn't possible to get good enough error messages automatically.
But to me this doesn't mean "parsing tools will never have good error messages", it means "parsing tools need to evolve to give users the right hooks to produce good error messages themselves."
This would let the algorithms do the part that they do well (analyzing the grammar and creating state machines) while letting the human do the part that can't be done automatically (creating good error messages).
Getting good error messages from the parser generator is a solved problem, but it seems that solution is not widely known.
As you said, it isn't possible to get good error messages automatically. The solution is to include error cases in the grammar. No hooking is necessary. If people often miss semicolons, include missing semicolon case in the grammar!
Mono C# compiler uses this technique (including error cases in the grammar) to the great effect. I wonder why people don't do that and resort to hand-written parsers instead.
Sure, things might change. I'd be happy with better error messages from automatically-built parsers. There's a lot of potential, I think, in methods that take advantage of having a formal description of the language grammar --- you could, for example, automatically generate a sentence that "fits in" at the point at which an error occurs. There's still a lot of progress to be made in the area, though, and for now, hand-generated error messages do seem to be more useful to people.
> We at Facebook have a long history of trying stock tools and see them failing at our scale.
I can understand that, having experienced the same before.
I just re-read my message and I regret the way it comes off -- my intention in writing it wasn't to be critical of you, but to explore the issue of how inconvenient parser tools are perceived as being, and try to analyze why that is.
"Insanity" was meant to refer to the overall situation, and not you personally! I might re-word this a bit.
I still think that we ought to be in a position where the tool is so easy and so valuable that it is a no-brainer. And I think this largely indicates that there is a long way for the tools to improve.
> Everyone uses regexes; using a lexer generator should (in my opinion) be as easy as using a one-off regex. I think the key is to make the lexer generator an embeddable library like regex libraries are.
There actually is a really powerful parser generator for D by Philippe Sigaud: https://github.com/PhilippeSigaud/Pegged It's much more pleasant to use than something like Boost Spirit.
Somewhat relatedly, D's standard library has a compile time regex engine that compiles regular expressions down at compile-time resulting in some of the fastest regular expressions in the world.
That's cool, though "some of the fastest regular expressions in the world" might be a bit of hyperbole: it would be comparable to any compile-time generator like Flex, Ragel, or a JIT generator like PCRE-sljit. And since D is a memory-safe language, I would expect the guards on buffer access to cause measurable overhead compared with C or C++ generated code.
> And since D is a memory-safe language, I would expect the guards on buffer access to cause measurable overhead compared with C or C++ generated code.
The typical C FUD being spread against memory safe languages since the early 80's.
Most optimizing compilers are able to remove bounds checking if dataflow analysis can prove the code stays within bounds.
For the cases where the compiler is not that smart, and the programmer has proven with help of profilers that bounds checking are indeed a problem, most compilers offer the possibility to disable bounds checking either at module or code region level.
So one gets the memory safety by default, but if really really really required, it is still possible to rescue those extra ms.
The thing is, in these languages one needs to ask permission to shot our foots off, instead of being able to do it in every code line.
By that I meant it is in the same league as the fastest regular expression engines in the world. I should have worded that better. It comes out around the same performance as V8's and Boost Xpressive.
Writing parsers is fun, and having done it several times, I know roughly how long it will take. Using a highly specialized tool has risks of its own: that I'll sink a lot of time into learning it, only to discover it isn't a good fit for what I want to do, and that I'll have to go through the whole learning curve again a few years from now because I don't write parsers very often and will have forgotten how the tool worked by then. Then there's the fun of debugging Other People's Code.
Not to mention the way that open-source tools tend to drag in a godawful mess of dependencies, most of which they don't really even need. In my experience, I can write a basic recursive-descent parser from scratch in C++ before I can get a warning-free compile out of many open-source packages.
So yes, there are reasons to roll your own parser. They may not be good reasons in your view, but don't shoot the messenger.
(All of this being said, there's no way I'd try to bake a fully-compliant C++ language parser from scratch.)
Well one problem could be that C++ does not have a context free grammar, and in fact its grammar is notoriously difficult to parse, so perhaps it's just as easy to write it by hand as it is to bend an existing parser generator to fit C++'s grammar?
In fact, I am not aware of any modern C++ static analyzer or compiler that uses an automatic parser generator.
By the way: I'm glad you mentioned Ragel. It's a fantastic dependency-free state machine generator that everyone should check out. http://www.complang.org/ragel/
I just discovered Ragel also, and was wondering if you or anyone else might know of a tool that could convert a coroutine or generator to a state machine? I asked about it on stack overflow but so far no responses:
See my comment here https://news.ycombinator.com/item?id=7293796, to make anyone able to add a company-specific check quickly is a valuable asset (to ensure decisions like "We don't want to use idiom X anymore").
Moreover, fake parsers like cppcheck will lint your code even if the code does not even build, so they are really easy to setup.
Are there any studies available comparing development effort for the two approaches, using an existing tool vs. writing one's own? It is not obvious to me in how many percent cases the former is better.
I have many a times used an existing tool and after getting the thing to work, have come out with a feeling that it may have been easier to do it myself. I used Flex in a compilers course on Coursera, and getting the lexer to match the specifications took about full two days, which sounded too much for the specifications at hand in the assignment. I have written a CFG parser from scratch in less time (not to say that these two tasks are comparable from development effort point of view).
> Everyone uses regexes; using a lexer generator should (in my opinion) be as easy as using a one-off regex. I think the key is to make the lexer generator an embeddable library like regex libraries are.
Ouch! This seems to apply to me. I've written lexers manually in the past but never bothered to learn a lexer generator properly. I guess I should. Would Flex, Ragel or something else be the best starting point? I have unpleasant memories of really cryptic error messages from the compiler when trying to get someone's toy project in Flex to work.
I recommend you start learning a Parser Combinator Library (like Parsec). These are nicer (especially to start out with).
The downside is that not all programming languages are powerful enough to host these libraries---so you might have to fall back on a parser generator later, if you language of choice is not supported.
Once you get used to tools like `go fmt` or Perl::Tidy it feels barbaric not to be able to cleanly format all your code instantly. Read only linters are nice but something that can reformat code is much nicer.
http://astyle.sourceforge.net/
linters are more important that pretty formatting to my mind. c++ has had both linters and reformatters for a long time.
I suppose the distinction is that they are not built in.
I guess I just don't see that. perltidy and lint(s) are completely nonoverlapping, to my knowledge. Something that can reformat your code doesn't even do the same job as a linter.
Ah well, perhaps it's just a difference in experience. I've never used a linter (there are countless linters) that fixed indentation, arranged if statements, and inserted newlines to wrap long lines. Nor has a formatter ever helped me find the sort of issues I use a linter to find.
There is overlap. I once lost a fair amount of time because some dimwit predecessor thought putting anything other than a space after commas was acceptable. (It sounds ridiculous but it's true. I can explain if you care.)
So: a formatter will, for example, make sure your lines are indented to some standard amount of indentation, a linter will bitch if they are not, and both the formatter and the linter should DTRT when they identify certain indefensible coding practices.
I wouldn't want to reach for lint to handle formatting, that is what indent is for. That is to say, lint should not complain that the if should contain a space before the (.
Years ago I remember learning template metaprogramming in C++ from an amazing book [1] from the same author of this tool. Even if looking back, I now think template metaprogramming when over-used may be a little bit over the top, I nevertheless have no doubt judging by his author that this new tool must have some very good qualities.
If you enjoyed that you should pick up The D Programming Language by Andrei. It's a good read. Metaprogramming with D is a dream compared to C++ so some of the techniques in Modern C++ Design aren't even necessary.
I'm a little skeptical that it doesn't use a standard parser like clang. My gut instinct, from experience with crappy c++ "parsers", would be that it works great 95% of the time, but fails the 5% of the time when it'd be most useful.
On the other hand, given the author, and the fact he explicitly mentions wanting to support C++11, I'm not sure my skepticism is warranted.
I wonder if anybody's tried it on a big C++ code base outside of Facebook?
> I wonder if anybody's tried it on a big C++ code base outside of Facebook?
I used a similar tool (cppcheck => http://cppcheck.sourceforge.net/) which also does not try to build a correct AST and does no proper semantics. The results are surprisingly useful, and most importantly it's very easy to add user-defined checks (edit: of course you also get some related false positives but it's very flexible).
It doesn't seem that flint parses at all – it only tokenizes, and then you write "rules" on "streams of tokens." It kinda sounds like a set of fancy regexes, but on tokens instead of characters.
That’s exactly right. And tokenising C++ is infinitely easier than parsing it, though there are still the odd context-sensitive edge cases like “> >” versus “>>”.
I chuckled at this passage: "flint is written in the D language, making it the first D codebase open-sourced by Facebook. In fact, our initial version of flint was written in C++; the D rewrite started as an experiment. From the measurements and anecdotes we gathered, the translation turned out to be a win on all fronts: the D version of the linter turned out smaller, dramatically faster to build, significantly faster to run, and easier to contribute to."
Seems like the argument is stronger to rewrite the code being linted, than to use the linter itself ;-)
D is actually a pretty nice language. (The caveats are around the libraries, especially the standard libraries to choose from.) I can see D being a better C++, in that it better solves that problems that C++ is supposed to be good for.
This is likely what they are looking at. Start with a program everyone uses, but isn't critical production system, evaluate the language and get people familiar with the language. Follow it up with some more rewrites and eventually give up on the old C++.
Here are some examples that we've run into in the past:
Some of our code used GCC nested functions, which Clang did not support.
We were using __sync_val_compare_and_swap with __int128 values, which wasn't supported by the Clang version we were using. I believe the current version of Clang supports this.
Clang is stricter about some things than GCC, e.g. Clang doesn't like it when you declare a class as struct in one place and in a class in another place.
I think there is also a difference where Clang deduces that your enum is unsigned if none of its declared values are negative, and warns you if you compare it < 0. Since we compile with -Werror, this will cause compilation to fail.
None of these are unsurmountable, but they do mean we can't just drop in Clang and expect everything to work.
Not facebook examples, but things I've seen are as follows. All these were fixable to be correct for both GCC and Clang (it's worth it for the clang static analysis).
GCC can call the base destructor explicitly in destructors (an error in clang). i.e. this->~BaseClass(); This is normally a coding error anyway, since the call is added by the compiler.
You sometimes have to add this-> in templates to disambiguate which function is called.
GCC has, or had, a better (different?) two-phase look up for templates, so you can #include a file after its template is used in another header. Clang is stricter and requires you #include "A.h" and #include "foo.h" before using A<foo>, even if the overall compilation unit does eventually declare the A template and the foo class.
GCC supports variable length arrays, so you can do `MyClass x[someValue + 1];`, but with Clang you need to use a vector of MyClass, like vector<MyClass> x(someValue + 1).
This sort of thing compiles in GCC, but not clang:
template <typename T>
class List : public list<T> {};
const List<string> tmpList;
It is missing a default constructor, GCC finds one from somewhere :)
GCC allows friends of derived classes to call static base class protected functions. i.e. Base::protectedFunc() can be called if the current class is a friend only of BaseSubclass. In clang you need to call BaseSubclass::protectedFunc().
Clang requires more explicit template instantiation to avoid link errors. Not sure exactly when, just "sometimes" and you'll know it when you see it :-)
Then as mentioned elsewhere clang shouts a lot about mismatched forward declaration (struct vs class).
To me, this is more a proof of concept implementation in D than a tool I'd consider having part of my workflow. The dependency chain and "exotic" language choice makes just having it available in your ecosystem a higher jump than what utilities like linters should require.
On the contrary, a small standalone tool like a linter is a perfect place to start experimenting with a new language before introducing it to your core projects. I'd expect there'll be packages available for it soon if it's really too much effort to build.
Why? I'd rather put some crazy requirements in my build tools, than in the delivered product.
Eg hypothetical: who cares if it only builds on, say Linux? We have VMs for that. As long as the output program runs on the target platform, say Windows, we are golden.
This looks really good. Having checks like this integrated and automated is the big win. Like data backup, if it's not automated then it doesn't really count.
Labeling the data as such potentially generates one instance of the
static data inside each compilation unit, including that header. Fixing
these issues has led to measurably smaller and faster executables.
When you include that header in two different .cpp files, which compile to two different .o files, you have that data twice, local to the .o file (compilation unit).
Now link them together and you have that data twice in the executable. You probably did not want that. (But -- as with some more of their examples -- maybe you did indeed want that?)
He avoided using a lexer generator because "I'd figured using a lexer generator was more trouble than it was worth". To him it was more convenient to write a lexer manually (using macros) in C++, then later completely rewrite it in a different language (D) as a one-off trie matcher code generator.
I am amazed that this could be thought of as less work. How can writing two lexers from scratch manually be less trouble than writing a bunch of regexes in a Flex or Ragel file? Especially since the result was likely slower than using Flex or Ragel would have been?
To me the interesting question is: how much of this allergy to external tools is:
1. the trouble of learning/maintaining something new/different (inherent tool overhead)
2. design of the tool isn't as user-friendly as it could be (incidental tool overhead)
3. irrational belief that the tool will be more work/trouble than it actually is (non-optimal decision making)
If part of the answer is (2), then improved tools will lead to greater adoption. And hopefully more convenient tools will lead to them being better known and more widely adopted, which should lessen (1) also.
Everyone uses regexes; using a lexer generator should (in my opinion) be as easy as using a one-off regex. I think the key is to make the lexer generator an embeddable library like regex libraries are.