Hacker News new | past | comments | ask | show | jobs | submit login
Facebook Open Sources flint, a C++ linter written in D (facebook.com)
197 points by andralex on Feb 24, 2014 | hide | past | favorite | 81 comments



It amazes me how far out of their way many people will go to avoid using parsing tools.

He avoided using a lexer generator because "I'd figured using a lexer generator was more trouble than it was worth". To him it was more convenient to write a lexer manually (using macros) in C++, then later completely rewrite it in a different language (D) as a one-off trie matcher code generator.

I am amazed that this could be thought of as less work. How can writing two lexers from scratch manually be less trouble than writing a bunch of regexes in a Flex or Ragel file? Especially since the result was likely slower than using Flex or Ragel would have been?

To me the interesting question is: how much of this allergy to external tools is:

1. the trouble of learning/maintaining something new/different (inherent tool overhead)

2. design of the tool isn't as user-friendly as it could be (incidental tool overhead)

3. irrational belief that the tool will be more work/trouble than it actually is (non-optimal decision making)

If part of the answer is (2), then improved tools will lead to greater adoption. And hopefully more convenient tools will lead to them being better known and more widely adopted, which should lessen (1) also.

Everyone uses regexes; using a lexer generator should (in my opinion) be as easy as using a one-off regex. I think the key is to make the lexer generator an embeddable library like regex libraries are.


Using third party lexer generators means your project is now dependent on that tool. This means:

1. the tool must exist on every platform your project will ever be ported to

2. the same version of that tool must exist on every such platform

3. if the hosted compiler changes, the tool must be updated in sync with that compiler - out of sync means your project won't build

4. if there are blocker bugs in the tool, you're not in a good position to fix it. You're stuck with developing workarounds.

5. Everyone building your project has to get the tool successfully installed.

The current build of D's runtime library has a dependency on libcurl. Supposedly, since libcurl is ubiquitous, that shouldn't be any problem. But it has been an ongoing sore relentlessly sucking up multiple peoples' time, of the issues I outlined above.


#1. What platforms do you think people would actually want to run a static analyzer on?

#2. You can enforce dependencies at a project level if you cared to. Lexers really haven't changed much from version to version.

Based on #3, you don't understand how lexers work.

Also, based on #4, do you actually know anybody that has encountered a blocking bug with a lexer?

#5. WHAT? How is that even remotely hard. This is a static analyzer. I'm not using my arduino to analyze code or anything.


> Using third party lexer generators means your project is now dependent on that tool.

I would submit that porting D is signficantly harder than porting flex and/or ragel. Not to mention that the number of platforms that support C++ and D but not flex should be smaller than the number of platforms that support C++ and flex but not D?


It's easier to just write the lexer by hand than deal with those issues.


Your arguments seem to equally apply to using third-party libraries at all. Are you suggesting that it's better to (for example) re-implement HTTP rather than use libcurl?


I'm saying you've got to have pretty strong arguments to make a project dependent on a third party tool, if you expect to distribute it, expect it to be long lived, and expect to port it to many platforms.


Yes I agree with the point that the portability of your dependencies limits your own portability.

This is definitely something that is dependent on the tool or library though. For example, if you decided to use Lua as a tool, it is extremely unlikely that the portability of Lua will be a limiting factor. It is very likely to be more portable than your code (given that it compiles unmodified on Turbo C 1.0 from 1990: http://www.youtube.com/watch?v=-jvLY5pUwic)

So I definitely agree with "choose your dependencies wisely."


...and if you decided to use Lua, you could also use LPEG, which is amazingly well suited to tasks like this...

[LPEG is also a dependency of course, but pretty small (just include its source code along with your own) and super portable in the same way Lua is.]


There's a fine line to walk between "Not Invented Here" and suffering through an inadequate tool. We at Facebook have a long history of trying stock tools and see them failing at our scale.

Walter Bright puts out a good argument against lexer generators here: http://www.drdobbs.com/architecture-and-design/so-you-want-t.... Besides, tokenization is hardly the interesting/bulky part of the task. The verifications are.


> Walter Bright puts out a good argument against lexer generators here: http://www.drdobbs.com/architecture-and-design/so-you-want-t....

I'm going to have to disagree with you and Walter. While I agree that for a fixed language, a well-written hand-built recursive descent parser will beat the pants off the output of a parser generator, languages aren't necessarily fixed and writing a good hand-build parser is harder than it seems. I still recommend that people start with parser generators.

(Edit: yes, we're talking about lexers above, but the article mentions both lexers and parsers. I think the case for using a hand-built lexer is much weaker than the case for using a hand-built parser, not that I'm a fan of either.)

If you use a parser generator, you can be sure that your language has a clear, (hopefully) LALR(1) syntax that anyone can understand and port to another parsing environment. If you hand-parse, then the temptation exists not only to introduce subtle irregularities and context dependencies into the language, but also to mix semantic interpretation and parse tree generation, and both of these anti-patterns make it difficult to build generic tools that work with your new language. With a hand parser, you also run the risk of making silly coding errors that end up being baked into your language specification. (Ahem, PHP.)


> While I agree that for a fixed language, a well-written hand-built recursive descent parser will beat the pants off the output of a parser generator

I'm not sure this is true, do you have numbers that demonstrate this?

I know that Clang knows some tricks that lexer generators don't currently know like using SSE to skip long blocks of comments. But even that, to me, represents an opportunity for better parsing tools, not an inherent performance advantage of hand-written parsers/lexers.


> I'm not sure this is true, do you have numbers that demonstrate this?

No numbers offhand, but GCC's experience of moving toward a hand-written C++ parser was very positive. I don't believe hand-written parsers have an inherent speed advantage, but it does seems that it'd always be easier for a hand-written parser to produce more informative error messages --- if anything, because the hand-written parser has more embedded syntactic information.


> GCC's experience of moving toward a hand-written C++ parser was very positive.

Fom what I see here, the measured improvement was 1.5% -- http://gcc.gnu.org/wiki/New_C_Parser

> I don't believe hand-written parsers have an inherent speed advantage, but it does seems that it'd always be easier for a hand-written parser to produce more informative error messages

From what I can see, this is the major shortcoming of parser tools at the moment. I haven't looked into this problem too deeply yet, but at the moment I suspect that it isn't possible to get good enough error messages automatically.

But to me this doesn't mean "parsing tools will never have good error messages", it means "parsing tools need to evolve to give users the right hooks to produce good error messages themselves."

This would let the algorithms do the part that they do well (analyzing the grammar and creating state machines) while letting the human do the part that can't be done automatically (creating good error messages).


Getting good error messages from the parser generator is a solved problem, but it seems that solution is not widely known.

As you said, it isn't possible to get good error messages automatically. The solution is to include error cases in the grammar. No hooking is necessary. If people often miss semicolons, include missing semicolon case in the grammar!

Mono C# compiler uses this technique (including error cases in the grammar) to the great effect. I wonder why people don't do that and resort to hand-written parsers instead.


Excellent, this -- information on other people's strategies for generating good error messages -- is exactly what I've been looking for.

Could you direct me to more resources/papers/tools that describe and apply this approach in practice? Thanks!


Sure, things might change. I'd be happy with better error messages from automatically-built parsers. There's a lot of potential, I think, in methods that take advantage of having a formal description of the language grammar --- you could, for example, automatically generate a sentence that "fits in" at the point at which an error occurs. There's still a lot of progress to be made in the area, though, and for now, hand-generated error messages do seem to be more useful to people.


I think this is starting to meander. We're talking about lexers, not parsers, and for a stable language, not a language being designed.


C++14 just got finalized... not sure how stable the language really is...


If memory serves, the original preprocessor that turned "C with Classes" into C wasn't written with a parser generator. I think your argument applies.


> We at Facebook have a long history of trying stock tools and see them failing at our scale.

I can understand that, having experienced the same before.

I just re-read my message and I regret the way it comes off -- my intention in writing it wasn't to be critical of you, but to explore the issue of how inconvenient parser tools are perceived as being, and try to analyze why that is.

"Insanity" was meant to refer to the overall situation, and not you personally! I might re-word this a bit.

I still think that we ought to be in a position where the tool is so easy and so valuable that it is a no-brainer. And I think this largely indicates that there is a long way for the tools to improve.


The Walter Bright who's comment appears next to yours? https://news.ycombinator.com/item?id=7294024


There can be Only One.


> Everyone uses regexes; using a lexer generator should (in my opinion) be as easy as using a one-off regex. I think the key is to make the lexer generator an embeddable library like regex libraries are.

There actually is a really powerful parser generator for D by Philippe Sigaud: https://github.com/PhilippeSigaud/Pegged It's much more pleasant to use than something like Boost Spirit.

Somewhat relatedly, D's standard library has a compile time regex engine that compiles regular expressions down at compile-time resulting in some of the fastest regular expressions in the world.


That's cool, though "some of the fastest regular expressions in the world" might be a bit of hyperbole: it would be comparable to any compile-time generator like Flex, Ragel, or a JIT generator like PCRE-sljit. And since D is a memory-safe language, I would expect the guards on buffer access to cause measurable overhead compared with C or C++ generated code.


> And since D is a memory-safe language, I would expect the guards on buffer access to cause measurable overhead compared with C or C++ generated code.

The typical C FUD being spread against memory safe languages since the early 80's.

Most optimizing compilers are able to remove bounds checking if dataflow analysis can prove the code stays within bounds.

For the cases where the compiler is not that smart, and the programmer has proven with help of profilers that bounds checking are indeed a problem, most compilers offer the possibility to disable bounds checking either at module or code region level.

So one gets the memory safety by default, but if really really really required, it is still possible to rescue those extra ms.

The thing is, in these languages one needs to ask permission to shot our foots off, instead of being able to do it in every code line.


By that I meant it is in the same league as the fastest regular expression engines in the world. I should have worded that better. It comes out around the same performance as V8's and Boost Xpressive.


Writing parsers is fun, and having done it several times, I know roughly how long it will take. Using a highly specialized tool has risks of its own: that I'll sink a lot of time into learning it, only to discover it isn't a good fit for what I want to do, and that I'll have to go through the whole learning curve again a few years from now because I don't write parsers very often and will have forgotten how the tool worked by then. Then there's the fun of debugging Other People's Code.

Not to mention the way that open-source tools tend to drag in a godawful mess of dependencies, most of which they don't really even need. In my experience, I can write a basic recursive-descent parser from scratch in C++ before I can get a warning-free compile out of many open-source packages.

So yes, there are reasons to roll your own parser. They may not be good reasons in your view, but don't shoot the messenger.

(All of this being said, there's no way I'd try to bake a fully-compliant C++ language parser from scratch.)


Well one problem could be that C++ does not have a context free grammar, and in fact its grammar is notoriously difficult to parse, so perhaps it's just as easy to write it by hand as it is to bend an existing parser generator to fit C++'s grammar?

In fact, I am not aware of any modern C++ static analyzer or compiler that uses an automatic parser generator.


but this isn't a parser, it only tokenizes and then applies matching of rules against a stream of tokens.

I.e. it doesn't need to know what C(X) means as long as it knows that it matches <ident,paropen,ident,parclose>.

IIUC.


By the way: I'm glad you mentioned Ragel. It's a fantastic dependency-free state machine generator that everyone should check out. http://www.complang.org/ragel/


I just discovered Ragel also, and was wondering if you or anyone else might know of a tool that could convert a coroutine or generator to a state machine? I asked about it on stack overflow but so far no responses:

http://stackoverflow.com/questions/21713904/convert-coroutin...


See my comment here https://news.ycombinator.com/item?id=7293796, to make anyone able to add a company-specific check quickly is a valuable asset (to ensure decisions like "We don't want to use idiom X anymore").

Moreover, fake parsers like cppcheck will lint your code even if the code does not even build, so they are really easy to setup.


Are there any studies available comparing development effort for the two approaches, using an existing tool vs. writing one's own? It is not obvious to me in how many percent cases the former is better.

I have many a times used an existing tool and after getting the thing to work, have come out with a feeling that it may have been easier to do it myself. I used Flex in a compilers course on Coursera, and getting the lexer to match the specifications took about full two days, which sounded too much for the specifications at hand in the assignment. I have written a CFG parser from scratch in less time (not to say that these two tasks are comparable from development effort point of view).


> Everyone uses regexes; using a lexer generator should (in my opinion) be as easy as using a one-off regex. I think the key is to make the lexer generator an embeddable library like regex libraries are.

Interesting read:

http://commandcenter.blogspot.ie/2011/08/regular-expressions...


Ouch! This seems to apply to me. I've written lexers manually in the past but never bothered to learn a lexer generator properly. I guess I should. Would Flex, Ragel or something else be the best starting point? I have unpleasant memories of really cryptic error messages from the compiler when trying to get someone's toy project in Flex to work.


I recommend you start learning a Parser Combinator Library (like Parsec). These are nicer (especially to start out with).

The downside is that not all programming languages are powerful enough to host these libraries---so you might have to fall back on a parser generator later, if you language of choice is not supported.


Once you get used to tools like `go fmt` or Perl::Tidy it feels barbaric not to be able to cleanly format all your code instantly. Read only linters are nice but something that can reformat code is much nicer.



While reading the article, I also thought why they didn't at least mention it (and why they abandoned the idea).

It's powerful, fast and will most likely be the base for a lot of {h,l}inting in editors/IDE's moving forward.

edit: Somehow missed that they mentioned Clang not being mature enough when starting this project.


http://astyle.sourceforge.net/ linters are more important that pretty formatting to my mind. c++ has had both linters and reformatters for a long time.

I suppose the distinction is that they are not built in.


Huh? You know that a linter is not a formatter, right?


Pretty sure he does. His comment only makes sense in that context.


I guess I just don't see that. perltidy and lint(s) are completely nonoverlapping, to my knowledge. Something that can reformat your code doesn't even do the same job as a linter.


I'm not really following you. It's very common for linters to flag stylistic and formatting issues.


Ah well, perhaps it's just a difference in experience. I've never used a linter (there are countless linters) that fixed indentation, arranged if statements, and inserted newlines to wrap long lines. Nor has a formatter ever helped me find the sort of issues I use a linter to find.


There is overlap. I once lost a fair amount of time because some dimwit predecessor thought putting anything other than a space after commas was acceptable. (It sounds ridiculous but it's true. I can explain if you care.)

So: a formatter will, for example, make sure your lines are indented to some standard amount of indentation, a linter will bitch if they are not, and both the formatter and the linter should DTRT when they identify certain indefensible coding practices.


I wouldn't want to reach for lint to handle formatting, that is what indent is for. That is to say, lint should not complain that the if should contain a space before the (.


I agree in that case, but when it comes to commas, using spaces in a certain way is objectively better.


well there is clang-format and clang-tidy and they are really easy to set up in CMake (I use "make format").


Years ago I remember learning template metaprogramming in C++ from an amazing book [1] from the same author of this tool. Even if looking back, I now think template metaprogramming when over-used may be a little bit over the top, I nevertheless have no doubt judging by his author that this new tool must have some very good qualities.

[1] http://en.wikipedia.org/wiki/Modern_C%2B%2B_Design


If you enjoyed that you should pick up The D Programming Language by Andrei. It's a good read. Metaprogramming with D is a dream compared to C++ so some of the techniques in Modern C++ Design aren't even necessary.

http://www.amazon.com/D-Programming-Language-Andrei-Alexandr...


I'm a little skeptical that it doesn't use a standard parser like clang. My gut instinct, from experience with crappy c++ "parsers", would be that it works great 95% of the time, but fails the 5% of the time when it'd be most useful.

On the other hand, given the author, and the fact he explicitly mentions wanting to support C++11, I'm not sure my skepticism is warranted.

I wonder if anybody's tried it on a big C++ code base outside of Facebook?


> I wonder if anybody's tried it on a big C++ code base outside of Facebook?

I used a similar tool (cppcheck => http://cppcheck.sourceforge.net/) which also does not try to build a correct AST and does no proper semantics. The results are surprisingly useful, and most importantly it's very easy to add user-defined checks (edit: of course you also get some related false positives but it's very flexible).


It doesn't seem that flint parses at all – it only tokenizes, and then you write "rules" on "streams of tokens." It kinda sounds like a set of fancy regexes, but on tokens instead of characters.


That’s exactly right. And tokenising C++ is infinitely easier than parsing it, though there are still the odd context-sensitive edge cases like “> >” versus “>>”.


I chuckled at this passage: "flint is written in the D language, making it the first D codebase open-sourced by Facebook. In fact, our initial version of flint was written in C++; the D rewrite started as an experiment. From the measurements and anecdotes we gathered, the translation turned out to be a win on all fronts: the D version of the linter turned out smaller, dramatically faster to build, significantly faster to run, and easier to contribute to."

Seems like the argument is stronger to rewrite the code being linted, than to use the linter itself ;-)


D is actually a pretty nice language. (The caveats are around the libraries, especially the standard libraries to choose from.) I can see D being a better C++, in that it better solves that problems that C++ is supposed to be good for.


> The caveats are around the libraries, especially the standard libraries to choose from.

The old Tango/Phobos situation?

With the advent of D2, that's no longer a thing. Tango is now all but dead.


Thanks for the update! I haven't toyed around with D in a while---but it was a mostly pleasant experience the last time I did.


This is likely what they are looking at. Start with a program everyone uses, but isn't critical production system, evaluate the language and get people familiar with the language. Follow it up with some more rewrites and eventually give up on the old C++.



"Even now, clang cannot compile some of our C++ codebase."

Andrei, I'm a bit surprised by this statement. Could you give an example?

Otherwise great work.


Here are some examples that we've run into in the past:

Some of our code used GCC nested functions, which Clang did not support.

We were using __sync_val_compare_and_swap with __int128 values, which wasn't supported by the Clang version we were using. I believe the current version of Clang supports this.

Clang is stricter about some things than GCC, e.g. Clang doesn't like it when you declare a class as struct in one place and in a class in another place.

I think there is also a difference where Clang deduces that your enum is unsigned if none of its declared values are negative, and warns you if you compare it < 0. Since we compile with -Werror, this will cause compilation to fail.

None of these are unsurmountable, but they do mean we can't just drop in Clang and expect everything to work.


Not facebook examples, but things I've seen are as follows. All these were fixable to be correct for both GCC and Clang (it's worth it for the clang static analysis).

GCC can call the base destructor explicitly in destructors (an error in clang). i.e. this->~BaseClass(); This is normally a coding error anyway, since the call is added by the compiler.

You sometimes have to add this-> in templates to disambiguate which function is called.

GCC has, or had, a better (different?) two-phase look up for templates, so you can #include a file after its template is used in another header. Clang is stricter and requires you #include "A.h" and #include "foo.h" before using A<foo>, even if the overall compilation unit does eventually declare the A template and the foo class.

GCC supports variable length arrays, so you can do `MyClass x[someValue + 1];`, but with Clang you need to use a vector of MyClass, like vector<MyClass> x(someValue + 1).

This sort of thing compiles in GCC, but not clang:

    template <typename T>
    class List : public list<T> {};
    const List<string> tmpList;
It is missing a default constructor, GCC finds one from somewhere :)

GCC allows friends of derived classes to call static base class protected functions. i.e. Base::protectedFunc() can be called if the current class is a friend only of BaseSubclass. In clang you need to call BaseSubclass::protectedFunc().

Clang requires more explicit template instantiation to avoid link errors. Not sure exactly when, just "sometimes" and you'll know it when you see it :-)

Then as mentioned elsewhere clang shouts a lot about mismatched forward declaration (struct vs class).


Here's a gem that clang produces when compiling folly:

"clang-3.4: error: unable to execute command: Segmentation fault"


So we're not the only ones. We had to go back to Clang 3.3 because Clang 3.4 had issues with our meta-programming craziness.

But still, I submit a clang-based tool is a better bet in the long run.


I believe clang also doesn't currently support the "ifunc" attribute, which we also use in a few places.


I am surprised how short the tokenizer is. The other thing is D looks surprisingly approachable coming from C++.


D is crazy good at boilerplate generation.


Exactly. The only thing I know to be more powerful are LISP macros.


Have you tried template Haskell?


To me, this is more a proof of concept implementation in D than a tool I'd consider having part of my workflow. The dependency chain and "exotic" language choice makes just having it available in your ecosystem a higher jump than what utilities like linters should require.


On the contrary, a small standalone tool like a linter is a perfect place to start experimenting with a new language before introducing it to your core projects. I'd expect there'll be packages available for it soon if it's really too much effort to build.


Why? I'd rather put some crazy requirements in my build tools, than in the delivered product.

Eg hypothetical: who cares if it only builds on, say Linux? We have VMs for that. As long as the output program runs on the target platform, say Windows, we are golden.


Clearly that's something to worry about. What won our team over was the (sometimes spectacular) speed gains compared to the C++ linter.


Could you comment on why it is faster?


This looks really good. Having checks like this integrated and automated is the big win. Like data backup, if it's not automated then it doesn't really count.


> Marking namespace-level data as static inside a header is > almost always a bad idea.

Can anybody explain this to me with some example?


Next sentences:

    Labeling the data as such potentially generates one instance of the
    static data inside each compilation unit, including that header. Fixing
    these issues has led to measurably smaller and faster executables.
When you include that header in two different .cpp files, which compile to two different .o files, you have that data twice, local to the .o file (compilation unit).

Now link them together and you have that data twice in the executable. You probably did not want that. (But -- as with some more of their examples -- maybe you did indeed want that?)


I'm going to the D conference!!! :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: