Hacker News new | past | comments | ask | show | jobs | submit login
The future of Clang-based tooling (trailofbits.com)
140 points by ingve on July 29, 2023 | hide | past | favorite | 26 comments



As one of the authors of https://hsyl20.fr/home/files/papers/2022-ghc-modularity.pdf (discuessed at https://news.ycombinator.com/item?id=31250141) this rings really true.

If all we do is compiled and to end, it is really easy for the pipeline stages to "rot together" such that we get these lies the blog post author points out. We must to be able to start and resume compilation from any point, with arbitrary programs in the intermediate representation to ensure modularity doesn't regress. Really glad IDE and now security use-cases are finally hammering these basic software arch principles to compiler writers!

I recall and earlier thread, https://discourse.llvm.org/t/rfc-an-mlir-based-clang-ir-cir/..., where someone else was interested in the same thing. And it seems Vast (prior to open sourcing, and the reveal of that name) was mentioned by the blog post author in the thread. Very much hoping there is thus enough interest to get this upstreamed.

Best of luck to everyone involved!


I'm really impressed with the quality of the writing. It's succinct, informative, and engaging.

The "engaging" part might be subjective, because I've recently taken a renewed interest in LLVM internals. But regardless, good writing.

P.S. The article gives a shout-out to CodeBrowser [0]. It wasn't immediately clear from the homepage, but CodeBrowser is open-source: [1].

[0] https://codebrowser.dev/

[1] https://github.com/KDAB/codebrowser


Having worked with clang (and gcc) quite a bit, there are a number of good points the author makes. There are a lot of cool things llvm/clang has, but it feels like a lot of the tooling does not mesh together as well as it should and some things lack refinement.

My biggest gripe overall (since it could be fixed easily) is the compile_commands.json. It's used by a number of tools and is generally awkward, cumbersome, and has a handful of shortcomings. To fix these issues, I used the intercept-build system provided with LLVM to generate a more succinct build file in JSON format that abstracts certain options (like paths) and groups options commonly found together. The reason for this is that sometimes you might be generating llvm bitcode, building clang AST, running clang-analyze, or translating the build options to work with either compiling or linking with GCC or Clang. For many of these it helps to be able to alter options easily, which you cannot do with the compile_commands.json file alone.

There are a number of areas like this where clang would benefit greatly, without demanding an enormous amount of effort.


One of the most surprising things I learned about "clang" was how relatively poor the "libClang" capabilities are.

I wanted to write a codegen tool that would auto-generate bindings for C++ code, and it turns out that "libTooling" is the only reasonable way to get access to the proper info you need from C++.

Another alternative is "libClangSharp", from Tanner Gooding who works on C# at Microsoft.

https://github.com/dotnet/ClangSharp


Have you seen https://github.com/RosettaCommons/binder ?

python aside, having gone down this rabbithole, and still not infrequently revisiting said rabbithole, I don't believe using *clang like this is a winning strategy. Because of the number of corner cases there are in eg C++17, you will end reimplementing effectively all of the "middle-end" (the parts that lower to llvm) for your target language. At that point you're not building bindings anymore but a whole-ass transpiler. Binder fails to be complete in this way.

My current theory is to try "synthesize" bindings from the llvm ir (a much smaller representational surface). Problems abound here too (ABI).

Alternatively there is https://cppyy.readthedocs.io/en/latest/, which I don't completely understand yet.


This is another part of clang I've considered be almost, but not quite there yet. Some of the calls to the API are not very intuitive and they left too much out of libclang for it to be of anything but limited use. I am not a C++ guy, and it would be far too difficult for me to learn on a project such as this for my purpose so I had to use GCC instead. GCC has fairly good internals documentation (not just doxygen, thankfully) and the code is reasonably well annotated so it was't too difficult to work with.


I think the bigger point isn’t mentioned but you can guess it by the medium: the author seems to want to do some sort of security analysis which requires them to hook various stages with precise semantics, and most of the API was probably designed around providing autocomplete or basic code intelligence. Not entirely sure that the only solution here is to throw out these representations rather than have them match reality a bit more closely if you ask for it, but I guess this works too.


It was not built around any of that. It was built to facilitate compiler construction and add some introspection to that process. The problem is that building what is a compile and code generation library to cover multiple languages and multiple architectures is really hard. Abstractions start to get leaky. Next thing you know there are a bunch of assumptions and hacks that make you neat library a big ol’ mess.

I’m not faulting any of the llvm maintainers. Other people were hoping the IR and library bits would turn into more than a compiler toolkit. Unfortunately, reality sets in over time.


Yeah the way things are is very naturally when one only compiles end-to-end --- there is little economic incentive to keep the internals modular when the productivity costs of entanglement only show up with a delay (and also programmers are not really compensated for productivity...).

It's really good in this new LSP era good tooling is increasingly "mandatory", and language implementers have to deliver. (See also https://ollef.github.io/blog/posts/query-based-compilers.htm... .) The higher standards of users (and aspirations of DARPA :)) are now providing the missing economic incentive.


From what I hear, Clang development has slowed to a crawl and GCC leads in new standards compliance and features, since major past Clang supporters have stopped contributing to Clang in order to instead concentrate on their own languages.

User ‘pjmlp’ has often written about this here on HN, but I don’t see anything posted in this thread.


Good read.

> When Clang is using itself incorrectly, it makes sense to trigger an assertion and abort execution—it’s probably a sign of a bug.

This statement may be ambiguous. It sounds like libraries shouldn't ordinarily abort on bad usages, and it's true this is a nuanced subject, but you really do want to abort as a default. Problematic things are introducing an abort in a code path that previously worked. You have to take two steps: tracking or providing a mechanic for tracking when it happens, then aborting once you are sure it won't cause a problem.

This of course doesn't apply to all ecosystems (JS for instance, due in part to diversity of environment), but this perspective is not limited to the internal behavior of clang, rather it applies largely to low level, important, potentially-system software.


Aborting (as in calling abort(3)) inside a library is very problematic if I’m writing an application that uses it. It takes away the ability of the larger application to detect and handle the error, simply terminating the entire process. Especially in a C++ library, something like exception throwing is better than an immediate abort, because the application can at least catch the exception and proceed. Exceptions are admittedly a controversial subject, but are easier to utilize inside potentially deeply nested call stacks where explicit error reporting would otherwise complicate the API.


The ABI complaint is sound. That really shouldn't be smeared out over the compiler front end (clang) and the architecture lowering (llc, ish). I kind of blame C for that one but maybe we could do better.

Llvm in general is pretty easy to work with. A single IR with multiple passes is a good way to build a compiler. Extending clang somewhat less so, though people seem to make that work anyway.


> A single IR with multiple passes is a good way to build a compiler

https://mlir.llvm.org/, which is using, is largely claiming the opposite. Most passes more naturally are not "a -> a", but "a -> b". data structures and data structures work hand in hand, it is very nice to produce "evidence" for what is done in the output data structure.

This is why https://cakeml.org/, which "can't cheat" with partial functions, has so many IRs!

Using just a single IR was historically done for cost-control, the idea being that having many IRs was a disaster in repetitive boilerplate. MLIR seeks to solve that exact problem!


The chez nanopass was similar, different types representing different invariants on otherwise similar IR as it progresses though a pipeline. MLIR has the same common infrastructure idea below the dialects.

There are a lot of engineering compromises in compiler design. Single IR vs multiple is one of the contentious ones, where in the details the single IR is prone to having different properties at different points in the compiler and the multiple IR is prone to having strong common themes.

I think a consensus is slowly forming that SSA is the right thing, but even there whether phi instructions or block arguments are better is debated, as is whether to stay in SSA form for machine instruction representation or not.

Were the early compilers sometimes based on a single IR? My impression is that they went through a wide variety of different representations tied to particular analysis passes, but I haven't seen early implementations of the industrial toolchains. I would be curious about any references you have for that - the optimal decisions of the past are usually worth reconsidering.


> I would be curious about any references you have for that - the optimal decisions of the past are usually worth reconsidering.

I am afraid I do not have muchon this.

> multiple IR is prone to having strong common themes.

I am optimistic that with enough fancy tricks (e.g. more type parameters) we can deduplicate.

Fundamentally, I rather be very strict and proactive about totality / modularity / formally enforcing invariant, and more lenient/reactive about duplication. Regressions of the former sort are extremely subtle and require tons of hard thinking to resolve. Regressions of the latter sort are comparatively simply to identify and think about the problem.

Seeing the commonalities in a bunch of similar things that we don't yet know how to abstract over it also a great way that practice can inform theory!


Says clang isn't a toolsmith’s compiler. Doesn't mention clangd. Hmmm. Even Apple switched from libclang to clangd.

https://lists.llvm.org/pipermail/cfe-dev/2018-April/057668.h...


libclang is a library, clangd is an executable. That post is about switching away from libclang-based tooling infrastructure; i.e. stop developing their own tool.

There differences in C wrapper vs C++ are superficial and not what this blog post is about. The problems are rather with the poor division of labor between the intermediate represents and the lies that they are properly self-contained. This is about what clang does, irrespective of whether one slaps a C interface on top or not.


You wouldn't 'execute' clangd. You would send it messages adhering to its Language Server Protocol [1]. The distinction between library API and server API is small. Moreover, Apple switched away from libclang-based tooling towards clangd.

[1] https://microsoft.github.io/language-server-protocol/specifi...


I can't tell if you're trying to split hairs or just being unintentionally obtuse. How do you propose sending clangd messages without executing it, i.e. starting the language server?


> The takeaway here is that the Clang AST is missing information that is invented by the LLVM IR code generator, but LLVM IR is also missing information that is destroyed by said code generator. And if you want to bridge that gap, you need to rely on an approximation: the Clang CFG.

This is how optimizers do terrible things with no way to warn users: information loss across multiple transformed representations of the original. It's a telephone game.


Does something like essentially this exist?

    lattice = languageLattice [python, cpp] -- also includes c-->python in the lattice, implicitly, since python is written in c
    latticeDebugger = languageLatticeDebugger lattice=lattice
if I want to debug mixed python & c++? mainly I love numba but I've had some trouble with it reducing debuggability/transparency of the code


The article lists features potentially responsible for Clang's gaining popularity, among them was fast compile times. I've always read that LLVM's compile times are terrible and that is, for instance, one of the reasons for Rusts slow compile times. Has this changed or is he only making claims about the Clang front end?


I haven't used clang much in recent years but i remember back when it was first introduced clang was faster than gcc while producing only slightly slower (or sometimes comparable) code.

Most likely compile times slowed down as clang and llvm became more complex, but early clang was faster enough for people to switch to it - and then just stayed with it (this isn't a unique case, people switched to Chrome from Firefox because Chrome was much faster and they stayed with Chrome even after Chrome became slower and Firefox faster).

In any case the comparison was with gcc (and perhaps msvc), not all types of compilers.


I haven't seen any recent comparisons but the most recent benchmark I saw was that gcc/clang were close. I'm sure the speeds vary quite a bit depending on project size, options, available RAM, etc. IIRC using LTO makes linking significantly more resource intensive and I would assume this is where most of the disparities in performance are.


I got bitten many times by the fact that PATH is not taken into account, because I use Nix to manage by dotfiles, including `clangd`, but when developing libraries that target the base distro (not Nix) clangd sometimes gets confused and does not taken into account the headers in /usr/include, only the Nix headers....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: