Disclaimer up-front: my day job is hacking on GCC and related GNU pieces of infr...

KaeseEs · on Aug 28, 2010

Given your experience with both, what would you consider the pain points of getting a grasp of each when looking to contribute? I know when I looked at GCC, the mix of custom manual memory management (gcc_free and pals), semi-automatic management via obstacks, and automatic management via garbage collection seemed baroque and frightening.

froydnj · on Aug 29, 2010

I don't have a good sense of pain points in LLVM; the inliner hacking was a while ago, I don't remember many of the details, and LLVM has surely changed quite a bit since then.

As for GCC, I think the pain points are twofold: the documentation for the middle-end is somewhat scattered. I honestly think enough information for figuring things out is present, it's just not always obvious where to look. There are lots of other passes to look at too, which can be extremely helpful. Assumptions of the interface, or side-effects, are not always stated, which can be surprising at times. The other pain point is contributing upstream: you're going to get dinged on formatting, documentation (usually just "did you do it"; the review is generally not as thorough as say, GDB's documentation review), compilation time, etc. etc.

Also, GCC's hash tables (htab_t) are a pain to use correctly.

Of course, my experience is somewhat slanted towards the middle-end; the set of pain points is somewhat different if you are working in the front-end or the back-end. And my set of pain points from the middle-end might be different if I had worked on different optimization passes. (My passes cared very little about things like aliasing, for instance.)

It's surprising to me that you mention memory management as a pain point, though I can see how the variety can be bewildering. The only distinction that really matters is between GC'd and non-GC'd memory. obstacks and alloc pools are only ways of providing specialized malloc interfaces. A useful rule of thumb is that if your data is only needed for one pass of the compiler, then you can allocate it any way you like; if the data is longer-lived than that, it needs to go in GC'd memory. I can elaborate if you'd like, but that's the basic idea.

FWIW, I agree that the whole GC system is somewhat baroque. The GC was a decent solution to an engineering problem and the whole mechanism nicely solved memory management problems and provided the basis for precompiled headers, but it causes problems in other ways nowadays and trying to get rid of it would be a huge effort.