Note that the author of that paper (a friend of mine) wrote another build system, the dead-simple-but-awesome make.py. I have a mirror/fork of it[0], since it's been unmaintained for a while (but it mostly doesn't need any maintenance).
The entire build system is a single Python script that's less than 500 lines of code. Rather than trying to fit complicated rules into Make's arcane syntax, rules are specified with a Python script, a rules.py file (see [1]). But the script should be thought of more as a declarative specification: the rules.py file is executed once at startup to create the dependency graph of build outputs, and the commands to build them.
Yet, despite the small size, it's generally easier to specify the right dependencies, do code generation steps, and get full CPU utilization across many cores.
At some point I'd like to write more about make.py and try to get it used a bit more by the public...
If you haven’t seen Bazel, you should take a look.
It’s definitely not as minimalist, but it has a very, very similar model for specifying the build. In me experience, it’s pretty easy to get going, and it makes it pretty hard to screw up any of the important features of the build.
Yeah, I know about Bazel, but only at a high level--I haven't used it.
I generally think the hermetic build concept is a very good one, but IMO Bazel goes about it the wrong way, and is overengineered. Rather than needing custom-built infrastructure for every type of language supported, I'd prefer build systems to use lower level OS facilities for discovering dependencies and controlling nondeterministic behavior. That is, build rules would use something like the rules.py files of make.py, specifying any arbitrary executables to run, but without needing to specify the input dependencies of each rule. Each command run would get instrumented with strace (or the equivalent for non-Linux OSes), and filesystem accesses detected. If a file is opened by a build step, that path would be checked for other build rules. If one exists, and it's out of date, the first build step gets paused while the input file gets built, then resumed. All of this happens recursively for the whole build graph, starting from the first requested build output. Other potentially nondeterministic system calls (timestamps, multi-threading/-processing, network access, etc) would be restricted/controlled in various ways yet to be determined.
That said, I haven't actually built anything like that (or know of anyone else that has). Maybe there's some complicated issues that this couldn't deal with but Bazel could. For example, there might be sources of nondeterminism that don't involve syscalls, like vDSO, I don't know for sure though. Portability between OSes would definitely be an issue. But overall I feel that, barring any major unforeseen issues, something like this could be built in a fairly minimalist fashion; maybe a few thousand lines of Python, possibly a small C module.
There are build systems that use strace to find dependencies. For instance tup [1][2] and Fabricate [3]. Also see this post by Waf which discusses some issues with this approach [4]
Ah, thanks for the references. Now that you mention them, I realize I definitely knew about tup and fabricate before (and possibly waf?), but had forgotten about them. I haven't really thought much about trace-based build systems in years, until this subthread.
And looking through that waf blog post, I realize that I meant ptrace instead of strace--I want full fine-grained syscall interception, not just a text report afterwards. That gets around a lot of the overhead/parsing problems mentioned, and is required for the "pause build command so its input file can be built" case.
> Rather than needing custom-built infrastructure for every type of language supported
My understanding is that bazel is moving away from this, so that you can define toolchains by saying "here is a binary that serves the job of linking/compiling stuff".
The challenge with your idea is that you're basically saying "hey, we should sandbox and introspect <any number of fairly arbitrary and complex binaries> to intercept and modify their filesystem and network (at a minimum) accesses, across any number of versions and uses". Even just handling conditionally rewriting file writes/reads based on guessing whether something is an input or re-used output isn't that easy in general.
> My understanding is that bazel is moving away from this, so that you can define toolchains by saying "here is a binary that serves the job of linking/compiling stuff".
How do they ensure determinism in that case? Is it just an easy escape hatch so that new languages can be easily supported, with no actual guarantees of hermeticity?
> The challenge with your idea is that you're basically saying "hey, we should sandbox and introspect <any number of fairly arbitrary and complex binaries> to intercept and modify their filesystem and network (at a minimum) accesses, across any number of versions and uses".
I think my approach would certainly use a syscall whitelist. Any unsupported syscall would be a build failure, and presumably a bug report if it's a legitimate use. I suspect most build commands can get by with a pretty minimal set of syscalls (mainly basic filesystem access). At some point though, if you start supporting more and more syscalls, you start re-implementing VMs/containers, which sucks. This build system only stays simple if people don't try to do a bunch of wacky things with it :)
Network accesses would probably get whitelisted by the user on a rule-by-rule basis for cases like "download these packages", with the outputs treated as always dirty. The tool would be responsible for running efficiently even if no work actually needs to be done.
One weird/hard thing to support would be soft/hard linking. I'm not sure exactly what should be done there, but that might not be needed for early versions.
> Even just handling conditionally rewriting file writes/reads based on guessing whether something is an input or re-used output isn't that easy in general.
I'm not sure I understand this. One thing I should note is that in my scheme, you still have to specify output files for rules--you only get to skip specifying the inputs.
> Is it just an easy escape hatch so that new languages can be easily supported, with no actual guarantees of hermeticity?
Yes, I mean in general some level of compiler hermeticity is assumed. You can verify it by checksumming everything (which bazel does), but you can easily destroy the performance/caching of bazel by modifying clang to have intentionally unpredictable results.
> One weird/hard thing to support would be soft/hard linking. I'm not sure exactly what should be done there, but that might not be needed for early versions.
I think bazels solution here is to just always make fat binaries. Or at least that's how it works if you're using blaze.
> I'm not sure I understand this. One thing I should note is that in my scheme, you still have to specify output files for rules--you only get to skip specifying the inputs.
Ah this helps somewhat, I think there are still potential ambiguities, one thing that bazel can do is statically determine all of the input sources without actually executing the build. You of course can't do this. This has a few nice properties: you get missing input errors before doing any actual building, you have a build graph you can statically analyze (deps queries are magic), and all the sandboxing can be done ahead of time (with symlinks to create a shadow filesystem) instead of ad-hoc, you can parallelize everything, meaning you can saturate all your machines the entire time.
With strace, you have performance issues from the tracing itself, and you can't precompile all the dependencies beforehand you have to discover them as you go (this likely also hurts on memory since you'd need to keep all the partial compiler information in memory).
I also find it easier to reason about stuff with explicit deps, but that's just me.
> I think bazels solution here is to just always make fat binaries. Or at least that's how it works if you're using blaze.
Oh, I should've specified soft/hard filesystem links--they'd make handling a virtual filesystem rather complicated. Dynamic linking is another strange case, but it looks like dylib loading happens in userspace, through an open()/mmap() (at least in Linux), that would be caught through the normal filesystem hooks.
> Ah this helps somewhat, I think there are still potential ambiguities, one thing that bazel can do is statically determine all of the input sources without actually executing the build. You of course can't do this. This has a few nice properties: you get missing input errors before doing any actual building, you have a build graph you can statically analyze (deps queries are magic), and all the sandboxing can be done ahead of time (with symlinks to create a shadow filesystem) instead of ad-hoc, you can parallelize everything, meaning you can saturate all your machines the entire time.
OK, I started out thinking that there would be very little parallelism impact from my scheme, but I realize now the problem that dependencies for a build step are only discovered serially, as each one is built and the dependent process continues.
I could imagine you could work around this with a speculation engine that uses the dependencies from the last run (possibly cancelling the speculated commands mid-flight if the dependent process doesn't end up opening their outputs, and if the speculated commands hasn't started writing files or anything). This would generally work fine, but would waste some work whenever you remove dependencies from a build step (e.g. delete an #include). But it does start to get messy!
Another note: as far as not specifying inputs, I mean inputs beyond the command line (which obviously needs to be specified up front). So you would only need speculation like the above for things like auto-generated headers. The usual cases like "build this executable out of these objects which are each built from these source files" can be parallelized just fine with no speculation.
Static analysis-type queries could be done after a build, but not in a clean tree. Not sure how much of a setback this would be--I'm not generally working in massive projects, and haven't yet felt the need for such queries.
strace performance is a good point (I don't know what sort of overhead it has), and so is memory overhead.
> I also find it easier to reason about stuff with explicit deps, but that's just me.
I could go either way on this one--I generally prefer explicit to implicit, but I don't like repeating the explicit instructions twice (build rules and source code), with the failure mode being unreliable incremental builds. If I'm going to the trouble of explicitly listing dependencies, though, I'll probably go with the ridiculously simple make.py instead of messing with Bazel...
Wow, that's kind of incredible. That bug has been open for years, and is only starting to see some progress. Lends some credence to my unsupported claim of Bazel being overengineered...
I also wrote a make replacement in python (2.7) [1]. I'm proud of it but it doesn't do parallel builds and there are so many other build tools that are better tested. But I'll put it here in case it gives anyone any ideas.
That's pretty cool! It's really not too different from make.py. Using decorators and python functions is likely cleaner in a lot of cases, though I'm not sure if that would fit well with make.py's pseudo-declarative model. It probably wouldn't be too hard to add support for parallelism, either.
The entire build system is a single Python script that's less than 500 lines of code. Rather than trying to fit complicated rules into Make's arcane syntax, rules are specified with a Python script, a rules.py file (see [1]). But the script should be thought of more as a declarative specification: the rules.py file is executed once at startup to create the dependency graph of build outputs, and the commands to build them.
Yet, despite the small size, it's generally easier to specify the right dependencies, do code generation steps, and get full CPU utilization across many cores.
At some point I'd like to write more about make.py and try to get it used a bit more by the public...
[0]https://github.com/zwegner/make.py [1]https://github.com/zwegner/make.py/blob/master/example/rules...