It seems odd to me that we still require .o files at all.
Given the final program needs to be storable in RAM + Virtual Memory it surprises me that we still need the intermediate step of pushing to the file system only to then immediately reopen and merge those files.
Does someone have more info on this? Or is the reason just legacy? Or is the reason just some ideological “single responsibility” thing?
At the highest level of performance needs, if you want to parallelize the build across machines, you're going to need some kind of storage abstraction for intermediate files since translation units are the easiest place to split builds at.
In some places, where builds are particularly optimized, there are special distributed filesystems just for object files. In this case, it's not necessarily true that the object files are backed by disk even.
Being backed by disk locally mainly helps for incrementally building so that you can change one file and only recompile intermediate files for translation units that depend on this file. Disk/FS caching presumably helps a lot with redundant I/O, and I think most of the benefit of building with tmpfs winds up being putting the source itself in RAM.
Edit: also it's worth noting that many compilers can output object files which can be linked by other linkers, allowing you to mix output from different compilers in some circumstances.
Thank you the distributed compile usecase is a fun one to think about. I can see why it would require some intermediate set of bytes to shuffle across the network.
> Being backed by disk locally mainly helps for incrementally building so that you can change one file and only recompile intermediate files for translation units that depend on this file.
For the local incremental usecase I’d love to see a more state-full compiler instead. One that could change the bytes of a binary instead. e.g. give all functions some additional “empty padding”. Then any modifications could directly fit into the binary as needed until some defragmentation process which creates the final output binary.
The first thing to point out is that reading a recently-written file isn't all that expensive: it's stored in the filesystem cache anyways, so you bypass the disk for the reads.
Moreover, the filesystem is actually a decent database for multiprocess communication. If every translation unit is compiled into an independent file, which is then combined into a single final output, there is no need to build any complex locking mechanisms or the like and you still get to take advantage of the embarrassingly parallel nature of compiling.
Incremental compilation is an incredibly important tool. If you make a small change to one file, it's frequently not necessary to rebuild most of the code. Making the output of individual file compilations work in a way that allows incremental compilation to happen requires basically building .o files--and there's very little savings to be had by not emitting them to disk.
Finally, I'll note that very frequently, debug builds of large applications cannot fit in RAM. Debugging symbols bloat builds tremendously, especially in intermediate object form (since many symbols end up needing to be duplicated in every single .o file). A debug build of a large application may take up 80GB in disk space, to build a binary that (without debug symbols) would be perhaps 100MB in size.
Keeping everything in RAM just isn't feasible at scale.
> Given the final program needs to be storable in RAM + Virtual Memory it surprises me that we still need the intermediate step of pushing to the file system only to then immediately reopen and merge those files.
You don't. I recommend you learn about unity builds, and how major build systems support toggling them at the project and subproject level.
The main reason why most people haven't heard about the concept and those who did the majority doesn't bother with it is that a) you have little to nothing to gain by them, b) you throw incremental builds out if the window, c) you ruin internal linkage and thus can introduce hard to track errors.
Also, it makes no sense at all to argue how the released software needs to run in memory to justify aspects related to how the software is built. At most you have arguments over code bloat and premature optimization, but at what cost?
You could combine the build system (e.g. make), the linker, and the compiler into one app that knows everything and does it all in memory. It could be slightly faster but much less flexible.
What about the case of different compiler generated object files from different languages / object file formats being combined? (without the need to have access to each unique compiler generating the .o file)
We make use of lots of distributed compiles at work which .o(bj) files are useful for, but to compile something locally it might be useful to keep things in memory - if all of the objects fit. The code base’s objects I work on won’t fit in ram, so will get paged to disk at some point during linking anyway
I don’t think writing to and reading from disk is a significant portion of this loop. You’d want to measure the cost and thus maximum benefit. SSDs are fast. There isn’t much upside here.
The problem may be disk speed, but I agree it may not. The problem I see is the dropping of state.
Specifically, to produce a .o file the compiler has already read through and created indexes of module::function/struct names, their layouts, dependencies, etc.
My understanding is that the .o files need to be re-parsed by the linker to create these indexes, layouts, and dependencies. Especially with LTOs I'd imagine there would be additional inlining work (stuff the compiler is already good at).
This is all just wasted time, including the IO bottlenecks - even if those are marginal.
If you speed it up by 1/5 ten times, is it 1/5 of the way to perfection each time? If yes I think that's an exaggerated way of measuring, if no then what was special about the particular baseline you picked?
I think you need to use a log scale for this. It's a step or two toward perfection, but perfection is infinite steps away.
Note that there have been some license controversies with mold before, namely that they wanted to make all outputs AGPL, not simply mold itself [0], seems like they walked this policy back however [1].
> Open-source license: mold stays in AGPL, but _we claim AGPL propagates to the linker's output_. That is, we claim that the output from the linker is a derivative work of the linker. That's a bold claim but not entirely nonsense since the linker copies some code from itself to the an output. Therefore, there's room to claim that the linker's output is a derivative work of the linker, and since the linker is AGPL, the license propagates. I don't know if this claim will hold in court, but just buying a sold license would be much easier than using mold in an AGPL-incompatible way and challenging the claim in court.
Regardless of their current stance, this type of policy changes on a whim led me to remove mold from any of my systems, since I don't want all of my code in the future to automatically become AGPL, even by accident.
In fairness, this was just a proposal (from someone who clearly has more knowledge of engineering than law). They got feedback that this isn't how the AGPL works, and decided to go with a commercial license for the macOS version instead. Which is annoying for me, as I was hoping to use mold on macOS and the monthly subscription seems a bit steep for a linker, but it seems like a perfectly reasonable license.
"I want to share another idea in this post to keep it open-source [..] Let me know what you guys think" is not a "policy change on a whim". It's an idea. It was not "walked back" on, because it was ... just an idea.
Your comment is a horrible misrepresentation of what's actually in the post.
wow. that borders on extortion... "you should really buy a license... who knows, otherwise your code might turn into AGPL, good luck with the expensive court fees..."
What the heck? You can get an AGPL linker for free, or pay for a non-AGPL one. This type of thing is standard in OS. Selling a product isn't extortion.
I don't recall seeing a single developer tool where the output was anything other than fully under the copyright of the author of the input files (or fully liberally licensed in the case of additional code objects).
Selling is not extortion, no. That's not what I claimed. But did you read the quote?
> mold stays in AGPL, but _we claim AGPL propagates to the linker's output_ [...] I don't know if this claim will hold in court, but just buying a sold license would be much easier than using mold in an AGPL-incompatible way and challenging the claim in court.
If you were a big corporation who doesn't want to buy a license but speed up dev time, couldn't you just let your devs use mold for development to have faster recompilation cycles, but link the official binaries with some other, slower linker?
It depends if you find it acceptable that the work you created gets assigned the AGPL license. However, this problem was already solved at this point, so it's a non-issue these days.
One practical problem, even for developers who are fine with licensing their code to whatever open source license is easiest, is that not all open source licenses are AGPL compatible. Take for example the Mozilla Public License, which is inherently incompatible with AGPL because of the terms imposed; this means that any project using MPL licensed libraries could no longer be linked with Mold.
License incompatibilities can be a huge pain (see: ZFS + Linux). If you develop software for yourself this isn't a problem, but if you intend to distribute your software this becomes more of an issue.
This is probably also the main reason why normal linkers/compilers don't impose licenses on the produced work.
The point of copyleft is to dictate the licence you must use, if you wish to (roughly speaking) link with the copyleft-licensed work. There are plenty of libraries that you cannot use if you wish to distribute your program without making its source-code available.
The unusual thing here is that the creators of a linker are apparently trying to have the copyleft licence propagate to code that is input to the linker. Others have pointed out that GCC has exceptions for this kind of thing, despite that it is released under a strong copyleft licence (GPLv3+).
No, the point of Copyleft is for you to not restrict the freedoms you got when you used the software when distributing it to others. You can use Copylefted software in any way to your heart's content in combination with whatever other software you want, you can just not distribute it using a more restrictive licence.
This level of detail is incidental to my point, hence roughly speaking. Even
the GNU folks summarise copyleft essentially as I have. [0]
Also, your account of copyleft is still incorrect. It's true of the GPLv2 and
GPLv3 licences but not true of all copyleft licences. The AGPLv3 licence, which
is the one relevant here, doesn't apply only on distribution.
edit I think I was mistaken in putting propagate to code that is input to the linker, though. As lokar's comment points out, it's instead about the output of the linker.
It's not, if it's contained to just the linker itself (which it is now), that's not why I stopped using it. I did so because it seems like they don't understand that much about AGPL and licensing in general and could change their license terms at any point to say something like "we claim AGPL propagates to the linker's output" which is very legally tenuous itself to claim so.
Tenuous or not, I believe GCC explicitly has a licensing exception that states that compiler output is not considered a derivative work of the compiler, and thus must also be licensed under the GPL. So the GNU/FSF folks at least thought it was a concerning enough legal idea to explicitly account for it.
Not sure we can say that a linker is the same as a compiler in this sense, but if so, maybe it is indeed worrisome.
That exception exists because compilers have a tendency to leave little bits of itself in code that they compile. For example, if you're compiling to a target that doesn't have a division instruction, you're going to be using a compiler-provided division routine that gets combined in with the source code. And that routine is a clear part of the compiler's source code.
The standard compiler license exception (this applies to LLVM to, e.g.) says that any such code that gets combined in with your application code doesn't count. Note that it's still a potential license violation to use that code elsewhere (say, using those routines in another compiler).
This isn't a concern for linkers because linkers don't really provide anything in the way of code, everything being provided by the compiler as a compiler or language support library. The largest code it might add to your program is probably the PLT stub code, at best a couple of instructions long.
https://news.ycombinator.com/item?id=26233244
https://news.ycombinator.com/item?id=29568454
https://news.ycombinator.com/item?id=33584651
https://news.ycombinator.com/item?id=34141912