Elfshaker: Version control system fine-tuned for binaries

nh2 · on Nov 19, 2021

I experimented with something similar with a Linux distribution's package binary cache.

Using `bup` (deduplicating backup tool using git packfile format) I deduplicated 4 Chromium builds into the size of 1. It could probably pack thousands into the size of a few.

Large download/storage requirements for updates are one of NixOS's few drawbacks, and I think deduplication could solve that pretty much completely.

Details: https://github.com/NixOS/nixpkgs/issues/89380

peterwaller-arm · on Nov 19, 2021

Author here. I've used bup, and elfshaker was partially inspired by it! It's great. However, during initial experiments on this project I found bup to be slow, taking quite a long time to snapshot and extract. I think this could in principle be fixed in bup one day, perhaps.

nh2 · on Nov 19, 2021

I also use bup for a long time, but found that for very large server backups I'm hitting performance problems (both in time and memory usage).

I'm currently evaluating `bupstash` (also written in Rust) as a replacment. It's faster and uses a lot less memory, but is younger and thus lacks some features.

Here is somebody's benchmark of bupstas (unfortunately not including `bup`): https://acha.ninja/blog/encrypted_backup_shootout/

The `bupstash` author is super responsive on Gitter/Matrix, it may make sense to join there to discuss approaches/findings together.

I would really like to eventually have deduplication-as-a-library, to make it easier to put into programs like nix, or also other programs, e.g. for versioned "Save" functionality in software like Blender or Meshlab that work with huge files and for which diff-based incremental saving is more difficult/fragile to implement than deduplcating snapshot based saving.

pdimitar · on Nov 19, 2021

I used `bupstash` and evaluated it for a while. I am looking to do 5+ offsite backups of a small personal directory to services that offer 5GB of cloud space for free.

`bupstash` lacked good compression. I settled with `borg` because I could use `zstd` compression with it. Currently at 60 snapshots of the directory and the `borg` repo directory is at ~1.52GB out of 5GB quota. The source directory is ~12.19GB uncompressed. Very happy with `borg` + `zstd` and how they handle my scenario.

I liked `bupstash` a lot, and the author is responsive and friendly. But I won't be giving it another try until it implements much more aggressive compression compared to what it can do now. It's a shame, I really wanted to use it.

I do recognize that for many other scenarios `bupstash` is very solid though.

lathiat · on Nov 20, 2021

Borg has been working great for me with zstd.

ybkshaw · on Nov 19, 2021

Thank you for having such a good description on the project! Sometimes the links from HN lead to a page that takes a few minutes of puzzling to figure out what is going on but not yours.

Siira · on Nov 19, 2021

Is elfshaker any good for backuping non-text data?

iTokio · on Nov 20, 2021

interesting design document: https://raw.githubusercontent.com/bup/bup/master/DESIGN

mhx77 · on Nov 19, 2021

Somewhat related (and definitely born out of a very similar use case): https://github.com/mhx/dwarfs

I initially built this for having access to 1000+ Perl installations (spanning decades of Perl releases). The compression in this case is not quite as impressive (50 GiB to around 300 MiB), but access times are typically in the millisecond region.

peterwaller-arm · on Nov 19, 2021

Nice, I bet dwarfs would do well at our use case too. Thanks for sharing.

pdimitar · on Nov 19, 2021

That's super impressive, I will definitely give it a go. Thanks for sharing!

thristian · on Nov 19, 2021

This seems very much like the Git repository format, with loose objects being collected into compressed pack files - except I think Git has smarter heuristics about which files are likely to compress well together. It would be interesting to see a comparison between this tool and Git used to store the same collection of similar files.

peterwaller-arm · on Nov 19, 2021

An author here, I agree! The packfile format is heavily inspired by git, and git may also do quite well at this.

We did some preliminary experiments with git a while back but found we were able to do the packing and extraction much faster and smaller than git was able to manage. However, we haven't had the time to repeat the experiments with our latest knowledge and the latest version of git. So it is entirely possible that git might be an even better answer here in the end. We just haven't done the best experiments yet. It's something to bear in mind. If someone wants, they could measure this fairly easily by unpacking our snapshots and storing them into git.

On our machines, forming a snapshot of one llvm+clang build takes hundreds of milliseconds. Forming a packfile for 2,000 clang builds with elfshaker can take seconds during the pack phase with a 'low' compression level (a minute or two for the best compression level, which gets it down to the ~50-100MiB/mo range), and extracting takes less than a second. Initial experiments with git showed it was going to be much slower.

johnyzee · on Nov 19, 2021

As far as I was able to learn (don't remember the details, sorry), git does not do well with large binary files. I believe it ends up with a lot of duplication. It is the major thing I am missing from git, currently we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

peterwaller-arm · on Nov 19, 2021

Performing poorly with non-textual data happens for a a number of reasons. Binary data, when changed, often have a lot of 'non-local' changes in them. For example, a PSD file might well have a compression algorithm already applied to it. An insertion/deletion is going to result in a very different compressed representation for which there is no good way to have an efficient delta. elfshaker will suffer the same problem here.

derefr · on Nov 19, 2021

One could, in theory, write a git-clean filter (like the one used for git-lfs), that teaches git various heuristic approaches to "take apart" well-known binary container formats into trees of binary object leaf-nodes.

Then, when you committed a large binary that git could understand, what git would really be committing in its place would be a directory tree — sort of like the "resource tree" you see if you edit an MKV file, PNG file, etc., but realized as files in directories. Git would generate it, then commit it.

On checkout, this process would happen in reverse: a matching git-smudge filter could notice a metadata file in each of these generated directories, and collapse the contents of the directory together to form a binary chunk; recursively, up the tree, until you hit the toplevel, and end up with the original large binary again.

Since most of the generated leaf-nodes from this process wouldn't change on each commit, this would eliminate most of the storage overhead of having many historical versions of large files in git. (In exchange for: 1. the potentially-huge CPU overhead of doing this "taking apart" of the file on every commit; 2. the added IOPS for temporarily creating the files to commit them; and 3. the loss of any file-level compression [though git itself compresses its packfiles, so that's a wash.])

I'm almost inspired to try this out for a simple binary tree format like https://en.wikipedia.org/wiki/Interchange_File_Format. But ELF wouldn't be too hard, either! (You could even go well past the "logical tree" of ELF by splitting the text section into objects per symbol, and ensuring the object code for each symbol is stored in a PIC representation in git, even if it isn't in the binary.)

josephg · on Nov 20, 2021

As I understand it, this is essentially what the Google chrome updater does. It disassembles the binary and recalculates jump labels. Then it generates a diff based on the assembly code. When it applies that diff on people’s computers, the your computer again pulls the chrome binary apart and reconstructs it. The code for this is complex, but it’s all opensource.

I remember reading about this technique years ago, thinking “cool when this catches on, software updates in all my software will be tiny”. But no, for some reason macos updates are still gigabytes in size. I have no idea why?

hcs · on Nov 20, 2021

I think you're referring to courgette, which does the disassemble + patch + reassemble thing. Fwiw, Chrome now uses a simpler but competetive format called zucchini, see my recent comment: https://news.ycombinator.com/item?id=29028534

derefr · on Nov 20, 2021

It does!

The major distinction (besides this being generalized to pluggable binary container formats) is that what Courgette does happens in the context of delta comparison between two known binaries, where the goal is to create a minimal delta patch that can move you from one known binary to another known binary (given a sufficiently-smart updater, which actually "contains" a lot of the information in a Kolmogorov-complexity sense.)

There are efficiencies you can gain in encoding, that only work in a one-to-one, known-to-known comparison context. (Think "interstitial frames" in video; or RLE'd 1-bit delta-sigma audio encoding.) In these contexts, you need known1 on-hand already, plus a patch specific to the pair of {known1, known2}, to be able to recover known2. (This is why, in streaming video where non-delta-compressed "keyframes" are rare, you can't seek to arbitrary locations in the stream without the decoded video turning into a garbled mess for a while: you're receiving patches, but you don't have the [correct, clean] known1s to apply them to.)

But these efficiencies don't apply in the context of a system like git, where you might be checking out any arbitrary version of the data at any time, with any [or no!] other arbitrary version(s) already cached or checked out. To enable people to efficiently grab the delta they need to do the particular, arbitrary version-to-version jumps they need to do, you'd either need to generate the power-set of all updates; or at least enough entries to populate a skip-list (as Microsoft apparently does for Windows updates! [1]).

The powerset of updates is impractical to generate+store, but gets clients the ability to perform arbitrary jumps in O(1) time. With a skip-list of updates, on the other hand, either the server or the client needs to compute its way through all the skips, to combine them into one final update — so each jump would take O(log N) "patch application" computation steps to resolve. (This is why it takes the Microsoft servers a while to figure out which Windows updates your computer needs!) Neither approach really makes sense for a system like Git, where Git repos are entirely mirrored to every client (so the clients would need to store all those update patches), and where git checkouts are something you're supposed to be able to do often and quickly, in the middle of your daily workflow.

Meanwhile, the sort of approach I outlined would be optimized instead for storing currently known binaries, in a way most amenable to inter-compression against future, unknown versions of the binary, without the need to store any explicit known-to-known patches, or to compute the composition of such patches.

Rather than seeking to compress together existing available things as well as possible, a known-to-unknown compression approach seeks to segment and normalize the data in such a way that future revisions of the same data, put through the same process, would generate lots of content-identical duplicate component objects, which would end up being dedup'ed away when stored in a Content-Addressable Store.

It's the difference between "meeting in the middle" and canonicalization. Levenshtein distance vs. document fingerprinting. RAID5 vs. a fountain-codec-encoded dataset. Or, to be very literal, it's streaming-window Huffman coding, vs. merkle-trie encoding.

This "normalize, to later deduplicate" known-to-unknown compression approach, is the approach that Git itself is built on — but it only works well when the files Git works with are decomposed into mostly-independent leaf-node chunks. My proposed thought here, is just that it wouldn't be impossible to translate other types of files, into "mostly-independent leaf-node chunks" at commit time, such that Git's approach would then be applicable to them.

[1] https://devblogs.microsoft.com/oldnewthing/20200212-00/?p=10...

JoshTriplett · on Nov 19, 2021

Can you talk a bit more about what ELF-specific heuristics elfshaker uses? What kind of preprocessing do you do before zstd? Do you handle offsets changing in instructions, like the BCJ/BCJ2 filter? Do you do anything to detect insertions/deletions?

peterwaller-arm · on Nov 19, 2021

We've just added an applicability section, which explains a bit more what we do. We don't have any ELF specific heuristics [0].

https://github.com/elfshaker/elfshaker#applicability

In summary, for manyclangs, we compile with -ffunction-sections and -fdata-sections, and store the resulting object files. These are fairly robust to insertions and deletions, since the addresses are section relative, so the damage of any addresses changing is contained within the sections. A somewhat surprising thing is that this works well enough when building many revisions of clang/llvm -- as you go from commit to commit, many commits have bit identical object files, even though the build system often wants to rebuild them because some input has changed.

elfshaker packs use a heuristic of sorting all unique objects by size, before concatenating them and storing them with zstandard. This gives us an amortized cost-per-commit of something like 40kiB after compression with zstandard.

[0] (edit: despite the playful name suggesting otherwise -- when we chose the name we planned to do more with ELF files, but it turned out to be unnecessary for our use case)

JoshTriplett · on Nov 19, 2021

Ah, I see! Makes sense that you can do much better if you get to compile the programs with your choice of options.

ChrisMarshallNY · on Nov 19, 2021

> we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

Perforce is still used by game developers and other creatives, because it handles large binaries, quite well.

In fact, I'm not sure if they still do it, but one of the game engines (I think, maybe, Unreal) used to have a free tier that also included a free Perforce install.

mdaniel · on Nov 19, 2021

It was my recollection, and I confirmed it, that they've almost always had a "the first hit is free" model for small teams, and they also explicitly call out indie game studios as getting free stuff too: https://www.perforce.com/how-buy

3np · on Nov 19, 2021

Do you think it would be feasible to do a git-lfs replacement based on elfshaker?

Down the line maybe it would even be possible to have binaries as “first-class” (save for diff I guess)

veselink1 · on Nov 19, 2021

An author here, we've opened a Q&A discussion on GitHub: https://github.com/elfshaker/elfshaker/discussions/58.

wlll · on Nov 19, 2021

Related, and impressive: https://github.com/elfshaker/manyclangs

> manyclangs is a project enabling you to run any commit of clang within a few seconds, without having to build it.

> It provides elfshaker pack files, each containing ~2000 builds of LLVM packed into ~100MiB. Running any particular build takes about 4s.

Tobu · on Nov 19, 2021

The clever idea that makes manyclangs compress well is to store object files before they are linked, with each function and each variable in its own elf section so that changes are mostly local; addresses will indirect through sections and a change to one item won't cascade into moving every address.

I'm not sure the linking step they provide is deterministic/hermetic, if it is that would prove a decent way to compress the final binaries while shaving most of the compilation time. Maybe the manyclangs repo could store hashes of the linked binaries if so?

I'm not seeing any particular tricks done in elfshaker itself to enable this, the packfile system orders objects by size as a heuristic for grouping similar objects together and compresses everything (using zstd and parallel streams for, well, parallelism). Sorting by size seems to be part of the Git heuristic for delta packing: https://git-scm.com/docs/pack-heuristics

I'd like to see a comparison with Git and others listed here (same unlinked clang artifacts, compare packing and access): https://github.com/elfshaker/elfshaker/discussions/58#discus...

peterwaller-arm · on Nov 19, 2021

Author here, I'd like to see such a comparison too actually, but I'm not in the position to do the work at the moment. We did some preliminary experiments at the beginning, but a lot changed over the course of the project and I don't know how well elfshaker fares ultimately against all the options out there. Some basic tests against git found that git is quite a bit slower (10s vs 100ms) during 'git add' and git checkout. Maybe that can be fixed with some tuning or finding appropriate options.

glandium · on Nov 20, 2021

It would be interesting to compare to gitoxide tweaked to use zstd compression for packs.

perth · on Nov 19, 2021

Reminds me of how Microsoft packages the Windows installer actually. If you’ve ever unpacked Microsoft’s install.esd it’s interestingly insane how heavily it’s compressed. I assume it’s full of a lot of stuff that provides semi redundant binaries for compatibility to a lot of different systems, because the unpacked esd container goes from a few GiBs to I think around 40-50 iirc.

derefr · on Nov 19, 2021

The emulation community also has "ROMsets" — collections of game ROM images, where the ROM images for a given game title are all grouped together into an archive. So you'd have one archive for e.g. "every release, dump, and ROMhack of Super Mario Bros 1."

These ROM-set archives — especially when using more modern compression algorithms, like LZMA/7zip — end up about 1.1x the size of a single one of the contained game ROM images, despite sometimes containing literally hundreds of variant images.

Daishiman · on Nov 19, 2021

How does this work? Do all the game series use the same engine code and assets?

notafraudster · on Nov 19, 2021

I think you're slightly misinterpreting what the parent said. Take the game Super Mario World for the console Super Nintendo. It was released in Japan. It was released in the US. It was released in Europe. It was released in Korea. It was released in Australia. It was probably released in various minor regions and given unique translations. There are almost certainly re-releases of the game on Super Nintendo that issued new ROM files to correct minor bugs. Maybe there's a Greatest Hits version which might be the same game, but with an updated copyright date to reflect the re-release. This might amount to 10-12 versions of the same game, but 99.99% of what's in the ROM file is the same across all of them, so they can be represented compressed very well.

A copy of Super Mario Advance 2 for Game Boy Advance, which is also a re-release of Super Mario World, almost surely uses its own engine and would not be part of the same rom set. Likewise, other Mario games (like Mario 64, Super Mario Bros, etc.) would not be part of the same rom set. So it's nothing about the series using the same engine code or assets.

We're talking bugfixes and different regions for the same game on the same console. But this still has the effect of dropping the size for complete console collections by 50% or more, because most consoles have 2-3 regions per game for most games.

derefr · on Nov 19, 2021

You're generally correct. But there are interesting exceptions!

Sometimes, ROM-image-based game titles were based on the same "engine" (i.e. the same core set of assembler source-files with fixed address-space target locations, and so fixed locations in a generated ROM image), but with a few engine modifications, and entirely different assets.

In a sense, this makes these different games effectively into mutual "full conversion ROMhacks" of one-another.

You'll usually find these different game titles compressed together into the same ROMset (with one game title — usually the one with the oldest official release — being considered the prototype for the others, and so naming the ROMset), because they do compress together very well — not near-totally, the way bugfix patches do, but adding only the total amount to the archive size that you'd expect for the additional new assets.

Well-known examples of this are Doki Doki Panic vs. Super Mario Bros 2; Panel de Pon vs. Tetris Attack; Gradius III vs. Parodius; and any game with editions, e.g. Pokemon or Megaman Battle Network.

But there are more "complete" examples as well, where you'd never even suspect the two titles are related, with the games perhaps existing in entirely-different genres. (I don't have a ROMset library on-hand to dig out examples, but if you dig through one, you'll find some amazing examples of engine reuse.)

bena · on Nov 19, 2021

Sort of. ROMHacks are modified ROM images of a certain game.

If you knew where in the ROM image the level data was contained, you could modify it. As long as you didn't violate any constraints, the game would run fine.

You could also potentially influence game behavior as well.

The Game Genie and Gameshark were kind based on this concept. Except, being further along the chain, it could write values coming into and out of memory, so other effects were possible.

So, in the case of Super Mario Bros. ROMHacks, they all use Super Mario Bros. as a base ROM. Then from there, all you need to do is store the diff from the base.

wpietri · on Nov 19, 2021

Ooh, neat. I was wondering why anybody would make a binary-specific VCS. And why "elf" was in the name. This answers both questions. Thanks!

mal10c · on Nov 19, 2021

This project reminded me of something I've been looking for for a while - although it's not exactly what I'm looking for...

I use SolidWorks PDM at work to control drawings, BOMs, test procedures, etc. In all honesty, PDM does an alright job when it works, but when I have problems with our local server, all hell breaks loose and worst case, the engineers can't move forward.

In that light, I'd love to switch to another option. Preferably something decentralized just to ensure we have more backups. Git almost gets us there but doesn't include things like "where used."

All that being said, am I overlooking some features of Elfshaker that would fit well into my hopes of finding an alternative to PDM?

I also see there's another HN thread that asks the question I'm asking - just not through the lens of Elfshaker: https://news.ycombinator.com/item?id=20644770

kvnhn · on Nov 19, 2021

Maybe not precisely what you want, but I built a CLI tool[1] that's like a simplified and decoupled Git-LFS. It tracks large files in a content-addressed directory, and then you track the references to that store in source control. Data compression isn't a top priority for my tool; it uses immutable symlinks, not archives.

[1]: https://github.com/kevin-hanselman/dud

henvic · on Nov 19, 2021

Interesting. I wonder if this can also be [ab]used to, say, deliver deltas of programs, so that you can have faster updates, but maybe it doesn't make sense.

https://en.wikipedia.org/wiki/Binary_delta_compression

peterwaller-arm · on Nov 19, 2021

Author here, I don't think it would apply well to that scenario. elfshaker is good for manyclangs where we ship 2,000 revisions in one file (pack), so the cost of individual revision is amortized. If one build of llvm+clang costs you some ~400 MiB; a single elfshaker pack containing 2,000 builds has an amortized cost of around 40kiB/build. But this amazing win is only happening because you are shipping 2,000 builds at once. If you wanted to ship a single delta, you can't compress against all the other builds.

necovek · on Nov 19, 2021

How fast would it be to get a delta between any two of the 2,000 builds in a single elfshaker pack?

If that's reasonably fast, perhaps an approach like that could work: server stores the entire pack, but upon user request extracts a delta between user's version and target binary.

Still, the devil is in the details of building all revisions of all software a single distribution has.

peterwaller-arm · on Nov 19, 2021

Yes you could do that. On the other hand, all revisions for a month is 100MiB, and all revisions we've built spanning 2019-now are a total of 2.8GiB, so we opted to forego implementing any object negotiation and just say 'you have to download the 100MiB for the month to access it'. I think you could a push/pull protocol could be implemented, but at that point probably git might do a reasonable job in that case :)

henvic · on Nov 19, 2021

Thank you for the insight!

tedunangst · on Nov 19, 2021

This is how openbsd binary patches for the kernel work. The changed object files are shipped; the end system relinks a new kernel.

carlmr · on Nov 19, 2021

I find the description a bit confusing, is there and example where we can see the usage?

w0m · on Nov 19, 2021

My top level being that it's a VCS (like Git) specialized for binaries; with commands baked in to prevent the slowdown that often comes with large git repositories.

throw_away · on Nov 19, 2021

Specifically, it's for ELF binaries built in such a way that adding a new function or new data does not break however they cache existing functions/data.

I wonder if this concept could be extended to other binary types that git has problems with, were you able to know/control more about the underlying binary format.

peterwaller-arm · on Nov 19, 2021

One of the authors here, thanks for the feedback. We've tried to improve it here: https://github.com/elfshaker/elfshaker/pull/59

mxuribe · on Nov 19, 2021

Same here. There is a usage guide, which helped a tiny bit: https://github.com/elfshaker/elfshaker/blob/main/docs/users/...

Honestly, I sort of looked at it for conventional backup strategy...as in, i wonder if it could work as a replacement for tar-zipping up a directory, etc. But, not sure if the use cases is appropriate.

peterwaller-arm · on Nov 19, 2021

Author here. We'd love this to be a thing, but this is young software, so we don't recommend relying on this as a single way of doing a backup for now. Bear in mind that our main use case is for things that you can reproduce in principle (builds of a commit history, see manyclangs).

mxuribe · on Nov 19, 2021

> our main use case is for things that you can reproduce in principle (builds of a commit history, see manyclangs)

I appreciate your response, and thanks very much for the clarification of use case; very helpful! Thanks also of course for building this!

xdfgh1112 · on Nov 19, 2021

For backup you probably want something like Borg to handle deduplication of identical content between backups.

peterwaller-arm · on Nov 19, 2021

Author here, I agree with xdfgh1112, please take care before using brand new software to store your backups!

mxuribe · on Nov 19, 2021

Yes, any time that i use something new or different (or both) for something as essential as backups, i take great and deliberate care...and test, test, test...well before standardizing on it. ;-)

wyldfire · on Nov 19, 2021

There is an associated presentation on manyclangs at LLVM dev meeting. I think they presented yesterday?

Unfortunately it won't be uploaded until later but it will show up on the llvm YouTube channel:

https://www.youtube.com/c/LLVMPROJ

ot · on Nov 19, 2021

I would guess it’s a way to quickly bisect on compiler versions.

dilap · on Nov 19, 2021

Huh, interesting, could you maybe use this as an in-repo alternative to something like git-lfs?

peterwaller-arm · on Nov 19, 2021

Author here, I don't currently know how this compares to git-lfs. It it is possible git-lfs would perform quite well on the same inputs as elfshaker works on. If git-lfs does already work well for your use case I'd recommend using that rather than elfshaker, as it is more established.

dilap · on Nov 19, 2021

Thanks for the response! I was more just curious about future possibilities vs immediate practicle use.

git-lfs just offloads the storage of the large binaries to a remote site, and then downloads on demand.

If you have a lot of binary assets like artwork or huge excel spreadsheets, it's very useful, because in those cases, without git-lfs, the git repo will get very large, git will get extremely slow, and github will get angry at you for having too large a repo.

But it's not all roses with git-lfs, since now you're reliant on the external network to do checkouts, vs having fetched everything at once w/ the initial clone, and also of course just switching between revisions can get slower since you're network-limited to fetch those large files. (And though I'm not sure, it doesn't seem like git-lfs is doing any local caching.)

So you could imagine where something like having elfshaker embedded in the repo and integrated as a checkout filter could potentially be a useful alternative. Basically an efficient way to store binaries directly in the repo.

(Maybe it would be too small a band of use cases to be practicle though? Obviously if you have lots of distinct art assets, that's just going to be big, no matter what...)

0942v8653 · on Nov 19, 2021

Does it do any architecture-specific processing, i.e. BCJ filter? Or is there a generic version of this? The performance seems quite good.

peterwaller-arm · on Nov 19, 2021

Author here. No architecture specific processing currently. Most of the magic happens in zstandard (hat tip to this amazing project).

Please see our new applicability section which explains the result in a bit more detail:

https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...

In manyclangs (which uses elfshaker for storage) we arrange that the object code has stable addresses when you do insertions/deletions, which means you don't need such a filter. But today I learned about such filters, so thanks for sharing your question!

beagle3 · on Nov 20, 2021

Thanks, great project!

In this comment, you say "20% compression is pretty good". AFAIK, usually "X% compression" means the measure of the reduction in size, not the measure of the remaining. Thus, 0.01% compression sounds almost useless, very different from the 10,000x written next to it.

mrich · on Nov 19, 2021

I'm guessing this does not yield that high compression for release builds, where code can be optimized across translation units? Likewise when a commit changes a header that is included in many cpps?

peterwaller-arm · on Nov 19, 2021

Author here. The executables shipped in manyclangs are release builds! The catch is that manyclangs stores object files pre-link. Executables are materialized by relinking after they are extracted with elfshaker.

The stored object files are compiled with -ffunction-sections and -fdata-sections, which ensures that insertions/deletions to the object file only have a local effect (they don't cause relative addresses to change across the whole binary).

As you observe, anything which causes significant non-local changes in the data you store is going to have a negative effect when it comes to compression ratio. This is why we don't store the original executables directly.

zeotroph · on Nov 19, 2021

Thank you for the explanation, so the pre-link storage is one of the magical ingredients, maybe mention this as well in the README?

Is this the reason why manyclang (using llvms cmake based build system) can be provided easily, but it would be more difficult for gcc? Or is the object -> binary dependency automatically deduced?

peterwaller-arm · on Nov 19, 2021

> maybe mention this as well in the README?

We've tweaked the readme, I hope it's clearer.

It would be great to provide this for gcc too. The project is new and we've just started out. I know less about gcc's build system and how hard it will be to apply these techniques there. It seems as though it should be possible though and I'd love to see it happen.

To infer the object->executable dependencies we currently read the compilation database and produce a stand-alone link.sh shell script, which gets packaged into each manyclangs snapshot.

zeotroph · on Nov 19, 2021

Ah, the compilation database is where more magic originates from :)

peterwaller-arm · on Nov 19, 2021

Yes, this is less great than I would like! :( :)

mrich · on Nov 19, 2021

Thanks. I had a use case in mind where LTO is enabled. Unfortunately the LTO step is quite expensive so relinking does not seem like a viable option. If I find some time I'll give it a try though.

peterwaller-arm · on Nov 19, 2021

ThinLTO can be pretty quick if you have enough cores, it might work. Not sure how well the LTO objects compress against each other when you have small changes to them. It might work reasonably.

manyclangs is optimized to provide you with a binary quickly. The binary is not necessarily itself optimized to be fast, because it's expected that a developer might want to access any version of it for the purposes of testing whether some input manifests a bug or has a particular codegen output. In that scenario, it's likely that the developer is able to reduce the size of the input such that the speed of the compiler itself is not terribly significant in the overall runtime. Therefore, I don't see LTO for manyclangs as such a significant win. But it is still hoped that the overall end-to-end runtime is good, and the binaries are optimized, just not with LTO.

lxe · on Nov 19, 2021

> There are many files,

> Most of them don't change very often so there are a lot of duplicate files,

> When they do change, the deltas of the [binaries] are not huge.

We need this but for node_modules

ithkuil · on Nov 19, 2021

The novel trick here is splitting up huge binary files and treat them as if they were many small files.

Node_modulea is already tons and tons of files, and when they are large, they are usually minified and hard to split on any "natural" boundary (like elf sections/symbols etc)

aabbcc1241 · on Nov 20, 2021

checkout pnpm, it stores each version of package only once, and setup your project's node_modules with symbolic to the exact cached version

cerved · on Nov 20, 2021

for what reason?

lxpz · on Nov 19, 2021

This should be integrated with Cargo to reduce the size of the target directories which are becoming ridiculously large.

peterwaller-arm · on Nov 19, 2021

Author here. I'm unsure whether this would apply very well to cargo or not. If it has lots of pre-link object files, then maybe.

londons_explore · on Nov 19, 2021

I'd like to see a version of this built into things like IPFS.

It seems obvious that whenever something is saved into IPFS, there might be a similar object already stored. If there is, go make a diff, and only store the diff.

hcs · on Nov 19, 2021

It should be possible to do this in IPFS already if you use the go-ipfs --chunker option with a content-sensitive chunking algorithm like rabin or buzhash [1]. With this there's a good chance that a file with small changes from something already on IPFS will have some chunks that hash identically, so they'll be shared.

[1] https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

londons_explore · on Nov 19, 2021

But that isn't quite as good as something like this that can 'understand' diffs in files, rather than simply relying on the fact a bunch of bytes in a row might be the same.

hcs · on Nov 19, 2021

I don't think elfshaker actually does do any binary diffing (e.g. xdelta or bsdiff). It works well because it uses pre-link objects which are built to change as little as possible between versions. Then when it compresses similar files together in a pack, Zstandard can recognize the trivial repeats.

peterwaller-arm · on Nov 19, 2021

Author here. This is correct, we set out to do binary diffing but we soon discovered that if you put similar enough object files together in a stream, and then compress the stream, zstandard does a fantastic job at compressing and decompressing quickly with a high compression ratio. The existing binary diffing tools can produce small patches, but they are relatively expensive both to compute the delta and to apply the patches.

jankotek · on Nov 19, 2021

Does it make a sense to turn it into fuse fs, with transparent deduplication?

peterwaller-arm · on Nov 19, 2021

Author here. Maybe, it's a fun idea. I have toyed with providing a fuse filesystem for access to a pack but my time for completing this is limited at the moment.

nh2 · on Nov 19, 2021

Many packfile-deduplicating backup tools (bup, kopia, borg, restic) can mount the deduplicated storage as FUSE.

It might make sense to check how they do it.

I'd also be interested in how elfshaker compares to those (and `bupstash`, which is written in Rust but doesn't have a FUSE mount yet) in terms of compression and speed.

Did you know of their existence when making elfshaker?

Edit: Question also posted in your Q&A: https://github.com/elfshaker/elfshaker/discussions/58#discus...

peterwaller-arm · on Nov 19, 2021

(Copying from Q&A) Before starting out some time ago, I did some experiments with bup. I had a good experience with bup and high expectations for it. However, I found that quite a lot of performance was left on the table, so I was motivated to start elfshaker. Unfortunately that time has past so I don't have scientific numbers for you measured with other software at this time.

As an idea of how elfshaker performs, we see ~300ms time to create a snapshot for clang, and ~seconds-to-minute to create a binary pack containing thousands of revisions. Extraction takes less than a second. One difference of elfshaker compared with some other software I tested is that we do the compression and decompression in parallel, which can make a very big difference on today's many-core machines.

ghoul2 · on Nov 19, 2021

If I already have, lets say a 100MB pack file containing (say) 200 builds of clang and then I import the 201st build into that pack file - is it possible to send across a small delta of this new, updated pack file to someone else who already had the older pack file (with 200 builds) such that they can apply the delta to the old pack and get the new pack containing 201 builds?

splittingTimes · on Nov 20, 2021

In general, how do you handle repos that have a combination of source files and a lot of binary data side by side?

bogwog · on Nov 19, 2021

Does this work well with image files? (PNG, JPEG, etc)

peterwaller-arm · on Nov 19, 2021

Author here, it works particularly well for our presented use case because it has these properties:

* There are many files,

* Most of them don't change very often,

* When they do change, the deltas of the binaries are not huge.

So, if the image files aren't changing very much, then it might work well for you. If the images are changing, their binary deltas would be quite large, so you'd get a compression ratio somewhat equivalent to if you'd concatenated the two revisions of the file and compressed them using ZStandard.

IceWreck · on Nov 19, 2021

Please add these points under a usecase heading in your README.

peterwaller-arm · on Nov 19, 2021

Done, hopefully this is clearer. Please let us know if you see a way to improve it further: https://github.com/elfshaker/elfshaker/pull/60

shp0ngle · on Nov 19, 2021

Ahhh that’s the key insight I have been missing, and that should be higher somewhere.

Thanks

i_like_waiting · on Nov 19, 2021

Thanks, seems like that could be good solution for storing of daily backups of DB. I didn't know I needed it but seems like I do.

peterwaller-arm · on Nov 19, 2021

Author here, this software is young, please don't use it for backups!

But also, in general, it might not work well for your use case, and our use case is niche. Please give it a try before making assumptions about any suitability for use.

wpietri · on Nov 19, 2021

In this age of rampant puffery, it's so... soothing to see somebody be positive and frank about the limits of their creation. Thanks for this and all your comments here!

peterwaller-arm · on Nov 19, 2021

the_duke · on Nov 19, 2021

Borg, bup or restic are relatively popular incremental backup tools that reduplicate with chunking.

phil294 · on Nov 19, 2021

Have a look at Borg, it handles incremental backups very well

tttsxhub · on Nov 19, 2021

Why does it depend on the CPU architecture?

peterwaller-arm · on Nov 19, 2021

(Disclosure: I work for Arm, opinions are my own)

Author here. elfshaker itself does not have a dependency on any architecture to our knowledge. We support the architectures we have use of. Contributions to add missing support are welcome.

manyclangs provides binary pack files for aarch64 because that's what we have immediate use of. If elfshaker and manyclangs proves useful to people, I would love to see resource invested to make it more widely useful.

You can still run the manyclangs binaries on other architectures using qemu [0], with some performance cost, which may be tolerable depending on your use case.

[0] https://github.com/elfshaker/manyclangs/tree/main/docker-qem...

999900000999 · on Nov 20, 2021

Would love to see this work with Unity projects.

Right now git lfs takes up so much space when storing files locally.

erichocean · on Nov 19, 2021

Seems like the Nix people would be interested in enabling this kind of thing for Nix packages…

yincrash · on Nov 19, 2021

Could this be useful for packing xcode's deriveddata folder for caching in ci builds?

xpe · on Nov 19, 2021

Never shake a baby elf!

cyounkins · on Nov 19, 2021

Cool! I wonder how this would compare to ZFS deduplication.

veselink1 · on Nov 19, 2021

An author here. elfshaker uses per-file deduplication. When building manyclangs packs, we observed that the deduplicated content is about 10 GiB in size. After compression with `elfshaker pack`, that comes down to ~100 MiB.

There is also a usability difference: elfshaker stores data in pack files, which are more easily shareable. Each of the pack files released as part of manyclangs ~100 MiB and contains enough data to materialize ~2,000 builds of clang and LLVM.

goodpoint · on Nov 19, 2021

I'm surprised nobody mentioned git-annex. It does the same using git for metadata. It's extremely efficient.

kristjansson · on Nov 19, 2021

AFAIK, git-annex doesn't address address sub-file deduplication/compression at all, it just stores a new copy for each new hash it sees? I suppose that content-addressed storage, combined with the pre-link strategy discussed elsewhere for the related manyclangs project would produce similar, if less spectacular, results?

billconan · on Nov 20, 2021

so how do I do diff and merge (resolve conflicts)?

axismundi · on Nov 19, 2021

does it work on intel macs?

svilen_dobrev · on Nov 19, 2021

will some of these work for (compressed) variants of audio? They're never same..

peterwaller-arm · on Nov 19, 2021

Author here. Compressed data is unlikely to work well in general, unless it never changes.