Hacker News new | past | comments | ask | show | jobs | submit login
Best way to do Linux clones for your CI (kernel.org)
96 points by ScottWRobinson on July 26, 2018 | hide | past | favorite | 11 comments



Obviously not applicable to Linux since it is hosted on Git, but Mercurial has this "clone bundles" feature built in.

If you e.g. `hg clone https://hg.mozilla.org/mozilla-central` to clone the main Firefox repo, your client will connect to a CDN to download the pre-generated bundle then go back to the server to pull recent changes. Bitbucket also has the feature deployed.

After I deployed this feature on hg.mozilla.org, server-side CPU load dropped by like 90%. And with IP filtering to detect clients connecting from AWS that allows us to serve URLs direct from S3 (as opposed to going through a CDN), the total bill is like $20/mo for CI usage (S3 intra-region data transfer is free). Literally thousands of hours of CPU-core time and >500 TB/mo offloaded from the Mercurial servers.


(Assuming you’re familiar with Git internals as well:) do you think there’s anything in the architecture of Git preventing it from having this feature? And, if not, should we Git developers push for it / work on a PR?

It’d make a lot of sense to integrate it if it’s possible, I think. The design direction of Git is somewhat subservient to the needs of Linux (i.e. Git is a thing Linus made to replicate BitKeeper, but it also was made to scratch—and continues to scratch—many of the LKML itches re: their unique patch-management workflows.)

So if Linux is having to do something re: “SCM ops” tooling, that could be better solved in Git, then why not solve it in Git?

(If anyone who was responsible for this Kernel.org Git bundle setup wants to chime in, that’d be interesting; I assume they likely considered making this a Git thing first, so there might be a good reason why it’s not.)


Yes, builtin CDN offload is actively under discussion as part of v2 protocol changes, but there is nothing currently available in released versions. The "repo" command that comes with Android dev tools uses the clone.bundle approach (which is basically what we're implementing on git.kernel.org for a handful of core repositories). I would guess "repo" is partly why nobody added anything like this directly to git -- for Google, that particular itch has been scratched.

Anyway, here's hoping that the coming protocol changes will have a native solution to this problem.


The implementation side of git already has this feature. The "bundle" file in the linked article is fundamentally the same thing. What git lacks is the automatic integration; you have to pull those bundles manually instead of having the client know how to get them automatically.


Great idea. The recent work to open up the git protocol mmakes it easier to add in this kind of smart behaviour: advertise a 'bundle' command that returns a URI to fetch the bundle from.


I think Google has something like that for the Android repositories. They are always synced through Git and 70 GiB+ in size.


yeah, the "repo" command they use has support for getting bundles, and then pulling the newer commits from the gerrit server.


Git bundles are pretty nice, they are just a packfile [1], with a list of refs prepended, that you can clone like a remote.

But this also means they have pretty inefficient compression, since every object (or delta) is packed using zlib on its own. For packfiles this is good because it allows random access, but for bundles it would be better to disable this compression (set it to level 0) and compress the whole bundle instead:

    435M default.bundle
    412M default.bundle.bz2
    410M default.bundle.gz
    401M default.bundle.xz

    978M uncompressed.bundle
    290M uncompressed.bundle.bz2
    330M uncompressed.bundle.gz
    245M uncompressed.bundle.xz
That's 245MB vs. 401MB, 40% space savings! It can't be cloned directly, though.

Produced using `git bundle create fname HEAD` on linux v3.0 after: `git repack -a -d -F --depth=100 --window=100` (default compression level) `git -c pack.compression=0 repack -a -d -F --depth=100 --window=100` (no zlib compression)

[1] https://github.com/git/git/blob/master/Documentation/technic...


I think CI means "cluster infrastructure" but article refers to a "CI infrastructure" so I'm not quite sure.


Continuous Integration


Ah thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: