Scaling Git’s garbage collection

forrestthewoods · on Sept 13, 2022

> At GitHub, we store a lot of Git data: more than 18.6 petabytes of it, to be precise.

That actually seems kinda small.

Git’s lack of good support for large files means there’s probably an exabyte of data that, imho, should be source control but isn’t.

ajb · on Sept 13, 2022

You can't actually put the Android source in GitHub because of the 4GB per repo size limit. Niche problem, but shows the scale of things.

naikrovek · on Sept 14, 2022

I have repos in GitHub Enterprise Server which are WELL OVER 100x what you say the limit is.

There is not a technical limit of 4GB per repo on github.com. maybe a private repo size limit if you are not a paying customer, but it is not a technical limit of the platform, I assure you.

fanf2 · on Sept 14, 2022

There is a 4GB limit on a single push, but you can work around that by pushing a very large repository incrementally. I have done that in the past for a repo with a linear history: push the first N commits, then the next N commits, etc. It would be harder for a repo with many refs or a branchy history; maybe limit each push by commit date?

est31 · on Sept 14, 2022

WDYM by 4GB size limit? When I run du -h on my linux checkout, I get 6.4G (no build artifacts, just git tracked source code plus the .git directory). Even if I only look at the .git/objects directory, it's 5.2G large.

londons_explore · on Sept 14, 2022

It seems GitHub have whitelisted git projects containing certain commits or files to go over the size limit.

Chromium is also well over the size limit.

_wolfie_ · on Sept 14, 2022

Can you use this to work around the size limit by having linux source code (up to the required commit) in a orphan branch?

londons_explore · on Sept 14, 2022

If I were implementing it, I would say total total repo size limit = 4GB, excluding any objects part of commits in whitelisted projects.

That makes sense for GitHub, because what they really care about is the hard drive space you use up, and a repo containing a commit they are already storing for some large project is no additional disk space.

1letterunixname · on Sept 14, 2022

Lol. That's tiny. Monorepos of MAANGs are multi TiB.

ajb · on Sept 16, 2022

As people have pointed out, there's not a hard 4GB limit. They do recommend less than 5GB.

naikrovek · on Sept 20, 2022

paying customers and popular open source repos can go well beyond 5GB. GHEC customers have a 5GB limit per file in git-lfs.

saagarjha · on Sept 14, 2022

You could if someone contacted GitHub to have it mirrored.

naikrovek · on Sept 14, 2022

it's not a real limit.

kortex · on Sept 13, 2022

It would be amazing if Github/lab provided a backing store for www.dvc.org . I've been using to great effect, but I have to rely on separate AWS integration for storing the large objects in s3.

arjvik · on Sept 14, 2022

Take a look at DagsHub - https://dagshub.com

They do exactly what you're looking for! They provide both a DVC remote to push and pull DVC-tracked objects, as well as a UI showing them integrated alongside git-tracked files in the repository view.

(Disclaimer: I work at DagsHub)

kccqzy · on Sept 13, 2022

That's indeed small. I'd guess that Google probably stores 4 orders of magnitude more data than GitHub.

(I was in fact asked a long time ago in an interview to estimate how much disk was needed to store Google's search index.)

sulam · on Sept 13, 2022

Glad it was a long time ago. Those kinds of questions are awful.

tharkun__ · on Sept 14, 2022

Why?

Unlike my sibling saying it's "pull out of thin air", I think this is an opportunity to actually do the opposite: Ask questions back!

I have no idea how to guess that. I don't know what kind of information Google stores in it's "index". What exactly do they mean by "index" even? I need more input. I can ask the interviewer questions and start a discussion!

Of course it's possible that the interviewer really just wanted to be a smart-ass and expected me to "just have an answer" and else fail me. But then I don't want to work there anyway. Having an actual discussion, as if we were just working together and had to 'solve' whatever problem at hand is great.

sulam · on Sept 14, 2022

Because these kinds of questions do not tell you anything about the person’s ability as a developer, a leader or give you any insight into the ability to be innovative. As others have noted, it simply measures their ability to make basic assumptions and then do math, plus bullshit your way through a discussion. Might be perfect for hiring product managers tho. /s ;)

tharkun__ · on Sept 14, 2022

Like my sibling reply says, I think you didn't read past my first word in the reply. The "why". That was rhetorical.

My entire point was basically that it actually isn't about assumptions at all. If you just assume when I ask that question this tells me you're out as a developer I want to hire. Nobody can build proper software by just assuming the first thing that comes to mind and bullshitting.

You have to show that you can take a totally ambiguous and open question and show me that you can systematically try to find out as much information as you can in a reasonable amount of time to verify your assumptions or at least make them less assuming. Of course (a d this can happen when solving a real problem too) it might be that you cannot accurately measure some input you need and you will have to make assumptions. But you will want to try and you will want to have divided your problem into lots of smaller parts such that the part where you still have to assume is only one small part and you want to be very explicit that this part is still an assumption only which you will have to verify or improve upon as you actually implement something. You do have to start at some point, otherwise it's analysis paralysis.

And no this has nothing to do with a PM. They are notoriously bad at the above approach and instead will just assume the first best thing and bullshit their way through with that for as long as they don't get caught (no /s there but my experience with 80+% of PMs)

sulam · on Sept 14, 2022

Fair enough, but I get that signal from asking them an actual system design question, or even a coding problem. It's really a waste, which is better than "awful", I guess. :)

rpadovani · on Sept 14, 2022

But you don't have to make assumptions! I think this is the important bit: you are supposed to ask questions back until you are confident enough about your answer. You should highlight that you are able to work with customers that want something without knowing what they want.

triceratops · on Sept 14, 2022

All of these questions boil down to "pull some numbers out of your ass, then multiply (or divide) as appropriate".

saagarjha · on Sept 14, 2022

This is coincidentally exactly how I guess my code will perform.

gregoryl · on Sept 14, 2022

With the resources/time most of us have, thats pretty much the peak of accuracy on performance estimates anyway.

isatty · on Sept 13, 2022

Agreed that it isn’t ideal, but about “awful” specifically - I’m not too sure. I would never ask such a question but I would assume the intent is just to find out how you think and not to get you to spit out a number. Would it be fun if the interviewer worked together with you to approximate it?

kccqzy · on Sept 14, 2022

Yup that was exactly the case. A fun discussion that involved estimating sizes of web pages, number of web pages, the effect of compression, the effect of deduplication, etc.

latchkey · on Sept 13, 2022

> estimate how much disk was needed to store Google's search index

1 googol!

nixgeek · on Sept 14, 2022

You might be off by an additional order or perhaps two!

sc68cal · on Sept 13, 2022

I wish they had not gone with uint32_t for storing mtimes, since they now have to deal with the 2038 problem, sometime in the future.

I am surprised they didn't directly use time_t, so that they wouldn't have to deal with this (since some platforms have already gone to 64 bit time_t)

cesarb · on Sept 13, 2022

> I wish they had not gone with uint32_t for storing mtimes, since they now have to deal with the 2038 problem, sometime in the future.

Since uint32_t is unsigned, wouldn't it be the Y2106 problem instead?

> I am surprised they didn't directly use time_t, so that they wouldn't have to deal with this (since some platforms have already gone to 64 bit time_t)

You mentioned the problem yourself without noticing: some platforms have gone to 64-bit time_t, but others haven't. This is a file format, which can be shared by multiple platforms, so it cannot use types which change size depending on the platform.

0x0 · on Sept 14, 2022

But you could always a fit a uint32_t mtime in a uint64_t file format field...? Wouldn't it be better to future-proof the file format?

cesarb · on Sept 14, 2022

That would double the disk space use of that particular file, for something which won't be necessary for more than 80 years (the mtime of a git object or pack file will always be in the past, unless the computer clock is wrong); and when that time arrives, a new file format can be defined (and the old format kept as read-only, with new data being written to the new file).

est31 · on Sept 13, 2022

For on-disk formats, time_t would probably not be a good choice, but indeed, they have a time_t to uint32_t conversion going on, that is not even saturating, just cutting bits off:

https://github.com/git/git/blob/e188ec3a735ae52a0d0d3c22f9df...

kevingadd · on Sept 13, 2022

Wouldn't that mean if a platform changed time_t formats it would invalidate all their stored files?

grogers · on Sept 13, 2022

Well if they use unsigned 32 bit they at least extended it to Y2106 :-)

But for this use case it's not really an issue though. FTA it sounded like they always write the mtime as now. It's unlikely they wouldn't GC the repo in 68 years to make wraparound an issue.

1letterunixname · on Sept 14, 2022

Most people should `git clone --depth 1` most of the time. Large businesses should do trunk-based development with something other than git that scales using distributed file systems with phantom checkouts, negating the need for LFS blob stores.

iudqnolq · on Sept 14, 2022

I haven't been able to find a link recently, but I recall from a few years ago that that's much harder on GitHub's servers. One of the newer package managers that uses GitHub for hosting (brew maybe?) tried to be helpful and default all their users to shallow clones. GitHub people wanted them to push an update ASAP to go back to full clones, and there was an issue thread laying it all out.

If I recall correctly it has to do with the way they cache git. If you request shallow clones they have to run a git process each time to make a custom download with only the data you need. Full clones get served from a cache.

masklinn · on Sept 14, 2022

It was an issue with Cocoapods: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

> If I recall correctly it has to do with the way they cache git.

It was not actually the shallow clone, the shallow clone is fine.

The problem was shallow fetches afterwards, as it made computing the minimum set of changes to fetch much harder for git (during the fetch operation the client and server actually try to negotiate a minimum set of changes, and the server creates an ad-hoc pack for this).

Not only that, but they'd hit a git edge case which ended up causing disproportionate CPU usage and converting the shallow clones to near-full clones but very, very inefficiently.

This issue was compounded by the very git-unfriendly layout of the repository: one of the directories had >16000 subdirectories, something Git's tree-processing code apparently was not much tested for, leading to significant inefficiencies.

vildravn · on Sept 14, 2022

> with something other than git that scales using distributed file systems with phantom checkouts, negating the need for LFS blob stores.

What would that something be for example?

1letterunixname · on Sept 14, 2022

NDA, so I can't talk about it. I would like to but I can't.

saagarjha · on Sept 14, 2022

Not that you should do anything you're not comfortable with but I will note that the source control systems for most major companies are essentially known, especially around here. "At Facebook we use a Mercurial-inspired thing with some local extensions" is not really earth-shattering.

tex0 · on Sept 14, 2022

That reminds me of Piper/CitC: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

Denisos · on Sept 14, 2022

am surprised they didnt use time

cutierust · on Sept 14, 2022

Just use pijul. Backend might be good but I'm tired of gits ui and am impressed with pijul so far (very unfurnished yet,)

ironmagma · on Sept 14, 2022

I would, but PIJUL IS BROKEN* and nobody has fixed it since it was reported. [1]

[1] https://discourse.pijul.org/t/signup-on-pijul-nest-gives-for...

* forgive the caps, I know I have sinned but I do wanna be able to use this VCS.

pmeunier · on Sept 14, 2022

This is just the Nest, not Pijul. I'm responsible for that, and sorry about it. I said I'd fix it soon, at the moment I'm working on an app to fix (a small part of) France's electricity crisis. It's mostly done, but it took me a while to write something that would be able to scale to loads of users before the winter.

ironmagma · on Sept 15, 2022

Thank you for the update, and the electricity crisis is (probably... :P) more important. No biggie.

pmeunier · on Sept 15, 2022

I'm just following Rustacean traditions of fixing things from the root.

When you have to write a web browser, you do have to fix C++ before you can actually start. I'm digging a bit deeper by avoiding blackouts so my French servers keep running (the Nest has two other locations), then I'll fix the server code.

cutierust · on Sept 14, 2022

That's nest not pijul.

ironmagma · on Sept 14, 2022

People can’t develop on/learn Pijul if Nest is broken.

cutierust · on Sept 14, 2022

Ssh is enough to a central machine if multiple people are involved. But what stops you from local development?

rkangel · on Sept 14, 2022

I'm interested in Pijul and the theory I've read about the patches model makes sense but I'm still looking for an answer to my standard question on Pijul:

"What real world situation does git barf at that Pijul would handle better". I'm sure there is one, but I have yet to see it.

pmeunier · on Sept 14, 2022

All of the following is described in the Pijul manual:

- Cherry-picking in Pijul actually works.

- Conflicts: you don't need rerere, conflicts don't come back once you've solved them. Conflicts are the most confusing situations, this is where you need a good tool the most.

- So-called "bad merges", where a merge or a rebase goes completely wrong and shuffles your lines around. Git users rarely only call the lack of associativity a "bad merge" and don't look further, but using 3-way merge is the wrong way to merge things, because there isn't enough information to merge correctly in all cases.

- Depending on your pace of work, you might find yourself working on different things at the same time. Pijul allows you to do that without worrying too much about how you'll eventually organise your work. If I worked with Git for example, I'd spend a lot of time organising my branches, they'd never be right, and I'd spend a lot of times rebasing afterwards. Pijul frees me from that work.

- Large files, but Git doesn't treat them well for historical reasons, not for reasons inherent to its design (unlike the other points).

glandium · on Sept 14, 2022

There are some types of merges that pijul would handle better, but technically, git could too, if it spent more time on them.

pmeunier · on Sept 14, 2022

Pijul handles all merges predictably, by design, because it doesn't actually merge anything, it just applies changes to a CRDT, and CRDTs work.

CRDTs aren't common in real-world distributed applications because they are hard to design. But when the problem is important enough (like in this case) I think the approach is worth it.

One way for Git to do what you said would be to import the repos in Pijul, commit by commit since the last common ancestor, instead of doing a 3-way merge. Technically feasible, but the performance would be terrible.

cutierust · on Sept 14, 2022

So what made me finally quit, i Was working on two machines and has a lot of conflicts after diversion. At some point magit crashed (my fault). Trying to merge them was very frustrating.

glandium · on Sept 14, 2022

pijul doesn't scale to big repositories (tried to convert the oldest commits from hg.mozilla.org/mozilla-central, it didn't like it, and that was a tiny fraction of the repo...)

pmeunier · on Sept 14, 2022

That doesn't match our experience, have you reported this?