I have repos in GitHub Enterprise Server which are WELL OVER 100x what you say the limit is.
There is not a technical limit of 4GB per repo on github.com. maybe a private repo size limit if you are not a paying customer, but it is not a technical limit of the platform, I assure you.
There is a 4GB limit on a single push, but you can work around that by pushing a very large repository incrementally. I have done that in the past for a repo with a linear history: push the first N commits, then the next N commits, etc. It would be harder for a repo with many refs or a branchy history; maybe limit each push by commit date?
WDYM by 4GB size limit? When I run du -h on my linux checkout, I get 6.4G (no build artifacts, just git tracked source code plus the .git directory). Even if I only look at the .git/objects directory, it's 5.2G large.
If I were implementing it, I would say total total repo size limit = 4GB, excluding any objects part of commits in whitelisted projects.
That makes sense for GitHub, because what they really care about is the hard drive space you use up, and a repo containing a commit they are already storing for some large project is no additional disk space.
It would be amazing if Github/lab provided a backing store for www.dvc.org . I've been using to great effect, but I have to rely on separate AWS integration for storing the large objects in s3.
They do exactly what you're looking for! They provide both a DVC remote to push and pull DVC-tracked objects, as well as a UI showing them integrated alongside git-tracked files in the repository view.
Unlike my sibling saying it's "pull out of thin air", I think this is an opportunity to actually do the opposite: Ask questions back!
I have no idea how to guess that. I don't know what kind of information Google stores in it's "index". What exactly do they mean by "index" even? I need more input. I can ask the interviewer questions and start a discussion!
Of course it's possible that the interviewer really just wanted to be a smart-ass and expected me to "just have an answer" and else fail me. But then I don't want to work there anyway. Having an actual discussion, as if we were just working together and had to 'solve' whatever problem at hand is great.
Because these kinds of questions do not tell you anything about the person’s ability as a developer, a leader or give you any insight into the ability to be innovative. As others have noted, it simply measures their ability to make basic assumptions and then do math, plus bullshit your way through a discussion. Might be perfect for hiring product managers tho. /s ;)
Like my sibling reply says, I think you didn't read past my first word in the reply. The "why". That was rhetorical.
My entire point was basically that it actually isn't about assumptions at all. If you just assume when I ask that question this tells me you're out as a developer I want to hire. Nobody can build proper software by just assuming the first thing that comes to mind and bullshitting.
You have to show that you can take a totally ambiguous and open question and show me that you can systematically try to find out as much information as you can in a reasonable amount of time to verify your assumptions or at least make them less assuming. Of course (a d this can happen when solving a real problem too) it might be that you cannot accurately measure some input you need and you will have to make assumptions. But you will want to try and you will want to have divided your problem into lots of smaller parts such that the part where you still have to assume is only one small part and you want to be very explicit that this part is still an assumption only which you will have to verify or improve upon as you actually implement something. You do have to start at some point, otherwise it's analysis paralysis.
And no this has nothing to do with a PM. They are notoriously bad at the above approach and instead will just assume the first best thing and bullshit their way through with that for as long as they don't get caught (no /s there but my experience with 80+% of PMs)
Fair enough, but I get that signal from asking them an actual system design question, or even a coding problem. It's really a waste, which is better than "awful", I guess. :)
But you don't have to make assumptions! I think this is the important bit: you are supposed to ask questions back until you are confident enough about your answer. You should highlight that you are able to work with customers that want something without knowing what they want.
Agreed that it isn’t ideal, but about “awful” specifically - I’m not too sure. I would never ask such a question but I would assume the intent is just to find out how you think and not to get you to spit out a number. Would it be fun if the interviewer worked together with you to approximate it?
Yup that was exactly the case. A fun discussion that involved estimating sizes of web pages, number of web pages, the effect of compression, the effect of deduplication, etc.
> I wish they had not gone with uint32_t for storing mtimes, since they now have to deal with the 2038 problem, sometime in the future.
Since uint32_t is unsigned, wouldn't it be the Y2106 problem instead?
> I am surprised they didn't directly use time_t, so that they wouldn't have to deal with this (since some platforms have already gone to 64 bit time_t)
You mentioned the problem yourself without noticing: some platforms have gone to 64-bit time_t, but others haven't. This is a file format, which can be shared by multiple platforms, so it cannot use types which change size depending on the platform.
That would double the disk space use of that particular file, for something which won't be necessary for more than 80 years (the mtime of a git object or pack file will always be in the past, unless the computer clock is wrong); and when that time arrives, a new file format can be defined (and the old format kept as read-only, with new data being written to the new file).
For on-disk formats, time_t would probably not be a good choice, but indeed, they have a time_t to uint32_t conversion going on, that is not even saturating, just cutting bits off:
Well if they use unsigned 32 bit they at least extended it to Y2106 :-)
But for this use case it's not really an issue though. FTA it sounded like they always write the mtime as now. It's unlikely they wouldn't GC the repo in 68 years to make wraparound an issue.
Most people should `git clone --depth 1` most of the time. Large businesses should do trunk-based development with something other than git that scales using distributed file systems with phantom checkouts, negating the need for LFS blob stores.
I haven't been able to find a link recently, but I recall from a few years ago that that's much harder on GitHub's servers. One of the newer package managers that uses GitHub for hosting (brew maybe?) tried to be helpful and default all their users to shallow clones. GitHub people wanted them to push an update ASAP to go back to full clones, and there was an issue thread laying it all out.
If I recall correctly it has to do with the way they cache git. If you request shallow clones they have to run a git process each time to make a custom download with only the data you need. Full clones get served from a cache.
> If I recall correctly it has to do with the way they cache git.
It was not actually the shallow clone, the shallow clone is fine.
The problem was shallow fetches afterwards, as it made computing the minimum set of changes to fetch much harder for git (during the fetch operation the client and server actually try to negotiate a minimum set of changes, and the server creates an ad-hoc pack for this).
Not only that, but they'd hit a git edge case which ended up causing disproportionate CPU usage and converting the shallow clones to near-full clones but very, very inefficiently.
This issue was compounded by the very git-unfriendly layout of the repository: one of the directories had >16000 subdirectories, something Git's tree-processing code apparently was not much tested for, leading to significant inefficiencies.
Not that you should do anything you're not comfortable with but I will note that the source control systems for most major companies are essentially known, especially around here. "At Facebook we use a Mercurial-inspired thing with some local extensions" is not really earth-shattering.
This is just the Nest, not Pijul. I'm responsible for that, and sorry about it. I said I'd fix it soon, at the moment I'm working on an app to fix (a small part of) France's electricity crisis. It's mostly done, but it took me a while to write something that would be able to scale to loads of users before the winter.
I'm just following Rustacean traditions of fixing things from the root.
When you have to write a web browser, you do have to fix C++ before you can actually start. I'm digging a bit deeper by avoiding blackouts so my French servers keep running (the Nest has two other locations), then I'll fix the server code.
I'm interested in Pijul and the theory I've read about the patches model makes sense but I'm still looking for an answer to my standard question on Pijul:
"What real world situation does git barf at that Pijul would handle better". I'm sure there is one, but I have yet to see it.
All of the following is described in the Pijul manual:
- Cherry-picking in Pijul actually works.
- Conflicts: you don't need rerere, conflicts don't come back once you've solved them. Conflicts are the most confusing situations, this is where you need a good tool the most.
- So-called "bad merges", where a merge or a rebase goes completely wrong and shuffles your lines around. Git users rarely only call the lack of associativity a "bad merge" and don't look further, but using 3-way merge is the wrong way to merge things, because there isn't enough information to merge correctly in all cases.
- Depending on your pace of work, you might find yourself working on different things at the same time. Pijul allows you to do that without worrying too much about how you'll eventually organise your work. If I worked with Git for example, I'd spend a lot of time organising my branches, they'd never be right, and I'd spend a lot of times rebasing afterwards. Pijul frees me from that work.
- Large files, but Git doesn't treat them well for historical reasons, not for reasons inherent to its design (unlike the other points).
Pijul handles all merges predictably, by design, because it doesn't actually merge anything, it just applies changes to a CRDT, and CRDTs work.
CRDTs aren't common in real-world distributed applications because they are hard to design. But when the problem is important enough (like in this case) I think the approach is worth it.
One way for Git to do what you said would be to import the repos in Pijul, commit by commit since the last common ancestor, instead of doing a 3-way merge. Technically feasible, but the performance would be terrible.
So what made me finally quit, i Was working on two machines and has a lot of conflicts after diversion. At some point magit crashed (my fault). Trying to merge them was very frustrating.
pijul doesn't scale to big repositories (tried to convert the oldest commits from hg.mozilla.org/mozilla-central, it didn't like it, and that was a tiny fraction of the repo...)
That actually seems kinda small.
Git’s lack of good support for large files means there’s probably an exabyte of data that, imho, should be source control but isn’t.