Hacker News new | past | comments | ask | show | jobs | submit login
Bring your monorepo down to size with sparse-checkout (github.blog)
137 points by Amorymeltzer on Jan 17, 2020 | hide | past | favorite | 64 comments



This reminds me of VFS for Git, Microsoft’s solution for scaling Git for the Windows code base. [1] [2] [3]

[1]: https://github.com/microsoft/VFSForGit

[2]: https://news.ycombinator.com/item?id=14411126

[3]: https://devblogs.microsoft.com/bharry/the-largest-git-repo-o...


This sounds like a parallel effort (along with the commit graph work) to keep pushing that better. The article's writer (who also wrote most of the blog posts on the commit graph work) mentions a "three million file repository" used for testing in this article and that would of course sound like the Windows repo.

It's also I'd imagine not mutually exclusive effort. It seems like exactly like something you would want in combination with something like VFS at scale, as it reduces the number of materialized versus virtual objects in both the git working copy and the git object database. If you've got millions or billions of objects and files, even reducing the number of virtual placeholders I would imagine is probably a big win.


The author of the blog post is on the git team at Microsoft.


I'm surprised that VFS for Git isn't yet available on GitHub. Surely they are working on adding it? Anyone have an inside scoop?


2 reasons: VFS is a fork of normal Git. There is no Linux client.

Also remember that there are many Git clients that work with normal Git repos. Like libgit and others. I doubt you'll see wide spread support for it unless MS can upstream it into the main Git implementation, and maybe some the the primary libraries.

This is one nice argument for Mercurial, where there is only a single implementation, so adding big new changes can be easier.


VFS for Git tool only works on Windows, and is written in C#; it also relies on a fork of Git.


I have always done —depth=1 for projects I am not a core developer of, but ran into an issue with it being seemingly impossible to do the same with submodules.

Golang should have figured this out from day 1 before shipping with a release system built around cloning a repo in its entirety, history and recursive submodules, and all.


depth=1 is an old feature but it only limits the clone in the history dimension. You still have to store the entire state of the tree as of last commit. This feature is about cloning parts of the tree.


My point is that even that basic functionality a) did not extend to all of git's core functionality, and b) went unused by major players in the industry self-procaimedly responsible for "optimizing" the internet.


Have you tried the --shallow_submodules flag? I haven't used it, but it seems like to should do what you want.


FWIW Go modules address this pretty well.


I wish somebody'd write a book on monorepos. I've run into only a handful of their problems when trying to manage production pipelines using just a dozen services, so I'm sure there's tons more (like the purpose behind this command). Nobody mentions the massive investment in time, technical expertise, compute resource, and money required to run large monorepos in production.

Also, would emulating this command with a repo of submodules not work?


What problems did you encounter with just a few services?

Monorepos should be straightforward unless you are managing the code of >1k engineers.


We’ve run into some nontrivial but totally solvable issues at about 100-200 engineers.

IME, most consternation comes from people adopting a mono repo without adopting a build/dependency graph tool (like Bazel, buck or pants).

An additional source of strain is from people abusing the repo (checking in large binaries, third party dependencies, etc).

A third is when people try to do branch-based feature development, instead of the “correct” practice of only deploying master (or weekly cuts of master).

I think even a simple list of these sort of “gotchas” would be valuable for the aspirational mono repo company.

My impression is that a lot of teams hit these early and painful roadblocks, and imagine that they’ll never go away (they do!!).


Checking in third-party dependencies is not always abuse. It can be a useful habit for certain kinds of reproducible builds. The Buck documentation even endorses keeping your dependencies in your monorepo along with your own sources.


I understand the reasoning, and agree that it’s not always abuse. At first blush it’s a good idea, but I’d maintain that it’s one of the things that balloons your repo size quite quickly. Plus, one have to draw a line somewhere on what to include (a Python interpreter? A Go version? awk and grep?), and third party vs in-house is a fairly robust one imo.

We host a private mirror for third party dependencies, so that “pip install”/“go get” fail on our CI system if the dependency isn’t hosted by us. This gives us reproducible builds, while allowing us to hold 3rd party libraries to a higher standard of entry than source code. For certain libraries we pin version numbers in our build system, but in general it allows us to update dependencies transparently. It also keeps our source repo size small, for developers, and allows for conflicting versions (example Kafka X.Y and X.Z) without cluttering the repo with duplicates.

It’s definitely a smaller gotcha than the others I listed, maybe to the point where it’s not a gotcha, but I stand by it :)


If you can do that with 3rd party dependencies, can't you do that with all the code?

This is what confuses me about monorepos. Their design requires an array of confusing processes and complex software to make the process of merging, testing, and releasing code manageable at scale (and "scale" can even be 6 developers working on 2 separate features each across 10 services, in one repo).

But it turns out that you can also develop individual components, version their releases, link their dependencies, and still have a usable system. That's literally how all Linux distros have worked for decades, and how most other language-specific packaging systems work. None of which requires a monorepo.

So what I'd like to know is, of the 3 actual reasons I've heard companies claim are why they need a monorepo, is it impossible to do these things with multirepo? If it is indeed "hard" to do, is it "so hard" that it justifies all the complexity inherent to the monorepo? Or is it really just a meme? And are these things even necessary at all, if other systems seem to get away without it?


These are great questions!! :)

> Can you treat all code like 3rd party dependencies?

Yes, but there are trade-offs. Discoverability, enforcing hard deadlines on global changes, style consistency, etc.

> Is it impossible to do these things with multi-repo?

No, but there are trade-offs to consider.

> If it's hard, is it "so hard" that it justifies the complexity?

Hitting the nail on the head; there are trade-offs :)

> Are these things necessary, if other systems get away without it?

There are many stable equilibria; open source ecosystem evolved one solution and large companies evolved another, because they have been subject to very different constraints. The organization of the open source projects is extremely different from the organization of 100+ engineer companies, even if the contributor headcounts are similar.

For me, the the semantic distinction between monorepos and multirepos is the same as the distinction between internal and 3rd party dependencies. Does your team want to treat other teams as a 3rd party dependency? The correct answer depends on company culture, etc. It's a set of tradeoffs, including transparency over privacy, consistency over freedom, collaboration over compartmentalization.

With monorepos, you can gain a little privacy, freedom, and compartmentalization by being clever, but get the rest for cheap; vice versa for multirepos. It's trading one set of problems for another. I'd challenge the base assumption that multirepos are "simpler", they're just more tolerant of chaos, in a way that's very valuable for the open source community.

I hope we've not been talking past each other, I really like the ideas your raising! :)


I don't think we're talking past each other, and thank you for your responses.

> Does your team want to treat other teams as a 3rd party dependency?

From what I recall, 'true' microservices are supposed to operate totally independent from each other, so one team's microservice really is a 3rd party dependency of another team's (if one depends on the other). OTOH, monolithic services would require much tighter integration between teams. But there's also architecture like SOA that sort of sits in the middle.

To my mind, if the repo structure mimics the communication and workflow of the people writing the code, it feels like the tradeoffs might fit better. But I'd need to make a matrix of all the things (repos, architectures, SDLCs, tradeoffs, etc) and see some white papers to actually know. If someone feels like writing that book, I'd read it!


> This is what confuses me about monorepos. Their design requires an array of confusing processes and complex software to make the process of merging, testing, and releasing code manageable at scale (and "scale" can even be 6 developers working on 2 separate features each across 10 services, in one repo).

False. It is having multiple repos what creates those problems and a huge graph of versions and dependencies.

What "processes" are you talking about?


> It is having multiple repos what creates those problems and a huge graph of versions and dependencies.

Bazel, the open source version of Google's CI tool, is built specifically to handle "build dependencies in complex build graphs". With monorepos. If it didn't do that, you'd never know what to test, what to deploy, what service depends on what other thing, etc. Versions and dependencies are inherent to any collection of independently changing "things".

Even if you build every service you have every time you commit a single line of code to any service, and run every test for every service any time you change a single line of code, the end result of all those newly-built services is still a new version. A change in that line of code still reflects the service it belongs to, and so thinking about "this change to this service" involves things like "other changes to other services", and so you need to be able to refer to one change when you talk about a different change. But they are different changes, with different implications. You may need to go back to a previous "version" of a line of code for one service, so it doesn't negatively impact another "version" of a different line of code in a different service. Every line of code, compared to every other line of code, is a unique version, and you have to track them somehow. You can use commit hashes or you can use semantic versions, it doesn't matter.

So because versions and dependencies are inherent to any collection of code, regardless of whether it's monorepo or multirepo, I don't buy this "it's easier to handle versions/dependencies" claim. In practice it doesn't seem to matter at all.

> What "processes" are you talking about?

Developer A and developer B are working on changes A1 and B1. Both are in review. Change A1 is merged. Now B1 needs to merge A1: it becomes B1.1. Fixing conflicts, running tests, and fixing anything changed finally results in B1.2, which goes into review. Now A develops and merges A2, so B1.2 goes through it all over again to become B1.4.

You can do all of that manually, but it's time-consuming, and the more people and services involved, the more time it takes to manage it all. So you add automated processes to try to speed up as much of it as you can: automatically merging the mainline into any open PRs and running tests, and doing this potentially with a dozen different merged items at once. Hence tools like Bazel, Zuul, etc. So, those processes.


You are conflating language/build issues with VCS issues.

Everything you discuss also applies to multirepo, but worse, because there no one enforces consistency across all the project and you will end up with a broken interdependency.


> Plus, one have to draw a line somewhere on what to include (a Python interpreter? A Go version? awk and grep?), and third party vs in-house is a fairly robust one imo.

If your code/project/company uses the dependency in any way in production and it is not a part of the base system (which should be reproducibly installed), you include it; either in source or binary form.

Why is the size a problem? Developers should only be checking out once. If your repo hits the many-GiB mark, then you can apply more complex solutions (LFS, sparse, etc.) if it is a burden.


It's a problem if the first step of your build system is a fresh `git pull` :)

Not unsolvable of course, just necessitates an extra layer of complexity.


> IME, most consternation comes from people adopting a mono repo without adopting a build/dependency graph tool (like Bazel, buck or pants).

That seems like a build problem, not a Git problem.

> An additional source of strain is from people abusing the repo (checking in large binaries, third party dependencies, etc).

That is not necessarily abuse. In fact, it is a good practice in many cases!

> A third is when people try to do branch-based feature development, instead of the “correct” practice of only deploying master (or weekly cuts of master).

I am not sure what you mean by branch-based development, but I don't see why that would be a specific problem of monorepos.


How are they straightforward? Like rebuilding a car's engine is straightforward? If you know how they're built, it's easy...


What? I don't understand what that means.

A monorepo is just 1 repo. There is nothing more straightforward than that.



> the core requirement of Continuous Integration that all team members commit to trunk at least once every 24 hours

It sounds good except for this part.


Monorepo + sparse-checkout looks a bit like a distributed subversion!


Did you ever try out SVK? It was based on svn's libraries and provided for a more disconnected workflow.

Most of my code (dating from before git was around) for various projects is in a single subversion tree and I check it out using git or mercurial to provide for local version control.

Features like sparse checkout are definitely welcome in git since the industry seems to have standardized on it.


Not really, because commits don't go across the entire SVN, which is what makes monorepos so powerful.


What do you mean? When you commit to svn the whole repository goes up in version number.


You are right, I was thinking of CVS.

In any case, with SVN you usually do not want to give write perms to everyone in all the tree, so you end up with effectively partitioned spaces, or you make several repos instead, or you put another layer on top. With Git, anyone can easily develop global commits.


sparse-checkout, partial-clone, and shallow seem like decent building blocks to make working with very large repos tractable in git. At the same time, the features and their interaction are pretty complicated, so I believe we'll need good "porcelain" abstractions over these building blocks to make the workflow reasonable for average users.


What is your bar for 'average users'?

If we're talking project-wide repos rather than entire-org repos, I'd wager the vast majority of projects can use monorepos without special git tooling, and will retain huge productivity benefits vs app/package-per-repo organisation.


Honestly most orgs (with "most" weighted by org, not by headcount) could handle entire-org repos without using any of these features. It's still worth simplifying the workflow and training experience for projects and orgs that grow beyond that, though.


Partial checkout efficiency improvements makes mono repos more compelling for large projects and organizations.

As an individual, I have switched to a mono repo for all of my Common Lisp code and with some adjustments to my Quicklisp configuration I am very happy with my setup.

I am a programming language junkie, and I have it on my low priority todo list to switch to a mono repo for Haskell, Racket, and Hy language (Lisp with a Clojure syntax that sits on top of Python).

I worked as a contractor at Google in 2013 and I absolutely loved their mono repo and web based development environment. I really miss that.


How does a sparse checkout not defeat the purpose of a monorepo? I thought monorepos existed so it was easy to make changes that affect the whole codebase and to test those changes. If you only checkout a portion of the files, how are you going to test against the whole repo?

EDIT: my overall concern is that it looks like people are reinventing clearcase. Please speak to an older developer who worked at an HP/IBM type company in he late '90s/early 2000's before you do that. Please!


Continuous integration tools still check out and test the whole repo. Google has used this approach for over a decade.


This would be impractical for really large monorepos like the ones Google and Microsoft have. They have virtual file system layers on top (MS open sourced theirs) to prevent checking out the whole repo.

In fact, it’s not just useful for the CI/CD pipeline - any developers making significant changes to base libraries or core infrastructure should be able to use the VFS in combination with a system like Bazel to run all (or a significant sample of) affected tests across the company.


They are hard to find. Do you know some?

All I have is this thread: https://lobste.rs/s/fosip5/should_version_control_build_syst...


I used clearcase many years ago, and this thread on lobste.rs is pretty accurate and interesting. They point out that the biggest problems were exclusive checkouts, file versioning instead of changesets, and the baked in out of date assumptions about networking. Getting your configspec wrong was a common problem too.

At HP we had some in-house perl-script wrappers around the raw clearcase tools that fixed many of these problems. The developers of those scripts had all left to go work for Rational (makers of clearcase), and I don't think anyone really knew how they worked. We also had a full-time clearcase engineer that kept the servers running. Fortunately our smallish projects didn't need the full power of clearcase and those perl scripts kept working fine for us. I did alway wonder what would happen if the one guy who understood the servers left the company.

In short, it's a complex and powerful tool that very few people really understood. Very few projects need all that power and complexity. I'm sure Microsoft and Google benefit from complex version control tools and have engineers to spare for managing and understanding them, but I don't think any open source projects or smaller companies are really going to benefit from "clearcase for the modern age" type tools.


I can't wait until there's tooling that takes advantage of this. Tying sparse checkout into Gradle or Bazel would make this a lot easier.


Interesting. I've been wanting something like that for submodules. Can the two features be combined?

For instance, if you need a single file/directory from another project in your repository.


I thought one of the big benefits of monorepos was that you didn't mess with submodules anymore?


It still might be needed for external dependencies. the code an organization writes might be in one repo, but if you want to bring in some other library, like libssl (assuming there is no better package manager for your language) submodules are often used


At my work we use sparse checkout and lfs on our binary dependencies submodule to pull in only the binaries that we need for the current platform (i.e. linux or windows)

Basically sparse checkout only populates the tree for the dependencies we want, and then git-lfs will only download the binaries that are present in the current worktree.

Works out pretty well.

Keep in mind though that sparse checkout still has the entire repository loaded in the `.git` object store, it just doesn't expose it in the worktree.


Are you sure about that?

> This combination speeds up the data transfer process since you don’t need every reachable Git object, and instead, can download only those you need to populate your cone of the working directory

If you're only downloading what you need to populate the working directory how is it that `.git` will have the entire repository?


Using sparse-checkout by itself will still download the entire repository and its complete history into .git. If you additionally use the "partial clone" feature, then you can restrict what gets downloaded and stored in .git as well - it will download only the objects that are needed for your selected directories (along with their complete history). On big repositories with long history this might still be too much data, so you might also want to use the "shallow clone" feature (via the --depth flag) to restrict how much history you download.


I guess it's possible you don't get it all, but I've definitely ran `git grep` before on that repo and had results come back that weren't in my worktree.

Edit:

   wrowclif@wrowclif-desktop:~/Taccs2/p5_deps$ git grep "def returnValue"
   twisted/install_linux_gcc54/lib/python2.7/site-packages/twisted/internet/defer.py:1350:def returnValue(val):
   twisted/install_linux_gcc54/lib/python2.7/site-packages/twisted/internet/test/test_win32events.py:66:    def returnValueOccurred(self):
   twisted/install_win64_vc141/lib/python2.7/site-packages/twisted/internet/defer.py:1350:def returnValue(val):
   twisted/install_win64_vc141/lib/python2.7/site-packages/twisted/internet/test/test_win32events.py:66:    def returnValueOccurred(self):
   twisted/vendor_base/src/twisted/internet/defer.py:1350:def returnValue(val):
   twisted/vendor_base/src/twisted/internet/test/test_win32events.py:66:    def returnValueOccurred(self):


   wrowclif@wrowclif-desktop:~/Taccs2/p5_deps$ ls ./twisted/
   install_linux_gcc54


I would speculate that the partial-clone implementation pulls down all the commits that touch any files that are required. Some of these commits would presumably include changes to other parts of the source tree. Perhaps `git grep` still matches on such commits?


partial clone is different from sparse checkout.

We are using sparse checkout. Partial clone is the one that only pulls down objects that are needed by the store.


LFS only downloads the files required by the checkout. From Git’s perspective, those files are very tiny, and only include the information required so LFS can download the files on-demand.

Git’s partial clone is a more natural way of achieving the same outcome.


> For instance, if you need a single file/directory from another project in your repository.

The last time this happened to me, I took it as a hint that I had split the repositories along the wrong lines. The repos should probably be either merged or divided further to prevent this.


Sometimes you don't own the other repo.


Doesn't that seem like a build tool situation? At that point the other piece of code isn't part of source, it's a source dependency, and no different from a binary dependency at some version so you don't really want the tree, you want the file at some revision and if it's `github` based then you have the natural HTTP endpoint and otherwise it's trivial to proxy as an artifact.


Well, you are right, but there would still be some advantages to submodules:

1. Check the files hash themselves: while you can definitely put the commit ID in the URL, nothing prevents the remote server (though unlikely if github) to answer with another version of the file (and could even do so selectively for your build server).

2. Simple upgrade path: with submodules, you can just `cd` into them and run `git pull` or `git checkout v11.5.2`, and git itself could inform you that a newer version is available if tracking a branch.

I also agree with the contribution aspect, though it is less important in some cases.

I take the latest example I have in mind where this could have been useful: For integration into F-Droid, RiotX needed not to include binary artifacts of a library, but the source itself. The source repository is quite big (multiple languages), but the thing of interest is a single java file [1]. They ended up simply copy-pasting the file [2] in their repo, which makes its origin less obvious, and more subject to bit-rot and vulnerabilities.

[1]: https://github.com/google/diff-match-patch/blob/master/java/...

[2]: https://github.com/vector-im/riotX-android/pull/760


Not really. You often need to make extensive changes in those kinds of external dependencies, so you really do want them in your source tree.


Oh, interesting. That explains why you want to retain the history, so you can easily merge and stuff too.


I have to tell it which directories I want? That seems like work the tool could do. Also, the granularity should be at the file level, not directory.


The sparse-checkout patterns match at the file level, so you can always use that (without “cone mode”) if you want. It becomes difficult to match an exact file list as people add files to projects: you require every other user to update their patterns to match the newly-added file.


Just keep in mind as the article points out that without "cone mode" is potentially a lot slower, and that's why cone mode exists.


It's based on glob paths. I think you can specify it at whatever level you want. Also you can use wildcards.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: