Hacker News new | past | comments | ask | show | jobs | submit login
Speeding up a Git monorepo (dropbox.tech)
147 points by illuminated on June 10, 2020 | hide | past | favorite | 74 comments



Monorepo/multirepo (and monolith/microservice which often tags along) seems like a false dichotomy.

There are costs to having big repositories (e.g. TFA and needing to do scaling work), and there are costs to having lots of repositories involved in producing a unified working result (dependency resolution is NP-complete in many of its useful formulations). Big players have the muscle to optimize Mercurial and Git, so they get to do super slick trunk-only monorepo development at engineer-commit scale, but still often have auxiliary repositories that take e.g. machine generated commits. Smaller players probably aren’t hitting scaling limits on these tools. But every situation is different and if one approaches it as an engineering problem you can usually do something very workable.

Likewise with monolith/microservice: there is a happy medium where you introduce a network boundary for an engineering reason (maybe one part of my computation needs a lot of CPU but a different part needs a lot of RAM, so they run on different SKUs/instance types). My big giant web app that dates to the founding of the company? Probably don’t want to rewrite that so let’s spin services out of it incrementally when I need to write something in C++ or use a shitload of RAM or whatever. That’s bread and butter systems engineering.

But this “pick a side” mentality where it’s like one giant ball of PHP in one giant Subversion repository or every team has their own little service in their own little repo and I burn 40% of my cycles parsing JSON isn’t a set cover: you’re allowed to choose a happy medium.

Just do things for valid technical reasons and don’t have Conway’s law go apeshit on your architecture by shattering it into a zillion pieces. The human factor stuff can be addressed with engineering rigor and consensus. “This is too slow to be in Python now” is a good reason to make a service. “The iOS team shares no code with the web team and bisects will be faster/easier” is a good reason to make a repo.

“I want to have my own coding style and/or use some language no one else knows and/or learn k8s and/or not deal with that team I don’t like” are not engineering reasons to type ‘git init’ or make a network call.


I think that's right, but organizationally challenging to implement.

In my experience engineers don't really think through this question carefully, and have to be prodded to do the right thing, or at least the thing that better serves others or their future selves.

There's also a hump to get over until network effects take over, where engineers have strong incentives to participate in the monorepo.

In your iOS vs web team example, we might imagine that they eventually want to run an integration test against the same API, whose team helpfully provides a hermetic version of the service for testing against. (Dropbox wrote some bazel rules for this sort of thing: https://www.youtube.com/watch?v=muvU1DYrY0w)


> The human factor stuff can be addressed with engineering rigor and consensus.

How do you add engineering rigor and consensus? Let's say as an Individual Contributor.


By working at a place where people listen to you and are not (too) crazy.


Just you? If it's a team effort, places like Dropbox have hundreds, if not thousands of engineers, all with an Opinion.


Sometimes I get they feeling that they should indeed just listen to me. When one needs an argument to convince a colleague that one really should not be using == to compare two floats because of rounding errors or when working in a place currently suffering from blatently ignoring https://www.joelonsoftware.com/2000/04/06/things-you-should-... and experiencing the predictable consequences afterwards I indeed quite often get the feeling that in many places where I could work I actually am the only adult in the room.

To be a bit more practical, though, yes one should listen to others as well because in some cases they might actually have ideas that are good. Also, morally, it kind of is the golden rule that if one expects to be listened to one needs to listen oneself as well. In cases of large places with lots of people I would say that there should be some form of code ownership and hence people and/or smallish teams can decide what to do with the code that they own. One of the most important things that contributes to code quality is not too many changes of code stewardship.


Sorry, let me clarify: Given a particular company, how do you add engineering rigor and consensus? Because that was the proposal in contrast to splitting off a separate service.


I’m still hoping some standard toolset emerges for dealing with large monorepos sometime in the near future. At the moment it’s clear that a number of companies are rolling their own solutions, which follow one of two patterns:

- Persistent process which watches workspace changes, or

- Workspace in virtual filesystem.

The other common factor seems to be trunk-based development with all commits rebased into a linear history. I’m not super hopeful that we’ll see an open-source solution in this space for a while, though—any company with a code base large enough to really need these solutions is also large enough to throw a few engineers at VCS, especially given that they’d already have engineers supporting VCS from the operations side of things.


Author of the post here.

I think Git upstream is trying to simplify configuration. They have a config option called `features.manyFiles` which enables most of the features we enabled for our developers (https://git-scm.com/docs/git-config#Documentation/git-config...).

We wanted to use this instead of deploying a wrapper, but it turns out that some of Git's features like fsmonitor do not interact well with repositories with submodules (there were Git crashes). And we have some developers that work on repositories with submodules. So we needed something more flexible, like enabling these features only on particular repositories.


I’m quite confused as to what you actually shipped to your developers to increase performance. This config option is a great place to start, but it would be great if that were clearer, so that others could follow suit.


We shipped a wrapper that tweaks git configs and logs metrics, and a custom fsmonitor hook that was slightly faster than the stock one. We also ensured watchman was installed on developers laptops.

And we made a few changes to Git to fix bugs (for example, `git stash` wasn't using fsmonitor data, so it was slow).


It would be great to have a listing of those config tweaks, even if with caveats attached, such as “causes issues with submodule”.

I don’t want to seem demanding, but it’s such a tantalizing article without this info :)


Sure.

core.fsmonitor is set to our custom fsmonitor (this causes issues with submodules, at least on 2.24)

core.untrackedCache true

We use index version 4

And a slight hack: our wrapper sets GIT_FORCE_UNTRACKED_CACHE = 1. This forces `git status` to write the untracked cache if it notices a difference. I was too lazy to add a patch to configure that.


Wonderful, thank you!


Most of Microsoft's Scalar tool is also a thin wrapper around existing git configuration or just task scheduling upstream git operations with currently no or minimal config UI and Microsoft's blog posts talk about what isn't a thin wrapper around auto-configuration is stuff they are working to upstream directly to git eventually (assuming it makes sense to be in git upstream; GVFS for instance is considered out of scope, and git likely won't ever have a background task scheduler).

Since Scalar is open source, it can be a useful reference for git settings to try in large repositories. I know that if I need to remember how to configure git's sparse "cone" checkouts, Scalar is where I'd look rather than try to manually configure it directly in git's config files.


Do they have any recommendations on at what size to enable that?


I reckon the discussion of the history that lead dropbox to switch to a monorepo is more interesting than the git speedups.

The last couple of places I've worked at are large non-tech companies, both orgs were internally using on-prem gitlab/github/bitbucket . These tools make it much easier for teams (or individuals) to create as many new repos as they want without coordinating with anyone else -- for better or worse.

I suspect what happens quite often these days is that people create many repos without consciously thinking about if that's a good idea or not -- because it is familiar and because there are relatively high quality tools/products to let you make more repos.

The small part of the org I currently work in probably has O(200) employees and O(200) git repos.

The last system I worked on in previous company had a single git repo containing all parts of a line of business application (db, API server, frontend, backend servers for batch jobs) but then there were about 40 other git repos containing deployment scripts etc used to deploy just this one system. It made it bloody hard to figure out exactly what version of what script or library was actually used to deploy ( to be fair, a lot of this was a consequence of using ansible modules which expects each module to be in its own git repo, and having a couple of people hack together a lot of ansible modules in a short amount of time without review)


Great post!

I wrote a (fast?) fsmonitor hook in Rust...benchmarked against the reference Perl implementation it's quite a bit faster. On a repo of 130k files, my monitor is able to `git status` in 18 millseconds.

https://github.com/jgavris/rs-git-fsmonitor


Which operating system are you using? That's impressive. More importantly, our hook doesn't support the new query version so we might want to switch.


I use macOS, but some folks have contributed a linux package / installer. And yeah, I added the v2 of the hook recently which is even faster!


For anyone interested in why monorepo works, I'd recommend the book Software Engineering at Google: Lessons Learned from Programming Over Time. It has detailed the reasons for the One Version Rule and Version Control over Dependency Management.


Doesn't Google use Perforce though, which (last time I used it) forces a monorepo approach? git doesn't have equivalents to branchspecs and clientspecs.


It is because Google wants a monorepo, then Google choose to use Perforce (and later Piper). It is not that Google uses Perforce and thus are limited to a monorepo.

The core value behind monorepo (and monorepo-like approaches) explained in the book is that dependency management is harder than version control.


It is interesting that the speed of lstat on macOS is the driver behind this problem. According to the article it is 10x slower than Linux.

I wonder if anyone has tried attacking that end of the problem? Faster lstat on macOS would benefit all applications not just git.


https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks... is an interesting write up about the problem.


That is also what I immediately noticed. The graphs showed that the slow runs were around ~4s (p50) or ~7s (p90), and then also mention that Linux was 5-10x faster. This would mean the same operations on Linux are somewhere around 1s in the worst case, which is perhaps even faster than their sped up system.

But instead they had to build all this extra junk like an additional caching layer and daemon (wasting memory and CPU) atop the kernel's existing cache just to work around the kernel being slow.


One thing for Linux is that it happily caches large numbers of inodes, as long as there's enough memory for it. MacOS has a (configurable) maximum, and the default value is low enough that it makes things really bad for large repositories. It used to be a few thousands, it's now around 263K, apparently. Check it out on your machines:

  sysctl kern.maxvnodes


> I wonder if anyone has tried attacking that end of the problem? Faster lstat on macOS would benefit all applications not just git.

One major blocker is that the code for this is largely closed source and not accepting patches.


I believe the same thing is a problem on Windows, too. Linux has fast lstat.


The fact that so many are using monorepos points to a weakness in revision control and dependency management tools. This is a big gap where someone will invent Git’s replacement. What features will it need to replace Git?


> The fact that so many are using monorepos points to a weakness in revision control and dependency management tools.

People seem to like monorepos, what problem do you have in mind?


Access control is a big one.


Easier and more performant to work on a subset of the repo. That is, if you're in folder /repo/foo/bar, it should have a mode so that it only checks anything in the bar folder. That can probably be added to Git though.


Check out pijul


Is there a reason people prefer a monorepo to submodules?

I worked at one place that kept everything in a mercurial mono repo and it was a real pain keeping branches in sync.


I introduced a submodule at my last job to hold common build code used across multiple repositories. I knew going in that some of our developers didn't really understand git, and were not interested in rectifying that. I won't say the submodule was a disaster, but I definitely paid for it with time spent sitting with people and helping them fix messes. I would do it again but only if I knew I could count on the team to make more than a token effort to understand what a submodule was and how it worked. I would not even consider adding multiple submodules unless I felt like everyone knew exactly how to use them.

Another potential source of issues: the submodule remote URI is checked in as part of the .gitmodules file. If your CI system uses a different URI than your developers, you have to work around that. If you change where your source is hosted and you want to check out an old version, you have to figure that out too.

That said, I think the most common reason is the same reason some people prefer monorepos in the first place: they perceive the monorepo as simpler, and adding submodules is not simpler. They want to know about and manage 1 repository clone. They want to commit, push, review, and build out of 1 repository. They want to search for stuff in 1 directory tree. They'd also like to do that with 1 tool, ideally 'git'. Nothing really offers that except monorepos.


You can use relative paths for submodule urls.


Submodules break a lot of other common git features such as switching branches. Instead you have to do something brutal like:

  git submodule deinit --all --force
  git clean -dfx
  git checkout newbranch 
  git submodule update --init --recursive
There might be cases where you can do a more lightweight switch, but they probably require iterating over the submodules.


The point is so you can do single commit cross project changes. You also have the added benefit of being able to track exact code/dependency versions across all projects.

In a many-repo, dependency tracking set up, its an anti-pattern to pin to latest because that can change underneath you and its hard to know what version latest even meant at the time someone committed the dependency.

In a submodule setup, you can track exact dependencies but you pin each submodule reference. You can't easily update all downstream references. You can't use latest for reasons above. You also can't make single commits across many projects.

In a monorepo set up, you can point to latest (ie whatever is in any given HEAD) and trust that everyone will know exactly what that means now and into the future. Monorepo also opens up some ability for build systems across projects and possibly more insight into company wide dependency tracking.


Title should be "Speeding up a Git monorepo at Dropbox with <200 lines of code" - looks like some sanitisation regex got over-zealous here.


Yes, I have posted that exact title but haven't noticed the cut until now. I hope the mods will be able to fix this.


Honestly the HN obsession with modifying titles is excessive. I understand not wanting to have incendiary titles which encourage flamewars, but the rest of the policies around title editing / sanitizing don't seem that useful and actual hinder coherence in many cases (like this one)


It's a shame that Dropbox abandoned Mercurial for Git. With both Facebook and Google contributing to better support for monorepos in Mercurial, Mercurial seems like a better choice for big monorepos.


Huh? I don’t know about Facebook but Google for sure does not use Mercurial as monorepo backend (nor does it use Git). There are Git and Mercurial-alike clients to interface with the in-house backend which is a Perforce like thing. Neither Git nor Mercurial would be fun to use at Google scale. Dropbox has a much smaller monorepo hence they can clone the whole Git thing on developer machines. I assume doing the same with Mercurial is impractical as no one has patience for something that slow.


I should clarify my comment by saying monorepo-related Mercurial improvements benefit not just those monorepos backed by Mercurial, but also where Mercurial is "just" a front end to a different system. I mean just think about it, what does a front end in this case really mean? When you run a command like `hg log` how much of the original Mercurial code are you running? Do you design an entirely different system that happens to share the same command-line syntax as the original Mercurial, or do you emulate the .hg folder format and run the original Mercurial code, or somewhere in between? Thinking about this problem would shed more light on why Google's work on Mercurial benefits everyone else with a big monorepo even though Google is "just" using Mercurial as a front end.

And both Facebook and Google have contributed to Mercurial on this front, though admittedly Facebook did more work than Google did. I know the tree manifest feature (https://www.mercurial-scm.org/wiki/TreeManifestPlan) is done by Google and upstreamed, and it benefits every repo with millions of files. (Just clone the hg repository and search for commits with a google.com author email and see what kind of commits they are.)


Sure, but still in Google/Facebook case there’s a centralized backend that does the day to day operations that you invoke from your laptop. In Dropbox case, it is just a local git repo that you push to server only when you want to land a change; just like whatever one does with GitHub, but quite a big one, so the use case is quite different from F/G, which is what I was getting at. In principle maybe you could do it, that’d require quite a big investment in effectively rebuilding mercurial though.


Google has a mercurial front-end to their "Perforce-ish" like server


Yes, that’s what I said in the above comment. They have a git frontend as well. I’m sure someone at Google has made one-off front ends for other random stuff as well. Point is, the monorepo isn’t Mercurial nor git, whereas for Dropbox it is git (and likely Mercurial doesn’t scale for that use case), which was the point (“shame it’s not mercurial”) in the original comment I was responding to.


At least two years ago, Google seemed to be driving an effort to move away from Piper and to Mercurial - not just a frontend (https://groups.google.com/forum/#!topic/mozilla.dev.version-...). They would need to build a bunch of custom extensions for that to work, and they've open sourced some.

But I have no idea what came of that effort.


Facebook has a Mercurial clone named EdenSCM[1], which includes a scalable backend named Mononoke[2].

[1] https://github.com/facebookexperimental/eden

[2] https://github.com/facebookexperimental/eden#mononoke


Did Microsoft's enhancements to Git, particularly the Bloom filter optimizations to git log/blame make it into the mainline?


It sounds like a lot of them have, but most still need to be configured. The commit-graph [1] is the biggest internal part of those optimizations and I believe core.commitGraph is still defaulted to off (and probably is more overhead than necessary for small to medium repositories).

[1] https://git-scm.com/docs/commit-graph


Thanks for the link to the documentation. That is updated with every major Git version, and can be used to track what features are present. Also, the release notes can be helpful.

In particular, the recent Git v2.27.0 release does include an implementation of Bloom filters with speedups for `git log` and `git blame`. You need to manually run the command to make it work. The version I prefer is this:

> git commit-graph write --changed-paths --reachable

After that first write (which writes filters for every reachable commit) you can do a smaller write by adding `--split` to write incrementally [2]

[2] https://devblogs.microsoft.com/devops/updates-to-the-git-com...

By writing these filters, you will speed up most `git log` and `git blame` calls. There is an improvement coming in the next version that includes speedups for `git log -L`.

Caveat: The biggest reason these improvements have not been widely advertised is that the user experience has not been completely smoothed out. In particular, you can only write the changed-path Bloom filters using the command(s) above. If a commit-graph is written during GC (due to `gc.writeCommitGraph` config setting) then the filters will disappear. Similar for `fetch.writeCommitGraph`. We plan to have these resolved in time for v2.28.0, along with more performance improvements.

(Full disclosure: I am a contributor to Git, Scalar, and VFS for Git, which are referenced by the article.)


Thanks for your work and your team's work on Git!


For huge repos building on build servers also becomes a problem. A build machine can usually do a shallow clone, but it would be better to filter to smaller pieces.

Even more importantly, when you have multiple builds using the same repo, CI systems usually work by setting up a local copy of the repository per build, so if you have a compile+unit test build, and another for slower integration tests running the same code, then the machine might end up having two git /objects directories somewhere, containing the same data. If you have a 100Gb repository and maybe 100 different builds against it, this quickly becomes unmanageable.

What I'd want is the ability to use a common objects directory for a machine, where common objects would be deduplicated. I don't know if this is achievable with GVFS or even by linking the directories?


Can someone explain how git can be slow for these folks while the Linux kernel continues to use git? Do these performance issues plague Linus as well? Or is the Linux kernel smaller than the code in TFA?


The Linux kernel is not a big project. Sure, it's bigger than your average one-file-one-function-npm-package, but not as big as people think. Dropbox, Google, Facebook, … their repositories are far larger, perhaps by several orders of magnitude.


Kernel is tiny. Whole repo is like 1.5GB. That's nothing. We are just 20 devs, not a massive company and I'm on 100k commit, 50Gb history repo.


That’s a massive repository…I’ve worked in code with many hundreds of thousands of commits from many, many people and they’re a couple gigabytes at most. Are you storing assets in your tree?


Yes a lot of it is non-text content (By file count not so much, but by size probably 80% or more). Nothing unnecesary, but resourcees required to build and test each revision (Not e.g. documentation). It's a document based and graphical app (CAD) so it is not practical to store separate non-text assets from the source tree. Each branch has different versions of various test inputs/outputs, drawings, image resources etc. We use git LFS obviously, otherwise this whole setup would be impossible. In all, it's actually a decent experience. We couldn't migrate off Subversion until LFS was stable, but now that it is, Git is actually a quite good VCS for the "Subversion use case" (Large, central, heavy in binaries).


> Interestingly, git status was 5-10x faster on Linux compared to macOS.

From the article. That's a very big difference.


Precisely.


At what point just investing in a Linux distro for your employees makes more financial sense? I have similar issues with docker-on-mac where it keeps using 100% of my cpu (https://github.com/docker/for-mac/issues/3499) because of which my 2 year old Macbook's screen has started dying due to overheating so much (https://www.ifixit.com/Answers/View/567125/Horizontal+line+o...

It would be better if the corporate developers just moved onto Linux Distros. The software and hardware quality is 99% there already.


I came across a neat infographic the other day comparing source code repositires, had to google for a while but here it is: https://medium.com/free-code-camp/the-biggest-codebases-in-h...

TL;DR, the Linux codebase there (version 3.1) is about 15M LOC, versus e.g. Windows Vista at 50M, MacOS at 85M, and Google at 2B. Not sure where Dropbox fits in that, but it's probably a similar order of magnitude to Linux and co.


I wonder where they got those numbers.


Those times look too long for the number of files. Here's my time:

  $ time git status
  On branch master
  Your branch and 'origin/master' have diverged,
  and have 1 and 115 different commits each, respectively.
    (use "git pull" to merge the remote branch into yours)
  
  real    0m0.517s
  user    0m0.240s
  sys     0m0.472s
  $ git ls-files | wc -l
  130685
This is on Ubuntu Linux, no special additions done to git. I wonder why the author's experience is so different.


I started using this a few months ago to solve the very important problem of my git-status-decorated bash prompt taking too long to display on macOS. I'm very happy with the result, but there are a couple of situations where it seems to get stuck and I have to go kill processes: after I've created a lot of untracked files and then deleted them; and if I've moved back and forth between revisions with thousands of combined file changes.

Still highly recommended for large (or just old) repos.


I learned very quickly to run my bash prompt in a timeout wrapper after cloning WebKit once and having an absolutely miserable time doing anything while inside the repository.


They mention stat syscalls as limiting factor in large repos. I wonder whether the ceiling of "too large" could be raised by batching/streaming stat calls via io-uring. It wouldn't help mac users, but at least on linux it could improve the out of the box experience for large repos.


There is currently no efficient way in git to just clone a single subdirectory. This is very inconvenient when dealing with very large monorepos - even with depth 1 - you still have to grab the entire tree.



I couldn't use monorepo if i have a service to deploy to Heroku though. Or do you know if i can deploy a monorepo to Heroku ?


Easy fix: Disallow use of MacOS. I really don't get why developers use an OS where no upstream package management of developer tools exists and that offers no way of running performant VMs.


I do like macOS’s UX and the fact that OS updates have been literally painless for the 16 years I’ve been using macOS as my daily driver. At the same time, it being a Unix under the hood, it gives me POSIX compatibility all over the place. Regarding developer tools, Homebrew has done a good enough job so far. YMMV.

I see why people dislike macOS but banning it at the workplace would be over the top.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: