Facebook engineer here, working on this problem with Joshua.
What this comes down to is that git uses a lot of essentially O(n) data structures, and when n gets big, that can be painful.
A few examples:
* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.
* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.
An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash. Also, inotify is an incredibly tricky interface to use efficiently and reliably. (I wrote the inotify support in Mercurial, FWIW.)
* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), and the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).
None of these problems is insurmountable, but neither is any of them amenable to an easy solution. (And no, "split up the tree" is not an easy solution.)
An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash
So does, presumably, the cache when you use lstat. (Let's scratch presumably. It does. Bonus points if you can't use Linux and use an OS that seems to chill its caches down as soon as possible. )
I hope I'm wrong, but the proper solution to this seems to be a custom file system - not only will it allow you to more easily obtain a "modified since" list of files, it also allows you to only get local files "on demand". (E.g. http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...)
That still doesn't solve the data structure issues in git, but at least it takes some of the insane amount of I/O off the table.
I'm looking forward to see what you guys cook up :)
So for your first item, it seems like it should be possible to add a (mostly immutable) cache file doing the job of Mercurial's files field in changesets, right? I.e. for each commit, list the files changed. Should be more efficient than searching through trees/manifests for changed files, at least.
For large (in files) trees, it seems like there's no easy solution, except for developing some kind of subtree support. However, that's similar to just splitting up the repository (along the lines of hg subrepo support), in the sense that now you have no real verification that non-checked-out parts of the tree will work with the changes in the part you do have checked out.
Still, the inotify daemon seems like it could alleviate things a bunch; particularly if the repository is on a server anyway, i.e. it's not rebooted that often.
Out of curiosity, why are these benchmarks using regular disk and flash disk? At only 15 GB, what happens using ram disk? Sure SSD is fast, but for these things it's still really slow.
Sorry for asking the obvious, but do you really need huge amount of data to keep development productive? How often do you use history that is several years old? Could you not archive it?
Or is the sheer number of files the problem, even ignoring history?
This is not a "git is perfect, fix your workflow" post, but I'm genuinely interested in what you have to say. Also, it seems like making git faster is a increasingly difficult task, given the amount of effort that has already been put into it.
Do you think adapting git to use, say, LevelDB and letting that do its job with incremental updates and maintaining a secondary index by path could help?
And that might dovetail nicely with an inotify daemon?
You seem to know enough about this problem to solve it given enough time.
Why doesnt facebook solve this git on huge repos problem and put out a patch for others to see? Oh, right, you want somebody else to solve the problem for you, for free!
Wow. I was expecting an interesting discussion. I was disappointed. Apparently the consensus on hacker news is that there exists a repository size N above which the benefits of splitting the repo _always_ outweigh the negatives. And, if that wasn't absurd enough, we've decided that git can already handle N and the repository in question is clearly above N. And I guess all along we'll ignore the many massive organizations who cannot and will not use git for precisely the same issue.
So instead of (potentially very enlightening conversation) identifying and talking about limitations and possible solutions in git, we've decided that anyone who can't use git because of its perf issues is "doing it wrong".
Your comment was at the top so I continued to read expecting to find a bunch of ignorant group think about how git is awesome and Facebook is dumb, but that's not really what's going on down below.
I don't know what facebook's use case is, so I have no idea if their repositories are optimally structured. However, I've used git on a very large repository and ran into some of the same performance issues that they did (30+ seconds to run git status), so I don't think it's terribly hard to imagine they're in a similar situation.
What we did to solve it is exactly what you're excoriating the people below for suggesting: we split the repos and used other tools to manage multiple git repos, 'Repo' in some situations, git submodules in others.
However, we moved to that workflow mainly because it had a number of other advantages, not just because it made day-to-day git operations faster.
I hope git gets faster, some of the performance problems described are things we saw too, but things are always more complicated and I see nothing below that looks like the knee-jerk ignorant consensus you're describing.
Sometimes the answer to "it hurts when I do this" is "don't do that... because there's other ways to solve the same issue that work better for a number of other reasons and we haven't bothered fixing that particular one because most of the time the other way works better anyway."
On a simliar note, I've heard there of people who would hit the limit on fortran files, so they put every variable into a function call to the next file, which itself contained one function and a function call to the next file after that (if necessary).
It is intuitively obvious that it is better to be rich and healthy than poor and ill. Sadly, the reality choices are neither. And you can not just split a repo. Hearing some ideas for that case would have been interesting.
Solving a scaling problem by splitting it is, well, obvious.
And, yes, I also ran github on a couple of projects at $work and the issues are real, seen them.
So, if it hurts when I try to use git - the answer will be don't use git... But the conveniences are so tempting...
Stat'ing a million files is going to take a long time. Perforce doesn't have this problem because you explicitly check out files (p4 edit). (Perforce marks the whole tree read-only, as a reminder to edit the file before you save.)
It seems like large-repo git could implement the same feature. You would just disable (or warn) for operations which require stat'ing the whole tree.
Then the question is how to make the rest of the operations perform well -- git add taking 5-10 seconds seems indicative of an interesting problem, doesn't it?
It seems eminently obvious to me that having basically a "change log" for a (part of a) filesystem is something that's valuable independent of your build system, revision control system, whatnot.
At least that's what I'd like to see - it's functionality that's orthogonal to those tools.
Mac OS X's FSEvents API has something similar to that. When you create a FSEvent listener you can pass in an old event ID so the system can give you all the stuff that happened while you weren't listening [1]. Apple uses this for Time Machine (and I suspect Spotlight, too).
To better understand this technology, you should first understand what it is
not. It is not a mechanism for registering for fine-grained notification of
filesystem changes. It was not intended for virus checkers or other technologies
that need to immediately learn about changes to a file and preempt those changes
if needed. [...]
The file system events API is also not designed for finding out when a
particular file changes. For such purposes, the kqueues mechanism is more
appropriate.
The file system events API is designed for passively monitoring a large tree of
files for changes. The most obvious use for this technology is for backup
software. Indeed, the file system events API provides the foundation for Apple’s
backup technology.
IMO, only telling users what directories changed is a smart move. It means that the amount of data that must be kept around is much smaller. That allows the OS to keep this list around 'forever' (I do not know how 'forever' that actually is)
NTFS has this optionally in the "USN Change Journal"; see http://msdn.microsoft.com/en-us/library/aa363798.aspx. It's used by a few Microsoft features like indexing and file replication, but it's available to third party programs too.
The git add problem is because .git/index is rewritten from scratch each time a new change is staged. With a 100 mb index file, that takes as long as it takes to write that much data to disk (cache). Much room for improvement here.
I found the original email equally disappointing, though. It boils down to "We pushed the envelope on size, it's too slow, we'd like to speed it up." Well, duh.
He uses the word 'scalability' early in the email, but shows no indication that he knows what it means. I'd love to hear if different operations slow down at different rates as the repo accumulates commits. Do they scale linearly, sublinearly, or superlinearly as the repo grows? Are there step functions at which there's a sudden dramatic slowdown (ran out of RAM, etc.)?
It's intentionally vague but with enough details that if you're actually in a position to help, you'll recognize what's going on and actually directly contact to get more information.
You don't spill internal processes and configurations without some kind of disclosure agreements and certainly not in a public forum.
There's no need to spill internal processes and configurations. The fellow said he had a synthetic repo that he used to benchmark various operations. Surely whatever generated that test repo can scale it up or down to whatever size they like, so you can benchmark at various points and collect the data that would tell us if there is some horrible non-linear scaling going on under the covers.
Right now it sounds like he's just trying to see what the possible solutions for his issues are. If he can provide additional benchmarks, etc., great. But he's under no obligation to provide any more than he has. Once there's a solution, then maybe.
Of course he's not under any obligation to provide any more info than he has. But given that he already has the test harness setup, and that only he has access to the hardware on which his benchmarks ran, it seems that he could easily enable more people to help him by providing additional data points.
I'm not asking for secrets here. I'm asking for some sign that he has a well-defined problem to solve.
How git performs as repo size grows to 15GB isn't hidden in a vault at facebook somewhere; I suspect they just haven't done anything more detailed than a superficial time measurement.
And as much as I'd like being truly open as an ideal, it falls apart when you're dealing with competition (not cooperation) and money. At best you try to keep things open enough.
I don't see how Facebook's build needs to be kept secret. It's a purely internal process and while they might lose something by giving details they can also gain if someone suggests improvements. That said, there are plenty of tings they need to keep secret. EX: Letting anyone export FB's full social graph would be really stupid.
Git and HG:
1. Require you to be sync'ed to tip before pushing.
2. Cannot selectively check out files.
The former means that in any reasonably sized team, you will be forced to sync 30 times a day, even if you are the only one editing your section of the source tree. The latter means that Joe who is checking in the libraries for (huge open source project) for some testing increases everyones repo by that much, forever, even if it's deleted later.
Needless to say, the universal response is that I'm doing it wrong.
Perforce 4 life!
But seriously, it says that Google adopted Git for their repo --- does anyone know how they use it? I would expect them to want a linear history, but their teams are way too big to be able to have everyone sync'ed to tip to push...
That's not the case. In fact, in the context of Linux kernel development, there's many emails on LKML where Linus is telling someone that they shouldn't be merging random-kernel-of-the-day into their development branch.
Git is not used for their main repo. Git is used as a local cache for perforce where a branch roughly corresponds to a CL. Only subtrees of interest are checked out.
> Git is used as a local cache for perforce where a branch roughly corresponds to a CL. Only subtrees of interest are checked out.
That's a common use for git at Google, but not the only one (I'm a SWE). When I do use perforce I've got enough rhythm that it doesn't get in my way, but I really like git at Google for local branches on rapidly-changing subtrees. A lot of time I'll work on a branch to submit as a CL, but then realize I should do something else that depends on it. Perforce is a mess at this situation if the tree is changing much, and git is perfect if you just make a new branch.
That's the problem. It's NOT an easy problem to solve.
A lot of posts on hn describing some problem elicit "Why, that's no problem at all!" responses or "That's the wrong problem to think about" responses.
Honestly that mindset is often really useful in programming, but when we get a problem that doesn't have a shortcut and is relevent, conversation goes to shit. Because I guess that's when programmers normally go into a hole and brute-force brain it out.
How to use mass comms to talk about a difficult open problem is, I suppose, itself an open problem.
It comes out of the Linux kernel, where you need a secure hash of a segment to prevent compromise. For big projects you have submodules, you can only get a level higher later.
In a company, you trust the sources. With Perforce you check out files and work with the part you want.
It is a design decision and they could have known before.
It is my guess (though I have no proof) that most places with particularly large repositories have lots of binary files in them. It's hard to get a 15GB repository if you just have text.
This sort of thing suggests a centralized check-in/check-out model, because binary files are difficult to merge sensibly, and nobody wants to spend terabytes of hard drive space storing the repository locally. And your centralized check-in/check-out needs, whatever scale they might be, are probably tolerably well served by one of the existing solutions.
Yes, but why is that a show stopper? It's a small market filled only with people who typically have large fist-fulls of cash and are dependent on version control. It's a small market, but companies in it have the resources for a good solution.
Because those companies generally already pay the $$$ for Perforce (which has any number of deeply terrifying, shiny red candylike self destruct buttons and makes git's user interface look kind) which for all its other faults handles this specific user case extremely well.
And also paying Perforce fistfuls of cash in licensing fees. I hear that Perforce is a quite a small company, and the founder wrote the lion's share of the code a couple decades ago.
I think they are probably on par with craigslist in profits per employee (i.e. much higher than Google or Facebook. Interestingly I think Facebook has about 1/10 the employees of Google with 1/10 the profits -- off the top of my head feel free to correct -- so I don't think they blew it out of the park with their IPO filing).
Perforce is quite expensive, yes. I don't understand the rest of your comments though. I'm not sure why company size, code author, or profit margins are relevant. Perforce is used by every major gaming studio, Pixar, Nvidia, and many more.
If I were to make a snarky comment it would be that Git is for poor people and Perforce is what you use when you grow up. That's not an even remotely reasonable statement, but it does have a teeny, tiny hint of truth to it. :)
I'm just pointing out the Perforce is making crazy profit, and somewhat ironically it's doing so more efficiently (I conjecture) than Facebook, which you are hearing a lot more about.
Perforce is a great system, but it's showing it's age by now. I think there is probably room for someone to make another product in the high end space and make boatloads of cash from big companies, but it's not easy.
Yes partly. Doing lots of commits locally before pushing to others is definitely something I like.
Another part of it is working disconnected -- with so many people coding on their laptops that's actually a pretty common use case.
Also the lack of need to do sysadmin work on git/hg is really nice. I used to run the free Perforce server a long time ago for myself, but it was annoying to do the backups. With git or hg you get whole-repository backups for free.
The "big repository with all dependencies model" has its drawbacks but it's interesting that facebook finds a lot of use for it, and that git is unsuitable for it. Perforce is probably still their best choice in that case.
The most recent version of Perforce added streams which is their primary answer to git and hg. Easy creation, management, and switching between branches. I've only used this at home and not in a large scale environment yet, but it's promising.
Later this year they are adding p4 Sandbox which allows for disconnected work. When that is complete and working I'm honestly not sure what advantage git will have left other than being free.
Perforce just raised their limit on the free version to 20 users and 20 workspaces, it used to be 2 users. We use it at Tinkercad and have been very happy, I used it at Google previously. The price (free) is acceptable for almost any small development organization.
If you have used ClearCase, you will know that while it's a great solution to a "what shall we do with our buckets of cash" problem, it's not something that anyone encountering performance problems would reach for.
Yes, it's well known that big companies with big continuously integrated codebases don't manage the entire codebase with Git. It's slow, and splitting repositories means you can't have company-wide atomic commits. It's convenient to have a bunch of separate projects that share no state or code, but also wasteful.
So often, the tool used to manage the central repository, which needs to cleanly handle a large codebase, is different from the tool developers use for day-to-day work, which only needs to handle a small subset. At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis. This model seems to scale fairly well; Google has a big codebase with a lot of reuse, but all my git operations execute instantaneously.
Many projects can "shard" their code across repositories, but this is usually an unhappy compromise.
People always use the Linux kernel as an example of a big project, but even as open source projects go, it's pretty tiny. Compare the entire CPAN to Linux, for example. It's nice that I can update CPAN modules one at a time, but it would be nicer if I could fix a bug in my module and all modules that depend on it in one commit. But I can't, because CPAN is sharded across many different developers and repositories. This makes working on one module fast but working on a subset of modules impossible.
So really, Facebook is not being ridiculous here. Many companies have the same problem and decide not to handle it at all. Facebook realizes they want great developer tools and continuous integration across all their projects. And Git just doesn't work for that.
At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis.
At MS we also use Perforce (aka Source Depot), and I've toyed with the idea of doing something similar. Have you found any guides for "gotchas" or care to share what you've learned going this route?
I used git-p4 at my last job, and the only thing that ever got weird was p4 branches. At Google we have an internal tool that's similar to git-p4, and it always works perfectly for me. Enough developers are using it such that most of the internal tools understand that a working copy could be a git repository instead of a p4 client.
So if you're planning on doing this at your own company, my advice is to write your own scripts that make whatever conventions you have automatic, and to move everyone over at the same time. That way, you won't be the weird one whose stuff is always broken.
I think most people got burned by cvs2svn and git-svn and think that using two version control systems at once is intrinsically broken. It's not. svn was just too weird to translate to or from. (People that skipped svn and went right from cvs to git had almost no problems, I'm told.)
Eric Raymond talks about the problems of converting svn repos to git and is promising a new release of reposurgeon soon that handles svn well. http://esr.ibiblio.org/?p=4071
Facebook uses Subversion for its trunk, actually, and just gets developers to use git-svn. This issue is primarily a problem because git-svn is a lot more serious about replicating the true git experience (keep everything local) than Google's p4-git wrapper is. They really just need to be a little less religious about keeping everything local.
> Yes, it's well known that big companies with big continuously integrated codebases don't manage the entire codebase with Git. It's slow, and splitting repositories means you can't have company-wide atomic commits. It's convenient to have a bunch of separate projects that share no state or code,
Can you expand on this? I would love to talk more about the "well known" part, I've never run across it before. I am a maintainer (tools guy actually) of a hg repo with about 120 subrepos, and the whole approach with subrepos is something that we're not thrilled about. Oh, and if you want to communicate via email, I'd be up for that too.
Repo is a repository management tool that we built on top
of Git. Repo unifies the many Git repositories when
necessary, does the uploads to our revision control
system, and automates parts of the Android development
workflow. Repo is not meant to replace Git, only to make
it easier to work with Git in the context of Android. The
repo command is an executable Python script that you can
put anywhere in your path. In working with the Android
source files, you will use Repo for across-network
operations. For example, with a single Repo command you
can download files from multiple repositories into your
local working directory.
With approximately 8.5 million lines of code (not
including things like the Linux Kernel!), keeping this all
in one git tree would've been problematic for a few reasons:
* We want to delineate access control based on location in the tree.
* We want to be able to make some components replaceable at a later date.
* We needed trivial overlays for OEMs and other projects who either aren't ready or aren't able to embrace open source.
* We don't want our most technical people to spend their time as patch monkeys.
The repo tool uses an XML-based manifest file describing
where the upstream repositories are, and how to merge them
into a single working checkout. repo will recurse across
all the git subtrees and handle uploads, pulls, and other
needed items. repo has built-in knowledge of topic
branches and makes working with them an essential part of
the workflow.
Looks like it's worth taking a serious look at this repo script, as it's been used in production for Android. Might allow splitting into multiple git repositories for performance while still retaining some of the benefits of a single repository.
> Looks like it's worth taking a serious look at this repo script, as it's been used in production for Android. Might allow splitting into multiple git repositories for performance while still retaining some of the benefits of a single repository.
Stay away from Repo and Gerrit. I use them at work, and they make my life miserable.
Repo was written years ago when Git did not have submodules, a feature where you can put repositories inside repositories. Git submodules is far superior to Repo, and allows you to e.g. bisect the history of many repositories.
I'm hoping that Google comes to it's senses and starts phasing out Repo in favor of Git submodules in Android development.
Compared to the code review facilities in GitHub, Gerrit is pretty crappy. Gets the job done, but the UI and the work flow it forces is a bit annoying.
The worst part in repo + gerrit is that their default work flow is based on cherry-picking and they introduce a new concept called Change-Id. The Change-Id basically yet another unique identifier for changes that is stored in the commit message. The intent is that you make a change (a single commit patch), a post-commit hook adds the Change-Id to the commit message and then you upload it for review. When you make additions to your change, the previous change gets overwritten. Gerrit tries to maintain some kind of branching (called dependencies), but they mess things up when there's more than one person working on a few changes at the same time.
In comparison with GitHub-style work flow where you make a branch with multiple commits, submit a pull request, get review, add commits, squash and merge, the repo + gerrit model is awfully constraining.
We might be using an old version of repo and/or gerrit and some of the issues I've encountered may be improved. However, I think that repo+gerrit is a mess beyond repair and trying to "fix" it only makes things worse.
Unless you work on Android and are forced to use repo+gerrit because Google does so, stay out of it.
I'm using it, the workflow isn't so bad. In fact it is similar to the way kernel patches are iterated and reviewed, except centralised instead of e-mail based.
Basically, if you want to manage a large collection of git source repositories, you'll probably end up using Repo and Gerrit and piggybacking on the work of the android ecosystem (and beyond, gerrit is used all over the place now)
There really isn't another solution out there right now (at least not anything open source) for very large single repositories.
> There really isn't another solution out there right now
What about Git submodules? They do fundamentally the same thing as Repo, but it's a built-in Git feature and not a bunch of scripts.
Repo can make your life very hard and you have to be a black belt Git ninja to understand what's going on when things don't go as you intended. Git submodules don't depend on having arbitrary GUID strings in your commit messages either (like Repo's Change-Id).
GitHub's reviews can handle Git submodules (but it's not free or open source). If someone knows any open source code review tools that can handle, please tell us.
Sorry for beating a dead horse, but I really want to save someone from fucking up (or at least re-centralizing) their workflow with repo scripts, when native git is better.
Having worked with repo professionally, I'm not a fan. You lose simple ability to track dependencies across repositories or even revert to a previous consistent point in time without diligent tagging. Even with good tags, restructuring your project setup and changing your repo manifest can still break your ability to go back in time.
Huh, fascinating. git was initially created for the Linux kernel development, and I haven't heard of any issues there. Offhand I would have said, as a codebase, the Linux kernel would be larger and more complex than facebook, but I don't have a great sense of everything involved in both cases.
So what's the story here: kernel developers put up with longer git times, the kernel is better organized, the scope of facebook is more massive even than the linux kernel, or there's some inherent design in git that works better for kernel work than web work?
The linux kernel has an order of magnitude fewer files than Facebook's test repository (25,000 in version 2.6.27, according to http://www.schoenitzer.de/lks/lks_en.html) and only 9 million lines of text.
This is on the largish side for a single project, but if Facebook likes to keep all their work in single repo then it isn't too difficult to go way beyond those stats. Think of keeping all GNU projects in a single repo.
It isn't surprising if Facebook has a large, highly coupled code base. Given their reputation for tight timelines and maverick-advocacy, I'm continually surprised the thing works at all.
I wouldn't say that a large repository implies that the code is highly-coupled. There are advantages for keeping certain code together in a single repo. Being able to easily discover users of functions of a library, being able to "transactionally" commit an update to a library (or set of libraries) and the code that uses it, being able to do code review over changes of code in various places, being able to discover if someone else has solved this problem before, and so forth. If you only have your project and its libraries checked out, you don't serendipitously discover things in other projects.
As mentioned in this talk on how Facebook worked on visualizing interdependence between modules to drive down coupling at
https://www.facebook.com/note.php?note_id=10150187460703920 , there are at least 10k modules with clear dependency information in a front-end repo, and the situation probably is a lot better now that they have that information-dense representation to work from (I don't work on the PHP/www side of things, I spend most of my time in back-end and operations repos).
None of what you mention here precludes breaking up the code into many smaller repositories, and then having them all linked together in one super-repository.
Then tags at the super-repository level can record the exact state of all submodules.
It's not about not checking the other modules out; you can make this the standard behavior, sure. Instead it's about having git manage reasonable sized blocks of the code base.
2) If you have inter-dependencies on modules you have to grapple with the "diamond dependency" problem. Say module A depends on module B and C, and suppose that module B also depends on C. However, module B depends on C v2.0 but A depends on C v1.0. If they're all split across repositories it's not possible to update a core API in an atomic commit.
3) Now you rely on changes being merged "up" to the root and then you have to merge it "down" to your project. This is one of the reasons Vista was such a slow motion train wreck: http://moishelettvin.blogspot.com/2006/11/windows-shutdown-c... -- kernel changes had to
be merged up to the root, then down to the UI, requiring months of very slow iterations to get it right.
Keep in mind that the talk in question is talking about the web site (and some other stuff) going into production, and as is mentioned in the talk, it is done more than once a week, and the whole shebang can be pushed in some number of minutes (I forget the amount mentioned).
Back-end services have their own release schedules and times, and obviously are made to be highly backward compatible so that they don't need to be done in lock-step with the front-end.
I think you're right about the "diamond dependency" model, but I think the merge-up and merge-down in Vista had more to do with having multiple independent branches in flight at the same time.
There are ALREADY interdependency issues if you're using git. Anyone could have changed anything before you did your last commit. If you pull and then push your changes without running tests, you're already risking breaking the build.
If a diamond dependency conflict came up, it shouldn't ever be committed to the TRUNK. Whoever made the change that causes a conflict should ideally discover that conflict before they submit it to TRUNK. But it's still not relevant to the decision to have one big repo or a hundred smaller ones, since the exact same problem can come up in either case if you fail to test before you submit.
With the proper workflow, having a lot of smaller trees can be functionally equivalent to having one massive tree. Checking in your branch to the TRUNK would require that you have the latest version of the TRUNK, and if you don't, then you'd need to pull the latest, just like how git works now. And updates to the TRUNK ARE atomic: Before you update TRUNK, it's pointing to the previously working version, and after you update, it's pointing to the one you just tested to make sure it still works.
And, just like now, a developer should therefore test to make sure the merge works after they've grabbed the latest.
It sounds like you're assuming that, if I update one of the submodules, it could break TRUNK? That can't happen without someone trying to commit it to TRUNK. Git remembers a specific commit for each submodule, and doesn't move forward to a new commit without being told to.
I'm not sure this would work - if an operation needs you to stat() a file for every file in the repo (for example), whether it is 10k files in 1 repo or 1k files each in 10 repos will probably just as bad?
An operation in git makes you stat() each file in the current repo -- so things like check-ins and local operations would be done 100% in the current repo.
Any time you were pulling the entire repo tree, it could be slow, yes. But assuming people are only working in one or a small number of repos at once, you can imagine a workflow that didn't involve nearly so many operations on the entire tree.
Ah, I see. Yes, that might make some of those operations better. Other operations that are common in our workflow might still need to look through the whole super-repo - for example getting a list of all changes (staged or unstaged) in the repo and generating a code review based on that.
(I almost habitually run "git status" whenever I've task switched away from code for even a few seconds to make sure I know exactly what I've done, which would have to look over the whole super-repo as well.)
Thankfully we're a while away from the times based on the synthetic test - it's not something I notice at all, but I probably write less code than most engineers here.
In the open-source world, when you want to change an API, you have to either add the change as a new API (leaving the existing API intact) or break backward compatibility and maintain parallel versions, gradually migrating users off of the old version.
Both of these options are a huge pain, and have a direct cost (larger API surface or parallel maintenance/migration efforts). When your entire repo and all callers are in the same code-base, you have a much more attractive option: change the API and all callers in a single changelist. You've now cleaned up your API without incurring any of the costs of the two open-source options.
This is why it can be nice, even if you have a bunch of nicely structured components, to have all code in a single repository.
While it may not be, I do find it an every day battle to keep my PHP wel 'styled'. Sure, PHP is the first language I learnt, and I do use a framework, but sometimes a long method is easier in the short run than writing good model functions. And I have models, but mostly for the ORM and they're all completely interconnected.
PHP makes me lazy, fast
That's pretty much the same in any language - if you don't have the discipline to keep your code in a good state in one, why would you suddenly gain that discipline in another?
The linux kernel is several orders of magnitude smaller. They are talking about 1.3 million files totalling nearly 10GB for the working tree. My kernel checkout has 39 thousand files totaling 489MB.
While I'd be interested in seeing this issue further unfold, just the prospect of a 1.3M-file repo gives me the creeps.
I'm not sure what the exact situation at Facebook is with this repository, but I'm positive that if they had to start with a clean slate, this repo would easily find itself broken up into at least a dozen different repos.
Not to mention the fact that if _git_ has issues dealing with 1.3M files, I wonder what other (D)VCS they're thinking of as an alternative that would be more performant.
A lot of big companies have repos 10 or 100 times that size. With tens of millions of files, sometimes up to 100 gigs or more of data under source control.
Most people like that use Perforce (e.g., Google).
And no, they don't split into multiple repos, they might well have the entire company's source code in a single repository (code sharing is way easier this way).
That's a pretty terrible way to share code. Simple example: I work on a project, write some code. Turns out that code is useful for someone else, so they reach in and include it, which is easy since it's all the same repo and all delivered to the build system. Now I make a change and break some customer I didn't even know exists. Oops.
If you read further down the thread they say that that's already had the non-interlinked files split out. What they've got left isn't easily broken up.
That C++ compiler is a single product (okay, you might have built a linker, and an assembler as well - say 3-5 products). In even medium enterprises (say, 500 employees, about 250 developers) you might have upwards of 35 different products, each of which with a 5-6 year active history.
Enterprise source control can be ugly - particularly if you have non-text resources (Art, Firmware Binaries, tools) that need to be checked in and version managed as well.
With all that said - I don't really understand why all the code is in a single repository. Surely a company of Facebook's size would experience some fairly great benefits from compartmentalization and published service interfaces. I guess I agree with the parent - sounds like a lot of intertwined spaghetti code. :-)
There're costs and benefits both ways. AFAIK, Microsoft and Amazon both use the separate repositories model, and Google and Facebook use a single large repository. Most people I know that have worked at both of these styles prefer the Google/Facebook style.
The biggest advantage of a single repository is pretty intangible - it's cultural. When anyone can change anything or can use any code, people feel like the whole company belongs to them, and they're responsible for Google/Facebook's success as a whole. People will spontaneously come together to accomplish some user need, and they can refactor to simplify things across component boundaries, and you don't get the sort of political infighting that tends to plague large organizations where people in subprojects never interact with each other.
I think if it were my company, I'd want the single repository model, but there need to be tools and practices to manage API complexity. I dunno what those tools & practices would look like; there are some very smart people in Google that are grappling with this problem though.
Why is a single repo required for everybody to see all the code? Tools like gerrit and github can handle multiple repos and provide commit access for multiple repos among a large group of people. If it were my company, I would keep separate repos but allow read and merge requests for all employees. That keeps everybody involved in projects across the entire company, but also allows them to notice when individual projects get spaghettified and thereby deserving of some cleanup/breakup into components. A GB-scale codebase does not help smart, new employees grok what the hell they can contribute.
It's not a matter of being able to see all the code, it's a matter of being able to see and modify all the code. It allows you to have a "just fix it" culture when people see something's broken, and it lets you write changes that span multiple projects without worrying about how your change will behave when it can't be committed atomically.
Well, of course that at some specific scale, you're gonna start to have trouble with any DVCS maintaining a complete local copy of such a huge repository.
It's even worse that just disk space and performance issues.
I can totally imagine a huge, busy repository where by the time you've pulled and rebased/merged your stuff, the repo has already been committed to again, invalidating your fast-forward commit and forcing you to pull again and again before you have any chance of pushing back your changes.
This is an inherent problem with DVCS that just can't be solved (trivially) when working on huge repositories that span millions of files and involve thousands of developers.
> They keep every project in a single repo, mystery solved.
This kind of "Duh, look what you're doing" response isn't really justified.
Sure, splitting up your repository would make things faster, but having to maintain multiple repositories is a major headache for the end-users of git. If it's possible, why not fix its scalability so that you don't have to worry about it?
I agree. Users are always right, even for open source software. Requiring the users to change their way is an easy way to lose them. A major player like Facebook willing to use Git is a good thing, as it may help spread the software in companies like Facebook. Git should try its best to please its user, rather than the other way round.
> They keep every project in a single repo, mystery solved.
That's not true:
> It is based on a growth model of two of our current repositories (I.e., it's not a perforce import). We already have some of the easily separable projects in separate repositories, like HPHP. If we could split our largest repos into multiple ones, that would help the scaling
issue. However, the code in those repos is rather interdependent and we believe it'd hurt more than help to split it up, at least for the medium-term future.
They already have multiple repositories, the stats they're doing there is based on "two of [their] current repositories" implying more than two.
Others have tried and keep throwing more and more smart people at the problem they just shouldn't have.
MSFT with Windows codebase that runs out of several labs. Crazy branching and merging infrastructure. They use source-depot, originally a clone of perforce.
Google with all their source code in one Perforce repo.
Facebook will be on perforce before we know it.
The solution is an internal Github, not one giant project.
Pointing at a big company and saying "they're doing it wrong" is easy enough to do, but you have to remember that every decision comes with tradeoffs. Take Google's codebase, since it's the one I know the best. A couple of the key decisions:
* Single rooted tree. Separated repositories would make it harder to share code, leading to more dupication.
* Build from head. We build everything from source, statically linked. No need to worry about multiple versions of dependencies, no lag between a bug fix and it being available to any and all binaries that need it, whenever they're next updated.
I don't think that an "internal Github" is going be a magic bullet here. It's more likely it would be a matter of trading one set of hard problems for another, as we all of a sudden need to figure out how to do cross-dependencies sanely, deal with multiple versions of libraries, etc, at scale. You are correct that one monolithic Perforce repo is a bit of a pain point, but that doesn't necessarily mean that the right decision is to shatter our codebase into different pieces - we'd rather make our repo scale better. For reference, we've already got hundreds of millions of lines of code, 20+ changes/minute 6 months ago (so what, 30+ now?), and plans for scaling the next 10x are in motion.
If you're interested, I recommend http://google-engtools.blogspot.com/. It details a number of the problems we've run into, and our solutions for dealing with them at scale.
Single rooted tree. Separated repositories would make it harder to share code, leading to more dupication.
I'm not convinced that the difference between a singly rooted tree and a multiple-rooted tree is going to make that much difference. I mean, think about it... if you 100k's or even millions of files, is anybody going to parse through all of that, looking for a reusable function, even if it is on their workstation?
And sure a compiled language would catch naming collisions on functions or whatever, but nothing stops somebody from creating a method
doQuickSort( ... )
and somebody else creating
quickSortFoo(...)
where they are semantically equivalent (or very nearly so).
It seems to me that the problem of duplicating code, because you don't know that a method already exists to do what you're trying to do, is the same problem regardless of how your tree is laid out; and is ultimately more of a documentation / process / discipline issue. But I'd be curious to hear the counter-argument to that...
Yes, in fact. We have some great tools that give us full search over our entire codebase (think Google Code Search), and you can add a dependency on a piece of code without needing to have it on your workstation already. The magic filesystem our build tools use knows where to get it and can do so on demand. Combined with good code location conventions, an overall attitude that promotes reuse over rewrites* and mandatory code reviews where someone can suggest a better approach, we do a pretty good job. Not everything is eliminated, of course, but I'm pretty happy with the state of things.
To your example, we'd use the STL for most of our sorting needs, but if you were to want, say, case-insensitive string sorting, I can tell you where to find it (ASCII, UTF8, or other). If you want a random number, any RNG you could want is available. Most data structures you could name have been written and tested already. Libraries for controlling how your binaries dump core, command line flags are parsed, callbacks are invoked, etc etc are readily available. We really do reuse code as much as possible, and it's wonderful to have ready access to all of this whenever you could ask.
*At a method level, anyways...we're famous for writing ever more file systems ;).
I've seen that before myself at other companies, and it's a shame. A healthy codebase is an investment in the future - if you're not taking the time to cultivate it you're sacrificing long term usability for short term gains. The larger the codebase the more difficult the task, of course, but for us that's just an excuse to solve the next hard problem :).
Awesome, glad to hear you guys take reuse so seriously. I'm a little surprised, only because - in my experience - so few organizations put in the effort that you guys do.
Knowing that somebody is taking the effort to get this sort of thing right is really, unspeakably awesome. Though I guess not really, given that I'm posting it. This sort of thing is one of the main reasons I read HN.
Hopefully the methodology will filter out into the wider world one day. . . Anyway, thank you for posting it!
Google has everything in one Perforce repo? You mean the search engine, do you?
I agree btw, the Github mindset is the best one. Create for every project a new repo and connect them with build tools. But why not hire 100 SOA-Consultants, they have enough money now.
No, literally the entire codebase for all of their products is in one Perforce repo. Ashish Kumar, manager of the Engineering Tools team, mentions it in this presentation: http://www.infoq.com/presentations/Development-at-Google
Kernel/Android/Chrome/basically anything open-source is different. If the code is going to be open-sourced, it can't have dependencies on proprietary code anyway.
The open-source stuff is a rounding error. Think about all the Google products; Search, Google+, Gmail, Groups, Translate, Maps, Docs, Calendar, Checkout, Wallet, Voice, ... those are all in one repository. (Not to mention all the libraries and internal tools; those are all in there too.)
To be fair, Android and Chrome are pretty huge projects. I know the numbers (though I don't think I can share them outside of Google), and while they're nowhere close to being a big part of the total, they're also big enough to not be considered a rounding error.
Ah,rachelbythebay, you caught him for being "technically incorrect". Which, depending on how you look at it, is either the best or worst kind of correctness.
Large repos bring their own problems, and results in some design decisions accordingly. For example, Visual Studio itself is 5M+ files and this affected some of the the initial design decisions (Server side workspaces, for this example) when developing TFS 2005 (the first version) [1]. That decision suits MS but not the small to medium clients well. So they're now alternating that design with client side workspaces.
It's not wise to offer Facebook to split the repository. Looks like it's time to improve the tool.
I can believe this working with a former facebook employee. They do not believe in separating or distilling anything into separate repos. Why the fuck would you want to have a 15GB repo?
Ideally they should have many small, manageable repositories that are well tested and owned by a specific group/person/whatever. At least something small enough a single dev or team can get their head around.
And then each of those dev teams can spend 1/2 their time writing code other people in the company have already written or every team can spend 1/2 their time publishing and reading documentation about what has been written.
There is no simple answer. There is only optimization for a particular problem-set you are trying to minimize.
> And then each of those dev teams can spend 1/2 their time writing code other people in the company have already written or every team can spend 1/2 their time publishing and reading documentation about what has been written.
I don't see what this has to do with a discussion of one repo vs multiple repos.
You think that in a multi repo world, the engineers aren't as aware of what code exists and where as they are in a single repo world? You think that code duplication and needing to read docs magically doesn't exist in a single repo world?
The number of repositories is just an organizational construct. Communication still must take place no matter what.
the obvious answer, repeatedly mentioned in comments:
> factor into modules, one project per repo
where i work we have a project with clear module boundaries, but all in the same repo. we have an "app" and some dependencies including our platform/web framework. none of these are stable, they're all growing together. Commits on the app require changes in the platform, and in code review it is helpful to see things all together. Porting commits across different branches requires porting both the application change and the dependent platform changes. Often a client-specific branch will require severe one-off changes so the platform may diverge -- it is not practical for us (right now) to continually rebase client branches onto the latest platform.
this is just our experience, not facebook's, but lets face it: real life software isn't black and white, and discussion that doesn't acknowledge this isn't particularly helpful.
We've got a superproject with our server configs, and sub projects for our background processing, API, and web-frontend respectively.
Often, each project can evolve and be versioned 100% independently. However, often you need to modify multiple projects and (especially with server config changes) coordinate changes via the super project.
It's a little hairy sometimes and often feels like unnecessary overhead, but the mental boundary is extremely valuable on it's own. Being able to add a field to the API and check that commit into the superproject for deployment before the front end features are done is nice. The social impact on implementation modularity is valuable. We write better factored code by letting Git establish a more concrete boundary for us.
This was actually pretty fascinating to me. On one hand, I am astonished at how long it takes to perform seemingly trivial git operations on repositories at this scale. On the other hand, I'm utterly mystified that a company like Facebook has such monolithic repositories. Even back when I was using SVN a lot, I relied on externals and such to break up large projects into their smaller service-level components.
I'd be very interested to see some benchmarks on their current VCS solution for repositories of this scale.
From a followup post: “We already have some of the easily
separable projects in separate repositories, like HPHP. If we could split our largest repos into multiple ones, that would help the scaling issue. However, the code in those repos is rather interdependent and we believe it’d hurt more than help to split it up, at least for the medium-term future. We derive a fair amount of benefit from the code sharing and keeping things together in a single repo, so it's not clear when it’d make sense to get more aggressive splitting things up.”
He notes that these repositories are somewhat broken up already, and wants to keep them together.
There are good reasons to keep code in one repository; particularly, git's submodule support has a number of nasty interface tradeoffs; I wouldn't say it breaks git, but you have to keep a clear understanding of all your submodules in your head when you have a lot of them.
OK, it pretty much breaks git to have submodules that are interdependent. I know this because I am currently moving one of my organizations off this exact plan -- it's the opposite of useful and speedy to have to worry about versions across a large number of backend / frontend repositories.
It is MUCH easier and therefore better for developers to put them together, and release together.
"We can build a binary that is more than 1GB (after stripping debug information) in about 15 min, with the help of distcc. Although faster compilation does not directly contribute to run-time efficiency, it helps make the deployment process better."
Not sure if it's still the case, but Google hosts all their internal source code on a modified version of perforce, so they essentially have everything in one repo.
What do Facebook and the National Institutes of Health have in common? I'm pretty sure this will end with Facebook building their own versioning system from scratch and give it some kitchsy name like "Retro".
$100B company, maybe they can afford to put some people onto solving this for the open software community (and put the solution into the open), especially since nobody else in the community seems to have this problem.
If you proposed a good solution I'm sure they'd be happy to provide time and money and open source the result. But most of the responses aren't even that there is a solution - they say to split the repository into smaller pieces and spend time and money internally having their internal developers deal with that.
A good solution will benefit everyone who uses git. Codebases get larger over time. There is more forking and experimentation. More spoken languages can be supported. More computer languages can be interfaced. The O(n) operations becoming less than that will benefit you in the future as your code grows.
This is Joshua (who posted the original email). I'm glad to see so much interest in source control scalability. If there are others who have ever contemplated investing a bit of time to improving git, it'd be great to coordinate and see what makes sense to do - even if it turns out that the right answer is just to make the tools that manage multiple repos so good that it feels as easy as a single repo.
There's two issues: the width of the repository (number of files) and the depth (the number of commits).
Since "status" and "commit" perform fairly well after the OS file cache has been warmed up, that probably can be resolved by having background processes that keep it warm. (Also, how long would it take to just simply stat that number of files? )
The issue of "blame" still taking over 10 minutes: We need to know far back in the repository they're searching. What happen if there's one line that hasn't been changed since the initial commit? Are you being forced to go back to through the whole commit history?
How old is the repository? Years? Months? I'm probably guessing in the at least years range based on the number of commits (unless the developers are extremely commit-happy).
At a certain point, you're going to be better off taking the tip revisions off a branch and starting a fresh repository. It doesn't matter what SCM/VCS tool you're using (I've been the architect and admin on the implementation of a number of commercial tools). Keep the old repository live for a while and then archive it.
You'll find that while everyone wants to say that they absolutely need the full revision history of every project, you rarely go back very far (aka the last major release or two). And if you do need that history, you can pull it from the archives.
This is an interesting social AND a technical problem. The problem for FB is that it is all too easy for them to just fork git, create the necessary interfaces and then hope the git maintainers would accept it (they mightn't) or release it into the wild (and incur bad karma and wrath of OS developers who'd see this has schism or even heresy).
They've reached out to the developers on git, and I guess that's a first step.
Keep in mind that your average repository doesn't only contain code that is compiled and executed (or interpreted), there is also documentation, static assets such as images (that may be processed), configuration, computed files (that may make sense to pre-compute once rather than compute on a hundred people's environments every build), and so forth.
Also, it doesn't only include the current file set - they include files that have been deleted, been split into modular files, been merged, been wholesale rewritten, put into a new hierarchy (some VCS systems handle this better than others).
(I work at Facebook, but not on the team looking into this stuff. I'm a happy user of their systems though. Keep in mind that the 1.3 million file repo is a synthetic test, not reality.)
Also, it doesn't only include the current file set - they include files that have been deleted, been split into modular files, been merged, been wholesale rewritten, put into a new hierarchy (some VCS systems handle this better than others).
The follow-up email still mentions a working directory of 9.5gb. I cannot fathom working on a code repository consisting of 9.5gb of text. There must be something else going on here, even considering any peripheral projects like the iOS and android apps, etc.
(edit: if there are huge generated files intermingled with code, shouldn't those be hosted on a "pre-generated cache" web server instead of git, for example?)
Our codebase at my employer currently hovers around 5GB in SVN. Binaries and other generated code are intermixed for historical reasons. Removing them is a non-starter due to the amount of time it'd take to do so; the best solution I've been able to come up with so far is to break out into multiple SVN repos (one for images, one for generated language files, etc.) and then, hopefully, get code into Github while externally using the SVN repos for stuff that shouldn't be versioned in a distributed manner (versioning that stuff is useful as a convenience - avoiding conflicts, etc.).
I'm surprised Facebook and all its peripheral development has that much source. I would expect something like 5-10 million lines of code, not ~100 million lines implied by the example.
The example is synthetic, so don't worry too much about the implications.
It is useful to keep in mind that Facebook isn't just the front-end (and isn't just code, also images, configuration, and so forth).
Just talking about open source stuff, Facebook also generates code like Cassandra, Hive (data warehousing application), Phabricator (a code review and lifecycle tool), HipHop for PHP (the translator/compiler, the interpreter, and the virtual machine), FlashCache (a kernel driver), Thrift, Scribe, and so forth.
We also have had to build applications to support our operations, so think about what sort of effort goes into building scalable monitoring, configuration management, automatic remediation, logging infrastructure, and so forth.
I don't know the actual lines of code across it all, and wouldn't mention it if I did, but people often underestimate the scale here.
It doesn't live in a single repository. The commenter I was replying to mentioned "Facebook and all its peripheral development" and a number of lines of code. I wanted to give him a little insight into what sort of things all the peripheral development might include, since it isn't obvious.
I don't think Git was designed to perform well with such a large repo. In this case, the best-practice is probably compartmentalizing the code and using Git submodules. The Git submodule interface is a little un-friendly, but I think it does work well for such large repos. I've been using submodules successfully for our development that tracks source files as well as binary assets.
I think it's a bad practice to keep a giant code base in one repo. Split the code base into purpose-specific modules, just as you would split any project into purpose-specific modules. In fact, those two things might well line up 1:1.
If a project depends on other projects, have it reference the other projects. Where appropriate, include exact version numbers and/or commit hashes. Gemfiles are good examples of this good practice at work.
Yes, git has submodules for this sort of thing, but after investigating that route, I decided against using git submodules. Use something independent of the VCS instead. Then git won't do weird or unexpected things when you switch branches. Also, you might want to mix in projects that use other version control systems. And really, why unnecessarily couple a project to its version control system?
If (when?), even after splitting a megaproject into manageable subprojects, these performance issues creep in, I'd certainly be interested in whatever improvements people are coming up with...
I'm curious what their performance numbers look like if they host the .git repo on tmpfs -- 15GB isn't unreasonable on a beefy (24-32GB of ram) machine.
Probably the same as the warm cache results, since that's basically what tmpfs is. I wonder if git does all that stat()ing serially or in parallel, though...
I don't have the link handy, but IIRC we did some experiments with that for Mercurial and found that stat() in parallel didn't really help much until you were using NFS or similarly slow network-type latency filesystems.
Hey do you know a lot of sites with such needs? Facebook is the first, and probably all sites with such needs I can count with fingers on my one hand. I don't think it's git issue - everyone use this system and all are happy using it. This is like a new feature, but not issue.
If this was the crazy size of your git repo, why wouldn't you make a tool that took your git repo and versioned it? Keep it in repos that can all be performant, since most of the time you are working with "time local" information?
My first thought, as suggested by some on the list, was modularization. Redstone's response (that the 1.3 million files are essentially all interdependent) terrifies me.
Multiple people in this conversation section have asserted that code sharing is way easier when all the code is in a single repo, but from my understanding of sub-modules, it would be a fairly simple matter of setting up your pre/post-commit hooks to update submodules to a branch automatically and get useful company wide change atomicity (after all, changes should only propagate between teams/projects once they have some stability).
Putting aside the question of whether or not an enormous singular repo can be broken up intelligently into modular projects, is there something about the submodule approach that makes it a uniquely unsuitable way for sharing changes amongst projects?
If you limit change propagation, your changes won't propagate as fast. That goes for bugs and bug fixes.
I can certainly see why you would have the latter propagated instantaneously, or close to it.
There's also the point that if you don't propagate change to everybody at the same time, you'll have dozens of slightly different versions of those projects across your company. The question of submodules vs. large repo is not as easily decided as you think - there are large upsides (and downsides) to both approaches.
No, you're misunderstanding what I'm saying with regards to publishing stable changes.
Say you have two branches: master and next. Stable work goes in master, unstable work goes in next. When the code is ready for consumption, you merge it into master.
Anyone who is using a project has it setup as a submodule. They add post/pre-commit hooks to update all project submodules. These submodules pull from master.
This way, everyone will get all stable changes on all submodule projects at the time of the next change to their own project.
I do understand what you mean just fine. Except it doesn't work that way. If I have a critical bug fix in libA, I need it rolled out now. What's more, I need it rolled out across all other projects that use libA, immediately. And no, I don't want to work until all projects committed a change of their own.
Even more, I (or my team) are not the only ones working on libA. Others are too. So keeping changes in 'next' and pushing to master only on occasion doesn't help much. Yes, it keeps non-working patches out - but that's what local branches on your machine are for.
(I'm not even going to mention the issue of merge conflicts. If you work on a massive scale, the longer you stay in a branch, the more likely you are to get a merge conflict. There's easily the chance to go into a several weeks long merge hell. Pull from master, resolve conflicts, run local tests - oh wait, master is already updated by somebody else)
git is not memory efficient by design, i used to push about 1G commit to the server and it hangs forever, i had to abort it and push in as small chunks instead
What I mean is, even if they manage to solve this problem with some tweaks, they will again hit the bottleneck in an year or two, so they should rethink their source code management
What this comes down to is that git uses a lot of essentially O(n) data structures, and when n gets big, that can be painful.
A few examples:
* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.
* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.
An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash. Also, inotify is an incredibly tricky interface to use efficiently and reliably. (I wrote the inotify support in Mercurial, FWIW.)
* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), and the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).
None of these problems is insurmountable, but neither is any of them amenable to an easy solution. (And no, "split up the tree" is not an easy solution.)