I bet they check in (or have checked in at some point) 3rd party libraries, jar files, generated files, etc. I battle this every day at $dayjob and we have a multi GB subversion repository with a separate one for the 3rd party binaries. Svn handles this a bit better than the DVCSes, so just checking out the HEAD is smaller than the full history you get in git/hg, and you can clean up some crud a bit. It just lives on the central server, not in everyone's working copy.
You can clone with a single revision in git. Still ends up being around 2X the size I believe since you have the objects stored for the single revision as well as the working tree.
Facebook now has 7000 employees and is 10 years old. Each employee would have had to write 14 kiB of code (357 LOC with 40 (random guess of mine) characters per LOC) every day during this 10 years to produce 8 GiB of code. (Obviously assuming the repository contains only non-compressed code and only one version of everything and no metadata and...)
If they've got memcached with their own patches, linux with their own patches, Hadoop with their own patches, etc. and tons of translations I can see 8 gigs of text.
Why would they put that all in the same repository? I'm pretty sure this 8 GB repo is just their website code. A frontend dev working on a Timeline feature shouldn't have to check out the Linux kernel.
8GB would be at least 100 million lines of code (upper bound with 80 characters per each line). For comparison Linux has 15+ million lines of code, PHP 2+M.
PHP's repo is around 500M. But I'd say it is probably tens to a hundred times smaller than Facebook should be, especially if you count non-public stuff they must have there. So comes out about right.
unlikely. they probably have a very dirty repo with tons of binaries, images, blah blah blah. It's highly unlikely they actually wrote 8GB of code, and the 46GB .git directory will be littered with binary blob changes, etc. This is really just to "impress" two people: 1) People who love Facebook 2) People who don't know anything about version control and/or how to do proper version control (no binaries in the scm).
When I can't avoid referring to binary blobs in git, I put them in a separate repo and link them with submodules. It keeps the main repo trim and the whole thing fast while still giving me end-to-end integrity guarantees.
I wrote https://github.com/polydawn/mdm/ to help with this. It goes a step further and puts each binary version in a separate root of history, which means you can pull down only what you need.
you're not going to merge binary files, so git isn't the right tool. the standard way is to use maven. git handles your sources, and anything binary (libs, resources etc) goes on the nexus (where it is versioned centrally) and is referenced in your poms: simple and powerful
git is "the stupid content tracker", not "the stupid merge tool". Even for things you have no intention of branching or merging, it still gives you control over versioning... and there's a huge gap between the level of control a hash tree gives you versus trusting some remote server to consistently give you the same file when you give it the same request.
I'm a bit confused. Whenever I've used git on my projects, I'd make sure the binaries were excluded, using .gitignore
Don't other people do that, too? What's the benefit of having binaries stored? I've never needed that; I've never worked on any huge projects, so I might be missing something crucial.
If there is a small number of rarely changing binaries (like icons, tool configs, etc.) then it may not be worth it to move them. Also if space is much cheaper than tool complexity and build time.
Well, it depends. Images are for instance binaries where a text diff makes little sense, so you have a copy of each version of the image ever used. And many projects use programs where the files are binaries. For instance, I've been on a project where Flash were used and the files checked in. Or PhotoShop PSD files, .ai files etc.
If you don't have the source that produced those binaries your only choice is to have them downloadable from somewhere else (which is a real hassle for the developer) or just check them into the repo.
Re-cloning a fresh repo should keep it small. There's also a git gc method which cleans up the repo. I guess another way would be to just archive all the history up to a certain point somewhere.
> Re-cloning a fresh repo should keep it small. There's also a git gc method which cleans up the repo.
There's only so much git gc can do. We've got a 500MB repo (.git, excluding working copy) at work, for 100k revisions. That's with a fresh clone and having tried most combinations of gc and repack we could think of. Considering the size of facebook, I can only expect that their repo is deeper (longer history in revisions count), broader (more files), more complex and probably full of binary stuff.
FWIW: Every time you modify a file in git, it adds a completely new copy of the file. 100mb text file, 100mb binary file - makes no difference. Modify one line, it's a new 100mb entry in your git repo.