Hacker News new | past | comments | ask | show | jobs | submit login

The worrying point here is the checkout of 8GB as opposed to the history size itself (46GB). If git is fast enough with SSD, this is hardly anything to worry about.

I actually prefer monolithic repos (I realize that the slide posted might be in jest). I have seen projects struggle with submodules and splitting up modules into separate repos. People change something in their module. They don't test any upstream modules because it's not their problem anymore. Software in fast moving companies doesn't work like that. There are always subtle behavior dependancies (re: one module depends on a bug in another module either by mistake or intentionally). I just prefer having all code and tests of all modules in one place.




As much as i hate clearcase i have to say it has some "interesting" features to deal with this. When you look at a repo you don't look at one unique version of the whole repo, you can create quite advanced "queries" such that for folder /src you look at your branch, for folder /3dpartylibs you look at version 5 and for folder /tests you look at head(latest). And since your working directory can be network attached (dynamic view) you dont have to "update" manually, head is like a living animal! Its like having subrepos built in.

While this is interesting it also requires a lot of discipline and almost one person dedicated as full time "dba" to not end up with spaghetti. Since there is no unique version number of the repo you have to store these queries and manually keep adding labels to be able to go back to an exact point of time.

It does have some uses like being able to run a new test on old code to find the first version when something broke or being able to change versions of external libraries or blob assets quickly but its hard to say if its worth it since it comes with so many other problems.


Most big tech companies use a service-oriented architecture. The website is composed of many small services which communicate with each other over HTTP or RPC protocols. Each service has its own version control repo and is maintained by a different team. That's generally the best way of scaling up.


Facebook's architecture, for at least the core web app, was a blob of PHP for ages.

They might have since modularized and cleaned it up but it seems unlikely they'd fully SOA-ize the Facebook web app.


That only applies to deployment. You're not building these services from the ground up: they're all going to have common libraries that need to stay up to date.


These are all solved problems. You create a package system that allows you to specify versioned dependencies to other packages. Your build and deployment systems can then build your package even though it depends on code that lives in other repositories owned by different teams. Hell, this even works across different versioning control systems; one team can be lagging along on SVN, another can have packages in P4, and yet another can have theirs in git, but they can all build against each others code.

It works absolutely brilliantly. Division of labor and responsibility becomes clear, repos stay manageable, large scale rewrites can happen safely, in piecemeal, over time... it really is the best way to do it.


As others have noted elsewhere, this "solution" has its own problems if you are rapidly moving. Which I don't think anyone can claim facebook hasn't been doing.

So, yes, if you are able to control growth enough that you can make this happen, it is attractive. If you can't, then this leads to a version of the diamond problem in project dependencies. And is not fun.


Growth is the reason that companies should avoid what Facebook has done. The company I currently work for anticipated the scaling problems that Google later encountered with Perfoce (http://www.perforce.com/sites/default/files/still-all-one-se...) and recognized that while perforce could be scaled further by purchasing increasingly beefy hardware, ultimately you could only reliably scale so far with Perforce.

If you're not growing, then there is no problem. If you have linear growth maybe you can keep pushing it, but who plans on linear growth?

Google is already on multiple Perforce servers because of scaling, and that is not a situation that is going to improve. If you start using multiple centralized version control servers, you are going to want a build/deployment system that has a concept of packages (and package versions) anyway.

> If you can't, then this leads to a version of the diamond problem in project dependencies. And is not fun.

These sort of dependency resolution conflicts can and do happen, but far less often than you would think. Enforcing semantic versioning goes a long way (and along with it, specifying. In practice, the benifits of versioned dependencies (such as avoiding ridiculous workarounds like described by this HN comment: https://news.ycombinator.com/item?id=7649374) far outway any downsides.

You can even create a system that uses versioned packages as dependencies while using a centralized versioning system. In fact, this is probably the easiest migration strategy. Build the infrastructure that you will eventually need anyway while you are still managing on one massive repository. Then you can 'lazy eval' the migration (pulling more and more packages off the centralized system as the company grows faster and faster, avoiding version control brownouts).


I'm assuming you aren't referring to "succeed" in your first sentence. :)

It is amusing the amount of hubris our industry has. Seriously, you are talking about outsmarting two of the most successful companies out there.

I mean, could they do better? Probably. But it is hard to grok the amount of second guessing any of their decisions get.


But are they successful because of this, or despite it?


Really good question. One that I am not pretending to know the answer to.

I do feel that the main reason they are successful is large manpower. That is, competent (but not necessarily stellar) leadership can accomplish a ton with dedicated hard workers. This shouldn't be used as evidence that what they are doing is absolutely correct. But, it should be considered when what they do is called into question.

If you have/know of any studies into that, I know I would love to read them. I doubt I am alone.


If you are a large company, you can move faster if devs aren't all working on the same repository. If all your code is in one repo and one team makes a breaking change to their part of it, everyone's code fails to build. If the source code is broken up into separate packages, each team can just use versioned releases of the packages they depend on and not have to worry about fixing compatibility issues.


While there is a strong appeal to your argument, Facebook stands as a resounding counter example. As does Google.

The counter argument appears to be this. If one team checks in a change that break's another team's code, then the focus should be on getting that fixed as soon as possible.

Now, if you are in multiple repositories, it can be easy to shift that responsibility onto the repository that is now broken. Things then get triaged and tasks must be done such that getting in a potentially breaking fix may take time.

Contrasted with the simple rule of "you can't break the build" in a single repository, where the impetus is on whoever is checking in the code to make sure all use sites still work.

Granted, an "easy" solution to this is to greatly restrict any changes that could break use site code. The Kernel is a good example of this policy. Overall, I think that is a very good policy to follow.


We use separate repos and it works out well. It's nice having separate Git histories that pertain to different areas of the codebase.

Our workflow covers all the potential problems you named (eg. scripts to keep everything up to date, tests that get run at build or push time after everything is already checked out from the individual repos, etc.).

We've been running this way for over a year with literally zero issues.


To get a log for a specific subdirectory, you just:

  git log -- my-teams-subdirectory


>People change something in their module. They don't test any upstream modules because it's not their problem anymore.

If you use any sort of versioning this shouldn't ever cause a problem.


I had a problem so I decided to use versioning, now I have a combinatorially exploding number of problems.


Yeah, doing it this way they can never make API incompatible changes without also fixing everything downstream... which effectively means once a library is popular enough it is locked at it's current API forever.


I've seen that happen at Google. At some point it's easier to write a new (differently named) library. Monolithic codebase + renames = poor man's versioning.

BUT this allows you to pay the prices of versioning (downstream burdened with rewriting => they never do => old code lives indefinitely) only in the worst cases. If done right (lots of tests, great CI infrastructure), fixing everything downstream is practical in many cases, and can be a win.

A subtler question is how this interplays with branching. You can't be expected to update downstream on every random/abandoned branch out there, only the head. Which deters people from branching because then it's their responsibility to keep up...


Or you bump an API version. And the fixes are gradual everywhere.


The parent was advocating not versioning.


How does monolithic repos solve that. Surely people who fix bugs in a library aren't testing the entirety of Facebook every time (how long would that even take? Assuming they've even set such a thing up.)


I used to work at Facebook. They have servers that automatically run a lot of their test cases on every commit.


It is at least easier to correlate the changes. When you have X+ modules, you have potentially X+ histories you have to look at to know when a change was seen with another change.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: