The largest Git repo

js2 · on May 24, 2017

Windows, because of the size of the team and the nature of the work, often has VERY large merges across branches (10,000’s of changes with 1,000’s of conflicts).

At a former startup, our product was built on Chromium. As the build/release engineer, one of my daily responsibilities was merging Chromium's changes with ours.

Just performing the merge and conflict resolution was anywhere from 5 minutes to an hour of my time. Ensuring the code compiled was another 5 minutes to an hour. If someone on the Chromium team had significantly refactored a component, which typically occurred every couple weeks, I knew half my day was going to be spent dealing with the refactor.

The Chromium team at the time was many dozens of engineers, landing on the order of a hundred commits per day. Our team was a dozen engineers landing maybe a couple dozen commits daily. A large merge might have on the order of 100 conflicts, but typically it was just a dozen or so conflicts.

Which is to say: I don't understand how it's possible to deal with a merge that has 1k conflicts across 10k changes. How often does this occur? How many people are responsible for handling the merge? Do you have a way to distribute the conflict resolution across multiple engineers, and if so, how? And why don't you aim for more frequent merges so that the conflicts aren't so large?

(And also, your merge tool must be incredible. I assume it displays a three-way diff and provides an easy way to look at the history of both the left and right sides from the merge base up to the merge, along with showing which engineer(s) performed the change(s) on both sides. I found this essential many times for dealing with conflicts, and used a mix of the git CLI and Xcode's opendiff, which was one of the few at the time that would display a proper three-way diff.)

Peaker · on May 24, 2017

When you have that many conflicts, it's often due to massive renames, or just code moves.

If you use git-mediate[1], you can re-apply those massive changes on the conflicted state, run git-mediate - and the conflicts get resolved.

For example: if you have 300 conflicts due to some massive rename, you can type in:

  git-search-replace.py[2] -f oldGlobalName///newGlobalName
  git-mediate -d
  Succcessfully resolved 377 conflicts and failed resolving 1 conflict.
  <1 remaining conflict shown as 2 diffs here representing the 2 changes>

[1] https://medium.com/@yairchu/how-git-mediate-made-me-stop-fea...

[2] https://github.com/da-x/git-search-replace

dom0 · on May 24, 2017

Also git rerere

When maintaining multiple release lines and moving fixes between them:

Don't use a bad branching model. Things like "merging upwards" (=committing fixes to the oldest branch requiring the fix, then merging the oldest branch into the next older branch etc.), which seems to be somewhat popular, just don't scale, don't work very well, and produce near-unreadable histories. They also incentivise developing on maintenance branches (ick).

Instead, don't do merges between branches. Everything goes into master/dev, except stuff that really doesn't need to go there (e.g. a fix that only affects a specific branch(es)). Then cherry pick them into the maintenance branches.

seanp2k2 · on May 25, 2017

Cherry picking hotfixes into maint branches is cool until you have stuff like diverging APIs or refactored modules between branches. I don't know of a better solution; it kind of requires understanding in detail what the fix does and how it does it, then knowing if that's directly applicable to every release which needs to be patched.

jjawssd · on May 25, 2017

Use namespaces to separate API versions?

POST /v4/whatever

POST /v3/whatever

_asummers · on May 26, 2017

We version each of our individual resources. so a /v1/user might have many /v3/post . Seems to work for us as a smaller engineering team.

jjawssd · on May 30, 2017

A better approach would be to alias /v3/user to /v1/user until there is a breaking change needed in the v3 code tier.

_asummers · on June 1, 2017

On a rapidly developing API, that would be way too much churn on our front end. For an externally facing API, I completely agree.

dom0 · on May 25, 2017

Yes, although this applies to all forms of porting changesets or patches between branches or releases.

lmitchell · on May 25, 2017

I don't understand the hate for merging up. I've worked with the 'cherry-pick into release branches' model, and also with an automated merge-upwards model, and I found the automerge to be WAY easier to deal with. If you make sure your automerge is integrated into your build system, so a failing automerge is a red build that sends emails to the responsible engineers, I found that doing it this way removed a ton of the work that was necessary for cherry-picking. I can understand not liking the slightly-messier history you get, but IMO it was vastly better. Do you have other problems with it, or just 'unreadable' histories and work happening on release branches? Seems like a good trade to me.

dom0 · on May 25, 2017

As the branches diverge, merges take more and more time to do (up to a couple hours, at which point we abandoned the model)... they won't be done automatically. Since merges are basically context-free it's hard to determine the "logic" of changed lines. Since merges always contain a bunch of changes, all have to be resolved before they can be tested, and tracing failures back to specific conflict resolutions takes extra time. Reviewing a merge is seriously difficult. Mismerges are also far more likely to go unnoticed in a large merge compared to a smaller cherry pick. With cherry picking you are only considering one change, and you know which one. You only have to resolve that one change, and can then test, if you feel that's necessary, or move on to the next change.

Also; https://news.ycombinator.com/item?id=14413681

I also observed that getting rid of merging upwards moved the focus of development back to the actual development version (=master), where it belongs.

lilyball · on May 24, 2017

I just looked at git-mediate and I'm very confused. It appears that all it does is remove the conflict markers from the file after you've already manually fixed the conflict. Except you need to do more work than normal, because you need to apply one of the changes not only to the other branch's version but also to the base. What am I missing here, why would I actually want to use git-mediate when I'm already doing all the work of resolving the conflicts anyway?

amag · on May 24, 2017

It looks like git-mediate does one more important thing; it checks that the conflict is actually solved. In my experience it's very easy to miss something when manually resolving a conflict and often the choices the merge tools give you are not the ones you want.

deepsun · on May 24, 2017

IntelliJ IDEA default merger knows when all conflicts in a file are handled and shows you a handy notification on top "Save changes and finish merging".

mandeepj · on May 26, 2017

TFS's conflict resolver also does this

lilyball · on May 25, 2017

Well, all it checks is that you modified the base case to look like one of the two other cases. That doesn't actually tell you if you resolved the conflict though, just that you copied one of the cases over the base case.

Peaker · on May 26, 2017

True, but if you follow a simple mechanical guideline: Apply the change 2 other versions (preferably the base one last) - then your conflict resolutions are going to be correct.

From experience with many developers using this method, conflict resolution errors went down to virtually zero, and conflict resolution time has improved by 5x-10x.

lilyball · on May 26, 2017

There's a much simpler mechanical guideline that works without git-mediate: Apply the change to the other version. git-mediate requires you to apply the change twice, but normal conflict resolution only requires you to apply it once.

Peaker · on May 27, 2017

Except you don't really know if you actually applied the full change to the other version. That's what applying it to the base is all about.

You often take the apparent diff, apply it to the other version, and then git-mediate tells you "oops, you forgot to also apply this change". And this is one of the big sources of bugs that stem from conflict resolutions.

Another nice thing about git-mediate is that it lets you safely do the conflict resolution automatically, e.g: via applying a big rename as I showed in the example, and seeing how many conflicts disappear. This is much safer than manually resolving.

lilyball · on May 27, 2017

Applying the change to the base doesn't prove that you applied the change to the other version. It only proves that you did the trivial thing of copying one version over the base. That's kinda the whole point of my complaint here, git-mediate is literally just having you do busy-work as a way of saying "I think I've applied this change", and that busy-work has literally no redeeming value because it's simply thrown away by git-mediate. Since git-mediate can't actually tell if you applied the change to the other version correctly, you're getting no real benefit compared to just deleting the conflict markers yourself.

The only scenario in which i can see git-mediate working is if you don't actually resolve conflicts at all but instead just do a project-wide search&replace, but that's only going to handle really trivial conflicts, and even then if you're not actually looking at the conflict you run the risk of having the search & replace not actually do what it's supposed to do (e.g. catching something it shouldn't).

Peaker · on May 28, 2017

> Since git-mediate can't actually tell if you applied the change to the other version correctly, you're getting no real benefit compared to just deleting the conflict markers yourself

This is patently and empirically false:

1) See the automated rename example. How do you gain the same safety and ease of resolution without git mediate? This is, unlike you say, an incredibly common scenario. After all, conflicts are usually due to very wide, mechanical changes. The exact kinds of changes that are easy to re-apply project-wide. It is true for not only renames, but also whitespace fixes which are infamous for causing conflicts and thus inserting bugs, and due to that are not allowed in many collaborative projects!

2) Instead of having to tediously compare the 3 versions to make sure you haven't missed any change when resolving (a very common error!) you now have to follow a simple guideline: Apply the same change to 2 versions.

This guideline is simple enough to virtually never fuck it up, unlike traditional conflict resolution which is incredibly error-prone. Your complaint that git mediate does not validate you followed this one guideline is moot, since this guideline is so easy to not fuck up - compared to the rest of the resolution process.

If you follow this guideline - git-mediate has tremendous value. It makes sure you did not forget any part of the change done by either side - and streamlines every other part of the conflict resolution process:

A) Takes my favorite editor directly to the conflict line

B) Shows me 2 diffs instead of 3 walls of text

C) Lets me choose the simpler diff to apply to the other sides, making a very minimal text change in my editor.

D) Validates that the change I applied was the last one (or decreases the size of the diff otherwise)

E) Does the "git add" for me, and takes me directly to the next conflict

This has been used by dozens of people, who can all testify that it:

A) Made conflict resolution easy and convenient

B) Reduced the error rate to zero (I don't remember the lasts bug inserted in a merge conflict)

C) Sped the process up by an order of magnitude

lilyball · on May 28, 2017

> See the automated rename example. How do you gain the same safety and ease of resolution without git mediate? This is, unlike you say, an incredibly common scenario.

Is it? I'm not sure if I've ever had a conflict that would be resolved by a global find & replace. Globally renaming symbols isn't really all that common. In my experience conflicts are not "usually due to very wide, mechanical changes", they're due to two people modifying the same file at the same time.

> It is true for not only renames, but also whitespace fixes which are infamous for causing conflicts …

Most projects don't go doing whitespace fixes over and over again. In projects that do any sort of project-wide whitespace fixes, that sort of thing is usually done once, at which point the whitespace rules are enforced on new commits. So yes, global whitespace changes can cause commits, but they're rather rare.

> Instead of having to tediously compare the 3 versions to make sure you haven't missed any change when resolving (a very common error!) you now have to follow a simple guideline: Apply the same change to 2 versions.

> This guideline is simple enough to virtually never fuck it up

You know what's even simpler? "Apply the same change to 1 version". Saying "Fix the conflict, and then do extra busy-work on top of it" is not even remotely "simpler". It's literally twice the amount of work.

The only thing git-mediate appears to do is tell you if a project-wide find&replace was sufficient to resolve your conflict (and that's assuming the project-wide find&replace is even safe to do, as I mentioned in my previous comment). If you're not doing project-wide find&replaces, then git-mediate just makes your job harder.

From reading your "process" it appears that all you really need is a good merge tool, because most of what you describe as advantageous is what you'd get anyway when using any sort of reasonable tool (e.g. jumping between conflicts, showing correct diffs, making it easy to copy lines from one diff into another, making it easy to mark the hunk as resolved).

Peaker · on May 29, 2017

> In my experience conflicts are not "usually due to very wide, mechanical changes", they're due to two people modifying the same file at the same time.

Then it is likely a reversal of cause & effect. People are more reluctant to make wide sweeping changes such as renames because they're worried about the ensuing conflicts.

> Most projects don't go doing whitespace fixes over and over again

Again, for similar reasons. Projects limp around with broken indentation (tabs/spaces), trailing whitespaces, dos newlines, etc - because fixing whitespace is against policy. Why? Conflicts.

> You know what's even simpler? "Apply the same change to 1 version".

It sounds simpler, but not fucking it up is not simple at all, as evidenced by conflicts being a constant source of bugs.

When you look at the 3 versions, and "apply the diff to 1 version" you actually think you apply the diff to 1 version. You're applying the perceived diff to 1 version, which often differs from the actual diff as it may include subtle differences that aren't easily visible. This is detected by git-mediate the double-accounting of applying the perceived diff to both - validating that the perceived diff equals the actual diff.

Without git-mediate? At best you bring up build errors. At worst, revive old bugs that were subtly fixed in the diff you think you applied.

> then git-mediate just makes your job harder.

You're talking out of your ass here. Me and 20 other people have been using git-mediate and it's been a huge game changer for conflict resolution. Every single user I've talked to claims huge productivity/reliability benefits from using it.

> all you really need is a good merge tool

I've used "good" merge tools a lot. They're incredibly inferior to my favorite text editor:

* Text editing within them is tedious and terrible

* Their supported actions of copying whole lines from one version to the other are useless 90% of the time.

Let me ask you this:

What percentage of the big conflicts you resolve with a "good merge tool" - build & run correctly on the first run after resolution?

For git-mediate, that is easily >99%.

lilyball · on May 29, 2017

> People are more reluctant to make wide sweeping changes such as renames because they're worried about the ensuing conflicts.

I disagree. People generally don't do project-wide find&replaces because it's just not all that common to want to rename a global symbol.

> Projects limp around with broken indentation (tabs/spaces), trailing whitespaces, dos newlines, etc - because fixing whitespace is against policy. Why? Conflicts.

You seem to have completely missed the point of my comment. I'm not saying projects limp along with bad whitespace. I'm saying projects that decide upon whitespace rules typically do a single global fixup and then simply enforce whitespace rules on all new commits. That means you only ever have one conflict set due to applying whitespace policy, rather than doing it over and over again as you suggested.

> You're applying the perceived diff to 1 version, which often differs from the actual diff as it may include subtle differences that aren't easily visible.

Then you're using really shitty merge software. Any halfway-decent merge tool will highlight all the differences for you.

> Without git-mediate? At best you bring up build errors. At worst, revive old bugs that were subtly fixed in the diff you think you applied.

And this is just FUD.

It's simple. Just stop using Notepad.exe to handle your merge conflicts and use an actual merge tool. It's pretty hard to miss a change when the tool highlights all the changes for you.

> You're talking out of your ass here.

And you're being uncivil. I'm done with this conversation.

Peaker · on May 25, 2017

That's why I showed an example of a rename. You write "manually fixed the conflict", where do you see that in the rename example?

You just re-apply either side of the changes in the conflict (Base->A, or Base->B) and the conflict is then detected as resolved. Reapplying (e.g: via automated rename) is much easier than what people typically mean by "manually resolving the conflict".

Also, as a pretty big productivity boost, it prints the conflicts in a way that lets many editors (sublime, emacs, etc) directly jump to conflicts, do the "git add" for you, etc. This converts your everyday editor into a powerful conflict resolution tool. Using the editing capabilities in most merge tools is tedious.

flamedoge · on May 24, 2017

or worse, formatting changes

vtbassmatt · on May 24, 2017

Windows developed an extension that lets them do conflict resolution in the web. We have a server-side API that it calls into, but the extension isn't fundamentally different from using BeyondCompare or $YOUR_FAVORTE_MERGETOOL.

Flenser · on May 25, 2017

An extension to what? Could you open source it?

vtbassmatt · on May 25, 2017

Extension to VSTS [1], sorry. We're working with them on making the extension available to the public. It's possible we could open source it as well; I'll poke around.

[1] https://www.visualstudio.com/en-us/docs/integrate/extensions...

Luuseens · on May 25, 2017

Is this CodeFlow you're talking about?

malnourish · on May 24, 2017

For 3 way merging, I've had good luck with beyondcompare

MikeHolman · on May 24, 2017

Generally, it's the responsibility of whoever made the changes to resolve conflicts (basically, git blame conflicting lines and the people who changed them get notified to resolve conflicts in that file). Distributing the work like this makes the merges more reasonable.

joeld42 · on May 24, 2017

Yep, this is the way I work. Always rebase against master, and fix conflicts there, so branches merging into master should always be up-to-date and have zero conflicts.

tokenizerrr · on May 24, 2017

How does that work?

srathi · on May 25, 2017

You first rebase against the master locally and push the merged feature branch after resolving all the conflicts yourself. Afterwards, you go to the master and merge it against the updated feature branch. The 2nd merge should not result in any conflicts.

nerdponx · on May 25, 2017

This model can be annoying on running feature branches. Once you rebase, you have to force-push to the remote feature branch. It's not so bad if you use --force-with-lease to prevent blowing away work on the remote, but it still means a lot of rewriting history on anything other than one-off branches.

cryptonector · on May 25, 2017

No no, you never force push to "published" branches, e.g., upstream master. What you're doing when you rebase onto the latest upstream is this: you're making your local history the _same_ as the upstream, plus your commits as the latest commits, which means if you push that, then you're NOT rewriting the upstream's history.

(In the Sun model one does rewrite project branch history, but one also leaves behind tags, and downstream developers use the equivalent of git rebase --onto. But the true upstream never rewrites its history.)

nerdponx · on May 25, 2017

That's what I thought as well. But what happens when you need to rebase the feature branch against master? Won't you have to force push that rebase?

cormacrelf · on May 25, 2017

There's nothing stopping you from doing merges instead of rebases.

       * latest feature commit #3 (feature)
       * merge
      /|
     * | more master commits you wanted to include (master)
     | * feature commit #2
     | * merge
     |/|
     * | master commits you wanted to include
     | * feature commit #1
     |/
     *   original master tip
     *   master history...

Then, when you're done with feature, if you really care about clean history, just rebase the entire history of the feature branch into one or more commits based on the latest from master. I think checkout -b newbranch; rebase --squash master does the trick here:

     *   feature commits #1, #2 and #3 (newbranch)
     | * latest feature commit (feature)
     | * merge
     |/|
     * | more master commits you wanted to include (master)
     | * feature commit #2
     | * merge
     |/|
     * | master commits you wanted to include
     | * feature commit #1
     |/
     *   original master tip
     *   master history...

Then checkout master, rebase newbranch, test it out and if you're all good, delete or ignore the original.

     * feature commits #1, #2 and #3 (master, newbranch)
     * more master commits you wanted to include
     * master commits you wanted to include
     * original master tip
     * master history...

eihakker · on May 30, 2017

  *latest feature c
  *merge
 /|

* |more master comm. | feature commit #: | merge |/| * |master commits y | feature commit # |/ original master * master history ..

cormacrelf · on June 1, 2017

maybe try that again keeping 4 spaces before each line?

cryptonector · on May 25, 2017

I've described this. Downstreams of the feature branch rebase from their previous feature branch merge base (a tag for which is left behind to make it easy to find it) --onto the new feature branch head.

E.g., here's what the feature branch goes through:

feature$ git tag feature_05

feature$ git fetch origin feature$ git rebase origin/master feature$ git tag feature_06

And here's what a downstream of the feature branch goes through:

downstream$ git fetch feature_remote

downstream$ git rebase --onto feature_remote/feature_06 feature_remote/feature_05

Easy peasy. The key is to make it easy to find the previous merge base and then use git rebase --onto to rebase from the old merge base to the new merge base.

Everybody rebases all the time. Everybody except the true master -- that one [almost] never rebases (at Sun it would happen once in a blue moon).

lilyball · on May 24, 2017

For 3-way merging the best tool I've found is Steve Losh's splice.vim (https://github.com/sjl/splice.vim/).

babuskov · on May 24, 2017

My favorite tool is Meld:

http://meldmerge.org/

It's open source. Also:

https://stackoverflow.com/questions/572237/whats-the-best-th...

72deluxe · on May 25, 2017

Yes back when using Linux, I used Meld a lot. I can recommend it - the directory comparison is good.

Also KDiff3. Struggling to remember the other ones I used to try unfortunately.

daemin · on May 25, 2017

I spent the better part of a year in a team that was merging and bug fixing our companies engine releases into the customer (and owning company) code base. We also had to deal with different code repositories and version control systems.

We ended up with a script that created a new git repository, checked out the base version of the code there. Then created a branch for our updated release and another for their codebase. Then attempted to do a merge between the two. For any files which couldn't be automatically merged it created a set of 4 files, the original then a triplet for doing a 3-way merge.

This is also when I bought myself a copy of Beyond Compare 4, which fits perfectly for the price and feature set for what we needed.

InclinedPlane · on May 25, 2017

It's usually not so bad. Generally a big project is actually split up into lots of different smaller projects, each of which have an owner. And typically a dev won't touch code that isn't theirs except in unusual cases. Teams that have more closely related or dependent code would typically try to work closer to each other and share code more often than teams that are more separated.

cryptonector · on May 25, 2017

Most changes don't touch hundreds/thousands of files. If you were to split the repo into many (as Windows used to be) then you'd still have the problem of huge projects having MANY conflicts, but worse: now you need to do your merge/rebase for each such repo.

In any case, rebasing is better than merging. (Rebasing is a series of merges, naturally, but still.)

Flenser · on May 25, 2017

Are you aware of KDiff3[1]? If you are why do you prefer Xcode's opendiff?

[1] http://kdiff3.sourceforge.net/

72deluxe · on May 25, 2017

For me, I found I did not need to do 3 way merges that often and opendiff's native UI fits in better than kdiff3 (for me). I think kdiff3 was Qt? Despite Trolltech's best efforts, Qt does not feel native on a Mac.

This isn't to say that KDiff3 isn't great - it is.

gnud · on May 25, 2017

I don't think KDiff 3 looks native anywhere, and I don't think it's because of Qt. It uses weird fonts and icons, and for some reason their toolbar buttons just look wrong.

Still an extremely useful program.

aanm1988 · on May 24, 2017

It's not nearly as bad when you have all the people who work on it with you.

You look at who caused conflicts and send out emails. Don't land the merge until all commits are resolved. People who don't resolve their changes get in trouble.

Not perfect, but there it is.

I worked on a chromium based product as well and had the exact same problem. Eventually we came up with a reasonable system for landing commits and just tried out best to build using chromium, but not having actual patches. Worked okay, not great. Better thant he old system of having people do it manually/just porting our changes to each release.

draw_down · on May 24, 2017

God, that sounds hellish.

sp332 · on May 24, 2017

Archive Team is making a distributed backup of the Internet Archive. http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK Currently the method getting the most attention is to put the data into git-annex repos, and then have clients just download as many files as they have storage space for. But because of limitations with git, each repo can only handle about 100,000 files even if they are not "hydrated". http://git-annex.branchable.com/design/iabackup/ If git performance were improved for files that have not been modified, this restriction could be lifted and the manual work of dividing collections up into repos could be a lot lower.

Edit: If you're interested in helping out, e.g. porting the client to Windows, stop by the IRC channel #internetarchive.bak on efnet.

gcb0 · on May 24, 2017

internet archive sounds like the best ever use case for IPFS

sp332 · on May 24, 2017

It was considered but it just didn't get enough attention from anyone to get it done. http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/i...

jakeogh · on May 24, 2017

Could bup be useful in addition to git-annex? https://github.com/bup/bup

stavros · on May 24, 2017

Wouldn't IPFS be much, much more suitable for this purpose?

kodablah · on May 24, 2017

IIRC, there are permanence and equitable sharing guarantee concerns with IPFS. The former at least can be helped by pinning I think.

stavros · on May 24, 2017

Ah, I see, yes. It would be probabilistic, unless there was some way to coordinate sharing (eg downloading the least shared file first).

cryptonector · on May 24, 2017

At Sun Microsystems, Inc., (RIP) we have many "gates" (repos) that made up Solaris. Cross-gate development was somewhat more involved, but still not bad. Basically: you installed the latest build of all of Solaris, then updated the bits from your clones of the gates in question. Still, a single repo is great if it can scale, and GVFS sounds great!

But that's not what I came in to say.

I came in to describe the rebase (not merge!) workflow we used at Sun, which I recommend to anyone running a project the size of Solaris (or larger, in the case of Windows), or, really, even to much smaller projects.

For single-developer projects, you just rebased onto the latest upstream periodically (and finally just before pushing).

For larger projects, the project would run their own upstream that developers would use. The project would periodically rebase onto the latest upstream. Developers would periodically rebase onto their upstream: the project's repo.

The result was clean, linear history in the master repository. By and large one never cared about intra-project history, though project repos were archived anyways so that where one needed to dig through project-internal history ("did they try a different alternative and found it didn't work well?"), one could.

I strongly recommend rebase workflows over merge workflows. In particular, I recommend it to Microsoft.

rdubz · on May 25, 2017

A problem with rebase workflows that I don't see addressed (here or in the replies) is: if I have, say, 20 local commits and am rebasing them on top of some upstream, I have to fix conflicts up to 20 times; in general I will have to stop to fix conflicts at least as many times as I would have to while merging (namely 0 or 1 times).

Moreover, resolution work during a rebase creates a fake history that does not reflect how the work was actually done, which is antithetical to the spirit of version control, in a sense.

A result of this is the loss of any ability to distinguish between bugs introduced in the original code (pre-rebase) vs. bugs introduced while resolving conflicts (which are arguably more likely in the rebase case since the total amount of conflict-resolving can be greater).

It comes down to Resolution Work is Real Work: your code is different before and after resolution (possibly in ways you didn't intend!), and rebasing to keep the illusion of a total ordering of commits is a bit of an outdated/misuse of abstractions we now have available that can understand projects' evolution in a more sophisticated way.

I was a dedicated rebaser for many years but have since decided that merging is superior, though we're still at the early stages of having sufficient tooling and awareness to properly leverage the more powerful "merge" abstraction, imho.

cryptonector · on May 25, 2017

Well, git rerere helps here, though, honestly, this never happens to me even when I have 20 commits. Also, this is what you want, as it makes your commits easier to understand by others. Otherwise, with thousands of developers your merge graph is going to be a pile of incomprehensible spaghetti, and good luck cherry-picking commits into old release patch branches!

Ah, right, that's another reason to rebase: because your history is clean, linear, and merge-free, it makes it easier to pick commits from the mainline into release maintenance branches.

The "fake history" argument is no good. Who wants to see your "fix typo" commits if you never pushed code that needed them in the first place? I truly don't care how you worked your commits. I only care about the end result. Besides, if you have thousands of developers, each on a branch, each merging, then the upstream history will have an incomprehensible (i.e., _useless_) merge graph. History needs to be useful to those who will need it. Keep it clean to make it easier on them.

Rebase _is_ the "more powerful merge abstraction", IMO.

rdubz · on May 25, 2017

rebase : centralized repo :: merge : decentralized repo

rebase : linked-list :: merge : DAG

If the work/repo is truly distributed and there isn't a single permanently-authoritative repo, a "clean, linear" history is nonsensical to even try to reason about.

In all cases it is a crutch: useful (and nice, and sufficient!) in simple settings, but restricting/misleading in more complex ones (to the point of causing many developers to not see the negative space).

You can get very far thinking of a project as a linked list, but there is a lot to be gained from being able to work effectively with DAGs when a more complex model would better fit the reality being modeled.

It's harder to grok the DAG world because the tooling is less mature, the abstractions are more complex (and powerful!), and almost all the time and money up to now has explored the hub-and-spoke model.

In many areas of technology, however, better tooling and socialization around moving from linked-lists (and even trees) to DAGs is going to unlock more advanced capabilities.

Final point: rebasing is just glorified cherry-picking. Cherry-picking definitely also has a role in a merge-focused/less-centralized world, but merges add something totally new on top of cherry-picking, which rebase does not.

cryptonector · on May 25, 2017

As @zeckalpha says, rebase != centralized repo.

You can have a hierarchical repo system (as we did at Sun).

Or you can have multiple hierarchies, contributing different series of rebased patches up the chain in each hierarchy.

Another possibility is that you are not contributing patches upstream but still have multiple upstreams. Even in this case your best bet is as follows: drop your local patches (save them in a branch), merge one of the upstreams, merge the other, re-apply (cherry-pick, rebase) your commits on top of the new merged head. This is nice because it lets you merge just the upstreams first, then your commits, and you're always left in a situation where your commits are easy to ID: they're the ones on top.

luckydude · on May 28, 2017

I'm the guy who started this DAG model (also at Sun with NSElite and then later with BitKeeper).

I agree that rebase == centralized. It's a math thing. If you rebase and someone has a clone of your work prior to the rebase chaos happens when they come together. So you have to enforce a centralized flow to make it work in all cases. It's pretty much provable as in a math proof.

cryptonector · on May 30, 2017

Not true! At Sun we did this with project gates regularly. The way it works (as I've described several times in this thread now) is that you rebase --onto. That is, you use a tag for the pre-rebase project upstream to find the merge base for your branch, then cherry-pick your commits (i.e., all local commits after the merge base) onto the post-rebase project upstream.

Now, you don't want to do this with the ultimate upstream, though occasionally it happened at Sun with the OS/Net gate, usually due to some toxic commit that was best eliminated from the history rather than reverted, or through some accident.

But you'd be right to say that the Sun model was centralized in that there was just one ultimate upstream. (There was one per-"consolidation", since Solaris was broken up into multiple parts like that, but whatever, the point stands.)

Whereas with Linux, say, one might have multiple kernel gates kept by different gatekeepers. Still, if you're contributing to more than one of them, it's easier to cherry-pick (rebase!) your commits onto each upstream than to just merge your way around -- IMO. I.e., you can have a Linux kernel like decentralized dev model and still rebase.

However, I as you can see from my comment in the previous paragraph, _rebase_ itself does not imply a centralized model.

luckydude · on June 1, 2017

I get that you can work around the problems, you don't seem to get that from a math point of view, rebase forces either

a) a centralized model

or

b) you have to throw away any work based on the dag before the rebase

or)

c) you have the history in the graph twice (which causes no end of problems).

(a) is the math way, (b) and (c) are ad-hoc hacks. You are well into the ad-hoc hacks, you've found a way to make it work but it includes "don't do that" warnings to users. My experience is that you don't want to have work flows that include "don't do that". Users will do that.

cryptonector · on May 25, 2017

Also, it's harder to grok merge history because we humans have a hard time with complexity, and merge history in a system with thousands of developers and multiple upstreams can get insanely complex. The only way to cut through that complexity is to make sure that each upstream ends up with linear history -- that is: to rebase downstreams.

luckydude · on May 28, 2017

Nope, you want what I called the event stack. It lets you have your cake and eat it too.

The event stack is a record of every tip that was ever present in this repo other than unpushed commits.

You were at cset 1234, you pull in 25 csets, the event stack has two events, 1 which points to 1234 and 2 which points at the tip after the pull.

You commit "wacked the crap out of it", then commit "fixed typo", then commit "added test", then commit $whatever. The event stack is

1 2 . which points at your current tip but is floating

Now you push. Your event stack is 1, 2, 3 and 3 points at the tip as of your push.

What about clone? You get your parent's event stack but other than that they are per repo.

The event stack is the linear history you want, it is the view that everyone wants. It's "what are the list of tips I care about in this repo?". Have a push that broke your tree but you don't know what the previous tip was because the push pushed 2500 commits? No problem. The event stack is a stack and there is a "pop" command that pops off the last change to the event stack. So you would just do "git pop" and see if that fixes your tree, repeat until it does.

We never built this in BitKeeper but I should try. If for no other reason than to show people you can have the messy (but historically accurate) history under the covers but have a linear view that is pleasant for humans.

cryptonector · on May 30, 2017

Yes, I've been asking for branch history (the reflog provides some, but it's insufficient because it's not shared in any way).

Even with this, I'd want to rebase away "fixed typo" prior to pushing, and more, I'd want to:

- organize commits into logical chunks so that they might be cherry-picked (in the literal sense, not just the VCS sense) into maintenance release branches

- organize commits as the upstream prefers (some prefer to see test updates in separate commits)

IIUC BitKeeper does have a sort of branch push history, unlike git. Is this wrong?

luckydude · on June 1, 2017

So the current BK doesn't really have branches, it has the model that if you want to branch you clone, each clone is a branch.

Which begs the question "how do you do dev vs stable branches?" And the answer is that we have a central clone called "dev" and a central clone called "stable". In our case we have work:/home/bk/stable and work:/home/bk/dev. User repos are in work:/home/bk/$USER/dev-feature1 and work:/home/bk/$USER/stable-bugfix123.

We run a bkd in work:/home so our urls are

    bk://work/dev
    bk://work/$USER/dev-feature1

BK has a concept of a level - you can't push from a higher level to a lower level. So stable would be level 1, dev would be level 2. Levels propogate on clone so when you do

    bk clone bk://work/dev dev-feature2

and then try and do

    bk push bk://work/stable

it will tell you that you can't push to a lower level. This prevents backflow of all the new feature work into your stable tree.

The model works well until you have huge (like 10GB and bigger) repos. At that point you really want branches because you don't want to clone 10GB to do a bugfix.

Though we addressed that problem, to some extent, by having nested collections (think submodules that actually support all workflows, unlike git, they are submodules that work). So you can clone the subset you need to do your bugfix.

But yeah, there are cases where "a branch is a clone" just doesn't scale, no question. But where it does work it's a super simple and pleasant model

zeckalpha · on May 25, 2017

Decentralization at scale can result in a linear chain, too.

ferdterguson · on May 25, 2017

IMO, VC comes down not to tracking what was actually done, but to creating snapshots of logical steps that are reasonable to roll back to and git bisect with.

cryptonector · on May 25, 2017

And cherry-pick onto release maintenance branches.

nikital · on May 24, 2017

A pain I have with rebase workflow is that it creates untested commits (because diffs were blindly applied to a new version of the code). If I rebase 100 commits, some of the commits will be subtly broken.

How do you deal with that?

cryptonector · on May 24, 2017

With git rebase you can in fact build and test each commit. That's what the 'exec' directive is for (among other things) in rebase scripts!

Basically, if you pick a commit, and in the next line exec make && make check (or whatever) then that build & test command will run with the workspace HEAD at that commit. Add such an exec after every pick/squash/fixup and you'll build and test every commit.

lsaferite · on May 25, 2017

Or you could use the "-x" parameter to execute something between every rebase.

ninkendo · on May 25, 2017

This is why, in git workflows with rebases, it's a good idea to create merge commits anyway, even if the master branch can fast-forwarded.

That way, looking at the history, you know what commits are stable/tested by looking at merge commits. Others that were brought in since the last merge commit can be considered intermediary commits that don't need to be individually tested.

(Of course, there's also the rebase-and-squash workflow which I've personally never used, but it accomplishes the same thing by erasing any intermediary history altogether.)

cryptonector · on May 25, 2017

Also, every commit upstream is stable by definition! Human failures aside, nothing should go upstream that isn't "stable/tested".

"Squashing" is just merging neighboring commits. I do that all the time!

Usually when I work on something, commit incomplete work, work some more, commit, rinse, repeat, then when the whole thing is done I rewrite the history so that I have changes segregated into meaningful commits. E.g., I might be adding a feature and find and fix a few bugs in the process, add tests, fix docs, add a second, minor feature, debug my code, add commits to fix my own bugs, then rewrite the whole thing into N bug fix commits and 2 feature commits, plus as many test commits as needed if they have to be separate from related bug fix commits. I find it difficult to ignore some bug I noticed while coding a feature just so that I can produce clean history in one go without re-writing it! People who propose one never rewrite local history propose to see a single merge commit from me for all that work. Or else the original commits that make no logical sense.

Too, I use "WIP" commits as a way to make it easy to backup my work: commit extant changes, git log -p or git format-patch to save it on a different filesystem. Sure, I could use git diff and thus never commit anything until I'm certain my work is done so I can then write clean history once without having to rewrite. But that's silly -- the end result is what matters, not how many cups of coffee I needed to produce it.

cryptonector · on May 25, 2017

I've toyed with the idea of using merge commits to record sets of commits as being... atoms.

Suppose you want to push regression tests first, then bug fixes, but both together: this is useful for showing that the test catches the bug and the bug fix fixes it. But now you need to document that they go together, in case they need to be reverted, or cherry-picked onto release maintenance branches.

I think branch push history is really something that should be a first-class feature. I could live with using merge commits (or otherwise empty-commits) to achieve this, but I'll be filtering them from history most of the time!

holtalanm · on May 24, 2017

we use a rebase workflow in git at my current employer, and it is amazing.

previous employer used a merge workflow (primarily because we didnt understand git very well at the time), and there were merge conflicts all the time when pulling new changes down or merging new changes in.

It was a headache to say the least. As the integration manager for one project, I usually spent the better part of an hour just going through the pull requests and merge conflicts from the previous day. I managed a team that was on the other side of the world, so there were always new changes when I started working in the morning.

cryptonector · on May 24, 2017

Yes! One of the most important advantages of a rebase workflow is that you can see immediately what upstream commits your conflict with, as opposed to some massive merge you have to go chasing branch history to figure out the semantics of the change in question.

"Amazing" is right. Sun was doing rebases in the 90s, and it never looked back.

dom0 · on May 24, 2017

My exact experience (in the context of "merging upwards"). Large merges are a huge pain to do, and are basically impossible to review, too.

cryptonector · on May 24, 2017

Yes! Reviewing huge merges is infeasible. Besides, most CR tools are awful at capturing history, especially in multi-repo systems. So rebasing and keeping history clean and linear is a huge win there.

Though, of course, rebasing is a win in general, even if you happen to have an awesome CR tool (a unicorn I've yet to run into).

cryptonector · on May 24, 2017

Another thing is that keeping your unpushed commits "on top" is a great aid in general (e.g., it makes it trivial to answer what haven't I pushed here yet?"), but also is the source of rebasing's conflict resolution power.

Because you're unpushed commits are on top, it's easy to isolate each set of merge conflicts (since you're going commit by commit) and to find the source of the conflicts upstream (with log/blame tools, without having to chase branch and merge histories).

chocolatebunny · on May 24, 2017

We use a rebase workflow when working with third party source code. We keep all third party code on a main git branch and we create new branches off the main branch as we rebase our changes from third party code version to version.

cryptonector · on May 24, 2017

Why wouldn't you use it for your own source?

richardwhiuk · on May 25, 2017

Why is a clean linear history desirable? It's not reflective of how the product was built? Is it just for some naive desire of purity?

ubolonton_ · on May 25, 2017

It's easier to see commits of a branch grouped together in most history viewers. Even though sorting commits topologically can help, most history viewers don't support that option.

When there is an undesired behavior that is hard to reason about, git-bisect can be used to determine the commit that first introduced it. With a normal merge, it will point to the merge commit, because it was the first time the 2 branches interacted. With a rebase, git bisect will point to one of the rebased commits, each of which already interacted with the branch coming before.

Resolving conflicts in a big merge commit vs in small rebased commits is like resolving conflicts in a distributed system by comparing only the final states, vs inspecting at the actual sequences of changes.

cryptonector · on May 25, 2017

Who cares about how a product's sub-projects were put together? What one should care about is how those sub-projects were put together into a final product. To be sure, the sub-projects' internal history can be archived, but it needn't pollute the upstream's history.

Analogously: who cares how you think? Aside from psychologists and such, that is. We care about what you say, write, do.

cryptonector · on May 25, 2017

A trip through "git rebase" search results at HN sure is cringe-inducing. So many people fail to get it.

e40 · on May 24, 2017

Can you describe it in a little more detail? Do you still use branches? If so, for what? For different versions?

cryptonector · on May 24, 2017

Great question.

We basically had a single branch per repo, and every repo other than the master one was a fork (ala github). But that was dictated by the limitations of the VCS we used (Teamware) before the advent of Hg and git.

So "branches" were just a developer's or project's private playgrounds. When done you pushed to the master (or abandoned the "branch"). Project branches got archived though.

In a git world what this means is that you can have all the branches you want, and you can even push them to the master repo if that's ok with its maintainers, or else keep them in your forks (ala github) or in a repo meant for archival.

But! There is only one true repo/branch, and that's the master branch in the master repo, and there are no merge commits in there.

For developers working on large projects the workflow went like this:

- clone the project repo/branch

- work and commit, pulling --rebase periodically

- push to the project repo/branch

- when the project repo/branch rebases onto a newer upstream the developer has to rebase their downstream onto the rebased project repo/branch

Project techleads or gatekeepers (larger projects could have someone be gatekeeper but not techlead) would be responsible for rebasing the project onto the latest upstream.

To simplify things the upstream did a bi-weekly "release" (for internal purposes) that projects would rebase onto on either a bi-weekly or monthly schedule. This minimizes the number of rebases to do periodically.

When the project nears the dev complete time, the project will start rebasing more frequently.

For very large projects the upstream repo would close to all other developers so that the project could rebase, build, test, and push without having to rinse and repeat.

(Elsewhere I've seen uni-repo systems where there is no closing of the upstream for large projects. There a push might have to restart many times because of other pushes finishing before it. This is a terrible problem. But manually having to "close" a repo is a pain too. I think that one could automate the process of prioritizing pushes so as to minimize the restarts.)

emmelaich · on May 25, 2017

Are you saying that you use Git instead of Mercurial these days?

Not necessarily implied by you; just checking.

cryptonector · on May 25, 2017

Me personally? Yes, I use git whenever I can. I still have to use Mercurial for some things.

I don't know what Oracle does nowadays with the gates that make up Solaris. My guess is that they still have a hodge podge, with some gates using git, some Mercurial, and some Teamware still. But that's just a guess. For all I know they may have done what Microsoft did and gone with a single repo for the whole thing.

emmelaich · on June 4, 2017

This is old but I meant Sun/Oracle/Java.

I remember watching the process to choose a new VCS. Mercurial is a fine choice but not the best choice. Even over the duration of the decision process the world was clearly moving overwhelmingly to Git.

quotemstr · on May 24, 2017

I have tremendous respect for Microsoft pulling itself together over the past few years.

ry_ry · on May 24, 2017

Such a relevant point, and I don't think they get enough props for it.

I don't believe for one second this was a quick turnaround for them either. I've spoken to MS dev evangelists at work stuff over the past few years and they've continually said "it's going to get better", usually with a wry smile.

It bloody did too. They're nowhere near perfect, and the different product branches remain as disjointed as ever, but I'm genuinely impressed at the sheer scale of the organisational change they've implemented.

nojvek · on May 25, 2017

Moving entire code base from source Depot (invented at Microsoft) git (not ms) was a huge undertaking. I know many ms devs who hated git.

But this is seriously brave and well executed on their part.

rufius · on May 25, 2017

Technically - Source Depot is a fork of Perforce. Not entirely invented at MSFT :).

contextfree · on May 26, 2017

I also know many MS devs who hated Source Depot :)

rufius · on May 25, 2017

To give you a little idea of scale - they've been at this for at least 4 years. It started while I still worked in Windows.

ericfrederich · on May 24, 2017

This may be the thing that gets Google to switch. They like having every piece of code in a single repository which Git cannot handle.

Now that it is somewhat proven, maybe Google will leverage GVFS on Windows and create a FUSE solution for Linux.

daxfohl · on May 24, 2017

I'd rather see google open up their monorepo as a platform, and compete with github. git is fine, but there's something compelling about a monorepo. Whether they do it one-monorepo-per-account, or one-global-monorepo, or some mix of the two, would be interesting to see how it shapes up.

daxfohl · on May 24, 2017

Though as things are going, I wouldn't be surprised if Amazon goes from zero to production-quality public monorepo faster than Google gets from here to public beta. It's not in Google's blood.

And of course Google will shut it down in five years once they're bored of it.

grogenaut · on May 25, 2017

Amazon doesn't do mono-repos. They have ORDER OF a million repos. They instead invested in excellent cross repo meta-version and meta build capabilities instead of going mono-repo.

kgwxd · on May 24, 2017

"one-global-monorepo" caused me to envision a beautiful/horrifying Borg-like future where all code in the universe was in a single place and worked together.

gue5t · on May 25, 2017

This is how Linux (and BSD, and so on) distributions work. Of course there are proprietary and niche outliers, but you can't forbid those in the first place.

rak00n · on May 25, 2017

I felt like that when I first saw golang and how you can effortlessly use any repo from anywhere.

wbl · on May 25, 2017

As the joke goes, Go assumes all your code lives in one place controlled by a harmonious organization. Rust assumes your dependencies are trying to kill you. This says a lot about the people who came up with each one.

aeorgnoieang · on May 25, 2017

> Rust assumes your dependencies are trying to kill you.

Would you mind unpacking this? I'm intrigued.

dom0 · on May 25, 2017

Cargo.lock for applications freezes the entire dependency graph incl. checksums of everything, for example.

hinkley · on May 24, 2017

This is the main thing I miss about subversion. You could check out any arbitrary subdirectory of a repository. On two projects the leads and full stack people had the whole thing checked out, everybody else just had the one submodule they were responsible for. Worked fairly well.

quotemstr · on May 24, 2017

Mercurial has narrowspecs these days too! Facebook's monorepo lets you check out parts of the overall tree too. It's not like every Android engineer's laptop has all of fbios in it.

pmontra · on May 24, 2017

Git has submodules too and teams usually have access control on the main server used for sharing commits.

hinkley · on May 25, 2017

git submodules aren't seemless. None of the alternatives appear to be any better. A halfassed solution is no solution at all.

pmontra · on May 26, 2017

Would you mind explaining about this lack of seamlessness?

I can imagine that part of it could be the need of a git clone --recursive, and everybody omits the --recursive if they don't know there are submodules inside the repository. There is another command to pull the submodules later but I admit it's far from ideal.

bobsam · on May 25, 2017

What's wrong with git submodules?

twinge · on May 24, 2017

Google already has a FUSE layer for source control: http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...

paulddraper · on May 24, 2017

Google used to have a Perforce frankenstein, but now they have their own VCS.

nine_k · on May 25, 2017

Piper is their perforce-like server. You check out certain files, as you'd with p4, work on the tree using git, with reviews, tests, etc. You periodically do a sync which is like pull - - rebase. Then you push your changes back into the perforce-like monorepo.

MichaelMoser123 · on May 25, 2017

also they do all the development in one branch / in the trunk https://arxiv.org/abs/1702.01715 ( I never understood the explanations as to why they do that )

Now the article says that with windows they do branches.

farresito · on May 24, 2017

They use mercurial (or were), which is as good as git. In fact, I bet a lot of people at Google are happy to use mercurial instead of git, given git's bad reputation with its command line interface.

Thaxll · on May 24, 2017

They don't use mercurial.

sbuttgereit · on May 24, 2017

You're thinking of Facebook if I'm not mistaken.

farresito · on May 25, 2017

I had seen several sources that affirmed that Google used Mercurial, but I'm not sure to what extend, so I will retract it :-)

Xorlev · on May 25, 2017

I'm sure there are a few teams that use Mercurial incidentally somewhere, but our primary megarepo is all on a VCS called Piper. Piper has a Perforce-y interface and there are experimental efforts to use Mercurial with Piper. Also mentioned in the article below, there's limited interop with a Git client.

If you're curious what it all ends up looking like, read this article. It's a fairly good overview and reasonably up to date.

https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

cbhl · on May 25, 2017

IIRC, the git client is deprecated; the mercurial one is meant to replace it for the use cases where you would want a DVCS client interfacing with Piper in the first place.

quotemstr · on May 25, 2017

It's not deprecated. I use git-multi every day, much to the chagrin of my reviewers. Google thinks that long DIFFBASE chains are weird and exotic, and Google doesn't like weird and exotic as a rule.

tobyhinloopen · on May 24, 2017

I wonder why Windows is a single repository - Why not split it in separate modules? I can imagine tools like Explorer, Internet Explorer/Edge, Notepad, Wordpad, Paint, etc. all can stay in its own repository. I can imagine you can even further split things up, like a kernel, a group of standard drivers, etc. If that is not already the case (separate repos, that is), are the plans to separate it in the future?

vtbassmatt · on May 24, 2017

Really good question. Actually, splitting Windows up was the first approach we investigated. Full details here: https://www.visualstudio.com/learn/gvfs-design-history/

Summary:

- Complicates daily life for every engineer

- Becomes hard to make cross-cutting changes

- Complicates releasing the product

- There's a still a core of "stuff" that's not easy to tease apart, so at least one of the smaller Windows repos would still have been a similar order of magnitude in most dimensions

erikpukinskis · on May 24, 2017

> - Becomes hard to make cross-cutting changes

This does seem like a negative, doesn't it?

But it's not. Making it hard to make cross-cutting changes is exactly the point of splitting up a repo.

It forces you to slow down, and—knowing that you can only rarely make cross-cutting changes—you have a strong incentive to move module boundaries to where they should be.

It puts pressure on you to really, actually separate concerns. Not just put "concerns" into a source file that can reach into any of a million other source files and twiddle the bits.

"Easy to make sweeping changes" really means "easy to limp along with a bad architecture."

I think that's one of the reasons why so much code rots: developers thinking it should be easy to make arbitrary changes.

No, it should be hard to make arbitrary changes. It should be easy to make changes with very few side effects, and hard to make changes that affect lots of other code. That's how you get modules that get smaller and smaller, and change less and less often, while still executing often. That's the opposite of code rot: code nirvana.

Goronmon · on May 24, 2017

No, it should be hard to make arbitrary changes.

If you change the word "arbitrary" to "necessary" (implying a different bias than the one you went with) then all of a sudden this attitude sounds less helpful.

Similarly "easy to limp along with a bad architecture" could be re-written as "easy to work with the existing architecture".

At the end of the day, it's about getting work done, not making decisions that are the most "pure".

colechristensen · on May 24, 2017

You have to balance getting work done vs. purity, and Microsoft has spent years trying to fix a bad balance.

Windows ME/Vista/8 were terrible and widely hated pieces of software because of "getting things done" instead of making good decisions. They made billions of dollars doing it, don't get me wrong, but they've also lost a lot of market share too and have been piling on bad sentiment for years. They've been pivoting and it has nothing to do with "getting work done" but by going back and making better decisions.

contextfree · on May 24, 2017

Those releases (well, Vista and 8 anyway, I don't know about ME) came out of a long and slow planning process - if they made bad decisions I don't think it was about not taking long enough to make them.

ern · on May 24, 2017

I assumed that Windows 8 was hated because it broke the Start Menu and tried to force users onto Metro.

sjwright · on May 25, 2017

It also broke a lot of working user interfaces, e.g. wireless connection management.

erikpukinskis · on May 24, 2017

> At the end of the day, it's about getting work done, not making decisions that are the most "pure".

This attitude will lead to a total breakdown of the development process over the long term. You are privileging Work Done At The End Of The Day over everything else.

You need to consider work done at every relevant time scale.

How much can you get done today?

How much can you get done this month?

How much can you get done in 5 years?

Ignore any of these questions at your peril. I fundamentally agree with you about purity though. I'm not sure what in my piece made you think I think Purity Uber Alles is the right way to go.

nimchimpsky · on May 24, 2017

> This attitude will lead to a total breakdown of the development process over the long term.

As evidenced by Microsoft following the one repo rule and not being able to release any new software.

Wait, what ?

erikpukinskis · on May 25, 2017

The text I quoted had nothing to do with monolithic repos.

taeric · on May 24, 2017

The linux codebase exists in stark contrast to your claim. Assuming your claim is that broken up repos is the better way.

hueving · on May 24, 2017

No it doesn't. I think you are thinking of the kernel, which is separate from all the distros.

Linux by itself is just a kernel and won't do anything for you without the rest of the bits that make up an operating system.

taeric · on May 25, 2017

Then I'll point to the wide success of monolithic utilities such as systemd as evidence that consolidating typically helps long term.

Which is to say, not shockingly, it is typically a tradeoff debate where there is no toggle between good and bad. Just a curve that constantly jumps back and forth between good and bad based on many many variables.

hueving · on May 27, 2017

systemd is also completely useless on its own. It still needs a bootloader, a kernel, and user-space programs to run.

When it comes to process managers, there is obviously disagreement about how complex they should be, but systemd is still a system to manage and collect info about processes.

caf · on May 25, 2017

The hierarchical merging workflow used by the Linux kernel does mean that there's more friction for wide-ranging, across-the-whole-tree changes than changes isolated to one subsystem.

taeric · on May 25, 2017

Isolated changes will always be easier than cross cutting ones. The question really comes down to whether or not you have successfully removed cross cutting changes. If you have, then more isolation almost certainly helps. If you were wrong, and you have a cross cutting change you want to push, excessive isolation (with repos, build systems, languages, whatever), adds to the work. Which typically increases the odds of failure.

vec · on May 24, 2017

Arguing about purity is only pointless and sanctimonious if the water isn't contaminated. Being unable to break a several hundred megabyte codebase into modules isn't a "tap water vs bottled" purity argument, it's a "lets not all die of cholera" purity argument.

vtbassmatt · on May 24, 2017

As the linked article says, modularizing and living in separate repos was the plan of record for a while. But after evaluating the tradeoffs, we decided that Windows needs to optimize for big, rapid refactors at this stage in its development. "Easy to make sweeping changes" also means "easy to clean up architecture and refactor for cleaner boundaries".

The Windows build system is where component boundaries get enforced. Having version control introduce additional arbitrary boundaries makes the problem of good modularity harder to solve.

Eyas · on May 24, 2017

You say that, but it is very telling that every large company out there (Google and Facebook come to mind) go for the single-repository approach.

I'm sure that, when dealing with stakeholder structures where different organizations can depend on different bits and pieces, having multiple repositories with difficulty of making breaking and cross-cutting changes, becomes good.

From the view of a single organization where the only users of a component are other components in the same organization, it seems like there is consensus around single-repository.

Jach · on May 24, 2017

It is very telling. Google has a cloud supercomputer doing nothing but building code to support their devs. I don't know about Facebook. (I really don't -- I'm constantly amazed that they have as many engineers that they do, what do they all work on?) Where I work (https://medium.com/salesforce-engineering/monolith-to-micros...) there's a big monolith but with more push towards breaking things up, at the architecture level and also on the code organization level. We also use and commit to open source projects (that use git), so to integrate those with the core requires a bit more effort than if they were there already but it's not a big burden and the benefits of having their tendrils being self-contained are big.

Which brings me to the point that in the open source world, you can't get away with a single-repository approach for your large system. And that also is telling, along with open source's successes. So which approach is better in the long run? I'd bet on the open source methods.

flai · on May 24, 2017

You forget Conway's law:

> organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations

Open source has a very different communication structure than a company. While the big three (MS, Google, FB) try to work towards good inter-departmemt relations, it is usually either

- a single person - a small group

that are the gatekeepers for a small amout of code, typically encapsulated in a "project". They do commit a lot to their project, yet rarely touch other projects in comparision.

Also, collaboration is infinitely harder, as in the office you can simply walk up to someome, call them, or chat them - in OSS a lot of communication works via Issues and PRs, which are a fundamentally different way to communicate.

This all is reflected by the structure of how a functionality is managed: Each set of gatekeepers gets their own repository, for their function.

Interestingly this even happens with bigger repositories: DefinitelyTyped is a repository for TypeScript "typings" for different JS libraries, which has hundreds of collaborators.

Yet, if you open a pull-request for an existing folder the ones that have previously made big-ish changes can approve / decline the PR, so each folder is its own little repo.

https://github.com/DefinitelyTyped/DefinitelyTyped/pull/1672...

So: maybe the solution is big repos for closed companies, small repos for open-source?

aanm1988 · on May 24, 2017

Amazon doesn't.

likpok · on May 24, 2017

From my experience with Java-style hard module dependencies, this makes it extremely difficult to refactor anything touching external interfaces.

You say this forces you to think ahead, but predicting the future is quite difficult. The result is that you limp along with known-broken code because it would take so much effort to make the breaking changes to clean it up.

For example, lets say you discover that people are frequently misuing a blocking function because they don't realize that it blocks.

Let's say that we have a function `bool doThing()`. We discover that the bool return type is underspecified: there's a number of not-exactly-failure not-exactly-success cases. In a monorepo, it's pretty easy to modify this so that `doThing()` can return a `Result` instead. With multiple repos and artifacts, you either bring up the transitive closure of projects, or you leave it for someone to do later. For a widely used function, this can be prohibitive. That makes people frequently choose the "rename and deprecate" model, which means you get an increasing pile of known-bad functions.

johnfn · on May 24, 2017

Have you actually worked on a repository at Windows scale..? If not, how can you know that your guesses about the workflow are accurate?

emmelaich · on May 25, 2017

Making something difficult even more difficult is not helpful to anyone.

And what happens if the inherent difficult disappears or reduces? You're still left with the imposed external difficulty.

phkahler · on May 24, 2017

Remember when they said IE was an integral part of the OS? Yeah...

electrum · on May 24, 2017

Facebook cited similar reasons for having a single large repository: https://code.facebook.com/posts/218678814984400/scaling-merc...

rbanffy · on May 24, 2017

I remember that, in the 90's, you'd often get new UI elements in Office releases that then would eventually move into Windows. There was a technical reason - the cross-cutting - but there also seemed to be a marketing reason - the moment those UI elements became part of core Windows, all developers (yours truly included) would be able to use those elements, effectively negating Office the fresh look before the competition.

afpx · on May 24, 2017

That's a really interesting article. I wish I would have found it before going down a similar path with my team, recently.

These types of use cases seem so commonly encountered that there should be a list of best practices in the Git docs.

e40 · on May 24, 2017

It's harder to share code between repos, though.

EDIT: like if something was to be shared between Windows and Office, for example.

cosinetau · on May 24, 2017

This was a very interesting point. It sounds like there are some serious architectural limitations on Windows, and this makes me believe the same might be true for the NT kernel, and that MS might not be interested in doing heavy refactoring of it.

I'm not a frequent Windows user, or a Windows dev at all. Does anyone know of any consequences that MS's decision might mean, if this hypothesis is true?

nxc18 · on May 24, 2017

The NT kernel is surprisingly small and well-factored to begin with - it is a lot closer to a 'pure' philosophy (e.g. Microkernel) than something like Linux to begin with.

If you have a problem with Windows being overcomplicated or in need of refactor it is almost certainly something to do with not-the-kernel.

If you look at something like the Linux kernel its actually much larger than Windows. It needs to have every device driver known to man (except that one WiFi/GPU/Ethernet/Bluetooth driver you need) because internally the architecture is not cleanly defined and kernel changes also involve fixing all the broken drivers.

caf · on May 25, 2017

Small and well-factored the core kernel may be, but if you're parsing fonts in kernel mode, you ain't a microkernel.

( https://googleprojectzero.blogspot.com.au/2015/07/one-font-v... )

nxc18 · on May 25, 2017

For sure, and Windows is not a microkernel, but it does have separated kernel-in-kernel and executive layers; it would approach being a microkernel architecture if the executive was moved into userland. This is similar to how macOS would be a microkernel, if everything wasn't run in kernel mode (mach, on which it is partially based, is a microkernel).

Of course the issue here is that after NT 4, GDI has been in kernel mode; this is necessary for performance reasons. Prior to that it was a part of the user mode Windows subsystem.

I'd be curious to see if GDI moved back to userland would be acceptable with modern hardware, but I suspect MS is not interested in that level of churn for minimal gain.

andrey_utkin · on May 24, 2017

Could you please share a link to NT kernel sources so that I take a look?

It is not true that internal architecture of Linux drivers is not clearly defined. It is just a practical approach to maintenance of drivers (as an author of one Linux hardware driver I'm pretty sure the best possible). Reasoning is outlined in famous "Stable API nonsense" document http://elixir.free-electrons.com/linux/latest/source/Documen...

I don't think Windows approach is worth praising here. It results in a lot of drivers stopping working after few major Windows release upgrades. In Linux, available drivers can be maintained forever.

nxc18 · on May 24, 2017

If you are sufficiently motivated, NT 4 leaked many years ago and you could find it; it even has interesting things like DEC Alpha support & various subsystems still included IIRC. Perhaps you could find a newer version like 5.2 on GitHub or another site, but beware, as a Linux dev/contributor, you probably don't want to have access to that.

FWIW, I've stumbled upon both of those things in my personal research while I had legitimate access to the 5.2 sources as a student. It turns out Bing will link you directly to the Windows source code if you search for <arcane MASM directive here>.

Yes, I'm in the process of reporting this to Microsoft and cleansing myself of that poison apple.

GlobalServices · on May 24, 2017

Yes! Windows Kernel is a much more "modern" microkernel architecture than any of the circa-1969 Unix-like architectures popular today. We use Windows 10 / Windows Serve for everything at our company, and we have millions of simultaneously connected users on single boxes. No problems and easy to manage.

Analemma_ · on May 24, 2017

It seems presumptive to say that "If you can't use multiple repos, your architecture must be bad". I could just as easily counter with, "If you're using multiple repos, it must mean you have an unnecessarily complex and fragile microservice architecture"

cosinetau · on May 24, 2017

I'm sorry that was implied. I simply want better insight into the kernel from people who have experience developing large kernels, and the decisions that are made as a consequence of architectural choices.

I feel like that those questions are valid, and are important in this field, not just kernel development. As someone who desires to continue learning, I will not yield to your counter.

speleo_engr · on May 24, 2017

Google also uses a single giant repo...

monksy · on May 24, 2017

Why is that a "good reason" to do it?

sateesh · on May 24, 2017

Not really :) and I am not the OP. This article [1] provides a very good overview of the repository organization Google has and the reasons behind it.

I think the reason that this works for well Google is the amount of test automation that is in place which seems to be very good at flagging any errors due to dependency changes before it gets deployed. Not sure how many organizations have invested and have an automated test infrastructure like Google has built.

1. https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

cromwellian · on May 24, 2017

It radically simplifies everything. Every commit is reproducible across the entire tool chain and ecosystem.

It makes the entire system kind of pure-functional and stateless/predictable. Everything from computing which tests you need to run, to who to blame when something breaks, to caching build artifacts, or even sharing workspaces with co-workers.

While this could be implemented with multiple repros underneath, it would add much complexity.

yesiamyourdad · on May 24, 2017

I think this is like the "your data isn't big enough for HDFS" argument earlier this week. The point I take away is that at some stage of your growth, this will be a logical decision. I don't think it implies that the same model works for your organization.

cwyers · on May 24, 2017

And Facebook.

arjie · on May 24, 2017

Why split it into separate modules? Seeing that big companies are very successful with monorepos (Google, Facebook, Microsoft), has made me reconsider if repository modularization is actually worth it. There are a host of advantages to not modularizing repos, and I'm beginning to believe they outweigh those of modular repos.

je42 · on May 24, 2017

mono repo only works if you have tooling. google's and facebook's tools are not opensource. also, ms tooling is windows only.

so for most of us the only reasonable path is to split into multiple repositories.

it is also easier to create tools that deal with many repositories.

than it is to create a tool that virtualizes a single large repo.

ethomson · on May 24, 2017

Microsoft's tooling is Windows only _today_. GVFS is open source _and_ we are actively hiring filesystem hackers for Linux/macOS.

riffraff · on May 24, 2017

but google's and facebook's repos are also orders of magnitude larger than what most people deal with, normal tools might work just fine in most cases.

amptorn · on May 24, 2017

> Why split it into separate modules?

Well, because of the unbelievable amount of engineering work involved in trying to get Git to operate at such insane scale? To say nothing of the risk involved in the alternative. This project in particular could easily have been a catastrophe.

a-priori · on May 25, 2017

Microsoft has about 50k developers. When you're dealing with an engineering organization of this size, you're looking at a run rate of $5B a year, or about $20M a day. It's a no-brainer to spend tens or even hundreds of millions of dollars on projects like this if you're going to get even a few percentage points of productivity.

It can be hard to understand this as an individual engineer, but large tech companies are incredible engineering machines with vast, nearly infinite, capacity at their disposal. At this scale, all that matters is that the organization is moving towards the right goals. It doesn't matter what's in the way; all the implementation details that engineers worry about day-to-day are meaningless at this scale. The organization just paves over over any obstacles in pursuit of its goal.

In this case Microsoft decided that Git was the way to go for source control, that it's fundamentally a good fit for their needs. There was just implementation details in the way. So they just... did it. At their scale, this was not an incredible amount of work. It's just the cost of doing business.

amptorn · on May 25, 2017

You haven't said anything at all about the risk.

a-priori · on May 25, 2017

If there's one thing large organizations are good at, it's managing risk. And if you read their post, they've done this.

They're running both source control systems in parallel, switching developers in blocks, and monitoring commit activity and feedback to watch for major issues. In the worst case, if GVFS failed or developers hated it, they could roll back to their old system.

Again, to my point above: there's a cost to doing this but it's negligible for very large organizations like Microsoft.

komali2 · on May 24, 2017

Wait so like, at google, Inbox and Android are in the same repo as ChromeOS and oh I dunno, Google Search? That doesn't make any sense at all...

jvolkman · on May 24, 2017

Android and Chrome are different, but most of Google's code lives in a single repository.

https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

TylerE · on May 24, 2017

It makes total sense when the expectation is that any engineer in the company can build any part of the stack at any time with a minimum of drama.

dvdcxn · on May 24, 2017

What's dramatic about copy and pasting a clone uri into a command?

TylerE · on May 24, 2017

A url? Not much.

100 urls? That's getting a bit annoying.

bitJericho · on May 24, 2017

This just blew my mind. I'm gonna go home and see about combining all my projects. That seems very useful!

therealdrag0 · on May 26, 2017

It makes sense to have separate repos for things that don't interact. But when your modules or services do. Having them together cuts out some overhead.

bronson · on May 24, 2017

You don't have the same requirements as Google.

bitJericho · on May 24, 2017

What are my requirements such that this wouldn't work for me?

mkj · on May 25, 2017

It makes things tricky if you want to opensource just one of your projects.

rleigh · on May 25, 2017

Well, there's always "git filter-branch".

Not that I'd want to run it on such a mega-repository; it takes long enough running it on an average one with a decade of history.

sbuttgereit · on May 24, 2017

Some of the rationale....

https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

uiri · on May 24, 2017

Yes, Inbox, Maps, Search, etc. are all in one repo in a specialized version control system.

Android, Chromium, and Linux (among others, I'm sure) are different in that they use git for version control so they are in their own separate repos.

andreasklinger · on May 24, 2017

Unsure if all their clients are in the same repo as well but even if…

Why doesn't this make sense.

I personally think of a repo as of an index not a filesystem. You checkout what you need but there is one global constant state - which can eg be used for continuous integration tests

c8g · on May 24, 2017

Android is in a separate repo.

detaro · on May 24, 2017

their earlier blogpost goes into it a bit https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g... :

The first big debate was – how many repos do you have – one for the whole company at one extreme or one for each small component? A big spectrum. Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos. Hmm. Ever worked in a huge code base for 20 years? Ever tried to go back afterwards and decompose it into small repos? You can guess what we discovered. The code is very hard to decompose. The cost would be very high. The risk from that level of churn would be enormous. And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code. Trying to coordinate that across hundreds of repos would be very problematic.

After much hand wringing we decided our strategy needed to be “the right number of repos based on the character of the code”. Some code is separable (like microservices) and is ideal for isolated repos. Some code is not (like Windows core) and needs to be treated like a single repo.