Hacker News new | past | comments | ask | show | jobs | submit login
Commits are snapshots not diffs (2020) (github.blog)
323 points by warpech on April 8, 2021 | hide | past | favorite | 307 comments



From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.

Whereas if I had used git, and created 2 additional commits, each making a change to a small text file, my total storage size would be barely larger than 100MB. Describing the commits as a diff, as opposed to a snapshot, leads to a better intuitive understanding of why this would be the case.

Not to mention other features the article discussed, such as cherry-picking. What does it even mean to "cherry-pick a snapshot"? In comparison, cherry-picking a diff and applying it to your current state, is far more intuitive.

And let's not forget commit messages. If a commit is a snapshot, I would expect the commit-message to be descriptive of the entire snapshot. Whereas if a commit is a diff, I would expect the commit message to be descriptive of the diff. Which is exactly how most people use commit messages.

Obviously both "diffs" and "snapshots" are leaky abstractions. If you insist on using the "snapshot" abstraction, you will need to resolve all of the above points of confusion by adding more complexity to your abstraction. And if you prefer to use the "diff" abstraction, you will eventually need to explain that a commit is actually a combination of diffs, along with some other metadata like a pointer to a parent commit. As a teaching tool, you can make either abstraction work. But I find it far more intuitive and useful to think of commits as "diffs + some metadata".


Commits are snapshots.

How to represent those snapshots, and fix the storage bloat a naive implementation would cause, is a completely different problem.

One of the things that makes Git smart is that it doesn't try to optimize things prematurely. SVN and co. would store actual diff data, but this made some operations really hard to implement (and, in many cases, slow).

Git has commits conceptually as snapshots. It's up to the storage code to figure out how to deal with this.

> But I find it far more intuitive and useful to think of commits as "diffs + some metadata".

Except that this is not what's happening. I wouldn't even call it an abstraction, it's how things actually work. What you call abstractions are actually operations. If we run a diff we are interested in the changes, but if you ask git to show you the commit it will show you just that.

If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.


> If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.

I find that thinking of commits as snapshots is not so useful. I prefer to think of them as a pair of parent commit and diff.

With that in mind, things like rebase become obvious: Take the same diff and attempt to apply it to a different parent.

It's not clear to me how thinking of commits as snapshots helps me to explain operations such as rebase.

I do concede, however, that "git cat" (I think that's the command) seems more closely related to a snapshot: you identify a commit and a file, and it will give you the content of that file at that commit. Clearly in this case the concept of a snapshot works well. But I need this very rarely.


> With that in mind, things like rebase become obvious: Take the same diff and attempt to apply it to a different parent.

You can think of it that way if you want. But it's not what Git actually does.

Personally I much prefer to have my mental model match the actual reality of things.

You may not use "git cat" very often, but what about "git checkout <SHA>"? If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff up to the SHA you asked for.

What it does in actuality is find the snapshot of that SHA and change the working tree to match it.


If git did rebuild the graph, right from the very first commit, the end result of the operation would look identical to the user as it does now.

It seems to me the two mental models are interchangeable when it comes to the use of git from the users point of view. What is missing, from the users point of view, when they model commits as diffs+parents vs as snapshots?

Now I think about it, it's probably that users have a bad understanding of the commit-as-diff models; they could similarly have a bad understanding of the commit-as-snapshot model I expect, I don't know that thinking in snapshots helps to understand git from an users point of view better than thinking (properly) in diffs.

The article for example explains that any two commits can be differenced because the underlying snapshot trees can be compared, but the commit-as-diff model can as easily explain why comparing two commits works by tracing each commit back to the common base commit - so the commit-as-diff mental model just needs to remember that commits are fundamentally tied to the path they have back to the root commit.

It seems to me if you take the diagrams from the article and remove the under-the-covers stuff leaving just the circles, the commits-as-diffs and commits-as-snapshots models look exactly the same.


Merge commits are a bit hard to understand from the perspective of "a commit is basically just a parent commit plus diff".

On the flip side, cherry-picking is hard to understand from the perspective of "a commit is basically just a snapshot, nothing more" (it's _also_ weird from the parent-commit-plus-diff perspective -- cherry-pick is kind of a weird operation, but useful enough that we keep it anyway despite it not fitting quite as cleanly into the git model as other operations).

Outside those edge cases, though, people with "snapshot" and "parent + diff" mental models will make basically identical predictions about what the results of various operations with git will be.


The solution is to think of it as "it's both, snapshot and parent + diff".

When you cherry-pick, git is using the parent+diff model to move the commit. When you do a merge commit, it's using the snapshot model.


> What is missing, from the users point of view, when they model commits as diffs+parents vs as snapshots?

With the wrong mental model it's harder to predict what operations are expensive. If "git checkout <SHA>" truly did have to replay all diffs from the beginning of time, it would be a very expensive operation that is best avoided unless you absolutely need it. But in practice it is a very fast operation (one of the fastest) that there is no need to shy away from.


A fair point possibly, but given checkout/switching branches is probably just about the most common action when working with git repos, I'd hope people would notice that it's fast pretty quickly.


> You may not use "git cat" very often, but what about "git checkout <SHA>"? If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff up to the SHA you asked for.

Yes, this is true. I don't know why it never bothers me. Maybe it's because you could also store the diffs in the opposite direction (i.e. store the tip of each branch in the clear, then store diffs from each commit to its parent). Computing the inverse of a diff should be a quick operation. Usually, when you check out something, it's the tip of a branch or near the tip of a branch.

Anyway.

Of course I know that storing trees makes it easy to compute diffs. Computing diffs will becomes slower with larger trees. On the other hand, storing diffs makes it slow to compute trees, and the more commits we've got, the slower the tree computation goes.


> Computing diffs will becomes slower with larger trees

Not usually. Computing a diff is roughly O(n) with the size of a diff. This is because unchanged leaves of the tree can be seen as identical (because the are content addressed) and are skipped. So to compute the diff you only need to recurse into changed directories.

So having a million files in the root directory and one has changed is very fast to diff as you just diff that one file. The worse case is the diff happening in a very deeply nested directory with lots of files in each of the subdirectories but even that is quite cheap as diffing a sorted directory listing is O(n) with the size of the listing.

(The actual worst case is diffing large files as most text diff algorithms are worse than O(n))


> If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff

Well, it would usually be more efficient to figure out where the current checked out branch differ from the branch that is checked out, and then unapply and apply diffs as needed.


what about "git cherry-pick <commit>"?

with this command you don't import a snapshot, but only the diff between <commit~>..<commit>, so the model parent+diff makes sense to me


Some commits cannot be cherry picked. This is because there is no coherent diff for them, like with merge commits.


Rebase doesn't work that way, though [0]. It first extracts the 3 versions (2 leafs and their common ancestor) and then does a diff & patch.

This allows git to store the deltas between versions in the most efficient way on disk, while also letting it use contextual diffs to minimize the chance of spurious merge conflicts. Patching algorithms have various heuristics that make sense for programming languages, like special treatment for lines with only changes in whitespace.

(Edited to add:) also, minimal diff algorithms have to do a lot of work to detect large blocks of text being moved around. This is part of what made Subversion, which used the same diff algorithm for storage compression and merging, painfully slow.

[0] https://git-scm.com/book/en/v2/Git-Branching-Rebasing


Here is the paragraph that describes what rebase does:

> This operation works by going to the common ancestor of the two branches (the one you’re on and the one you’re rebasing onto), getting the diff introduced by each commit of the branch you’re on, saving those diffs to temporary files, resetting the current branch to the same commit as the branch you are rebasing onto, and finally applying each change in turn.

Is "applying the diff to a different parent" not a good way to describe this?


You're using the word 'diff' for 2 different things:

- an efficient way to store 2 very similar files

- the minimal set of changes made by a programmer to a file.

Subversion uses the same diff algorithm for these 2 functions, which is why people conflate them. But git uses different algorithms. The first one (which it calls deltas) are optimized for speed and compression ratio. The second set of algorithms (you can choose from a few, some of which are better at identifying rearrangements of large blocks of text) are optimized for merging 2 programmer's changes without conflicts.


The way you try to apply a diff to a different parent is by doing a three-way merge... the vast majority of tools do this by taking three files as arguments and producing a fourth as output. The three-way merge is the underlying process which makes merge, rebase, cherry-pick, and revert work. They are all just "three-way merge, shuffle the arguments around, and adjust metadata".

The parent + diff storage is not isomorphic to snapshot storage. Snapshot storage reflects the actual usage of VCS tools... people make changes, and record the final state. Parent + diff does not do this, it records the changes, which requires creating a diff, and there are multiple ways to create a diff between two snapshots.

Git postpones the "which diff is correct" question until you actually care about the answer.


> If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.

I don't think those concepts are distinct as you're painting them. At a user visible level commits will almost always be visualized as diffs, which puts us at a place where - at the highest level and lowest level they're defined as pretty close to diffs, while at an intermediary level they're defined closer to snapshots.

I honestly think they're neither, each expression method (diff vs. snapshot) can be translated pretty easily and both are trying to represent the same end goal. It can be helpful to know that commits are representative of the full state of the codebase that exists at a time, but that view can be at odds with merging and rebasing which use actual change sets to calculate - when a commit is being manipulated it's helpful to view it as a diff (and git does this) - while as, when a commit is being read, we're using it as a snapshot.


Structure purist, ingredient rebel: A snapshot between two levels of diffs is a sandwich.


One way I like to think about this is that when you rebase a branch, the diffs are the same (barring any conflicts) but the commits are different. Just another reason commits aren't the same as diffs.


The diffs are often different, even without conflicts. Try comparing them some time, and look closely at the diff... look at the lines starting with @. People usually ignore those lines but "patch" does NOT.

This is not an irrelevant detail, but it's the result of a three-way merge. The three-way merge can update those @ lines if it has a complete set of inputs (all three inputs). If you to make a patch from one branch and then apply it to a different branch without using the three-way merge algorithm (stripping the diff of all its context), the patch may fail to apply even if the three-way merge succeeded without conflicts.


> If we run a diff we are interested in the changes, but if you ask git to show you the commit it will show you just that.

git show <commit SHA-1> will output a diff.


I think this is more a sign that git (porcelain) is not aligned with the underlying model.

It is actually a pity that so little effort went into git UI. I find the OP explanation of git model awesome and the presented concepts beautiful, but the cli utility has countless naming and consistency problems which make me sad that hg didn't win over git. Life would be much simpler for many developers if it did, imho.


If commits are snapshots:

- say i have a repo with one file, a big 100MB csv with millions of lines.

- i change one line in the CSV for one commit.

- repeat multiple times in many many commits.

- how big will the repo be?


i convinced myself that commits are snapshots by doing the following:

    # generate a 100M text file
    base64 -b 76 /dev/urandom | head  -c 100000000 > file.txt
    git add . && git commit -m "1"
    
    # remove first line and add a new line to bottom
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "2"
    
    # repeat
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "3"
    ...

    du -sh . # a very big folder
each of the commits are almost 80M big in the git folder. if you run `du -h .` you can see how git stores each object individually (80M big)


> From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.

That's not what one would expect. Suppose I have a directory that is 100MB in size. If I take a snapshot of it ("btrfs subvolume snapshot"), my snapshot would be 100MB in size, but the storage needed for the original and the snapshot together would still be 100MB (plus a few kilobytes of overhead). If I take a second snapshot of it tomorrow ("btrfs subvolume snapshot" again), my second snapshot would also be 100MB in size, and my total storage needs would still be 100MB (plus a few kilobytes of overhead).

If I made a change to a small text file before each snapshot, my total storage size would still be barely larger than 100MB.

That is, when creating a snapshot, one would expect it to be copy-on-write. While not exactly what git does (it's a content-addressable storage instead of a copy-on-write storage), the end effect is similar enough for most purposes (the main difference being that undoing a change in git would not need extra storage, while a copy-on-write storage would store a new copy of the contents).


Clearly people are using two diametrically opposed definitions of snapshot.

If a snapshot is defined is opposed to a diff, then it's clear snapshot means "full copy". If I snapshot the state of my cloud server, it creates a full copy of its disk in block storage somewhere, and takes several minutes to complete.

You are describing snapshots that exist as part of a diff system or copy-on-write system, where they use virtually no storage at all, because further changes are assumed to be applied as diffs rather than overwriting previous data. Where the snapshot is a "marked" diff that can specifically be rewinded to, as opposed to a general ongoing stream of diffs.

But that's a more advanced and system-specific definition of snapshot.

As a general mental model, when you say "think of it as a snapshot not a diff", I think it's clear that the former definition is being used, and that the expectation is a fully copy that takes up disk space. Because otherwise, in the second case, all the snapshots are just the most recent diff (on top of the entire prior history), so the sentence "think of it as a snapshot not a diff" doesn't really mean anything. The snapshot and the diff are the same.


> If I snapshot the state of my cloud server, it creates a full copy of its disk in block storage somewhere, and takes several minutes to complete.

Which cloud provider are you using? Neither Amazon nor Google take snapshots this way. Amazon EBS and Google Persistent Disk both use copy-on-write semantics for snapshots. If you take a hundred snapshots of a 100 GB disk, your total usage is 100 GB plus metadata. When you run a VM instance from that disk, the storage usage will increase as blocks change, to a maximum of 200 GB total storage (for live disk + out of date snapshot).

When I use QEMU or VirtualBox at home, I also get copy-on-write snapshots of disks, although it's certainly possible to get a full copy if you want. I think the feature is pretty standard.


Digital Ocean. It absolutely takes snapshots by making a full copy:

https://docs.digitalocean.com/products/images/snapshots/

So this is a perfect example of what I mean by the word "snapshot" being used in two different ways by different people.

Snapshot meaning "full copy" is one usage (Digital Ocean), snapshot meaning "diff checkpoint" is another usage (Google, AWS).


Those aren't different definitions of "snapshot", though.


Of course they're different. They have different meanings, so they're different definitions.

It's not like it's the same concept with different hidden implementation details.

On Digital Ocean, I can delete the server but I still have the snapshot. On the others, you can't. One copies, the other bookmarks.

They're entirely different concepts, therefore different definitions.


That’s an incorrect notion of “definition”. The concept of a snapshot is that you make a copy of something at a moment in time. That’s one concept, one definition, one meaning. You may fight over the details of the definition or the implications, but at most it means that you need to revise the definition a little bit, not that you need to add a new sense to the word.


> That’s an incorrect notion of “definition”.

Nope, pretty sure different concepts means different definitions. Well -- or different "senses" if you want to be technical, but of course nearly everyone outside of dictionary editors uses "definition" to mean "sense".

> The concept of a snapshot is that you make a copy of something at a moment in time. That’s one concept, one definition, one meaning.

Except one of the two definitions isn't making a copy of anything. It's creating a new pointer to something that already exists, that's all. Zero copying. That's the entire point here.

Which is why it's two concepts, two definitions, two meanings.


Copy-on-write is an implementation detail that allows for lower storage. The snapshot is still the full copy. One could try to argue that the same is true for git in that diffs (or content addressable storage) are just an internal implementation detail, but as the parent pointed out that's not quite true--our commits document the diff, not the materialized snapshot.


> our commits document the diff, not the materialized snapshot

That's not actually true, though. This is what a raw commit object looks like:

  $ git cat-file commit bfc766d38e1fae5767d43845c15c79ac8fa6d6af
  tree 99768f8965d5382d1c1695c371a854d061f2548b
  parent 860a3b34854d8abe9af9f1eb584691de926ce897
  author Peter Maydell <peter.maydell@linaro.org> 1462981466 +0100
  committer Peter Maydell <peter.maydell@linaro.org> 1462981466 +0100
  
  Update version for v2.6.0 release
  
  Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
(This commit is from the QEMU project repository.) Note that there is no reference to a diff. There is a reference to a tree, which is a binary object representing a directory structure—a full snapshot of the state of the working tree as of that commit:

  $ git ls-tree 99768f8965d5382d1c1695c371a854d061f2548b             
  100644 blob 3ac0cfc6f0d60a95a5c3557497835c80e52a1696    .dir-locals.el
  100644 blob 37755ede83a0d5b4fe22114f624a140d2bcaefff    .exrc
  100644 blob 88a80ff4a5c552bad3bc2bf40b1fc4a45c57a177    .gitignore
  100644 blob 9da9ede26161bc5c6f12552b55bc54f55bb1e839    .gitmodules
  …
Each of those blobs is a reference to the full content of the corresponding file:

  $ git cat-file blob 3ac0cfc6f0d60a95a5c3557497835c80e52a1696  # .dir-locals.el
  ((c-mode . ((c-file-style . "stroustrup")
              (indent-tabs-mode . nil))))
For storage purposes there is deduplication, delta encoding, and compression going on behind the scenes, so committing a 100M working tree with a few small changes doesn't take up 100M of additional storage space, but these are invisible to the upper layer. When Git needs a diff, for example to perform a three-way merge, or in response to `git show` command, it generates one on the fly from the snapshots.


Apologies for the typo, I meant to say “...commit messages document...”


Copy on write filesystems describe changes as a structural diff, effectively.


That's not really true. The copy-on-write filesystems just allow multiple files to reference the same blocks, and only allow modifications to blocks if the refcount is 1. At least, at its simplest, that's how copy-on-write works. To copy a file, you copy the block references and increment the reference counts. You won't end up with a diff or deltas stored anywhere.


Sure it is. You just need to look at it a little differently.

Even in classical COW of memory pages in a Unix forked process, the set of pages mapped into the process with refcount 1 are a diff to all those with refcount > 1.

Virtual machine snapshots are more explicitly diff-oriented. Deltas to the base disk or snapshot are stored separately (that's your diff), and "deleting a snapshot" actually means remapping all the separately stored blocks and collecting the newly released blocks. There's two strategies snapshotting can follow: copy-then-write-in-place, or redirect-on-write. Either way, the set of copied or redirected blocks are a diff to the in-place blocks, just the polarity of the difference is switched.

See e.g. https://www.dell.com/community/Student-Discussions/Copy-on-w...

Things get more interesting with e.g. ZFS snapshots, where the whole filesystem, including metadata, is copy-on-write, and tree-structured to maximize sharing and permit atomic writes (how ZFS solves RAID5 write hole). There, snapshots hold on to one of the old roots in the tree. The diff is implicit in the difference in tree structure; shared blocks are common, different blocks are different. It's super-easy to do a recursive comparison between such trees, extracting a diff is a sublinear time operation because it can trivially skip over identical subtrees. It's a matter of perspective, when you're in the middle of a recursive tree compare, whether you think you're actually diffing data, or whether the data in one leg of compare is simply telling you which subtrees are shared and which subtrees are different, and thus the data is a delta, or diff. You certainly don't need a complete traversal, which tells me that the data is doing most of the work.


> The diff is implicit in the difference in tree structure;

I can only interpret this as, "the data is not described as diffs." There's a meaningful difference here and I'm not being picky about it. To some extent, you can convert between a diff structure and shared structure, but that doesn't mean that the differences aren't meaningful.

Two structures may be isomorphic but they represent data in different ways and the operations have different algorithmic complexity.


I've learnt something new today, thanks for sharing. Looks like I had a naive understanding of how snapshotting actually works.

I still think that it's more intuitive to describe commits as diffs, in the context of things like cherry-picking a commit or rebasing/reordering a series of commits.

But given that you can also "check out" a commit, in order to get a specific snapshot of the repo, I can see the parallels between commits and snapshots. Maybe both analogies are equally useful in describing the different features that git provides.


The point of the article is not an analogy. Git is based on snapshots. Abs diffs are computed from snapshots as needed.

The snapshots are also de-duplicated and compressed, but that is not important.

The article is a good one. And if you spend the time to understand git it gets easier to use.


Internal implementations and external interfaces are not necessarily the same thing. When reading a single-threaded application's code, it is helpful to read it as a series of instructions, executing serially. In reality, both the compiler and your CPU are constantly reordering instructions, and executing them in parallel/out-of-order. However, this is all done while still preserving the illusion of serial-execution. Taking a beginner programmer down this rabbit-hole of implementation details, is going to be more harmful than helpful.

Thank you for the suggestion, but I already find git easy to use. And thinking of commits as diffs that can be cherry-picked, rebased and reordered, is something that helps me greatly in understanding it.


The correct way to think about snapshots and diffs when it comes to cherry-picking and rebasing is to realise that diffs are always derived from snapshots. I.e. the fundamental data-structure is the snapshot and from those we can build diffs. Those diffs are necessary to implement cherry-picking and rebasing but it's also possible to imagine an implementation git that has those features missing. It would still fundamentally work in the same way - it would just be slightly less useful.

Edit: If you think this is just splitting hairs, I encourage you to look at the differences between git and pijul which is a VCS where the fundamental building block is diffs: https://pijul.com/


> The correct way to think about snapshots and diffs when it comes to cherry-picking and rebasing is to realise that diffs are always derived from snapshots.

Ironically, git snapshots are themselves derived from diffs. Creating a snapshot without diffs, would require making a full copy, which git most definitely does not do.

So would you rather think of cherry-picking as diffs derived from snapshots which are derived from diffs? Or as simply diffs? I find the latter better as a mental model.


It's helpful to understand git in terms of the "porcelain" and the "plumbing".

The git commands you know and love are largely the porcelain, nice fixtures over other things. When you "git cherry-pick", under the hood what it's actually doing is querying that commit's parent(s), finding the diff the commit introduced relative to its parent(s), and then applies those same changes to the index and your working tree.

Cherry-pick is porcelain on top of the plumbing.

There are a few "write git yourself" tutorials out there, of which "Write yourself a Git!" is I think the most popular. In it, you'll learn how git really stores data, and you'll write a (fairly basic) git client that can do several things to locally manage a repository.

Write yourself a Git!: https://wyag.thb.lt/


>I still think that it's more intuitive to describe commits as diffs, in the context of things like cherry-picking a commit or rebasing/reordering a series of commits.

If I understood the article correctly, those things actually are implemented via diffs. It's just that the diffs are calculated on-the-fly, used to create a new snapshot, and then discarded.


You can still think of them as snapshots. Git just does compression on the entire folder of snapshots, including de-duplication of data that doesn't change between snapshots.

In fact, when I teach git to students, I don't even bother with the trees/blobs, which in my view are just an implementation detail. I just tell them to think of git zipping up their working directory together with some metadata (commit message, reference to parents), and putting that zip file into its own "compressed" storage inside the .git directory. That seems to be sufficient for a good mental model of how to work with git (independently of the git's somewhat baroque command line interface, which just takes getting used to)


This is the thing though. You're talking about snapshots which actually have duplication removed... in my mind this really fits more with the 'diff' model. I've already done the exploratory diving-into-git-internals thing years ago, so I could develop a better understanding of how things actually work.

But for newcomers who want to understand how git is working, it really makes more sense to tell them it's 'like a diff. Not exactly under the hood, but think of it like a diff for now'. This is what I've been telling people as I've mentored a number of people in getting acquainted with git over the years, and if they're curious enough to look under the hood, they'll get a better understanding of the internals.

As a programmer, what you're working with is essentially the diff. This is the easiest way to think about things initially. The fact that git is storing blobs under the hood, shallowly deduplicating blobs but still storing large chunks separately that may contain duplicate data, until it generates packfiles which do a deeper deduplication/compression, is really not that helpful. Telling people it's more like zipping is a bit disingenuous because it doesn't really explain how things are compressed more efficiently over the course of many changes.

If I have a 1MB code file and make 1000 commits of one-line changes then sure, git is initially storing large blobs representing those, but then will compress over the change set when it generates the packfile.

Compared to making a zip of the file for every change (say these are 100KB compressed) and now you have people thinking the 1000 one-line changes generate 100MB in the .git directory.

You may think that a 1MB file with many smaller changes is a fabricated example, but consider that dependency lockfiles (package-lock.json I'm looking at you) can easily grow to this size, and contain this many changes.


It may depend on the background of who you're talking to. Programmers may be very comfortable with diffs, but non-programmers (in my case, physics graduate students) usually aren't. On the other hand, everybody is familiar with snapshots: even high school student will end up with "report_v1.docx", "report_v2.docx", etc, which are snapshots at the file level (and work reasonably well as long as you have a consistent scheme and don't need branches). I've also routinely seen less-technical people organize their research / paper writing by making a weekly snapshot of their work folder ("project-2020-04-1"). Telling these people that git basically does the same thing for them automatically with a tree-like "labeling scheme" that allows for branches tends to go over quite well, in my experience. For actually programmers, I'd be inclined to give them a more technical introduction to git's internals. I'd still point out that git stores compressed snapshots, not diffs (especially if they're older and may have previous SVN experience)


Those non-programmers are likely going to have a worse understanding of what is happening when you zip/compress something anyway, but I concede this is probably the most straightforward path if they have some understanding of what a zip is, and can't understand what a diff is. But even then I question if they should be using git, since `git diff`, `git show`, basically everything git exposes, is going to show them diffs.


A storage with pure diff would be impossible to recover if you get a error in any commit. It would also be much slower to examine the data, and newer version control do not use pure diff.

The version control system Mercurial had description about these problems on the homepage, "behind the scense", which was good reading.

I am not sure if GIT is the best solution, but at least a "pure snapshot" is okey, but where a diff storage must in practise include some snapshot logic as well.


Diff based, but with snapshot "control frame" every N commits, like video?

Obviously joking though.


The "snapshots which are stored as deltas, if that works" part is unrelated to the diffs the git porcelain generates for you when you do a git-diff or git-show. The former is purely an implementation detail of the storage (albeit an important one), while the latter is entirely virtual, calculated from the snapshots every single time you view the data. That's why operations like git-diff and git-blame can take some time on large trees or histories (and why e.g. git-blame has various options to tweak how it tracks files across revisions, because that is not something git does), while git-log is fast.


Also (for less-technical audiences), I don't exactly dwell on the de-duplication. It's just "Git makes snapshots and puts them into .git in some efficient way. Don't worry about it. Or, if you want the details, read the Git SCM book."


Not really: if you do a checkout of a snapshot into an empty directory, you expect the entire state at the time of the snapshot, not just the diffs.


As a programmer I care about diffs only when I am comparing two versions. A commit creates a new version. "Snapshot" is a distraction.


The diff mental model doesn't work for things like `git checkout <commit>`.


I actually haven't had a problem with this, though perhaps it's because I understand what's happening at a deeper level. You're generally referencing commits which exist somewhere in this family of commits you can view with `git log --graph`. You can easily think of checkout as the path of diffs to get there. Files at commits are still whole objects, mentally, but the thing we care about as programmers working with multiple versions are the diffs.

I have had it break down a bit more when working with stash though, because now the object you're referencing can exist outside of that graph-like commit family.


So it only stores the difference between the two snapshots? ;)


No. If it chooses to compress the commits, which it if I remember correctly not does automatically for each commit, but rather occasionally as a larger step, it uses the difference to whatever it deems to be a good candidate, if it finds one. E.g. if you have a file in commit A, change it massively in later commit B, and then on a different branch create commit C that also changes the file to one very similar to the one in B, git might very well compress C by storing the difference from B to C, despite those having no direct relationship in the commit graph. It can also choose to not use a delta to a different version entirely, and this is 100% an internal implementation detail of the storage system in git (afaik one of those implementation details is that it prefers candidates that are in the same commit chain, but it doesn't have to - and it can easily jump multiple commits if that works better). If you ask git to show you a diff to the previous commit, it does not pull a diff from storage, but pulls two file versions from its storage backend (which if deltas have been used to store will resolve those) and diffs them.


No, it stores an entirely new set of references to objects, as well as some of those objects themselves (any that are not identical to previously stored objects).

You cannot look at a commit on its own and know exactly how it's different from the previous commit, but you do have the complete new state. You have to look at the parent commit's references and do an object-by-object comparison to identify exact changes. On the other hand, when you look at a diff, you can see exactly what has changed, but you cannot produce the version that came before without also having a complete copy of the current version.


The implementation-specific compression doesn't store deltas or diffs, it stores unique blocks of text.

Git allows for shallow clones, which would be impossible if the protocol or implementation were based solely around diffs.


I don't know that you need to teach them any of that. Version control is an abstraction. I have no clue what happens under the hood and I don't care.


To some extent, this is true. I don't feel the need to totally understand gits packing logic or the specific mechanics of the various diff/merge algorithms.

But some knowledge of how/why your tools work the way they do can be very helpful.

Some knowledge of a tools internal working can be fundamental to efficient use of that tool. At the very least it can allow you to understand or derive your useful interactions with that tool rather than simply memorize how it is used.


> Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size.

Not with `btrfs subvolume snapshot`, it won't. If that's not a snapshot, I don't know what is.

From a storage perspective, no dammit, Git commits are snapshots, look at the bits on disk if you don't believe it. This isn't something that people who like to write blog posts about Git made up for pedagogical purposes, it's how Git actually works.

As you point out, it's wonky for pedagogical purposes; what does it mean to "cherry-pick" a snapshot? When thinking about cherry-picking, yeah, a diff makes more sense than a snapshot. But saying a diff is better pedagogically doesn't change the fact that a commit actually is a snapshot (and when cherry-picking, it diffs to snapshots to create a patch, then applies that patch).


> From a storage perspective, no dammit, Git commits are snapshots, look at the bits on disk if you don't believe it

Except they're not. They're (often) packfiles, which are a delta encoding i.e. a diff. It's not necessarily the same as a specific commit, but appealing to "the bits on disk" is wrong.

It is certainly true that the git object model each commit object refers to a tree that represents the complete state of the repository at that commit.

It is also true that many git commands implictly treat a commit as being the diff between the state of the tree in that commit and the state in the parent. For example git show, git rebase and git cherry-pick.

It is simultaneously true that the on-disk storage system is optimised for performance and so doesn't map onto the object model in a trivial way.


> They're (often) packfiles, which are a delta encoding i.e. a diff. …appealing to "the bits on disk" is wrong.

That's fair. The diffs in a packfile have no relation to the "diff" that a commit would be if the commit were a diff; so it's wrong to use "but packfiles" when arguing that commits are diffs and not snapshots; but you're right, packfiles make my "bits on disk" argument not quite right.

The way I look at it is that packfiles are a compression mechanism; and they don't alter the fact that fundamentally it's snapshots that are being compressed. But that's not the only way of looking at it.


> It is also true that many git commands implictly treat a commit as being the diff between the state of the tree in that commit and the state in the parent. For example git show, git rebase and git cherry-pick.

A commit is a snapshot, and you can compute the diff between a commit and any of its parents. If a commit has multiple parents, git cherry-pick bails out unless you pick a parent (usually -m 1), and git rebase, I think implicitly assumes the first parent.

(EDIT: a commit's tree, its parents' trees)


> If a commit has multiple parents, … git rebase, I think implicitly assumes the first parent.

`git rebase`'s behavior regarding merge commits is shockingly complicated, but much of the time: Because by default it linearizes the history, it actually just skips merge commits because it assumes that the merge has already happened implicitly by applying one of the merge's parents on top of the other parent.


> Obviously both "diffs" and "snapshots" are leaky abstractions.....

Joel Spolsky wrote many great things, but "all abstractions leak" was not one of them (edit his but not good). I am very tired of programmers excusing their poor imagination with appeals to this nonsense.

------

Commits store snapshots. Full stop.

The "bad mental model" is not commits being snapshots, but things behind stored individually, i.e.

> Sum |things| = |Product things|

This comes up in many other contexts, especially when storage quotas are involved and it's unclear what to do when storage is deduped across quotas.

-----

git packfiles do use a delta encoding, but it's important to understand that there isn't any necessarily any correspondence between the history and the delta encodidng. In fact, commands like `git repack` exist precisely to avoid path dependency issues from the repacks matching the history too much.

Saying commits are diffs to explain the delta-encoding storage characteristics is wrong and confuses, not clarifies.

------

> And let's not forget commit messages. If a commit is a snapshot, I would expect the commit-message to be descriptive of the entire snapshot. Whereas if a commit is a diff, I would expect the commit message to be descriptive of the diff. Which is exactly how most people use commit messages.

It's git tree objects that are snapshots, commit objects have tree child and a prev commit child, so it is natural for them to describe the relationship between two states without appealing to hypothetical alternatives.

> Not to mention other features the article discussed, such as cherry-picking. What does it even mean to "cherry-pick a snapshot"? In comparison, cherry-picking a diff and applying it to your current state, is far more intuitive.

I might `git checkout somethingelse .` mid-rebase. What does that mean if commits are diffs? Nothing very clear. The better thing to teach people is about darcs and patch theory and those other modules. I think the git model and the patch theory model both have uses, and the fact that git makes people always work in the git model is a fundamental issue that cannot be fixed with analogies.

- Patch theory is good for the things are you still working on

- merkle dag of states is good for the things you've already done / agreed upon.


> All non-trivial abstractions, to some degree, are leaky.

You look a bit silly making grandiose comments that take one web searching to disprove

https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...

> All non-trivial abstractions, to some degree, are leaky.


I'm fairly certain he was disagreeing with the content of the statement, not that Joel Spolsky wrote it.

ie, yes Spolsky said that, but he was wrong.


Yes, thanks


I think the emphasis was one "great". I.e. your parent wants to say that this thing Joel wrote was not great.


if your filesystem was copy on write and implemented snapshot semantics internally (like WAFL for example, over 20 years old now), then the second snapshot would not take 100MB, it would just cost the metadata.

A commit is a snapshot of a tree with a reference to it's prior ancestors. It's important to know that because it becomes extremely relevant when trying to do things like merges properly.


If you commit 100MB file, change few bytes in it and commit it again your .git/objects will almost certainly contain two 100MB objects. The fact that it is somewhat likely that running "git gc" or something similar will convert one of them into reference to the other one and some compact representation of the difference is implementation detail.

While commit object does represent the snapshot it also references the previous state, thus the commit message usually describes what was changed between the referenced snapshot and the parent(s) that are also referenced from the commit object.

As for the overall model and leakage between implementation details and how people use it interesting approach is used by SCCS/BitKeeper with its internal "weave" format that essentially is both snapshot and diff at the same time.


Lookup Copy-On-Write. ZFS and BtrFS do it.


After going through the "Git Internals"[0] docs, I found that the snapshot mental model has been much more helpful in understanding what my Git commands are doing, how someone's history got into a confusing state, etc. The primary model is that of the Merkle tree, and subsequently hashing, which are very simple and powerful concepts.

[0]: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Po...


I prefer to think of a repo as a whole as a tree, where the nodes are snapshots and the vertices between each node is a diff. This sort of lands us in both places


> From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.

That's not the way storage snapshot works under most (all?) storage targeted file systems, filers etc.. What you're talking about there is a backup.

Snapshots are not backups. Snapshots work on "copy on write" basis.

Roughly speaking, when you take a snapshot you draw a line in the sand. "These were the files at this time". Snapshot operations as a result are super cheap and super fast. Future changes to those files results in the filer/file system writing the modified blocks to new locations, not overwriting the original data.

So take a 100MB directory. I create a snapshot. That results in almost new storage usage, just a small amount of metadata. I write/modify 10MB of data, now the total storage cost is 110MB. If I take another snapshot after writing that 10MB. it's still only 110MB of storage usage.


If "diffs" and "snapshots" are leaky abstractions, that often enough lead you badly astray, then why insist on these abstractions in the first place?

Why not just teach people the mental model behind Git up-front? Objects form an immutable directed acyclic graph, human-readable names point at objects, there are some rules by which the graph is being extended and pruned, and by which names (references) are being updated to point at different objects.

This isn't a hard mental model, not for programmers (for whom the tool is intended in the first place). If you know how the most basic pointer-based data structures - a linked list, a tree, a directed graph - work, then learning the actual model isn't hard, immediately clarifies why Git does what it does. It should be taught to people up front.

A commit isn't a diff, and it isn't a snapshot. It's a bunch of objects Git creates for you, where the "commit" object points at previous commits and at a tree, built of "tree" and "blob" objects. When Git wants to know how to recreate your file structure, it starts at the "commit" object and walks the graph to discover what files and folders should exist. When you make a change and perform the "commit" action, Git creates a new "commit" object and a new "tree" object for it, and add more objects to the graph to encode what changed, while reusing previously existing objects for things that did not change. The end state is, if you start at your new "commit" object and walk the graph, the resulting description of your file structure should be equal to what's on your hard drive when you made your commit.

Trying to paper over that with "friendly abstractions" is what makes Git difficult to understand.


> What does it even mean to "cherry-pick a snapshot"?

It means to do something like a three-way diff among three snapshots: the cherry-picked baseline, the target, and a common ancestor.

You can do something similar with the diff3 tool, which takes three files (snapshots) as input, not diffs.


Depends on the diff. If the diff is not aligned by bits a single bit offset might cause double the size, ie the full file to delete and a full file add.

>If you insist on using the "snapshot" abstraction

But its not insisted. Both abstractions are used as needed.


>… you will eventually need to explain that a commit is actually a combination of diffs, along with some other metadata like a pointer to a parent commit.

Only that this is completely wrong.

A commit is a snapshot of the tree. There are no diffs.

There is also no "metadata attached" — the commit is the actual data (!) describing the tree snapshot.

Git is a kind of simple content addressable object store storing kind of Merkle tree objects. That would be a proper (abstract) description.


Why have so many people written long thoughtful explanations about how the author is wrong to suggest snapshots are a better mental model, and that you think all abstractions are leaky, but you find diffs a better mental model?

The entire article is literally about how commits are literally snapshots. I would say people didn't read TFA, but a lot of people are quoting lines from TFA and then going on to argue with/expand on them in a way that is directly contradicted by the next few lines.

I think it's because most of the people here have spent years working with git, and are so deeply attached to their understanding that they didn't hear most of what the article said.

(Some commentators have pointed out specific oversimplifications the author makes like glossing over pack files, I'm referring to the people who say a git blob is a diff when the entire point of TFA is that it isn't)


People are disagreeing with the author, not because they didn't necessarily read the article, but because they don't agree about how things should be defined.

At the root, this is a disagreement about semantics and philosophy, not about git itself. I'm going to refer to Aristotle here: we think we have knowledge of a thing only when we have grasped its cause, and there are four general 'causes' [1]:

- The material cause: 'What is it made of?'

- The formal cause: 'What is the ideal of this thing?' , e.g. what's its abstract nature?

- The efficient cause: 'How did this thing come to be?'

- The final cause: 'What is its purpose?' How is it actually used? What role does it play in the world?

Here we can see that commits are used (at least in the git internals) as 'snapshots' — they refer to bytes, not changes in bytes. That's pretty close to the formal and efficient causes — the abstraction inside of git is closest to a snapshot, and that comes from the history of what Linus wanted when he wrote it.

But! The underlying storage uses deltas (which are diffs) to save space. That's the material cause.

But also, when we actually use commits, git often creates diffs for us as a convenience (cherry-picking, rebasing), and hides the fact that they're snapshots under the hood (final cause).

So there's an inherent tension between the different ways to answer 'what is a thing?'. For commits, this is especially bad, since there's an even split between 'causes'.

This tension never goes away because the most useful definition really depends on the context.

[1] https://plato.stanford.edu/entries/aristotle-causality/#FouC...


> But! The underlying storage uses deltas (which are diffs) to save space. That's the material cause.

This does not make the "commits are stored as diffs" story much more true:

1. This is only true of pack files, but pack files are only created once the repository exceeds a certain size.

2. Nothing about the pack file format requires that deltas follow the chronology of commits at all. The deltas could be stored in reverse order or even random order compared to the chain of commits.

3. The deltas in a pack file do not correspond to a change in a given commit, they are just the data to create a particular snapshot. If you find that a commit's file blob is stored in a pack file as a delta, that does not tell you anything about whether the file changed in that particular commit. You have to look at two commits and diff them to determine which files actually changed.

If a person wants to think about version control in an abstract way, then yes the two views (commits vs diffs) are somewhat interchangeable. If a person wants to understand what actually happens when you run a Git command, the answer to that question is less open to interpretation.


> The underlying storage uses deltas (which are diffs) to save space.

Not necessarily! The base git storage stores each object individually, not as deltas ("disk space is cheap"); it's only after a "git gc" that they are stored as deltas to other (potentially unrelated) objects. The original implementation of git didn't even have the delta storage (pack files), it was added later as an optional optimization.

So answering to "what it's made of?" with "deltas" comes with a huge caveat, that it's often partially or completely untrue.


This is exactly what I'm talking about. A person posts "this is literally how this works", and someone replies "philosophically I would prefer to think it works differently, therefore you're wrong".


Your distilled summary of this form of objection relying on wishful thinking made my day, thanks a lot!


The true zen of source control is that they are both.


> Why have so many people written long thoughtful explanations about how the author is wrong to suggest snapshots are a better mental model, and that you think all abstractions are leaky, but you find diffs a better mental model?

Probably because, to take their words at face value, they find diffs a better mental model? I think impugning "people [...] are so deeply attached to their understanding that they didn't hear most of what the article said" is a real bad faith reading, especially when you even acknowledge that central to people's arguments is "all mental models are leaky". This article may be technically correct about the way git internals are structured, but it makes cherry-pick and rebase more mentally complex for users to understand (you first have to go from commit => patch), not less.

Saying "Commits are collections of files + a parent commit, but you can diff it to generate a patch" and saying "Commits are a patch + a parent commit, and you can apply it to generate a collection of files" are isomorphic mental models—the fact that #1 is "correct" (for some value of correct that doesn't include the actual files stored on disk) is really besides the point.


My point is that people criticizing TFA's proposed mental model are missing the fact that TFA doesn't propose a mental model, it explains how things work. Both have value, but they're distinct.


I disagree. TFA is explaining the mental model Git uses to structure their codebase. If you're writing code for Git, this is obviously very useful to understand, but if you're just using it, this is only one of several mental models available to you. In this case, I think it's right to say that the distinction the author is attempting to draw is immaterial to those not working on the Git codebase.


If your code is written in a certain way that's a model, not a mental model.


Yes! It just seems so strange not to care about how things actually are in software. Is it a way of coping with the fact that so much software is so deeply layered and complex now?

Maybe I’m misremembering, but I feel like I didn’t see this usage of “mental model” much until fairly recently. The first I recall being surprised at was a discussion of a “mental model of Javascript” -- why would you need a mental model of something with a very detailed spec and multiple compatible implementations to study? If you want to understand how some aspect works, just look up how it actually does work.


Well, sometimes the API of a piece of software presents one model, while the implementation actually uses a different model underneath for various reasons.

In particular in Git, some commands expose the commits-as-diffs model (cherry-pick, rebase) while others present the commits-as-snapshots model (checkout). However, if you were to look at various layers of git code, the model is either commits-as-snapshots, or neither (compressed storage).

You could also theoretically change the entire implementation of git to store commits as diffs, and offer the exact same API as it does today (probably with differences in the way conflicts are resolved, and definitely with differences in performance).


Unless you open up the spec every single time you make a prediction, you're using a mental model.

And the spec is probably not arranged for easy use.


It's necessary to approximate, but if someone tells you your approximation is wrong it makes no sense to say it's right because you prefer it that way.

If your mental model is that floats are real numbers and someone tells you they aren't, you don't go "I philosophically prefer to think they're reals, so you're wrong". You either update your mental model or decide you'd rather be a bit wrong than learn something (you perceive as) tedious.


Sure, you usually don't want your mental model to be wrong on purpose.

But that's different from not having one.

And sometimes a slightly wrong model has other benefits that will cause you to make less mistakes, so it's still a good tool.


Agreed, commits are snapshots, whether we like or not. For obvious storage efficiency reasons, the implementation then diffs/packs/etc, but this is a different issue altogether.

I have found that I can't work with git with a different mental model (diffs). Every time things get messy, the diff model is not enough, whereas snapshots + commit graph + names/pointers make things natural.

Interestingly enough, when migrating people from svn to git, explaining the actual model makes the transition much smoother, so it would seem I'm not the only one.


i convinced myself that commits are snapshots by doing the following:

    # generate a 100M text file
    base64 -b 76 /dev/urandom | head  -c 100000000 > file.txt
    git add . && git commit -m "1"
    
    # remove first line and add a new line to bottom
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "2"
    
    # repeat
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "3"
    ...

    du -sh . # a very big folder
each of the commits are almost 80M big in the git folder. if you run `du -h .` you can see how git stores each object individually (80M big)


I’ve read the article before and it’s entirely unclear how it is supposed to be helpful. As the author acknowledges, things like cherry-pick show that one can think of commits as diffs whereas in the git implementation the object known as a commit is a snapshot of the state of a directory tree. Fine, but so what? Both times I’ve read this article my impression has been that the author is relatively new to git and processing some new information they’ve learned.


> Why have so many people written long thoughtful explanations about how the author is wrong to suggest snapshots are a better mental model, and that you think all abstractions are leaky, but you find diffs a better mental model?

Once you remember (learn?) that a commit can have N parents, it becomes apparent that it cannot be a single diff.


What does TFA mean in this context?


TFA = The F**ing Article


Thanks. I had a hunch. I'm familiar with "RTFM", but would probably get equally confused if "TFM" was used as a noun.


I suspect the chronology is something like RTFM -> TFM -> RTFA -> TFA, but the second and third might be switched. Dropping the R does introduce obscurity, but being able to convey the underlying sentiment (that while the content could/should have been consulted, it seems as though it was not) without a verb allows for a nonconfrontational syntax similar to passive voice, but even moreso, and often without obvious "weasel" effect, to boot!


Makes a lot of sense, thanks! Maybe it's also useful since the R is read explicitly as "read". Hence "they should instead be RTFM" sounds grammatically wrong. Breaking off the verb allows for a more natural read. It's funny how an abbreviation can carry more information than whatever it's short for.


And I believe this slang came from Slashdot, which is like the Hacker News forum in the decade before Hacker News


Interesting, I didn't know it came from Slashdot! I spent quite a bit of time reading it in the early 2000s, and sometimes miss the subculture and not always subtle jokes (beowulf clusters... ). The moderation system encouraged jokes (there was a specific 'funny' tag), unlike HN which does not 'orient' things, and happens to be very serious.


it's an abbreviation to refer to the article being discussed on a site like this.


The more polite phrase is The Featured Article. :)


The same context as RTFM


Can't satisfy everybody.


> I believe that Git becomes understandable if we peel back the curtain and look at how Git stores your repository data.

I agree, and like many, I have been saying that for years (nay, for more than a decade): and that's exactly the problem!

You don't need to understand how an internal combustion engine works to drive a car... You don't need to understand how your graphics card renders stuff to develop a web page... You don't need to know how a brushless motor works to use a drill...

There is a pattern there, and it's the one that makes sense.

I've read up on the internals of git a dozen times by now. But I only occasionally need to do something weird that makes me go back to it, so I usually forget the relevant bits.

The trouble is that I've used a distributed VCS that did not ask me to understand internals and it had a sane UI, and good model (like tree-like commit history, so a top-level commit log would only have merges, but you could dive deeper into individual commits if you so pleased). It wasn't perfect, but it's hard for me to accept that we have gone with a subpar solution where every "tutorial" starts with how you need to understand the internals! But you also need to memorise them, dammit!

Just like I keep forgetting the Emacs rectangle editing shortcuts since I seldom use them, I'll keep forgetting the specifics of git internals that I might need once every 12 months.

And it's not me, it's _you_, git!


Sadly, the bad part is git's user interface. It hides the pretty parts underneath.

There is a concept of "the next commit" or, equivalently, "the pending commit". In the documentation, this gets called "indexed" or "cached" or "staged" --- three different names! And if you want to diff with it, you can't refer to it by name. You need to use an option, so it's "git diff --cached <other commit>.

I know git's internals, mostly because it lets me navigate its bad user interface.


There's a difference between implementation details and core abstractions you need to know to understand a device. That git stores data as an immutable DAG with a layer of pointers to attach human-readable names to things, that's not an implementation detail. That's a core abstraction.

It's not like knowing the internals of an ICE, it's like knowing a car moves using wheels, that these wheels must touch the ground for the car to be controllable, and some of them must rotate to change the direction of movement. Knowing such "car internals" isn't necessary for you to be able to turn the key and get it to roll - but it is necessary for safe driving. People who didn't master these "car internals" are the ones who speed on wet ground, don't understand safe braking distance, or why their car skids.


Commits being snapshots is not git internal. It is the high level abstraction.


But you have to know how ICE works to resolve issue. Oil pressure? Engine temperature? Leaking radiator? Gas pump? Dead battery? Wet spark plugs (on older models)? But there is more — air filter, oil filter, oil change, mechanical parts — these usually resolved by handling car to professionals, and they have to know how system works.

You have to understand DOM to resolve web page issue. Understanding how graphic card works would help in resolving webgl issues.


We are not talking about people implementing git the tool. We are talking about people using git as a tool (even though they share the common job title of a software engineer/developer).

The OP leads in with how "git cherry-pick" and "git rebase" are hard to use and promises to clear it up with a deep dive. You know, how do I turn the wipers on my car? Or turn signals?

> You have to understand DOM to resolve web page issue. Understanding how graphic card works would help in resolving webgl issues.

As a developer working to build things with DOM, you have to understand the DOM APIs and model. You do not have to understand how DOM is _implemented_ in browsers today and how they achieve things you need when you call the APIs. Sure, there are gotchas that are useful to know ("this CSS selector takes O(n^2) time to match"), but they are the exception, not the rule.

Similar holds for WebGL: you need to know the APIs and how to use them effectively. Sure, it's good to know where some of the gotchas are ("this might re-render the whole thing on-screen introducing flicker, here's the off-screen version"), but it's not a blocker.

But, none of these require you to understand internal implementation details to effectively use the public APIs (which with git are CLI commands).

This is not to say that understanding the nitty-gritty details of anything is a bad thing: it is a GREAT thing, and will probably empower you to do ever more intricate things (and it is usually a very rewarding exercise to learn more about a tool you use)! But that's different from having to know the internals to do the most basic of things (which I'd argue "cherry-pick" and "rebase" are).


With DOM, you have to know that it is tree. Or you'd be surprised that `append` changes parent. You have to understand reflow, you have to understand stacking, lots and lots of things.

A lot of people use DOM without understanding, a lot of people use git without understanding. In both cases one requires some knowledge to resolve issue.

And git is trivial, it stores snapshots, nothing hard there. It is interesting, but does not help with `cherry-pick` and `rebase`. The only hard thing about git is recovering — git reflog — easily avoided with backup branches `git branch foo`. Some kind of undo could be useful for beginners to avoid fear of screwing things up.

Git has a different mindset from SVN — commits are cheap, branches are cheap, experiments produces new branches, cherry-pick them, rebase them, etc, etc, etc.


This blog post is the most compelling argument I've yet seen for pijul.

Git should work the way we think it does! It's confusing that snapshots are being converted into a few different forms of change object, which can be reconciled with merges or rebases or applying patches.

Pijul (and darcs before it) actually works on the basis of patches, pijul with a robust theory of patches. A cherry-pick just moves a patch from one history-of-patches (branch) to another history-of-patches. One can share just a patch, and applying it is guaranteed to be the same action everywhere if that's possible, which it often is.

I'm patiently waiting for pijul to be mature enough that I can move everything over to using it, it's one of the more exciting projects in the last ten years.


I have visited the pijul site 2 or three times, every time I would start reading about a "sound mathematical theory", get bored, and close the tab. To this date I still don't know what pijul is trying to do and why I should be interested on it.

They really should improve their documentation (hint, in case someone reads this: nobody except a few geeks give a shit about sound mathematical models. Show me how pijul makes my life easier compared to git, that's all I need)


Have you ever rebased a long chain of git commits onto new branch, where one of the first of those commits have a conflict with the new base, and after resolving this conflict for that commit you have the same conflict over and over again for all the subsequent commits, even if they did not modify that place in the code, and you need to manually resolve it again and again?

Pijul will, as I understand, save us from those unnecesary repeated "conflicts".

See also the answer by @chriswarbo about removing unndesired changes from history



No, I did hot know that, repeated everything manually. Will try that next time, thank you.

BTW, pijul docs mention rerere as helping "in some cases":

> This is why in these systems, conflicts are often painful, as there is no real way to solve a conflict once and for all (for example, Git has the rerere command to try and simulate that in some cases).

https://pijul.org/manual/why_pijul.html


I feel their example (with the ABGX) just makes me think "merging can result in weirdness silently and git and pijul do it different but silent" - it doesn't really argue that one is better than the other.

(Most people probably use git as an effectively infinite string of zip files anyway. https://xkcd.com/1597/ )


To the contrary, that example actually gives a solid argument for three points where Pijul is better than Git:

1. The order between lines is preserved by Pijul. This is important: let's say Alice works at the beginning of the file (lines 1-10 of 1000), whereas Bob works at the end (lines 990-1000 of 1000). Pijul preserves the order in all cases, whereas Git might randomly decide, based on the contents of the lines, to merge Bob's new lines in the middle of Alice's new lines.

2. Git solves an optimisation problem (3-way merge) that may have multiple solutions. Unfortunately, there is no way to count the number of solutions, or even to tell whether there are multiple solutions, in a reasonable amount of time. Git therefore picks one solution silently, based on the contents of the lines. In contrast to that, Pijul is deterministic.

3. Pijul is associative, meaning that merging A and (BC) is the same as first merging A and B, and then merging C; in other words, you can merge a branch commit by commit. Git doesn't have that property: if you merge a branch, you MUST (1) stop working on it, or else the future merges become totally unpredictable, and artificial conflicts might come back (yes, I know about dirty-hacks-to-try-and-fix-that-when-they-work such as rerere), and (2) check the result of the merge extremely carefully, because in addition to the logical errors that merges can reasonably introduce, Git might also introduce extra unpredictable errors by randomly shuffling lines around.


That makes it more clear - the A B C explanation is too simplistic.


I feel like this is missing something about the drawbacks? Or are there truly no drawbacks beyond disk usage for the cache, and folks should just enable it once they're aware it exists?


I guess it involves a bit of assumptions and guesswork to automatically replay your previous actions to files that in turn may have changed. It probably slightly increases the chances of Git doing something you didn't expect, and not tell you about it. Hence why it isn't default. Maybe?


Could one not add a new rebasing strategy to git by generating patches from the git history? Are the concepts non-translatable?


I think rebase alreaby works by generating patches, but for some reason the repeated conflicts happen...


There's zero theory on the following page:

https://pijul.org/manual/why_pijul.html

There are many answers:

1. Commutation makes your life easier because you can be much less careful about how you manage your branches. Rebase, merge and commit are the same operation (apply a patch), without any loss of power: you can simulate all of Git within Pijul, except for bad (i.e. silent non-associative merges), which Pijul doesn't have.

2. Everything is easily reversible. I know all actions in Git are reversible in some way, but not in the same way: for example, you can't undo an old patch without changing the identity of all the patches after it. I know you're thinking this is important for strong version identifiers, but Pijul also has strong version identifiers, just without the compromise on usability. This is achieved using cool cryptography tricks.

3. Solving a conflict in Pijul actually solves it. Conflicts happen between two (or more) patches, and are solved by a patch: if the same two patches are on another branch, you are guaranteed to get the exact same conflict in 100% of cases, and that conflict is solved by the very same patch that solved it in the first branch.

4. When merging, Git solves an optimisation problem that may have multiple solutions in some cases. Git chooses one arbitrary solution based on the content of the lines, and doesn't warn you if there are others (because that would be a very hard computational problem). Pijul doesn't do that, and gives you strong guarantees on merges. You still have to test, but when reviewing, you can predict in your head, with 100% accuracy, how Pijul will merge. This isn't the case in Git: lines inserted at the end of a file might be merged into unrelated lines at the beginning of a file sometimes, if Git feels like it.


> Git should work the way we think it does!

Hold on, who is "we"? Personally speaking, git works the way I think it does. Granted, I've written my own (simple) libgit2 frontend, so I understand the git internals fairly well, on a high level at least

I haven't looked into pijul, but why is teaching people a new tool more helpful than teaching people how the tool they already use works? (Like the OP blog post does.)

Am I blinded by the knowledge I gained from writing my little tool and learning about git internals? I get that a tool you need to learn the internals of to use is probably a bad tool, but is asking git users to understand the contents of the OP blog post really too much? Maybe I'm just a git fanboy...


>> Git should work the way we think it does!

> Hold on, who is "we"?

I'm not the GP, but I agree that git should work the way "we" think it does, and I think a reasonable definition of "we" in the context of Git Users is probably SaaS/Startup/SMB software engineers.

Git is popular enough to have many thousands of different use cases, but I would speculate that the distribution of use cases probably follows the distribution of public Github/Gitlab repos pretty closely.

> Personally speaking, git works the way I think it does. Granted, I've written my own (simple) libgit2 frontend, ...snip...

> Am I blinded by the knowledge I gained from writing my little tool and learning about git internals?

Yes.

> I get that a tool you need to learn the internals of to use is probably a bad tool, but is asking git users to understand the contents of the OP blog post really too much?

Yes. Or rather, knowing git's internals is incredibly helpful if you've already decided to use git and now you're deciding what workflow to use to develop software, because you can match your mental model of how to use git to the way git naturally wants to represent your stored work.

However, if you come to git with an existing mental model of software development, and that existing mental model includes the idea of "branches" or "diffs" or "immutable history", then you're going to quickly and repeatedly run into stumbling blocks as your mental model doesn't match git's internal model. Git can do branches and diffs and immutable history, of course, but they're a leaky abstraction on top of the concepts git really cares about.

> Maybe I'm just a git fanboy...

Sure, nothing wrong with that!


> Git should work the way we think it does!

I think it works using snapshots... or are you saying that Git should work the way that you think it does, and not how I think it does?

It's clear that Git is not the final evolution of version control systems, that we are just currently in the "Git era" and at some point we're going to be in the "post-Git era" of VCS. It's unclear what that looks like, but I am skeptical when I hear these claims about Pijul.

> One can share just a patch, and applying it is guaranteed to be the same action everywhere if that's possible, which it often is.

My understanding is that you need to define a very weak version of "same version everywhere" which is useless. With Git, you can merge and get no conflicts, but that is no guarantee that the patch applied successfully... it just means that the merge operation didn't run into any obstacles. It's not just the patch that needs to be vetted by humans, it's the state which must also be vetted, and that's one of the problems that Git solves well.


I don't view git as a series of diffs. I view it as a logical extension of my file system to include a time dimension (or in fewer words, as snapshots).

It replaces file-v1, file-v2, file-v2-with-changes-from-Alex, etc, that you commonly find on the hard drives of people not familiar with version control. That it can generate meaningful diffs is a product of the type of data we're storing.


> I'm patiently waiting for pijul to be mature enough that I can move everything over to using it

Pijul is super slow. I've tried it a couple of times, and is too slow to be usable.


It has made huge progress lately, which was always the plan: the new data representation introduced in November 2020 made it scale to huge repositories (this was impossible before, because of disk space and speed), and then we also changed the backend recently (https://pijul.org/posts/2021-02-06-rethinking-sanakirja/).


I'll check this again!


Can I make a shallow clone in Pijul?


I'm one of the authors. The concepts are different, we do have a concept called "tags" (not ready yet) which is more efficient than shallow clones, in the sense that you can make patches (commits) on top of a shallow clone without any down side (unlike in Git, where shallow repositories can become slower).

You can also do partial clones in Pijul: since patches commute, the patches you produce on top of a partial clone can be pushed to the full repository in the exact same way.


I strongly disagree.

Snapshots are a useful concept for programming. Each snapshot represent a compilable program with a certain set of features. So snapshot A has a certain set of features and B has another.

Diffs are not a useful concept. Does the diff between A and B represent the new features in B? No. Because if it did, it would mean I could take any another compilable snapshot C and apply the diff of A and B to it, then I should end up with a snapshot D is compilable and has all the features of C with the new features in B. And that doesn't work with any programming language I know.

It doesn't even work with the most trivial features.

Diffs may be a useful concept when working with some data formats. But for programming languages, snapshots are the right concept.


I suggest you read more about Pijul and Darcs, because you seem to be confusing Pijul patches with the output of `diff`.

Patches are much easier to work with, more reliable and fully deterministic. For example, merge and rebase are the same operation in Pijul, you can remove an old patch without changing the identity of all the patches after it, and yet have strong version identifiers, with the exact benefits you describe for snapshots.


No idea what Pijul is, but how does this not describe git?

Unless your complaint is that a commit is really a set of diffs/patches?


Pijul (and Darcs) operate on sets of patches. As a simple example, git commits have at least one 'parent', which imposes an order, e.g. let's say I edit file X in commit x and file Y in commit y; if I want both of those changes, git forces me to apply them in a particular order, e.g. [x, y]. If someone else applied those same two commits in a different order, they'll get a different commit ID, which may cause problems e.g. when trying to merge their changes with ours.

If we treat x and y as (sets of) patches instead, then the set {x, y} is the same as the set {y, x}; the order doesn't matter (we say those patches commute).

The idea of commuting patches is really useful, since we can rearrange patches to a more convenient form. For example, if we commit something we shouldn't (like a password, or a huge binary), then later remove it, a system like git makes it hard to remove that file from the history. If we're dealing with sets of patches, we can simply swap them around until the 'add file' and 'remove file' patches are next to each other, then merge those two patches. Voila, the file no longer appears, the rest of the history remains intact, the branch's content is guaranteed to remain unchanged (since we only swapped commuting patches, which doesn't change anything; and merged two patches, which doesn't change anything).


Patches are not commutative in general though, so surely Pijul must have some history/ordering mechanism?


Pijul can detect when a patch commutes and when it does not. From there it can construct a dependency graph of commits and use that information in various interesting and useful ways. Back when I used Darcs I would somewhat frequently do things like pull these three commits and their decscendants as a sort of pseudo branch that would include everything necessary for a specific line of work but leave everything else behind.

To the point of the article when commits are diffs you sort of intuitively think such things should be possible. But because in git commits are snapshots it's not as easy as you would expect it to be.


Yes, patches aren't commutative in general. That is precisely the reason Pijul can help us in these situations, and guarantee the result will be the same: it will either work without issue, or it will abort due to non-commutative patches (it's similar to a type checker, which either guarantees that functions won't be called with the wrong type of argument, or aborts because the types don't match up)

From my understanding, the 'history/ordering mechanism' in Pijul is composition of patches. In general, the patch 'patch1 ∘ patch2' can be different from the patch 'patch2 ∘ patch1'; when they just-so-happen to be the same, we say that patch1 and patch2 commute.


> In general, the patch 'patch1 ∘ patch2' can be different from the patch 'patch2 ∘ patch1'.

That isn't true. In Pijul, either patch1 explicitly depends on patch2, or patch2 explicitly depends on patch1, or else these two things you said are equal.


People using git think that commits are patches. But that isn’t how git works. Git sometimes tries to let you treat a commit like the diff between it and it’s parent and lets you try to rewrite history but these are really making new commits with new ids and this confuses people.

In pijul, the objects you interact with actually are diffs (aka patches) and then snapshots are well-formed sets of patches. Here, well-formed means that if a patch is in the set then so are it’s dependencies (these dependencies aren’t like parent commits in git, they’re more like you need to add line 3 before you can delete it). So removing or modifying a patch in a branch isn’t a horrific interactive rebase operation anymore.

When you move a patch in pijul it doesn’t affect any of the patches written before or after it (unless they depend on it). When you “move a patch” in git you rewrite the history and create new commits, so if I was talking about a commit (id) before the move, I would be talking about some dangling commit after the move and would need to update my id to the corresponding new post-move commit.


I think merge commits are key to why "snapshots" are a better model than "diffs", and a stronger arguments would emphasize this more.

Like people have said, the two models:

- a commits is a snapshot plus a pointer to a parent commit

- a commits is a pointer to a parent commit plus a diff

are sort of isomorphic. And some commands in the git porcelain (like git cherry-pick, or git rebase) indeed make more sense if you think of commits as diffs.

But this isomorphism becomes really strained when you have commits with more than one parent (or even zero parents). (And I think it's telling that those commands don't play very nicely with merge commits or the root commit.)

If you really want to incorporate merge commits and the root commit, the alternatives become:

- a commit is a snapshot, together with a list of zero or more pointers to parent commits

- a commit is a list of M >= 0 pointers to parent commits, together with N > 0 diffs, subject to the invariant that:

a) M = N, except that for exactly one commit, which we will call the "root" we are allowed to have M = 0 but N = 1

b) starting from any commit, if you traverse a path back to the root commit by following parent pointers, and then sequentially (in reverse order) apply, for each commit in the path, the diff that corresponds to the parent pointer chosen, then the result of composing all those diffs is independent of the path chosen.

And when you put it like that, it's pretty clear that the "diffs" model is really impractical, and that's why it's a lot better to think of commits as snapshots.


It's nice to understand this, but I fail to see it helping much in practice. Sure, you'll know why the thing you want to do is hard for git to do, but that wont make it much easier.

And without knowing even further implementation details, it's a bad idea to rely on this knowledge. For example, the article states that committing a rename separately from edits in the renames files helps git track the renames. But that's not obviously true from the discussion above, because it's not obvious if, when computing a diff between two commits, git will follow the entire history or just apply the diff algorithm on the two commits.

If it were the latter, then it doesn't really matter which order you commit things in, git would simply see commit1: fileA, fileB with contents cA and cB; commit2: fileD, fileE with contents cD and cE, and would do the quadratic work anyway, even if commit1.5 had fileE, fileD with contents cA, cB.


It strikes me as bizarre that something as old and as important as git is to the general version control problem, doesn't have a beautiful, complete and helpful user interface.

With the status quo how it is, I definitely love articles like this because every time I use git I get a kind of anxiety that fades only in proportion to the depth with which I understand actual git mechanics.

The thing I find strange is that when I interact with databases that have beautiful, helpful user interfaces, I have almost none of this anxiety, and just kind of accept "black box that handles things", and move on with my life.

I figure I must not be alone in this psychological niche. Which again, makes it bizarre that the problem of giving git a beautiful, complete, helpful front end has not been solved.


I guess I'll be the one to make the obligatory "magit is awesome and if you use Emacs you should definitely check it out" comment.

Other that being horribly slow on Windows I can't think of any downsides. Aside from the very rare black magic incantations it does everything I've needed from a Git frontend.

If something like it existed for SVN ($JOB VCS of choice, sadly) I would abandon Tortoise in a heartbeat. IntelliJ is nice but the overhead of the VCS add-ons kill my startup time.


> It strikes me as bizarre that something as old and as important as git is to the general version control problem, doesn't have a beautiful, complete and helpful user interface.

It has several.

Tower is a wonderful interface in MacOS, Sublime-Merge too.

Github is another, Gitlab also a very good. Gog is a free as in beer option too.

There are several. None has dominated the market, tough.


Appreciate this and the other similar comments! I only really knew about Github Desktop, and didn't really like it, but I'll give the others a whirl:)


I use and recommend Fork.


I can't agree with you more. git commands are definitely not designed for the current main stream usage (i.e. with services like GitHub/GitLab). Simple things like forking a repo from another user and edit locally requires >10 non-straight forward steps is far from ideal.


There are so many tools to help with this though? If you want to work with Github, there is an official Github CLI tool that makes forking easy peezy. Gitlab doesn't have an official one AFAIK, but there are unofficial ones. And if you want GUI there's a myriad of those as well. I don't understand this complaint at all.


I like SourceTree from Atlassian, I dip into the command-line from time to time but it meets many needs.

Only problem is, no Linux version, only macOS and Windows. But that's now solved with WSL2 ... code in Linux/Docker/PyCharm etc on Windows WSL2, SourceTree on Windows.


Would you like to talk about our lord and saviour IntelliJ ?


I think it's a tragedy that just about every developer uses git but most learn add, commit, branch, and merge and then just stop learning.

A lot of people are scared of rebase and cherrypick and shut down or get defensive when you mention them or try to encourage their use.

The result is, because developers only have a hammer, they brute force merge everything which results in grotesque conflict resolutions and commit histories and makes it hard to untangle problems.

At a previous job, another developer was kind enough to walk through rebasing on the command line with vim. I was receptive and in about 10 minutes, I realized there was a significant set of standard features and day to day Git use I was previously just oblivious to.

These days, the UI for rebasing and cherry picking in Gitkraken is state of the art and effortless and I use them every day without hesitation and without the fear that comes from not understanding or knowing what I'm doing. Still, I constantly struggle with coworkers merging feature branches from 100 commits ago in to new feature branches and brute force resolving conflicts across half a dozen files in one commit without any context.

I see it all because I have visibility in to the history and branch relationships but I still get shrugs and eye rolls when I bring it up. I don't necessarily want to dictate nitpicky git usage but I have a hard time accepting when people just to refuse how rebasing and cherrypicking work when they're both core basic features of a tool we all use every day. Proper Git use is one of those hills I'll die on, though so I don't intend to shut up about it any time soon :)

Edit: My practical advice: If you use git every day and you don't know how to rebase, reset, cherrypick, and stash from the command line, make it a goal. Then, once you're comfortable, learn how to do it in a visual tool like Gitkraken and make an effort to incorporate them in to your daily workflow. My guess is things will become a lot less tedious and confusing when things get messy.


I don't have a solution and maybe the problem is just not solvable but ...

The tragedy is that git is so hard to learn. Start a github project (I know github is not git). Take a PR, have the PR have a conflict, now, try to explain to the new user how they can fix their PR via git to not conflict. You'll be stuck giving them a giant lesson, probably an hour to write the instructions, then several back and forths.

Mostly, either they already know git and fix it themselves OR I give up and merge it by hand myself since it's easier than becoming a git teacher for them.


> Take a PR, have the PR have a conflict, now, try to explain to the new user how they can fix their PR via git to not conflict.

Is this a Git problem? I recall entire workdays being wasted on SVN and CVS back in the day with multiple people trying to make sense of a merge.

In Git this is actually easier to do (and easier to do repeatedly, with git rerere and similar).


It's a problem, and the place to fix it is in git. That makes it a git problem. Just because things were worse back in the day doesn't mean we can't have nice things.


What puzzles me is how resistant many developers are to using or even considering a Git GUI. I prefer SmartGit, but GitKraken is nice too.

People tell me, "I'm so much more productive on the command line" and then it turns out all they know is pull/commit/push and using a local branch. Anything outside that brings terror: "I never use rebase. What if something goes wrong in the rebase? Now I've lost all my work and I have to pull a fresh copy of the repo from scratch."

Yes, I have heard exactly that.

One thing I love about SmartGit is how it unifies features that the Git command line presents as separate and unrelated concepts. The reflog? Click the Recyclable Commits checkbox and now all of your reflog commits show up as ordinary commits just like any other.

Stashes? Same thing. Turn on the checkbox to make a stash or all stashes visible and now they show up as ordinary commits, which is all they are under the hood.

Want a diff between two commits, whether they be normal commits or stash or reflog commits? Click one commit, ctrl+click the other, and you instantly see the differences between the two. No need to check out a reflog commit temporarily just to have a look at it.

Yet I have only had a 5-10% success rate in getting anyone to take a look at any Git GUI, much less using one. I would be really interested in understanding why so many developers are reluctant to doing anything other than the Git command line.


> SmartGit […] GitKraken

> I would be really interested in understanding why so many developers are reluctant to doing anything other than the Git command line.

In the spirit of curiosity, I downloaded the two packages. SmartGit shows me a document and outright threatens me to "deactivate". https://www.syntevo.com/blog/?p=4148 This is a euphemism for other people coming to my computer and deciding what I am able to do or not. I declined here. GitKraken shows a log-in screen before letting me use the software properly. It does not refer to the account I already have on my operating system and which is entirely under my control, but some other account which is under the control of other people. That account could be revoked at any time without my say in the matter and then I would not be able to use the software, on the face of it that's the purpose of the log-in screen. I declined here. As such, I did not even run the central part of the two software packages and cannot tell my opinion how well it would work, but I already learned what I needed to know.

I object to other people desiring to restrict how I use a software. This is fundamentally not compatible with my view on how I want to run my life, and that software/the people responsible for it have no business telling me. I will never consider these packages again. The software I have been using for years, namely qgit (a GUI) and the command-line tools, impose no such restriction.


I agree with your assessment. I think GUIs are great things that you don't do often enough to memorize (or for things that are inherintly visual, but not relevant here) but they are often looked down on.

There are many people who do enough git to know how it works well and be familiar enough with enough of the commands that they don't need a GUI and are likely faster on the command line. But for every one of those people there are at least 2 who would work faster, and more accurately with a good GUI.


> Edit: My practical advice: If you use git every day and you don't know how to rebase, reset, cherrypick, and stash from the command line, make it a goal. Then, once you're comfortable, learn how to do it in a visual tool like Gitkraken and make an effort to incorporate them in to your daily workflow. My guess is things will become a lot less tedious and confusing when things get messy.

I would add git bisect to the list. It's incredibly useful (if your codebase is sane).


I read some description ob what that is and it looks like checking out different commits (via bisection) until you figure out where in the history some change happened. Is there some other benefit I am missing?


It's not about finding a particular change, it's about which change broke something.

In the best cases, it's totally automatic. You know that it worked at commit A and is broken by commit Z. So it checks out commit M and runs the tests. If they succeed, then it broke somewhere between M and Z. If they fail, then it broke somewhere between A and M. So it checks out either H or S, depending, and repeat.

It's not always that easy, especially when your tests and environment are complicated. There's often manual intervention, which is tedious. Still, log2 N steps is often manageable, especially if the computer is taking care of the tracking for you.


Thank you (and the sibling) for the explanation, sounds useful!


It automatically does a binary search, and you can use it completely automatically if you can write a script that determines if the bug was present.

The other day I used it to write a good bug report. I first used it to find the earliest commit I could compile on my machine, then I used it to find the commit where a certain command would fail.


I know how to rebase, reset, cherry-pick, stash, reflog, assume-unchanged, and many other advanced techniques.

I still prefer to add/commit/branch/merge. I often copy-paste changes into a new branch, just because I don't enjoy recalling arcane commands from memory or googling them for the umpteenth time.

I suspect that git is a leaky abstraction that doesn't fit the corporate software development workflow. I think that git is a hammer and non-distributed development is the screw we're hitting with it.


True, but you could also argue the opposite view that it's a sign of git's usability that beginners can get by with just those commands. The problem doesn't crop up until those lazy users start doing things that make the repo messy.


That's honestly the opposite of good design. It's hiding the complexity to make it seem easy for beginners, then slamming them with inscrutable error messages when they "don't use it right." It leads to a system with a deceptively gentle learning curve that requires you to suddenly learn everything all at once when you hit an issue.


Good point. I'm not sure which side I'm on, to me git feels like a good core with an atrocious UI, even after years of use I have to look up whether to use this or that option and whether it's uppercase or lowercase, do I use "--" or ":" and on and on.

Mind you, I'm not complaining, most utilities I've written for my internal users are worse! It's when something gets out and used by the masses that you wish you had had the time to put together a coherent user interface.


It's really just using the old generic version control commands everyones used to.


Rebasing and cherry-picking are awesome tools once you know how what they're actually doing. I think people avoid rebase for a few reasons; the term "rebase" doesn't mean anything outside of Git so it's not obvious what it is doing under the hood, and inexperienced Git users might use it to change the history on the main branch, which I see as an antipattern.

There's nothing inherently wrong with merging, but I personally don't like it because I find merge commits harder to understand than regular commits. Better to use things like rebasing and cherry-picking to move commits arbitrarily and then squash some commits into units of work that make sense.

Stash is crappy though, IMO, because it's not branch-specific. Instead of stash, I like to fork the branch I'm working on and create a "WIP" commit. That way I don't lose track of work I had in progress that only belongs in a certain branch.


> the term "rebase" doesn't mean anything outside of Git so it's not obvious what it is doing under the hood

Read "base" as "baseline"; that should help significantly for the simplest use of rebasing a series of commits without changing them.


I think it's a tragedy that just about every developer uses git but most learn add, commit, branch, and merge and then just stop learning.

This implies that they think in a wrong way and not have a wrong tool. A real tragedy is that git took over the world (in minds of lovers of shiny-new things and in saas) without most of the world realizing that they don’t even need it, because they wouldn’t even like to think in its way. The world wanted quick subversion and instead got this in-all-regards UX monster.


In practice, rebasing increases conflicts, requires teams to time their merges, and obfuscates the history of the project.

I never understood why people think this is a good pattern.


What you need to understand is people use rebase on their unshared branches. It's part of crafting your commit history to be a coherent set of atomic changes instead of the path you took while developing it all.

You rebase BEFORE you merge into the mainline branch.


Do you run your test suite against each of the commits you create when rebasing? If not, isn’t this “coherent set of atomic changes” misleading? It seems like a lot of effort to make a fake clean-looking history.


When I've done this, your private/dev branch may be a series of broken commits. Then you rebase onto main, squash to one commit, and test (if necessary). So what shows up on the main branch is a single, squashed, tested commit that contains one logical unit of code (usually a feature or fix).

In this model the main branch history is "real" in that it records the sequence of changes to the production code. It's "fake" in that it doesn't record the exact sequence of fumbling steps and backtracks you took to get there. But IME the latter is usually not very useful anyway.


I like the squashed commit approach. I get there by merging upstream into my dev branch when developing, then squashing before I merge my changes into the upstream. As far as I can tell, that has the same outcome as rebase with squash. Both approaches create a simple commit graph, and both avoid fake intermediate commits.


In some cases I agree, but squashes can end up so large that doing a `git bisect` (which is quite useful in finding the comparatively small commit which introduced a bug) becomes unfeasible.


There shouldn't be an issue in doing so. During a rebase you'll either have no conflicts — in which case there isn't an issue — or you'll have to stop to resolve conflicts, and you might as well run tests before continuing the rebase. In both cases I'd argue that the statement "coherent set of atomic changes" applies.


Correct me if I’m missing something here - but a lack of conflicts during rebase only means that the few lines surrounding your changes weren’t changed in the upstream. The rest of the repo changed, and this will often cause some kind of inconsistent state. I’ve encountered this situation frequently when using git bisect.


When you rebase, you basically replay the history of your branch since it diverged from the branch you're rebasing onto. Thus, the branch is always in a consistent state (or equally consistent to when you originally authored the commit you're replaying). And of course this assumes the target branch is already in a consistent state.


If the upstream is like this:

A -> B

And you branch off B and start making changes, then the upstream continues on its own:

A -> B -> C -> D

Now you rebase your dev branch off D. Your changes get replayed on top of D and create new commits. Some of those commits might not be valid, because they take code that worked in the context of B and put it in the context of D. The history seems clean if all you do is look at the diffs, but if you bisect and try to use the repo in one of the rewritten commits, you may find it doesn’t even compile (even if that commit was fully functional before rebasing).


Hm, you're right. The simplest example I could think of right now is the upstream having renamed/deleted something that the dev branch depends on, but didn't directly touch. That would definitely cause a "broken" history during the rebased commits, and is technically unavoidable.


A surprisingly common occurrence is two developers independently notice and fix the same problem, but they implement the fix in two different ways. The diffs might not conflict at all during a rebase. Or they might only conflict in some places, and the “behavioral conflict” remains after the diff conflict is resolved. This issue would eventually be noticed and fixed when tests fail before merging to master, but the intermediate rewritten commits are unlikely to be fixed.


I can see this happening, but with a reasonable bug-tracking solution in place and enforcing `fix/...` branches for fixes, these situations could mostly be avoided.


For sure, I understand that.

It tends not to be an issue when a developer is working on an isolated feature that only he or she cares about, that is reviewed in a timely matter, and gets directly committed to main.

Often this is not the case.


In a large repo with many people merging, it helps keep things organized.

In my experience you can make an argument for a merge-based workflow up to around 6 people. By 12 it's painful and hard to track what's going on, doubly so when you have a dev branch and multiple sustaining branches or something more complicated.

By the time you get to 100 people or more committing to the same repo, it just becomes absolute chaos, and at least you can maintain a semblance of sanity in your official branches by forcing a rebase-based workflow on them.


I feel like it’s one of these moments when people speak of git and everyone has their own version of the manual. git-rebase:

git rebase master; Now, the snapshot pointed to by C4' is exactly the same as the one that was pointed to by C5 in the merge example. There is no difference in the end product of the integration, but rebasing makes for a cleaner history.

What does it even mean to have a rebase-based workflow? In svn-like terms, does it mean that you have to sync-merge before reintegrate-merging? If yes, why is rebase stated as if something completely non-existent before and reinvented? You do sync-merge before reintegrating in svn, otherwise you’ll apply ancient-based patches to the young trunk, which is obviously not what you want.

And if you do not use rebase, but use a merge-based workflow, does it mean that you apply ancient-based patches to the master? If yes, of course it will be a conflict hell, cause master could undergo few refactorings in the meantime.

It is so confusing when people talk in a different slang, and you can’t tell if they invented something new or just missed something so damn obvious in the old tech. Can you please comment on which of these thoughts are [in]correct?


It's not exactly congruent to what you're describing with SVN. It's all about the DAG of commits and how we're using it to describe the history of the codebase. The end code is identical between the two workflows.

A merge-based workflow maintains the work-in-progress history of commits running parallel to main before merging the two together. So your commit tree splits and reforms, with the number of branches at any one time equal to the number of people working on distinct features at one time.

It shows you how a feature evolved, which there is some benefit of, but at the cost of an explosion in branches that are now part of the permanent record of your codebase and the main branch you're working in. It rapidly turns into spaghetti with even just a few people working in the repo.

A rebase-based workflow will typically compress all the work in progress to a single commit, which then gets applied to the tip of the main branch. This maintains a linear flow of commits where each commit is a single PR.

Maintaining that linear flow of commits is increasingly important as the number of people committing to the repo rises and the branches rise with them.

Visually, a merge-based workflow might look like this:

   4
   |\
   | \
  /3  \
 | 2\  |
  \| |/
   |//
   |/
   1
This would represent 3 features worked on by different people, all branched off the same source (1), and then merging back in.

The same thing in a rebase-based workflow would look like this:

  4
  |
  3
  |
  2
  |
  1
All of the work in progress is collapsed into a single commit when completing the PR to maintain the linear history. Of course while it was in progress, it resembled the merge-based workflow above. The difference is that instead of merging at the end, they rebased and squashed the commits.

Again, the end result in terms of the code is the same. The difference is what you see when you're navigating the history of the repo.


Thanks for taking time on such detailed explanation! So, the benefit of rebase is in a graph view of the repo, not in a conflict resolving workflow (which is consistent with the manual). But why

merge-based ... rapidly turns into spaghetti with even just a few people working in the repo

Isn’t it just a detail of how graph/report tools work? Can’t they track these merge points and “rebase in their ram”? I don’t get how a graphical representation of merge points may change the workflow.

One more thing that is unclear is why some people think that rebase is somehow superior in terms if conflict and reintegration. Like they “had issues with svn and now that rebase is a thing, issues gone”. Maybe they didn’t understand that you have to sync-merge your branches (effectively rebasing) periodically to not diverge from trunk (or parent branch) too much?

Added: I know rebase is not congruent with what I’m asking, but my questions are more about how git folks think, not about how git works. Cause I often see its comparison to other VCSs and claims that are vaguely or simply untrue about git competitors. As if before git there was some stoneage.


Thinking about this further, SVN forces history to be a linear history of commits, which is easy to reason about.

DVCS have a DAG of commits, which can get arbitrarily complicated and difficult to reason about.

Rebase-based workflow results in a linear history of commits, which is easy to reason about.

Merge-based workflow results in an arbitrarily complicated DAG of commits which is difficult to reason about.

That's all there is to it.


It is a detail of how the graph is stored and viewed, but you'll be looking at that graph when bisecting searching for when a regression occurred, or how a particular feature came to be, or when a particular feature landed.

The repo rapidly turns into spaghetti with a merge-based workflow because all the work in progress is now part of the historical record for any particular commit to main. You're not looking at a single commit, you're looking at a chain of commits and a merge node. Now imagine there are 20 people committing to the same repo. Your branch factor is exploding! Imagine the picture of my merge-based workflow, but multiplied 7. It rapidly becomes very difficult to navigate.

The DAG is important as a historical record because you have to go back to it on a very regular basis.

It doesn't change the workflow from a "do your work and commit to the repo, periodically sync with main" perspective. It changes the workflow from a pull request perspective.

I've found rebase is (slightly) inferior for conflict resolution and reintegration--mainly because unless you squash your commits down to a single commit, you may need to resolve the same conflict as many times as you have commits after the conflict in the worst case, which is irritating. Merge is just one and done. But that's a minor thing.

Having used SVN (briefly) and Mercurial (a lot) and Git (a lot), SVN pushes you into dealing with a single linear history of commits, with cross-branch merges being extremely painful and error-prone if anyone else has worked in the same area. DVCS like Mercurial and Git allow you to do whatever you like for the history, and cross-branch merges are generally easy and pain-free. I can't say anything about the underlying implementations and why it is that way, but that is my lived experience.

Most of the time with Mercurial and Git I can let the merge tool resolve differences if there are conflicts with the odd line needing manual intervention. Most of the time there aren't conflicts.

And having used SVN a bit...yeah, it's the stone age in comparison. Sometimes you have to make that tradeoff because you can't store everything locally, but I haven't enjoyed the times I've had to use SVN after having used Mercurial and Git.

And Git's user interface sucks.

I need a Mercurial skin on top of Git so the commands make sense.


This. I don't know if it's because of a lack of understanding or bad workflow, but almost all of the times I've seen bugs caused by git operations to slip through the cracks, it's been because someone decided that a pretty history was of upmost priority. Rebasing is probably necessary in some cases, but it can be a real foot gun as well.


Honestly, I don’t really see your point. Yes, We keep our commit messages as clean and descriptive as possible. Yes, if we have the time, we split our commits into logical groups of changes. Yes, we work on feature branches for mature projects. We do all this with the git integration of IntelliJ, and I don’t see the slightest reason to waste any time with the syntax of our version control tool! I’d gladly force everyone on the team to use „stuff“ as the single, exclusive commit message, if that improved velocity (which it obviously doesn’t). Because all this discussion about proper git usage is nothing but bike-shedding.


It's also one of the few tools that is likely to be a constant factor in your job for a long time. Yes, it's not super easy, and you can sort of get by with minimal effort, but it's not that much time to invest compared to how much benefit you'll reap.

And I don't mean memorising commands and their arguments, but rather understanding Git from first principles.

(I wrote this visual tutorial for that purpose, takes about fifteen minutes to go through : https://agripongit.vincenttunru.com/)


This looks nice but scrolling on Safari on Big Sur is causing weird artifacts. maybe if I scroll really slow


Ah sorry about that; unfortunately Safari doesn't support any of the systems that I own, so I can't test on it, and it seems to work fine on the WebKit browser I do have...


Be safe out there everyone. Squash with caution, don't force push master. And remember, reflog is always there to help if you get into trouble.


Agreed. Git should be treated as a deep skill, as important to practice and train with as unit testing and regular expressions.

Think of your git history as a product and art form in itself. If you don’t enjoy writing your commit history, readers will not enjoy reading it.

On a tactical level, I highly recommend buying Sublime Merge 2.


Alternative interpretation: git is a terrible terrible tool. It solve's Linus' problem, but he also wrote the damn thing. Had GitHub not entered the scene we'd likely all be using something else, maybe even SVN still.

Maybe distributed source control really is this complicated and treating git as a deep skill is justified, but having also used Darcs and Mercurial I have a hard time believing that git's usability issues are inherent complexity and are in fact an artifact of git itself.


I found Sublime Merge -- although it may have been version 1 when I tried it! -- very unintuitive and fiddly, a lot like using Vim for doing three-way merges. Definitely one of those "YMMV" kinds of tools. (I mean, I'm sure Vim is terrific at it once you get the hang of it.)

Personally, I've settled on

* Getting pretty familiar with the git command line

* Using a decent GUI diff and three-way merge tool (I use Kaleidoscope)

* Using GitUp, an open source Mac git client, on occasions where I want to get kind of arcane: committing individual lines of files in separate commits, re-ordering commits (on an unmerged feature branch because I'm not a complete monster), etc.

I suspect having already discovered GitUp is a good chunk of why I didn't get into Sublime Merge; it can do a lot of advanced stuff in ways I personally find easier to grok.


Regular expressions? I really enjoy using them and futzing around but I encourage the people I work with to stay away and to avoid using in production when possible.


Regular expressions are useful as tools for searching through code, filtering logs, searching through the filesystem..., even if you never commit them into your codebase.


Yes, good clarification. I probably write 10x (maybe 100x? more?) more throwaway regexes than regexes I commit.


Whoa, why? That sounds like an awful advice to me. By all means, use regexes, but make sure you understand the theory behind them (state machines) so you will know not to parse HTML with them. They are really pretty easy to do right once you grasp the concepts.


You can write correct ad hoc regex parsers for many subsets of HTML depending upon your needs.


> My practical advice: If you use git every day and you don't know how to rebase, reset, cherrypick, and stash from the command line, make it a goal. Then, once you're comfortable, learn how to do it in a visual tool like Gitkraken and make an effort to incorporate them in to your daily workflow.

I do agree you should learn rebase, reset, cherrypick and stash, but I don't agree that you need to learn on the CLI. I mean, use the CLI if you prefer that, but the git GUIs are perfectly adequate for performing any of these operations.

I used to use git CLI heavily, but in the past few years I have simply not needed to, to the point that aside from a small handful of the most common operations I don't even remember a lot of it anymore. Partly this is due to maturity of the GUIs, and partly because old practices like SSHing to a dev server to edit+commit something there are just totally obsolete and unnecessary these days.

Even for a simple commit, there's a massive convenience of seeing the timeline and being able to interactively stage and look at diffs that is just miles ahead of CLI, and lets me break down commits into better units of work and write better messages.

There's a stupid gatekeeping thing some developers still do about git CLI, I don't get it. It's as valid as dictating what text editors, color schemes, input devices, or OSes "real developers" use. Judge people on their output, not their tools.


I don't force my choice on others, but I'm firmly in "git cli" camp. The reason is simple - cli is available everywhere. I'm sure GitKraken & co. are great once you get used to them, but apart from GitLab graph view (which GitHub sadly lacks, and cli tool afaik also) I don't miss anything. But again, this is just my personal preference and I agree that developers should be free to use gui if they prefer it.


> cli is available everywhere

I guess that's part of what I was getting at though. Where are you doing your development that isn't your workstation?

I work on a whole bunch of different things -- from personal stuff running on my laptop or VMs in my house to cloud services deployed across dozens of AWS services -- and things have just got to a point where I have no need to do a commit anywhere but my workstation. (Well, technically, I have two: one personal, one work).

I definitely used to do it years ago, but now I don't remember the last time I had to use git on a remote system.


I use two at the moment but they change over time, along with the OS. I am not talking about developing remotely, though it does happen that I use git push to deploy for some smaller projects. But I could have used GUI for that too. I guess I just don't want to get used to a tool I won't be able to keep using.


I avoid rebase because it rewrites history, and I'd rather have an accurate log of what happened. Cherry-pick can certainly be useful for grabbing particular commits; although I find myself more often using `format-patch`+`am` to grab particular files (which also works across repos).


I use rebase often because it rewrites history - it lets me squash commits into a conceptual single commit, or re-order commits together that chronologically make more sense next to each other.


I never understood the "rewrites history" argument. The original commits don't necessarily faithfully represent my thought process, I might have a larger commit because I got distracted in the middle of the day or a shorter commit because I wanted to make sure my code was backed up at the end of the day.


I think the issue here is that using git is not a goal unto itself. Git is a tool/system that should get out of your way as much as possible. Instead it has arcane commands and options making anything but the most basic operations Shakespeare novels on the command line.

My goal is to have my code in the repo. So if git starts being a pain, it’s much easier to store my edits locally, pull a fresh copy of the repo, copy over my edits again and commit + push.

If you have a good cook, let him/her cook dishes and let someone else care about sharpening the knives and cleaning the dishes.

If you have a brilliant programmer, let him/her write good code. Don’t bother them with understanding binary trees and hashes of snapshots of diffs of local repo’s of pointers of objects in a blob graph lalalalala.


My usual workflow is to frequently merge master into my feature branch during development, then I squash before merging back to master. As far as I can tell, this gives a clean history so shouldn’t bother the rebase fans (who prioritize a simple commit graph), and shouldn’t bother bisect fans (who get confused by fake historical commits). Is there something bad about this approach?


> I think it's a tragedy that just about every developer uses git but most learn add, commit, branch, and merge and then just stop learning.

This is because Git is too hard to use.

How do I know that Git is too hard to use? Because there are literally thousands of blog post tutorials explaining how easy Git is to learn. Things that are easy do not need thousands of different guides telling you how easy it is.


> This is because Git is too hard to use.

Git is hard to LEARN. It is objectively very easy to USE for those who have learned it, so much so that the population of "I used to use git until I found ..." evangelists is effectively zero. Tools like mercurial exist in the marketplace of ideas mostly by peeling off users who haven't yet started using git productively by promising 80% of the features for 10% of the effort.

In fact, I don't know that there has been a new tool since vim or emacs that so well illustrated this dichotomy between ease of learning and ease of use.

But to be honest: it really is needlessly hard to learn. The content of the linked article is that git is built on an extremely simple foundation of data structures and operations that anyone can understand. But the takeaway from the article is that no one does understand it, because that layer is hidden behind a facade of tools that completely obscure it. Where are the "blobs" in git reset? What is the "index"? Is it a "tree" (it's not, IIRC)? I definitely agree with people who complain about the porcelain layer's design. But I still use git every day and love it.


I respectfully disagree on the basis that, yes revision control is hard, but git provides a relatively beautifully simple api and vocabulary on top of an inherently complex and absolutely necessary set of concepts.

When you're working on a codebase with multiple people, there are going to be changes and the changes have to be consolidated and the conflicts have to be resolved. I believe, with a reasonable amount of time and effort, developers can learn that API and vocabulary and I have yet to encounter anything comparable in terms of ease of use and "grok-ability" - especially with modern GUI tools.

git is one of the most ubiquitous and unavoidable technologies in software development and it's 100% worth the time and effort to understand and be good at it.


The counter argument would be that maybe git is so basic and easy to learn that everyone feels comfortable enough with it to write a tutorial?

I think a large amount of content is more a factor of Git's ubiquity than its difficulty.


> git is so basic and easy to learn that everyone feels comfortable enough with it to write a tutorial?

Nobody writes tutorials on how easy Lyft or Uber apps are to use. Easy interfaces don't need lots and lots of tutorials. That's exclusively the result of poorly designed interfaces AND complicated systems.


Lyft is an app. Git is a tool for developers. I have no clue how they're related.

That being said, I googled "How to use Lyft" and there's a ton of results.


> Lyft is an app. Git is a tool for developers. I have no clue how they're related.

Google: What is a system

Good luck.


You have evidence that git is hard to use, but no evidence establishing that it's too hard. Some things just aren't easy.


> This is because Git is too hard to use.

> How do I know that Git is too hard to use? Because there are literally thousands of blog post tutorials explaining how easy Git is to learn. Things that are easy do not need thousands of different guides telling you how easy it is.

I'm not sure that's convincing. I think that a lot of guides about how easy it is indicate that it's slightly difficult to learn. That results in a lot of people struggling for a little bit, overcoming the struggle, and feeling a sense of accomplishment and enlightenment, which they then want to share.

(There's also a difference between how hard something is to use and how hard it is to learn. I'd argue that there's often a trade-off to be made, where some sacrifice on difficulty learning results in a reward in ease of use—in the sense that, for example, vim is far easier to use than any other editor for a seasoned vimmer.)


My SCM journey: RCS-PVCS-cvs-VSS-MKSSI-svn-Perforce-git

I hate all of them but learned to use them because what's the alternative?


Git has over-complicated source control for the majority of developers. Things were much simpler with svn.


SVN is a versioned storage engine pretending to be source control. Branching and tagging? We'll just expect everyone to obey an implied policy in our filesystem tree. Oh you think you want to merge these things that have a clear ancestor in the DAG? Not so fast buddy.


IIRC, merging in SVN is... basically something you never wanted to do. :)


Svn's complete inability to handle merges at a file level may have been simpler, but was by no means better. Needing to coordinate who is allowed to edit each file in order to avoid a painful merge is common with svn, and nearly unheard of with git. Svn looks at the inherent complexity of multiple people working on the same code base, shrugs, and figures that it is somebody else's problem.


> Things were much simpler with svn.

Not from my experience as a newbie with SVN at the start of uni and in the beginning at $job.

In both cases, it was temperamental, prone to network issues (this was in both student accom. -> uni server, and LAN at $job) and did not like users working on the same files.

Git took some learning, and it took reading Git Magic <http://www-cs-students.stanford.edu/~blynn/gitmagic/> to go from <https://xkcd.com/1597/> to the friend mentioned in the alt text.

SVN still feels like I'm pulling teeth all these years later.


> My practical advice: If you use git every day and you don't know how to rebase, reset, cherrypick, and stash from the command line, make it a goal.

I couldn't agree more. It's not only enabling more sensible interactions with the people that collaborate on repositories with you, it's also a comfort that allows to experiment and take benefit of the tool without fear of getting stuck.

I would add checking the reflog and how to use it to complete the list, even if it's clearly less important.

A source that I recommend wholeheartedly for those that want to go further down the git rabbit hole: https://learngitbranching.js.org/


How do you track which cherry picks have been done?


We've never had an issue with git merge'ing all over. I've never looked at our git graph because I've never had to.

Maybe rebase is a tool to help poor software development practices? (and your colleagues letting branches go stale is one of those)


I agree if you merge early and often everything goes smooth and if you have a stale branch your doing something else wrong. At least for the work I do, web dev. Some of our pipeline folks are working under a different paradigm and they run into these problems often but also understand how to rebase and cherry pick. As always just depends,


> help poor software development practices

Says someone who's "never looked at our git graph".


Entertain me then, why do you need to look at it?


> about every developer uses git but most learn add, commit, branch, and merge and then just stop learning.

What are branch and merge? /j


> A lot of people are scared of rebase and cherrypick and shut down or get defensive when you mention them or try to encourage their use.

Because a lot of people have been burned and way too many hours been lost due to a rebase gone wrong. Cherry-pick and stash are trivial operations, reset (outside of "undo a git add") and especially rebase are not.

The learning curve for both is steep, the potential for failure extremely high, so I understand why organizations go as far as entirely banning rebase.


People are scared of rebase because of the constant scare-mongering around it. "Rebase is Evil" and "Never use Rebase", etc. Then we end up with junior devs that are too-scared to even use their git IDE's built-in rebase-remote branch onto remote so they end up littering the entire repo with "Merged branch-A from origin into branch-A".

It's so bad that even seasoned developers that haven't delve deeply into git have no idea that this sort of rebase is practically harmless. Instead they parrot "Rebase is Evil" without thinking twice.


This is a good overview of Git internals. If this stuff interests you, Chapter 10 of Pro Git offers similar descriptions of Git objects [1] and Git references [2], and then continues onto Git packfiles [3] which are not covered by OP.

[1]: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

[2]: https://git-scm.com/book/en/v2/Git-Internals-Git-References

[3]: https://git-scm.com/book/en/v2/Git-Internals-Packfiles


Whereas in Pijul and Darcs, commits (called patches) are diffs, not snapshots. They are based on a sound theory of patches, which allows for operations not supported by Git like commuting, as long as the commits aren't interdependent. Plus, language-specific tools can extend the notion of dependency from line-based to semantic.


> They are based on a sound theory of patches, which allows for operations not supported by Git like commuting, as long as the commits aren't interdependent.

This is definitely supported by git. Even though commits may technically be snapshots, you can build a diff from snapshots (and vice versa). `git diff` will get you the diff for any given commit, and `git rebase` will happily reorder commits for you by reapplying the diffs.


When reading a bit about Pijul, a few months back, I had assumed every two patches would commute, and I couldn't image how that could possibly work.

Does it really have this limitation? If so, it doesn't look much of an improvement compared to git: I can shuffle "patches" all right using `git rebase -i`. I conceide it can be quite slow, though.


So not every patch can commute with every other patch: “delete foo” doesn’t make sense until after “add foo” has happened. So patches have dependencies that they must come after, but for lots of vc situations, patches are independent. Sets of patches makes rebasing a branch trivial for example because adding the patches from the master after your patches is equivalent to adding them before. If you would get a merge conflict, you get the same merge conflict whether they are added before or after.

But nailing down the logic behind commuting patches can be important too as it can catch subtle problems that might happen with normal snapshot-based merging. Consider some people independently editing branches

  Bob adds a file with line “foo”
  Alice pulls Bob’s patch
  Bob changes “foo” to “bar”
  Alice changes “foo” to “bar”
  Bob changes “bar” back to “foo”
In Pijul or Darcs you should get a consistent result pulling changes from Bob and Alice no matter what order you do it. But if you use something like git, the order you pull and merge, and if you do it at any intermediate times, might change the resulting snapshot (as well as just the history). The start and end state of Bob’s repo loom the same as snapshots but they are different because Bob changed his mind about the line “bar”—maybe the change didn’t work.


I really liked this video: the guy first walks you through how to build your own git-like utility with a handful of shell commands, then goes and walks through an actual git repo:

https://youtu.be/qq_s2Hh--aQ

Even the first 20 minutes was enough for me to have a substantially better understanding of how git works.


I just finished watching this and have to say it is a great talk to understand git.

Once you go through the initial 20 mins building a git like utility with bash the rest of the talk about git becomes easier to follow. I recommend building the git like utility in bash yourself first, playing with it for some time and then watching the rest of the talk.


This article goes into a little too much detail imho. I have had great success explaining Git to coworkers using post-its, permanent marker and a flip board (no computer!) and going through the steps Git would take (abstractly, not exactly) when performing certain commands. All commits (and their relations) are written down on the board with the marker because they don't change (eg: rebasing just creates a new line of commits). The branches are written down on post-its and can move around (like this article explains, they are just pointers). You can use a whiteboard with non-permanent marking for the working directory and index if you want to go that deep.


Neat overview of some of the core concepts in Git that often go unnoticed. Although I'll say that the fact that commits are technically not diffs doesn't seem to matter much in day to day use. Git does a decent job of abstracting that detail away to the point that you could just as well believe commits are diffs. Also, I want to say that technically I believe Git does use deltas to compress an object's history in the blob store. But the different blobs that comprise an object's history can be thought of somewhat as being separate. Git could just as easily not perform this internal, space-saving optimization and things would all work the same. The SHA hashes would be the same and based on the same input.


Cherry-pick is what messes up the commit-as-snapshot idea for me. If I see a small commit that I feel I can merge into my branch then that commit feels like a diff and I don't want to care about the rest of the stuff that commit snapshots. I guess that's a good thing.


I tend to agree.

I am not someone who has a deep understanding of the inner-workings of git by any means, yet I am perfectly comfortable with rebasing and cherry-picking.

For me, git is so much easier to intuit if I only think of it as diffs. When I rebase, I'm just rearranging diffs, or squashing them together, or whatever. If I try and think of everything as snapshots it actually gets more confusing for me.


So, I think a useful simple way to think of it is "git creates diffs when it needs to on demand."

When you're doing a cherry pick of say commit ce123, what you're asking git to do is: 1. Diff ce123 against its parent 2. Go apply that diff to some other branch

Likewise rebasing is the same, but with an extra step to apply the inverse of the diff to the original commit first, then rewrite the history.

One of the big advantages of this on demand diffing approach is it's much more robust vs conflicts. Back in the subversion days I wrote some shell scripts that did the equivalent of git cherry pick and rebase. I'd keep a couple extra copies of a checkout, would use the switch command to quickly put them into a specific state, then would just generate a diff manually to apply to my main working copy. It worked, and was often faster than manually copying text around between editor windows, but it was extremely conflict prone.

So this distinction, of whether you store snapshots and diff on demand, or store diffs and snapshot on demand, is somewhat subtle but has important consequences.


Since you can go from diffs to snapshots, and snapshots to diffs, aren't they basically equivalent? I'm struggling to see the important consequences at the user level.


There's an asymmetry that shows up once you start doing things that rewrite the history/order.

A snapshot style commit, like git uses, always denotes a complete state. Git creates the diff it needs on demand in relation to this, and any new commit created by cherry picking, rebasing, etc, is given a new identity specific to it's content.

On the other hand, in a diff based system, the meaning of a diff changes based on its neighbors. This is because a line based diff is a flawed way of representing the logical operations we're doing. This is why git style systems tend to see less conflicts than diff based systems.

I was a darcs user for personal stuff before git appeared, and ultimately in comparing the two I've come to see git as the right model due to this asymmetry. If we had editors that exposed logical/semantic operations, then the Theory of Patches approach would be extremely impressive. But that also would rewire our version control system to be language aware, so probably a non starter as a generic tool for developers.


You can't go from diffs to snapshots. Two identical diffs can be applied on different branches - looking just at the diffs, you don't know which branch it is.


Yeah, good summary.


Commits are conceptually snapshots, and everything else Git does is just an optimization over the naive “keep all versions of all files ever” (imagine implementing a version of Git that is just zipping the entire folder). Diffs are isomorphic to commits and are generated as needed.

I wrote about it (albeit imprecisely) here: https://siawyoung.com/git-intuition


Yes, exactly, this is a very good post on the nature of Git.

> Branches are pointers

Yes. I would say they are named pointers. Commit hashes are weak, unnamed pointers.


I think we're running into a naming issue here. It's usefult to think of a single commit in itself as a diff. The DAG is a useful model for an accumulation of changes. The question is, what changes and operations make up a node in the DAG (i.e. what code is in this branch, compared to that? What code do they have in common)?

To answer this: take the node and follow along the predecessor until you get one (or more) roots. All commits along the root are contained in the commit at hand. That's the history.

Adding changes is, I think, the most useful mental model, even if it is not the implementation.

Now what the author is saying is: A commit is not only the diff, but also the whole tree/history that the diff is based on. And that is also true and then the commit (the adding plus the past) is a snapshot.

Do we have a good naming convention for the single node in the tree with its changes, compared to the single node in the tree with its changes AND the references to the parents with all their changes etc.?


> It's usefult to think of a single commit in itself as a diff.

Except if it has multiple parents like a merge commit.

Actually I don’t agree even in general. It took me an unreasonably long time to become unafraid of git because I clung to the common VCS mental where commits were actually diffs.


But the snapshot-model also doesn't really make a lot of sense for merges. It's a snapshot of what then? A merge of all parent trees? What's a merge of two files then? Defining this merge-operation on trees is at least as mentally taxing as the alternative.

Accumulating all the diffs from two (or more) ends (until they are common again) is at least as useful.


For me, a merge commit in git is just a snapshot like any other except that its metadata contains links to more than one parent.

The parent child relationship acts as nothing more than remark that the child was derived from both parents in some way.

Of course, commonly the child is derived by finding the most recent common parent, using heuristics to guess file identities after any renaming and then performing a 3-way line-based diff between what it thinks are corresponding files.

But actually git doesn't really care - it's just another snapshot you've created and added to the DAG.

I haven't found it helpful to think of what's going on in git in terms of an "accumulated file diffs" abstraction because git has no notion of file identity (across commits).


> But the snapshot-model also doesn't really make a lot of sense for merges. It's a snapshot of what then?

It's a snapshot of the final result.

That's the beauty of the "commit as snapshot" model: each commit always contains the final result of the commit. It doesn't matter if the commit is a normal commit with a single parent, a merge commit with multiple parents, or even an initial commit with zero parents. It doesn't matter if the parent commits are unavailable (shallow repositories). It doesn't matter if the parent commits have been changed (grafts).


You can have a diff to multiple parents - you get multiple status columns then. Similar to what you see in the diff in merge issues.


This comes up from time to time and each time the comments debate the correctness/effectiveness of the title.

The contents of the post does shed much light on how git operates and introduces a view that can help in navigating how to use git.

Whether or not you want to think of a commit as a snapshot or a diff isn't material. It's best to think of it as a dual, since a diff on any base can create a snapshot, and a snapshot can create a diff from a snapshot.

This very much mirrors the idea of a transaction log (of diffs) and a 'current' state. The current state is convenient, can benefit performance, but is not absolutely necessary. It doesn't even have to be the most recent, e.g. key frames in video compression. These are all just ideas, getting used to them and being able to move viewpoints between them is better than clinging to any one of them.


Most developers think of commits as diffs and they can for all intents and purposes be thought of as such. It’s actually best for the understanding of how to practically get things done to think of them in this way.

Odd semantic argument to make.


If anyone does want to get more into the internals of Git without playing with a production repo, I built a "playground" awhile ago which creates a simple Git repo of synthetic commits which you can then play around with:

https://github.com/dmuth/git-rebase-i-playground

I know it says "rebase -i", which originally what I built it for (and what the exercises in the README are for), but you can really do whatever you want in it, and blow away/rebuild the repo with the included script.

Enjoy!


>Commits are snapshots....commits are diffs....

Neither model really encompasses commits for me.

I prefer...

Commits are a point in history I can return to after I inevitably fuck up or look back on so I can convince myself, yes I am indeed making progress.


Am I the only person that doesn’t want to understand the inner workings of my VCS in lurid detail? I don’t have to know as much about any other developer tool in order to use it effectively.


> I don’t have to know as much about any other developer tool in order to use it effectively.

You don't? How do you debug problems?


I just use the debugger and it mostly works how I expect. I don’t have to go and study the data structures and other intricacies of the debugger itself to puzzle out why it works the way it does. Git is terrible in that way, as evidenced by the thousands of blog posts of people trying to describe the inner workings of it and how it will “make more sense” once you understand it as well.


Think about ZFS - ZFS works perfectly fine for me and I don't have the foggiest idea how it works beyond "copy on write" magic.


Do you feel like you need to know this to use git? What did the blog post change about your use of git?


I used to think commits as snapshots, but it was confusing. Then I read "Git Internals".

A commit contain the "whole" content of each file that we've commited. But since a commit has a pointer to a root commit, it also represents a working directory. Even though a commit contain "whole" files, the git internally stores only parts of the files as an optimization.

When we diff two commits, we see the difference of the file contents in the corresponding working directories that the commits represent.


Great article but:

"one of my favorite analogies is to think of commits as having a wave/partical duality.."

is a hilariously misguided object to build an analogy from. Theoretical physicist checking in, and my community has been searching for about 100 years for an analogy to explain that shit, so it's hilarious to see someone try to use it as a concrete object people can use as a touchstone to better understand a purely classical database.


Read it as "think of commits as <unintelligible bullshit you have to take on faith because nobody really understands it>"


i convinced myself that commits are snapshots by doing the following:

    # generate a 100M text file
    base64 -b 76 /dev/urandom | head  -c 100000000 > file.txt
    git add . && git commit -m "1"
    
    # remove first line and add a new line to bottom
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "2"
    
    # repeat
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "3"
    ...

    du -sh . # a very big folder
each of the commits are almost 80M big in the git folder. if you run `du -h .` you can see how git stores each object individually (80M big)


The fact that in the implementation “commit” means one thing, does not mean that people need to / should use “commit” to mean the sane thing, not that it is necessarily helpful to do so. In any case a commit is more than a snapshot because it has a parent, thus diff is a sensible mental model for the pair.


I think this is incorrect, no?

Can’t all commits be turned into patches? Thus, aren’t commits isomorphic to diffs?


It's technically correct. The key thing here is that it's isomorphic. You can either have a system of commits in which diffs are computed between commits or a system of diffs in which commits are computed by applying diffs. The trade-offs are in performance of various operations, not in the user exposed semantics. Git chooses to have its first class object be commits and diffs are computed on the fly. So again it's technically correct, but in practice commands like cherry-pick, which treats a commit like a diff between that commit and its parent really blur the line. I think in reality you can be a really advanced git user and not even realize that there's a difference between a commit based and diff base version control system, because in practice there really isn't much of a difference.


Yes,-ish. However, there's also the question of what operations are efficient. (Diff feels very performant in git but) maybe having the diffs as the first-class objects enables doing something efficiently that git doesn't do. (perhaps, identifying when identical patches have occurred in different portions of the history?)

I've used several patch-based VCs (RCS and CVS) but I think they pre-date this "sound theory of patches" and instead the use of patch-style representation was for optimizing storage. (just as git uses packs and deltas to optimize storage and performance, f'rinstance) So I don't really know what I'm missing.

(If the sound theory of patches would let me better understand what occurred at a merge commit than git's tooling, that'd be just about enough to sell me on switching. except for the network effects of git & github.)


Consider a small commit with spelling error. If I turn this into a patch and apply it to another branch it will be a different commit even it will be the same patch.

As such, the concept of a "commit" in Git refers to a complete state of everything; a snapshot.


If you inspect the files in the .git directory, you’ll find commits stored as trees of directory and file objects. But it is true that they can be converted diffs on the fly, which is exactly what the git show command does.


> aren’t commits isomorphic to diffs?

Nearly, though renames are only approximately extracted from the snapshots.


If your VC isn't plumbed all the way into your editor, you can't tell changing a typo from deleting and re-typing the whole file when it comes time to create a delta.


There is a `git mv` command that means "rename". Git could (even with its current data model) explicitly annotate commits with this intent, but doesn't. I don't know how useful that would be compared to the current heuristics, but it does mean that git commit "snapshots" are not (quite) isomorphic to a diff format (like posix diff's) that can explicitly encode renames.


That's true, but git doesn't natively have a way to refer to a single diff. You can use a hash to refer to a commit, but that depends on the entire history up to that point. If you rebase a commit then the hash changes, even if the new commit is semantically the same.


It kinda does, `git show <commit>` does exactly what one would expect if commits were actually diffs. That is it shows the diff between <commit> and its parent.


My first tutorial was the Pro Git book, and this fact was stressed well there so it stuck. Thinking of commits as snapshots also has the small advantage of making the first commit less special.


Darcs users disagree.


That's a cool explanation.

I'm a bit slow on the uptake, so I had to re-read a couple of sections, but it was helpful.


this... seems so very flawed and disprovable to me. Ignoring the obvious storage issues that have been discussed if commits were snapshots you could rebase and reorder them without ever worrying about conflicts. In reality you very much DO have to worry about conflicts because they are change instructions that transform a file from A->B->C if you try and reorder it as A->C->B you're going to have serious issues (assuming these all touch the same code) because C is a transformation from the B state to the C state. It blows up attempting to convert A->C because the instructions in that transformation describe going from B->C.

> A commit is a snapshot in time. Each commit contains a pointer to its root tree,

it so... _so_ very much isn't. It's not even a snapshot in time of a section of a file.

It's a change instruction. No, it's not a "diff" but it also isn't a snapshot.


commits are snapshots. cherry-picking/rebasing diffs a commit and its (first?) parent, and applies the diff on a base commit to create a new commit.

if you `git replace` a single commit and change its contents, its children do not change their contents, so `git show`ing any direct child will show a new diff, not previously present, reverting the actions you've performed in `git replace`.


simonedon chuckles in allagmatic.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: