Commits are snapshots not diffs (2020)

whack · on April 8, 2021

From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.

Whereas if I had used git, and created 2 additional commits, each making a change to a small text file, my total storage size would be barely larger than 100MB. Describing the commits as a diff, as opposed to a snapshot, leads to a better intuitive understanding of why this would be the case.

Not to mention other features the article discussed, such as cherry-picking. What does it even mean to "cherry-pick a snapshot"? In comparison, cherry-picking a diff and applying it to your current state, is far more intuitive.

And let's not forget commit messages. If a commit is a snapshot, I would expect the commit-message to be descriptive of the entire snapshot. Whereas if a commit is a diff, I would expect the commit message to be descriptive of the diff. Which is exactly how most people use commit messages.

Obviously both "diffs" and "snapshots" are leaky abstractions. If you insist on using the "snapshot" abstraction, you will need to resolve all of the above points of confusion by adding more complexity to your abstraction. And if you prefer to use the "diff" abstraction, you will eventually need to explain that a commit is actually a combination of diffs, along with some other metadata like a pointer to a parent commit. As a teaching tool, you can make either abstraction work. But I find it far more intuitive and useful to think of commits as "diffs + some metadata".

outworlder · on April 8, 2021

Commits are snapshots.

How to represent those snapshots, and fix the storage bloat a naive implementation would cause, is a completely different problem.

One of the things that makes Git smart is that it doesn't try to optimize things prematurely. SVN and co. would store actual diff data, but this made some operations really hard to implement (and, in many cases, slow).

Git has commits conceptually as snapshots. It's up to the storage code to figure out how to deal with this.

> But I find it far more intuitive and useful to think of commits as "diffs + some metadata".

Except that this is not what's happening. I wouldn't even call it an abstraction, it's how things actually work. What you call abstractions are actually operations. If we run a diff we are interested in the changes, but if you ask git to show you the commit it will show you just that.

If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.

hibbelig · on April 8, 2021

> If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.

I find that thinking of commits as snapshots is not so useful. I prefer to think of them as a pair of parent commit and diff.

With that in mind, things like rebase become obvious: Take the same diff and attempt to apply it to a different parent.

It's not clear to me how thinking of commits as snapshots helps me to explain operations such as rebase.

I do concede, however, that "git cat" (I think that's the command) seems more closely related to a snapshot: you identify a commit and a file, and it will give you the content of that file at that commit. Clearly in this case the concept of a snapshot works well. But I need this very rarely.

haberman · on April 8, 2021

> With that in mind, things like rebase become obvious: Take the same diff and attempt to apply it to a different parent.

You can think of it that way if you want. But it's not what Git actually does.

Personally I much prefer to have my mental model match the actual reality of things.

You may not use "git cat" very often, but what about "git checkout <SHA>"? If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff up to the SHA you asked for.

What it does in actuality is find the snapshot of that SHA and change the working tree to match it.

taberiand · on April 8, 2021

If git did rebuild the graph, right from the very first commit, the end result of the operation would look identical to the user as it does now.

It seems to me the two mental models are interchangeable when it comes to the use of git from the users point of view. What is missing, from the users point of view, when they model commits as diffs+parents vs as snapshots?

Now I think about it, it's probably that users have a bad understanding of the commit-as-diff models; they could similarly have a bad understanding of the commit-as-snapshot model I expect, I don't know that thinking in snapshots helps to understand git from an users point of view better than thinking (properly) in diffs.

The article for example explains that any two commits can be differenced because the underlying snapshot trees can be compared, but the commit-as-diff model can as easily explain why comparing two commits works by tracing each commit back to the common base commit - so the commit-as-diff mental model just needs to remember that commits are fundamentally tied to the path they have back to the root commit.

It seems to me if you take the diagrams from the article and remove the under-the-covers stuff leaving just the circles, the commits-as-diffs and commits-as-snapshots models look exactly the same.

JoshuaDavid · on April 8, 2021

Merge commits are a bit hard to understand from the perspective of "a commit is basically just a parent commit plus diff".

On the flip side, cherry-picking is hard to understand from the perspective of "a commit is basically just a snapshot, nothing more" (it's _also_ weird from the parent-commit-plus-diff perspective -- cherry-pick is kind of a weird operation, but useful enough that we keep it anyway despite it not fitting quite as cleanly into the git model as other operations).

Outside those edge cases, though, people with "snapshot" and "parent + diff" mental models will make basically identical predictions about what the results of various operations with git will be.

zaarn · on April 9, 2021

The solution is to think of it as "it's both, snapshot and parent + diff".

When you cherry-pick, git is using the parent+diff model to move the commit. When you do a merge commit, it's using the snapshot model.

haberman · on April 8, 2021

> What is missing, from the users point of view, when they model commits as diffs+parents vs as snapshots?

With the wrong mental model it's harder to predict what operations are expensive. If "git checkout <SHA>" truly did have to replay all diffs from the beginning of time, it would be a very expensive operation that is best avoided unless you absolutely need it. But in practice it is a very fast operation (one of the fastest) that there is no need to shy away from.

taberiand · on April 9, 2021

A fair point possibly, but given checkout/switching branches is probably just about the most common action when working with git repos, I'd hope people would notice that it's fast pretty quickly.

hibbelig · on April 8, 2021

> You may not use "git cat" very often, but what about "git checkout <SHA>"? If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff up to the SHA you asked for.

Yes, this is true. I don't know why it never bothers me. Maybe it's because you could also store the diffs in the opposite direction (i.e. store the tip of each branch in the clear, then store diffs from each commit to its parent). Computing the inverse of a diff should be a quick operation. Usually, when you check out something, it's the tip of a branch or near the tip of a branch.

Anyway.

Of course I know that storing trees makes it easy to compute diffs. Computing diffs will becomes slower with larger trees. On the other hand, storing diffs makes it slow to compute trees, and the more commits we've got, the slower the tree computation goes.

kevincox · on April 8, 2021

> Computing diffs will becomes slower with larger trees

Not usually. Computing a diff is roughly O(n) with the size of a diff. This is because unchanged leaves of the tree can be seen as identical (because the are content addressed) and are skipped. So to compute the diff you only need to recurse into changed directories.

So having a million files in the root directory and one has changed is very fast to diff as you just diff that one file. The worse case is the diff happening in a very deeply nested directory with lots of files in each of the subdirectories but even that is quite cheap as diffing a sorted directory listing is O(n) with the size of the listing.

(The actual worst case is diffing large files as most text diff algorithms are worse than O(n))

yxhuvud · on April 8, 2021

> If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff

Well, it would usually be more efficient to figure out where the current checked out branch differ from the branch that is checked out, and then unapply and apply diffs as needed.

wazari972 · on April 8, 2021

what about "git cherry-pick <commit>"?

with this command you don't import a snapshot, but only the diff between <commit~>..<commit>, so the model parent+diff makes sense to me

skj · on April 8, 2021

Some commits cannot be cherry picked. This is because there is no coherent diff for them, like with merge commits.

tlb · on April 8, 2021

Rebase doesn't work that way, though [0]. It first extracts the 3 versions (2 leafs and their common ancestor) and then does a diff & patch.

This allows git to store the deltas between versions in the most efficient way on disk, while also letting it use contextual diffs to minimize the chance of spurious merge conflicts. Patching algorithms have various heuristics that make sense for programming languages, like special treatment for lines with only changes in whitespace.

(Edited to add:) also, minimal diff algorithms have to do a lot of work to detect large blocks of text being moved around. This is part of what made Subversion, which used the same diff algorithm for storage compression and merging, painfully slow.

[0] https://git-scm.com/book/en/v2/Git-Branching-Rebasing

hibbelig · on April 8, 2021

Here is the paragraph that describes what rebase does:

> This operation works by going to the common ancestor of the two branches (the one you’re on and the one you’re rebasing onto), getting the diff introduced by each commit of the branch you’re on, saving those diffs to temporary files, resetting the current branch to the same commit as the branch you are rebasing onto, and finally applying each change in turn.

Is "applying the diff to a different parent" not a good way to describe this?

tlb · on April 8, 2021

You're using the word 'diff' for 2 different things:

- an efficient way to store 2 very similar files

- the minimal set of changes made by a programmer to a file.

Subversion uses the same diff algorithm for these 2 functions, which is why people conflate them. But git uses different algorithms. The first one (which it calls deltas) are optimized for speed and compression ratio. The second set of algorithms (you can choose from a few, some of which are better at identifying rearrangements of large blocks of text) are optimized for merging 2 programmer's changes without conflicts.

klodolph · on April 8, 2021

The way you try to apply a diff to a different parent is by doing a three-way merge... the vast majority of tools do this by taking three files as arguments and producing a fourth as output. The three-way merge is the underlying process which makes merge, rebase, cherry-pick, and revert work. They are all just "three-way merge, shuffle the arguments around, and adjust metadata".

The parent + diff storage is not isomorphic to snapshot storage. Snapshot storage reflects the actual usage of VCS tools... people make changes, and record the final state. Parent + diff does not do this, it records the changes, which requires creating a diff, and there are multiple ways to create a diff between two snapshots.

Git postpones the "which diff is correct" question until you actually care about the answer.

munk-a · on April 8, 2021

> If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.

I don't think those concepts are distinct as you're painting them. At a user visible level commits will almost always be visualized as diffs, which puts us at a place where - at the highest level and lowest level they're defined as pretty close to diffs, while at an intermediary level they're defined closer to snapshots.

I honestly think they're neither, each expression method (diff vs. snapshot) can be translated pretty easily and both are trying to represent the same end goal. It can be helpful to know that commits are representative of the full state of the codebase that exists at a time, but that view can be at odds with merging and rebasing which use actual change sets to calculate - when a commit is being manipulated it's helpful to view it as a diff (and git does this) - while as, when a commit is being read, we're using it as a snapshot.

mewse · on April 8, 2021

Structure purist, ingredient rebel: A snapshot between two levels of diffs is a sandwich.

xmprt · on April 8, 2021

One way I like to think about this is that when you rebase a branch, the diffs are the same (barring any conflicts) but the commits are different. Just another reason commits aren't the same as diffs.

klodolph · on April 8, 2021

The diffs are often different, even without conflicts. Try comparing them some time, and look closely at the diff... look at the lines starting with @. People usually ignore those lines but "patch" does NOT.

This is not an irrelevant detail, but it's the result of a three-way merge. The three-way merge can update those @ lines if it has a complete set of inputs (all three inputs). If you to make a patch from one branch and then apply it to a different branch without using the three-way merge algorithm (stripping the diff of all its context), the patch may fail to apply even if the three-way merge succeeded without conflicts.

slavik81 · on April 8, 2021

> If we run a diff we are interested in the changes, but if you ask git to show you the commit it will show you just that.

git show <commit SHA-1> will output a diff.

trulyme · on April 8, 2021

I think this is more a sign that git (porcelain) is not aligned with the underlying model.

It is actually a pity that so little effort went into git UI. I find the OP explanation of git model awesome and the presented concepts beautiful, but the cli utility has countless naming and consistency problems which make me sad that hg didn't win over git. Life would be much simpler for many developers if it did, imho.

hongsy · on April 9, 2021

If commits are snapshots:

- say i have a repo with one file, a big 100MB csv with millions of lines.

- i change one line in the CSV for one commit.

- repeat multiple times in many many commits.

- how big will the repo be?

hongsy · on April 9, 2021

i convinced myself that commits are snapshots by doing the following:

    # generate a 100M text file
    base64 -b 76 /dev/urandom | head  -c 100000000 > file.txt
    git add . && git commit -m "1"
    
    # remove first line and add a new line to bottom
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "2"
    
    # repeat
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "3"
    ...

    du -sh . # a very big folder

each of the commits are almost 80M big in the git folder. if you run `du -h .` you can see how git stores each object individually (80M big)

cesarb · on April 8, 2021

> From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.

That's not what one would expect. Suppose I have a directory that is 100MB in size. If I take a snapshot of it ("btrfs subvolume snapshot"), my snapshot would be 100MB in size, but the storage needed for the original and the snapshot together would still be 100MB (plus a few kilobytes of overhead). If I take a second snapshot of it tomorrow ("btrfs subvolume snapshot" again), my second snapshot would also be 100MB in size, and my total storage needs would still be 100MB (plus a few kilobytes of overhead).

If I made a change to a small text file before each snapshot, my total storage size would still be barely larger than 100MB.

That is, when creating a snapshot, one would expect it to be copy-on-write. While not exactly what git does (it's a content-addressable storage instead of a copy-on-write storage), the end effect is similar enough for most purposes (the main difference being that undoing a change in git would not need extra storage, while a copy-on-write storage would store a new copy of the contents).

crazygringo · on April 8, 2021

Clearly people are using two diametrically opposed definitions of snapshot.

If a snapshot is defined is opposed to a diff, then it's clear snapshot means "full copy". If I snapshot the state of my cloud server, it creates a full copy of its disk in block storage somewhere, and takes several minutes to complete.

You are describing snapshots that exist as part of a diff system or copy-on-write system, where they use virtually no storage at all, because further changes are assumed to be applied as diffs rather than overwriting previous data. Where the snapshot is a "marked" diff that can specifically be rewinded to, as opposed to a general ongoing stream of diffs.

But that's a more advanced and system-specific definition of snapshot.

As a general mental model, when you say "think of it as a snapshot not a diff", I think it's clear that the former definition is being used, and that the expectation is a fully copy that takes up disk space. Because otherwise, in the second case, all the snapshots are just the most recent diff (on top of the entire prior history), so the sentence "think of it as a snapshot not a diff" doesn't really mean anything. The snapshot and the diff are the same.

klodolph · on April 8, 2021

> If I snapshot the state of my cloud server, it creates a full copy of its disk in block storage somewhere, and takes several minutes to complete.

Which cloud provider are you using? Neither Amazon nor Google take snapshots this way. Amazon EBS and Google Persistent Disk both use copy-on-write semantics for snapshots. If you take a hundred snapshots of a 100 GB disk, your total usage is 100 GB plus metadata. When you run a VM instance from that disk, the storage usage will increase as blocks change, to a maximum of 200 GB total storage (for live disk + out of date snapshot).

When I use QEMU or VirtualBox at home, I also get copy-on-write snapshots of disks, although it's certainly possible to get a full copy if you want. I think the feature is pretty standard.

crazygringo · on April 8, 2021

Digital Ocean. It absolutely takes snapshots by making a full copy:

https://docs.digitalocean.com/products/images/snapshots/

So this is a perfect example of what I mean by the word "snapshot" being used in two different ways by different people.

Snapshot meaning "full copy" is one usage (Digital Ocean), snapshot meaning "diff checkpoint" is another usage (Google, AWS).

klodolph · on April 8, 2021

Those aren't different definitions of "snapshot", though.

crazygringo · on April 8, 2021

Of course they're different. They have different meanings, so they're different definitions.

It's not like it's the same concept with different hidden implementation details.

On Digital Ocean, I can delete the server but I still have the snapshot. On the others, you can't. One copies, the other bookmarks.

They're entirely different concepts, therefore different definitions.

klodolph · on April 9, 2021

That’s an incorrect notion of “definition”. The concept of a snapshot is that you make a copy of something at a moment in time. That’s one concept, one definition, one meaning. You may fight over the details of the definition or the implications, but at most it means that you need to revise the definition a little bit, not that you need to add a new sense to the word.

crazygringo · on April 9, 2021

> That’s an incorrect notion of “definition”.

Nope, pretty sure different concepts means different definitions. Well -- or different "senses" if you want to be technical, but of course nearly everyone outside of dictionary editors uses "definition" to mean "sense".

> The concept of a snapshot is that you make a copy of something at a moment in time. That’s one concept, one definition, one meaning.

Except one of the two definitions isn't making a copy of anything. It's creating a new pointer to something that already exists, that's all. Zero copying. That's the entire point here.

Which is why it's two concepts, two definitions, two meanings.

throwaway894345 · on April 8, 2021

Copy-on-write is an implementation detail that allows for lower storage. The snapshot is still the full copy. One could try to argue that the same is true for git in that diffs (or content addressable storage) are just an internal implementation detail, but as the parent pointed out that's not quite true--our commits document the diff, not the materialized snapshot.

nybble41 · on April 9, 2021

> our commits document the diff, not the materialized snapshot

That's not actually true, though. This is what a raw commit object looks like:

  $ git cat-file commit bfc766d38e1fae5767d43845c15c79ac8fa6d6af
  tree 99768f8965d5382d1c1695c371a854d061f2548b
  parent 860a3b34854d8abe9af9f1eb584691de926ce897
  author Peter Maydell <peter.maydell@linaro.org> 1462981466 +0100
  committer Peter Maydell <peter.maydell@linaro.org> 1462981466 +0100
  
  Update version for v2.6.0 release
  
  Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

(This commit is from the QEMU project repository.) Note that there is no reference to a diff. There is a reference to a tree, which is a binary object representing a directory structure—a full snapshot of the state of the working tree as of that commit:

  $ git ls-tree 99768f8965d5382d1c1695c371a854d061f2548b             
  100644 blob 3ac0cfc6f0d60a95a5c3557497835c80e52a1696    .dir-locals.el
  100644 blob 37755ede83a0d5b4fe22114f624a140d2bcaefff    .exrc
  100644 blob 88a80ff4a5c552bad3bc2bf40b1fc4a45c57a177    .gitignore
  100644 blob 9da9ede26161bc5c6f12552b55bc54f55bb1e839    .gitmodules
  …

Each of those blobs is a reference to the full content of the corresponding file:

  $ git cat-file blob 3ac0cfc6f0d60a95a5c3557497835c80e52a1696  # .dir-locals.el
  ((c-mode . ((c-file-style . "stroustrup")
              (indent-tabs-mode . nil))))

For storage purposes there is deduplication, delta encoding, and compression going on behind the scenes, so committing a 100M working tree with a few small changes doesn't take up 100M of additional storage space, but these are invisible to the upper layer. When Git needs a diff, for example to perform a three-way merge, or in response to `git show` command, it generates one on the fly from the snapshots.

throwaway894345 · on April 9, 2021

Apologies for the typo, I meant to say “...commit messages document...”

barrkel · on April 8, 2021

Copy on write filesystems describe changes as a structural diff, effectively.

klodolph · on April 8, 2021

That's not really true. The copy-on-write filesystems just allow multiple files to reference the same blocks, and only allow modifications to blocks if the refcount is 1. At least, at its simplest, that's how copy-on-write works. To copy a file, you copy the block references and increment the reference counts. You won't end up with a diff or deltas stored anywhere.

barrkel · on April 10, 2021

Sure it is. You just need to look at it a little differently.

Even in classical COW of memory pages in a Unix forked process, the set of pages mapped into the process with refcount 1 are a diff to all those with refcount > 1.

Virtual machine snapshots are more explicitly diff-oriented. Deltas to the base disk or snapshot are stored separately (that's your diff), and "deleting a snapshot" actually means remapping all the separately stored blocks and collecting the newly released blocks. There's two strategies snapshotting can follow: copy-then-write-in-place, or redirect-on-write. Either way, the set of copied or redirected blocks are a diff to the in-place blocks, just the polarity of the difference is switched.

See e.g. https://www.dell.com/community/Student-Discussions/Copy-on-w...

Things get more interesting with e.g. ZFS snapshots, where the whole filesystem, including metadata, is copy-on-write, and tree-structured to maximize sharing and permit atomic writes (how ZFS solves RAID5 write hole). There, snapshots hold on to one of the old roots in the tree. The diff is implicit in the difference in tree structure; shared blocks are common, different blocks are different. It's super-easy to do a recursive comparison between such trees, extracting a diff is a sublinear time operation because it can trivially skip over identical subtrees. It's a matter of perspective, when you're in the middle of a recursive tree compare, whether you think you're actually diffing data, or whether the data in one leg of compare is simply telling you which subtrees are shared and which subtrees are different, and thus the data is a delta, or diff. You certainly don't need a complete traversal, which tells me that the data is doing most of the work.

klodolph · on April 10, 2021

> The diff is implicit in the difference in tree structure;

I can only interpret this as, "the data is not described as diffs." There's a meaningful difference here and I'm not being picky about it. To some extent, you can convert between a diff structure and shared structure, but that doesn't mean that the differences aren't meaningful.

Two structures may be isomorphic but they represent data in different ways and the operations have different algorithmic complexity.

whack · on April 8, 2021

I've learnt something new today, thanks for sharing. Looks like I had a naive understanding of how snapshotting actually works.

I still think that it's more intuitive to describe commits as diffs, in the context of things like cherry-picking a commit or rebasing/reordering a series of commits.

But given that you can also "check out" a commit, in order to get a specific snapshot of the repo, I can see the parallels between commits and snapshots. Maybe both analogies are equally useful in describing the different features that git provides.

diroussel · on April 8, 2021

The point of the article is not an analogy. Git is based on snapshots. Abs diffs are computed from snapshots as needed.

The snapshots are also de-duplicated and compressed, but that is not important.

The article is a good one. And if you spend the time to understand git it gets easier to use.

whack · on April 13, 2021

Internal implementations and external interfaces are not necessarily the same thing. When reading a single-threaded application's code, it is helpful to read it as a series of instructions, executing serially. In reality, both the compiler and your CPU are constantly reordering instructions, and executing them in parallel/out-of-order. However, this is all done while still preserving the illusion of serial-execution. Taking a beginner programmer down this rabbit-hole of implementation details, is going to be more harmful than helpful.

Thank you for the suggestion, but I already find git easy to use. And thinking of commits as diffs that can be cherry-picked, rebased and reordered, is something that helps me greatly in understanding it.

spuz · on April 8, 2021

The correct way to think about snapshots and diffs when it comes to cherry-picking and rebasing is to realise that diffs are always derived from snapshots. I.e. the fundamental data-structure is the snapshot and from those we can build diffs. Those diffs are necessary to implement cherry-picking and rebasing but it's also possible to imagine an implementation git that has those features missing. It would still fundamentally work in the same way - it would just be slightly less useful.

Edit: If you think this is just splitting hairs, I encourage you to look at the differences between git and pijul which is a VCS where the fundamental building block is diffs: https://pijul.com/

whack · on April 13, 2021

> The correct way to think about snapshots and diffs when it comes to cherry-picking and rebasing is to realise that diffs are always derived from snapshots.

Ironically, git snapshots are themselves derived from diffs. Creating a snapshot without diffs, would require making a full copy, which git most definitely does not do.

So would you rather think of cherry-picking as diffs derived from snapshots which are derived from diffs? Or as simply diffs? I find the latter better as a mental model.

AaronFriel · on April 8, 2021

It's helpful to understand git in terms of the "porcelain" and the "plumbing".

The git commands you know and love are largely the porcelain, nice fixtures over other things. When you "git cherry-pick", under the hood what it's actually doing is querying that commit's parent(s), finding the diff the commit introduced relative to its parent(s), and then applies those same changes to the index and your working tree.

Cherry-pick is porcelain on top of the plumbing.

There are a few "write git yourself" tutorials out there, of which "Write yourself a Git!" is I think the most popular. In it, you'll learn how git really stores data, and you'll write a (fairly basic) git client that can do several things to locally manage a repository.

Write yourself a Git!: https://wyag.thb.lt/

breischl · on April 8, 2021

>I still think that it's more intuitive to describe commits as diffs, in the context of things like cherry-picking a commit or rebasing/reordering a series of commits.

If I understood the article correctly, those things actually are implemented via diffs. It's just that the diffs are calculated on-the-fly, used to create a new snapshot, and then discarded.

goerz · on April 8, 2021

You can still think of them as snapshots. Git just does compression on the entire folder of snapshots, including de-duplication of data that doesn't change between snapshots.

In fact, when I teach git to students, I don't even bother with the trees/blobs, which in my view are just an implementation detail. I just tell them to think of git zipping up their working directory together with some metadata (commit message, reference to parents), and putting that zip file into its own "compressed" storage inside the .git directory. That seems to be sufficient for a good mental model of how to work with git (independently of the git's somewhat baroque command line interface, which just takes getting used to)

hmsimha · on April 8, 2021

This is the thing though. You're talking about snapshots which actually have duplication removed... in my mind this really fits more with the 'diff' model. I've already done the exploratory diving-into-git-internals thing years ago, so I could develop a better understanding of how things actually work.

But for newcomers who want to understand how git is working, it really makes more sense to tell them it's 'like a diff. Not exactly under the hood, but think of it like a diff for now'. This is what I've been telling people as I've mentored a number of people in getting acquainted with git over the years, and if they're curious enough to look under the hood, they'll get a better understanding of the internals.

As a programmer, what you're working with is essentially the diff. This is the easiest way to think about things initially. The fact that git is storing blobs under the hood, shallowly deduplicating blobs but still storing large chunks separately that may contain duplicate data, until it generates packfiles which do a deeper deduplication/compression, is really not that helpful. Telling people it's more like zipping is a bit disingenuous because it doesn't really explain how things are compressed more efficiently over the course of many changes.

If I have a 1MB code file and make 1000 commits of one-line changes then sure, git is initially storing large blobs representing those, but then will compress over the change set when it generates the packfile.

Compared to making a zip of the file for every change (say these are 100KB compressed) and now you have people thinking the 1000 one-line changes generate 100MB in the .git directory.

You may think that a 1MB file with many smaller changes is a fabricated example, but consider that dependency lockfiles (package-lock.json I'm looking at you) can easily grow to this size, and contain this many changes.

goerz · on April 8, 2021

It may depend on the background of who you're talking to. Programmers may be very comfortable with diffs, but non-programmers (in my case, physics graduate students) usually aren't. On the other hand, everybody is familiar with snapshots: even high school student will end up with "report_v1.docx", "report_v2.docx", etc, which are snapshots at the file level (and work reasonably well as long as you have a consistent scheme and don't need branches). I've also routinely seen less-technical people organize their research / paper writing by making a weekly snapshot of their work folder ("project-2020-04-1"). Telling these people that git basically does the same thing for them automatically with a tree-like "labeling scheme" that allows for branches tends to go over quite well, in my experience. For actually programmers, I'd be inclined to give them a more technical introduction to git's internals. I'd still point out that git stores compressed snapshots, not diffs (especially if they're older and may have previous SVN experience)

hmsimha · on April 8, 2021

Those non-programmers are likely going to have a worse understanding of what is happening when you zip/compress something anyway, but I concede this is probably the most straightforward path if they have some understanding of what a zip is, and can't understand what a diff is. But even then I question if they should be using git, since `git diff`, `git show`, basically everything git exposes, is going to show them diffs.

sagonar · on April 8, 2021

A storage with pure diff would be impossible to recover if you get a error in any commit. It would also be much slower to examine the data, and newer version control do not use pure diff.

The version control system Mercurial had description about these problems on the homepage, "behind the scense", which was good reading.

I am not sure if GIT is the best solution, but at least a "pure snapshot" is okey, but where a diff storage must in practise include some snapshot logic as well.

KptMarchewa · on April 8, 2021

Diff based, but with snapshot "control frame" every N commits, like video?

Obviously joking though.

formerly_proven · on April 8, 2021

The "snapshots which are stored as deltas, if that works" part is unrelated to the diffs the git porcelain generates for you when you do a git-diff or git-show. The former is purely an implementation detail of the storage (albeit an important one), while the latter is entirely virtual, calculated from the snapshots every single time you view the data. That's why operations like git-diff and git-blame can take some time on large trees or histories (and why e.g. git-blame has various options to tweak how it tracks files across revisions, because that is not something git does), while git-log is fast.

goerz · on April 8, 2021

Also (for less-technical audiences), I don't exactly dwell on the de-duplication. It's just "Git makes snapshots and puts them into .git in some efficient way. Don't worry about it. Or, if you want the details, read the Git SCM book."

ako · on April 8, 2021

Not really: if you do a checkout of a snapshot into an empty directory, you expect the entire state at the time of the snapshot, not just the diffs.

gowld · on April 8, 2021

As a programmer I care about diffs only when I am comparing two versions. A commit creates a new version. "Snapshot" is a distraction.

bosswipe · on April 8, 2021

The diff mental model doesn't work for things like `git checkout <commit>`.

hmsimha · on April 8, 2021

I actually haven't had a problem with this, though perhaps it's because I understand what's happening at a deeper level. You're generally referencing commits which exist somewhere in this family of commits you can view with `git log --graph`. You can easily think of checkout as the path of diffs to get there. Files at commits are still whole objects, mentally, but the thing we care about as programmers working with multiple versions are the diffs.

I have had it break down a bit more when working with stash though, because now the object you're referencing can exist outside of that graph-like commit family.

gilbetron · on April 8, 2021

So it only stores the difference between the two snapshots? ;)

detaro · on April 8, 2021

No. If it chooses to compress the commits, which it if I remember correctly not does automatically for each commit, but rather occasionally as a larger step, it uses the difference to whatever it deems to be a good candidate, if it finds one. E.g. if you have a file in commit A, change it massively in later commit B, and then on a different branch create commit C that also changes the file to one very similar to the one in B, git might very well compress C by storing the difference from B to C, despite those having no direct relationship in the commit graph. It can also choose to not use a delta to a different version entirely, and this is 100% an internal implementation detail of the storage system in git (afaik one of those implementation details is that it prefers candidates that are in the same commit chain, but it doesn't have to - and it can easily jump multiple commits if that works better). If you ask git to show you a diff to the previous commit, it does not pull a diff from storage, but pulls two file versions from its storage backend (which if deltas have been used to store will resolve those) and diffs them.

planckscnst · on April 8, 2021

No, it stores an entirely new set of references to objects, as well as some of those objects themselves (any that are not identical to previously stored objects).

You cannot look at a commit on its own and know exactly how it's different from the previous commit, but you do have the complete new state. You have to look at the parent commit's references and do an object-by-object comparison to identify exact changes. On the other hand, when you look at a diff, you can see exactly what has changed, but you cannot produce the version that came before without also having a complete copy of the current version.

dbt00 · on April 8, 2021

The implementation-specific compression doesn't store deltas or diffs, it stores unique blocks of text.

Git allows for shallow clones, which would be impossible if the protocol or implementation were based solely around diffs.

redisman · on April 8, 2021

I don't know that you need to teach them any of that. Version control is an abstraction. I have no clue what happens under the hood and I don't care.

shuntress · on April 8, 2021

To some extent, this is true. I don't feel the need to totally understand gits packing logic or the specific mechanics of the various diff/merge algorithms.

But some knowledge of how/why your tools work the way they do can be very helpful.

Some knowledge of a tools internal working can be fundamental to efficient use of that tool. At the very least it can allow you to understand or derive your useful interactions with that tool rather than simply memorize how it is used.

LukeShu · on April 8, 2021

> Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size.

Not with `btrfs subvolume snapshot`, it won't. If that's not a snapshot, I don't know what is.

From a storage perspective, no dammit, Git commits are snapshots, look at the bits on disk if you don't believe it. This isn't something that people who like to write blog posts about Git made up for pedagogical purposes, it's how Git actually works.

As you point out, it's wonky for pedagogical purposes; what does it mean to "cherry-pick" a snapshot? When thinking about cherry-picking, yeah, a diff makes more sense than a snapshot. But saying a diff is better pedagogically doesn't change the fact that a commit actually is a snapshot (and when cherry-picking, it diffs to snapshots to create a patch, then applies that patch).

jgraham · on April 8, 2021

> From a storage perspective, no dammit, Git commits are snapshots, look at the bits on disk if you don't believe it

Except they're not. They're (often) packfiles, which are a delta encoding i.e. a diff. It's not necessarily the same as a specific commit, but appealing to "the bits on disk" is wrong.

It is certainly true that the git object model each commit object refers to a tree that represents the complete state of the repository at that commit.

It is also true that many git commands implictly treat a commit as being the diff between the state of the tree in that commit and the state in the parent. For example git show, git rebase and git cherry-pick.

It is simultaneously true that the on-disk storage system is optimised for performance and so doesn't map onto the object model in a trivial way.

LukeShu · on April 8, 2021

> They're (often) packfiles, which are a delta encoding i.e. a diff. …appealing to "the bits on disk" is wrong.

That's fair. The diffs in a packfile have no relation to the "diff" that a commit would be if the commit were a diff; so it's wrong to use "but packfiles" when arguing that commits are diffs and not snapshots; but you're right, packfiles make my "bits on disk" argument not quite right.

The way I look at it is that packfiles are a compression mechanism; and they don't alter the fact that fundamentally it's snapshots that are being compressed. But that's not the only way of looking at it.

nyanpasu64 · on April 8, 2021

> It is also true that many git commands implictly treat a commit as being the diff between the state of the tree in that commit and the state in the parent. For example git show, git rebase and git cherry-pick.

A commit is a snapshot, and you can compute the diff between a commit and any of its parents. If a commit has multiple parents, git cherry-pick bails out unless you pick a parent (usually -m 1), and git rebase, I think implicitly assumes the first parent.

(EDIT: a commit's tree, its parents' trees)

LukeShu · on April 8, 2021

> If a commit has multiple parents, … git rebase, I think implicitly assumes the first parent.

`git rebase`'s behavior regarding merge commits is shockingly complicated, but much of the time: Because by default it linearizes the history, it actually just skips merge commits because it assumes that the merge has already happened implicitly by applying one of the merge's parents on top of the other parent.

Ericson2314 · on April 8, 2021

> Obviously both "diffs" and "snapshots" are leaky abstractions.....

Joel Spolsky wrote many great things, but "all abstractions leak" was not one of them (edit his but not good). I am very tired of programmers excusing their poor imagination with appeals to this nonsense.

------

Commits store snapshots. Full stop.

The "bad mental model" is not commits being snapshots, but things behind stored individually, i.e.

> Sum |things| = |Product things|

This comes up in many other contexts, especially when storage quotas are involved and it's unclear what to do when storage is deduped across quotas.

-----

git packfiles do use a delta encoding, but it's important to understand that there isn't any necessarily any correspondence between the history and the delta encodidng. In fact, commands like `git repack` exist precisely to avoid path dependency issues from the repacks matching the history too much.

Saying commits are diffs to explain the delta-encoding storage characteristics is wrong and confuses, not clarifies.

------

> And let's not forget commit messages. If a commit is a snapshot, I would expect the commit-message to be descriptive of the entire snapshot. Whereas if a commit is a diff, I would expect the commit message to be descriptive of the diff. Which is exactly how most people use commit messages.

It's git tree objects that are snapshots, commit objects have tree child and a prev commit child, so it is natural for them to describe the relationship between two states without appealing to hypothetical alternatives.

> Not to mention other features the article discussed, such as cherry-picking. What does it even mean to "cherry-pick a snapshot"? In comparison, cherry-picking a diff and applying it to your current state, is far more intuitive.

I might `git checkout somethingelse .` mid-rebase. What does that mean if commits are diffs? Nothing very clear. The better thing to teach people is about darcs and patch theory and those other modules. I think the git model and the patch theory model both have uses, and the fact that git makes people always work in the git model is a fundamental issue that cannot be fixed with analogies.

- Patch theory is good for the things are you still working on

- merkle dag of states is good for the things you've already done / agreed upon.

gowld · on April 8, 2021

> All non-trivial abstractions, to some degree, are leaky.

You look a bit silly making grandiose comments that take one web searching to disprove

https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...

> All non-trivial abstractions, to some degree, are leaky.

breischl · on April 8, 2021

I'm fairly certain he was disagreeing with the content of the statement, not that Joel Spolsky wrote it.

ie, yes Spolsky said that, but he was wrong.

Ericson2314 · on April 8, 2021

Yes, thanks

heinrich5991 · on April 8, 2021

I think the emphasis was one "great". I.e. your parent wants to say that this thing Joel wrote was not great.

dbt00 · on April 8, 2021

if your filesystem was copy on write and implemented snapshot semantics internally (like WAFL for example, over 20 years old now), then the second snapshot would not take 100MB, it would just cost the metadata.

A commit is a snapshot of a tree with a reference to it's prior ancestors. It's important to know that because it becomes extremely relevant when trying to do things like merges properly.

dfox · on April 8, 2021

If you commit 100MB file, change few bytes in it and commit it again your .git/objects will almost certainly contain two 100MB objects. The fact that it is somewhat likely that running "git gc" or something similar will convert one of them into reference to the other one and some compact representation of the difference is implementation detail.

While commit object does represent the snapshot it also references the previous state, thus the commit message usually describes what was changed between the referenced snapshot and the parent(s) that are also referenced from the commit object.

As for the overall model and leakage between implementation details and how people use it interesting approach is used by SCCS/BitKeeper with its internal "weave" format that essentially is both snapshot and diff at the same time.

towergratis · on April 8, 2021

Lookup Copy-On-Write. ZFS and BtrFS do it.

NTARelix · on April 8, 2021

After going through the "Git Internals"[0] docs, I found that the snapshot mental model has been much more helpful in understanding what my Git commands are doing, how someone's history got into a confusing state, etc. The primary model is that of the Merkle tree, and subsequently hashing, which are very simple and powerful concepts.

[0]: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Po...

jedimastert · on April 8, 2021

I prefer to think of a repo as a whole as a tree, where the nodes are snapshots and the vertices between each node is a diff. This sort of lands us in both places

Twirrim · on April 8, 2021

> From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.

That's not the way storage snapshot works under most (all?) storage targeted file systems, filers etc.. What you're talking about there is a backup.

Snapshots are not backups. Snapshots work on "copy on write" basis.

Roughly speaking, when you take a snapshot you draw a line in the sand. "These were the files at this time". Snapshot operations as a result are super cheap and super fast. Future changes to those files results in the filer/file system writing the modified blocks to new locations, not overwriting the original data.

So take a 100MB directory. I create a snapshot. That results in almost new storage usage, just a small amount of metadata. I write/modify 10MB of data, now the total storage cost is 110MB. If I take another snapshot after writing that 10MB. it's still only 110MB of storage usage.

TeMPOraL · on April 9, 2021

If "diffs" and "snapshots" are leaky abstractions, that often enough lead you badly astray, then why insist on these abstractions in the first place?

Why not just teach people the mental model behind Git up-front? Objects form an immutable directed acyclic graph, human-readable names point at objects, there are some rules by which the graph is being extended and pruned, and by which names (references) are being updated to point at different objects.

This isn't a hard mental model, not for programmers (for whom the tool is intended in the first place). If you know how the most basic pointer-based data structures - a linked list, a tree, a directed graph - work, then learning the actual model isn't hard, immediately clarifies why Git does what it does. It should be taught to people up front.

A commit isn't a diff, and it isn't a snapshot. It's a bunch of objects Git creates for you, where the "commit" object points at previous commits and at a tree, built of "tree" and "blob" objects. When Git wants to know how to recreate your file structure, it starts at the "commit" object and walks the graph to discover what files and folders should exist. When you make a change and perform the "commit" action, Git creates a new "commit" object and a new "tree" object for it, and add more objects to the graph to encode what changed, while reusing previously existing objects for things that did not change. The end state is, if you start at your new "commit" object and walk the graph, the resulting description of your file structure should be equal to what's on your hard drive when you made your commit.

Trying to paper over that with "friendly abstractions" is what makes Git difficult to understand.

kazinator · on April 8, 2021

> What does it even mean to "cherry-pick a snapshot"?

It means to do something like a three-way diff among three snapshots: the cherry-picked baseline, the target, and a common ancestor.

You can do something similar with the diff3 tool, which takes three files (snapshots) as input, not diffs.

jayd16 · on April 8, 2021

Depends on the diff. If the diff is not aligned by bits a single bit offset might cause double the size, ie the full file to delete and a full file add.

>If you insist on using the "snapshot" abstraction

But its not insisted. Both abstractions are used as needed.

still_grokking · on April 10, 2021

>… you will eventually need to explain that a commit is actually a combination of diffs, along with some other metadata like a pointer to a parent commit.

Only that this is completely wrong.

A commit is a snapshot of the tree. There are no diffs.

There is also no "metadata attached" — the commit is the actual data (!) describing the tree snapshot.

Git is a kind of simple content addressable object store storing kind of Merkle tree objects. That would be a proper (abstract) description.

iudqnolq · on April 8, 2021

Why have so many people written long thoughtful explanations about how the author is wrong to suggest snapshots are a better mental model, and that you think all abstractions are leaky, but you find diffs a better mental model?

The entire article is literally about how commits are literally snapshots. I would say people didn't read TFA, but a lot of people are quoting lines from TFA and then going on to argue with/expand on them in a way that is directly contradicted by the next few lines.

I think it's because most of the people here have spent years working with git, and are so deeply attached to their understanding that they didn't hear most of what the article said.

(Some commentators have pointed out specific oversimplifications the author makes like glossing over pack files, I'm referring to the people who say a git blob is a diff when the entire point of TFA is that it isn't)

smallnamespace · on April 8, 2021

People are disagreeing with the author, not because they didn't necessarily read the article, but because they don't agree about how things should be defined.

At the root, this is a disagreement about semantics and philosophy, not about git itself. I'm going to refer to Aristotle here: we think we have knowledge of a thing only when we have grasped its cause, and there are four general 'causes' [1]:

- The material cause: 'What is it made of?'

- The formal cause: 'What is the ideal of this thing?' , e.g. what's its abstract nature?

- The efficient cause: 'How did this thing come to be?'

- The final cause: 'What is its purpose?' How is it actually used? What role does it play in the world?

Here we can see that commits are used (at least in the git internals) as 'snapshots' — they refer to bytes, not changes in bytes. That's pretty close to the formal and efficient causes — the abstraction inside of git is closest to a snapshot, and that comes from the history of what Linus wanted when he wrote it.

But! The underlying storage uses deltas (which are diffs) to save space. That's the material cause.

But also, when we actually use commits, git often creates diffs for us as a convenience (cherry-picking, rebasing), and hides the fact that they're snapshots under the hood (final cause).

So there's an inherent tension between the different ways to answer 'what is a thing?'. For commits, this is especially bad, since there's an even split between 'causes'.

This tension never goes away because the most useful definition really depends on the context.

[1] https://plato.stanford.edu/entries/aristotle-causality/#FouC...

haberman · on April 8, 2021

> But! The underlying storage uses deltas (which are diffs) to save space. That's the material cause.

This does not make the "commits are stored as diffs" story much more true:

1. This is only true of pack files, but pack files are only created once the repository exceeds a certain size.

2. Nothing about the pack file format requires that deltas follow the chronology of commits at all. The deltas could be stored in reverse order or even random order compared to the chain of commits.

3. The deltas in a pack file do not correspond to a change in a given commit, they are just the data to create a particular snapshot. If you find that a commit's file blob is stored in a pack file as a delta, that does not tell you anything about whether the file changed in that particular commit. You have to look at two commits and diff them to determine which files actually changed.

If a person wants to think about version control in an abstract way, then yes the two views (commits vs diffs) are somewhat interchangeable. If a person wants to understand what actually happens when you run a Git command, the answer to that question is less open to interpretation.

cesarb · on April 8, 2021

> The underlying storage uses deltas (which are diffs) to save space.

Not necessarily! The base git storage stores each object individually, not as deltas ("disk space is cheap"); it's only after a "git gc" that they are stored as deltas to other (potentially unrelated) objects. The original implementation of git didn't even have the delta storage (pack files), it was added later as an optional optimization.

So answering to "what it's made of?" with "deltas" comes with a huge caveat, that it's often partially or completely untrue.

iudqnolq · on April 8, 2021

This is exactly what I'm talking about. A person posts "this is literally how this works", and someone replies "philosophically I would prefer to think it works differently, therefore you're wrong".

hnjst · on April 8, 2021

Your distilled summary of this form of objection relying on wishful thinking made my day, thanks a lot!

efaref · on April 8, 2021

The true zen of source control is that they are both.

nightpool · on April 8, 2021

> Why have so many people written long thoughtful explanations about how the author is wrong to suggest snapshots are a better mental model, and that you think all abstractions are leaky, but you find diffs a better mental model?

Probably because, to take their words at face value, they find diffs a better mental model? I think impugning "people [...] are so deeply attached to their understanding that they didn't hear most of what the article said" is a real bad faith reading, especially when you even acknowledge that central to people's arguments is "all mental models are leaky". This article may be technically correct about the way git internals are structured, but it makes cherry-pick and rebase more mentally complex for users to understand (you first have to go from commit => patch), not less.

Saying "Commits are collections of files + a parent commit, but you can diff it to generate a patch" and saying "Commits are a patch + a parent commit, and you can apply it to generate a collection of files" are isomorphic mental models—the fact that #1 is "correct" (for some value of correct that doesn't include the actual files stored on disk) is really besides the point.

iudqnolq · on April 8, 2021

My point is that people criticizing TFA's proposed mental model are missing the fact that TFA doesn't propose a mental model, it explains how things work. Both have value, but they're distinct.

nightpool · on April 8, 2021

I disagree. TFA is explaining the mental model Git uses to structure their codebase. If you're writing code for Git, this is obviously very useful to understand, but if you're just using it, this is only one of several mental models available to you. In this case, I think it's right to say that the distinction the author is attempting to draw is immaterial to those not working on the Git codebase.

iudqnolq · on April 8, 2021

If your code is written in a certain way that's a model, not a mental model.

iainmerrick · on April 8, 2021

Yes! It just seems so strange not to care about how things actually are in software. Is it a way of coping with the fact that so much software is so deeply layered and complex now?

Maybe I’m misremembering, but I feel like I didn’t see this usage of “mental model” much until fairly recently. The first I recall being surprised at was a discussion of a “mental model of Javascript” -- why would you need a mental model of something with a very detailed spec and multiple compatible implementations to study? If you want to understand how some aspect works, just look up how it actually does work.

tsimionescu · on April 9, 2021

Well, sometimes the API of a piece of software presents one model, while the implementation actually uses a different model underneath for various reasons.

In particular in Git, some commands expose the commits-as-diffs model (cherry-pick, rebase) while others present the commits-as-snapshots model (checkout). However, if you were to look at various layers of git code, the model is either commits-as-snapshots, or neither (compressed storage).

You could also theoretically change the entire implementation of git to store commits as diffs, and offer the exact same API as it does today (probably with differences in the way conflicts are resolved, and definitely with differences in performance).

Dylan16807 · on April 8, 2021

Unless you open up the spec every single time you make a prediction, you're using a mental model.

And the spec is probably not arranged for easy use.

iudqnolq · on April 8, 2021

It's necessary to approximate, but if someone tells you your approximation is wrong it makes no sense to say it's right because you prefer it that way.

If your mental model is that floats are real numbers and someone tells you they aren't, you don't go "I philosophically prefer to think they're reals, so you're wrong". You either update your mental model or decide you'd rather be a bit wrong than learn something (you perceive as) tedious.

Dylan16807 · on April 9, 2021

Sure, you usually don't want your mental model to be wrong on purpose.

But that's different from not having one.

And sometimes a slightly wrong model has other benefits that will cause you to make less mistakes, so it's still a good tool.

Agingcoder · on April 8, 2021

Agreed, commits are snapshots, whether we like or not. For obvious storage efficiency reasons, the implementation then diffs/packs/etc, but this is a different issue altogether.

I have found that I can't work with git with a different mental model (diffs). Every time things get messy, the diff model is not enough, whereas snapshots + commit graph + names/pointers make things natural.

Interestingly enough, when migrating people from svn to git, explaining the actual model makes the transition much smoother, so it would seem I'm not the only one.

hongsy · on April 9, 2021

i convinced myself that commits are snapshots by doing the following:

    # generate a 100M text file
    base64 -b 76 /dev/urandom | head  -c 100000000 > file.txt
    git add . && git commit -m "1"
    
    # remove first line and add a new line to bottom
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "2"
    
    # repeat
    tail -n +2 file.txt > tmp && rm file.txt && mv tmp file.txt && base64 -b 76 /dev/urandom | head -n 1 >> file.txt
    git add . && git commit -m "3"
    ...

    du -sh . # a very big folder

each of the commits are almost 80M big in the git folder. if you run `du -h .` you can see how git stores each object individually (80M big)

da39a3ee · on April 9, 2021

I’ve read the article before and it’s entirely unclear how it is supposed to be helpful. As the author acknowledges, things like cherry-pick show that one can think of commits as diffs whereas in the git implementation the object known as a commit is a snapshot of the state of a directory tree. Fine, but so what? Both times I’ve read this article my impression has been that the author is relatively new to git and processing some new information they’ve learned.

koolba · on April 8, 2021

> Why have so many people written long thoughtful explanations about how the author is wrong to suggest snapshots are a better mental model, and that you think all abstractions are leaky, but you find diffs a better mental model?

Once you remember (learn?) that a commit can have N parents, it becomes apparent that it cannot be a single diff.

mekkkkkk · on April 8, 2021

What does TFA mean in this context?

dekerta · on April 8, 2021

TFA = The F**ing Article

mekkkkkk · on April 8, 2021

Thanks. I had a hunch. I'm familiar with "RTFM", but would probably get equally confused if "TFM" was used as a noun.

hunter2_ · on April 8, 2021

I suspect the chronology is something like RTFM -> TFM -> RTFA -> TFA, but the second and third might be switched. Dropping the R does introduce obscurity, but being able to convey the underlying sentiment (that while the content could/should have been consulted, it seems as though it was not) without a verb allows for a nonconfrontational syntax similar to passive voice, but even moreso, and often without obvious "weasel" effect, to boot!

mekkkkkk · on April 8, 2021

Makes a lot of sense, thanks! Maybe it's also useful since the R is read explicitly as "read". Hence "they should instead be RTFM" sounds grammatically wrong. Breaking off the verb allows for a more natural read. It's funny how an abbreviation can carry more information than whatever it's short for.

nayuki · on April 8, 2021

And I believe this slang came from Slashdot, which is like the Hacker News forum in the decade before Hacker News

Agingcoder · on April 8, 2021

Interesting, I didn't know it came from Slashdot! I spent quite a bit of time reading it in the early 2000s, and sometimes miss the subculture and not always subtle jokes (beowulf clusters... ). The moderation system encouraged jokes (there was a specific 'funny' tag), unlike HN which does not 'orient' things, and happens to be very serious.

iudqnolq · on April 8, 2021

it's an abbreviation to refer to the article being discussed on a site like this.

Anthony-G · on April 9, 2021

The more polite phrase is The Featured Article. :)

doublerabbit · on April 8, 2021

The same context as RTFM

happyconcepts · on April 9, 2021

Can't satisfy everybody.

necovek · on April 8, 2021

> I believe that Git becomes understandable if we peel back the curtain and look at how Git stores your repository data.

I agree, and like many, I have been saying that for years (nay, for more than a decade): and that's exactly the problem!

You don't need to understand how an internal combustion engine works to drive a car... You don't need to understand how your graphics card renders stuff to develop a web page... You don't need to know how a brushless motor works to use a drill...

There is a pattern there, and it's the one that makes sense.

I've read up on the internals of git a dozen times by now. But I only occasionally need to do something weird that makes me go back to it, so I usually forget the relevant bits.

The trouble is that I've used a distributed VCS that did not ask me to understand internals and it had a sane UI, and good model (like tree-like commit history, so a top-level commit log would only have merges, but you could dive deeper into individual commits if you so pleased). It wasn't perfect, but it's hard for me to accept that we have gone with a subpar solution where every "tutorial" starts with how you need to understand the internals! But you also need to memorise them, dammit!

Just like I keep forgetting the Emacs rectangle editing shortcuts since I seldom use them, I'll keep forgetting the specifics of git internals that I might need once every 12 months.

And it's not me, it's _you_, git!

mdnahas · on April 8, 2021

Sadly, the bad part is git's user interface. It hides the pretty parts underneath.

There is a concept of "the next commit" or, equivalently, "the pending commit". In the documentation, this gets called "indexed" or "cached" or "staged" --- three different names! And if you want to diff with it, you can't refer to it by name. You need to use an option, so it's "git diff --cached <other commit>.

I know git's internals, mostly because it lets me navigate its bad user interface.

TeMPOraL · on April 9, 2021

There's a difference between implementation details and core abstractions you need to know to understand a device. That git stores data as an immutable DAG with a layer of pointers to attach human-readable names to things, that's not an implementation detail. That's a core abstraction.

It's not like knowing the internals of an ICE, it's like knowing a car moves using wheels, that these wheels must touch the ground for the car to be controllable, and some of them must rotate to change the direction of movement. Knowing such "car internals" isn't necessary for you to be able to turn the key and get it to roll - but it is necessary for safe driving. People who didn't master these "car internals" are the ones who speed on wet ground, don't understand safe braking distance, or why their car skids.

leni536 · on April 8, 2021

Commits being snapshots is not git internal. It is the high level abstraction.

sergeykish · on April 9, 2021

But you have to know how ICE works to resolve issue. Oil pressure? Engine temperature? Leaking radiator? Gas pump? Dead battery? Wet spark plugs (on older models)? But there is more — air filter, oil filter, oil change, mechanical parts — these usually resolved by handling car to professionals, and they have to know how system works.

You have to understand DOM to resolve web page issue. Understanding how graphic card works would help in resolving webgl issues.

necovek · on April 9, 2021

We are not talking about people implementing git the tool. We are talking about people using git as a tool (even though they share the common job title of a software engineer/developer).

The OP leads in with how "git cherry-pick" and "git rebase" are hard to use and promises to clear it up with a deep dive. You know, how do I turn the wipers on my car? Or turn signals?

> You have to understand DOM to resolve web page issue. Understanding how graphic card works would help in resolving webgl issues.

As a developer working to build things with DOM, you have to understand the DOM APIs and model. You do not have to understand how DOM is _implemented_ in browsers today and how they achieve things you need when you call the APIs. Sure, there are gotchas that are useful to know ("this CSS selector takes O(n^2) time to match"), but they are the exception, not the rule.

Similar holds for WebGL: you need to know the APIs and how to use them effectively. Sure, it's good to know where some of the gotchas are ("this might re-render the whole thing on-screen introducing flicker, here's the off-screen version"), but it's not a blocker.

But, none of these require you to understand internal implementation details to effectively use the public APIs (which with git are CLI commands).

This is not to say that understanding the nitty-gritty details of anything is a bad thing: it is a GREAT thing, and will probably empower you to do ever more intricate things (and it is usually a very rewarding exercise to learn more about a tool you use)! But that's different from having to know the internals to do the most basic of things (which I'd argue "cherry-pick" and "rebase" are).

sergeykish · on April 9, 2021

With DOM, you have to know that it is tree. Or you'd be surprised that `append` changes parent. You have to understand reflow, you have to understand stacking, lots and lots of things.

A lot of people use DOM without understanding, a lot of people use git without understanding. In both cases one requires some knowledge to resolve issue.

And git is trivial, it stores snapshots, nothing hard there. It is interesting, but does not help with `cherry-pick` and `rebase`. The only hard thing about git is recovering — git reflog — easily avoided with backup branches `git branch foo`. Some kind of undo could be useful for beginners to avoid fear of screwing things up.

Git has a different mindset from SVN — commits are cheap, branches are cheap, experiments produces new branches, cherry-pick them, rebase them, etc, etc, etc.

samatman · on April 8, 2021

This blog post is the most compelling argument I've yet seen for pijul.

Git should work the way we think it does! It's confusing that snapshots are being converted into a few different forms of change object, which can be reconciled with merges or rebases or applying patches.

Pijul (and darcs before it) actually works on the basis of patches, pijul with a robust theory of patches. A cherry-pick just moves a patch from one history-of-patches (branch) to another history-of-patches. One can share just a patch, and applying it is guaranteed to be the same action everywhere if that's possible, which it often is.

I'm patiently waiting for pijul to be mature enough that I can move everything over to using it, it's one of the more exciting projects in the last ten years.

diegocg · on April 8, 2021

I have visited the pijul site 2 or three times, every time I would start reading about a "sound mathematical theory", get bored, and close the tab. To this date I still don't know what pijul is trying to do and why I should be interested on it.

They really should improve their documentation (hint, in case someone reads this: nobody except a few geeks give a shit about sound mathematical models. Show me how pijul makes my life easier compared to git, that's all I need)

avodonosov · on April 8, 2021

Have you ever rebased a long chain of git commits onto new branch, where one of the first of those commits have a conflict with the new base, and after resolving this conflict for that commit you have the same conflict over and over again for all the subsequent commits, even if they did not modify that place in the code, and you need to manually resolve it again and again?

Pijul will, as I understand, save us from those unnecesary repeated "conflicts".

See also the answer by @chriswarbo about removing unndesired changes from history

zemo · on April 8, 2021

you know about rerere right? https://www.git-scm.com/book/en/v2/Git-Tools-Rerere

avodonosov · on April 8, 2021

No, I did hot know that, repeated everything manually. Will try that next time, thank you.

BTW, pijul docs mention rerere as helping "in some cases":

> This is why in these systems, conflicts are often painful, as there is no real way to solve a conflict once and for all (for example, Git has the rerere command to try and simulate that in some cases).

https://pijul.org/manual/why_pijul.html

bombcar · on April 8, 2021

I feel their example (with the ABGX) just makes me think "merging can result in weirdness silently and git and pijul do it different but silent" - it doesn't really argue that one is better than the other.

(Most people probably use git as an effectively infinite string of zip files anyway. https://xkcd.com/1597/ )

pmeunier · on April 9, 2021

To the contrary, that example actually gives a solid argument for three points where Pijul is better than Git:

1. The order between lines is preserved by Pijul. This is important: let's say Alice works at the beginning of the file (lines 1-10 of 1000), whereas Bob works at the end (lines 990-1000 of 1000). Pijul preserves the order in all cases, whereas Git might randomly decide, based on the contents of the lines, to merge Bob's new lines in the middle of Alice's new lines.

2. Git solves an optimisation problem (3-way merge) that may have multiple solutions. Unfortunately, there is no way to count the number of solutions, or even to tell whether there are multiple solutions, in a reasonable amount of time. Git therefore picks one solution silently, based on the contents of the lines. In contrast to that, Pijul is deterministic.

3. Pijul is associative, meaning that merging A and (BC) is the same as first merging A and B, and then merging C; in other words, you can merge a branch commit by commit. Git doesn't have that property: if you merge a branch, you MUST (1) stop working on it, or else the future merges become totally unpredictable, and artificial conflicts might come back (yes, I know about dirty-hacks-to-try-and-fix-that-when-they-work such as rerere), and (2) check the result of the merge extremely carefully, because in addition to the logical errors that merges can reasonably introduce, Git might also introduce extra unpredictable errors by randomly shuffling lines around.

bombcar · on April 11, 2021

That makes it more clear - the A B C explanation is too simplistic.

MattIPv4 · on April 8, 2021

I feel like this is missing something about the drawbacks? Or are there truly no drawbacks beyond disk usage for the cache, and folks should just enable it once they're aware it exists?

mekkkkkk · on April 8, 2021

I guess it involves a bit of assumptions and guesswork to automatically replay your previous actions to files that in turn may have changed. It probably slightly increases the chances of Git doing something you didn't expect, and not tell you about it. Hence why it isn't default. Maybe?

jayd16 · on April 8, 2021

Could one not add a new rebasing strategy to git by generating patches from the git history? Are the concepts non-translatable?

avodonosov · on April 8, 2021

I think rebase alreaby works by generating patches, but for some reason the repeated conflicts happen...

pmeunier · on April 9, 2021

There's zero theory on the following page:

https://pijul.org/manual/why_pijul.html

There are many answers:

1. Commutation makes your life easier because you can be much less careful about how you manage your branches. Rebase, merge and commit are the same operation (apply a patch), without any loss of power: you can simulate all of Git within Pijul, except for bad (i.e. silent non-associative merges), which Pijul doesn't have.

2. Everything is easily reversible. I know all actions in Git are reversible in some way, but not in the same way: for example, you can't undo an old patch without changing the identity of all the patches after it. I know you're thinking this is important for strong version identifiers, but Pijul also has strong version identifiers, just without the compromise on usability. This is achieved using cool cryptography tricks.

3. Solving a conflict in Pijul actually solves it. Conflicts happen between two (or more) patches, and are solved by a patch: if the same two patches are on another branch, you are guaranteed to get the exact same conflict in 100% of cases, and that conflict is solved by the very same patch that solved it in the first branch.

4. When merging, Git solves an optimisation problem that may have multiple solutions in some cases. Git chooses one arbitrary solution based on the content of the lines, and doesn't warn you if there are others (because that would be a very hard computational problem). Pijul doesn't do that, and gives you strong guarantees on merges. You still have to test, but when reviewing, you can predict in your head, with 100% accuracy, how Pijul will merge. This isn't the case in Git: lines inserted at the end of a file might be merged into unrelated lines at the beginning of a file sometimes, if Git feels like it.

ausbin · on April 8, 2021

> Git should work the way we think it does!

Hold on, who is "we"? Personally speaking, git works the way I think it does. Granted, I've written my own (simple) libgit2 frontend, so I understand the git internals fairly well, on a high level at least

I haven't looked into pijul, but why is teaching people a new tool more helpful than teaching people how the tool they already use works? (Like the OP blog post does.)

Am I blinded by the knowledge I gained from writing my little tool and learning about git internals? I get that a tool you need to learn the internals of to use is probably a bad tool, but is asking git users to understand the contents of the OP blog post really too much? Maybe I'm just a git fanboy...

notdonspaulding · on April 8, 2021

>> Git should work the way we think it does!

> Hold on, who is "we"?

I'm not the GP, but I agree that git should work the way "we" think it does, and I think a reasonable definition of "we" in the context of Git Users is probably SaaS/Startup/SMB software engineers.

Git is popular enough to have many thousands of different use cases, but I would speculate that the distribution of use cases probably follows the distribution of public Github/Gitlab repos pretty closely.

> Personally speaking, git works the way I think it does. Granted, I've written my own (simple) libgit2 frontend, ...snip...

> Am I blinded by the knowledge I gained from writing my little tool and learning about git internals?

Yes.

> I get that a tool you need to learn the internals of to use is probably a bad tool, but is asking git users to understand the contents of the OP blog post really too much?

Yes. Or rather, knowing git's internals is incredibly helpful if you've already decided to use git and now you're deciding what workflow to use to develop software, because you can match your mental model of how to use git to the way git naturally wants to represent your stored work.

However, if you come to git with an existing mental model of software development, and that existing mental model includes the idea of "branches" or "diffs" or "immutable history", then you're going to quickly and repeatedly run into stumbling blocks as your mental model doesn't match git's internal model. Git can do branches and diffs and immutable history, of course, but they're a leaky abstraction on top of the concepts git really cares about.

> Maybe I'm just a git fanboy...

Sure, nothing wrong with that!

klodolph · on April 8, 2021

> Git should work the way we think it does!

I think it works using snapshots... or are you saying that Git should work the way that you think it does, and not how I think it does?

It's clear that Git is not the final evolution of version control systems, that we are just currently in the "Git era" and at some point we're going to be in the "post-Git era" of VCS. It's unclear what that looks like, but I am skeptical when I hear these claims about Pijul.

> One can share just a patch, and applying it is guaranteed to be the same action everywhere if that's possible, which it often is.

My understanding is that you need to define a very weak version of "same version everywhere" which is useless. With Git, you can merge and get no conflicts, but that is no guarantee that the patch applied successfully... it just means that the merge operation didn't run into any obstacles. It's not just the patch that needs to be vetted by humans, it's the state which must also be vetted, and that's one of the problems that Git solves well.

smichel17 · on April 8, 2021

I don't view git as a series of diffs. I view it as a logical extension of my file system to include a time dimension (or in fewer words, as snapshots).

It replaces file-v1, file-v2, file-v2-with-changes-from-Alex, etc, that you commonly find on the hard drives of people not familiar with version control. That it can generate meaningful diffs is a product of the type of data we're storing.

volta83 · on April 8, 2021

> I'm patiently waiting for pijul to be mature enough that I can move everything over to using it

Pijul is super slow. I've tried it a couple of times, and is too slow to be usable.

pmeunier · on April 11, 2021

It has made huge progress lately, which was always the plan: the new data representation introduced in November 2020 made it scale to huge repositories (this was impossible before, because of disk space and speed), and then we also changed the backend recently (https://pijul.org/posts/2021-02-06-rethinking-sanakirja/).

volta83 · on April 13, 2021

I'll check this again!

jayd16 · on April 8, 2021

Can I make a shallow clone in Pijul?

pmeunier · on April 10, 2021

I'm one of the authors. The concepts are different, we do have a concept called "tags" (not ready yet) which is more efficient than shallow clones, in the sense that you can make patches (commits) on top of a shallow clone without any down side (unlike in Git, where shallow repositories can become slower).

You can also do partial clones in Pijul: since patches commute, the patches you produce on top of a partial clone can be pushed to the full repository in the exact same way.

mdnahas · on April 8, 2021

I strongly disagree.

Snapshots are a useful concept for programming. Each snapshot represent a compilable program with a certain set of features. So snapshot A has a certain set of features and B has another.

Diffs are not a useful concept. Does the diff between A and B represent the new features in B? No. Because if it did, it would mean I could take any another compilable snapshot C and apply the diff of A and B to it, then I should end up with a snapshot D is compilable and has all the features of C with the new features in B. And that doesn't work with any programming language I know.

It doesn't even work with the most trivial features.

Diffs may be a useful concept when working with some data formats. But for programming languages, snapshots are the right concept.

pmeunier · on April 10, 2021

I suggest you read more about Pijul and Darcs, because you seem to be confusing Pijul patches with the output of `diff`.

Patches are much easier to work with, more reliable and fully deterministic. For example, merge and rebase are the same operation in Pijul, you can remove an old patch without changing the identity of all the patches after it, and yet have strong version identifiers, with the exact benefits you describe for snapshots.

cmeacham98 · on April 8, 2021

No idea what Pijul is, but how does this not describe git?

Unless your complaint is that a commit is really a set of diffs/patches?

chriswarbo · on April 8, 2021

Pijul (and Darcs) operate on sets of patches. As a simple example, git commits have at least one 'parent', which imposes an order, e.g. let's say I edit file X in commit x and file Y in commit y; if I want both of those changes, git forces me to apply them in a particular order, e.g. [x, y]. If someone else applied those same two commits in a different order, they'll get a different commit ID, which may cause problems e.g. when trying to merge their changes with ours.

If we treat x and y as (sets of) patches instead, then the set {x, y} is the same as the set {y, x}; the order doesn't matter (we say those patches commute).

The idea of commuting patches is really useful, since we can rearrange patches to a more convenient form. For example, if we commit something we shouldn't (like a password, or a huge binary), then later remove it, a system like git makes it hard to remove that file from the history. If we're dealing with sets of patches, we can simply swap them around until the 'add file' and 'remove file' patches are next to each other, then merge those two patches. Voila, the file no longer appears, the rest of the history remains intact, the branch's content is guaranteed to remain unchanged (since we only swapped commuting patches, which doesn't change anything; and merged two patches, which doesn't change anything).

cmeacham98 · on April 8, 2021

Patches are not commutative in general though, so surely Pijul must have some history/ordering mechanism?

zaphar · on April 9, 2021

Pijul can detect when a patch commutes and when it does not. From there it can construct a dependency graph of commits and use that information in various interesting and useful ways. Back when I used Darcs I would somewhat frequently do things like pull these three commits and their decscendants as a sort of pseudo branch that would include everything necessary for a specific line of work but leave everything else behind.

To the point of the article when commits are diffs you sort of intuitively think such things should be possible. But because in git commits are snapshots it's not as easy as you would expect it to be.

chriswarbo · on April 9, 2021

Yes, patches aren't commutative in general. That is precisely the reason Pijul can help us in these situations, and guarantee the result will be the same: it will either work without issue, or it will abort due to non-commutative patches (it's similar to a type checker, which either guarantees that functions won't be called with the wrong type of argument, or aborts because the types don't match up)

From my understanding, the 'history/ordering mechanism' in Pijul is composition of patches. In general, the patch 'patch1 ∘ patch2' can be different from the patch 'patch2 ∘ patch1'; when they just-so-happen to be the same, we say that patch1 and patch2 commute.

pmeunier · on April 11, 2021

> In general, the patch 'patch1 ∘ patch2' can be different from the patch 'patch2 ∘ patch1'.

That isn't true. In Pijul, either patch1 explicitly depends on patch2, or patch2 explicitly depends on patch1, or else these two things you said are equal.

dan-robertson · on April 8, 2021

People using git think that commits are patches. But that isn’t how git works. Git sometimes tries to let you treat a commit like the diff between it and it’s parent and lets you try to rewrite history but these are really making new commits with new ids and this confuses people.

In pijul, the objects you interact with actually are diffs (aka patches) and then snapshots are well-formed sets of patches. Here, well-formed means that if a patch is in the set then so are it’s dependencies (these dependencies aren’t like parent commits in git, they’re more like you need to add line 3 before you can delete it). So removing or modifying a patch in a branch isn’t a horrific interactive rebase operation anymore.

When you move a patch in pijul it doesn’t affect any of the patches written before or after it (unless they depend on it). When you “move a patch” in git you rewrite the history and create new commits, so if I was talking about a commit (id) before the move, I would be talking about some dangling commit after the move and would need to update my id to the corresponding new post-move commit.

fraculus · on April 8, 2021

I think merge commits are key to why "snapshots" are a better model than "diffs", and a stronger arguments would emphasize this more.

Like people have said, the two models:

- a commits is a snapshot plus a pointer to a parent commit

- a commits is a pointer to a parent commit plus a diff

are sort of isomorphic. And some commands in the git porcelain (like git cherry-pick, or git rebase) indeed make more sense if you think of commits as diffs.

But this isomorphism becomes really strained when you have commits with more than one parent (or even zero parents). (And I think it's telling that those commands don't play very nicely with merge commits or the root commit.)

If you really want to incorporate merge commits and the root commit, the alternatives become:

- a commit is a snapshot, together with a list of zero or more pointers to parent commits

- a commit is a list of M >= 0 pointers to parent commits, together with N > 0 diffs, subject to the invariant that:

a) M = N, except that for exactly one commit, which we will call the "root" we are allowed to have M = 0 but N = 1

b) starting from any commit, if you traverse a path back to the root commit by following parent pointers, and then sequentially (in reverse order) apply, for each commit in the path, the diff that corresponds to the parent pointer chosen, then the result of composing all those diffs is independent of the path chosen.

And when you put it like that, it's pretty clear that the "diffs" model is really impractical, and that's why it's a lot better to think of commits as snapshots.

tsimionescu · on April 8, 2021

It's nice to understand this, but I fail to see it helping much in practice. Sure, you'll know why the thing you want to do is hard for git to do, but that wont make it much easier.

And without knowing even further implementation details, it's a bad idea to rely on this knowledge. For example, the article states that committing a rename separately from edits in the renames files helps git track the renames. But that's not obviously true from the discussion above, because it's not obvious if, when computing a diff between two commits, git will follow the entire history or just apply the diff algorithm on the two commits.

If it were the latter, then it doesn't really matter which order you commit things in, git would simply see commit1: fileA, fileB with contents cA and cB; commit2: fileD, fileE with contents cD and cE, and would do the quadratic work anyway, even if commit1.5 had fileE, fileD with contents cA, cB.

Tomminn · on April 8, 2021

It strikes me as bizarre that something as old and as important as git is to the general version control problem, doesn't have a beautiful, complete and helpful user interface.

With the status quo how it is, I definitely love articles like this because every time I use git I get a kind of anxiety that fades only in proportion to the depth with which I understand actual git mechanics.

The thing I find strange is that when I interact with databases that have beautiful, helpful user interfaces, I have almost none of this anxiety, and just kind of accept "black box that handles things", and move on with my life.

I figure I must not be alone in this psychological niche. Which again, makes it bizarre that the problem of giving git a beautiful, complete, helpful front end has not been solved.

adamnew123456 · on April 8, 2021

I guess I'll be the one to make the obligatory "magit is awesome and if you use Emacs you should definitely check it out" comment.

Other that being horribly slow on Windows I can't think of any downsides. Aside from the very rare black magic incantations it does everything I've needed from a Git frontend.

If something like it existed for SVN ($JOB VCS of choice, sadly) I would abandon Tortoise in a heartbeat. IntelliJ is nice but the overhead of the VCS add-ons kill my startup time.

motoboi · on April 8, 2021

> It strikes me as bizarre that something as old and as important as git is to the general version control problem, doesn't have a beautiful, complete and helpful user interface.

It has several.

Tower is a wonderful interface in MacOS, Sublime-Merge too.

Github is another, Gitlab also a very good. Gog is a free as in beer option too.

There are several. None has dominated the market, tough.

Tomminn · on April 9, 2021

Appreciate this and the other similar comments! I only really knew about Github Desktop, and didn't really like it, but I'll give the others a whirl:)

xpe · on April 9, 2021

I use and recommend Fork.

SCLeo · on April 8, 2021

I can't agree with you more. git commands are definitely not designed for the current main stream usage (i.e. with services like GitHub/GitLab). Simple things like forking a repo from another user and edit locally requires >10 non-straight forward steps is far from ideal.

mekkkkkk · on April 8, 2021

There are so many tools to help with this though? If you want to work with Github, there is an official Github CLI tool that makes forking easy peezy. Gitlab doesn't have an official one AFAIK, but there are unofficial ones. And if you want GUI there's a myriad of those as well. I don't understand this complaint at all.

nwatson · on April 8, 2021

I like SourceTree from Atlassian, I dip into the command-line from time to time but it meets many needs.

Only problem is, no Linux version, only macOS and Windows. But that's now solved with WSL2 ... code in Linux/Docker/PyCharm etc on Windows WSL2, SourceTree on Windows.

marcodave · on April 8, 2021

Would you like to talk about our lord and saviour IntelliJ ?

gpspake · on April 8, 2021

I think it's a tragedy that just about every developer uses git but most learn add, commit, branch, and merge and then just stop learning.

A lot of people are scared of rebase and cherrypick and shut down or get defensive when you mention them or try to encourage their use.

The result is, because developers only have a hammer, they brute force merge everything which results in grotesque conflict resolutions and commit histories and makes it hard to untangle problems.

At a previous job, another developer was kind enough to walk through rebasing on the command line with vim. I was receptive and in about 10 minutes, I realized there was a significant set of standard features and day to day Git use I was previously just oblivious to.

These days, the UI for rebasing and cherry picking in Gitkraken is state of the art and effortless and I use them every day without hesitation and without the fear that comes from not understanding or knowing what I'm doing. Still, I constantly struggle with coworkers merging feature branches from 100 commits ago in to new feature branches and brute force resolving conflicts across half a dozen files in one commit without any context.

I see it all because I have visibility in to the history and branch relationships but I still get shrugs and eye rolls when I bring it up. I don't necessarily want to dictate nitpicky git usage but I have a hard time accepting when people just to refuse how rebasing and cherrypicking work when they're both core basic features of a tool we all use every day. Proper Git use is one of those hills I'll die on, though so I don't intend to shut up about it any time soon :)

Edit: My practical advice: If you use git every day and you don't know how to rebase, reset, cherrypick, and stash from the command line, make it a goal. Then, once you're comfortable, learn how to do it in a visual tool like Gitkraken and make an effort to incorporate them in to your daily workflow. My guess is things will become a lot less tedious and confusing when things get messy.

greggman3 · on April 8, 2021

I don't have a solution and maybe the problem is just not solvable but ...

The tragedy is that git is so hard to learn. Start a github project (I know github is not git). Take a PR, have the PR have a conflict, now, try to explain to the new user how they can fix their PR via git to not conflict. You'll be stuck giving them a giant lesson, probably an hour to write the instructions, then several back and forths.

Mostly, either they already know git and fix it themselves OR I give up and merge it by hand myself since it's easier than becoming a git teacher for them.

outworlder · on April 8, 2021

> Take a PR, have the PR have a conflict, now, try to explain to the new user how they can fix their PR via git to not conflict.

Is this a Git problem? I recall entire workdays being wasted on SVN and CVS back in the day with multiple people trying to make sense of a merge.

In Git this is actually easier to do (and easier to do repeatedly, with git rerere and similar).

iudqnolq · on April 8, 2021

It's a problem, and the place to fix it is in git. That makes it a git problem. Just because things were worse back in the day doesn't mean we can't have nice things.

Stratoscope · on April 8, 2021

What puzzles me is how resistant many developers are to using or even considering a Git GUI. I prefer SmartGit, but GitKraken is nice too.

People tell me, "I'm so much more productive on the command line" and then it turns out all they know is pull/commit/push and using a local branch. Anything outside that brings terror: "I never use rebase. What if something goes wrong in the rebase? Now I've lost all my work and I have to pull a fresh copy of the repo from scratch."

Yes, I have heard exactly that.

One thing I love about SmartGit is how it unifies features that the Git command line presents as separate and unrelated concepts. The reflog? Click the Recyclable Commits checkbox and now all of your reflog commits show up as ordinary commits just like any other.

Stashes? Same thing. Turn on the checkbox to make a stash or all stashes visible and now they show up as ordinary commits, which is all they are under the hood.

Want a diff between two commits, whether they be normal commits or stash or reflog commits? Click one commit, ctrl+click the other, and you instantly see the differences between the two. No need to check out a reflog commit temporarily just to have a look at it.

Yet I have only had a 5-10% success rate in getting anyone to take a look at any Git GUI, much less using one. I would be really interested in understanding why so many developers are reluctant to doing anything other than the Git command line.

bmn__ · on April 9, 2021

> SmartGit […] GitKraken

> I would be really interested in understanding why so many developers are reluctant to doing anything other than the Git command line.

In the spirit of curiosity, I downloaded the two packages. SmartGit shows me a document and outright threatens me to "deactivate". https://www.syntevo.com/blog/?p=4148 This is a euphemism for other people coming to my computer and deciding what I am able to do or not. I declined here. GitKraken shows a log-in screen before letting me use the software properly. It does not refer to the account I already have on my operating system and which is entirely under my control, but some other account which is under the control of other people. That account could be revoked at any time without my say in the matter and then I would not be able to use the software, on the face of it that's the purpose of the log-in screen. I declined here. As such, I did not even run the central part of the two software packages and cannot tell my opinion how well it would work, but I already learned what I needed to know.

I object to other people desiring to restrict how I use a software. This is fundamentally not compatible with my view on how I want to run my life, and that software/the people responsible for it have no business telling me. I will never consider these packages again. The software I have been using for years, namely qgit (a GUI) and the command-line tools, impose no such restriction.

kevincox · on April 8, 2021

I agree with your assessment. I think GUIs are great things that you don't do often enough to memorize (or for things that are inherintly visual, but not relevant here) but they are often looked down on.

There are many people who do enough git to know how it works well and be familiar enough with enough of the commands that they don't need a GUI and are likely faster on the command line. But for every one of those people there are at least 2 who would work faster, and more accurately with a good GUI.

outworlder · on April 8, 2021

> Edit: My practical advice: If you use git every day and you don't know how to rebase, reset, cherrypick, and stash from the command line, make it a goal. Then, once you're comfortable, learn how to do it in a visual tool like Gitkraken and make an effort to incorporate them in to your daily workflow. My guess is things will become a lot less tedious and confusing when things get messy.

I would add git bisect to the list. It's incredibly useful (if your codebase is sane).

trulyme · on April 8, 2021

I read some description ob what that is and it looks like checking out different commits (via bisection) until you figure out where in the history some change happened. Is there some other benefit I am missing?

jfengel · on April 8, 2021

It's not about finding a particular change, it's about which change broke something.

In the best cases, it's totally automatic. You know that it worked at commit A and is broken by commit Z. So it checks out commit M and runs the tests. If they succeed, then it broke somewhere between M and Z. If they fail, then it broke somewhere between A and M. So it checks out either H or S, depending, and repeat.

It's not always that easy, especially when your tests and environment are complicated. There's often manual intervention, which is tedious. Still, log2 N steps is often manageable, especially if the computer is taking care of the tracking for you.

trulyme · on April 8, 2021

Thank you (and the sibling) for the explanation, sounds useful!

iudqnolq · on April 8, 2021

It automatically does a binary search, and you can use it completely automatically if you can write a script that determines if the bug was present.

The other day I used it to write a good bug report. I first used it to find the earliest commit I could compile on my machine, then I used it to find the commit where a certain command would fail.

phendrenad2 · on April 8, 2021

I know how to rebase, reset, cherry-pick, stash, reflog, assume-unchanged, and many other advanced techniques.

I still prefer to add/commit/branch/merge. I often copy-paste changes into a new branch, just because I don't enjoy recalling arcane commands from memory or googling them for the umpteenth time.

I suspect that git is a leaky abstraction that doesn't fit the corporate software development workflow. I think that git is a hammer and non-distributed development is the screw we're hitting with it.

zwieback · on April 8, 2021

True, but you could also argue the opposite view that it's a sign of git's usability that beginners can get by with just those commands. The problem doesn't crop up until those lazy users start doing things that make the repo messy.

phailhaus · on April 8, 2021

That's honestly the opposite of good design. It's hiding the complexity to make it seem easy for beginners, then slamming them with inscrutable error messages when they "don't use it right." It leads to a system with a deceptively gentle learning curve that requires you to suddenly learn everything all at once when you hit an issue.

zwieback · on April 8, 2021

Good point. I'm not sure which side I'm on, to me git feels like a good core with an atrocious UI, even after years of use I have to look up whether to use this or that option and whether it's uppercase or lowercase, do I use "--" or ":" and on and on.

Mind you, I'm not complaining, most utilities I've written for my internal users are worse! It's when something gets out and used by the masses that you wish you had had the time to put together a coherent user interface.

redisman · on April 8, 2021

It's really just using the old generic version control commands everyones used to.

ravenstine · on April 8, 2021

Rebasing and cherry-picking are awesome tools once you know how what they're actually doing. I think people avoid rebase for a few reasons; the term "rebase" doesn't mean anything outside of Git so it's not obvious what it is doing under the hood, and inexperienced Git users might use it to change the history on the main branch, which I see as an antipattern.

There's nothing inherently wrong with merging, but I personally don't like it because I find merge commits harder to understand than regular commits. Better to use things like rebasing and cherry-picking to move commits arbitrarily and then squash some commits into units of work that make sense.

Stash is crappy though, IMO, because it's not branch-specific. Instead of stash, I like to fork the branch I'm working on and create a "WIP" commit. That way I don't lose track of work I had in progress that only belongs in a certain branch.

Izkata · on April 9, 2021

> the term "rebase" doesn't mean anything outside of Git so it's not obvious what it is doing under the hood

Read "base" as "baseline"; that should help significantly for the simplest use of rebasing a series of commits without changing them.

wruza · on April 8, 2021

I think it's a tragedy that just about every developer uses git but most learn add, commit, branch, and merge and then just stop learning.

This implies that they think in a wrong way and not have a wrong tool. A real tragedy is that git took over the world (in minds of lovers of shiny-new things and in saas) without most of the world realizing that they don’t even need it, because they wouldn’t even like to think in its way. The world wanted quick subversion and instead got this in-all-regards UX monster.

Solvitieg · on April 8, 2021

In practice, rebasing increases conflicts, requires teams to time their merges, and obfuscates the history of the project.

I never understood why people think this is a good pattern.

TheLocehiliosan · on April 8, 2021

What you need to understand is people use rebase on their unshared branches. It's part of crafting your commit history to be a coherent set of atomic changes instead of the path you took while developing it all.

You rebase BEFORE you merge into the mainline branch.

fshbbdssbbgdd · on April 8, 2021

Do you run your test suite against each of the commits you create when rebasing? If not, isn’t this “coherent set of atomic changes” misleading? It seems like a lot of effort to make a fake clean-looking history.

breischl · on April 8, 2021

When I've done this, your private/dev branch may be a series of broken commits. Then you rebase onto main, squash to one commit, and test (if necessary). So what shows up on the main branch is a single, squashed, tested commit that contains one logical unit of code (usually a feature or fix).

In this model the main branch history is "real" in that it records the sequence of changes to the production code. It's "fake" in that it doesn't record the exact sequence of fumbling steps and backtracks you took to get there. But IME the latter is usually not very useful anyway.

fshbbdssbbgdd · on April 8, 2021

I like the squashed commit approach. I get there by merging upstream into my dev branch when developing, then squashing before I merge my changes into the upstream. As far as I can tell, that has the same outcome as rebase with squash. Both approaches create a simple commit graph, and both avoid fake intermediate commits.

xoudini · on April 8, 2021

In some cases I agree, but squashes can end up so large that doing a `git bisect` (which is quite useful in finding the comparatively small commit which introduced a bug) becomes unfeasible.

xoudini · on April 8, 2021

There shouldn't be an issue in doing so. During a rebase you'll either have no conflicts — in which case there isn't an issue — or you'll have to stop to resolve conflicts, and you might as well run tests before continuing the rebase. In both cases I'd argue that the statement "coherent set of atomic changes" applies.

fshbbdssbbgdd · on April 8, 2021

Correct me if I’m missing something here - but a lack of conflicts during rebase only means that the few lines surrounding your changes weren’t changed in the upstream. The rest of the repo changed, and this will often cause some kind of inconsistent state. I’ve encountered this situation frequently when using git bisect.

xoudini · on April 8, 2021

When you rebase, you basically replay the history of your branch since it diverged from the branch you're rebasing onto. Thus, the branch is always in a consistent state (or equally consistent to when you originally authored the commit you're replaying). And of course this assumes the target branch is already in a consistent state.

fshbbdssbbgdd · on April 8, 2021

If the upstream is like this:

A -> B

And you branch off B and start making changes, then the upstream continues on its own:

A -> B -> C -> D

Now you rebase your dev branch off D. Your changes get replayed on top of D and create new commits. Some of those commits might not be valid, because they take code that worked in the context of B and put it in the context of D. The history seems clean if all you do is look at the diffs, but if you bisect and try to use the repo in one of the rewritten commits, you may find it doesn’t even compile (even if that commit was fully functional before rebasing).

xoudini · on April 8, 2021

Hm, you're right. The simplest example I could think of right now is the upstream having renamed/deleted something that the dev branch depends on, but didn't directly touch. That would definitely cause a "broken" history during the rebased commits, and is technically unavoidable.

fshbbdssbbgdd · on April 9, 2021

A surprisingly common occurrence is two developers independently notice and fix the same problem, but they implement the fix in two different ways. The diffs might not conflict at all during a rebase. Or they might only conflict in some places, and the “behavioral conflict” remains after the diff conflict is resolved. This issue would eventually be noticed and fixed when tests fail before merging to master, but the intermediate rewritten commits are unlikely to be fixed.

xoudini · on April 9, 2021

I can see this happening, but with a reasonable bug-tracking solution in place and enforcing `fix/...` branches for fixes, these situations could mostly be avoided.

Solvitieg · on April 8, 2021

For sure, I understand that.

It tends not to be an issue when a developer is working on an isolated feature that only he or she cares about, that is reviewed in a timely matter, and gets directly committed to main.

Often this is not the case.