- [Major] I feel (but could be potentially convinced otherwise?) that there is one very deep fundamental flaw in the semantic model, and that is the fact that the identity of a commit depends on its history. I simply do not understand why this has to be the case. If I later discover a ZIP backup of the tree that I forgot to commit, and I want to insert it into the history, it shouldn't suddenly completely break the entire repo. Of course it seems fine to have a hash that depends on the history, and it's very likely useful for many purposes, but that shouldn't be the primary mechanism for identifying commits. By default, I think the identity of a commit should be defined by a hash of its contents only, but independent of its history. This would (among other things) let you re-write the history structure without rewriting the commits themselves and causing other people to have to reset their repos, which seems insanely useful to me.
I think the reason for why the commit hash has to change is that a commit represents the entire state of a repository, not just the change made in the commit. Being able to take a sequence of commits and insert them into a repository just is not a thing that makes sense in git's model.
If you just hashed diffs, you would not get whole-repo integrity guarantees.
It is possible to go the other way with patch theory (see Darcs) but it's far from trivial to implement performantly.
> I think the reason for why the commit hash has to change is that a commit represents the entire state of a repository, not just the change made in the commit.
Yes, of course I realize that's the reason. My entire point was that a commit shouldn't represent the entire state of a repository.
> Being able to take a sequence of commits and insert them into a repository just is not a thing that makes sense in git's model.
Yes, and this is exactly why I declared this to be a fundamental flaw in git's model.
> If you just hashed diffs
Diffs are an implementation concern, which I don't care about. I'm only talking about the logical semantics.
> you would not get whole-repo integrity guarantees.
As I explained, I wasn't suggesting you must get rid of that hash entirely: "Of course it seems fine to have a hash that depends on the history, and it's very likely useful for many purposes, but that shouldn't be the primary mechanism for identifying commits."
> It is possible to go the other way with patch theory (see Darcs) but it's far from trivial to implement performantly.
Again, I didn't say you have to get rid of the current hashes. I was just saying we need something else to use for identifying commits.
------
If an example helps: consider what happens when you (say) sign off on a commit. Are you genuinely signing off on the history? Can you even claim with a straight face that you even know everything in the history behind every commit you sign off on? The reality is, you don't, and you don't need to, because you're only concerned about the commit itself. There's no reason a change in history should invalidate your signature. (Of course, the point here is not just signatures. They're just one example to illustrate what I'm saying. You can think of other scenarios.)
No, you are signing off current state of the repository. Otherwise it would be possible (not trivial, but possible) to take signed commit and apply it on different history, which could create a security loophole.
Your view on commit is a logical set of changes. Git's view is state of the repository. The set of changes between revisions, which is useful for developer to see more than the whole state, is computed on the fly.
>I was just saying we need something else to use for identifying commits.
> Your view on commit is a logical set of changes. Git's view is state of the repository.
No, my view of a commit is not a logical set of changes. It's everything that would be in my worktree if I checked out the commit. Which is neither merely the changes from the previous commit(s), nor the entire history leading to the current commit.
But git already has this object, it’s called a tree and each commit has a unique tree associated with it. The commits are the object that carries history and metadata on top of the trees. Is your objection that the commit metadata is associated to the commit and not the tree?
I used to think that, but life gets complicated. How do you transmit a commit with its history? It used to be you just sent a single hash, now you would have to send all the commits of the whole history. Also how would you merge repositories with different histories?
I spend some time thinking about and I couldn't come up with anything sensible which wouldn't lead to history being effectively brokenm
I'm not sure I understand what the problem is. I'm only talking about the logical objects, not the physical representation. You can still store diffs and you can still have history hashes if that helps you with storage, processing, etc. -- that's perfectly fine. The storage optimizations should be independent of the logical structure. I'm just saying the logical identity of a commit shouldn't depend on its history. For example, if someone removes a commit from the history, that shouldn't have to trash anyone's repo and be such a massively destructive operation. It should only cause a client (in the worst case) to resync its history hashes from that point onward the next time it pulls -- which is quite a cheap, fast, and non-intrusive operation. (Say, 100k commits with 20B SHA-1 hashes would just be ~2MB.)
I'm not sure I understand. How could removing a commit from the history not be a destructive operation? It would necessarily affect every commit after it, hashes or no hashes, because for each commit following, the state of the tree would change, and thus so would the commit.
To my mind It would be akin to walking across the room and then somehow changing things such that you took one step fewer than required.
I'm not sure what kind of implementation you're envisioning that could work the way you seem to describe. Or do you mean that git should save the entire state of the repository as independent blobs every time you commit something? I don't think you could do that with any hope of reasonable performance.
If you instead just allowed "removing" commits logically without actually physically altering the datastructure on disk, there's no point in providing the functionality in the first place.
> If you instead just allowed "removing" commits logically without actually physically altering the datastructure on disk
Yes
> there's no point in providing the functionality in the first place.
Why so dismissive? Wouldn't it make sense to give me the benefit of the doubt here and ask me what the point of something like this might be, instead of just shutting down it down as pointless? Unless you think I'm just dumb, or otherwise trying to troll here by asking for something pointless?
I am not being dismissive. If you provide functionality that allows the user to delete something without actually deleting it, what's the point of pretending that you can delete things? Usually when people want to delete commits, it's because they committed something like a secret, and really do want to delete it.
Git doesn't try to hide the fact that the committed data is immutable, and to accomplish "deletion" the only option is to rewrite the entire affected part of the datastructure and garbage-collect anything that's unreferenced. You can not modify a commit. You can only create new commits and manipulate references to them.
This is fundamentally what enables git to function in a distributed manner, since the only state between repositories that needs special logic are the references; the actual data could be blindly synced with rsync or something, because it practically speaking can't ever conflict.
In order to have useful global non-hash commit identifiers, you would need a separate data structure of references that somehow decides which commits are identical, and is capable of reconciling conflicts globally across all clones of a git repository. I'm pretty sure that this isn't even in theory possible for the general case.
As for signoffs, a change in history might make a change you signed off broken or completely irrelevant, so yes, I do think that a change in history can invalidate a signoff on a commit.
What you ask for already exists: it's called the tree hash (which can be obtained by doing `git log -n1 --pretty="%T"`). The tree hash is unaffected by history, so if you for instance revert a commit, the tree hash will also revert. IIRC Julia uses tree hashes rather than commit hashes to track its packages.
I'm most definitely not suggesting we should be using patches instead of commits though. I don't want anything to be logically composed of patches at all. (Physical-storage-wise, they can go wild; I don't care.)
Ah, I misunderstood you. I thought you were asking for the identity of a commit to be the identity of the contents of that commit, i.e. the changes - but it seems you're talking about the contents of the working tree at the moment of the commit, with no dependency on prior history.
The thing is, the contents of a commit aren't patches. They are snapshots of the worktree. Your mental model is wrong (sorry), that's why you misunderstood. :-)
This is a common misconception that is corrected in many blog posts and tutorials; it's also explained clearly in the documentation. See the section (aptly) titled "Snapshots, Not Differences", where it says [1]:
"The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they keep as a set of files and the changes made to each file over time. [...] Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot."
Now of course as an implementation detail it only stores diffs based on existing blobs, but except for the obvious speed difference, this fact is completely irrelevant to you as a user. You neither know nor care how it is actually storing its commits. And the thing is, even if you looked underneath, you would have absolutely no guarantee that the blobs are physically stored as diffs against the parent commits. They might be stored as diffs against other random blobs the repo for efficiency, and the user would be none the wiser.
What's the difference, though? How are patches different from commuting commits? By commuting I mean commits that do not depend on their position in history.
> - [Major] I feel (but could be potentially convinced otherwise?) that there is one very deep fundamental flaw in the semantic model, and that is the fact that the identity of a commit depends on its history.
Commits whose identity does not depend on their position in history are commits that are commutative (with respect to their position in history). So you very much said so, but we obviously appear to be talking about different things. I'm at a loss as to where these things differ.
What? This is like saying you and your younger brother are commutative. It makes no sense. Commits are snapshots, not diffs. i.e. they're variables, not operations. i.e. they're verbs, not nouns. They're as commutative as you and I are.
Oh, I see what you're saying now, I think. You're arguing for commits to completely lose any relationship with one another by default while remaining simple snapshots. I didn't realize this at first since I fail to see the immediate utility of this.
I agree the concept of a standalone snapshot is useful, but I don't think snapshots are the right abstraction when thinking about the evolution of a codebase from a human perspective and consider changes the more important concept.
I mean, the idea that commits are diffs is a (common) misconception about git, likely carried over from another VCS. The snapshot model is the current abstraction; I haven't added any idea of my own here here. It's right there in the documentation: https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3...
But I never said changes aren't important and should be neglected. And I'm also not saying there shouldn't be any relationship about commits' relationships to each other. You certainly can and should record and utilize that information as well. It just shouldn't be part of that commit. (Except maybe if it's a merge commit, in which case the contents of the previous file system snapshot are relevant. But even there, that shouldn't include the hash, which represents all the ancestors.) Information about commits' relationships should be external information, whose manipulation won't suddenly alter the commits or their identities themselves.
This isn't a radical proposal or something. For starters, git's own documentation (which I already linked here) literally say "Git thinks of its data more like a series of snapshots of a miniature filesystem". Well, the snapshot of the file system doesn't include the history of how it came into creation, so doesn't that mean that shouldn't be part of your commit? That's already a contradiction with its own principles right there. And going beyond that, most things we do with commits already revolve around the file system snapshots, not the history. e.g. when you sign a commit, you sign the snapshot, not the history. Or when you say this guy is the "committer", you're just talking about the snapshot, not the history. And when a commit gets inserted into the middle of the history, that logically doesn't affect you, and in practice, you don't want it to trash the commit you're one. The identity of your commit is still the same after all -- it's the same snapshot.
- [Major] I feel (but could be potentially convinced otherwise?) that there is one very deep fundamental flaw in the semantic model, and that is the fact that the identity of a commit depends on its history. I simply do not understand why this has to be the case. If I later discover a ZIP backup of the tree that I forgot to commit, and I want to insert it into the history, it shouldn't suddenly completely break the entire repo. Of course it seems fine to have a hash that depends on the history, and it's very likely useful for many purposes, but that shouldn't be the primary mechanism for identifying commits. By default, I think the identity of a commit should be defined by a hash of its contents only, but independent of its history. This would (among other things) let you re-write the history structure without rewriting the commits themselves and causing other people to have to reset their repos, which seems insanely useful to me.
- [Minor] I wish metadata was also preserved.