Linus Torvalds proposes a change to the Git commit object format

breckinloggins · on July 15, 2011

OK, finally found a decent explanation of what a Git generation number actually IS:

http://www.spinics.net/lists/git/msg161165.html

pak · on July 15, 2011

Beisdes analyzing properties of the commit graph, which doesn't seem to be a common task, what about them is so useful that they need to be stored in commit headers now? For instance, what day-to-day operations do they actually make faster? At first, it sounded like it might be for easier navigability (e.g. expanding git-rev-parse) but I don't see that in this description.

davvid · on July 15, 2011

"git tag --contains" needs to know the generation number of old tags, which is expensive to calculate. "git describe" would be faster for the same reason.

tonfa · on July 15, 2011

Interesting. It's also nice to see that Mercurial got this right from the start (it is not generation numbers, but the rev numbers can be used in the same way to stop the exploration).

Some mistakes were made while designing Mercurial, this wasn't one of them.

andrewflnr · on July 15, 2011

So it's almost, but not quite, like the revision numbers everyone else has always had?

iclelland · on July 15, 2011

Yes -- almost, but not quite. If you and I each create a branch off of a commit with gen #123, then, as I understand it, the subsequent commits in my branch would be #124, #125, etc., and your commits in your branch would also be #124, #125, etc.

Contrast this with CVS, where I would have 1.124.1.1, 1.124.1.2, erc., and you would have 1.124.2.1, 1.124.2.2, or with Subversion, where I might get revisions 125, 128, and 129, while the server gave your commits #124, 127 and 130, and someone else, on a totally different part of the project got #126.

As long as development proceeds linearly, on a single branch, then yeah, it's about the save as revision numbers in a centralized RCS -- once you start branching and merging, though, it represents a different concept entirely.

jbert · on July 15, 2011

For a single repository, it does have a very similar interpretation to, say, svn revnos.

You can speak of "revision #125 of a branch" in a specific repository. Which is generally exactly what you need for human-to-human communication about development.

"Can you see if that bug is in r125 of unstable?" "I've got all changes up to r245 of prod"

I guess the confusing aspect would be if "r245 of prod" in the central server was "r100 of prod" in my local repo because I haven't cloned the full history?

noste · on July 15, 2011

It would appear to me that multiple commits in a branch can have the same generation number (see the diagram in http://www.spinics.net/lists/git/msg161165.html ). So unless your history is linear, using generation numbers in human-to-human communication may get confusing really fast.

jbert · on July 15, 2011

In that diagram, I see 2 sequences of commits (two branches).

The 'original branch' (e.g. "unstable") goes: 0,1,2,5,6

Then there is a topic branch (e.g. "add-frob"), which goes: (0,1),2,3,4,(5). Note that I consider the the 'add-frob' branch ended at the merge commit, so there is no "revno 6 of 'add-frob').

I don't consider that merging 'add-frob' back into unstable means that "revno 2 in unstable" could mean commit D - I would call commit D "revno 2 in add-frob".

Does that system work?

stonemetal · on July 15, 2011

No there isn't a similarity to svn revnos. Rev 125 of a branch in a repo isn't a specific commit it is a specific depth in the tree. In your example you couldn't talk about a singular Rev 125 unless there really was one Rev 125. These numbers wouldn't be like Hg's revnos they would be stable across clones.

jbert · on July 15, 2011

My intention was that a single branch could have an unambiguous revno.

As pointed out elsewhere in the thread, that requires that branch identity is maintained across a fork+merge, but that seems reasonable to me?

fanf2 · on July 16, 2011

In git, a branch is just a ref that points to the most recent commit in a development series. The history of the branch has no explicit link to the branch name. Individual commits do not belong to a branch. So the commits in the history of a single branch will not have unique generation numbers.

praptak · on July 15, 2011

They are somewhat between revision numbers and vector clocks.

rlpb · on July 15, 2011

What I like about git is that it stores only the minimum amount of information, and this makes it easy to explain. A commit hash is a hash of canonical information, not of derived information.

It seems really ugly to store derived information in a commit (specifically, that the hash would be altered by it).

It seems that Jeff has said the same thing, but Linus disagrees. Vocally.

http://www.spinics.net/lists/git/msg161336.html

ryannielsen · on July 15, 2011

From my understanding, they're essentially adding this as an additional bit of information that's minimally required. The currently used timestamps are error prone and thus will be replaced by generation numbers which are more robust. They're still adhering to the principle of only storing the minimum amount of information, they're just adding generation numbers to that set.

In fact, you could make the argument that timestamps are the derived information that git has been storing all along while generation numbers are the canonical information which should have been stored from the beginning. Generation numbers are a result of the state of the tree, while timestamps are derived from the ambient (and potentially incorrect!) environment from which the commit was made.

rlpb · on July 15, 2011

Well, generation numbers can be determined by counting up through parent commits. So they are derived information, it's just that that takes ages and lots of disk seeks to count through.

Timestamps aren't really needed. They are information that is useful to use that we want to store, just like the date in an email. Thus they are as required as the names of the author and committer.

The reason for the discussion about the commit timestamps is (AIUI) a heuristic optimisation that works because they happen to be there and happen to (most of the time) be in order.

cube13 · on July 15, 2011

>Well, generation numbers can be determined by counting up through parent commits. So they are derived information, it's just that that takes ages and lots of disk seeks to count through. >Timestamps aren't really needed.

First off, timestamps are needed. They're used to order commits in the history. Generation numbers do the same thing, but a bit more elegantly, because they avoid most of potential clock issues in a distributed environment.

You're making the assumption that the set of derived data and the set containing the absolute minimum amount of data git needs to work are mutually exclusive sets. They're not, especially if the derived data is computationally expensive to get, and is still used for normal operation.

One of git's primary goals is fast, scalable performance. Commit generation numbers help reduce potential errors with the current timestamp approach. However, they're expensive to calculate, and don't scale well at all. Linus' argument is that instead of calculating them every time, it's far simpler to just add them in and be done with it.

ryannielsen · on July 15, 2011

By that definition of "derived information", the hash is "derived information" since it's based of the changes made to source data (whatever that data may be).

That said, point taken about the necessity of both generation numbers and timestamps. But that invalidates the OPs comment about git storing "only the minimum amount of information". It sounds like that's never been a hard principle.

davvid · on July 15, 2011

git does store "only the minimum amount of information".

Here's what Linus had to say about it:

> Generation numbers are _completely_ redundant with the actual structure

> of history represented by the parent pointers.

Not true. That's only true if you add ".. if you parse the whole history" to that statement.

And we've never parsed the whole history, because it's just too expensive and doesn't scale. So right now we depend on commit dates with a few hacks.

So no, generation numbers are not at all redundant. They are fundamental. It's why we had this discussion six years ago.

From: http://www.spinics.net/lists/git/msg161348.html

ryannielsen · on July 15, 2011

Thanks for the background! (Seriously, not trying to be snarky.)

That info does support my original point that generational numbers probably should have been stored from the start and timestamps are the more "derivative" bit of information since it comes from the environment and not the data itself.

Thus, rlpb's concern that storing generational numbers pollutes its design of storing "only the minimum amount of information" isn't necessarily well founded, since the generational number might be more minimal and correct than the current timestamp. That was the aim of my original post: generational info is fundamental, not extraneous derived info, and probably have been stored with commits in the first place.

derrickpetzold · on July 15, 2011

>It seems really ugly to store derived information in a commit

I don't understand how generation numbers are derived information. They are used to find the position of the commit in relation to another. That makes them information that is essential to the commit. The problem was to get around them not being there timestamps were compared and that is not reliable for obvious reasons. So I really don't understand why any one would complain about this.

iclelland · on July 15, 2011

They're derived. You can tell that they're derived information by the fact that you can compute them for old commits, long after commit time, which is exactly what part of this proposal is to do. You can derive them simply by counting the maximum number of steps between a commit and any of its roots. The essential information isn't the generation numbers; it's the structure of the commit history -- the actual chains of commits, with all of the branches and merges. Generation numbers are just an artifact of counting.

On the other hand, this information is very handy, once you have it, for certain algorithms, and it could be expensive to re-compute all the time, which is why the proposal is to generate and store them explicitly. (This is also the reason that timestamps have been used before, even if they were a bit of a hack -- they're readily available, and way faster than recomputing generation numbers all the time)

mscarborough · on July 15, 2011

I don't generally come across Linus' dev threads, but it's usually in the context of some linkbaity 'watch Linus smack this dude down' or something of that nature.

This reads like a really productive thread from my limited understanding of git internals. It's pretty cool how much good engineering thought is going into this proposal.

Maybe that's why git rocks so hard.

jacknagel · on July 15, 2011

Minor smackdown from this thread here: http://article.gmane.org/gmane.comp.version-control.git/1771...

gregschlom · on July 15, 2011

Ah! I knew I was going to stumble upon Linus' signature "that's total and utter bullshit" somewhere: http://www.spinics.net/lists/git/msg161348.html

pyre · on July 15, 2011

I like the suggestion of storing the generation numbers in the pack index. When you generate a pack you're already parsing the entire tree. That makes more sense than requiring all future git objects to have 'generation numbers' jammed into them. Especially because it introduces an incompatibility with current git objects, which it would probably be best to avoid.

nplusone · on July 15, 2011

Change last name to 'Torvalds' (edit: name in title changed)

cypherpunks01 · on July 15, 2011

What operations would be sped up by having generation numbers?

I see Jeff King's message that they would make certain bounding traversals faster, but when do bounding traversals need to be computed when I'm using git day-to-day?

Rauchg · on July 15, 2011

It's also about making git not error-prone, which the current timestamp approach seems to do.

derrickpetzold · on July 15, 2011

I was wondering how they got along without generation numbers for so long. It was by comparing timestamps and those are unreliable because systems can be misconfigured. How they are going to handle legacy repos with that problem I still don't get. I am guessing that history is f'd.

mdwrigh2 · on July 15, 2011

New versions of git will actually go back and generate this information for old commits. This will lead to git being slightly slower when in old repositories until all the commits contain the generation information, but that should happen fairly quickly.

derrickpetzold · on July 15, 2011

I was talking about the case where the timestamps are off. I don't think there is any way to fix that.

ajuc · on July 15, 2011

You just go the whole way up to the root node counting parents (taking max length when there are many routes), no problem with amiguity. The problem is - it's slow.

iclelland · on July 15, 2011

Not necessarily "all the commits", as I understand it (there's some debate in the thread, so I could be wrong) As long as at least one commit in each merged branch has a generation stored, it's simple to compute without going all the way back to all the roots.

breckinloggins · on July 15, 2011

Can someone explain what generation numbers are? Googling "git generation numbers" pulls up mostly this discussion thread.

I'm assuming they're easy-to-remember incremental numbers tied to commit? Like 1, 2, 3, or tied to commit and branch, like master/1, etc.?

Kliment · on July 15, 2011

Here is how I understand the problem.

At the moment, each commit stores a reference to the parent tree. By parsing that tree and reading the entire history you can obtain a hierarchy of commits. Because you need to order commits in many situations, reading the entire history is extremely inefficient, so git uses timestamps to determine the ordering of commits. This of course fails if the system clock on a given machine is off. With a generation number, you can get an ordering locally from the latest commits, without having to rely on timestamps or read the entire tree.

When you have a commit with generation n, any later commits that include it wound have generation >n, so to tell the relation between commits, you only need look as far back as n, and you can immediately get the order of any intermediate commits. It has nothing to do with "easy to remember". It's about making git more efficient and robust.

Peaker · on July 15, 2011

Why call it a "generation number" and not "depth"?

caf · on July 15, 2011

Perhaps because "Generation" continues the Parent/Child allegorical language.

rs · on July 15, 2011

Think they're using "generation" here in the context of number of parents (yes, "depth" would work fine as well, but is a more general term)

pyre · on July 15, 2011

  > number of parents

s/parents/ancestors/

But that statement only holds up on branches without any merge commits. Because the actual algorithm does not just total up the number of ancestors.

beza1e1 · on July 15, 2011

Since gitk and others put new commits on top, i'd propose "height" instead of "depth" ;)

macrael · on July 15, 2011

I'd love an explanation of what "generation numbers" are.

pufuwozu · on July 15, 2011

Looking at the diff, it seems like a generation number is just a number the parents that a commit has. For example:

Commit a553af has no parents, it has generation number of 0.

Commit c464e0 is the next commit, it has generation number 1.

And so on. Branches count independently of one another. When commits have multiple parents (e.g. merges), the generation number starts counting from the previous maximum.

pyre · on July 15, 2011

  > just a number the parents that a commit has

It's the max length from a commit to reach a root node in the graph. So any time that you hit a fork in the path you take the longest route.

malkia · on July 15, 2011

http://news.ycombinator.com/item?id=2765942

>> OK, finally found a decent explanation of what a Git generation number actually IS: http://www.spinics.net/lists/git/msg161165.html