Hacker News new | past | comments | ask | show | jobs | submit login
The Architecture of Git (2012) (aosabook.org)
355 points by wheresvic1 on Oct 26, 2018 | hide | past | favorite | 79 comments



The best (known to me) informal intro into the architecture of Git is The Git Parable: http://tom.preston-werner.com/2009/05/19/the-git-parable.htm...


Also greatly worth reading: “Git from the Bottom Up”:

* HTML version: https://jwiegley.github.io/git-from-the-bottom-up/

* PDF version: http://ftp.newartisans.com/pub/git.from.bottom.up.pdf (via http://newartisans.com/2008/04/git-from-the-bottom-up/)

The great thing is that after reading and understanding these, one's mental model matches the reality of the Git program, so one can both try bolder things, and get unstuck from any mess.


I recommend this constantly, it is excellent.

Not to outright beginners, but anyone past that point should read this. It is a great job of clearly presenting an accurate mental model that helps you use git. If you are already a git expert this is just a good thing to learn from in general and with how to explain and teach git usage in particular.

The only documentation I know of that can turn people from cargo cult git users to people who just do the version control things they need done with the parts of git they need. That is damn useful.


I've always liked the low-level explanation at https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

It's a great exercise for the reader to recreate the core functionality of making git commits with just a few command line utilities (echo, wc, xxd, shasum).


IIRC,that's how they were implemented in the first place.


[flagged]


> Ah, ha ha ha ha ha ha ha ha ha ha ha

Would you please not post like that here? It degrades discussion.

The rest of your comment is ok, though the reader would learn more if you explained what the problem actually is.


I had come across this gem about understanding the vi editing model a while ago:

"Your problem with Vim is that you don't grok vi."

It's the top answer on this StackOverflow question:

What is your most productive shortcut with Vim?

https://stackoverflow.com/questions/1218390/what-is-your-mos...

Wonder if there is any such post about Git that cuts to the chase, even for a part [1] of the Git model, and explains it clearly [2].

[1] I had also come across a StackOverflow post that explains a part of git very clearly, like the vi example I quoted. I think it was about how to roll back accidental changes using "git reset --hard" and variants. Saved it, but don't have it handy right now.

[2] Note: I said "clearly", not necessarily "simply". I like to quote the (probably out-of-print) book by Abbe Dimnet [3] called The Art of Thinking, in which he said something like this (while deploring the trend of books that try to make things artificially simple, a.k.a. dumbed down):

"French grammar cannot made simple. It can be made clear."

[3] https://en.wikipedia.org/wiki/Ernest_Dimnet

He also wrote a book on that same topic (French grammar made clear). Googled the former book recently and saw about the second book. I've read the first, long ago, which I found in a second-hand bookshop. Good book. Apparently it was a best-seller at the time it came out, according to Wikipedia.

Quotes by him:

https://en.wikiquote.org/wiki/Ernest_Dimnet


I'm surprised that it's not mentioned in the article that one of the most interesting architectural aspects of git is that it's a blockchain system.


That's arguably not the most interesting architectural aspect of Git.

Or does any back-linked tree data-structure becomes interesting if the nodes keep a hash of their parent instead of a raw reference? I don't think that's the case.

It might be a bit heretical but I don't think Git has a super interesting internal architecture. I'm not downplaying the fact that Git was very innovative, especially considered the landscape of SVMs at the time. The tool as a whole is great and has desirable properties but its internals don't strike me as particularly innovative. It's a clever composition of solutions to well established problem domains. And in that aspect it is a beautiful engineering solution although there is room for a lot of improvement in terms of UX.

And in addition to that, I would argue that it would be a very weak definition of "blockchain". The innovation in Bitcoin is the incorporation of proof-of-work and resulting alignment of incentives such that it can achieve probabilistic consensus in an adverse setting and with some degree of asynchronicity.

The underlying structure of the data is an obvious choice because it is simple and "captures" the idea of aggregate global state, but it's also hardly an important innovation. UTXOs are more significant.

And also, recall that the textbook example for state-machine replication is always an append-only log for example. So that's not the crux of it, or of blockchain, in my humble opinion.


I said one of the most. I didn't declare it the most...


It's a merkle tree, not every hash tree or chain is a blockchain thing.

IMO blockchain describes the combination of a hash chain and the no-trust agreement protocol on top.


I don't agree with that. The oldest blockchain known has its hash codes published each week in the New York Times. I would agree that having the ability to prove the history hasn't been modified by externally retaining the hashes could be a component, but git qualifies for that as well.


I would recommend reading the Wikipedia article for a blockchain [0]. It's agreed that a blockchain is decentralized and resistant to change from bad actors. If a system doesn't meet those requirements, it's just a fancy database. I don't see how Git or the New York Times example you mentioned are inherently decentralized.

[0] https://en.wikipedia.org/wiki/Blockchain


The New York Times example is by the inventors of blockchain.

https://www.anf.es/pdf/Haber_Stornetta.pdf


Have a look at https://en.wikipedia.org/wiki/Linked_timestamping

This is technology from the 90s. The innovation of blockchain is the no-trust agreement protocol.


But when was the term 'blockchain' used for the first time?

This is just a matter of definition. You can call git blockchain or not, the only fact is that git shares the merkle tree property with Bitcoin.


This is from 2012, so it makes more sense to be surprised that it doesn't mention git being a crowdfunded, gamified, phablet-compatible cybertool as-a-service on the cloud.


It was written before “blockchain” was a well known term.


git predates the creation of bitcoin and 13 years after the invention of blockchain.


Very informative. This explains well about why Linus once said signing every single Git commit is unnecessary and it only means you have automated the signing process. [1]

[1]: https://news.ycombinator.com/item?id=12290873


Not sure if it’s of any relevance but I recently discovered https://gitup.co and must say, it’s pretty cool.


Frickin' amazing UI design, and some great pieces of functionality (edit the commit graph directly; ability to undo most operations).

Unfortunately development seems mostly paused, and it still has some big gaps...

Still tho by far my favorite git client. (#2 would be Sublime Merge.)


Fork (git-fork.com) is pretty neat as well


I still recommend people to watch this video from the man himself, Linus Torvalds, if they really want to understand the architecture and data structure behind git - https://www.youtube.com/watch?v=4XpnKHJAok8

If I remember correctly, he mentioned that he wrote the MVP in 4 weeks and the data structures he used are quite simple. Never got a chance to look at the source code but I guess they are in github (at least the mirror copy.)


It only took 4 weeks plus 30 years of programming experience and a genius mind.


The whole website is a treasure trove for learning about system design.


"The Architecture of Open Source Applications" is great. I highly recommend their other articles, as they are all interesting and informative. If you can, throw a few bucks their way by buying the PDF or paperback versions.


I still prefer my book's history of Git in which I compare Linus rejection of Monotone to the movie "Back to the Future":

https://github.com/xrd/BuildingToolsWithGithubBook/blob/ca7f...


Interesting, what am I reading?


This (a book about building tools with Git and the GitHub API) was published by O'Reilly in 2016, and we just released it under creative commons. You can read it completely free at https://buildingtoolswithgithub.teddyhyde.io or get the repo of the book contents above.


I've found this online course by Paolo Perrotta to be the best introduction to Git's architecture, by far: https://www.pluralsight.com/courses/how-git-works

He takes you pretty much from first principles all the way up to how remote repositories are tracked. Its been far more useful to me than simply trying to learn the CLI.

I realize its behind a paywall, but I'd recommend signing up for the free trial just to watch this course (if you're curious about the inner workings of Git).


Git is the VCS most suitable for bottom-up learners. Which is the VCS most suitable for top-down learners?


C:/Users/Billy/Desktop/Backup/New Folder (39)/IMPORTANT/My_Project (OLD)/My_Project.sln.bak


Darcs, probably? Darcs has one of the easiest UIs in the DVCS game and was always very easy to teach people how to use.

But then of course if you want to learn the details of how darcs works under the hood you wind up in Haskell, and feeling a need to read a lot of academia papers.


I'm a bit disappointed by the flippant replies. Mercurial is a very solid alternative VCS (Facebook uses it internally) and is friendlier (or at least has less gotchas) than git.


Mercurial is the obvious first alternative. It supports many of the same concepts as git but with a lower barrier to entry. A lot of this has probably been said many times before but I feel it's worth reiterating.

1. No staging area. This makes commit+push a 2 phase operation rather than 3 phase and removes a lot of confusing concepts, state and commands. Selectively commiting files and chunks of files is still possible but it is all done in the commit operation. (It doesn't track the history of an edit as git does where you can skip the last edit in one file but who really uses this feature? If you are such advanced user you might as well use stash or amend your last commit)

2. Branches are not just pointers to the tip of a graph. This makes it easier to understand what was master and what was dev after a merge and master can't suddenly point elsewhere.

3. All command line commands are sane by default as opposed to git where you almost always need at least one flag to get correct behaviour. And there is usually only one way to do things right.

4. Incremental version numbers (locally only) makes it look more friendly than commit hashes.

5. Much better GUI tools available. Dont know why since the market is smaller but probably because of a less complicated internal model it's easier to make a GUI representing it.

Even with all those i still don't know if it is easy enough for non developers, or if any scm with a commit/push/pull model ever can be. The whole concept of merge conflicts is very complicated and even seasoned developers need to go get an extra cup of coffee when a conflict pops up in the middle of what you expected to be a trivial rebase.

The only way to avoid merge conflicts fully would be instant edits such as Google docs but I don't think that would work very good for code where different people would break the build constantly.


As a flippant replier, it's partly motivated by not really thinking other VCSes have a chance.

As confusing to use as Git's UI may be, I think the only way we're getting away from it is if someone massive pushes an alternative.

Alexa.com has Github trending down six places to the 66th most visited page on the internet. Worldwide. That's not just programmers that's everyone.


> not really thinking other VCSes have a chance

There's no reason the world has to universally adopt one true VCS.


I can observe that other VCSes today are unlikely to thrive without making the proclamation that there can be only one. The two ideas are not mutually exclusive.


    > Which is the VCS most suitable for ....
Git.

It doesn't matter what your question is. :-)

I mean, come on, it's not like you have a choice. You gotta use whatever your teammates/coworkers/organization is using.

That said, yeah, there's a lot of pedagogic problems with git stemming from the INSANE inconsistency of command-line. The only redeeming quality? It works and it's popular.


I really hope Git isn't the end of the road for VCS. I work in visual effects and video games. Most video game studios still begrudgingly use Perforce for project data. Many also have a separate Git server for code. Perforce or SVN is still the go to solution for binary assets. Trying to explain Git to programmers is difficult enough, for artists who have never used the command line it is unreasonable. Every time I've seen an "asset library" it's usually written from scratch instead of built on another VCS. I've made some crude attempts at building on Git and the data just structure isn't appropriate.

I know it's a bit unreasonable to expect the same tool for everything (but that's what you were implying). The binary problem is more manageable for code with a few UI assets (like icons), but isn't great.


I work in web development and Git has usually been good enough for our projects. But recently we have had a few projects that used large files and started using git-lfs in those. Is git-lfs not good enough for storing large assets in a git repository?

Speaking of non-coders using a VCS, our designers use Sketch (a macOS app) and until recently have used Dropbox for sharing and put the date into the file name as a form of version control. But they have now started using the Abstract app (https://www.goabstract.com/) which is a fancy UI around git (I think, not sure though), but none of that VCS complexity is leaking through. And they seem to like it. So maybe all it takes is a custom GUI that's tailored for a narrow and specific use case.


On those web projects, is Git used just for versioning the delivered assets (psd, jpg, and gifs) or is it used for the working files (Illustrator, Photoshop, etc)? I've used it for the former, which is why I said it was more manageable, but not a great solution. If you were to treat it like we do code, you'd only commit the working files and have a build script to generate the deliverables.

Those are tooling problems, but I also think there are architectural problems. I don't have a lot of experience, but I have looked at git-lfs. You need a separate repo and also a separate path for data, right? It's also an add-on. It's all working around Git itself. For artists, what's the value add over Perforce or SVN? I can see that maybe you could use the same tools as the coders, but you have a bunch of new problems. I'm not saying it doens't have its place. I can see myself using it in the future, it just doesn't look like an out of the box solution.

A few years ago I was toying with something that would be more like how a lot of backup systems work (and I think macOS attempted something like this a few years ago and abandoned it). Each time you save it will make an auto-commit, if your app has integration, it takes a screen shot and stores any metadata from the scene (this was targeting a 3d application, but with the goal of also working with the filesystem directly). You could also make an explicit commit and give a commit message. Auto-commits would get flattened into hourly, daily, weekly. Explicit commits would stay around as indefinitely as you'd want. Git was a poor backend for this because I don't think you can merge commits in the background, that also rewrites history. This was focusing on version control for an individual, ignoring collaboration and merging.

The whole reason Git was written is because the data-structures facilitate the needs of the Linux kernel's programmers more cleanly (“Bad programmers worry about the code. Good programmers worry about data structures and their relationships.”) I feel like a lot of these Git tools for artists are coercing Git's data structures into an awkward workflow. That's the biggest reason why I hope it's no the end of the road for VCS.

Thanks for introducing me to Abstract. I know I've seen similar attempts in the past. I hope it's useful to artists and I hope Abstract is a successful business, but I'm a bit bummed that Open Source is relegated to a sub-group of programming.


Based on my brief stint in the game development industry, it seems that there are two major obstacles that keep artists from adopting Git. One is the tooling: most of the tools artists use have integrated support for Perforce. This problem is not trivial, but it's something that could ultimately be overcome.

A much bigger obstacle is the workflow. The reason why coders can work with a DVCS is because merging in other people's changes is something that can be clearly defined and executed. When it comes to assets, there is no clear definition of whether and how you can merge changes. It's not even clear for which kinds of assets diff and merge are concepts that make sense.

There's a lot of work that could be done there, but it doesn't seem like it will be done any time soon, mostly because game development is one of the most single-minded fields in software development industry; the only thing that matters is momentum and meeting the deadlines at all costs (witness the latest Rockstar Games controversy about 100-hour weeks) and everything else is secondary.


In practice, most people don't expect binary files to diff or merge. It's more about how mergeable groups of files or commits are. I've seen a few tools that diff certain image formats, which can be helpful, but I don't think this is all that different than what Git does with text.

I agree that the bigger issue is workflow. Often after setting them up I get asked for a checkout/locking model because that's easier to police and gives you a clear straight-line history (anyone who has used it is also aware of the downsides). Git's permissive system where you can make changes, but then you have to resolve them before committing shifts around the problem. This seems to be more of a problem for them? Maybe because they're more medium-sized and have more of a flat hierarchy?

I was going to bring up Unreal Engine. It has had Perforce integration for awhile and Git integration has been in beta (I wouldnt be surprised if that has changed in the last few years). It abstracts everything into the engine UI (so no things like branching). It performed way worse. Their docs have a lot more caveats.

https://wiki.unrealengine.com/Unreal_Project_Git_Workflow_(T... https://wiki.unrealengine.com/Git_source_control_(Tutorial)

To be fair, I don't see indie studios bothering with Perforce, either.


> To be fair, I don't see indie studios bothering with Perforce, either.

What then? SVN?


Nothing. If you're lucky they'll have their own obtuse naming structure of "carrot_model_FINAL_FINAL_12.psd" With something like Unreal Engine, Maya, Photoshop, etc. you can get a few knowledgable artists and build a game with minimal technical skill. If it's not as turnkey as using Dropbox or Windows Backup, they probably won't use it.

I started at a place in 2011 that had been open since 1993 and had spun off some software that is industry standard. They were proud they had recently got everyone using SVN. For the entire life of the company teams used whatever they wanted and most used nothing. They're still using SVN.


I don't understand how it is possible to work like this. As soon as I started learning programming I immediately wished for and imagined some kind of version control. I didn't have to imagine much, as I very quickly learned about cvs (the standard back then).

My partner, a scientist, that deals with all kind of data and code and latex has expressed the need for version control too (sadly I have not been able to explain git well enough such that it makes enough sense so she could use it).


I've gone back and forth (and I probably spend more time thinking about tools at the expense of solving problems). I have a lot of git repos with either 1 commit or where I never even bothered committing. If you're programming you can just tar up the source folder and call it a day. 95% of the use case for VCS is committing to a single tree. Depending on what you're doing, it's rare you'll even look backward. If so, timestamps and usernames are often good enough (you wont even look at the code). It takes a lot of upfront work to be able to write a new unit test then run it on the source history to see when the regression was introduced (even though I think that's really cool).

The huge benefit of VCS is project management, which most people brush off as much as possible. Git and SVN provide models where responsibility is clear. For one person, it's a bit ridiculous to stage, commit, push, merge, build, test, release, but necessary when you get a few people involved or what to share the source code.

Artists are a different beast. First off, artists are often delivering art and never have to open the project again. It's like if programmers were hired for a couple hours to a couple weeks and only ever handed over binaries with no expectation to ever support or revisit it. For programming, a directory along with a series of files make up a project. For a lot of artist tools there's a monolithic binary project file. Generating "deliverables" are often done manually or configured inside the monolithic project file, where programmers would write a separate build script. Both with programmings and artists they mix project files, intermediate data, and deliverables all together. Programmers and scientists are the only ones with the motivation and technical skill to tease that apart and push for changes.


    > unreasonable to expect the same tool for everything
Git is not the end of the road for vcs.

While I am sure that there are scenarios where other tools are better-suited, question I was responding to wasn't signaling any of those relatively edge-case scenarios (like binary assets in games).

The sad truth is git has now utterly dominates version control in such a way that you pretty much have to master it to be able to work in a team of developers because, most likely, they are using git.

Whatever the case, few of us get to choose what vcs we get to use. So if it is not git, it is perforce, subversion, or something else or even nothing.

"Just adapt" is what I should have said first.


It works and it's popular and the underlying model is easy to understand.

I think step one to replacing git will be to launch a tool that completely replaces the git CLI but uses the same model below. I've not looked into EasyGit much, but I don't think it goes far enough.


People once said that about Subversion.

I believe Mercurial has a chance if they get support for large binary files right before git does.


Subversion wasn't a distributed VCS. Anything looking to displace git would have to make that kind of generational leap for the existing userbase to be remotely interested.


Most git user do not use the distributedness. So maybe Subversion will make a comeback similar to how people now realize that monoliths (instead of Microservices) are often better.

Distributed has disadvantages. For example, Subversion has nicely incrementing revisions like r42, where git has 8ab52fd9ee.

Mercurial is distributed as well.


Being distributed is really more of a 00s concept anyway. It makes total sense that distributed systems would be replaced with more centralized ones as the internet itself goes through a similar transition.


Dropbox, perhaps.


What are top-down and bottom-up learners?


Plastic SCM.


Copy/paste.


I loved the AOSA books, but there's been sometime that they don't publish a new one.

Do anybody know if there is something new in the works?

And of any other similar books?


Have you already read all of them? I was calculating how long it would take me to read and understand every chapter of every book. I usually like to write small snippets of code or look for references and stuff for books like these.

Anyway, if I read, say, one chapter every Monday, it would take me about two years to complete the books!

Probably not a problem that there's "nothing new" since the books seem to be more about timeless design principles and less about novelty.


Ok, time to contradict myself. Other two books in my back burner in the same ballpark:

* Beautiful Code

* Programming Pearls


The fourth paragraph in section 6.2 refers to "BitMover" with no previous context. Copy-paste error?


there is context in the intro (see bitkeeper)


I found nothing about BitMover here:

http://aosabook.org/en/intro2.html


sorry, I meant in the git background section 6.2, it briefly mentions BitMover as the developer of BitKeeper, but that's all needed to know about it


> To understand Git's design philosophy better it is helpful to understand the circumstances in which the Git project was started in the Linux Kernel Community.

I don't understand this sentiment. It's not helpful to know the history at all. At best, it romanticize the choices made. Stating the goals would be an intro that shows some level of analysis.


No, the author is right. Git us in its core a database for managing patches. Understanding the needs of Linus Torvalds as his role of Kernel maintainer is about the only good way to understand why git is so strangely designed.


Darcs is a database for managing patches. Nothing in Git inherently cares about patches. To a first approximation it's a database for managing full snapshots of trees of files.


Git only needs lists of files because it needs entry points into its lists of patch fragments that make up the file and to assign file names to them. Other than that, a changeset is just another name for a patch that can be added, altered, rewritten or removed. That makes git a patch database in my book.

Darcs feels more like a research project to me. The developers try to find a theoretical foundation in which they can base a VCS, but they have not managed to make their theory work with the level of perfection that they want. But if they eventually get it right, it will probably have the provably best text-based merge tool possible.


This is not how Git's data model works. You may be thinking of delta-compression which during "git gc" and purely as an optimization step does delta-compression across content in the repository.

But that's purely an optimization that has nothing to do with the intrinsic data model. There's no point at which the patch output you see with "git diff/show" is actually stored as-is in Git. It's computed on-the-fly.

This separates Git from many other SCMs where patches or other deltas are permanently stored at the time of commit in a way that can't modified afterwards.

The distinction matters because those systems generally have storage that doesn't compress as well, since they need to compute and store a diff at the time, whereas a system like Git can keep finding better delta candidates as history progresses.

This goes all the way back to the likes of RCS. The Subversion FSFS backend also works like this, and I believe Mercurial to some extent, and certainly Darcs since storing a history of patches is what it's for.


curious why you say it's strangely designed. seems pretty rational to me.


There are three things in git that I consider design errors:

- The staging area/index/cache should not even exist. That the same construct has three interchangeable names is already a sign that something is wrong. The fact that that construct is used to confusingly stage snapshots of files for committing as well as moerge operations makes it an unwieldy thing that has probably teleported too many lines of code into the digital nirvana already.

- Branches should be immutable properties of changesets instead of flimsy, easily deleted tags with special flags. Deleting a branch after a merge makes it impossible to tell which branch in the history was the master and which the feature branch.

- Gits graph of changesets is also too lightweight and is missing forward references. This is the reason why deleting branches irreversibly deletes their entire history. The reflog is only a crude hack around that and exists only because the crude data structures require taking stock of the entire set of internal references to figure out that a certain part of it (an "object", but essentially a file in the repository) is actually no longer referenced and can be removed.

I can probably come up with more reasons why git is very flawed. But this is enough fuel for the fire for one post.


Maybe -- but the index is ridiculously useful. being able to commit some of your changes is part of what makes git so much more useful than something like mercurial.

To be honest you may have a complaint for 2 and 3, but i'm not sure what it is, as i've never had any of the issues you bring up.


mercurial and git both have commands that allow you to select part of a change to commit, and both allow you to amend an existing commit. No index necessary.


If it had been started in e.g the game development scene there would have been some other design decisions made. The idea that the source tree is rather small and the history is short and there are few binary assets managed are definitely showing in the design.


The goals are stated, just a few lines down in the same section you quoted. What makes you think design goals and philosophy aren't formed by history? How does knowing the history of a project romanticize it? Personally, I'd assume exactly the opposite, that knowing history is the only way to understand the goals and philosophy, and that romaticizing only happens when a history is not understood and/or the story is changed or told through a tinted lens.


I am fine with having extra sentences to read.


By understanding the real world motivation you can reason about both the stated and unstated goals of the project.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: