Hacker News new | past | comments | ask | show | jobs | submit login
Git for Computer Scientists (2010) (eagain.net)
259 points by thanato0s on June 13, 2021 | hide | past | favorite | 101 comments



I just think of git as a graph and branches/tags as text pointers to nodes in the graph. Doesn't seem that complex to me...

Maybe I "got gud" though and can no longer empathize with git beginners


The git plumbing and plumbing commands are straightforward and easy enough to understand once you read about them a bit (I recommend the free Pro Git book online).

The original git porcelain commands - git branch, git reset, git pull - are execrable. They’re filled with implementation details (index/cache vs staging), weird and suggestive syntax that seems like it should be extensible and widely applicable but isn’t (localbranch:remotebranch), and nuclear-powered self-destruct functionality hidden amongst playthings (git reset vs git reset —hard).


It sounds like git in general isn't necessarily the problem (at least after getting the basic model down), it's specifically the interface and associated foot-guns it sticks in there for beginners (and tired experts) to trip over.

Most people most of the time will get by if they grab a decent git GUI, figure out the minimal set of operations they need, and just Google the rest when necessary.

My stupidest git mistake was when I was cleaning out a directory of bin and obj folders and included gits obj folder as well. And of course after crying in the corner for a bit I take a little time to look into git commands and I could have just run 'git clean'


Two additional bad defaults: crlf handling on Windows, and pull not defaulting to rebase.

The message "up to date with origin/master" is also misleading, because it doesn't check the remote itself.


Pull defaulting to rebase could be a dangerous and chaotic default. If you want to argue that pulling should be fast-forward-only, then I'd say maybe you have a case.


I have aliases ff for fast-forward-only merges and puff for fast-forward-only pulls. I never type `git pull` anymore, and it's much harder to shoot your foot off with the aliases.


`git checkout` great for switching to a different commit and for throwing away local unstaged changes!


Using `git checkout` with a filename really really really should have a yes/no prompt by default, or at least an `-i` option like `rm`.


I think `git switch` is an attempt to resolve this


Yeah, something along these lines is how I've explained it to co-workers as well. Tossing in "git log --all --oneline --decorate --graph" so they can actually see the graph also helps a lot.


    gitk --all
is even better. Most people don't work in front of VT100s today.

Actually I work most of the day remotely in `screen` on my office machine. But in case of complicated/confusing/buggy history/branching I fetch for the repo so I can run gitk locally.


Because that's exactly what it is, and it's not. When I explain git to newbies in this way, it's like something clicks in their brain and they just start to "get it" as well.


But then once you get the mental model you spend the other 90% of the time figuring out what magic incantations are needed to transform the graph in the way you want.


The sad thing is you can't just "figure it out." Most folks do "good-enough" after some coaching and memorizing a limited set of commands needed for their workflow-- until something unexpected happens.

It could be a typo, or trying something new, or forgetting/misunderstanding the intent of some counterintuitive command, or maybe cleaning up an existing problem. All those things can easily put someone in a deep rabbit hole of inside git book or, worse, google search.


Well it's understandable. Moving nodes in a graph (a tree, really) has a lot of side effects. Couple that with multiple people trying to keep things in sync(ish) and it gets super complex.


90% of the time you can achieve what you want with `git rebase -i`


Especially if they took a maths course on graph theory.


It could be hard if you never took Discrete or another course that introduces graph theory. Or if you cheated your way through or barely scraped by. I can see how a CS freshman or someone from another field might struggle, but even then it's more comprehensible than any of the alternatives.


The hard part of git has never been the understanding its graph model.

The hard part HAS ALWAYS BEEN is memorizing all those badly named and counterintuitive commands.


Even that is only true if you're coming from a system where branch, tag, and checkout mean something else.


I think the hard part of Git isn't the bit you've just described. It's the terrible CLI and terminology.

Give any Git newbie a decent GUI and a translation for Git terminology into sane terminology and they will have no problems.


I think that's a fine perspective for a computer scientist or graph theorist, myself. Fortunately, since the article title is "Git for Computer Scientists," that's essentially the approach the article takes. :-)


The 'complex' part usually relates to how to manage the graph in terms of what you want to do, and all the odd states that might exist otherwise, especially when syncing with 'other graphs'.


That's pretty much the gist of the article, no?


That's what makes it good :)

I had no idea why it took my brain so long to wrap my head around it. Maybe it was blissful ignorance, never sitting down and making a mental model of it. Always looking for a tldr version of doing things. I don't know exactly at which point things clicked but it went from bewildering to just makes sense


I think all that is needed is an aha moment


i honestly feel people are allergic to rtfm


A friend of mine just posted this today, and I totally agree:

https://weisser-zwerg.dev/posts/software-engineering-vcs/


> in my opinion, the majority of projects developed in-house in an organization by a dedicated in-house software engineering team, would be better off following the guiding principles in Why Google Stores Billions of Lines of Code in a Single Repository and rather using something like SVN rather than git.

Mmm no thanks! In any case there's no reason you can't use Git itself as a monorepo! You don't have to inflict SVN on people for that.

Very weird opinion.


Well, the one reason would be to avoid having to deal with Git-Apostels who insist on using git "the right way" instead of how you tell them to. If they cannot learn the 3 or 4 ways of calling svn, for sure they cannot use git the way you want them to.


I worked on computer science projects for 20 years. At first we had no source management, everyone did whatever they wanted. Then we used CMS, and we occasionally stepped on each others toes but things were better. Then we switched to SVN and nothing much changed except we established a way to hold locks on files while they were being changed. Then we switched to git because the students wanted to learn the cool thing. We started having meetings to teach people git. Meetings about the best way to use git. Meetings to deal with common problems in our use of git. Productivity dropped because everyone now had to deal with git problems. It made my life hell because usually I just wanted to check in my code so it would not get lost, but instead I would continually get forced into reconciling git issues. I would have to resolve issues to get my code checked in because other people had changed something entirely unrelated to my work. I stopped using git and kept my own backups and only checked in code very occasionally so I could get work done. I noticed other people doing the same thing.

The main problem I had with using git is it did not match the way we worked. Git assumes there is one person who is the gatekeeper, who decides what gets into the source, and who does some integration and testing. In research there usually is no one in charge of that, instead everyone is responsible for their own code, testing it, and integrating it. The git model was wrong for us, we never used pull requests at all, because there was no one person who understood everything well enough to approve them. Students don't have the experience or time to be the integrator, the profs don't even write code, and I had multiple other things that I had to do. So using git made a mess of what had been a simple process previously. Git was designed by Linus to make his life easier in managing changes to a kernel. It does not work well in other scenarios and should not be used in many circumstances. Yes, you can make it work, but at a cost.


Will read it, until now, for me,git feels like it tries to get in my way (probably because I think differently). I heard about fossil,(https://fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki) does anybody have experience with it? Does it suck less?


I think there are 2 core principles governing fossil:

1) It wants to be the only tool you need to bring with you if you and your fellow developers are going to be stuck on a desert island. It’s not only version control, but also an issue tracker, and more recently a chat app.

2) It preserves everything you commit to it. Whereas git lets you polish your commit history before pushing, fossil keeps everything. You can’t alter your local history. All your messy scratch work can’t be cleaned up. It’s visible forever.

That second point is what turns me off to it, so I’ve stuck with git for personal projects. When I push up my local code, I like to have a very clean history.


Thanks. They really seem to have a bias against deleting (and not so good rationalizations):https://fossil-scm.org/home/doc/trunk/www/shunning.wiki

Apart from that, did you try it, and was it smooth?


I tried it just for an afternoon, but I did find it easy to work with.

I find git easy to work with too though, as long as everyone sticks to a well-defined workflow and doesn’t do anything weird.


Maybe that is the problem here, we do not have a defined workflow. Do you know of any good workflow sheet for git we can lift ideas from?


So if I understand correctly there’s no equivalent to git squash merging branches?


My experience with git is it's organically grown, and regularly a mess. It works reasonably well and in fact is better than a lot of alternatives, but can become a monster quickly and unexpectedly. My experience with mercurial was better than with git.

But none of this matters, as git/github/gitlab is today the industry standard, or even the category killer for version control. You will have to deal with it in one capacity or another. So my advice is to deal with git, learn at least up to medium level. As industry standards go, things could be a lot worse than git.


The main thing to know for newbies is as long as you don’t force push to a remote branch, it is safe. You are creating new state only, not destroying. All errors are Recoverable.


Even force pushing is not really a problem. If you don't want to keep every typo and braindead approach in history, force push is a required tool.

Naturally things go wrong occasionally, but garbage collection is not run often. You only need to know the SHA-1 and you can fetch "lost" commits again.

Old SHA-1s can be found in reflog. We also have all pushes automatically announced all in chat, so you can look up previously pushed SHA-1s in chat history (we use gitlab and zulip and those support it out of the box).

Of course you still need a mental model how git history (including history rewriting) works, othwerwise you cannot understand what exactly went wrong. And without knowing what went wrong fixing it gets awkward trial and error, which unfortunately many less experienced git users seem to do.


And don’t expect the command details to make sense. What you want to do is some simple thing in terms of a graph of states, and just google the command if you aren’t sure it it is origin/branch or origin space branch.


You can still easily lose uncommitted local state, which is unrecoverable, and also put the checkout into a state from which a newbie finds it hard to escape.


Definitely helped a bit. I just graduated from university and am working full time as a developer now. I thought I knew how to use git because I knew how to do feature branching and merging. Boy was I wrong. Within a few weeks at my new job, I've realized that I'm missing so much useful git knowledge. When I learned about cherry-pick, my mind was blown.

My goal right now is to develop a better mental model of git than what I have right now. If anyone has recommendations for resources, please let me know!


https://learngitbranching.js.org/

Go through every lesson, understand it, and find yourself more knowledgeable about git usage than 95% of developers.


So true and so worth the extra knowledge to understand your tools. You should also read about the various knobs on your compiler or interpreter from time to time. I used to reread gcc manual every five years, and now I search on the env variables that affect python runtime. Getting ready to that for go build chain now I have 3 or 4 production go things. Similar for my editor and libc and language stblib and kernel APIs, tho they are more diffuse than the gcc manual.


This was a good way to pass some time :)

I look forward to sharing this with others on my team.



That is a good, clear exposition. Thanks!


Jessica Kerr (jessitron) gives a good git internals talk that you can find on YouTube if that’s a helpful learning style.


Some past threads:

Git for Computer Scientists - https://news.ycombinator.com/item?id=3146466 - Oct 2011 (15 comments)

Git for Computer Scientists - https://news.ycombinator.com/item?id=1485612 - July 2010 (17 comments)



This is never not funny.


I'm so sick of git.

Yes I know what a directed acyclic graph is. No I don't know what 'checkout' will actually do this time when I run the command.

It's been 10 years. It's still confusing people. There's still article after article, book after book written on a tool that should be getting out of a programmers way.

Let's try something else.


For the more knowledgeable on Git. What is the current status of the transition from SHA-1 ?


This [1] is useful to read. Sha256 supported (experimentally at least) in Git since 2.29[2]

[1] https://lwn.net/Articles/811068

[2] https://lore.kernel.org/lkml/xmqqy2k2t77l.fsf@gitster.c.goog...



I ran across this little gem recently:

> Git gets easier once you get the basic idea that branches are homeomorphic endofunctors mapping submanifolds of a Hilbert space.

* Isaac Wolkerstorfer, https://twitter.com/agnoster/status/44636629423497217


Richard Feynman had a joke theory that any purely theoretical mathematical concept when expressed in layman's terms devolves into something completely obvious. So does this actually mean something like "git uses branches."?


I have never got all these jokes. When my job switched from Subversion to git it took me about one week plus reading a couple of articles to become more productive in git than I ever was in Subversion. Yes, version control is a bit tricky but git is not that hard to understand and was much easier than contemporary Subversion versions.


Things have gotten a little better. But, try to do something off the beaten path in Git, and you may ultimately get the joke.

For example: “two weeks ago an intern accidentally committed a file containing IP we’re not allowed to use, we need to erase it from the repository and all developer machines.”

Have fun with that one!

EDIT: I mean, try to figure this out from the official Git documentation (https://git-scm.com/docs). No, Stack Overflow and Github are not the official Git documentation. Believe it or not, the idea that "Git is hard to use" predates Stack Overflow.


I have had to do it once during the 12 years I have used git, so I seriously doubt that this is why people think git is hard. And I think that googling it would be fine in that case. That said: since I have done it once I could easily figure out how to do it again and it wasn't hard, just a bit cumbersome due to git's distributed nature.


Is Googling not allowed? This situation is pretty common, so there are plenty of SO answers and articles on how to accomplish rewriting history to erase it from the repo.

Removing from developer machines is a separate issue and requires you to be able to coordinate your Devs.

If you meant that it's not simple to work out from scratch what you should do without lots of reading and trial and error...that kinda goes for a lot of tools, no?


Yes but git seems to be one of those tools where laypeople seems to genuinely not be able to derive how to do complex tasks from first principles. Lord knows I can’t. If your Googling doesn’t turn up someone’s who’s had your exact problem you will have to burn a long time figuring out how to do what you want.



> For example: “two weeks ago an intern accidentally committed a file containing IP we’re not allowed to use, we need to erase it from the repository and all developer machines.”

Technically, the issue was actually pushing that commit to the remote repository.

I think the best advise one can give people when using it is to to run:

  git log -p origin/master..HEAD
and look at the commit messages and associated diffs to see if there's anything there that shouldn't be there before the actually run git push.


> git log -p origin/master..HEAD

See THIS is the problem. Ugly, inconsistent, clumsy use of the english language, and confusing.

This will go on my git sheet, with a comment as to what it actually does because I don't have the time to actually unpack that from first principles. I've got better things to do than become an expert on needlessly complicated software.


> See THIS is the problem. Ugly, inconsistent, clumsy use of the english language, and confusing.

It's a command line interface, not plain English. What's ugly and inconsistent about the git log command as was quoted in your reply?

> I've got better things to do than become an expert on needlessly complicated software.

As a software developer, I have to read through a lot of documentation to be able to use programming languages, SQL, data stores, unix utils, etc. I don't see why it would be any different for a VCS.

I think the actual issue is that people aren't willing to read through the documentation to understand what a command does and what options are available.

As for the command itself, the -p switch shows the diff associated with each commit shown with the git-log command. origin/master represents the upstream tracking branch of the master branch (most likely the base branch that the person is working on). .. represents a range operator and HEAD repesents the commit that's the latest commit on the branch on the local machine.


I use plenty of command line interfaces everyday. Most of them are pleasant, predictable, and easy to remember. None of them consistently confuse me like git does. (How many different things does 'checkout' do?)

SQL is not only much more intuitive than git, it gives me amazing leverage to deliver value to clients. By comparison, git wastes my time. There's zero or minimal competitive advantage to using it over any other VCS.


Erase from the repo, a little non-standard, but fine. Being asked to remove it from all developer machines sounds like someone misunderstood how version control works. Was that a real life example you hit?


They might have a model of version control in their head that predates distributed version control systems - I never used one myself, but the code base I work on still has scars here and there from the era when only one developer could have any single file checked out.


Not a misunderstanding, a requirement. If the developers cannot have that data (legal reasons? Secrets?) it must be deleted.

Probably has to be done outside git, though. Maybe one of the corporate virus scanners will let you definite a local signature.


It's rather simple: remove it from the origin repo using BFG Cleaner or whatever, then ask devs to delete and re-clone the repo. Not everything needs a complex technical solution.


Git clones the entire remote repository to each developer's machine. So, if you accidentally committed something you shouldn't have two weeks ago, everyone will now have a copy in their local repo. And you can't always just tell people to delete their local repos and start again, since they might have local branches they're working on, etc.


I don't think this is even possible with SVN or CVS, is it ?


At least with SVN, the is one option that is pretty similar to git’s filter-branch: svndumpfilter. You dump the entire history of the SVN repo to a file, edit it, and then load it into a new SVN repo. I used this technique to pre-process a repo to remove large files before migrating to Git. The file format is simple enough that you can easily write a program to edit the stream.


It’s very easy in CVS, which is why some people prefer CVS to any distributed solution.


Curious why the IP address has to be obliterated from history instead of just correcting it in a new commit? An IP does not seem sensitive like a secret or private key.

EDIT: Sorry, my bad. I misread.


Presumably they did not mean an Internet Protocol address, but instead data containing disallowed intellectual property.


I believe in this case, IP == Intellectual Property.


>I have never got all these jokes.

If you mean that literally, this joke is referencing another joke meme in the functional programming community, explained here: https://blog.merovius.de/2018/01/08/monads-are-just-monoids....

If you mean, you don't get why people joke about Git being hard, it probably isn't for most professional programmers already accustomed to some kind of source control, just perhaps to anyone new to it.


I often wonder about this as well, but I think there must be a difference of opinion of what 'hard' is.

I think that people think they are confident about Git in 'a few weeks' are basically have a lack of self-awareness. They 'don't know what they don't know'.

I've seen countless times, developers huddled around someone else's computer, trying to figure out what went wrong.

I saw a git animation/visualisation tool once that animated operations, and I saw things happening I had never seen before, basically a lot of 'loose end' things that I didn't even know existed.

I also wonder if that has something to do with the fact that such things are maybe not suited to 'command line' and are inherently structured.


It is not hard to be more productive in git than in subversion... (ok, I hate subversion). I think a better point of comparison would be mercurial, which is based on the same principles as git but with different opinions.

But I think the trouble with git is that it is very flexible, there are many ways to do things (ex: merge vs rebase) and the command line interface is not particularly intuitive. I messed up with git much more than I messed up with mercurial, but git also makes it easier to fix the mess.



Reads like the man page for "git rebase".


A sparse Hilbert space. Beginners make that mistake a lot.


Is this a joke or serious? I don’t understand enough of the words after “branches” to know. I’m serious btw.


It's a parody/derivation from the monad joke, where one explanation for a monad is that they're "just an monoid in the category of endofunctors".


HN makes me realize I don't know enough words.


Is that maths soup or a real construct? I can't join the dots but I'm also studying physics so category theory is still slightly Greek to me (I can feel the mathematicians' noses rising already...)



tl;dr it's a random math soup (from first link)


It's random math words put together and doesn't make sense even in isolation.


Or else it's research that won a Fields Medal. There is no middle ground.


geezus this is old, has there not been more recent versions/attempts at this kind of post?


also, is submitted almost annually with hardly any interest at all. Because either everyone has seen it already around for a decade, and/or it's actually not much of an article.


For me almost every Git teaching resource has gone like this:

Step 1.It is explained that Git is a simple program, and that the underlined idea can be understood easily, it is only that other materials have done a bad job about it.

Step 2. Tell the reader a blob is just the byte object containing the information you are source-controlling. "See how easy it is?"

Step 3. Invent their own nomenclature/diagrams/metaphors for all the other concepts, totally muddling the waters.

Step 4. Become one of the resources criticized on Step 1.


One day I'd like to break this circle. I've been doing 2-day Git workshops with a colleague for a few years now, and "to internals or not to internals" is our constant disagreement. I don't like talking about blobs, trees and anything below the "commit hash" level because I almost never need it myself.

My other personal issue is the complete opposite of the "going way too deep into details" teaching resources: showing clone/commit/push/pull and calling it a day. This leads to resources like ohshitgit.com as things will eventually break when people use commands without understanding what is happening.

When doing our workshops, we go through the basics: what is a commit, what is a branch, what is HEAD, what do commands like checkout/reset/rebase do on graph level. This approach demistifies Git without going into internals. It also takes away the fear of "advanced" topics (like "rewriting" history)


Do you have slides or other resources from your workshops you’d be willing to share?


Unfortunately, those resources are company-internal. But I'm planning to create a public resource based on my experience of doing those workshops, without falling into the trap mentioned by GP.


I also take this middle ground, seems to work well for most students


Our IT department refuses to give an introduction to git. Smart guys.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: