Cleaning Up Git History

infogulch · on Aug 8, 2021

Cherry-picking (pun intended) a comment elsewhere itt:

> Crafting commits is hard if one wants perfect commits and most of us don't have time to do that.

This is a common sentiment, but I really think that this is a product of the tooling instead of a fundamental difficulty of the task. If it wasn't such a pita to split, join, and shift commits around then people would consider trying to do it more often.

Another comment:

> TIL: Splitting A Commit

> git rebase -i <ref> Mark commit as 'edit' Continue with rebase, it will stop on 'edit' marked commit git reset HEAD git add --patch git commit git rebase --continue

What is the mountain of context that someone needs to know before even that "one simple trick"-style explanation would make any sense at all? How would you describe or recommend that process to someone in a technical non-dev role that has never used source control? You wouldn't. So they'll never split a commit. So now you have an avalanche of users that don't even approach the problem of crafting a PR with the mindset to produce clean commits, so it can never become the norm.

At the least we need a more direct way to accomplish three main things: split a commit, merge adjacent commits into one, reorder commits. If these "basic operations" on commit chains are easy to accomplish we might have a chance at having a clean git clean history.

rusk · on Aug 8, 2021

> How would you describe or recommend that process to someone in a technical non-dev role that has never used source control?

That’s the audience though. I know technical non-devs in various roles that would kill for the kind of features that git provides. They’re stuck manually versioning files. There is definitely a market out there for something like this but even simple version control is very tricky for non-devs. It’s just something we’re more intitivively good at probably to do with how we approach breaking down problems. Animators, graphic designers, architects etc tend to think more holistically. If they use these tools at all they’re usually a service provided by some in house experts, such as the engineering team or IT.

omegalulw · on Aug 9, 2021

This seems like a horrible way to split a commit. After the interactive rebase stops just do an interactive add and let the rebase continue.

delecti · on Aug 8, 2021

I feel like I must be unaware of how other people use git, if this is remotely relevant.

Everywhere I've worked has used a similar flow. Commits in master correspond 1:1 to PRs. Your personal flow of local commits or branches is irrelevant; your PR is reviewed as a unit, and merged as a squashed commit on top of master. The description of the PR is either exactly identical to the commit message, or only slightly modified.

Clearly lots of people don't work that way, because that doesn't leave any room for a messy git history. I also don't know why I should care about the in-progress commits of my coworkers when I'm reviewing a PR as a unit.

omegalulw · on Aug 9, 2021

This doesn't seem like a good idea. What about features that take 10000+ lines of code? I'm guessing you ask your devs to split them across PRs. Is it really a far cry then to pull commits via rebase. If your PR commits are good it shouldn't be a big deal.

Not to mention how do you revert commits on master? A big PR that broke prod will lead to nastyy merge conflicts when reverting.

delecti · on Aug 9, 2021

We don't have features that take 10,000 lines of code. Our biggest features get split among a couple smaller PRs to review them along the way.

I know there's a case for larger feature branches that accumulate an entire large feature before being merged into master, but in a world of microservices, it's just not a use case we've had to consider.

And reverting PRs happens occasionally, but because changes are much smaller than the scale of what I assume you deal with, we never have many (if any) merge conflicts to deal with. We primarily deploy new features with feature flags, so it's vastly easier to just turn them off until we can fix the issue.

cellularmitosis · on Aug 9, 2021

> I know there's a case for larger feature branches that accumulate an entire large feature before being merged into master,

I think you are tempering your stance in the name of arguing charitably, but I’ll push against that.

No, there is no case for this. Branches are pain accumulators. The longer they live, the more pain they will bring to everyone involved.

If you need to work on something big, that’s what feature flags are for.

croo · on Aug 8, 2021

Yeah but... What happens when you work with a new college who put two features in a branch because he forgot to switch and someone did a shitty review and it's now in master but you can't revert because its release day and three other colleges already did a rebase and you cannot take the time to revert this mess anymore because other bugs and...

I mean commits in need of repair happens, however good your workflow is.

delecti · on Aug 8, 2021

That's why new members of the team get extra attention for their first few changes, PRs require two reviews, and we don't rebase master.

I've really just never needed to do major repairs to the master branch of a project I was working on. We're probably missing out on git features that could be nifty, but it's also easier to avoid shooting ourselves in the foot.

eplanit · on Aug 8, 2021

That's the model I'm accustomed to, as well. Most shops I have worked with in the past 10 years use Gerrit, which facilitates that (transparently).

somehnacct3757 · on Aug 8, 2021

I'll do this stuff for myself in a long-lived branch to keep rebasing and cherry-picking on the table.

I don't see the value for small changes, just squash. I also don't see the value in a PR-based workflow, as your team's PR practices should have sufficient context without asking everyone to rewrite history.

I also don't really believe that devs can reliably retcon atomic commits that compile using just git commands. You make a nice commit for the new file but forget that it has a dependency in another new file. Or you remember the dependency and painstakingly add -p the functions necessary for the commit, but forget one variable definition that was in a chunk at the top of the file.

omegalulw · on Aug 9, 2021

Run tests on every commit? At least make sure every commit is buildable.

tomasreimers · on Aug 8, 2021

Agreed with a lot, but for many companies I think cleaning up git history is more effectively accomplished by cleaning up PRs than actually trying to clean up the commits themselves that developers write.

"Good commits" are one of those things that benefit the whole, but no individual developer is incentivized to do. Because of that, it's easier to enforce at code review (rather than hope developers just do it). Many companies do this by squashing changes, so a developers' commits and the commits on trunk are decoupled.

I think the best way to make changes (and therefore commits on trunk) understandable is to split them up into series of dependent changes; something supported by the code review tooling used at places like FB / Uber / Google, but not super well in GitHub... Which leads to large PRs with a lot of unrelated fixes.

Disclaimer: I work at a company trying to bring that dependent change / stacked diff workflow to git and GitHub (https://graphite.dev).

xyzzy_plugh · on Aug 8, 2021

> Disclaimer: I work at a company trying to bring that dependent change / stacked diff workflow to git and GitHub

GitHub I understand, the tooling here is terrible. I have my own scripts to solve this, but there are fundamental issues with how GitHub treats branches. But git? Huh?

The "stacked diff" workflow is more commonly known as a patch set or patch series, and is as old as git itself. See the LKML for examples. Tools like Gerrit or Sourcehut promote this workflow as a first-class citizen. It's how you use git over a mailing list anyways.

fosterfriends · on Aug 8, 2021

++ stacked changes workflow. @tomasreimers made me a convert after debating traditional github workflows for a year

usr1106 · on Aug 8, 2021

Every commit should build is a reasonable principle.

However, it's in contradiction to TDD. If you have a bug (in master) the first thing you are supposed to do is to create a test that fails because of the bug. This is a self-contained commit, but it will not build (assuming your test are run during the build as they should). The next commit is fixing the bug and now it builds again.

So the commit with the new test should be marked as unsuitable for bisecting. How to do that easily in away that bisecting remains easy I have yet to work out.

omegalulw · on Aug 9, 2021

Isn't this taking TDD to literally? Verifying that a new test is indeed broken doesn't deserve a separate commit. The presence of that test in the codebase with all tests passing is enough. If your devs are writing new tests that are already passing your team needs a better lead, not more build controls.

usr1106 · on Aug 10, 2021

In normal development flow sure. But if the bug made it already to master I'd prefer the separate commit so it can be easily verified that the problem is gone.

trevinhofmann · on Aug 8, 2021

I've been enforcing every commit to both build and pass tests, but now I might only require each commit to build and each merge request to pass tests.

In that case, you can have a merge request with a commit to add a test and then another commit to fix it. As long as both commits build, bisect will still work.

I agree with having CI enforce both passing tests and working builds, but why would you want the build step to require passing tests?

WorldMaker · on Aug 9, 2021

This is basically the approach that I take. I expect git log --first-parent (and git bisect --first-parent) on the main branch to pass full CI builds, but I'm increasingly more lax about individual commits the further you get out into the DAG. I think it way more useful to record the steps-in-progress than to toss them (especially when you are trying to debug the why or the how of a change months or years later from say a junior developer; it's a lot easier sometimes to see things like "oh, this was a test value meant to be replaced" when you see exactly which intermediate commit it was from deep in a PR).

(ETA: It's easy to forget that the I in CI stands for "integration" and was always focused on merges of changes more than specific changes.)

usr1106 · on Aug 8, 2021

That's just terminology. When I said "does not build" I meant CI fails because a test fails. Of course the code would compile or what other steps you might have.

trevinhofmann · on Aug 8, 2021

Change the CI to test that every commit builds and every merge request passes tests. If we don't enforce the tests to pass for every commit, that solves the problem. Right?

simonw · on Aug 8, 2021

One of the many reasons I'm not a fan of the "pure" model of TDD.

If you want to work like that though one way to do it is to work in a branch/PR - building out your commit that fails, then your iterative commits - and then squash the branch into a single commit onto main layer on.

If anyone want to see your TDD tiny commits process later they can go and see it in the PR.

lloydatkinson · on Aug 9, 2021

I have never seen TDD state that a test should not be in the same commit as a feature.

simonw · on Aug 8, 2021

I went through a period of writing essay-length commit messages, on the basis that they were a form of documentation that I could completely trust because they were entirely accurate to the state of the software at the time of the commit.

Then I realized that if I bundle the documentation change itself into the same commit I can have the same benefit while also creating documentation that people can read without digging through the commit log to find it!

These days my commit messages are much shorter, but the commits usually bundle implementation, tests and documentation updates into the same unit.

I also make sure my commits link out to a relevant issue. This gives me somewhere I can record additional context and drop in things like screenshots that help illustrate my thought process while working on the code.

d0m3 · on Aug 8, 2021

Maintaining git history is a waste of time. Work with PRs and let the PR own the context of the changes. Simple and easy for everyone. You can also link much more information than in a commit.

b3n · on Aug 8, 2021

To be fair, git has no concept of pull requests, this is entirely a feature of some hosting platforms which this article makes no mention of.

WorldMaker · on Aug 9, 2021

git does have a concept of a merge commit, and most pull request tools have the option to force merge commits (as opposed to and in contrast to fast-forward merges or squash/rebase merges). If merge commits are forced you have an object representing completed PRs in the DAG. You can use tools like git log --first-parent to browse the "PR list" without needing to get into weeds of the individual commits "inside" the PRs at first.

Thev00d00 · on Aug 9, 2021

This is great until your company decides to move to bitbucket or whatever and suddenly you lose those repo metadata without a big ugly migration

WorldMaker · on Aug 9, 2021

If you force merge commits you at least have a record of completed PR activity in git history, even if you don't have details such as review comments (or in progress/in review code review efforts).

vanous · on Aug 8, 2021

TIL: Splitting A Commit

git rebase -i <ref> Mark commit as 'edit' Continue with rebase, it will stop on 'edit' marked commit git reset HEAD git add --patch git commit git rebase --continue

Buttons840 · on Aug 8, 2021

Before learning to clean up the history, spend some time learning to make sense of a messy history. Have you read the git log man page?

WorldMaker · on Aug 9, 2021

I do think that a lot of folk's concerns around "clean history" would go away if --first-parent was more often a default they reached for (especially in GUI tools; it's fun work to draw "subway diagrams" of the git DAG, but it would be a better user experience in many cases to instead do a boring old master/detail drilldown starting with --first-parent and then exploring the DAG out from there).

someguy101010 · on Aug 9, 2021

git commit --amend is a lifesaver as well

Josh5 · on Aug 8, 2021

"Every commit should be runnable, that is we should be able to git checkout any commit and get a functional code base. This means no “WIP” commits..." I strongly disagree with this.

xyzzy_plugh · on Aug 8, 2021

You can still have WIP commits, you just clean them up before you share them, i.e. squash. It doesn't mean you have to reduce your work to a single commit -- that's silly -- but it does mean building a narrative of changes.

Take a look at patches in the LKML and their associated revisions. This is the most sane approach to history. Changes are broken up logically, nearly all commits in a patch series are individually functional/correct, and in their presented order they tell a story for the given feature.

Your "update" or "whoops" or "wip" or "address feedback" commits are, almost universally, entirely useless. There's no point in checking out one of those commits. They don't help tell a narrative to uniformed readers. They're just noise.

ozim · on Aug 8, 2021

In the end a lot of features end up being single commits. Crafting commits is hard if one wants perfect commits and most of us don't have time to do that.

If I implement text-box with a save button there is no way I can functionally commit "working change" where I only have a text-box or have a save button separately.

There is nothing wrong with that I believe. I think there is much more wrong with insisting on keeping "history" so people think that their "whoops" or "wip" are important.

usr1106 · on Aug 8, 2021

> Crafting commits is hard if one wants perfect commits and most of us don't have time to do that.

I can agree with the first part, because in my current job we do it. Some more than others, but it's recommended. Throwing some naturally grown git history to review is making the life of the reviewer hard. Doesn't matter whether you belong to those who make a lot of WIP commits because you are paranoid of losing some idea or whether you only commit once when everything is ready, because you are shy of showing your trial and error mess.

Probably you are also correct that most people don't invest the time to do it. That's why there is more bad code on the planet than good. It's hacked together hastily and not reviewed properly because the history is not reviewer-friendly.

I'm sure in some organizations quality work is not tolerated because "we don't have the time".

xyzzy_plugh · on Aug 8, 2021

You take me too literally. Not every feature needs to be broken out into separate commits. Usually the narrative looks something like this:

- Fix some bugs impeding the work being done. Zero or more commits.

- Refactor or lay the groundwork for the change. This could be something like plumbing dependencies, writing some feature flags, a new API, database migrations. Zero or more commits.

- Implement the feature. If you previously plumbed an API to save some text, now you can wire up the frontend. One or more commits.

- Lastly, you may have some cruft leftover that you want to remove, like an old method that you now mark as deprecated or issue a warning, or change some links or remove some feature flags. Zero or more commits.

In the end you have a series of changes that describe a cohesive item of work, or feature. It's rarely as simple as just slapping some extra markup somewhere. This makes it easy for reviewers to see the important bits and makes it easy to bisect or blame. Squashing all these changes would be a mistake.

I highly recommend folks become familiar with interactive git rebase and fixup commits. Instead of thinking linearly, and just slapping work on a branch, it should be more akin to writing a novel. Some background, set the scene, introduce the characters, some exposition, advance the plot, resolution, etc.

blowski · on Aug 8, 2021

You’re exactly right.

All comments regarding “how you should use Git” (and many comments in general) could be prefaced with nature of the project. A predominantly markup based project worked on by a whole team of juniors will have different needs to a C codebase worked on by a single senior.

There’s not one correct way, just alternative ways that are optimal in some situations.

Eikon · on Aug 8, 2021

I disagree too, I prefer a lot of small commits. What some people call "keeping a clean history" is actually destroying history and discarding a lot of important information.

I saw a lot of people merging branches a single commit "to keep the history clean" while in fact it just made the history less legible and less reversible.

alFReD-NSH · on Aug 8, 2021

The article is suggesting that while you are working you can commit as you go, quick and dirty, irrelevant of the changes. Once you get something working and it's ready to be merged, then you can go back rewrite the branch, removing back and forth changes and writing a good commit message with the reason for it for every one change.

This way, the history would be meaningful instead of it be littered by fix up commits and it makes reverts easy.

Eikon · on Aug 8, 2021

> The article is suggesting that while you are working you can commit as you go, quick and dirty

Why would someone do that? I’d argue that if there is cleaning to be done, it should be done way before something is ready to be merged, while context is fresh in your head.

Also, cleaning before something is ready to be merged will lead to sloppy history removal as, often, people are eager to merge for a variety of reasons, and rewriting history may not be seen as a priority.

alFReD-NSH · on Aug 8, 2021

I personally commit and push to a branch when I think there's enough changes for someone else to review my work (of course with WIP flag), in case if someone is curious and once any of the changes are ready to be merged (this can be before the review changes), I'll clean up and write nice commits.

I guess it's personal preference, I don't recall losing context, as I don't let too much of time passed. And I actually ensure nice commits are written before code review otherwise it might be harder for the reviewer to get the context of the changes.

Vinnl · on Aug 8, 2021

According to logic that you should also always commit before you hit Ctrl+Z.

Of course you have to consider what history is relevant and create relevant commits for those. But a WIP commit at the end of the day is rarely more relevant when split up from the finishing up of that work the next morning.

ktross · on Aug 8, 2021

I agree with the idea, but disagree with the approach. This seems like a lot of extra work for not much, or maybe even less, benefit.

The process I've set up for my team allows us to get the best of both worlds, in my opinion, with almost no extra effort:

1. Create a branch and open a pull request. Nothing gets pushed directly to master.

2. After the pull request has been tested and reviewed, we Squash and Merge into master on GitHub with a single button click.

This way we end up with a single commit in master for each feature, bugfix, etc. There's also a link to the PR in the commit message where you can view the individual commits that were squashed as well as links to any related Jira issues.

MereInterest · on Aug 8, 2021

I think I'd go the other way around. Instead of squash/rebase, you should instead use `merge --no-ff`. The allows your main branch to have a clean history with the `--first-parent` option (e.g. `git log --first-parent`), while still maintaining the history on the branches. If you squash the commits, the original is no longer part of the commit history. Yes, there is a link in the commit message, but being able to manually click through to the commits on an external web page is a huge usability drop as compared to getting the information straight from `git bisect`.

Squash/rebase also plays horrendously with my local branches. If I make a PR, I should be able to use `git branch -d branch_name` to safely verify that the branch has been merged into main and can be deleted. If you squash/rebase instead, then I need to use the unconditional `git branch -D branch_name`. git can no longer verify that I'm performing a safe operation, because the history needed to determine whether it is safe has been rewritten.

ktross · on Aug 8, 2021

> Squash/rebase also plays horrendously with my local branches.

This is a very good point, and it has been slightly annoying at times with this approach. It hasn't been a major pain point for me, so I've just dealt with it. I've seen some scripts/aliases that claim to solve this, but I haven't spent much time looking into it.

MereInterest · on Aug 8, 2021

My current workflow is to watch for a PR to be accepted on github, then use the delete branch button in the GUI. On my local repo, I'll then use `git remote prune origin`, and only call `git branch -D` on branches that were pruned. It's a workable solution, and it could be scripted around, but I don't want to. Reproducing the functionality that already exists in git feels like a waste of time, when the entire purpose would be to work around the existence of a rebase workflow. Better would be to just use a merge/no-ff workflow in the first place.

bonzini · on Aug 8, 2021

The individual commits disappear if the branch or repo used in the pull request are deleted.

ktross · on Aug 8, 2021

GitHub actually keeps these in the pull request indefinitely if you delete the branch. You can also restore the branch at any time if needed. We have the branches set to automatically delete when a PR is merged.

bengale · on Aug 8, 2021

When I’ve done this before the commit messages get listed in the squash commit.

bonzini · on Aug 8, 2021

The list is useless, it's the contents that are helpful when debugging (and the wip commits just get in the way).

MereInterest · on Aug 8, 2021

Exactly! Suppose feature A is implemented (commit 1). Then it is discovered that the original implementation breaks feature B, and that is resolved (commit 2). In merge workflows, you have a merge commit on main, while commit 1 and 2 are maintained. In rebase/squash workflows, you have a single commit on main, with both changes. If 6 months later I find that the commit broke some feature C, I really want to know if it was the main change or the compatibility fix that did it.

bonzini · on Aug 8, 2021

Right, though usually commit 2 would be placed before commit 1.

MereInterest · on Aug 8, 2021

True, and that's a better way around it, since then both commits can run and pass the tests.

hnjst · on Aug 8, 2021

I'll take an intermediary position. Commits on a WiP branch don't need to let the repo in a working state, however I like when said branch has its history rewritten so that each commit is mostly a working unit change before submitting the branch for merging.

You definitely shouldn't restrain from committing often during development, however having a cherrypickable branch history when merging instead of "let's squash this mess and forget about it" has value to me.

c_s_guy · on Aug 8, 2021

I'm not experienced enough to form a strong opinion on this. Could you provide some reasons why you strongly disagree?

At first glance, it seems like a beneficial goal. If I ever need to checkout different points of history to find a breaking change, it'd be nice if I can run the code after every `git checkout`.

Forge36 · on Aug 8, 2021

Every commit to main should be runable. There's no worse way to start a project than debugging why it won't run or worse won't compile.

Keeping every branch would preserve the detailed history, unfortunately these extra branches can create their own noise.

omegalulw · on Aug 9, 2021

Why? It's an excellent principle, especially for debugging. The biggest benefits of your commits and commit messages are not you but future code reviewers. Imagine someone reads the blame on a file and sees WiP commits (that absolutely will eventually happen), that's nightmare fuel.

WorldMaker · on Aug 9, 2021

As someone who has had to do some "deep archeology debugging" of legacy code, I'd much rather see that git blame full of WIP commits from junior developers than squash/rebase commits from the same. Questions like "was this an intentional regression or bad test behavior from an in progress commit or an accidental regression from a bad merge?" are much easier to answer when you've got "more parts of how the sausage was made" to work with. When I'm not doing "deep archeology" I have tools like git praise --first-parent to focus on the high level (completed PR merge commits).

jonjonsonjr · on Aug 8, 2021

Can you explain why? I've found that it is a lot simpler to use git bisect if every commit builds.

cortesoft · on Aug 8, 2021

There are two competing interests, and the author suggests cleaning up after so that you can satisfy both.

The first interest is what you said, for every commit to build.

The second interest is to help prevent loss during development. If I am working on something and need to finish for the day, but haven't yet got the code to a working state, I want to be able to commit and push my code in case something happens to my laptop.

Also, sometimes I am making a series of big changes, but I need to make them all for the code to work. I want to have checkpoint along the way, in case I need to undo something (or even just see what I have changed since the checkpoint!) committing WIP code that doesn't build lets you do that.

We usually solve this problem with squash merges, so every commit on the main branch is the full working feature, but there are downsides to that technique.

IshKebab · on Aug 8, 2021

I'm pretty sure they mean every commit on `master` or `develop`. Not every commit in your `add_new_feature` branch.

Forge36 · on Aug 8, 2021

It's in the ambiguity we can discuss where the rules make sense :)

olejorgenb · on Aug 8, 2021

    git bisect --first-parent

Every merge commit should build is a more reasonable policy