Every time I end up using repos with submodules, I'm stuck in the submodule unwedging dance at some point. It's just not worth it. Either it's stuck with changes that accidentally snuck into the subtree (rm -rf and submodule init/update), the commit is bad and git can't update to it somehow (maybe it's not a tag and can't be fetched? usually gets fixed with rm -rf and submodule init/update), or maybe it's just when I switched branches and it's left in a dirty state because... reasons.
Git is elegant is so many ways, but submodules are the broken, ugly stepchild of the beauty that git is.
I suspect it's not the idea of submodules, but the terrible, terrible command interface to them and how badly they work with the rest of the system.
Having used few “lesser” VCSes I am not so sure I’d call Git elegant. It just became famous and defacto VCS of the Internet - a lot of that credit must go to Github (which also was and is used a lot by comps and teams). Something like Markdown — one day everyone and their kittens were just using it which is not necessarily not a compliment.
I had created 2-3 char git submodule related aliases. Git submodule exists because a better alternative isn’t available (which gives us same behaviour or close to it).
As a side note: git won because at the time all other VCSes were either functionally worse (lice RCS or Subversion), or were good but required a paid license (like Perforce or BitKeeper), or were too slow to for larger projects like Linux (Mercurial).
More advanced things were created since then, like Fossil or Pijul. But the network effects make git predominant now.
> But the network effects make git predominant now.
That and the fact that it's... well, it's a good system - if perhaps not with the best UX.
It works well for pretty much everyone who cares to learn the basics, and then you can evolve from there with more practice. Which is probably true of any system.
Indeed, git is pretty good internally, despite the clunkiness of the CLI.
But, say, Mercurial is also pretty good in many aspects. It used to be rather popular, but its popularity is waning, and not because of some kind of technical inferiority.
> or were too slow to for larger projects like Linux (Mercurial).
So it seems like there were technical reasons for Git vs. Mercurial. I don't really know and having never used Mercurial, I couldn't comment on how good it is or how it compares with Git.
From what I read around, it's mostly the UX that's marginally better on the Mercurial side. This is the point where the network effect certainly has weight. If one offering is not better enough than the one people are used to, there is no compelling reason to learn something anew and move all existing projects over.
When there is great technical reason, people will move though and the network effect will start moving across. See the previous systems that were popular before Git: Subversion, CVS, ClearCase, etc. Those have mostly been phased out completely, except maybe for older projects that have them ingrained into their processes and technology.
I've been on a number of projects which used git without GitHub. But of course GitHub like "the default" repository of open-source projects has done a lot to make git the default VCS.
There's no wonder GitHub drew people in. It's interface and ease of building a community around a repo made of so much more accessible than anything else around at the time.
I don't get the opposition to submodule. imho it's fine, there are limited options with it, so limited chance it may go wrong.
some suggested package manager, which imho is another layer of dependency, e.g. in Python, it always takes me some extra seconds to figure out which package manager I'm supposed to use, and it's constantly evolving.
end of day it's up to which one you are most familiar with.
Submodules are the feature you discover someone pulled in when things don't work, and after a bit of digging, you realize it's because the submodule wasn't initialized, and it didn't say so anywhere.
I suspect the opposition to submodules comes from the poor (manual) integration.
As someone suggested in another thread, they ought to auto-recurse by default.
But I don't think that's going to happen, either.
Maybe something like direnv where it tells you about the unloaded submodule once you enter the repository. Except that won't work for IDEs.
My impression of this opposition (and hate) is that people don't read the docs and expect submodules to bend to their vision of how submodules should work. Because of that, we have multiple similar but not the same "submodule" implementations: git-submodule, git-subtree, git-subrepo...
I think one problem is that submodules are kind of over-kill in situations where what you really want is to have separate repos for various reasons but one way of pulling them all in at once. You could make a script with the origins etc but then you put than in a repo and then ...
I use them in my personal projects and have a sprawling mass of crazy. It "works" but it's a pain and I have scripts to recursively crawl through and basically do 'git pull' everywhere or 'git push'.
So if you end up using submodules when you don't really care about versioning the sub-repos, I think you end up feeling a bit of an idiot like me. But it's a pain to reverse.
Same here. I used it and set it up once and had no issues personally, but only after documenting how to deal with it exactly.
It's just a rare feature so people don't know how to deal with it, so they just hate it. Once you figure out the 5 commands you will need you're golden.
Not only are they “a pain” but I see git submodules as the technical equivalent of a “Big Tent” political party trying to fit mutiple divergent factions into a single entity.
At some point someone in a submodule repo is going to go rogue and make something incompatible with the bigger picture. Psychologically, they’ve merged to their trunk so as far as they are concerned their change is complete. Task done, “mission accomplished”. Yet until it’s also in the parent repo’s trunk, it might as well still be on a branch.
In a monorepo, the rogue coder would instead still be on a branch. They’re done when their branch works and their change is merged (er, ff’d!) onto the one true main, not before.
None of this applies to FOSS projects. Those are Big Tents where political negotiations are constantly required to keep various democratically equal projects aligned.
In the corporate world though I am entirely happy with a one party state running a top down, planned economy of N year plans following CEO-Thought, and with a single repository.
In trunk based development libraries do break product code, yes, which is (and should be recognized this way) 100% intentional. The other alternative is leaf based development where you have dependency hell because you can't force product code to update libraries.
Pick your poison - change management is not an easy topic.
I love submodules. It's similar to using a package manager but allows you to modify the submodules' code much easier. And as already mentioned you need to use `git config --global submodule.recurse true`. This really should be the default.
Submodules are okay in theory. But in practice the actual implementation is very buggy and incomplete. It's relatively easy to get into a state where your .git directory is completely broken. Plenty of operations are unreliable to the point that they break CI. They don't work with worktrees.
On top of that they are needlessly confusing. Why is there a .gitmodules file and hidden state inside the .git directory? Why aren't they cloned by default? Many of the UX issues have only been fixed if you turn non-default options on (e.g. the display of diffs can be changed from useless "submodules changed from commit 123 to 456" to "these commits have been added/removed").
Just all-round they are a mediocre idea, implemented badly.
Are there any downsides to completely skip dependency managers of specific languages and just use submodules to handle dependencies?
I don't mean code by 3rd parties. I mean the projects and libraries I write myself. I am tempted to try and handle my own code-reuse purely through git submodules. Would I encounter any problems?
It doesn’t even need to be a diamond to have dependency problems. Suppose I have two dependencies, A and B. Dependency A also depends on B. Now I have two copies of B, one from my submodule and one from A’s submodule, and there’s no guarantee that these are compatible versions.
We used them for years for dependencies, but have instead moved to a monorepo because of how we want to do releases. We never had any issues while using submodules.
If your projects and libraries are public, your package manager will likely let you use Git references (usually commits/branches/tags) to define dependencies. I find that cleaner than submodules.
1. A blob, which is analogous to a file, and is referenced by the hash of its contents
2. A tree, which is analogous to a directory, which contains blobs and other trees, and maps them to names. It is referenced by the hash of its contents.
3. A commit, which contains One (1) tree (the top-level of your repo), a reference to one (or more, for a merge) parent commit, and miscellaneous metadata like the author and the commit message. It is referenced by, you guessed it, the hash of its contents! (annotated tags are commit objects)
3.5. References, which are analogous to symlinks to commits. Branches and lightweight (non-annotated) tags are References.
Now, remember how a tree can contain a blobs or other trees? What if (gasp) you put a commit object in them!? That's essentially what a submodule is.
That's why a submodule is always included _at a particular commit of it_. That's why there's all sorts of complicated support machinery to make "a commit object inside a tree object" make sense.
> a submodule is always included at a particular commit
Actually you can make a submodule track a branch instead of a specific commit. I've never seen anyone actually do that though and it seems like a bad idea. Though I did work for a while for a company that had written a custom tool that worked like that and we never ran into any problems due to it.
I don't think it is possible to put a branch name into a tree object, not without deep modifications to git, so I suspect your previous company developed a significantly different tool around it.
My previous company wasn't using native Git submodules.
I think you're right actually the submodules. You can associate a submodule with a specific branch, but it still records the hash like normal and you still have to manually update it.
I’ve used submodules a lot, and choose not to use them, anymore. They are just too much work. I now use package managers to accomplish pretty much the same thing.
One project that I wrote, used nested submodules. There was a specific reason for the nesting, as it was a layered system, and each layer had a very specific context and functional domain, and submodules helped enforce that.
The problem was, it made changes a huge pain. If I made changes in the deeper layers, I’d need to propagate the changes throughout the entire chain, above. I wrote a few bash scripts to handle that, but it was fairly kludgy, and quite brittle.
I ended up just folding it all into a monorepo.
The one feature that I’d love to see in git, is something that Microsoft SourceSafe could do. I call it “Virtual Repos.”
You could make a “repo” that was actually an amalgam of files that were references to files in other repos. Their state in the virtual repo reflected their state in their “home” repo, and changes made in the virtual repo would go out to their home repo.
It must be a nightmare to get right, though. I can see why it would not be implemented, but these could be used for a lot of the same things that submodules are used for.
What is bad about them on the command line? If you understand git the only extra things you need are `git submodule update --init` and `git submodule add ...`
The UX that sucks is around what happens if they are unclean/how to update them. A checkout doesn't recursively checkout the relevant submodules. This is the biggest pain point for most orgs I've worked at. It's an easy setting to set (`git config --global submodule.recurse true`) but the fact that it's not default hurts.
Most engineers have a poor understanding of Git. My university had a great history of version control course right at the dawn of the git era (In 2006! RIT really speedran it, standardizing on Git by the end of the year, but also including tutorials on RCS, CVS, and SVN and a brief foray into Perforce). Still, a ton of my classmates just didn't get it.
The other major blocker is what to do with an unclean submodule repo; I honestly don't remember what git does by default, because it's bad. And most projects get unclean real quick. Makefile hygeine is not common, and for most of time most projects became unclean from a simple `make`. It's better now, but not great.
> Makefile hygeine is not common, and for most of time most projects became unclean from a simple `make`.
If you're checking generated files into git, submodules aren't the problem imo.
I can see the frustration of modifying submodules files and trying to commit the main repo, but if you have to do that then it wasn't supposed to be a submodule. That's like complaining that modifying node_modules files doesn't apply upstream to your dependencies.
I have gotten rid of those by the rule: If a commit update updates a submodule, it must not update anything else. (Yes, this can violate the general rule that nothing needs to be added to a commit to be complete. But updating submodules has been worth the exception in my experience.)
I'm working on a project where pushing a commit to a submodule runs a CI job which updates the reference on the parent repository. This seems to lead to very few issues.
Checkout git subrepo [0] if you also find working with submodules cumbersome.
It has a different set of trade offs and works without any problems or changes to your workflow if they fit. (Only thing it has problems is rebasing, under specific circumstances)
Just checked subtree and while they aim to provide the same thing, they are using different ways.
- You can mix subrepo commits and main repo commits freely in a single commit, it’ll take care of submitting only the relevant changes when pushing upstream.
- Publishing changes from a subrepo iş just a single command.
- Subrepo adds a .gitrepo file to the subdirectory for metadata
The readme on repo does a good job of explaining things.
I haven't ever actually used git subtree to push changes, but I'm pretty sure that all of those are true for it too, and git subtree doesn't need a `.gitrepo` file so that seems like a point in its favour.
I'm sure there are advantages to git subrepo, but I am still not sure what they are.
After running the platform org at a company that used submodules: never again.
There was not a week that went by without me having to unstick some team that had horribly managed to screw things up because of them or watch an engineer burn an entire day fighting with them or watch a new hire completely confused.
"Well if everyone would just ..."
Everyone is not going "to just". If your system relies on everyone inherently having the same understanding of the world and behaving in the same way, then it's a terrible system.
This is the article I should have read when first trying to use git submodules. The two main facts "a submodule is another git repository" and "a submudule is always pinned to a specific commit of the other git repository" are the most important things to understand git submodules. Somehow all tutorials/examples that I saw before show lots of git commands and their outputs but do not highlight the two basic facts so they actually do not help.
Hey, I wrote this article. Just wanted to really thank you for saying that. I also felt like "why didn't anyone tell me this" so I wanted to share it :-)
Git subtrees also let you move back and forth between multiple repos vs mono-repo without losing history. So you don't need to solve that particular debate in your team.
Can anyone speak to usecases for submodules that arent better served by your language’s package manager? Multi-language codebases, languages without appropriate package management perhaps?
Submodules are great for projects where your code depends on upstream Git repos that you don't control and don't want to vendor yourself.
I recently did an embedded Linux design that depended on 5 external repositories: one from Yocto, three from OpenEmbedded, and one from a CPU vendor. My own code just sat on the top of this set of repos.
Submodules made that design very simple. One repo with all of my code in it, and submodules for all external repos. All dependent repos were pinned before of how submodules work. Pins were easy to update when desired, and never move on their own.
Isn't that because you didn't have a good package manager to handle these? If you have a package manager that allows you to add packages from Git repos, why would you use submodules?
We use a submodule in https://github.com/uber/h3-py to wrap the core H3 library, which is written in C. Submodules seemed like a reasonable way to handle the dependency, and, at least for this use case, the approach hasn't given me any problems.
The latter had been an issue for me in the past with some projects that just weren't packaged for, e. G., python and had to be imported directly. It can also be helpful for non-packaged assets that are held in a separate git repository.
This is where I use them. I have some Rust bindings to C++ code, and that C++ code lives in my repo as a submodule. Everyone seems to hate submodules I guess because of the surprising behavior described in the post, but for my use case they've been completely fine.
Two of the largest tech companies in the world, Google and Meta, had to roll custom VCS for their day to day engineering operations because git and git submodule were so unsuitable. The default pack file behavior of git is completely unsuitable for a rapidly releasing company with a monorepo. You don’t want or need the entire history — you just want a few recent commits. You do want some visibility into what your coworkers are up to so you can prevent merge conflicts before they happen (centralization is good!). You probably only need part of the tree, not the entire thing.
If you go back and watch Linus’ talk at Google regarding git, he’s basically describing (unknowingly) why Google needs to not use git for its day to day. Even on a smaller scale, Android (AOSP) had to create a meta tool for git called git-repo to handle its source tree. Git submodule failed there.
> Two of the largest tech companies in the world, Google and Meta, had to roll custom VCS for their day to day engineering operations because git and git submodule were so unsuitable.
Where did you get that from? Sources?
Google rolled their own VCS, because Google is older than Git, and they needed something that works. Their custom VCS is a hacked up version of Perforce.
By the time Git came around, Google was already pretty much committed to their in-house custom tool, too many things relied on it.
Git submodules aren't really intended for that use-case in the first place. They're not really intended to model a mono-repo at all, more a relationship between repositories that have their own histories.
The main thing that has been developed in git to allow very large repos is shallow clones (both in terms of history and slices of the repo). This model works well enough within git's logic, but it's just historically not been focused on until fairly recently (and I don't really know what the state of play is there - I think there's still a limit at a certain scale where simply finding the state of play of a large checkout becomes a bottleneck, and you start to want a persistent daemon to use FS notifications to keep track of what's changed instead of stat()ing every file in the tree)
(I've often pondered if it would be possible to make a DVCS where there's no firm repo boundary at all, i.e. you could construct a checkout from any combination of trees and commits stored in different locations, and have it work seamlessly. There's probably more than a few thorny issues in there, but it would be an interesting concept)
Every "language package manager" with a lock file format and requirements file, is an inferior, ad hoc, formally-specified, error-prone, incompatible reimplementation of half of Git.
Almost every use case for a package manager is better served by Git, whether you choose to use submodules or not. If you want to do version control, use the version control system, and stop trying to do an end-run around the way it works.
Previously:
> I'm happy to criticize NPM the tool. The whole thing is designed as a second, crummier version control system that lives in disharmony with and on top of your base-level version control system (so it can subvert it). It's a terrible design.
Git only replaces the lock file aspect of package managers, not the version requirement resolution part. (Or the part that tells deals with eg Rust's feature selection, or different build instructions for different operation systems or versions of the language etc.)
> Git only replaces the lock file aspect of package managers
Nope, Git is pretty good about downloading stuff over the network, too. In fact, it's so good at it that many people using a language package manager insist you use Git at some point even when (before) using the package managers. Indeed, there's been a lot of trepidation and gnashing of teeth about whether the places where language package managers download packages from are as reliable/trustworthy as the server where the Git repo for the software project is hosted.
> nothing about eg semantic versioning, or how to resolve different requirements from different libraries
"[…] incompatible reimplementation of _half_ of Git."
You can use git subtree to convert between mono-repo and separate repos, without losing your history. You can even keep both styles up to date concurrently.
Actually it's backwards. Git gives ability to manage and navigate commit graph on quite low level using a pair of commands: checkout and reset and fulfill any wild desires. While in other VCSs it's a separate command per case.
Note: I think GIT UX is horrible and requires multiple years of practice to be comfortable with.
Most of the time, you only need a handful of commands, but there's a long tail of niche situations, especially if you are using git for maintaining a large project like the Linux kernel.
Remember, git was designed and written for the kernel first and foremost.
I am running simulations with a rapidly evolving codebase. I have a separate repo with all the simulation code in it. I am want to tie each simulation with the git commit (of the main repo) at which it was run. Are git submodules the correct solution to this in any way?
understanding submodules has not caused me to stop wishing that something in the vein of nix (in the sense of being able to provide a "lockfile" that transcends language-level package managers) becomes sufficiently commonplace that people would feel silly doing anything other than using whatever that turns out to be, or just directly vendoring if all else fails
Git is elegant is so many ways, but submodules are the broken, ugly stepchild of the beauty that git is.
I suspect it's not the idea of submodules, but the terrible, terrible command interface to them and how badly they work with the rest of the system.