Disappointed to see so many knee-jerk reactions to this. Vendoring dependencies is a simple way to ensure consistent build inputs, and has the bonus effect of decreasing build times.
To respond the two major criticisms:
1) “It takes a lot of space”
Don’t be so sure. Text diffs and compresses well. I have a 9-year old Node repo that I’ve been vendoring from the beginning and it’s only grown 200MB over that time. (Granted, I’m fairly restrained in my use of dependencies. But I do update them regularly.)
But even if it does take a lot of space… so what? If your dependencies are genuinely so huge that this is a problem, then vendoring may not be right for you. But you could also use one of the many techniques for managing the size of your repo. Or just acknowledge that practices are contextual, and there’s no such thing as “best practice”—just a bunch of trade-offs.
2) “It doesn’t work well with platform-specific code”
This can cause some pain if you’re in a multi-platform environment. The way I deal with it (in Node) is by installing modules with --ignore-scripts, comitting the files, running “npm rebuild”, and then adding whatever shows up to .gitignore. I have a little shell script that makes this easier.
This is only an issue for modules that have a platform-specific build, which I try to avoid anyway. But when it comes up, it can be a pain in the butt. I find its pain to be less frequent and more predictable than the pain that comes from not vendoring modules, though, so I put up with it.
Bonus) “It’s not best practice”
Sez who? Dogma is for juniors. “Best practices” are all situational, and the only way to know if a practice is a good idea is to examine its tradeoffs in the context of your situation.
npm has long had a problem respecting lock files. The concept is easy: have a fixed lock file, get a reproducible build. But no: npm will change your lock file (I believe it's framed as "optimizing") without notice.
(Perhaps they've solved this in the last couple of years. I've been staying away from that ecosystem... too much growing in it...)
Yes, you are right. But there are people who trust npm (now and esp. in the future) and other free infrastructure and there are those who prefer to be a bit more self reliant after getting burned once.
I use vendoring in Go because my team's builds happen within a huge, complicated corporate network that has been known to break arbitrarily in new and interesting ways (or rather when something gets changed unexpectedly and then it takes days/weeks to navigate outsourced IT and change it back). Vendoring deps doesn't save me all the time, but I've generally found it helps. Plus builds are a bit quicker because I can download everything in one go (via git clone) rather than pulling everything in at build time. It also helps when the linter decides it wants all the dependencies downloaded before doing anything and then we find it has a relatively short timeout when the network gains a lot of latency without notice.
On reflection, it seems more like I'm papering over network issues. Perks of working in an enterprise company I guess.
In a discussion about Skub, no one need to explain why the Skub-powered approach isn't good enough. It is the duty of anyone pushing Skub to explain exactly what makes Skub so special to the point that we need to have Skub in our lives.
There are only 2 problems I see with the existing solution in npm.
- "npm add package" puts in a "^ver", which is bad practice
- there is no good infrastructure to pull hash based blobs out of the ether in case npmjs is offline
npm-shrinkwrap has solved repeatability forever, people just didn't always use it. Auto-upgrading dependancies is the big problem, which should have never existed because it is not principled. I'd go further and say that dependancies and devDependances should only support exact versions, and peerDepenancies are the only thing that supports non-exact versions.
I don't check in my dependencies in my current project because I don't need to; but in earlier projects, I or we did, for various good reasons; and it worked perfectly well, and was extremely convenient for new developers.
Dumb question: what does it mean to “vendor” your dependencies?
Best guess is something like “ship required source or binaries along with your end product.” Like static linking but extended to dynamic languages and source control.
It's to have it come from some place you control. It can be your source control, but a very common way to vendor dependencies on other languages is to just save them in a server somewhere and install pull them from that on your code.
Upsides from storing node_modules in repo are outweighed by the downsides. Unless of course you're Google-scale and can afford to contribute filesize fixes upstream, write fancy tooling to enforce commit-time workarounds, etc. Nobody working finger-to-feature has time for this.
For your average npm shop which doesn't have infinite internet oil money, here is why the article recommendations won't work for you.
Your CI will pay the time penalty during git clone instead of npm ci. In fact, the node_modules folder will be bigger than your source folder almost immediately. And over time you won't be cloning just the head files you'll also be cloning every npm package binary ever committed. You can't undo this without investing in smarter git tooling. Which is time spent not writing features.
NPM packages which install arch-specific binaries will constantly flip flop from commits by devs on different OS's.
Nobody is safe from left-pad, not even Google, and committing your node_modules folder doesn't change that. Eventually someone is going to have to run npm i.
Running npm ci on everyone's machine is reproduceable, I don't know what OP is warning about. Package lock pins all the versions.
If you have a large enough team to invest in dev experience, there's way better ways to get the advantages of the article without the downsides. You can cache the npm ci result in a container layer for your CI/CD or use middleware like artifactory
Maybe, but everyone's CI situation is so variable that it may not be that easy. For instance if you are using a monorepo then even a shallow clone can be overkill. And if you rely on git history for conditional CI then a shallow clone will ruin the output of many git commands. So you could end up in an either/or optimization situation depending on the order your CI/CD organically grew and other architectural decisions you made.
Counter point, and ignoring download size as your typical CI probably doesn't download the entire history, when are we supposed to pretend we've reviewed our dependencies?
I'll admit I don't believe everyone always need to check every dep, but we're skating close nobody checking them ever.
My team guards up front when introducing a new dependency. You fill out a little template with security assessment as well as some other stuff, just to do a dirt simple build vs 'buy' analysis. left-pad for example would fail because the build time cost savings are not worth the ongoing maintenance cost. (In fact doing this assessment at all rarely makes sense for microlibs, by design.)
Once something's in package.json I don't believe anyone who says they can vouch for the security of that over time. We're all doing security theater with npm audit, dependabot, etc. Don't use npm at all if anyone's life depends on your code.
I think formlising an assessment like that makes some sense, but the question was more around what the assurances are. So it probably works like this:
#1 You look at the dependency and do an assessment on whether it's worth including. Check.
#2 You probably require some automated checks. SAST, Depedency Scanning / SCA, maybe some DAST, etc. Check
The outstanding question though...
#1 Did anyone actually read the code of the depedency?
#2 Did anyone actually look at what the depedency itself pulls in?
#3 Are these checks re-done when you update the lock files?
#4 If nobody is doing it, who's updating the lists and rules we use to scan from?
#5 Where possible do you have the monitoring to check when an app is doing something weird? i.e. network ACLs that when they fail, cause an event, that alerts a person to investigate?
I think we're mostly agreeing here, but the wider question is why is it that folks writing the app and including the depdency don't feel responsible for these things?
I think you mean they're automatically doing SCA and maybe SAST. I don't think there's a human working at microsoft reading the code for you though, is there?
> Your CI will pay the time penalty during git clone instead of npm ci.
Things like GitLab's CI runners will do a single clone then do `fetch`, `checkout`, and `clean` to checkout your repo. Git repo size isn't a huge bottleneck in CI performance.
Only if you have long-living runners: if you use a dynamic fleet to save money, then almost every time you clone the repo. However, this is why you can do sparse checkouts and limit the git depth.
Yarn offers "Plug'n'play" mode since v2, which basically promotes what the author says. It takes the idea further: dependencies are stored as zip archives instead of thousands of small files, which reduces the "git noise" and actually makes this viable as a performant workflow.
Yarn’s offline support is not directly tied to PNP. Yarn v1’s support for “offline” installs was a day one requirement, and as I understand it one of the primary drivers for Facebook engineers (at the time) to drive the creation of Yarn.
If offline installs is what you want, I don’t see any advantage of node_modules compared to this feature - only disadvantages (size, noise, and cross-platform incompatibilities).
I might be misremembering but at runtime, the Node process loads .pnp.js, which is a kind of monolithic “compiled” modules file containing all the modules installed. No reading of the .zip files occurs at runtime. (Again, I might be misremembering or have misunderstood. Please confirm/deny this if you know.)
I’d be curious how “read file from a zip that I know the exact path to” performs compared to “recursively walk the node_modules directory and dynamically lookup the location of the file”.
Exactly reading article I for a second felt like author was either not fully aware of tooling, or doesn't care about the noise it will cause (potentially complex conflicts) checking node_modules will cause.
Also if you use the yarn deployment plugin (forgot the name), it will create a deployment folder for you that does not contain dev dependencies. This way we found a package that was installed as a dev dependency but in reality was a regular runtime dependency.
Yarn pnp was in a poor state tooling-wise last time I checked it a year or so ago. Too many tools depend on ./node_modules/ (or actually an entire module.paths thing) to exist on a filesystem. Was it resolved somehow?
by and large it has been solved with editor extensions and pnp plugins for various tools, i've used yarn 2/3 at enterprise scale for a few years now and am happy with it
Wanted to suggest the same thing. Used Yarn 2 PnP with zero installs in the past company and it worked great. Whenever I switched to projects not using it install seemed to take ages. Used that only on backend, YMMV.
But there is nuance (there always is...), the README file in node_modules is here: https://github.com/ChromeDevTools/devtools-frontend/blob/mai... - and it makes it clear the only NPM dependencies used by the build-system or infrastructure is meant to be checked-in. Other NPM packages should not.
----------
In conclusion: the linked blog-article is clickbait that misrepresents how the Chrome team manages their dependencies.
In our industry (games) we often do that - checkin prebuilt code in the depots (typically "p4"). I'm not saying it's wrong/right, it's just what we do (not 100% fully, but almost, though people in IT/infra tend to do otherwise).
Does your team/company store art and asset blobs (even rendered FMVs?) in Perforce too, or do you have a separate asset-management system for that? If you do have two separate systems, how do you keep the asset-store and source-control in-sync?
It's been a long time (easily 20 years now) since I dabbled in any high-end creative software (like 3ds, etc), but I remember they generally all had large binary singleton project files (like Flash .fla and Photoshop .psd files) that couldn't be deconstructed and effectively diffed by any source-control system (though Flash eventually supported external ActionScript source-files), I'm curious how that affects your org's asset storage needs.
We ... commit them. Probably some really exceptional cases stay behind some "ftp"-like service (not really ftp, something like this).
It maybe wasteful, but it's the established practice (it seems). People coming from other game companies pretty much use it too (some rare exceptions). Also the automotive/chip design industry uses them - and yes mainly for the big blobby non-diffable/mergeable assets.
Locks are terrible, and yet you gotta doit sometimes, as how else would you prevent people working on the same asset.
I'm (still) terrible with git, often screw up commits, and have to google search/stack overflow to get it right (I use it mainly for simple home projects). I can only imagine the pain and suffering a non-tech person would have with git. Also the metadata is quite lot for WFH conditions. Working remotely does not always mean working from a dumb terminal (I wish).
> Locks are terrible, and yet you gotta doit sometimes, as how else would you prevent people working on the same asset.
I'm unfamiliar with P4's locking semantics; when files are "locked" does that prohibit other users from even getting a copy of the centralized file, or merely prevent users from overwriting the centralized file on push/upload? How does branching work?
If I were designing a centralized asset management system then I'd definitely add support for git-style (i.e. "many-worlds") branches instead of SVN/TFS-style "spatial" branches - but I'd also add support for some kind of "mini-branch" or deferred-conflict-resolution, whereby a file, or entire directory, can still be pushed to central storage but have multiple different representations that can be resolved/merged later, rather than immediately. So if two artists are working on the same "texture123.psd" file without realizing it then the system would let them both push (so the first artist to push would get to overwrite the file, and the second artist's push would see their file saved as "texture123.psd.v2").
There are good business reasons for having the ability to disallow changes to files in central storage, but that doesn't mean locks need to be used: it could be done by instead directing all updates/pushes to separate mini-branches, thus allowing users to push-and-forget and allowing them to defer conflict resolution while still protecting files from unwanted changes.
I've only ever used Perforce through a command-line interface. What's the UX like when using Perforce for blobs/assets? Does everyone have to use the command-line or is there a GUI experience (with fast-rendering thumbnails?)?
Forgive the questions, I'm just curious about the minutia and peculiarities of the gaming-biz because I've never worked in the field.
I've seen this trip up people in the past, in one case a CI/CD system running Linux was used to produce a project deployed to a mostly windows environment this didn't cause any issues until the day a developer added a binary module. Honestly, I'm surprised it worked as long as it did, it took a little over a year before anyone hit that issue.
No, and that detail is dangerously missing from the advice. lol
Anyone that does this and has teams on Mac, Windows and Linux will find out how crappy this is very quickly.
It’s even worse than that. If you support all LTS/current Node versions—as is, and should be, very common for libraries—it'll break even on a single platform. I’m sure it’s solvable, but the solution would be so complex it’s tantamount to building a new package manager.
> Would this still be an issue if the whole team uses docker to run the code?
It could be if it's a CPU architecture difference. For example an M1 Mac (ARM64) vs just about every other system (x86-64).
I know we had to switch out MySQL with MariaDB locally because the official MySQL Docker image doesn't support ARM64 devices but MariaDB does. That's just another example where even if you're using Docker there could be differences.
We've also had issues where developers aren't used to case sensitivity at the file system level and things work on their Mac but fail on Linux in CI because Docker's bind mounts (often used in dev) will use file system properties from the host OS which means even if your app runs in Linux within a container it may run differently on a macOS host vs Linux.
The moral of the story here is Docker is good but it isn't a 100% fool proof abstraction that spans across Linux, Windows and macOS on every combination of hardware.
> Once you check your node_modules in, there's no need to run an install step before you can get up and running on the codebase. This isn't just useful for developers locally, but a big boost for any bots you might have running on a Continuous Integration platform (e.g. CircleCI, GitHub Actions, and so on). That's now a step that the bots can miss out entirely. I've seen projects easily need at least 1-2 minutes to run a complete npm install from scratch [...]
Couldn't this issue be solved by using caching? If I remember correctly, Travis CI has the option to cache certain folders between builds, meaning an npm install doesn't have to start from scratch, and can just incrementally update the cached node_modules folder (any changes are then copied to the next CI build).
I understand what he's getting at, but if you are trying to have an exact copy of code in order to guarantee an exact behaviour, then you would want to extend that to the operating system used to run the software.
One example was having code where the test suite ran fine on my local MacBook, but would fail on CI (Linux). It turned out that on Linux finding files by name is case sensitive, whereas on Mac OS it isn't, and a require statement in Node.js was referencing a file path with casing that was different to the file name's spelling.
This entire comment thread is just torture for anyone who uses real dep management tools like Nix and Guix. Sorry, but it's just exhausting watching the same asinine conversations play out over and over. I guarantee there are multiple people who have spent more time pontificating over pointless ways to microscopically improve the node tooling situation who could've picked up Nix in 1/10th the amount of time.
But then again, I know there's a bunch of Nixers that pick their head up slightly, shake it, and then just go back to work on actually interesting problems instead of a millionth discussion about dealing with npm. Christ, the stuff people put up with.
EDIT: Ironic, this being here along with the CISA/log4j post where everyone is yammering about SBOM (software bill of matearials). Again, I just glance over at Nix and go, "sure, what do you want to know, I can tell you instantly if log4j is anywhere and if it's a vulnerable version (excluding non-source-built packages in nixpkgs)".
This really only works if you avoid native modules. I'm not sure why the author didn't mention native modules as it's a pretty big deal in this context.
> There are times where this doesn't work; updating TypeScript may require us to update some code to fix errors that the new version of TypeScript is now detecting. In that case we have the ability to override the rule. As with anything in software engineering, most "rules" are guidelines, and we're able to side-step them when required.
Ah, I glossed over that. But that just makes me more confused, not less. On a long lived project most changes to node_modules will require code changes... just not sure what the point is.
If your libraries have breaking changes on every update, sure. Most of the libs I use haven’t had one in decades, but I don’t write js so maybe my experiences aren’t that relevant.
Can’t you like, cache the node_modules folder on CI builds? I dunno, seems gross, unless you’re on a project with minimal deps or very meticulous about which deps you leverage. I am just one of those people who are constantly trying to upgrade dependencies anyways (cautiously of course) as to avoid vulnerabilities. That said, I see the point, it’s interesting..
it’s possible, kind of a moot point as doing nothing induces a similar level of risk in my experience. Don’t touch your code for two weeks, I guarantee you that ‘npm audit’ will complain about some new issue.
Also worth noting the longer you wait to upgrade, the harder it can be to do so when you finally need to. If someone discovers a critical fault in the version you're running but you're several years out of date, upgrading can be a huge pain.
This very week I was dealing with my artifacts exploding in size because AWS got 429s from GitHub. Then Composer pulled from source and there were so many extra files and SCM folders we exceeded the max artifact size.
Another idea is to host your own package cache. That would be my preference where SCM size prohibits checking in dependencies themselves.
In the past for a large Python project I've handled this using a separate repository for all of the dependencies - that way you can still get work done even if PyPI is unavailable for some reason, but you don't bloat your main repository with an extra few hundred MBs of stuff.
In my experience, it's about the inconsistency between Node.js versions (each version could produce different package-lock.json).
So, for example, you install the dependencies with Node 12, but have to run locally the system with Node 14.
So, i think just checkout the node_modules folder into git is not complete solution. You're avoiding the need to commit inconsistent package-lock.json, which is not hard to solve though.
Nobody should be running different versions in a team. Unless you’re selling a product where this is a likely scenario you should be running the same environment either through some kind of virtualization or at worst nvm.
When you install nodejs, you have npm installed. So in this case, there's not much you can do though. So it's not just npm issue, it's NodeJS issue, too.
Not necessarily. On Arch, npm isn't bundled with the node package.
Also, according to the official npm documentation, npm should be installed through nvm, thus having the ability to specify its version separately from node.
Note the other child comments of this; NPM has failed to make the lockfile reliable across systems and versions. Here's an example from March that isn't fixed: https://github.com/npm/cli/issues/2846
The fact that you have so many people believing "nuke node_modules and delete package-lock.json" is a reasonable step in diagnosing an error is damning to NPM.
We don't check in our node_modules, but "use the lockfile" is not a valid counter to this article's points.
I honestly don't understand how people get this impression of lockfiles as being perfectly reliable. How are they not occasionally bitten by these bugs? Maybe I'm just unlucky, but I'm a little jealous of these developers who apparently are good enough managing/updating their dependencies and keeping their count low enough that they've just never run into problems like this before.
Lockfile v1 literally ignores pinned versions of dependencies if the package.json specifies a fuzzy version number[0], and the advice of the npm team was, "it's fine, everyone will just bump a major version number of npm." And to this day, I still don't know what the expected behavior is, there really isn't a list anywhere about when the lockfile is and isn't supposed to be respected. So it's not really surprising to me that people distrust version pinning, and I always feel like I'm kind of living in a different world when people say that lockfiles just solve everything.
npm went through a rough few years (lockfiles, leftpad) and obviously the hivemind of the JS ecosystem is not the most careful one (hence all the advice of nuke it and npm i).
but those who care use yarn, those who even want to be correct use yarn2, and so on.
Indeed. What's probably needed here is a way to review a diff of the contents of the updated packages. Checking them is is just a brute-force way to do that.
not to mention that unless someone is very familiar with the code of dependencies it's very hard to review hundreds of small near meaningless changes unrelated to your actual functional/business requirements.
something like cargo-crev for npm might be a long term solution
I’ve heard lots of people claim that yarn gives no advantage over npm anymore, as of a year or two ago. But in 5 or so years of using yarn every day, on numerous projects, I’ve probably nuked node modules a couple of times, and never even considered deleting yarn.lock. Maybe yarn is still superior in this regard?
Done. Then I delete it, and `npm install`. Then commit. The majority of people I have worked with do that. On Friday some dude was saying "shrinkwrap v2 is not shrinkwrap v1" and the advice was "delete it and npm install" then commit. (Payment company software manager).
In my experience this is probably not viable. A single, valid update to a direct dependency can bring dozens or even hundreds of updated sub-dependencies. And an audit fix will often update just the lockfile. Maybe I lack imagination, but I can’t think of a workable heuristic for determining how the lockfile was changed.
I think, instead, it would be good to move in a different direction for sub-dependencies generally. A rough sketch:
1. Packages state their dependencies as they do currently.
2. When released, they’re built and bundled by the package manager host.
3. Included in the bundle is a manifest of the dependencies used, specific imports used, and a hash for each (recursively until exhausted).
4. On install, identical code (same hash/same bundled result) is deduplicated.
5. A human readable record of dependencies and imports used is produced. It’s important that it’s human readable, because:
6. This should not be filtered out in diffs. It should be subject to review just like any other change.
All of this is pretty complex, and there are probably ways to reduce that complexity. But it has some obvious advantages:
- Only your direct dependencies are installed. Tons of bloat can be stripped out.
- Even deduplication can be performed on the package manager’s servers. And hashes aren’t a particularly expensive lookup.
- It would go a long way towards addressing audit fatigue: if your dependencies’ bundles don’t include affected code, the audit doesn’t apply; if they do, you can be reasonably confident the audit is valid.
- A (wild guess) huge amount of the time, sub-dependency changes will require little to no review. Their stable parts will seldom change, and the parts shared among several dependencies could be reviewed as one unit.
- Lock files themselves just need to track direct dependencies (and even then, only to support semver ranges).
What you're describing is more in spirit with the intent of a lockfile and worth exploring by package managers. But I do think a heuristic could be conceived for the lockfile-was-rebuilt situation: If a single top-level dependency declared with ^ or ~ was present in the lockfile when the existing resolution was still valid.
Ideally the onus is on the package manager to provide metadata in the lockfile for the strategies it took when generating.
Problem with yarn is that it doesn't actually use dependencies' lock files... So once you publish your library to npm your lock file doesn't do anything whatsoever.
That makes a lot of sense for file sizes—otherwise, common dependencies patch versions apart would be duplicated many time, and it would block you from upgrading a library's dependency for a security fix. You still get the important part of reproducible builds for your program. Rust's Cargo behaves the same way.
I like the benefits but our node_modules is 1.9GB, def not checking that in.
A Git bot that parses changes to yarn lock and comments with size/loc/files of all the deps etc, kinda like how coverage bots work. get all the observability benefits without the checkin cost, also wouldn’t have to split npm changed and code changes..
Tbh haven’t gone deep into it, now you ask maybe I should. It compresses ok, zstd to 240MB. React + react native, we lack etc tons of stuff. My guess/hope is mostly dev dependencies.
mostly dev dependencies, i guess you really pay the price for not having a compressed intermediate format like a JAR:
- storybook is 420MB (! will look into this one!)
- babel is 110MB
- sentry sdk is 92MB (?)
- aws-sdk is 63MB
- typescript 60MB
- react-native 51MB
.. on and on, 40MB..30MB..20MB.. hundreds of them just adds up! 147 dependencies over 1M. pretty incredible/wasteful really when you look at it.
Maven, Gradle and the like always download dependencies as part of the build, and if a server is down (or flaky like jitpack) good lord is it annoying.
What I don't understand is why dependency download isn't a separate task you do before compilation. FIRST you download all the dependencies to stabilize that, then you build the code. The reproducibility is the big one.
IT drives me nuts that the dependencies are hidden away in maven and gradle. I have to lookup obscure "download dependencies to a lib" task configuration. Still the obsession with massive jars/wars/whatever when all you should have to update in a deploy is the difference in the libs and the main code jar.
The reason for this obscurement is pretty much "uh, it saves disk space?" which is a laughable consideration given the bloat in war files, docker images, and the like.
> What I don't understand is why dependency download isn't a separate task you do before compilation. FIRST you download all the dependencies to stabilize that, then you build the code.
This is what bazel does. It also offers a `bazel fetch` to pre-download things before going offline (ex: flight).
I don't agree that it should be checked into source control. I do believe it should be cached somewhere. How much bandwidth of popular sites are used by redundant actions. A single request for a 2GB archive is much better than a 1,000,000 small requests that are all a few KBs or MBs in size.
Nodejs has a node-gyp problem. Every node_module that somewhere down the dependency tree requires a "native" module will require recompilation on the target machine (or in worst case: the user machine).
I really would have hoped that the NaN module related problems will be fixed over time, but here we are in 2021 and nothing's been fixed.
As long as npm doesn't use binaries and headers, those things will stay broken. The thing that they argue with to use "always source" is kinda ridiculous when considering that probably the most of all npm packages are using webpack or another bundler before pushing their own package to npm - because npm itself has become impossible to use as a package manager alone.
I mean, a couple MB of libraries with the wrong dependencies can lead to multiple phantomjs installation, which is an inactive, deprecated, and unsecure project for years already... just because of some unit tests that have no place in a production npm package.
My hopes are that more sane developers come together, switch to ESM and implement better policies for evaluating their dependencies (e.g. blocking sources from people that have more than 1000 npm packages and brag about it).
Pikapkg was a great idea in my opinion, and I was using it before they moved the project to building astro as a platform :-/
This is basically a way of saying: to hell with those "package managers".
It's a sentiment that I'm actually in agreement with. I've been coding mostly in Java for the past 22 years. Somewhere around 2010 Maven became the prevalent build tool quickly displacing the venerable Ant. With Ant we had builds that used checked in jar file dependencies. It was obvious what your builds consisted of and they were very fast once you cloned the repo.
Then came Maven and the conflict resolution hell quickly followed (esp with unpinned dependencies). Now every time I type mvn install it feels like a small adventure in its own right. I'm absolutely flabbergasted. We sacrificed simplicity and reliability for a bit of instant gratification. Bad tradeoff.
You can save yourself some headaches by not using transitive dependencies in your Maven builds. This can be enforced with Maven Enforcer. The trade-off is that you will have to clean unused dependencies. This can be tool-assisted with something like: https://github.com/castor-software/depclean/ . You can also enforce dependency version convergence with Maven Enforcer. None of this saves you from "jar hell" directly, but it helps prevent you from unknowingly creating a disaster-in-waiting.
It’s been a long while since I saw a POM file with unpinned dependencies… ca 12 years of Java experience. Had to deal with an Ant project recently that didn’t check in the JARs into Git, I thought my hair would become gray by the time I figure out how to pull the right dependency versions transitively to get the Ant build to work.
But I share the overall sentiment. What if Maven Central goes down one day? It’s just a web server like any other.
Most Java shops run their own artefact repository acting as a pull-through repository to Maven Central and other 3rd party repositories. In addition to acting as insurance against losing access to critical dependencies, it also can provide a performance boost when downloading dependencies, and helps to offload your organisation's traffic from Maven Central.
I don’t check in node_modules but do a backup once in a while.
With frontend nowadays it’s sad but after six months it’s highly unlikely that my project will compile, not to mention the tooling like Vue, Vite, etc that has breaking changes.
I mean it’s scary. You write a program and it WONT run if you just give it enough time. Locking versions is not really a solution since often times the tooling itself, and IDE extensions require newer versions of packages.
Maybe you wanted to fix a typo a year down the line but oh no, now you need to figure out why Vite won’t start, why eslint dropped support for xyz, spend hpurs figuring out what you need to change in your configs, etc.
I don't see how this would work in practise. You could use module-alias[1] to actually switch to the backup copy during dev or a local build, but that will only work when the backup is on the same machine you're building on, and then everyone on the team will need to have the same backup (or use a network drive for it I guess). If you don't check in your backup then nothing will get through CI or make it to production. Why not just check in node_modules and let git handle the 'backup' process?
This is all horrible advice and other commenters have rightfully pointed this out already so I won't repeat it more… (okay, once more: This is all horrible advice, don't do that)
But, there is one thing I like from this, which is git diffs showing the actual final code diff when you upgrade dependencies.
Of course, this being horrible advice, it ignores how many JS packages ship minified which would make the diff as useful as binary noise. But I like it in theory. This could be a good opportunity to write a tool that replicates this specifically (and what's more, for other languages as well)
Most node modules (and I’m pretty sure is best practice) do not minify published node module code. The user should decide if they want to/how to minify. I often go in an read node module code and although it might have been transpiled for compatibility, it is not minified.
This is fairly standard "conservative configuration management" stuff.
The company I worked for, used to do that with everything. In fact, one of the ways that they archived versions, was to create a bootable external hard disk clone, of the entire development machine, and store that.
If you want to be absolutely sure that you have the complete building blocks, then you don't trust your package manager. Make local dupes of the packages, and integrate them into your own version control.
Does the Chrome DevTools team use Google's big monorepo and all the tooling around it? If so, that puts the author in a different situation than the vast majority of devs.
Totally agreed with every step, that's what we do too -- except it's composer and composer managed packages here not npm but the reasoning is similar. On top, we too often need to actually patch composer managed packages and rolling patches without the code being version controlled is a PITA. It's git diff --relative if it's already in git otherwise it's .... I dunno, check out the package somewhere else, hope you get a close enough version (because what's released can differ a bit from version control), copy over the files , roll a patch, clean up the patch... what an unnecessary nightmare.
And composer patches makes life quite easy compared to maintaining a fork. If I were to fork something I would need to handle merging every time they have a new release, run the build etc. With composer patch, a new released version is installed and the patch on top. Sure, if there's a conflict that needs to manually resolved but that's usually minimal effort since most patches are absolutely tiny, a few kilobytes at most.
I never even understood the arguments for keeping the packages out of git. Trying to save disk space these days is pointless. Maybe npm is different but composer handles about 160MB of code here. Maybe I missed the memo but these days that's nothing. My laptop shipped with a 500 000MB SSD so it's like, what, half a percent? The speed advantage , on the other hand, is absolutely undeniable, git won on speed in the first place, these script language tools can't possibly compete with a git pull on speed. git diff, as the author notes, is not at all a problem, just separate the vendor commits from your commits. And as I noted: they are useful for vendor packages.
So instead of downloading the current set of dependencies after cloning, you download every dependency ever used while cloning? And you do this to "save bandwidth"? Doesn't make much sense. (Yes I know about shallow clones, but it's often nice to have the full history around)
Any reasonable CI tool will have a way to cache generated assets based on file contents, that's the way to go here IMO.
Consider if you had written, "So instead of downloading the current project source tree, you download every version..?"
That is what happens in a DVCS, after all—in fact, it's sort of the whole point. If you're so uncomfortable with this, it might be worth asking yourself whether it was ever really the case that you agreed that DVCSes were the right approach. (Even then, still no reason to embrace package managers like some kind of paramilitary force that's subject to its own rules—better to just improve your version control system to handle things the right way, right?)
Theres cost-benefit tradeoffs for sure. It'd be nice to be able to do a big bisect without needing to flash the dependencies every iteration, but if it comes at the cost of making everything else slower (more bandwidth, more disk I/O, larger diffs to parse, etc.), it just isn't worth it.
Sure you can say "well why not just make git better then it will handle any operation in any size repo imperceptibly quickly", but I think you and I both know that isn't anywhere near as easy to implement as it is to type.
> you can say "well why not just make git better then it will handle any operation in any size repo imperceptibly quickly", but I think you and I both know that isn't anywhere near as easy to implement as it is to type
Good thing I didn't type that, then. That's not what the shape of my argument looks like at all.
There's a massive leap between, "we don't want every clone of our repo per se to carry all the baggage of our dependencies' histories, so it would by nice to have some scheme for handling lightweight, shallow copies" and "... so we decided the right way to do that, rather than making that a first class feature of the version control system we're using, is to create a hack in the form of a new set of unrelated tools meant to circumvent our VCS's fundamentals completely—so from our its point of view, these controlled objects and the scheme we use for managing them are invisible and might as well not even exist."
The listed reasons are insufficient and we could achieve many of these by just pinning our dependencies versions. If we did this, our git repo after a few commits will tend towards a gazillion GBs. Costs outweigh the benefits, if at all there are any.
Horrible advice. Don't break the industry practice and check-in your node_modules
Whether he's right or wrong can be debated, but reducing the argument to "follow industry practice" is a perfect example of cargo culting. Industry practice needs to be based on something today, not something from ten years ago that might or might not be valid anymore. He offers a long list of arguments and even though I don't check in node_modules, some of the arguments are compelling - e.g. we already want reproducible builds and use package-lock, but why not skip this step altogether? Why not skip setting up cache on your CI if you don't need to? What if knowing the details of your package manager and CI is useless because there are simpler ways of doing things. I'm not that convinced that a few commits would create a gazillion GB repo, so his arguments seem stronger.
> He offers a long list of arguments and even though I don't check in node_modules, some of the arguments are compelling (...)
Are they, really?
The less debatable point is arguing that making CICD pipelines slightly faster, but this feels like an appeal to microoptimization. Any free tier CICD system out there let's you do a single npm install and move these dependencies as far as you'd like into the pipeline as artifacts. Is a git checkout really faster that a npm install?
For example, is adding a npm dependency really invisible if you already track package.json and even package-lock.json? Those files show up in diffs, and it's hard to miss them.
Also, if the goal is to get replicated builds, isn't this handled by pinning versions and tracking package-lock?
The left_pad example is particularly ridiculous as I highly doubt that a company like Google, like any company that cares about auditing and vending dependencies, does not run its own npm proxy with cherry-picked packages.
Here's a better idea, can't you have an npm cache/clone that keeps all the artifacts you use in your code? So you pull from it, it pulls from npm and caches?
> Here's a better idea, can't you have an npm cache/clone that keeps all the artifacts you use in your code? So you pull from it, it pulls from npm and caches?
Not only is that possible, that's also expected to be mandatory in any company that is required to monitor ad control dependencies. I know for a fact that some FANGs do manage and enforce the use of internal npm repositories, mainly because of infosec audits, and I doubt Google is not one of them.
The fact that there’s no mention of the `npm ci` command (or yarn or pnpm) makes me wonder how deeply this problem was investigated before using it as a justification for this hacky workaround.
One wonders why they aren’t checking in the binaries for their database and language runtimes. Surely this would save crucial seconds in project setup.
There's a tremendous irony here, which is that these package managers are little more than a hack to let people manage parts of their project tree (and the accompanying shame) outside the harsh and knowing gaze of their version control system.
So I imagine that if you ship an app on Linux you check in the source of every system library and utility it depends on, right? Anything less is little more than a hack to escape the harsh and knowing gaze of your version control system.
> I imagine that if you ship an app on Linux you check in the source of every system library and utility it depends on, right?
Dumb false equivalence, since that ("every system library and utility it depends on") is not the argument of the side you're trying to appear to offer a response to. Please refer to the HN guidelines.
A less dishonest retort would be to ask if one should check in the dependencies that are analogous to what ends up in node_modules, and the response would be, "welp, that's exactly how many app developers have been known to approach things, so 'yes'."
Despite the sanctimony, I'm sure there's at least a slim chance of you understanding my point: the distinction between "what ends up in node_modules" and any other application dependency is arbitrary and purely conventional. There are legitimate technical reasons to check dependencies into source control, but neither the reasons cited in the article nor any pompous ascriptions of moral judgment to software tools are among them.
As a point of fact, the sanctimony of referring to something as a "hacky workaround" began here, where _you_ were the one to (unironically) introduce the phrase: <https://news.ycombinator.com/item?id=29528285> Pointing out the logical inconsistency of a strong claim is not provocation, no matter how much you feel like you are the one who is being attacked.
I don't recognize your claim that the distinction is arbitrary. Is the NPM world's distinction between package.json's "dependencies" vs "devDependencies" arbitrary? (Answer: no.)
This is a fun shift in culture because for a long time checking in node_modules was the official advice of the early Node documentation unless you were building a library for npm.
It's already been long enough since that time that people seem to have forgotten that it even existed. Vendoring dependencies is one of those things where every once and a while a language will reintroduce the concept, and it always seems to catch people off guard. Go is a good example, although it seems to have varying advice about whether vendored dependencies should be checked in to version control. That might not be surprising considering that Go is also coming out of Google, just like this article.
It feels a little bit weird to say that this is just "industry practice" when you have the Chrome DevTools team telling you they don't do it, but :shrug:. Google does tend to be a bit of a rarity in how it treats monorepos. I'm just always interested to see how opinions on this have evolved; it's rare for me to see analysis that says, "we used to do this, and here's why we found out that it didn't work." Usually the opinions end up seeming more universalist, like the very idea of vendoring dependencies is somehow weird and unexpected, and not something that the industry was largely on board with for a decent amount of time.
It copies all untracked stuff (including node_modules) into a leaf tag. It is fairly easy to manage them, or find the latest one. And because they are leaves, they can be pruned and completely garbage collected when they aren't useful anymore.
I have been burnt many times by npm, and I use this script to guarantee that I have a stash of my node_modules, while also keeping my project small.
And I have diffed different snapshot tags to see which module changed that broke something.
And by leaving everything in unaltered text, it exposes it to git which does a great job at compression stuff, especially highly differential revisions of my node_modules.
A 500M node_modules from one of my projects only weighed about 100M extra, even with several snapshots. And I can just delete them anyway.
I need to work on it a lot more, it was just a quick and dirty solution when I had to work with React Native a few years ago.
It doesn't handle submodules at all, and there's plenty more I'd like to do with it.
Funny but I've also done this.
Thanks for sharing!
Depending on the context, if you don't want this in git history, and want to handle git submodules, there's also git-archive-all https://github.com/roehling/git-archive-all (if you like shell scripts, it is using bats for testing - it was the first time I heard of it)
I remember an article a few years ago about Google's 86TB google3 source code repo. Someone on HN asked "I know Google is big but how on earth do they have 86TB?". Someone then quipped "someone accidentally checked in node_modules"
Basically came here to say this. The whole article felt very silly until I was like “oh wait OP stated early on he works at google… yeah seems about right ~closes tab~”
Insomnia lead me to go dig into the repo… the average age of 98% of files in node_modules is 10 months old, attached to the commit when most of these files were added to the repo in the first place… so the entire argument is predicated on the changes to 2% of the dependencies
You're drastically overestimating the storage requirements. Time is far, far more valuable than storage savings. The point about CI builds running faster is enough to sell the idea all by itself.
Oddly not mentioned in the article is that it allows you to build when npmjs.org is down or unreachable, which happens often enough to be frustrating, and if it happens when you're trying to deal with an emergency, it's downright infuriating.
Git is good at storing text diffs, any binary file in your node_modules (images, natives etc) are permanently stored in your repo, including old versions or deleted files. I've seen times where this had meaningful impact on both disk use and the speed of Git itself
Reread what you just wrote for a moment and reflect on that.
Also, you clearly have not written software using Node.js on a long enough time horizon. Pinned versions don't mean anything when sub-dependencies can have transient versioning resolution occur.
The reality is that unless you can fully byte-for-byte assure what you have deployed today is what you can retrieve from an old tag, let's say weeks, months, or years from now, you don't have a replicable build.
Most people will never need to do this, sure, but serious operations who will choose Node.js to build some software and then plan to walk away from it later should not only commit their node_modules directory, but also keep a copy of the designated Node.js engine version as well. It's not likely you'll need a backup of an LTS version when you can just go retrieve it, but that's not the point:
You will encounter a scenario in your professional career where retrieval is not an option for some piece of software.
Edit: There are industries where committing prebuilts is normal and has absolute strengths, and having experienced it myself, it certainly is desirable sometimes.
Until very recently, the official LTS release of Node shipped with an npm version that would ignore lockfiles during certain situations when running `npm install`.
If the package.json listed a fuzzy dependency and the lockfile was pinned to an outdated version, it would just be updated anyway. This was fixed in later versions with the release of the lockfile v2 format, but the fix was never backported to older versions of npm, even though those versions of npm were the recommended, default versions that shipped with LTS Node installs if you went to the main website or installed from a software repo.
I think that for a non-trivial number of people, they may not have a lot of trust for lockfiles because they tried using them and they just straight-up didn't work.
I guarantee they are not aware of the ramifications of what they wrote. They live in the now, following the contemporary paradigm.
I live in the five years future where the propeller head rock star programmer has moved to greener pastures (the ones where he doesn’t have to write project planning documents of any kind).
Hey it's me, 10-15yrs in the future guy. Pass me a stack of punch cards and let me babble a little bit about legacy code. I found some dusty bourbon in rockstar guy's old desk.
Don't let it get to you, I'll happily cash checks to work on whatever legacy spaghetti tech is in play. Hours are hours, dollars are dollars, and as long as I'm maintaining a happy ratio of those two, I don't mind what code I'm working on. I'll sharpen pencils and sweep floors for 30 hours a week, I'll even listen to your life's struggles if that's what you want me to do for that money.
Don't let me give the impression that this is because all I think about is money or that I don't care, it's quite the contrary.
I think of myself as a developer, I like to think I do a good job, when I get the opportunity to straighten the edges of a sagging beam or create a structure from scratch I take pride in that, as it is my purpose. There's no point in getting worked up about the practices of my peers because that does not serve me. This is my craft, and the person who's creations I am now steward of was also a craftsman, who had different experiences, motives, and contexts that led them to expressing their intent through the code now entrusted to me.
Imagine a television show, or movie franchise; it may have different writers over time commanding the dialogue of the main character, developing their mannerisms, polishing their pearl. These businesses and legacy products we work on are just like those characters who get passed on to new writers. Think of yourself as one of that team, carrying on a legacy, adding your flair and support. Never stop working on your pearl and use every project as an opportunity to fulfill your own desires and express yourself, while honing your skill.
I wish I had the time to sit down for a bourbon and a chat with you but in order to earn my $26k salary I have to repeatedly unearth these ancient systems, divine how they were supposed to work, repair them to a state where they don't break as much anymore, and then move to the next emergency.
I'd love to join the smoking-jacket crowd but I have student debts to pay, and my other job as a janitor actually sweeping floors to get to.
The world isn't the same place it was when punched cards, smoking jackets and bourbon in the library were a thing.
I've just checked a largish repo, and the node_modules folder is just under 500mb - I could check that in, and be done with it. Updates aren't constant, so it would be every now and then.
That's really not a lot for knowing the code that is being used in your codebase hasn't changed, and the bonus of having everything available should the registry go down, or something.
So I don't think it's 'horrible advice' - it's do what suits your needs best. Some people want to have everything they need to build their application in their control, on the off chance everything hits the fan.
(Also, this: "Don't break the industry practice and check-in your node_modules" - does not necessarily mean it is the best way, it just happens to be the advice from the start.
I’d say that the main link in that reasoning is that “Git can’t handle this without making the repo a gazillion GBs” which, can of course, can be solved if you weren’t using Git in the first place. Certain other SCMs, like Perforce, allow you to trim history and don’t require you to clone the whole history in the first place.
With git you can specify the clone depth and only get the latest X versions. And there are ways to trim history with external plugins ( git filter repo).
You have to remake every commit of the repository which basically means you have a new repository, and new commits. In order to do the latter thing you said, you cannot have done the former thing you said.
The author thinks that Git is so obiquitous that they just use “Git” as a stand-in for “VCS” throughout after the first paragraph. So the author definitely thought that Git would be suitable for this.
At my previous company we committed a zipped node modules to git LFS so we could have easy reproducible offline builds without hosting an internal npm instance. Seemed to work well enough.
If you are going to be tracking binaries, you should take a look at git-annex. It is so much more flexible and powerful. The thing that I don't like about git-lfs is how limiting it is with backend serving, and how you essentially can't remove something from your repo history after it has been checked in.
Building an internal repository of any kind is not hard at all for you today.
For me, five years after you left the company it’s a pain the arse because all your code refers to this repository that doesn’t exist, the one with the custom packages with no source control, the dependencies which are no longer even in LTS versions of any extant OS distribution, and the Vagrantfile won’t work because it used undocumented perimeters for both Vagrant and that homebuilt hyper visor that you and your team built as a lark (that doesn’t exist outside your personal laptop).
So for me, it’s all the dependencies get checked into the repository, all the tests run before we merge to master, and we do not use any custom in-house infrastructure of any kind.
Just another service that needs to be maintained but isn’t on the books as something that needs to be maintained leading to a wonderful day a couple of years in the future when a license cull of abandoned VMs means all your code stops building successfully on the same day.
My complaint isn't about build times, it's about dark repositories which aren't just mirrors of offical repositories but also contain home-grown packages such as "Company X custom VirtualBox Ubuntu box for VMWare" which contains an Ubuntu machine with up-to-date guest tools for the version of VMWare we use, along with the versions of Puppet, NTPd, Samba, and so forth that we use for all our Vagrant-ified infrastructure. Thus we save time over building the guest VM from scratch (about 20 minutes for each box we spin up) but someone has to maintain that repository.
Adding my voice to say this is not a bad idea at all in practice. Before anyone trying to apply the same approach to other ecosystems, however, be aware that (frontend) JavaScript is kind of unique in this. In most languages you’d likely have at least one package somewhere in the dependency chain that is in binary form, and those tend to need a bit more vendoring mechanism to handle correctly, unless everyone in your team have exactly the same setup.
I don't get why we are concerned about space. Space at this scale (never reached GB level) is so cheap and easy, to me at least, yet reproducibility is not.
with NPM shrinkwrap you lock down the version of the package you installed and their deps. That way you can use the same package on all envs.
Helps with testing and debugging as you're removing a variable (bad deps, outdated deps, newer deps etc)
Can't speak for elsewhere, but at $WORK, the development and CI environments sit behind a package resolver (don't know if that's the right word) which transparently caches and proxies package downloads. So dependencies are implicitly persisted permanently by virtue of usage, even if only once.
People here complaining about the size of the node modules folder - this is one of the reasons people use other solutions like perforce. Checking a 2GB folder into p4 is an absolute no brainer, and for all the flexibility people talk about with git, it's inability to handle this is pretty damning after so many years.
As the other commenter replied p4 doesn't download the entire history locally, so storage is only a concern on the server. Depending on what you're storing in p4, you can use p4 archive and/or p4 obliterate to manage the actual storage used on the server. E.g. you might use p4 obliterate to only keep the latest version of your node modules folder in p4 if you're trying to optimize for CI performance, or if you're using p4 for build artifacts, you might use p4 archive and store the older revsions on slower bulk storage.
P4 does not store the entire history locally, so your only limit is the amount of space on the server.
Git will pack objects after a while (initially they’re stored as-is, just compressed), but large files or enormous amounts of changes can make the repository grow to unwieldy amounts, and while git is able to perform “shallow” clones (clones which only store part of the history), not everything handles them well
Git also has "partial" clones, which can avoid some of the downsides of shallow clones. You can, for example, filter out all historical objects (--filter=blob:none). Historical objects are fetched lazily as needed, such as checking out an old revision. The same can be done with unreachable trees.
This article has excellent diagrams that depict the different types of clones:
The article contains some false assumptions and false statements
> Having your node_modules checked in guarantees that two developers running the code are running the exact same code with the exact same set of dependencies.
No, it doesn't, your system environment is important. Any code executed can check for environment parameters and branch accordingly. For example, a simple "if (macOSVersion === "10.10") {} else { }" would run different code branches and possibly produce different results, even when executing the same code binaries.
The reason you don't check in node_modules is that differences in system env during build time produces different build results - checking in node_modules fixes that, but does not handle system differences at runtime.
Subtrees, instead of submodules, are meant to be super-cool - and I'd love to try them, but I'm still far too wedded to existing git tooling (namely GitKraken) where there's still no support for subtrees.
Suggestion how to manage node_modules: When testing different modules use npm/yarn, but once you have decided which module to use, fork the module on Github (most node_modules are on Github) then link directly to your fork in package.json like this: https://github.com/{user}/{module}/tarball/master
Now you can use git/Github instead of npm to manage package updates.
Only if those packages' post-install scripts only mutate their own node_modules directory contents - and I can see this quickly falling-apart as soon as the team's dev-boxes becomes a heterogenous environment (e.g. someone using Ubuntu-on-WSL onboarding an all-M1 Mac team).
Anyway, it's a given that I disagree with the article's specific point (i.e. to commit node_modules to source-control), however I am sympathetic to arguments about avoiding another left_pad incident, but there are better solutions to that then simplistically committing node_modules:
1. package-lock (though this is an incomplete solution: it helps to protects you from vague dependency version numbers (as it uses cryptographic hashes), but it doesn't store a copy of the npm package, and you need to make sure everyone is using the exact same npm/node tooling versions otherwise your package-lock file will be clobbered by different users checking-in wildly different `lockfileVersion` versions.
2. git LFS: every so often (once a month or so?) in a separate directory off-to-the-side, add a heavily compressed 7z LZMA archive of a snapshot of your node_modules directory (ideally in a known-good-state). This allows you to keep a repo-local copy of your important dependencies without it cluttering up your commits. While these would be monthly updates - and your actual package.json/package-lock.json dependencies may change daily or weekly - in the event of catastrophe it won't be too much work to track-down any missing dependencies or to revert the deps back to the last known-good LFS file.
3. Use tools like `offline-npm` and Verdaccio, which are NPM caching proxies. If this was 2019 and everyone was working in a central office then you'd run Verdaccio on a single box in your LAN and have everyone configure their NPM clients to route through that box, which then stores every package ever requested - you could presumably run a cron-job to ensure that package cache is backed-up somewhere safe, maybe even with git-LFS as discussed above.
I've only ever had issues with differing major versions. That is, sharing lock-files between any node v14.x should work, but expect things to break if you go to v16.
tbf Google is so far ahead that it’s kinda not a real company. Aka you don’t have to adopt every best practice google does since you aren’t that big, mature, secure or rich.
Only anecdotes from several people working there at different divisions. I think I've seen it mentioned here at times as well.
One of the lead devs - years of tenure - on one of the Android backend teams had barely heard of the name and wasn't sure what it was when I brought it up 2 years ago.
Linux containers including Docker are absolutely widespread as I understand it, though.
Whilst I fully sympathise with the points, none of them are convincing nor helpful to me. I'm all for a balanced helping of "just do whatever works", but that's my only takeaway from this article, namely: compromising on best practices can have its place.
Or is there actually a fundamental design problem with lock files in general? My sinking feeling is that it's rather just the Node ecosystem's implementation of them :/
Why are people still writing server-side code in Javascript? It was a cute idea a decade ago. But now you have to twist yourself into knots to prevent a ton of problems. Nobody will use C today because "memory safety", but everybody will use Node.js despite the dependencies being both a security and usability nightmare. Imagine if I suggested using Bashscript for a web application backend.
> One of our dependencies that we check in is TypeScript, and every time we update that, the git diff is huge and frankly not worth looking at (beyond the CHANGELOG)
Which I assume is the official attitude towards any dependency of the same magnitude. How are you more aware of the code you are shipping? Okay you managed to give yourself a visual on how much LOC your dependencies are but is that a relevant awareness? Do I not get the same thing with a `du -h node_modules`, with a matching pretty GUI on top?
The one thing I haven't seen addressed so far is: doesn't this make them susceptible to poisoned dependencies? Say they have a dependency to a large well-known library, what's stopping a malicious contributor from adding an HTTP call, thinly disguised to prevent grep, to some server in MiddleOfFucking, Nowhere? Even if they manage to flag this from blackbox testing, they now have a problem that only they have.
I can try to answer my own question: they're Google, they can afford a team scanning for vulnerabilities like this, a team dedicated to analyzing codebases that are found to be compromised, a legal and PR team to handle the fallout if this kind of vulnerability makes it to the public.
In short: horrible advice to follow if you are not Google.
This is idiotic and could only be written by someone that have no idea about node package managements.
1. Most of CI systems have better options for caching node modules[1][2]. When you check node_modules, you add fixed cost(increasing) for every commit. When you use CI caching you add fixed cost(static) to only small number of builds.
2. Once you start upgrading packages repo size will continue to grow. At some point you will be forced to git filter out node_modules. You will lose the ability to run locally older commits.
3. You will need to pin version of npm/yarn because structure of node_modules depends on hoisting algorithm. Every upgrade of node will be extra painful because you also potentially need to upgrade all yours packages.
4. Platform dependent modules like fsevents, node-sass can be broken if you use different OS. You will be forced to only support a single platform(linux).
5. Impossible to resolve node_modules conflicts. Modern package manager have git conflict resolution build in. If two people update the same module to the same version they can still create a merge conflict when node_modules are checked.
6. Currently, you have plenty of good options that can achieve the same with smaller effort. You can use yarn2 with node_modules linker and local cache. This would create .yarn folder in the repo that have all modules as zip files. During install it would use these to hydrate node_modules. Alternatively, You can use pnp and have zero install but with proper support[3]
7. You lose automatic audit and dependency management. Current best practice is to use something like dependabot or renovate bot. Once you commit your package, you will no longer be able to use this effectively.
8. Most people commenting on left-pad are maybe not aware, but today npm is immutable, and you simply cannot unpublish public package[4]. Because npm.com is such vital infrastructure, it unlikely that it would ever stop working.
> This is idiotic and could only be written by someone that have no idea about node package managements.
Some of your points are valid, but the first sentence seems really hand-wavy and rude.
> 1. Most of CI systems have better options for caching node modules[1][2]. When you check node_modules, you add fixed cost(increasing) for every commit. When you use CI caching you add fixed cost(static) to only small number of builds.
I was wondering about this too[0], mostly because it's the setup I tended to gravitate towards for my projects. Gitlab CI and Travis can do this AFAIK, although I'm not sure how long the cache folders would be kept (especially on the Free tier).
> Some of your points are valid, but the first sentence seems really hand-wavy and rude.
Because people here are discussing this and entertaining this idea. This creates a level of legitimacy. After this article, there might be now countless teams transitioning to this crazy idea. Then two years later, people would continue to complain about node ecosystem because they were burned badly by projects maintained using this approach.
The problem with today word is that everyone tries to be politically correct. Everyone wants to discuss things in a civilized way based on merit and logic. In many cases, we could avoid wasting time and energy by declaring things as they are. For example, If mainstream would call anti-vaxxer stupid, we would have now more people vaccinated
There are more things that people spend large amount of time discussing when there are nothing to discuss. This thread should not have 195 comments.
"Because npm.com is such vital infrastructure, it unlikely that it would ever stop working."
Except that is has? It's a service, like everything else on the internet, it can and will go down. The choice here is whether you can carry on, or wait until it comes back up. THAT is the main benefit I see in this. (Not everybody wants to manage a local copy/proxy/internal NPM registry etc)
We have limited amount of time, adopting this idea would impact productivity of team for long term. Making your application multi-region with backups and security is more important than protecting against a black swan event that would have limited impact. Even if npm goes down, you might have at maximum few hours of disruption for deploying in CI. This is relatively minor compared to recent us-east outage.
see also: yarn offline caching[0]. Also great for dealing with CI environments where you can only cache stuff in certain folders (you can control where the cache is stored)
I know pulling in lots of dependencies from various anonymous authors is a security risk. Can you be sure that all of the code has been vetted? This seems exacerbate this. You’re allowing developers to check in anything without oversight and it will be ignored just because it’s in this particular folder
That’s already exactly the same risk almost all web developers take currently. Yes it is a real threat, but it’s too hard to deal with and not often exploited.
We all hope and rely on the fact that popular OSS projects have enough eyeballs on them to make sure nothing malicious slips through. What is proposed here allows a load of changes to be made that completely bypass the normal review process
This doesn't bypass the review process because no ones review process includes auditing the code of all things in the package lock files. This is no more or less secure than the current way of doing things.
I currently have a more concrete problem than speed with leaving node_modules out of version control: buggy libraries that, as downloaded, need small corrections. Doing the right thing and fixing libraries for everyone on npm requires significantly more effort and it could be impossible.
Possibly dumb question. But does this prevent developers working on different platforms? My understanding was that the specific files installed could depend on the architecture of the system being installed on?
I mean come on now. You have internet? Good. Then u will be able to both run npm & git commands. Plus npm has cache. Why burn internet payload for no reason?
Source code is source code, it’s the input into a build process, it doesn’t determine the output: the build process itself must be designed for reproducible builds. Committing all of your dependencies doesn’t give you reproducible builds.
I don’t think committing ‘node_modules’ is egregious, and it has benefits as the post describes, but unfortunately it’s far from _the_ solution to the problems described. Something like Google’s own Bazel is a better option.
This is insane. The only pro I'll even entertain is the potential to cut a few minutes off your CI/CD build time, but for the vast majority, even that brings no actual benefits.
I agree, and to go a step further I don’t even understand why everyone is so obsessed with package management systems when sub modules do the same thing only better
To respond the two major criticisms:
1) “It takes a lot of space”
Don’t be so sure. Text diffs and compresses well. I have a 9-year old Node repo that I’ve been vendoring from the beginning and it’s only grown 200MB over that time. (Granted, I’m fairly restrained in my use of dependencies. But I do update them regularly.)
But even if it does take a lot of space… so what? If your dependencies are genuinely so huge that this is a problem, then vendoring may not be right for you. But you could also use one of the many techniques for managing the size of your repo. Or just acknowledge that practices are contextual, and there’s no such thing as “best practice”—just a bunch of trade-offs.
2) “It doesn’t work well with platform-specific code”
This can cause some pain if you’re in a multi-platform environment. The way I deal with it (in Node) is by installing modules with --ignore-scripts, comitting the files, running “npm rebuild”, and then adding whatever shows up to .gitignore. I have a little shell script that makes this easier.
This is only an issue for modules that have a platform-specific build, which I try to avoid anyway. But when it comes up, it can be a pain in the butt. I find its pain to be less frequent and more predictable than the pain that comes from not vendoring modules, though, so I put up with it.
Bonus) “It’s not best practice”
Sez who? Dogma is for juniors. “Best practices” are all situational, and the only way to know if a practice is a good idea is to examine its tradeoffs in the context of your situation.