Hacker News new | past | comments | ask | show | jobs | submit login
Improve Git monorepo performance with a file system monitor (github.blog)
203 points by chmaynard on June 29, 2022 | hide | past | favorite | 65 comments



This is awesome. Especially the fact that it's built-in and easy to turn on.

Seams like quite a complex solution though. I guess some big company (Microsoft?) implemented it internally for their own use and later tried to move it to upstream git. I wonder if there was some pushback from git maintainers from having this functionality built-in.

Also why for Windows they use named pipes when in theory Windows also supports it? (https://devblogs.microsoft.com/commandline/af_unix-comes-to-...)

BTW, to the author of this article. It is very good. It was an interesting read. The are some small issues:

- "markdown" link didn't get converted to html: "[core.untrackedcache](https://git-scm.com/docs/git-config#Documentation/git-config...)"

- the link to "philosophy" of Scalar doesn't work: https://github.com/microsoft/git/blob/HEAD/contrib/scalar/do...


Git for Windows currently supports Vista and above - the AF_UNIX support was only added in Windows 10 1803.

Named pipes are fine, the semantics are basically identical, and you can guarantee there is a separate namespace versus the filesystem (for AF_UNIX, \x00 prefixes work on Linux but not on macOS).


> [FSMonitor] is currently available on macOS and Windows.

Are there any other git features with this limitation? Wild to me that we're here.

Thankfully the article covers the semi-longstanding "hooks" that existing (& very high performance) tools like Watchman (which are cross platform) can use.

Great in depth read. Good stuff! From the 2.37 release[1].

[1] https://github.blog/2022-06-27-highlights-from-git-2-37/ https://news.ycombinator.com/item?id=31898261 (34 points, 2 days ago, 7 comments)


I've been wondering about why there was no linux support, and found an e-mail from the author of the subcommand (as well as the github.blog post) explaining the situation.

Apparently an older implementation using inotify was dropped because inotify does not work recursively, so you would have to do an inotify call for all directories of the hierarchy which is obviously very inefficient. There are system wide limits in the number of directories you can listen to, and even if you increase the limit you would probably cause a lot of overhead.

Newer linuxes support the fanotify system call, which does allow recursive listening. They haven't implemented something using fanotify yet however.

https://lore.kernel.org/git/e1442a04-7c68-0a7a-6e95-304854ad...


Thanks, I didn't know about fanotify. Now I'm wondering why everything I use day-to-day (file syncing tools, IDEs, etc.) still seems to be stuck with inotify if there is a better option on modern operating systems. Maybe some of them are actually using fanotify under the hood, despite components still having "inotify" in their name.


TIL: fanotify was enabled in Linux 2.6.37 released 4 January, 2011

https://kernelnewbies.org/Linux_2_6_37#Enable_Fanotify_API


It needs 5.1 to be actually useful though:

https://man7.org/linux/man-pages/man7/fanotify.7.html

       In the original fanotify API,
       only a limited set of events was supported.  In particular, there
       was no support for create, delete, and move events.  The support
       for those events was added in Linux 5.1.  (See inotify(7) for
       details of an API that did notify those events pre Linux 5.1.)
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5....

And https://lore.kernel.org/git/87czs1d6uy.fsf@evledraar.gmail.c... to link to fanotify discussion on the git mailing list (as well as some flaming about the feature not being supported on linux).


Ebpf should be possible to use on newer linuxes also I believe.


My guess is that inotify is so slow with large directories that it wasn't worth it. Plus inotify has cumbersome user limits.

inotify has a number of other relevant limitations, like not being able to create recursive notifications or handle "move" operations. Implementation effort is going to be way higher for an inotify-based system, and of course that's made far worse by the numerous file systems in linux - I imagine any implementation would probably start first with ext4.

I suspect an ideal solution would be via ebpf, but I'm not sure.


My assumption was that on Linux it's just been using inotify or something for a while and so hasn't needed a bespoke monitor. I have no idea if that's true or makes sense though.


More likely the linux fs is fast enough not to need the optimization. Unsurprisingly git was designed to run well on linux .


Not sure why the response got downvoted. I personally found Git performance to be, well, okay on macOS (but depends) and absolutely horrible on Windows due to very slow stat() calls on NTFS.

Of course, in a large enough monorepo Linux performance would also suffer, but to a much lesser degree.

Also, conveniently, both Windows and macOS have an API for recursive directory watch, whereas Linux doesn't (in Vanilla kernel). Inotify can only watch the immediate directory you're observing + there's a pretty low default limit on the number of inotify descriptors that you're allowed to have on top of that


We've found this to be basically true. Git operations that stat() a lot are dramatically, catastrophically slower on OS X, and that's part of why my employer started doing fs watching there and mostly left Linux as is. More than once I've had to update a cross-platform tool to avoid stat()ing because though cheap on Linux, it took 10s of seconds on OS X.


You're most probably right.

  $ time git status
  real 0m0.324s
  user 0m0.197s
  sys 0m0.425s
That's on a working tree with 314,708 files and no watchman.


yeah it uses famous linux-exclusive data structures, like hashes and strings.


It's not the data structures, it's the file system operations.


Yeah not supporting Linux in a Git feature? WTF what a free software sellout moment.


Having a cross-platform file watcher built into a ubiquitous tool like git is pretty awesome. I could see build tools integrating with this and making more aspects of development faster without having to run a bunch of file watcher services. They all seem to have issues.

I have tried Watchman, but setting it up is a pain. There are so many ways to use it. I also welcome running less Facebook code on my systems.


Except it's not cross-platform:

> It is currently available on macOS and Windows.


TBF, Git operations on repos with many small files are extraordinarily slow on Windows (probably not Git's fault, because all file operations involving many small files are slow on Windows, even copying stuff around on the desktop), so that feature is much more critical to have on Windows than Linux.


Today I learned that “cross-platform” means “all-platform.”


Considering there is only really three platforms I think it's a pretty fair assumption.. and we're talking about git here, you can assume cross platform includes Linux.


It looks cross-platform enough to me.

After all these years of Windows-only or MacOS-only software, you guys deserve this.


[deleted]


I've stopped using anything language specific like Guard or nodemon. Instead I use the inotify commands on Linux and entr/ack on macOS so no matter what I'm doing, I can watch for changes in a directory.


I’ve got qualms with just about all of the big tech companies in one way or another.

My 2c is that one of the unambiguously positive externalities of the tech mega-corp trend is all the great OSS we get as a by-product of their operations.

I mean, I don’t exactly love how iPhones get made, but I’m pretty stoked that clang kicks ass now.


Related question: most of the companies I know that have large monorepos have a sizable dedicated team to support their dev tools, and have invested a lot to make monorepos feasible.

Are there any recommendations or standardized tools for structuring monorepos for companies that don't have a dedicated dev tools team? Last I checked, lerna seemed to be the most common tool to support JS monorepos - is that still true? Is there a better tool for a primarily Typescript codebase (primarily React on the front end, Node on the backend, but also native mobile apps)?


We recently used Turborepo -- https://turborepo.org/ -- for a project that started as a two Electron-based builds that quickly escalated to a three Electron-based build. Once we had it set up for the two, it was quite easy to add one more. Our shared components were in a central package while custom ones existed in their respective app directories.

The nice thing with this type of separation for us is being able to target CI/CD scripts to specific apps. Previously we were using targeted dev script logic to initiate the different app builds which just wasn't maintainable. This new approach this time around made for Electron deployments to be super simple, consistent, and repeatable.

All this was done by two team members on dev side.


In JS/TS world, until recently yarn was a de facto tool for monorepos. Recently pnpm looks like getting traction. There are a few other new tools recently getting popular as well, it's a hot topic.


npm itself has also upstreamed more "workspace" support for monorepos. It's not necessarily the best tool for the job, but it's a possible tool.

Also incremental build support in Typescript itself has seen a lot of improvements in recent versions. It is useful to check if your monorepo can benefit from Typescript incremental builds.


A bit tangential but you might also be interested in reading about the USN Journal on Windows which has been around since Windows 2000 https://en.m.wikipedia.org/wiki/USN_Journal


What's the current state of git tooling for large files and partial clones?


My holy grail implementation would be a "partial clone" that downloads desired files like normal, but creates stubs for selected files that are not stored on the device but downloaded on-demand upon opening them, like the OneDrive Files On-Demand [1] or Google Drive File Stream.

[1]: https://support.microsoft.com/en-us/office/save-disk-space-w...


Have you seen https://github.com/microsoft/VFSForGit? It's used by the Windows team to manage a monorepo containing most Windows source code.

Unfortunately that approach was put in maintenance mode since it didn't seem like it would be supportable on macOS.


This has been superceded by Scalar (https://github.com/microsoft/scalar) and again merged into Microsoft's fork of git (https://github.com/microsoft/git)

It supports neat stuff like partial clone which seems like a pretty big deal.


Interesting! It seems some of Scalar from late 2021 has already made it into the official git project's contrib dir [0]. It looks like Scalar is mostly an opinionated way to configure git [1] using git partial-clone among other features.

Git partial-clone looks almost perfect, except it only downloads and displays files explicitly added to the git sparse-checkout list. I want some "magic" vfs shenanigans that lets me view and browse the full repo exactly as if the full repo where checked out, but when I open a directory or file the contents are downloaded on-demand.

[0]: https://github.com/git/git/tree/master/contrib/scalar

[1]: https://github.com/microsoft/git/blob/vfs-2.37.0/Documentati...


It seems win32 has specific support for this kind of tech at a couple different layers, one that's low level filesystem virtualization that's like FUSE I guess [1], and another that's higher level and exposes sync status of files via explorer and other win32 apis [2] which is what I assume OneDrive, Dropbox, Google Drive, etc use.

[1]: https://docs.microsoft.com/en-us/windows/win32/projfs/projec...

[2]: https://docs.microsoft.com/en-us/windows/win32/cfapi/build-a...


Have you looked into git-annex?


I have a healthy suspicion of the performance of file-watchers. I hope this feature doesn’t make Git faster at the expense of “all filesystem operations crawl”.


This has been the way to get performance on a large Git repo for over a decade now, just not built into Git. It provides very good improvements even in environments that aren't the fastest, like laptops.


See my point isn't whether this makes Git faster (it probably does) but it may have a performance impact on the rest of the system.


We deployed a system like this across hundreds of engineers in a reasonably sized monorepo and had zero complaints about system performance. While I don’t know the underlying architecture of inotify/etc, it seems to be very efficiently implemented.


It is of course not turned on by default. I don't know how bad the performance hit is, but it's an option so you get to choose the tradeoff. Either your git operations are slow or you take the small hit on all operations. If you spend all of your time working in a big repo it's probably going to be worth it.


On the bright side, filesystem operations on Windows are already slow so you don't need to worry about turning on a file-watcher there.


Hooking writes at a kernel level shouldn’t have much impact, provided it’s actually hooking there


I assume it is using one of the native platform APIs to detect file changes, which generally have some sort of overhead associated with them and then may or may not block on userspace code that can be badly behaved.


In my experience on Windows watching for file events, I’ve seen that it’s not very reliable. As the article notes, it’s possible that the operating system may drop events. Nevertheless, this solution should help improve performance and reduce disk scans. If you have any other applications dependent on watching for file system events, enabling this may hinder those (again, this is based on my experience with Windows).


I'm using ReadDirectoryChangesW() to read a filtered stream of events from the USN journal. I've not noticed any reliability problems. Technically, the kernel API can always drop events, but whether that is a kernel can't keep up problem or the daemon application not servicing the event stream fast enough doesn't really matter. The API does know if/when events were dropped. And the FSMonitor daemon guards against that and forces a "resync", so the "git status" client is advised to do a normal scan, so the output is correct.


My main complaint with the Windows implementation is that it does not not play well with the lock-by-default policy of the NTFS filesystem. I deactivated the the filesystem watcher on Windows after the agent repeatedly locking files so that checkouts would fail.


You might give the new fsmonitor a try. It does not lock any files on the disk. It does have a single handle to the worktree root directory to listen for events. But even that is not exclusive. And it CWD's out of the tree during initialization, so it does not prevent you from deleting the worktree while it is running.


I wonder if this will cause issues in repos where changes can come from containerized apps syncing their runtime config to disk. Depending on the platform and the container framework, a lot of different things could potentially break here, from NFS-related to number of open files.


Only MacOS and windows? Where’s the Linux support?


It's simple, Microsoft who implemented this, only needed support for Windows and MacOS. There is an integration with watchman which runs on Linux.


But but “Microsoft loves Linux” ;)


Now I’m interested to see how this will improve my magit experience!


Enabling this messed up something related to projectile/helm-projectile which I use to navigate to files, and is an integral part of my git/magit setup in Emacs.

the buffer projectile-files-errors says "warning: Empty last update token."


is it exposed as an API? That is unclear to me, how I could build a tool to leverage this


This serves as an example to me that git is - maybe - not the right tool for the job.


As with a lot of developer tools, the most adopted solutions are rarely the best tool for the job. But because everyone knows them, thats what continues to be used.


Moving to a continuous, asynchronous strategy versus a point-in-time synchronous strategy, seems like a perfectly reasonable way to improve performance.


All file operations involving many small files are slow on Windows, that's hardly git's fault. It can just do its best to work around the problem (for instance with this file watcher thingy).


Yes. Or they could just a different kind of SCM that doesn't have these performance issues.


I don't think that's so easy. For SVN we also saw a >10x performance difference between checking out the same repository on Linux vs Windows (however, after initial checkout, performance scales mostly with the number of changes, not the repository size).


so what is?


I don't know their use case very well.

Maybe Perforce. Or didn't MS have their own in-house SCM as well?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: