GitMounter: A FUSE filesystem for Git repositories

dalf · 2023-11-29T05:37:45.000000Z

Seafile (a file sync storage) is inspired by git to store the files (internally there are repositories, branches and commits). However the file are not stored directly:

> A file is further divided into blocks with variable lengths. We use Content Defined Chunking algorithm to divide file into blocks.

> This mechanism makes it possible to deduplicate data between different versions of frequently updated files, improving storage efficiency. It also enables transferring data to/from multiple servers in parallel.

I use it on old PC without issue. Drawback: since the files are not stored in clear, in case of data corruption of the Seafile repositories, I need backup (never happened to me).

* https://manual.seafile.com/develop/data_model/

* https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf

Calzifer · 2023-11-29T10:15:05.000000Z

And in keeping with the topic Seafile has a FUSE extension to access this storage system directly.

https://manual.seafile.com/extension/fuse/

djkoolaide · 2023-11-29T09:30:11.000000Z

Seafile is fantastic and I'm surprised I don't see more discussion about it around here. I've been running it on a VPS with MinIO as my object storage for about two years now, ~4TB of data just shy of 100,000 files. It syncs fast, stable af, and I "own" all my data. Can't recommend it enough.

account42 · 2023-11-29T11:45:47.000000Z

How does Seafile's chunking compare to git's packfiles which can store binary deltas. [0] While git conceptually stores full files that doesn't mean that the the actual implementation isn't more efficient.

[0] https://git-scm.com/docs/pack-format

turminal · 2023-11-29T01:24:04.000000Z

A similar thing exists for Plan 9: https://orib.dev/git9.html

skadamat · 2023-11-29T13:57:00.000000Z

We built a Rust implementation of NFS to accomplish the same thing!

https://news.ycombinator.com/item?id=37573679

This lets you:

- Mount a Git repo + branch state to a folder: `git xet mount https://xethub.com/XetHub/LAION-400M.git --prefetch 0`

- Analyze parquet files using DuckDb:

`import duckdb`

`duckdb.query("select COUNT() from 'data/.parquet'")`

computerfriend · 2023-11-29T05:44:47.000000Z

For all the use cases I can think of, git worktree could be used to simulate this without any fuse dependency.

withinboredom · 2023-11-29T11:23:37.000000Z

I did exactly this and use it as a "file system adapter" on my WordPress installation (to handle uploads/media). I tried to submit it as a plugin but they said "no" because I can't use "git" in the name -- and fuck that. So, I'm the only person on earth that does this (AFAIK).

corobo · 2023-11-29T17:48:26.000000Z

Do you have this plugin available anywhere at all? I can't say it'd fit my workflow or anything without knowing more (so don't go out of your way!) but I'm very intrigued in to how it works

112233 · 2023-11-29T07:27:05.000000Z

I'd like to use specific git tree object as a lowerdir for overlayfs mount, preferrably without requiring root. Do you have anything in mind for that?

lloydatkinson · 2023-11-29T07:53:02.000000Z

Does worktree allow checking out the same branch but at different commits?

matrss · 2023-11-29T08:13:38.000000Z

No, because a branch is a (re-assignable) name for some commit and cannot point at different commits while in the same repository at the same time. A "branch but at different commits" simply makes no sense. You can however create a worktree with a detached HEAD pointing at any commit. By default it seems to create a branch for each worktree.

SpaceNugget · 2023-11-29T08:18:10.000000Z

Somewhat pedantic but relevant. A branch is a single commit. There aren't different commits _in_ a branch to checkout.

Regardless, if you are asking whether you can check out commit that doesn't have a branch pointing at it, yes you can.

You can have a work tree for every commit in your repository if you like.

glandium · 2023-11-29T09:43:03.000000Z

You can have multiple work trees for the same commit, for that matter, as long as they are the same branch (so with different branches pointing at the same commit, or detached heads)

gpderetta · 2023-11-29T13:17:07.000000Z

to be even more pedantic, a branch is a (named) pointer to a commit. You can have multiple names pointing to that commit.

rwbhn · 2023-11-29T07:44:18.000000Z

TIL. Thx!

Dylan16807 · 2023-11-29T04:04:59.000000Z

I wonder why anyone would think it's "impractical".

You might have to wait a moment at startup for it to make a list of commits, but after that git is very well designed for this sort of browsing.

cryptonector · 2023-11-29T07:35:35.000000Z

It's impractical to list a directory with millions of entries (e.g., commits) that require a lot of work to find. One might want to organize such a pile of things in a way that naturally limits the number of items to list -- i.e., one might want to add paging, so you'd have page 1, page 2, etc., with each page having some small number of items, say, 1000 (e.g., commits).

Dylan16807 · 2023-11-29T13:55:28.000000Z

Very few repositories have enough commits to be a problem, and ls is not very important either. The prompt was just that cd works. It doesn't have to show the big pile of commits.

Also you just have to use ls -f. Edit: Oh, you even mention this yourself in another comment. Serious non-problem then. If you're worried about the filesystem side, it could also load the list of commits gradually.

TeMPOraL · 2023-11-29T07:44:10.000000Z

> to list a directory with millions of entries

Fair, it'll likely break most tools written in the last 10-15 years.

> (e.g., commits) that require a lot of work to find.

That's solvable with a cache. I'm surprised git doesn't seem to have one for those, at least going by how long it takes to generate a full "shorthand log" on a large repo.

cryptonector · 2023-11-29T08:06:11.000000Z

Caching is nice, but if it takes an hour when the cache is cold then it's not usable even if a directory with millions of entries posed no problems for any tools. Now for most projects that's not going to happen, but for something like the Linux kernel it just might. And there might be many caches to warm, and much to cache. For a large enough project and a computer that one might not think of as small that might just fail to work at all. This is why as Git grows up it's using Bloom filters to optimize things like git blame and git log on one file rather than adopting the fully relational model of Fossil. (Please read into this all the chagrin I'm feeling about that, because I'm quite the fan of Fossil's relational model.)

PS, the single biggest problem with million-entry directories is the propensity of tools like ls(1) to want to sort the darned things, so one has to remember to use `ls -f`, or to at least use the C locale to get memset() collations instead of much much slower Unicode collations. Another problem is that the POSIX stat(2) family of functions combine reading metadata that could come from just the directory (e.g., a file's inode number) (and which contents has already been read) with reading metadata that requires [possibly much] extra I/O to get, so if you're doing `ls -l` on a million-entry directory you might as well go on vacation (but make sure to send the output to some file, cause your scrollback buffer just won't do).

iforgotpassword · 2023-11-29T07:03:53.000000Z

Depends on your personal preferences. I've been using cgit for this for over a decade now, it's blazing fast and I just know it inside out. It's running on a local server hosting all the repos, so obviously requires a little more setup; don't know if I'd install a web server on my machine just for that.

edude03 · 2023-11-29T02:55:03.000000Z

See also: https://github.com/presslabs/gitfs

dividuum · 2023-11-29T11:09:44.000000Z

It should automatically map all the commits of each file to .old, .older, .old2.old and so on. For the true „version management in a file system“ feeling :-)

compressedgas · 2023-11-29T01:29:47.000000Z

Perhaps the first back in 2005 was http://web.archive.org/web/20100218153353/http://www.sfgoth....

anonacct37 · 2023-11-29T01:12:47.000000Z

I think this is a great idea. In the old days before fuse was so widely used I saw this same idea used in the jvm since it supports custom protocol handlers that you can use for vfs operations.

armchairhacker · 2023-11-29T02:35:19.000000Z

You should get this added to the Wikipedia list: https://en.wikipedia.org/wiki/Filesystem_in_Userspace

aflam · 2023-11-29T09:39:03.000000Z

Neat! We started working on something similar, using LD_PRELOAD. After setting some env variables we'd see a commit's files layered at some path, on top of artifacts saved for this commit. The goal is to run jobs that expect to access both git files and build artifacts, while avoid duplication of storage. FUSE would be better but users don't have enough permissions.

vesinisa · 2023-11-29T10:23:53.000000Z

> FUSE would be better but users don't have enough permissions.

This is also going to be a great prank on the next engineer who's never gonna figure why he can't see any of the files the build jobs are seeing in his shell.

aunwick · 2023-11-29T03:37:49.000000Z

I think clear case has a patent on this from 1997.

nonameiguess · 2023-11-29T22:55:42.000000Z

It will have expired by now, but ClearCase is exactly what I think of whenever I see these kinds of ideas come up. It's really a handy tool but too bad almost nobody uses it or has even heard of it in the open source world since it isn't free (in any sense of the word). They were still using it at the NRO's ground processing stations as recently as six years ago. Just rsync VOBs between dev, staging, and prod environments, and checkout a particular view to install upgrades and you're guaranteed to have a totally identical environment complete with all dependencies, no containers necessary. It's really better than Git for this, too, because it can work as a distributed filesystem across many hosts at once, handles binary files perfectly well without needing extensions, uses a real database. You can version control an entire cluster of servers the same way Git version controls a single software project.

But it's 90s IBM enterprise business model to the core and the rest of the Rational product suite sucks.

still_grokking · 2023-11-29T16:14:09.000000Z

Do you mean this here: https://www.ibm.com/docs/en/rational-clearcase/9.0.2?topic=t... ?

What did they "patent"? Object databases? Versioning files? Mounting file-systems?

Anyway, that are at most some super weird software patents; so you don't have to care outside of the US, I guess.

ipaddr · 2023-11-29T04:21:57.000000Z

Wouldn't the patent be expired?

jtwaleson · 2023-11-29T05:18:39.000000Z

Yes, there’s a 20 year maximum duration for patents. Depending on some technicalities it could be a year or two longer but definitely under 26. Unless someone created additional patents with improvements.

virtue3 · 2023-11-29T05:23:31.000000Z

potentially? but you can also extend a patent by adding to it by patenting an improvement on the patent. :/

account42 · 2023-11-29T11:50:43.000000Z

That doesn't extend the original patent duration, it's just a new patent for a related invention.

bezier-curve · 2023-11-29T03:53:01.000000Z

That's before git was created. Do you have details about this?

shiroiuma · 2023-11-29T04:11:38.000000Z

Yes, it is before git was created; that doesn't matter for a patent. In ClearCase, the repository is mounted on the filesystem very much like this; in Linux, it required special kernel drivers to work back when I used it (2000s).

still_grokking · 2023-11-29T16:28:43.000000Z

The presented idea (one folder per commit, as FUSE fs) seems indeed largely impractical.

But there are good uses for mountable "git like" repos. For example for backup systems.

https://github.com/bup/bup