It's interesting that this uses smudge/clean filters. When I considered using th...

jedbrown · on April 8, 2015

As the author of git-fat, I have to say the smudge/clean filter approach is a hack for large files and the performance is not good for a lot of use cases. The reality is that it's common to need fine-grained control over what files are really present in the repository, when they are cached locally, and when they are fetched over the network. Git-annex does better than the smudge/clean tools (git-fat, git-media, git-lfs) but at somewhat increased complexity. I think our tools have stepped over the line of "as simple as possible but no simpler" and cut ourselves off from a lot of use cases. Unfortunately, it's hard for people to evaluate whether these tools are a good fit now and in a couple years.

As for git-lfs relative to git-fat: (1) the Go implementation is probably sensible because Python startup time is very slow, (2) git-lfs needs server-side support so administration and security is more complicated, (3) git-lfs appears to be quite opinionated about when files are transferred and inflated in the working tree. The last point may severely limit ability to work offline/on slow networks and may cause interactive response time to be unacceptable. Some details of the implementation are different and I'd be curious to see performance comparisons among all of our tools.

joeyh · on April 8, 2015

Thanks for verifying my somewhat out of date guesses about smudge performance!

Re the python startup time, this is particularly important for smudge/clean filters because git execs the command once per file that's being checked out (for example). I suppose even go/haskell would be a little too slow starting when checking out something like the 100k file repos some git-annex users have. ;)

caust1c · on April 9, 2015

Yep. It's really a problem that needs to be fixed in git proper. I'm surpised that github of all people didn't realize this and/or invest the time to do it right.

The one major obvious drawback for it being fixed in git is that it's not backwards compatible with old clients though. Doing it in go is probably a good improvement over the existing solutions of git-media and git-fat but I don't think is the final one.

Funny enough, although I had thought this since I started working with git-fat, I only recently admitted it[1]. Perhaps if I had admitted it when I first started work on it then there's a chance they would have seen it! :-P

[1]https://github.com/cyaninc/git-fat/issues/41#issuecomment-88...

clacke2 · on April 9, 2015

Ok, that means I don't have to check out git-lfs, git-fat or git-bigstore. My annex is 250k symlinks pointing to 250 GiB of data. It's slow enough as it is.

joeyh · on April 9, 2015

At 250k files in one branch, you are starting to run into other scalability limits in git too, like the innefficient method it uses to update .git/index (rewriting the whole thing).

bicolao · on April 9, 2015

There is the new "split-index" mode to avoid this (see "git update-index" man page). The base index will contains 250k files but .git/index only contains entries you update, should be a lot fewer than 250k files.

weakish · on April 23, 2015

It seems that `--split-index` is only available at 'update-index'. Can it be used with `add` or via `git config`?

SomeCallMeTim · on April 8, 2015

I looked at git-fat as an option for me, but what killed it was rsync as the only backend; I really wanted to send files to S3.

I also looked at git-annex, and I could see using it if it were just me on the project (or as a way of keeping fewer files on my laptop drive), but I was reluctant to add any more complexity to the source control process, since explaining how to use git-annex to the entire team was too big of a barrier.

jedbrown · on April 8, 2015

Thanks for the feedback. There is a PR for S3 support, but it's dormant because it was mixed with other changes that broke compatibility. I haven't personally wanted S3, so haven't made time to rework the PR.

willkelleher · on April 9, 2015

Ah, that is my fault. We're using the fork quite actively, but need to revive that PR and improve the config settings.

mdonahoe · on April 8, 2015

So if I have large (1GB+) files in my repo, you recommend against git-fat?

I have been enjoying the simplicity of git-fat, but running git diff and especially git grep makes me think I should switch to something else.

caust1c · on April 9, 2015

As someone who worked on it a lot, I'd say it'd be worth it to switch to git-lfs. Exactly the same designs with different formats but written in golang.

joeyh · on April 8, 2015

Seems that git status nowadays does manage to avoid running the smudge filter, unless the file's stat has changed. This overhead does still exist for other operations, like git checkout.

bburky · on April 8, 2015

Also, is there any reason Git LFS can't be used as a special remote for git-annex?

It would provide an easy way for people to host their git-annex repos entirely on GitHub.

joeyh · on April 8, 2015

Yeah, git-annex is very interested in having a special remote for everything and anything. And if someone creates 4 shell commands, I could have a demo working in half an hour. The commands would be:

  lfs-get SHA256 > file
  lfs-store SHA256 < file
  lfs-remove SHA256 (optional)
  lfs-check SHA256 # exit 0 or 1, or some special code if github is not available

Presumably the right way would be to use their http api, but these 4 commands seem generally useful to have anyway.