Hacker News new | past | comments | ask | show | jobs | submit login

It's interesting that this uses smudge/clean filters. When I considered using those for git-annex, I noticed that the smudge and clean filters both had to consume the entire content of the file from stdin. Which means that eg, git status will need to feed all the large files in your work tree into git-lfs's smudge filter.

I'm interested to see how this scales. My feeling when I looked at it was that it was not sufficiently scalable without improving the smudge/clean filter interface. I mentioned this to the git devs at the time and even tried to develop a patch, but AFAICS, nothing yet.

Details: <https://git-annex.branchable.com/todo/smudge>




As the author of git-fat, I have to say the smudge/clean filter approach is a hack for large files and the performance is not good for a lot of use cases. The reality is that it's common to need fine-grained control over what files are really present in the repository, when they are cached locally, and when they are fetched over the network. Git-annex does better than the smudge/clean tools (git-fat, git-media, git-lfs) but at somewhat increased complexity. I think our tools have stepped over the line of "as simple as possible but no simpler" and cut ourselves off from a lot of use cases. Unfortunately, it's hard for people to evaluate whether these tools are a good fit now and in a couple years.

As for git-lfs relative to git-fat: (1) the Go implementation is probably sensible because Python startup time is very slow, (2) git-lfs needs server-side support so administration and security is more complicated, (3) git-lfs appears to be quite opinionated about when files are transferred and inflated in the working tree. The last point may severely limit ability to work offline/on slow networks and may cause interactive response time to be unacceptable. Some details of the implementation are different and I'd be curious to see performance comparisons among all of our tools.


Thanks for verifying my somewhat out of date guesses about smudge performance!

Re the python startup time, this is particularly important for smudge/clean filters because git execs the command once per file that's being checked out (for example). I suppose even go/haskell would be a little too slow starting when checking out something like the 100k file repos some git-annex users have. ;)


Yep. It's really a problem that needs to be fixed in git proper. I'm surpised that github of all people didn't realize this and/or invest the time to do it right.

The one major obvious drawback for it being fixed in git is that it's not backwards compatible with old clients though. Doing it in go is probably a good improvement over the existing solutions of git-media and git-fat but I don't think is the final one.

Funny enough, although I had thought this since I started working with git-fat, I only recently admitted it[1]. Perhaps if I had admitted it when I first started work on it then there's a chance they would have seen it! :-P

[1]https://github.com/cyaninc/git-fat/issues/41#issuecomment-88...


Ok, that means I don't have to check out git-lfs, git-fat or git-bigstore. My annex is 250k symlinks pointing to 250 GiB of data. It's slow enough as it is.


At 250k files in one branch, you are starting to run into other scalability limits in git too, like the innefficient method it uses to update .git/index (rewriting the whole thing).


There is the new "split-index" mode to avoid this (see "git update-index" man page). The base index will contains 250k files but .git/index only contains entries you update, should be a lot fewer than 250k files.


It seems that `--split-index` is only available at 'update-index'. Can it be used with `add` or via `git config`?


I looked at git-fat as an option for me, but what killed it was rsync as the only backend; I really wanted to send files to S3.

I also looked at git-annex, and I could see using it if it were just me on the project (or as a way of keeping fewer files on my laptop drive), but I was reluctant to add any more complexity to the source control process, since explaining how to use git-annex to the entire team was too big of a barrier.


Thanks for the feedback. There is a PR for S3 support, but it's dormant because it was mixed with other changes that broke compatibility. I haven't personally wanted S3, so haven't made time to rework the PR.


Ah, that is my fault. We're using the fork quite actively, but need to revive that PR and improve the config settings.


So if I have large (1GB+) files in my repo, you recommend against git-fat?

I have been enjoying the simplicity of git-fat, but running git diff and especially git grep makes me think I should switch to something else.


As someone who worked on it a lot, I'd say it'd be worth it to switch to git-lfs. Exactly the same designs with different formats but written in golang.


Seems that git status nowadays does manage to avoid running the smudge filter, unless the file's stat has changed. This overhead does still exist for other operations, like git checkout.


Also, is there any reason Git LFS can't be used as a special remote for git-annex?

It would provide an easy way for people to host their git-annex repos entirely on GitHub.


Yeah, git-annex is very interested in having a special remote for everything and anything. And if someone creates 4 shell commands, I could have a demo working in half an hour. The commands would be:

  lfs-get SHA256 > file
  lfs-store SHA256 < file
  lfs-remove SHA256 (optional)
  lfs-check SHA256 # exit 0 or 1, or some special code if github is not available
Presumably the right way would be to use their http api, but these 4 commands seem generally useful to have anyway.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: