Hacker News new | past | comments | ask | show | jobs | submit login
Git Large File Storage 1.0 (github.com/blog)
276 points by kccqzy on Oct 1, 2015 | hide | past | favorite | 75 comments



While I'm sure this will help some people use git to address a use case that was previously impossible with git, I can't help but feel that it a bad step overall for the git ecosystem.

It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).

I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.


You could extend git-lfs "pointer" file to support secure distributed storage using Convergent Encryption [1]. Right now, it's 3 lines:

    version https://git-lfs.github.com/spec/v1
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    size 12345
By adding an extra line containing the CHK-SHA256 (Content Hash Key), you could use a distributed p2p network like Freenet to store the large files, while keeping the data secure from other users (who don't have the OID).

    version https://git-lfs.github.com/spec/v2proposed
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    chk sha256:8f32b12a5943f9e0ff658daa9d22eae2ca24d17e23934d7a214614ab2935cdbb
    size 12345
That's how Freenet / Tahoe-LAFS / GnuNET work, basically.

[1] https://en.wikipedia.org/wiki/Convergent_encryption


Mercurial marks their Largefiles[0] support as a "feature of last resort". IE: enabling this breaks the core concept of what a DVCS is, as you now have a central authority you need to talk with. But at the same time, many people that use Git and HG use it with a central authoritative repo.

[0] https://www.mercurial-scm.org/wiki/LargefilesExtension


++! When I was in the games industry, it was extremely important to have this feature (and yes, it was a last resort!) This is why, then, we choose Mercurial over Git.

Unfortunately there is a lot of wackyness and far too often assets got out of sync. We ended up regressing, and put large assets (artwork) into a subversion repo instead.

I wish there was a better option, such as truncating the history of largeFiles, but that seems to break the concept of Git/Mercurial even more than the current "fix"


How long ago were you having problems with it? I've heard it's been a lot better in the more recent years.


Indeed that was about 5 years ago. The problems generally were around assets getting out of sync, and occasionally corruption when uploading to the large-file storage server.

Problems generally occurred when a client timed out in the middle of an upload or download.

They were troublesome issues (and silent failures) which made it unusable for production use.

Hope they got it fixed, it was a great concept, and well ahead of Git in attempting to solve!


git-lfs (and similar systems) split up the storage of objects into the regular git object store (for small files) and the large file storage. This allows you to configure how you get and push large files independently of how you get and push regular objects.

A shallow clone gives you some `n` commits of history (and the objects that they point to). Using LFS allows you to have some `m` commits worth of large files.

If you want a completely distributed workflow, and have infinite local storage and infinite bandwidth, you can fetch all the large files when you do a normal `git fetch`. However, most people don't, so you can tweak this to get only parts of the local file history that you're interested in.

Indeed this is a trade off that requires some centralization, but so does your proposed solution of a shallow clone. This adds some subtlety and configurability around that.


Yes, I just don't see why it couldn't have been done directly using/extending existing git protocols.

The additional command and configuration (and perhaps object storage) would be needed either way.


Perhaps by adding a .git_sections file which keeps track of different sets of files you might want to checkout, but don't need to. You could have it that you could define different targets (and a default) such that say you are working on a large video-game you could have one repository for everything, but define a "artists" "programmers" and "full" target, where artists can keep their huge assets together with the rest of the repo and programmers can do shallow pulls, not constantly fetching asset files which may or may not be necessary for what they're working on.


It's not about checking out, though. git-lfs aims to avoid the unnecessary transfer and storage of large files.

If you didn't want large files in the work tree, you would use sparse checkout. These are orthogonal problems.


I don't see why the centralization is different, can't you just download all large files and upload them somewhere else?


How? Does git-lfs have a way of downloading and serving all of the files?

If it was fully integrated with git, you could do 'git fetch --all-objects-ever' and your repository could then be cloned and fetched from.


`git lfs fetch -all` should download everything https://github.com/github/git-lfs/blob/master/docs/man/git-l...


Neat! Is there a way for me to serve those files so people can then use my repository as the authoritative source? I see that they still have the concepts of remotes, so maybe things are getting there?


git lfs push should work to push files new to the repo, but I'm not sure it works to push files that exists in the repo but are new to the server (because it is a new remote)

to us, things like git lfs encourage us to ensure there is an open source package that supports it, to keep the d in dvcs, so we'll add support to our community edition too


Gitlab giving me an alternative is awesome and commendable. If I will be able to clone a repo from github including all of the lfs then push it all to gitlab that would be better than nothing! Is github contributing to your effort to have an lfs server that is open and free?

Let me give an example of one way it hurts the existing git ecosystem. Someone decides to include their external source dependencies for their project as tarballs using lfs (which is probably dumb and not the use case that lfs is trying to support, but people will do it nonetheless). Now I want to mirror that repository inside of my companies firewall which hosts its git repositories using just git over ssh. Without lfs, I would just do 'git clone --mirror && git push --mirror' and I internally have a mirror that is easy to keep up-to-date, is dependable, supports extending the source in a way that is easy to contribute back, etc..

Now what options do I have with lfs (outside of gitlab)? Create a tarball with all the source +lfs in it? Create a new repository that doesn't use lfs and commit the files to that? Each of these is less than ideal and makes contributing back to the project harder.

Imagine instead a world where this happened: Github.com announces that they are adding large file support to git. These large files will be using an alternative object storage, but the existing git, http, and ssh protocols will be extended to support fetching these new objects. When support for it lands in the git mainline repository, suddenly everyone will be able to take advantage of it, regardless of how they choose to host their repositories!

I admire gitlab for creating an open source server implementation. I just wish that github would have done it a different way that would have been better for the overall git community (not just github users).


I would love to see a Google Piper for git that load files with a virtual filesystem in FUSE as you access them.


Please see https://github.com/presslabs/gitfs (no LFS support yet I think)


Without locking this is largely useless.

Usually large files are binary blobs(PSD, .ma, etc) and it becomes incredibly easy to blow away someone's work by not pulling before every file you edit(or two people edit at the same time).

As much as some people hate Perforce that's exactly what they are setup to do. Plus their binary syncing algorithms are top-notch. We used to regularly pull ~300GB art repo(for gamedev) in ~20 minutes.

Git is great for code but this seems like square peg, round hole to me.


I've read the replies to vvanders and he's correct. With binaries you really want some sort of global locking (easy with a centralized system, hard with a distributed system).

I believe his (her?) point is that for a very large class of binaries there is just no upside in parallel development, one guy is going squash the other guy's work. You want to serialize those efforts.

We don't have global locks yet but we know how to do them, just waiting for the right sales prospect to "force" us to do them. I'm 90% sure we could add them to BK in about a week.


Git annex solves this without locking nor losing any of the versions - the actual files get different names (based on hashes of the contents), which are referenced by a symlink tracked by git. If two people edit the same file - pointing the symlink at different filenames - you get a regular git merge conflict.


That's the thing with binary assets, there is no merge path due to them being binary by nature.


No /automatic/ merge resolution. Obviously you have a tool that can open them (if you edited it in the first place), and you can use that to view the differences, and replay one set of changes. The fact that the SCM detected the conflict, and alerted you, allowing you to resolve it, is a solid improvement over not using an SCM. Further, automatic merge resolution isn't always possible with text-based assets either (and even when it is, it isn't always the correct option!).


I don't think you follow. Programs like Photoshop, Maya and 3DSM don't merge.

Period.

Your only option in this case is to throw away someone's work and force them to redo the work on the file that you decided to keep.


Programs like gedit don't merge. Period.

Yet I can still use it to resolve a merge conflict.

In Photoshop you would do this by opening both images, and visually comparing them to see what's different, then copying the appropriate parts from one to another. Instead of just visually comparing them, you might combine them into one file as separate layers, and use a transform to see the difference. If the tool doesn't support that, use ImageMagick to generate a graphical diff (either the `compare` or `compose` commands), and then copy the relevant parts from one to another.

We have fancy tools to help us, but fundamentally, merging is a /human/ operation, that requires human judgment to see how multiple sets of changes can be made to coexist. And that doesn't require a tool (though it can certainly help).


> In Photoshop you would do this by opening both images, and visually comparing them to see what's different, then copying the appropriate parts from one to another.

Good luck.

The point the parent was trying to make is that the lock operation of SVN was quite convenient for preventing a dual-edit scenario of assets that aren't easy to merge like 3D meshes, scenes, PSDs, etc.. It's easy to sit in the ivory tower of text merge resolution given how easy it is in comparison. The atom of change in other tools are quite a bit less obvious. Sure you can diff a mesh, but merging usually just means redoing it or picking one or the other.


Cool, how do you merge After Effects, Autocad, Cinema 4D, Unreal 3 Packages, Illustrator, Sketch, Blender, XSi Lightwave or any other of the production packages that I've seen used in shipping actual products? What happens if no one used layers in your Photoshop file and collapsed history to save performance?

There's a reason Pixar and most mid-large game dev studios use Perforce or a similar tech, it's because fundamentally you need locking if you're working with binary assets.


Your intelligent merge tool has access to the file history. If one user modifies a layer, and the other one squashes them down, in the merge you probably want to apply the changes in that order, even if it's out-of-order chronologically. If the file format has a full edit history baked in, great; even more info for the intelligent merge. Maybe the in-file history can even be kept in sync with the repository level history.

In the current ecosystem you probably need locking for your sanity, but some day software will suck less.


I can't imagine this is intended to compete head-to-head with something like Perforce. As you've pointed out it simply can't. But for a repo that's mostly code with binary assets that get updated occasionally it's probably a god-send.


> In Photoshop you would do this by opening both images, and visually comparing them to see what's different, (...)

Yeah, good luck with that. With software like Photoshop, not every change is obvious or easily visible. Maybe the other guy tweaked blending parameters of a layer, or reconfigured some effects layers. Or modified the document metadata. Or did thousands other things that are not immediately visible, or for which it is hard to determine the sequence changes should be reapplied.

Maybe you can manually merge the two files to some reasonably good approximation of the intended result, but you can't never be sure you didn't miss something important.

Merging tools for text files show enough information for you to know when you've seen every change made. You can't have that with complex binary formats used by graphics programs, mostly because those formats were never explicitly designed to support merging.


> Yeah, good luck with that. With software like Photoshop, not every change is obvious or easily visible.

Well, there's the time-honored technique of "rapidly switch between windows that are zoomed to the same place." But, more rigorously, I mentioned a way to do this; there are tools that can do a diff of raster images--which is what you are making at the end of the day with Photoshop. Sure it can't tell you what blurring parameters someone changed, but you can see that the blur changed, then you can go look at the parameters.

> Or did thousands other things that are not immediately visible

The trickiness of that situation isn't unique to binary formats. It comes up with code too.

> Maybe you can manually merge the two files to some reasonably good approximation of the intended result, but you can't never be sure you didn't miss something important.

That's just as true with code as it is with other formats!

> because those formats were never explicitly designed to support merging.

Neither was text. We just ended up making some tools that were reasonably decent at it.

I've been there, I've done that. I've done the 3-way merge with Photoshop files, and resolved the conflicts with 2 different people working on an InDesign file, and broken down to running `diff` on hexdumps of PDF files. Resolving merges with things that don't have nice tools for it isn't fun.

But it's a /lie/ to claim that a conflict for binary formats is "game over, you're just going to steamroll someone's work, there is no path to merge resolution". It's not a fun path, but it's not game over. Which is all I was really trying to refute.

(aside: It's interesting to me that this chain of comments went from being upvoted last night to downvoted this morning.)


> Well, there's the time-honored technique of "rapidly switch between windows that are zoomed to the same place." But, more rigorously, I mentioned a way to do this; there are tools that can do a diff of raster images--which is what you are making at the end of the day with Photoshop. Sure it can't tell you what blurring parameters someone changed, but you can see that the blur changed, then you can go look at the parameters.

I guess this could work with simple cases and if you accept less than pixel-perfect standard; I can see how this will fail when several people are working on a single file for long (because not everything that is important is visible to visual diff, at least you'd end up overwriting whatever scaffolding the other guy set himself up for his work), but at this point I'd be questioning the workflow that requires two or more people to work simultaneously on a single asset.

> The trickiness of that situation isn't unique to binary formats. It comes up with code too.

> That's just as true with code as it is with other formats!

Not really - text files don't contain any more data than you can see when you open them in your editor. With text, you see everything. When you open a 3D model or a PSD file, or even a Word document, what you see is just a tip of an iceberg.

> But it's a /lie/ to claim that a conflict for binary formats is "game over, you're just going to steamroll someone's work, there is no path to merge resolution". It's not a fun path, but it's not game over. Which is all I was really trying to refute.

I can agree with that. It's not impossible to do such merges; worst case scenario, one will end up praying to a hex editor like you say you did. It can even be fun sometimes. I guess what 'vvanders was arguing about is practicality - you can do it if you're willing to invest the time, but it's much better to not have to do it at all.

> (aside: It's interesting to me that this chain of comments went from being upvoted last night to downvoted this morning.)

    HN moves in a mysterious way
    its voting to perform;
    A reader questions his comment's downvotes,
    And thus ensues shitstorm.
That is to say, sometimes it's so random and fluctuating that personally, I stopped caring. If I get downvotes I usually a) already know I deserve them for being an ass and/or factually incorrect, b) have someone tell me why I deserve them, or c) assume they're random fluctiations and not worth getting upset about. I think we're dealing with type c) now.


So your suggestion for merging two 3D models that were made in Blender is to use ImageMagick compare to see what's different and then copy the differences from one Blender file to the other?


Is that really inherent to the data, or a case of useful merge plug-ins just not having been written yet for the proprietary file formats of big closed source apps?

Source code doesn't ~really merge all that well either; there's just been a big community of software developers collaboratively fixing up their tools for collaboration.

Of course we have it easy cause the tools we're using are ~made of the same stuff we work with daily. No amount of photoshop filter experience will enable you to write a program that intelligently merges two photoshop files.


Except perforce has been moving towards streams to compete with git's more desirable workflow, and you can't lock across streams anyway.


After we add support for git lfs we plan to add web ui locking for files, this will allow you to lock files when browsing them and prevent others from uploading them


Won't a central storage for the large files alao make it straightforward to add locking functionality in a future version of git-lfs, or as an add-in? I agree it sure looks like an omission to have a vcs that is centralized and aimed at binary data, without having any locking functionality.


Yeah, but that's assuming a whole lot of infrastructure around marking files read-only.

I also haven't seen what the pull performance is like, P4 is a pretty known quantity(with caching as well).


I don't even necessarily need my client to mark files as read only on my local machine. A system that just lets me query who has the lock on a certain file, and lets me get the lock on it if noone has, is miles ahead of shouting/email.

A remote-only locking system should be pretty easy to implement, e.g by just throwing in "filename.userid.lock" into the filesystem next to the file in question.


We plan to add locking to the GitLab web ui at some point. At least this will prevent two people from locking the same file.


Translation: because it is not useful for you it's not useful for anybody else.

That's nonsense of course.

There are a lot of use cases where this would be very helpful without locking (i.e. jar/dlls).

This is useful now, locking can come later. We don't have to solve every conceivable problem all at once. More progress is made in small incremental steps than big bang leaps.


This is great, we plan to ship alpha LFS support in GitLab CE & EE & .com in 8.1 or 8.2. That is in addition to the git annex support that EE & .com already have for a longer time https://about.gitlab.com/2015/02/17/gitlab-annex-solves-the-...


Is it safe to assume that Gitlab's implementation of Git LFS will allow to host the file storage server on premises and potentially on another machine than the one running Gitlab?


It for sure will allow you to host it on premises. We'll open source our LFS server so people can use it for other purposes.


Will you be supporting both indefinitely or is there a plan to transition to a single well-supported solution for large files over the coming N releases?


It seems that LFS is winning over Annex here, we might drop Annex in 9.x, but nothing is certain.


If LFS is winning but Annex is the better solution (as the comments I've read point to), it would be a shame to drop Annex.


There is a cost to everything and we try to be pragmatic. Git Annex is causing lots of work for us in the Omnibus packages. Video2000 is also technically better than VHS, but still people stopped producing equipment for it much sooner.


I haven't been following the various Git large file solutions - can someone comment on how this implementation compares to git-annex or whatever else is out there?


There are a lot of comparisons in the original announcement of LFS on HN https://news.ycombinator.com/item?id=9343021


Also interesting is the comparison in http://git-annex.branchable.com/not/


The complaints about git-lfs make me think we need to tell people about BAM for BitKeeper.

http://www.bitkeeper.com/features_binary_asset_management_ba...

BAM works with a similar idea, instead of saving large files in the local repository, users are allowed to save them in a centralized server. This saves disk space and network transfer time.

However, unlike other solutions, BAM preserves the semantics of distributed development.

Instead of requiring a single or standardized set of servers, every user can have a different BAM server. Data is moved between servers automatically and on demand.

One group in an office might use a single BAM server for storing all their data close and locally. When another development group is started in India, they can use a server local to them. The binary assets will automatically transfer to the India server as commits are pulled between sites.

This allows centralized storage of your data and yet still supports having a team work while completely disconnected from the internet.


I've been using BAM for quite a while (I'm one of the developers of it). I use it to store my photos. I've 55GB of photos in there and backing them up is

    cd photos
    bk push
Works pretty well, when my mom was still alive we pushed them to her imac and the screen saver was pointed at that directory. So she got to see the kids and I got another backup.


If for whatever reason lfs doesn't work for you, check out our solution to large file storage on git: https://github.com/dailymuse/git-fit

We wrote it because of various issues with tools at the time, basically boiling down to an inclination to have a dead simple solution.

I haven't tried lfs, but if it's anything like github's other software, then I'm sure it's substantially better than our tool.


I worked at Unity for a couple years and they are one of the biggest users of (and maintainers of) the Mercurial LargeFiles extension, so I was using that on a daily basis.

I agree that it should be a measure of last resort, but if you can't avoid working with big binary files, it makes the difference between a workflow that is a bit more cumbersome, and one that just grinds to a halt. Getting this functionality in git is great. And it'll mean a huge step forward in collaboration tools for game developers. You pretty much can't avoid big binary files when making games - and so far they've been stuck with SVN or Perforce (or the more adventurous ones are trying out PlasticSCM, which apparently is pretty nice too, but is proprietary and doesn't have a big ecosystem around it like git does). I hope this can lead to a boom of game developers using git.


Yup. I'm using git for game source code and I'm often holding off any commits to graphics/music until project is done. Any workaround outside git means you have two systems to manage and it can get really painful.


We had a pretty good discussion here when this was initially announced six months ago: https://news.ycombinator.com/item?id=9343021 Looks like they've had some success!


Not sure if they were working together with GitHub on this, but Microsoft also announced today that Visual Studio Online Git repos now support Git-LFS with unlimited free storage:

http://blogs.msdn.com/b/visualstudioalm/archive/2015/10/01/a...



Where are the files actually stored? I hear "git lfs server" in the demo video, can this be changed? Can I init my repo and tell it to push all my objects to my own private s3 bucket, or can I only rely on some outside lfs server I don't control?


Not with GitLFS, since it's designed under the assumption that the hostname serving your repo over ssh is also the GitLFS server over HTTP.

However git-fat[1] is an alternative system that works in much the same way, but lets you configure where the files are stored.

[1]: https://github.com/cyaninc/git-fat


This could be a corollary to P. Graham's "don't do anything that scales": don't do anything that involves plonking stupidly large files into version control.


Good thing that Atlassian/BitBucket would also be supporting it: https://blog.bitbucket.org/2015/10/01/contributing-to-git-lf...

And very glad to read that they decided to contribute to this instead of working on their own solution for the same problem. Kudos!


The fact that both Atlassian and GitHub intended to unveil their own almost identical competing solutions, both built in Go, in consecutive sessions at the Git Merge conference (without either being aware of the other) is pretty hilarious.


Git-lfs has been helpful for managing my repo of scientific research data. Hundreds of large-ish excel files, pngs, and hdf5 add up quickly if you're doing lots of small edits.

There's still some warts (don't forget git lfs init after cloning!), buts it's mostly fast and transparent. I also ponied up $5 a month to get 50 gigs or so of lfs storage. Decent deal imho.


As someone new to this idea, the README helped clarify the workflow: https://github.com/github/git-lfs/blob/master/docs/api/READM...


That video is hilarious! I wish we had more awesome videos like this for new technologies!


Is there a solution, that doesn't depend on external storage?

I have data that belongs to my source, but is rather big and I want it inside of my repo.


They do have a reference implementation of the serverside here: https://github.com/github/lfs-test-server - though they themselves don't consider it production ready. But I'm sure it'll either get there in time, or another open source implementation will rise to the challenge (cf. syste's comment about GitLab planning support for this: https://news.ycombinator.com/item?id=10313495 )


What happens if someone that hasn't downloaded their command line tools tries to clone your repo? Will they get the big files too?


I believe they just get the references to the big files, not the files themselves.


No, they will just get the small, metadata containing files


Was I the only one who expected that bear to move on its own?


Any information on GitHub Enterprise support?


GitHub Enterprise has been supporting LFS since 2.2 (current latest is 2.3.3) in a technical preview mode. See here: https://enterprise.github.com/releases/2.2.0/notes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: