Git as a NoSQL datastore

trjordan · on Nov 22, 2010

I want ACID in my databases:

  - Atomic -

Yep, but done with a global lock.

  - Consistency -

Yep. Since it's NoSQL, there's no referential checks, so this is easy. Git itself probably wants index / commit tree has to be readable, and all attributes.json files be valid. I don't see any reason it wouldn't be.

  - Isolated -

Sure, works fine, due to that global lock.

  - Durable -

Check. Once it's written, it's written.

So, git doesn't get you any of this for free, and getting Atomic / Isolated is done by introducing a big ugly global lock. I guess I don't see the appeal of using git (or even a model like git) because it doesn't get you any interesting necessary features for free. I see the backup and history thing as a secondary feature -- neat, but not worth sacrificing ACID for.

That said, cool idea -- there's not a lot of code you need to get to a reasonably small codebase.

pauldowman · on Nov 22, 2010

I'm pretty sure I can fix that global lock, but I'm focusing on getting the basics solid before focusing on performance.

My idea was that if there are concurrent transactions, then you just need a merge when the second one commits.

This is "optimistic locking": since they're probably updating different subdirectories it would be a trivial merge. If they update the same attribute of the same entity (i.e. they both change the same line of the same attributes.json file) then there's a conflict and the second commit fails.

This still requires a lock but only a tiny one, just at the moment of actually updating refs/heads/<branchname> (whereas right now it's around the whole transaction which obviously sucks).

Are there any flaws in that?

The reason I haven't done it yet even though it sounds pretty trivial is that with Grit (the Ruby Git library) the merge will need to use the working directory and that's a problem for concurrency (and also because I want the working directory to be useable by a human). I was thinking that with a bit of extra work I could make it happen in memory somehow.

Paul

po · on Nov 22, 2010

Maybe you should just think of it as the NoSQL version of sqlite3 but with added version control.

jemfinch · on Nov 22, 2010

What does that even mean?

JoachimSchipper · on Nov 22, 2010

I think "easy to set up" combined with "less crappy than expected". Although SQLite isn't very crappy at all...

po · on Nov 23, 2010

Sqlite is often used as a quick and dirty database by projects that don't want to burden their developers with the setup overhead of mysql or postgres.

It's a tiny, rock-solid executable that stores data in a single binary file and supports sql access. Access to the file is globally locked. It's good enough for testing out a project on a single developer's machine but the global lock makes it hard to use in a multi-user production scenario.

In a similar vein I could see git as a widely available, rock-solid, "quick and dirty" way for a project to store and retrieve documents.

chopsueyar · on Nov 22, 2010

Because sqlite stores everything in a flatfile, so your are just doing version control on flatfile changes.

vog · on Nov 22, 2010

> - Consistency -

> Since it's NoSQL, there's no referential checks, so this is easy.

Consistency is quite easily achieved by putting the consistency checks into appropriate Git hooks. That way, Git would not allow any commit/pull of inconsistent data. A "rollback" is then trivial, as this only has to be done in the local working directory. So regarding consistency, Git provides everything that SQL databases provide, not just the limited NoSQL interpretation of "consistency".

Also note that this already happens in the traditional use of Git: There, consistency checks are called "building the application and running the test suite". It is a vital part of continuous integration systems. So consistency checks are not only theoretically possible, but already actively practiced with Git.

chopsueyar · on Nov 22, 2010

I feel it is similair to the motivation when one attempts to install Linux on some random device.

I expect to see a lot more 'interesting' projects that create NoSQL datastores in creative ways.

bergie · on Nov 22, 2010

Wouldn't atomicity come from doing things on per-commit basis?

Also, you're not taking into account the things that Git does give you that typical databases don't, like branching and merges

technoweenie · on Nov 22, 2010

Git doesn't give you merges, it gives you merge conflicts. I find it similar to how Riak uses vector clocks (http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/). It knows enough to tell you there's a conflict, but requires you (or the app) to tell it which version is the right one.

pauldowman · on Nov 22, 2010

Right, it's either a trivial merge when you commit the transaction, or it's a conflict (i.e. two transactions updated the same attribute of the same entity).

There's no automatically fixing a conflict so the second transaction fails. Since there's full isolation that's pretty standard optimistic locking: http://en.wikipedia.org/wiki/Optimistic_locking

In most cases the transaction would succeed if it was automatically retried.

technoweenie · on Nov 22, 2010

What global lock? Are you talking about the index? There are options for writing the index to separate directories. You can write each blob/tree separately, build an index in a unique temporary directory, and then the commit.

Is there something else I'm missing?

pauldowman · on Nov 22, 2010

GitModel itself has a global lock (for now), here: https://github.com/pauldowman/gitmodel/blob/master/lib/gitmo...

I think it's easily fixed with a merge, as I described here: http://news.ycombinator.com/item?id=1929691

fhars · on Nov 22, 2010

Concurrent updates to the same object.

You could get around some of the limitations by creating a branch for every session (though probably not on a high volume site), but then you will have lots of interesting problems once a session ends and you have to merge all changes back into master.

technoweenie · on Nov 22, 2010

Ahh, yes this would be a problem on GitModel. I was thinking of Git at a lower level, since you can't update objects without also changing the SHA.

JeffJenkins · on Nov 22, 2010

I wrote code to do this in python for a project. The biggest problem I ran into was that if you wanted the full history of X/Y/Z.json to contain the full history of Z.json, even if it had been moved, it ended up requiring two parallel structures. One structure with directories for the tree data, one for the raw data of that node.

My intent was to use it in a document system using Operational Transforms, which side-stepped the issue of concurrent access; only the canonical representation of client data would need to be written, so it was serial writing

derefr · on Nov 22, 2010

Huh, I was half-way to creating something in the same vein, while trying to use Git as the "world-state synchronization protocol" for a distributed MOO. Back to the (much-simplified) drawing-board :)

Vitaly · on Nov 22, 2010

I don't think its intended to be used for frequently changed models like 'user'. It is much more suited for low change frequency document like models. For example 'content' model in a blog or cms.

having history might not be a big deal, but having branches IS!.

It is great for stuff like testing some new set of pages on a staging server and then 'adding' them to the currently running production server which kept changing while you were working on the stage. Try that with SQL!

pauldowman · on Nov 22, 2010

Yes, it would be a lot more appropriate for things that don't change that frequently. I actually started writing a blogging app for coders, and then succumbed to a bout of extraction distraction by generalizing the part that read the pages/posts from a Git repo: https://github.com/pauldowman/balisong

yatsyk · on Nov 22, 2010

I also thought about similar project. Git based active model has some drawbacks and not for every project for sure. It's not so good if there are many users making a lot of changes to data (like facebook). But it's very interesting for applications when few users make changes on site and site looks like static for rest of visitors (applications like any CMS). And we are getting a lot of cool git features for free.

po · on Nov 22, 2010

Also of interest and in a similar sort of spirit:

Bup: Git based backups - https://github.com/apenwarr/bup

pygy_ · on Nov 22, 2010

In the same vein: Gibak.

http://eigenclass.org/hiki/gibak-backup-system-introduction

http://eigenclass.org/hiki/gibak-0.3.0

https://github.com/pangloss/gibak for an improved fork (last commit in 2008, but still...)

pauldowman · on Nov 22, 2010

Bup looks awesome. All the benefits of rsync plus the ability to do incremental backups.

"Even when a backup is incremental, you don't have to worry about restoring the full backup, then each of the incrementals in turn; an incremental backup acts as if it's a full backup, it just takes less disk space"

.... nice!

silentbicycle · on Nov 22, 2010

Well, sure, git provides an append/log-based distributed hash store. It's probably not what you're looking for, though: It doesn't have a library for efficient in-process access, so you need to spawn a git shell command per operation, with somewhat opinionated semantics. It's also GPL'd.

It's good for prototyping, though. (And I hear it's good at managing changes for source code!)

WALoeIII · on Nov 22, 2010

"It doesn't have a library for efficient in-process access"

https://github.com/mojombo/grit

timf · on Nov 22, 2010

There is a Python port, too:

http://packages.python.org/GitPython/0.3.1/

( http://gitorious.org/git-python )

pauldowman · on Nov 22, 2010

Yep, GitModel uses Grit.

silentbicycle · on Nov 22, 2010

Ah, via Ruby. RUBY?!??! Ah. Delightful.

YECCH

So, uh, can't we just make an efficient, well-designed, C library for accessing those data structures, write BSD-licensed wrappers as consenting adults, and get on with our lives? I'm speechless.

Is writing an append-only log data-structure system really that hard? Ok, then.

holman · on Nov 22, 2010

We use Grit extensively all over GitHub. It works quite well. I've yet to have a seizure from it.

At the same time, we're also funding work on libgit2, which is an efficient, well-designed C library for accessing those data structures. There's also wrappers for consenting adults. http://libgit2.github.com

silentbicycle · on Nov 22, 2010

Thanks for your informative response. Are you interested in a Lua wrapper?

pygy_ · on Nov 22, 2010

I would love to see this happening. I could then port gitmodel on top of your lib and get a NoSQL DB, easy to sync with a single user desktop app for the iOS platform (in conjunction with Wax).

silentbicycle · on Nov 22, 2010

I may get to it this weekend. I'll keep you posted.

technoweenie · on Nov 22, 2010

Sure: https://github.com/libgit2/libgit2.

aditya · on Nov 22, 2010

What's stopping you?

silentbicycle · on Nov 22, 2010

Git itself works fine, for my purposes. If I wrote a C library for managing git's data structures, it would basically be out of spite. I may make a Lua wrapper for libgit2, though - it's more what I had in mind (though it's too bad it's GPL'd).

I have a library for managing distributed graphs on top of an append-only log file, but it's got different design trade-offs than git does - It's for a specific C/Lua project, and the semantics of the data don't match git's.

asb · on Nov 24, 2010

libgit2 is GPLed, but importantly also has a libgcc-style exception.

Sirupsen · on Nov 22, 2010

This could be interesting for projects where keeping revisions would be handy, for instance a note-taking or document platform.

rlpb · on Nov 22, 2010

Yes - how about Tomboy/gnote integration?

epynonymous · on Nov 22, 2010

have you done any performance tests? how well does git scale in general for when you have billions of files and versions?

this gives me some ideas, good stuff.

zoomzoom · on Nov 22, 2010

i have seen that these huge deployments of git run into problems, that is why git is less than perfect as a system-wide backup tool.

uriel · on Nov 22, 2010

In a somewhat related note, see also Venti: http://doc.cat-v.org/plan_9/4th_edition/papers/venti/