Hacker News new | past | comments | ask | show | jobs | submit login
Git as a NoSQL datastore (github.com/pauldowman)
84 points by joshfraser on Nov 22, 2010 | hide | past | favorite | 42 comments



I want ACID in my databases:

  - Atomic - 
Yep, but done with a global lock.

  - Consistency - 
Yep. Since it's NoSQL, there's no referential checks, so this is easy. Git itself probably wants index / commit tree has to be readable, and all attributes.json files be valid. I don't see any reason it wouldn't be.

  - Isolated - 
Sure, works fine, due to that global lock.

  - Durable - 
Check. Once it's written, it's written.

So, git doesn't get you any of this for free, and getting Atomic / Isolated is done by introducing a big ugly global lock. I guess I don't see the appeal of using git (or even a model like git) because it doesn't get you any interesting necessary features for free. I see the backup and history thing as a secondary feature -- neat, but not worth sacrificing ACID for.

That said, cool idea -- there's not a lot of code you need to get to a reasonably small codebase.


I'm pretty sure I can fix that global lock, but I'm focusing on getting the basics solid before focusing on performance.

My idea was that if there are concurrent transactions, then you just need a merge when the second one commits.

This is "optimistic locking": since they're probably updating different subdirectories it would be a trivial merge. If they update the same attribute of the same entity (i.e. they both change the same line of the same attributes.json file) then there's a conflict and the second commit fails.

This still requires a lock but only a tiny one, just at the moment of actually updating refs/heads/<branchname> (whereas right now it's around the whole transaction which obviously sucks).

Are there any flaws in that?

The reason I haven't done it yet even though it sounds pretty trivial is that with Grit (the Ruby Git library) the merge will need to use the working directory and that's a problem for concurrency (and also because I want the working directory to be useable by a human). I was thinking that with a bit of extra work I could make it happen in memory somehow.

Paul


Maybe you should just think of it as the NoSQL version of sqlite3 but with added version control.


What does that even mean?


I think "easy to set up" combined with "less crappy than expected". Although SQLite isn't very crappy at all...


Sqlite is often used as a quick and dirty database by projects that don't want to burden their developers with the setup overhead of mysql or postgres.

It's a tiny, rock-solid executable that stores data in a single binary file and supports sql access. Access to the file is globally locked. It's good enough for testing out a project on a single developer's machine but the global lock makes it hard to use in a multi-user production scenario.

In a similar vein I could see git as a widely available, rock-solid, "quick and dirty" way for a project to store and retrieve documents.


Because sqlite stores everything in a flatfile, so your are just doing version control on flatfile changes.


> - Consistency -

> Since it's NoSQL, there's no referential checks, so this is easy.

Consistency is quite easily achieved by putting the consistency checks into appropriate Git hooks. That way, Git would not allow any commit/pull of inconsistent data. A "rollback" is then trivial, as this only has to be done in the local working directory. So regarding consistency, Git provides everything that SQL databases provide, not just the limited NoSQL interpretation of "consistency".

Also note that this already happens in the traditional use of Git: There, consistency checks are called "building the application and running the test suite". It is a vital part of continuous integration systems. So consistency checks are not only theoretically possible, but already actively practiced with Git.


I feel it is similair to the motivation when one attempts to install Linux on some random device.

I expect to see a lot more 'interesting' projects that create NoSQL datastores in creative ways.


Wouldn't atomicity come from doing things on per-commit basis?

Also, you're not taking into account the things that Git does give you that typical databases don't, like branching and merges


Git doesn't give you merges, it gives you merge conflicts. I find it similar to how Riak uses vector clocks (http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/). It knows enough to tell you there's a conflict, but requires you (or the app) to tell it which version is the right one.


Right, it's either a trivial merge when you commit the transaction, or it's a conflict (i.e. two transactions updated the same attribute of the same entity).

There's no automatically fixing a conflict so the second transaction fails. Since there's full isolation that's pretty standard optimistic locking: http://en.wikipedia.org/wiki/Optimistic_locking

In most cases the transaction would succeed if it was automatically retried.


What global lock? Are you talking about the index? There are options for writing the index to separate directories. You can write each blob/tree separately, build an index in a unique temporary directory, and then the commit.

Is there something else I'm missing?


GitModel itself has a global lock (for now), here: https://github.com/pauldowman/gitmodel/blob/master/lib/gitmo...

I think it's easily fixed with a merge, as I described here: http://news.ycombinator.com/item?id=1929691


Concurrent updates to the same object.

You could get around some of the limitations by creating a branch for every session (though probably not on a high volume site), but then you will have lots of interesting problems once a session ends and you have to merge all changes back into master.


Ahh, yes this would be a problem on GitModel. I was thinking of Git at a lower level, since you can't update objects without also changing the SHA.


I wrote code to do this in python for a project. The biggest problem I ran into was that if you wanted the full history of X/Y/Z.json to contain the full history of Z.json, even if it had been moved, it ended up requiring two parallel structures. One structure with directories for the tree data, one for the raw data of that node.

My intent was to use it in a document system using Operational Transforms, which side-stepped the issue of concurrent access; only the canonical representation of client data would need to be written, so it was serial writing


Huh, I was half-way to creating something in the same vein, while trying to use Git as the "world-state synchronization protocol" for a distributed MOO. Back to the (much-simplified) drawing-board :)


I don't think its intended to be used for frequently changed models like 'user'. It is much more suited for low change frequency document like models. For example 'content' model in a blog or cms.

having history might not be a big deal, but having branches IS!.

It is great for stuff like testing some new set of pages on a staging server and then 'adding' them to the currently running production server which kept changing while you were working on the stage. Try that with SQL!


Yes, it would be a lot more appropriate for things that don't change that frequently. I actually started writing a blogging app for coders, and then succumbed to a bout of extraction distraction by generalizing the part that read the pages/posts from a Git repo: https://github.com/pauldowman/balisong


I also thought about similar project. Git based active model has some drawbacks and not for every project for sure. It's not so good if there are many users making a lot of changes to data (like facebook). But it's very interesting for applications when few users make changes on site and site looks like static for rest of visitors (applications like any CMS). And we are getting a lot of cool git features for free.


Also of interest and in a similar sort of spirit:

Bup: Git based backups - https://github.com/apenwarr/bup



Bup looks awesome. All the benefits of rsync plus the ability to do incremental backups.

"Even when a backup is incremental, you don't have to worry about restoring the full backup, then each of the incrementals in turn; an incremental backup acts as if it's a full backup, it just takes less disk space"

.... nice!


Well, sure, git provides an append/log-based distributed hash store. It's probably not what you're looking for, though: It doesn't have a library for efficient in-process access, so you need to spawn a git shell command per operation, with somewhat opinionated semantics. It's also GPL'd.

It's good for prototyping, though. (And I hear it's good at managing changes for source code!)


"It doesn't have a library for efficient in-process access"

https://github.com/mojombo/grit



Yep, GitModel uses Grit.


Ah, via Ruby. RUBY?!??! Ah. Delightful.

YECCH

So, uh, can't we just make an efficient, well-designed, C library for accessing those data structures, write BSD-licensed wrappers as consenting adults, and get on with our lives? I'm speechless.

Is writing an append-only log data-structure system really that hard? Ok, then.


We use Grit extensively all over GitHub. It works quite well. I've yet to have a seizure from it.

At the same time, we're also funding work on libgit2, which is an efficient, well-designed C library for accessing those data structures. There's also wrappers for consenting adults. http://libgit2.github.com


Thanks for your informative response. Are you interested in a Lua wrapper?


I would love to see this happening. I could then port gitmodel on top of your lib and get a NoSQL DB, easy to sync with a single user desktop app for the iOS platform (in conjunction with Wax).


I may get to it this weekend. I'll keep you posted.



What's stopping you?


Git itself works fine, for my purposes. If I wrote a C library for managing git's data structures, it would basically be out of spite. I may make a Lua wrapper for libgit2, though - it's more what I had in mind (though it's too bad it's GPL'd).

I have a library for managing distributed graphs on top of an append-only log file, but it's got different design trade-offs than git does - It's for a specific C/Lua project, and the semantics of the data don't match git's.


libgit2 is GPLed, but importantly also has a libgcc-style exception.


This could be interesting for projects where keeping revisions would be handy, for instance a note-taking or document platform.


Yes - how about Tomboy/gnote integration?


have you done any performance tests? how well does git scale in general for when you have billions of files and versions?

this gives me some ideas, good stuff.


i have seen that these huge deployments of git run into problems, that is why git is less than perfect as a system-wide backup tool.


In a somewhat related note, see also Venti: http://doc.cat-v.org/plan_9/4th_edition/papers/venti/




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: