Show HN: Dotmesh – A git-like CLI for application states

ntumlin · on Feb 7, 2018

It might be better to just call it version control for application states, rather than saying a "git-like CLI" for application states. When I hear "git-like CLI" I interpret that as "hard to use" and "confusing"

lewq · on Feb 7, 2018

Interesting feedback, thanks. I agree that the git CLI is confusing, unfortunately it's the best thing we've got that a huge number of developers are familiar with for exploring a state tree with branches and commits.

If you're interested, we've described some important ways that dotmesh is _different_ to git here: https://docs.dotmesh.com/faq/#how-does-dotmesh-differ-from-g...

ahnick · on Feb 7, 2018

This is really exciting to see and something I feel has been missing for some time. I think if this is grown correctly it would be a great acquisition for Docker to make.

It always struck me that I should be able to "docker push" my data and share that with my team just as I do my apps. In fact, I had built a quick hack to do something similar called Dockershare (https://github.com/ahnick/dockershare). I realized through that effort that a custom docker volume plugin would be needed and that it was a much larger problem than what I had time to tackle.

I imagine that dotmesh must have grown out of what was being done with dvol? (https://github.com/clusterhq/dvol) In any case, kudos for getting this built. I'm excited to try it out.

binocarlos · on Feb 7, 2018

Thanks for the feedback - "docker push <mydata>" is nice :-)

For the same reasons as you mention in your post, we've been hard at work with Kubernetes support - Persistent Volumes but with extra features!

toomim · on Feb 8, 2018

If this is like git, does that mean that you can merge two different branches of data?

For instance, if I have two runs of a test, that produce different outputs, can I merge the data back together at the end?

If not, then this is only capturing one aspect of git -- the archiving of snapshots of the state of the data.

deitcher · on Feb 8, 2018

The primary focus is, indeed, archiving (and replaying and sharing, with ease) states of data. Merging has been explored, but nothing concrete on it yet. Even something relatively simple, like merging code, often requires a human to resolve (sometimes with real effort). Imagine trying to do that with filesystem snapshots of database files.

We are exploring it. We have some thoughts on higher level understanding of data that might make it possible.

But definitely starting with the basics, as you said.

lewq · on Feb 8, 2018

How would you want merge to work?

hiccuphippo · on Feb 8, 2018

Show the different rows or documents one besides the other and let the user choose which to put in the merged table. It would need to change foreign keys to match the id in case it needs to change. But dothub seems to work at the filesystem level so this seems impossible for it.

infinitone · on Feb 7, 2018

It seems like a good idea in theory but i'm not so sure it'd work in many environments in practice. If i understand it correctly, you're storing all the state such as files but there is state that is tied to that specific machine (ie. machine fqdn, machine-specifc filepaths) and you wouldn't want to apply that state on another machine. I guess you could do some data wrangling and .stateignore that stuff but it would require quite the effort on a large application that spans many components and many teams.

On a very small app, i can see the utility of dotmesh.

lewq · on Feb 7, 2018

Hey! Yes, it's hard to capture the state of VMs.

This is where the Docker and Kubernetes integration comes into play -- if your app is captured entirely in Kubernetes manifests, the only thing left to capture (apart from the declarative Kube manifests, which should already be in version control) is the state that exists in Kubernetes Persistent Volumes. Dotmesh provides a Kubernetes Persistent Volume driver which provides Dotmesh StorageClass PVs and a Dynamic Provisioner, meaning that you really can capture the entire state of your app with Dotmesh... as long as you're deploying it with Kubernetes.

Code and infrastructure are already under control thanks to version control and terraform, ansible etc -- this completes the picture.

Give it a go: https://dotmesh.com/try-dotmesh/ and please leave more feedback here or in our Slack! (linked to in the footer at the bottom of dotmesh.com)

ioquatix · on Feb 7, 2018

I store all my application state directly in git: https://github.com/ioquatix/relaxo

Makes rolling back mistakes easy.

lewq · on Feb 7, 2018

Nice! (Non snarky question) does that scale?

ioquatix · on Feb 7, 2018

That's a good question. Relaxo is a database designed around immutable, transactional structures where convenience is more important than scale. Think of things like comments on a blog, items for sale in a small shop - https://github.com/ioquatix/financier is an example of an actual project which is in production.

Some things which I personally find useful about Relaxo:

- Easy to move data around, merge and fork data (it's just a git repository).

- Easy to roll back or inspect changes. If you make a mistake, just reset HEAD.

- Easy to backup (guaranteed consistency on disk).

- Better grouping of changes by transactions, which have a description, date, and information about who committed it (can even tie to currently logged in user for a web app, for example).

In theory Relaxo could scale up. Using libgit2 as the backend, it wouldn't be hard to use redis as an object store for git. The git data structure on disk is really just a key-value store with some specific data structures.

The main issue with Relaxo is query performance and indexes. Simple queries like fetching a document is fast. Complex queries including subsets, aggregations, and joins require supporting indexes to work efficiently, and this is something that is hard to build into a pure document storage system. The naive solution is to load all the documents and filter them, which is actually fine until you get a large number of documents (e.g. 1,000+).

However, git does provide one useful guarantee - it will sort directory entries. With this in mind, it's possible to make radix-sorted indexes (e.g. /invoices/by_date/2017/07/). You can use this to do basic indexes, but it's still not as good as a traditional SQL database in this regard.

deitcher · on Feb 8, 2018

I have seen a growth of such "vcs-like" databases, but I think the preponderance remains SQL stores like MySQL/MSSQL/Postegres or NoSQL like Mongo/Cassandra/Redis/Couch/etc. For those - or anything that has its own model of storage or processing and, in the end, is backed by filesystem-type storage, dotmesh provides a really nice solution.

I haven't used Relaxo itself, but personally, I like the fact that independent groups are thinking of version control semantics for data. Tells me it is heading in a positive direction.

ioquatix · on Feb 8, 2018

Relaxo actually grew out of Couch DB.

Relaxo used to be a couch query server (https://github.com/ioquatix/relaxo-query-server - not so useful any more) and ruby front end (https://github.com/ioquatix/relaxo-model - still useful). But I got frustrated with the direction of couchdb 2.x so I rewrote it to do everything in-process and use git as the document store. It organically grew from that.

Unless you are operating at scale, doing things in-process is vastly more convenient. Sending ruby code to the query server to perform map-reduce was a cumbersome process at best. It's easier just to write model code and have it work as expected.

Systems like Postgres a great when you have a single database and multiple front-end consumers though. You'd need to put a front-end on top of relaxo in order to gain the same benefits, but it would be pretty trivial to do so - just that its never been something that I've needed to implement. The API you'd actually want is one that interfaces directly with your Ruby model instances, rather than database tables and rows. I think there is room for improvement here - probably implementing a websocket API that exposes the raw git object model and then allowing consumers to work on top of that.

deitcher · on Feb 8, 2018

Pretty cool. Is there a write up on architecture and usage models? I’d like to see it.

I was a happy couch 1.x user, but moved away with 2.0. Nothing specific about it, just needs and timing.

ioquatix · on Feb 9, 2018

Thanks for being so interested.

The architecture is super simple, I'd suggest that the first place to look is the source code.

There are really only two ways of accessing the underlying data store - a read-only dataset and a read/write changeset which can be committed.

It's purely a key-value storage at the core - a key being a path and a value being whatever you want.

On top of that you can build more complex things, e.g. https://github.com/ioquatix/relaxo-model which provides relational object storage and basic indexes (e.g. has one, has many, etc)

thedirt0115 · on Feb 7, 2018

From the readme: "Relaxo is designed to scale to the hundreds of thousands of documents. It's designed around the git persistent data store, and therefore has some performance and concurrency limitations due to the underlying implementation... Relaxo can do anywhere from 1000-10,000 inserts per second depending on how you structure the workload."

dantiberian · on Feb 8, 2018

This looks really cool! A few thoughts:

1. The first thing I thought when I saw this is "How is this secure?". You're wanting to store the most sensitive information a business has - credentials + production DB. I took a look around the site + Google and couldn't find anything about security. Client side encryption of data seems like it would be good to make people comfortable with storing their data at dothub. I'm not sure if there is any use case for dothub having unencrypted data (at least not yet)?

2. "Application states" is quite a vague term, when I saw that I thought it was referring to capturing the state of a running process. "A git-like CLI for application states" is not a very compelling pitch. As others have noted, for all but the most masochistic of users, "a git-like CLI" is a negative point.

The benefit you're offering is "Snapshot production data to be able to replay in development" and "Snapshot failed CI builds to debug later". I'd recommend putting those up-front and in bold. A more compelling tagline (to me) is "Dotmesh - version control and snapshots for your production data".

lewq · on Feb 8, 2018

Thank you for the great feedback!

1. We mention encryption in the docs FAQ https://docs.dotmesh.com/faq/#what-do-you-encrypt -- where did you look for it? Maybe we can make it easier to find. Noted about this being a priority.

2. Thanks for proposing the updated tagline! I'll run it past the team ;-) we'll certainly develop more messaging and use cases around production data as we develop the project beyond 0.1 :-)

dantiberian · on Feb 8, 2018

> where did you look for it? Maybe we can make it easier to find. Noted about this being a priority.

I searched on https://dotmesh.com for "security" and "encryption", searched Google for "site:dotmesh.com security", and tried going to http://dotmesh.com/security, but got nothing for all three.

ferrantim · on Feb 7, 2018

Congrats Luke and team. I'm curious, what did you learn at ClusterHQ with Flocker that made you want to start dotmesh?

lewq · on Feb 7, 2018

ClusterHQ was a fantastic learning experience. I'm proud of what we achieved and the many strong relationships that were built in the team.

Ultimately the reason that ClusterHQ failed, I think, was that we believed we had product-market fit before we really did, and we started scaling too soon.

When we started, it wasn't possible to connect storage to containers at all, and so we had to put a lot of work into making that possible. And by the time we'd got Flocker working reliably across AWS, GCE, OpenStack & a dozen or so storage vendors, we'd been commoditized by Kubernetes.

Our premature scaling then made it harder to adapt as fast as we needed to. Many lessons learned!

We're focusing on a rigorous approach to finding product-market fit, my colleague Alice has written more about this here: https://dotmesh.com/blog/dotmesh-hypotheses/

lewq · on Feb 8, 2018

Try dotmesh here: https://dotmesh.com/try-dotmesh/

Learn about architecture, use cases (tutorials) & lots more: https://docs.dotmesh.com

grkvlt · on Feb 8, 2018

Looks useful for QA testing of distributed systems. I can also see a use case where I snapshot the state of one container from a node in a cluster then pull it onto the next node as it starts up before joining. It could maybe make things converge quicker in blockchain applications as well, where each new node needs to get a copy of the entire chain before it can do useful work?

zdkaster · on Feb 8, 2018

This sounds really cool way to manage the lifecycle of software. Will try it out. Though my first experience after trying the live hosted tutorial at https://dotmesh.com/try-dotmesh/

"$ dm cluster init dm: command not found"

lewq · on Feb 8, 2018

Did you run the curl command that's the first item in the tutorial?

zdkaster · on Feb 10, 2018

No, I didn't, sorry. After dm installed and set up, it seems working good. Great work, thanks. I wasn't aware that it's required installation process in Katacoda, just followed the Deploying Dotmesh to Docker Step 1.

simplify · on Feb 7, 2018

Would this concept allow users to share subsets of data between each other? assuming they had their own nodes.

binocarlos · on Feb 7, 2018

Thats the idea yes! you can run dotmesh on your own servers and install locally on each users machine to then push and pull just like git remotes. It's using copy-on-write so you are only pushing the difference. Another main use case is for CI to consume volumes, run tests then snapshot the results.

We have a hosted service if you don't want to run your own nodes (https://dothub.com) but the server and client are both open source. disclaimer: I work on the project

ivan_ah · on Feb 8, 2018

This is very interesting. Are there any tools provided for "DB diffs"? e.g. show exactly which rows are different between two snapshots?

It seems like dot* would have to know about the application logic to show useful diffs, but maybe it can be done generically at the DB level.

lewq · on Feb 8, 2018

I love your thinking, Ivan!

We have an issue for this here: https://github.com/dotmesh-io/dotmesh/issues/85

sjellis · on Feb 8, 2018

Awesome. I literally started writing a small tool for managing state yesterday, because we really do need smarter ways to move application datasets around.

lewq · on Feb 8, 2018

Interesting! We should compare notes! Join our slack (in website footer) and we can chat? Or I'm @lmarsden on Twitter :-)

nojvek · on Feb 8, 2018

I read through. Still have no idea why I would use it.

raoulj · on Feb 7, 2018

Maybe it's just me, but while I find the graphics informative on the landing page, I wonder if they could be made to be more easily understandable.

lewq · on Feb 7, 2018

Hey, thanks for the feedback!

Does this help? https://docs.dotmesh.com/concepts/architecture/

deitcher · on Feb 8, 2018

Would you mind sharing more details about how they are confusing? Always happy to take feedback. Feel free to comment here or on the community Slack, although a GitHub issue may be the best place. Whatever works... and much appreciated.