ToroDB – Document-oriented JSON database on top of PostgreSQL

e1ven · on Oct 29, 2014

Another project worth mentioning is mongolike - It uses PLV8 to implement the Mongodb interface ontop of Postgres.

https://github.com/JerrySievert/mongolike

jeffdavis · on Oct 29, 2014

And don't forget Mongres: https://github.com/umitanuki/mongres

tracker1 · on Oct 29, 2014

This is really cool, thanks for pointing it out... I'd, personally prefer that some of these implementations allowed for both a mongo wire protocol compatibility, as well as being able to use a similar interface within plv8 ... that would be awesome.

sandGorgon · on Oct 30, 2014

these projects are so cool ! I wish all of them join hands to build a single Mongo emulation layer on Postgres. And drop in compatibility with MongoMapper

ahachete · on Oct 30, 2014

There is now on the project's wiki (https://github.com/torodb/torodb/wiki/How-ToroDB-stores-json...) more detailed information about how ToroDB stores in tables the JSON documents. Hope this information helps :)

michaelchisari · on Oct 29, 2014

As someone who has been coding web apps for a very long time, it feels odd to say this, the PostgreSQL community is doing some very interesting things. I'm excited to see where things go.

jsherer · on Oct 29, 2014

Genuinely curious...why does it feel odd to say that?

chaostheory · on Oct 29, 2014

I second that. Postgres has been doing interesting things for years now.

michaelchisari · on Oct 29, 2014

When I first started web development (mid/late 90's), Postgres was the very conservative open source choice. MySQL was fast, good enough, and broke a few big rules.

That hasn't been the case for a while, but with the recent speed improvements in the past few years, prompted by NoSQL, and MySQL's troubles with Oracle, Postgres has become the Open Source SQL frontrunner, and everybody's taking it in really interesting directions.

tracker1 · on Oct 29, 2014

I'm hoping the JSON & plv8 support gets much more in the box, and that the replication setup becomes more baked in as well.

Right now, I'm migrating from MS-SQL to a MongoDB replica set for most of our core data. Mainly because the failover works better than most, and the licensing costs for even MS-SQL with replication and failover are budget blowing. Most of the PostgreSQL options for replication are really bolted on with some serious drawbacks, and automatic failover to a new master is another issue.

I've always liked PostgreSQL (and Firebird for what that's worth)... I do hope that some of these features become more of a checkbox item during installation, and less of a bolted on, have to compile and dive into the deep to get them, and even then only have it half baked.

chaostheory · on Oct 31, 2014

I also started development in the 90s. Postgres became an interesting choice in the mid-2000's once (someone can correct me) the organization responsible with stewarding it did a re-org and opened up more to outsiders.

If I remember correctly, Postgresql had stored procedures way before Mysql AND you could create them with Python and several other languages.

ocdnix · on Oct 29, 2014

"JSON documents are stored relationally, not as a blob/jsonb."

How is the transformation designed, to go from a structured document to a flat set of tables, akin to what an object-relational mapper would do?

ahachete · on Oct 29, 2014

The transformation is performed at the ToroDB layer. Each document is analyzed and several steps are performed:

- Document is received in BSON format (as per the MongoDB wire protocol) and transformed into a KVDocument. KVDocument is an internal implementation, an abstraction of the concrete representation of a JSON document (i.e., hierarchical, nested sets of key-pairs).

- Then, KVDocuments are split by levels (called sub-documents).

- Each subdocument is further split into a subdocument type and the data. The subdocument type is basically an ordered set of the data types of that subdocument.

- Subdocuments are matched 1:1 to tables. If there is an existing table for the given subdocument type, the document is stored directly there. If there isn't, a new table with that type is directly created. This means that there is also a 1:1 mapping between the attribute names (columns) and key names, and makes it very readable from a SQL user perspective.

- There is a table called structure that is basically a representation of the JSON objetct but without the (scalar) data. Think of the JSON object but only the braces and square brackets, plus all the keys (or entries in arrays) that contain objects. There is, per level, a key in this structure that cointains the name of the table where the data for this object is stored. This table uses a jsonb field to store this structure, but note that there's no actual data in this jsonb field.

- There's finally a root table which matches structure with the current document. This is used as structures are frequently re-used for many documents. This is in part one of the biggest factors which contributes to significantly reduce the storage required compared to, for example, MongoDB, as the "common" information of that "types of documents" is stored only once.

This information and more will be shortly added to the project's wiki. However, it's very easy to see if you run ToroDB and look at the created tables :)

Note: I'm one of the authors of ToroDB

jeffdavis · on Oct 29, 2014

Can you briefly explain the advantage of tearing apart the JSON document, rather than just storing it as JSONB and using the various JSONB functions/operators/indexes?

jeremyjh · on Oct 29, 2014

He did explain that - it can yield significant space savings where there is a common "type of document" because the field names do not need to be stored in each document instance, the way they would have to be in JSONB. Also, it means you can query these tables with normal Postgres query tools and they will actually make sense.

ch · on Oct 30, 2014

This is a pretty interesting approach, and it doesn't seem to be tied to any particular database implementation.

Would it be possible to factor this up into a JSON -> SQL translation function, which could than be used by various backends (effectively consume the JSON and spit out the CREATE TABLE and INSERT statements)?

ahachete · on Oct 30, 2014

Indeed.

It's more than a JSON to SQL translation, but it could definitely use various backends. It has some plpgsql Postgres code and some data types to speed operations (saving some round trips to the database), but it won't be hard to port it to other backends :)

yvsong · on Oct 29, 2014

Is it possible to quantify the savings of storage and IO?

ahachete · on Oct 30, 2014

There are some benchmarks in the following presentation: http://www.slideshare.net/8kdata/toro-db-pgconfeu2014 where ToroDB was presented (PostgreSQL Europe Conference, Madrid, 2014, 2014.pgconf.eu).

yangyang · on Oct 29, 2014

I'd be very interested to know more about this too. There is considerable overhead for each row in PostgreSQL (24 byte header, which must be aligned). Obviously you'll save a bit if there are repeated keys higher up, but you'll pay for the foreign key references to those rows in the children.

ahachete · on Oct 30, 2014

It is true that PostgreSQL has such a high row overhead. But the storage savings are significant, in any way. Please see the presentation pointed out earlier.

SEJeff · on Oct 29, 2014

Can this be made to use HStore?

zrail · on Oct 29, 2014

This is my question. How is the data actually broken down into tables/columns? I can't find a schema anywhere in the source, but maybe I just don't know where to look.

Also the code talks an awful lot about DB cursors, which indicates that this is not really taking advantage of either SQL or the relational model at all.

zapov · on Oct 29, 2014

You might find Revenj (https://github.com/ngs-doo/revenj) interesting too. It maps DSL schema to various Postgres objects and offer 1:1 mapping between view in the DB and incoming/outgoing Json.

ahachete · on Oct 29, 2014

zrail, it's difficult to find the schema and/or tables in the source code as all of them are 100% dynamic. See my coments above about the internal workings of ToroDB. Or give it a try, run it and check how data is laid out!

Regarding the use of cursors, they are absolutely necessary, as there is -in MongoDB- the concept of a session, and queries may be asked in subsequent packets to return the next results. However, I don't see how this impedes to take advantages of the relational model. ToroDB definitely does that, if you look at the created tables schema.

Demiurge · on Oct 29, 2014

I would also be very interested in how this kind of a breakdown occurs. It is not uncommon I have to store arbitrary structure, and if there is a better way than a blob, which can be abstracted away by an app, that would be quite useful on its own, without the MongoDB protocol.

platform · on Oct 30, 2014

as far as I understand it, shredding hiearchical documents into a relationa store is optional feature of Oracle XMLDB

http://www.oracle.com/technetwork/database-features/xmldb/xm...

It is called XMLDB structured storage (vs binary storage, that one actual stores the hierarchical XML naitively)

The XMLDB structured storage I think has been available since 2003 (but I might be wrong there).

Oracle's XMLDB structure requires XSD declaration. Within that declaration XMLDB can use 'hints' to indicate if a given set of attributes across documents is related. And if yes, it will shred the docs in a way that the related attributes will be co-joined.

The advantage, as the author of ToroDB noted below is a) space saving b) ability to use relational joins that are using disk-optimized access strategies.

The disadvantage (at least in Oracle XMLDB structured option) -- is the need for declaration of the model ahead of time, and joins for deeply/complex structured documents.

It looks like ToroDB 'senses' the model of each document on the way in. I think there are definetely use cases for this approach, that tolerate the trade off between ingestion speed and storage control. Plus using a relational engine underneath allows for ACID properties (eg multi-object rollback/commmits) -- which native mongo does not provide.

ahachete · on Oct 30, 2014

platform, very interesting the XMLDB info. As you point out, ToroDB does not require any model declaration ahead. It's perfectly "schema-less", as it 'senses' the model of each document.

Regarding ingestion speed, it is very high, even compared to MongoDB's. There are some benchmarks in this presentation: http://www.slideshare.net/8kdata/toro-db-pgconfeu2014

cmkrnl · on Oct 30, 2014

I don't really see why ToroDB can't be implemented on top of Oracle :) There could be some demand for it there.

I needed this for a side project at work a while back, but didn't get the time to implement it.

Oracle's cumbersome enough already; I for one have no desire to drag XSDs into it to use XMLDB.

ahachete · on Oct 30, 2014

I'm not saying it's not gonna happen...

pvh · on Oct 29, 2014

Cool concept. I'm not a huge fan of normalizing out the underlying data into pseudorelational tables.

Reconstructing trees in Postgres is a little gnarly on larger tables but it's definitely neat to see more people trying to combine the user experience of Mongo with the technology advantages of Postgres.

What inspired the creation of ToroDB? Are you using it in production yourselves in a limited way?

ahachete · on Oct 30, 2014

Thanks!

Documents are split into chunks before hitting the database, so there is no need to reconstruct trees in PostgreSQL. Indeed, many queries don't need to reconstruct the (whole) tree, just a part of it or even just one level. However, in doing so, ToroDB is able to offer queries that only need so scan a small subset of the database (compared to the whole database) to query your data.

ToroDB was inspired by the DRY principle: relational databases like PostgreSQL are already good enough that with some tweaking may perfectly well as a document-store.

jchamberlain · on Oct 30, 2014

Very interesting approach. One question I have is around the licensing. I see that it's licensed using AGPLv3. I would think that would require any code that connects to it to be open source. Am I reading that right? If so, then I am guessing that there will be a commercial offering w/o the restrictions?

Thanks

ahachete · on Oct 30, 2014

Yes and yes :)

kofejnik · on Oct 30, 2014

Wait. Does AGPL really require my Python code to be released as open-source if I want to connect to your server? As I understand AGPL , if you modify the code, you must make it available to people connecting over the network, that's all.

ahachete · on Oct 31, 2014

Errr yes, you're right, my "yes and yes" was too quick. Sorry for that!

kxo · on Oct 30, 2014

I'd really like to write a similar, mongo-compatible layer for FoundationDB.

This is great work.

jakozaur · on Oct 29, 2014

Super interesting. Though would be interested how it works. Is it using JSON/hstore underneath?

Right now reading the code seems to be the only option to analyze it.

ahachete · on Oct 29, 2014

No, it's not using json/jsonb/hstore (to store the true JSON data). It uses a bit of jsonb but for a side part of the information.

And this is the true power of it, that data is stored in normal, relational tables. Please see a comment above explaining this in more detail :)

drawkbox · on Oct 29, 2014

Does all data have to be stored relationally or can you pick and choose?

Reason I ask is for example something like a game session blob from a multiplayer game, I might not want that to be merged with the relational data and just throw it away after a certain amount of time/days/weeks so they are nice as blobs but other data would be better relational.

Yes there are app/cache level ways around this but might be cool if there was a way to choose auto relational or blobby. I guess ultimately you could just have multiple stores for live and archival data but something to think about.

ahachete · on Oct 30, 2014

Data is laid out transparently to the user. Indeed, there should be no (visible) difference as to how it is stored. Data within a single "level" is stored on a single table. If you want to keep blobs, you can definitely do that, they will be stored as such (bytea in postgres).

jmasonherr · on Oct 29, 2014

So can I use it with meteor?

justinsb · on Oct 29, 2014

I would be surprised if ToroDB implemented the oplog which Meteor uses for efficient live-updating, but I guess it _should_ work with the polling mode.

SQL support (i.e. direct Postgres support) is on the Meteor roadmap for post-1.0.

ahachete · on Oct 29, 2014

ToroDB is definitely going to have MongoDB's oplog, so it should work as is for live-updating.

SQL support for ToroDB... there's surely room for it in the future ;)

ahachete · on Oct 29, 2014

It hasn't been tested, but as soon as the full Mongo protocol is supported, it should work exactly the same as Mongo does. Give it a try and let us know! :)

arthursilva · on Oct 29, 2014

To the author: you should probably use VODKA index.

jeltz · on Oct 29, 2014

VODKA indexes are still experimental and exist only as a branch in Alexanders own repo (https://github.com/akorotkov/postgres/tree/vodka). I do not think they are anywhere near production ready, and even if they are they are not an extension.

Alexander is working on allowing new index access methods (the type of index: e.g. hash, btree, gin, gist) to be added in extensions.

arthursilva · on Oct 29, 2014

Yep. And that's why I believe it's a great way to test/improve VODKA.

PS: now I'm not sure if it applies considering that the data is actually normalized inside Postgres.

wh-uws · on Oct 29, 2014

Really interesting pdf of slides from the PGCcon talk about this here

http://www.pgcon.org/2014/schedule/events/696.en.html