Hacker News new | past | comments | ask | show | jobs | submit login
ToroDB – Document-oriented JSON database on top of PostgreSQL (github.com/torodb)
166 points by davidgl on Oct 29, 2014 | hide | past | favorite | 49 comments



Another project worth mentioning is mongolike - It uses PLV8 to implement the Mongodb interface ontop of Postgres.

https://github.com/JerrySievert/mongolike


And don't forget Mongres: https://github.com/umitanuki/mongres


This is really cool, thanks for pointing it out... I'd, personally prefer that some of these implementations allowed for both a mongo wire protocol compatibility, as well as being able to use a similar interface within plv8 ... that would be awesome.


these projects are so cool ! I wish all of them join hands to build a single Mongo emulation layer on Postgres. And drop in compatibility with MongoMapper


There is now on the project's wiki (https://github.com/torodb/torodb/wiki/How-ToroDB-stores-json...) more detailed information about how ToroDB stores in tables the JSON documents. Hope this information helps :)


As someone who has been coding web apps for a very long time, it feels odd to say this, the PostgreSQL community is doing some very interesting things. I'm excited to see where things go.


Genuinely curious...why does it feel odd to say that?


I second that. Postgres has been doing interesting things for years now.


When I first started web development (mid/late 90's), Postgres was the very conservative open source choice. MySQL was fast, good enough, and broke a few big rules.

That hasn't been the case for a while, but with the recent speed improvements in the past few years, prompted by NoSQL, and MySQL's troubles with Oracle, Postgres has become the Open Source SQL frontrunner, and everybody's taking it in really interesting directions.


I'm hoping the JSON & plv8 support gets much more in the box, and that the replication setup becomes more baked in as well.

Right now, I'm migrating from MS-SQL to a MongoDB replica set for most of our core data. Mainly because the failover works better than most, and the licensing costs for even MS-SQL with replication and failover are budget blowing. Most of the PostgreSQL options for replication are really bolted on with some serious drawbacks, and automatic failover to a new master is another issue.

I've always liked PostgreSQL (and Firebird for what that's worth)... I do hope that some of these features become more of a checkbox item during installation, and less of a bolted on, have to compile and dive into the deep to get them, and even then only have it half baked.


I also started development in the 90s. Postgres became an interesting choice in the mid-2000's once (someone can correct me) the organization responsible with stewarding it did a re-org and opened up more to outsiders.

If I remember correctly, Postgresql had stored procedures way before Mysql AND you could create them with Python and several other languages.


"JSON documents are stored relationally, not as a blob/jsonb."

How is the transformation designed, to go from a structured document to a flat set of tables, akin to what an object-relational mapper would do?


The transformation is performed at the ToroDB layer. Each document is analyzed and several steps are performed:

- Document is received in BSON format (as per the MongoDB wire protocol) and transformed into a KVDocument. KVDocument is an internal implementation, an abstraction of the concrete representation of a JSON document (i.e., hierarchical, nested sets of key-pairs).

- Then, KVDocuments are split by levels (called sub-documents).

- Each subdocument is further split into a subdocument type and the data. The subdocument type is basically an ordered set of the data types of that subdocument.

- Subdocuments are matched 1:1 to tables. If there is an existing table for the given subdocument type, the document is stored directly there. If there isn't, a new table with that type is directly created. This means that there is also a 1:1 mapping between the attribute names (columns) and key names, and makes it very readable from a SQL user perspective.

- There is a table called structure that is basically a representation of the JSON objetct but without the (scalar) data. Think of the JSON object but only the braces and square brackets, plus all the keys (or entries in arrays) that contain objects. There is, per level, a key in this structure that cointains the name of the table where the data for this object is stored. This table uses a jsonb field to store this structure, but note that there's no actual data in this jsonb field.

- There's finally a root table which matches structure with the current document. This is used as structures are frequently re-used for many documents. This is in part one of the biggest factors which contributes to significantly reduce the storage required compared to, for example, MongoDB, as the "common" information of that "types of documents" is stored only once.

This information and more will be shortly added to the project's wiki. However, it's very easy to see if you run ToroDB and look at the created tables :)

Note: I'm one of the authors of ToroDB


Can you briefly explain the advantage of tearing apart the JSON document, rather than just storing it as JSONB and using the various JSONB functions/operators/indexes?


He did explain that - it can yield significant space savings where there is a common "type of document" because the field names do not need to be stored in each document instance, the way they would have to be in JSONB. Also, it means you can query these tables with normal Postgres query tools and they will actually make sense.


This is a pretty interesting approach, and it doesn't seem to be tied to any particular database implementation.

Would it be possible to factor this up into a JSON -> SQL translation function, which could than be used by various backends (effectively consume the JSON and spit out the CREATE TABLE and INSERT statements)?


Indeed.

It's more than a JSON to SQL translation, but it could definitely use various backends. It has some plpgsql Postgres code and some data types to speed operations (saving some round trips to the database), but it won't be hard to port it to other backends :)


Is it possible to quantify the savings of storage and IO?


There are some benchmarks in the following presentation: http://www.slideshare.net/8kdata/toro-db-pgconfeu2014 where ToroDB was presented (PostgreSQL Europe Conference, Madrid, 2014, 2014.pgconf.eu).


I'd be very interested to know more about this too. There is considerable overhead for each row in PostgreSQL (24 byte header, which must be aligned). Obviously you'll save a bit if there are repeated keys higher up, but you'll pay for the foreign key references to those rows in the children.


It is true that PostgreSQL has such a high row overhead. But the storage savings are significant, in any way. Please see the presentation pointed out earlier.


Can this be made to use HStore?


This is my question. How is the data actually broken down into tables/columns? I can't find a schema anywhere in the source, but maybe I just don't know where to look.

Also the code talks an awful lot about DB cursors, which indicates that this is not really taking advantage of either SQL or the relational model at all.


You might find Revenj (https://github.com/ngs-doo/revenj) interesting too. It maps DSL schema to various Postgres objects and offer 1:1 mapping between view in the DB and incoming/outgoing Json.


zrail, it's difficult to find the schema and/or tables in the source code as all of them are 100% dynamic. See my coments above about the internal workings of ToroDB. Or give it a try, run it and check how data is laid out!

Regarding the use of cursors, they are absolutely necessary, as there is -in MongoDB- the concept of a session, and queries may be asked in subsequent packets to return the next results. However, I don't see how this impedes to take advantages of the relational model. ToroDB definitely does that, if you look at the created tables schema.


I would also be very interested in how this kind of a breakdown occurs. It is not uncommon I have to store arbitrary structure, and if there is a better way than a blob, which can be abstracted away by an app, that would be quite useful on its own, without the MongoDB protocol.


as far as I understand it, shredding hiearchical documents into a relationa store is optional feature of Oracle XMLDB

http://www.oracle.com/technetwork/database-features/xmldb/xm...

It is called XMLDB structured storage (vs binary storage, that one actual stores the hierarchical XML naitively)

The XMLDB structured storage I think has been available since 2003 (but I might be wrong there).

Oracle's XMLDB structure requires XSD declaration. Within that declaration XMLDB can use 'hints' to indicate if a given set of attributes across documents is related. And if yes, it will shred the docs in a way that the related attributes will be co-joined.

The advantage, as the author of ToroDB noted below is a) space saving b) ability to use relational joins that are using disk-optimized access strategies.

The disadvantage (at least in Oracle XMLDB structured option) -- is the need for declaration of the model ahead of time, and joins for deeply/complex structured documents.

It looks like ToroDB 'senses' the model of each document on the way in. I think there are definetely use cases for this approach, that tolerate the trade off between ingestion speed and storage control. Plus using a relational engine underneath allows for ACID properties (eg multi-object rollback/commmits) -- which native mongo does not provide.


platform, very interesting the XMLDB info. As you point out, ToroDB does not require any model declaration ahead. It's perfectly "schema-less", as it 'senses' the model of each document.

Regarding ingestion speed, it is very high, even compared to MongoDB's. There are some benchmarks in this presentation: http://www.slideshare.net/8kdata/toro-db-pgconfeu2014


I don't really see why ToroDB can't be implemented on top of Oracle :) There could be some demand for it there.

I needed this for a side project at work a while back, but didn't get the time to implement it.

Oracle's cumbersome enough already; I for one have no desire to drag XSDs into it to use XMLDB.


I'm not saying it's not gonna happen...


Cool concept. I'm not a huge fan of normalizing out the underlying data into pseudorelational tables.

Reconstructing trees in Postgres is a little gnarly on larger tables but it's definitely neat to see more people trying to combine the user experience of Mongo with the technology advantages of Postgres.

What inspired the creation of ToroDB? Are you using it in production yourselves in a limited way?


Thanks!

Documents are split into chunks before hitting the database, so there is no need to reconstruct trees in PostgreSQL. Indeed, many queries don't need to reconstruct the (whole) tree, just a part of it or even just one level. However, in doing so, ToroDB is able to offer queries that only need so scan a small subset of the database (compared to the whole database) to query your data.

ToroDB was inspired by the DRY principle: relational databases like PostgreSQL are already good enough that with some tweaking may perfectly well as a document-store.


Very interesting approach. One question I have is around the licensing. I see that it's licensed using AGPLv3. I would think that would require any code that connects to it to be open source. Am I reading that right? If so, then I am guessing that there will be a commercial offering w/o the restrictions?

Thanks


Yes and yes :)


Wait. Does AGPL really require my Python code to be released as open-source if I want to connect to your server? As I understand AGPL , if you modify the code, you must make it available to people connecting over the network, that's all.


Errr yes, you're right, my "yes and yes" was too quick. Sorry for that!


I'd really like to write a similar, mongo-compatible layer for FoundationDB.

This is great work.


Super interesting. Though would be interested how it works. Is it using JSON/hstore underneath?

Right now reading the code seems to be the only option to analyze it.


No, it's not using json/jsonb/hstore (to store the true JSON data). It uses a bit of jsonb but for a side part of the information.

And this is the true power of it, that data is stored in normal, relational tables. Please see a comment above explaining this in more detail :)


Does all data have to be stored relationally or can you pick and choose?

Reason I ask is for example something like a game session blob from a multiplayer game, I might not want that to be merged with the relational data and just throw it away after a certain amount of time/days/weeks so they are nice as blobs but other data would be better relational.

Yes there are app/cache level ways around this but might be cool if there was a way to choose auto relational or blobby. I guess ultimately you could just have multiple stores for live and archival data but something to think about.


Data is laid out transparently to the user. Indeed, there should be no (visible) difference as to how it is stored. Data within a single "level" is stored on a single table. If you want to keep blobs, you can definitely do that, they will be stored as such (bytea in postgres).


So can I use it with meteor?


I would be surprised if ToroDB implemented the oplog which Meteor uses for efficient live-updating, but I guess it _should_ work with the polling mode.

SQL support (i.e. direct Postgres support) is on the Meteor roadmap for post-1.0.


ToroDB is definitely going to have MongoDB's oplog, so it should work as is for live-updating.

SQL support for ToroDB... there's surely room for it in the future ;)


It hasn't been tested, but as soon as the full Mongo protocol is supported, it should work exactly the same as Mongo does. Give it a try and let us know! :)


To the author: you should probably use VODKA index.


VODKA indexes are still experimental and exist only as a branch in Alexanders own repo (https://github.com/akorotkov/postgres/tree/vodka). I do not think they are anywhere near production ready, and even if they are they are not an extension.

Alexander is working on allowing new index access methods (the type of index: e.g. hash, btree, gin, gist) to be added in extensions.


Yep. And that's why I believe it's a great way to test/improve VODKA.

PS: now I'm not sure if it applies considering that the data is actually normalized inside Postgres.


Really interesting pdf of slides from the PGCcon talk about this here

http://www.pgcon.org/2014/schedule/events/696.en.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: