RethinkDB 1.10: multi-indexing

Cyranix · on Oct 1, 2013

I kicked off a new project just a few days ago with RethinkDB -- absolutely loving it so far. This new release addresses a minor pain point and gives me confidence in my choice of data store.

In case `coffeemug is still checking in here, I'd like to put in a request to update the documentation page on drivers. The listed Elixir driver has been broken for a few months; a working one can be found at azukiapp/elixir-rethinkdb. Overall the page could be organized a bit better: putting the languages in alphabetical order and not duplicating the list (which pushes contributor names below the fold) seem like straightforward improvements. Thanks for all the hard work -- hope to be sharing my project and showing off some RethinkDB swag before too long!

jdoliner · on Oct 1, 2013

These sound like very good ideas. I'll see if I can get them implemented. Thanks for the suggestions.

coffeemug · on Oct 1, 2013

A message has been dispatched to @mglukhovsky.

elithrar · on Sept 30, 2013

Constantly impressed by the release speed and feature list. Thinking back to 1.4/1.5 (when I started using RethinkDB) and I wouldn't have believed myself had I seen the current features.

Justsignedup · on Oct 1, 2013

http://rethinkdb.com/faq/

WHY DOES EVERYONE RECOMMEND MYSQL? Seriously, its 2013, and postgres is right there in the land of amazing databases. Easy to set up. Easy to get started. And works so beautifully. Why do people mention MySQL. There is zero benefit to using it. Less than zero. It's harder to use than pg.

xd · on Oct 1, 2013

I've been using MySQL for over 12 years. It's easy to set up, easy to get started with, and works very beautifully.

Zero benefit, really, do you have to be so insulting?

ltbarcly3 · on Oct 1, 2013

I can't tell you how many MySQL users I've worked with where 50% of their dev work was doing complex caching and denormalization to avoid joins.

Now, as a MySQL user, you're saying to yourself "joins are slow". Yes. In MySQL, joins are very, very slow.

I mean, you could write a book on all the things MySQL does in a fundamentally broken way. It's a very long list. Generally, MySQL fans have only ever used a single database system, MySQL, and they extrapolate that it's completely brain damaged behavior is somehow a characteristic of relational databases. This is not so.

To be blunt, if you think MySQL works beautifully, I mean, will you at least admit that at least 95% of your experience is with MySQL? Anyone who had ever used any of Oracle, Postgres or even SQLServer could ever say that with a straight face.

xd · on Oct 1, 2013

I develop data management systems and one area of my job is to work with existing solutions and develop import/export routines, so I can assure you have I have a lot of experience across the entire spectrum of RDBMS's.

The MOST interesting thing I've found with the systems I've worked with is the insane use of SQL. Quite frankly, most developers have little to no real idea what they are doing when it comes to SQL, which is evident when you look at the rise of "NoSQL".

If you are a good software developer you will develop good software no matter the language. If you are a good DBA you will write good SQL no matter the underlying database system.

ltbarcly3 · on Oct 1, 2013

MySQL prevents you from writing good SQL..

If you have a nice normalized schema, you will need to do joins, sometimes involving multiple tables and joining to sub selects. With MySQL, this is just a disaster, so you can't have a nice normalized schema, or if you do, you have to stash data in a denormalized way somewhere.

xd · on Oct 2, 2013

I've written just as complicated joins with MySQL as I have Oracle, PostgreSQL, SQL Server etc etc all with very similar performance.

Can you provide/email me with some examples of what you are finding a "disaster" and I'll help find the flaws in the SQL or Schema, whichever it may be.

agildehaus · on Oct 1, 2013

RethinkDB started life as a MySQL storage engine.

continuations · on Oct 1, 2013

How is multi-indexing implemented? What kind of query performance can it attain?

This seems somewhat similar to hyperdex's search function. It is implemented through what is called hyperspace hashing. How would you compare yourself to hyperdex?

coffeemug · on Oct 1, 2013

An index in RethinkDB is an additional BTree that stores a record keyed by the value of the index, not by the value of a primary key. When you query by an index, RethinkDB looks the records up by looking at the index BTree instead of the main BTree. This is similar to how most RDBMSes implement indexes.

Multi-indexing is implemented by inserting multiple rows into the index B-Tree. So if your index for a document returns [1, 2], Rethink stores two records into the index BTree -- one keyed by 1, the other keyed by 2.

This is different from hyperdex -- we don't do any algorithmic tricks with spaces to make this more efficient. The model we used fit extremely well into our architecture, so @jdoliner just went ahead and implemented it.

Hope this makes sense -- great questions! (Also, love your username)

continuations · on Oct 1, 2013

How do you handle distribution of documents among nodes? Say document X has 3 tags "A", "B", "C". Do you place all documents with tag "A" in the same node, all docs with tag "B" in the same (but potentially different) node, and all docs with tag "C"...? In that case you'd need to put document X in potentially 3 different nodes. Or do you not do that kind of "pre-sorting" and just send queries to every node asking for match for tag "A"? (I like my username too. I was surprised it was still available)

coffeemug · on Oct 1, 2013

Documents are distributed across nodes based on a primary key. Currently we use range-based sharding, but will be moving towards hash-based sharding soon. So, for any given document, we look at the primary key and determine where a document should be.

We store a secondary index for a shard on the same node where the master for the shard resides. So if you're storing users whose last names are between A-M on node A, all secondary indexes for users with last names between A and M will also be on node A. That means that for any secondary index query we have to contact all nodes that have shards for a table, but we do a number of systems tricks to make this really efficient.

continuations · on Oct 1, 2013

> but we do a number of systems tricks to make this really efficient.

Can you talk about these tricks? I'd love to learn more about it.

Xeoncross · on Oct 1, 2013

Brand new, written in C++, looks beautiful, nice client libraries, trying to steal all the best parts of other NoSQL databases? I'm going to give it a spin.

ltbarcly3 · on Oct 1, 2013

Multi-indexing: index a single document by multiple values

So.. This is an inverted index. Not exactly a new concept, it certainly doesn't justify a new name. Unless, maybe, the people making RethinkDB didn't know.. no, that's impossible right?

Could a company commercially selling a database not realize they just reinvented a basic data structure?

Then again, the google query "site:rethinkdb.com inverted index" has no relevant results.

Maybe they think the name inverted index is more confusing than just inventing a new name?

susi22 · on Oct 1, 2013

They settled for the name since MongoDB calles it "Multi-key" index.

Most people would probably not be familiar with the name "inverted index".

jonny_eh · on Oct 1, 2013

How do you guys release such polished and worthwhile releases on such a regular schedule? It's rather impressive.

coffeemug · on Oct 1, 2013

Thanks. I'm one of the founders @ Rethink, and project/engineering management is my job. The dev team here is simply amazing. It's not false modesty -- I've been busy with some administrative aspects of the company for the past month and haven't been able to put any time at all into project management. The guys just took the issue tracker and ran with it -- I woke up this morning to discover that we've got a new release.

So, my hat is off to all the DB Rethinkers. It's an honor to lead such an amazing team.

EDIT: I do hire people and occasionally set the tone, so I deserve some credit :)

ltbarcly3 · on Oct 1, 2013

I strongly suspect that nobody thinks that when someone brags about the company they own it's any form of modesty, false or otherwise.

jdoliner · on Oct 1, 2013

Hi,

I wrote the multi-index code and got this release shipped so I guess my opinion might be worth something on it. The answers I read to questions like this normally strike me as very tautological. Things like "you need your team to be well oiled and to have very specific goals". It's not that that's false, it's just that we normally judge a team's well oiledness by how quickly it releases nice things. So I'm going to jump over that, our team is plenty well oiled and getting better all the time but let me tell you how I think we got there.

For us the secret was getting in repetitions. We started by building and releasing one thing we were really proud of, our process for doing it was adhoc and broke down in a million places but we slogged through. Then it came time to build the next version, this time we did some things better and a lot of things we didn't do better. This was our 10th major release by now it feels like we're doing a lot of things better. The most satisfying thing about it is that it's not totally clear to any one of us where all the inefficiencies went the knowledge of how to release quickly isn't stored in any single brain.

One thing I think is an absolutely crucial aspect of this process is making it discreet. During a release cycle I take our release process as absolute. I try as hard as I can not to even think about ways to fix it. Between release cycles I step back and look for ways to tweak the process and ignore 1st order feature development. I'm not sure what it is that makes a discreet process work better, it might just be that it frees up mental bandwidth, but if you find yourself getting bogged down in the process of releasing rather than actually releasing while you're trying to develop the product to release try shutting off half the equation. It's helped me a lot.

Also another way to solve this is to have separate people focus on each aspect. That can be tough at smaller companies because so much communication needs to happen between the 2 but it certainly can work.

dkhenry · on Oct 1, 2013

Can secondary indexes be compound ?

coffeemug · on Oct 1, 2013

Yes!

  r.table('users').index_create('full_name',
                                [r.row['first_name'], r.row['last_name']])

This indexes users both on first name and last name.

alecco · on Oct 1, 2013

Is this similar to multi-column indexes in SQL?

coffeemug · on Oct 1, 2013

No -- multicolumn indexes are equivalent to compound indexes in RethinkDB.

RDBMSes don't need multi-indexing because they don't allow storing arrays of values in any meaningful way. Imagine having a row in a database that contains an array of values. For example, suppose that for every user you store an array of cities they've ever lived in. RDBMSes typically can't do that, but document stores can. If you'd like to efficiently find every user that has ever lived in Mountain View, multi-indexing lets you do that.

corresation · on Oct 1, 2013

>For example, suppose that for every user you store an array of cities they've ever lived in. RDBMSes typically can't do that, but document stores can.

?? What? Are you talking about a junction table?

sendob · on Oct 1, 2013

That would be one way, or

On Postgres, you could use arrays: http://www.postgresql.org/docs/9.3/static/arrays.html also has some great contrib modules as well like: http://www.postgresql.org/docs/9.3/static/intarray.html

Mongodb documents their multikey indexes for a similar approach: http://docs.mongodb.org/manual/core/index-multikey/

not to take away from rethinkdb, awesome to see a project move rapidly!

petergeoghegan · on Oct 1, 2013

Importantly, in Postgres the arrays are indexable with a GIN (inverted) index. So you can ask arbitrary questions along the lines of: show me all rows in this table that have any of the following integers: [5, 7, 9] somewhere in their array column. This will be performed very efficiently.

In recent versions, Postgres will even keep track of statistics for arrays (and other, similar non-scalar types), so that you can get a good selectivity estimate, which is often critical to getting a good plan from the query planner.

Nate75Sanders · on Oct 1, 2013

That's not storing a collection in a table. That's approximating the storage of a collection in a table by using standard RDBMS features.

If you think these are the same somehow, I encourage you to think about the number of disk seeks required for recovering the full collection in both cases.

vidarh · on Oct 1, 2013

If all you need is the indexed values, then most mature RDBMs's supports using covering indexes in queries and satisfying the query entirely from the index.

Since the point of B-tree's is to minimize disk seeks and packing values with the same key close together, the number of disk seeks can often be made just as small as if the collection had been stored in the "owning" record.

Secondly, whether or not it is is an implementation detail. There's nothing inherently preventing RDBMSs from supporting "inlining" of collections from related tables. But in practice most RDBMSs choose to support that exactly through the much more generic optimization of supporting queries using covering indexes...

In other words: It only takes a lot of disk seeks if you use a RDBMS that is lacking in features and/or haven't added the proper indexes.

corresation · on Oct 1, 2013

If you think these are the same somehow

I didn't say they were the same, but instead was replying to the statement "RDBMSes typically can't do that, but document stores can.", but of course an RDBMS can store an "array of cities you've ever lived in".

I encourage you to think about the number of disk seeks required for recovering the full collection in both cases.

Ignoring XML, array types, or materialized views, for the RDBMS it is generally zero on virtually any modern platform, as the common case now is a database that is 100% in memory. A $79/m OVH server has 32GB. A low-end Dell server can have 256GB for $3000 (there are extremely few databases where the hot data surpasses even a GB). And the best thing about normalization is that it, like LZ, is a form of compression, yielding a more likely scenario that your database fits in memory.

That's all a side topic, however, and is neither here nor there on the long running NoSQL/SQL debate.

jeltz · on Oct 1, 2013

No, they are the same thing as GIN indexes in PostgreSQL.[1] GIN is used to index fulltext searches, arrays and hstores (a kind of key-value map).

https://devcenter.heroku.com/articles/postgresql-indexes

1. Implementation might differ a bit in the details but both seem to be based on the same ideas.