> Early in the development of RethinkDB, we made the design decision to be extremely conservative about durability and safety of users's data.
I'm glad that the Rethink engineers are taking this approach to development instead of "let's win all the benchmarks!" Overall, it's apparent that they are really thinking through the consequences of their decisions and the tradeoffs in database systems, and that they want to build something that will last instead of something that will just become popular in the short term.
Seems a bit unconventional to manually specify which index you'd like to use when writing queries. This would normally be something handled by a query optimizer.
Of course, many DBMS provide some facility for specifying indices manually through query hinting, but for non-trivial queries, the optimizer often can do a better job than a programmer at picking the most efficient query plan.
This was an intentional choice we made for a couple of reasons. The most pragmatic one was that we don't have an optimizer and we thought that secondary indexes could still be useful to a lot of people without one. And we've found it's always better to ship early and get feedback sooner rather than later.
Another reason though is we're not totally sure we're sold on the state of the art of query optimizers. From what we've seen they can introduce a number of non intuitive behaviors and it can be hard to tell which operations will use which index. Query optimizers are nice when they work correctly but they don't always work correctly so for the first release of secondary indexes we opted to make something that had very clear precise semantics and then build on it. We will eventually have a query optimizer but it's a big problem and we don't just want to implement what others have because we think we can do better.
The thing that makes me crazy about database engines is the "60% the time, it works every time" phenomenon, wherein things are magically fast until they are magically slow.
Then the discussion goes seamlessly from "don't worry your pretty head, just build your app" to "of course you have to really understand <proprietary, opaque, uncontrollable internal technology hidden by abstractions>." With no recognition that these are basically contradictory messages.
If you can really hide the complexity so it almost never matters, great. But here, where you couldn't, I think you made the right choice by exposing it.
Thanks, we've gotten a lot of mileage out of the following ethos:
Let users ignore danger, but never let users ignore the fact that they're ignoring danger. Always give users a way to learn exactly what the dangers are.
In the case of the query optimizer we haven't gotten to giving people a way to ignore these dangers yet but we will eventually.
A decent cost based optimizer will be right most of the time, and you have hints for the cases where it falters.
Your approach is simply hiding the query optimizer choices inside of the app, making it more fragile.
Let's say I have two fields: a and b, both of which have an index. And then I have a query with predicates across both a and b. For example, a = 10 and b = 20.
One of the core facilities within an optimizer is selectivity estimation. By looking at the statistics, the optimizer will see that a = 10 might look at 10,000 rows while b = 20 might only look at 10. So the optimal and desired choice would be to use index over b.
However, the exact same query construct with different parameters (e.g. a = 50 and b = 3) might flip the index selection.
Now let's imagine I had to implement this inside of my app. Every time I have such a constraint, the app has to become aware of selectivity to know which indexes to use based on input parameters.
You are underestimating the necessary design tension. If you expose hints, then you've constrained how your optimizer works because it has to work with those hints. (Oracle, for example, has been fighting this battle for ages.) Once you've built an optimizer, people come to depend on it so it is critical that you get it right.
They will need optimizer eventually, and they know it. But I'm glad that they are not implementing it before they are ready.
Secondly your selectivity estimation point can go either way, and on the whole I don't like it. I've personally experienced the situation where a database recomputes statistics, the CBO decides that a new query plan is needed for a common query, it made a poor choice, and the first that any human hears about it, the site is down.
The problem here is that the risk profile for the application of trying to be smart here is completely backwards. In general, as long as a query is not a bottleneck, I don't care about making it faster. Oh, you made it 2x faster? I didn't mind before and I'm unlikely to even know that you did so. But if you JUST ONCE switch to a bad plan on production without warning, your users WILL notice, they WILL care, and they WILL NOT be happy.
As a developer, I don't care that you make the right choice 95% of the time. I want you to make a predictable choice. Because if you're making a bad choice in development, I've got a window of opportunity to notice and do something about it where nobody cares. But if you randomly make changes on production, every single mistake counts.
Oh, but you say that this just means that you need stored query plans? I agree, and this is an example of why the behavior of the optimizer has to be thought through very carefully before you just throw something out there, people come to depend on it, and then you realize that you have put barriers to thinking of it the way you want to think of it.
This accords with our experience with large Oracle databases as well.
We've had a few panics caused by a DBA having updated table statistics when trying to optimize some query. This would occasionally cause sudden, massive changes in the way that other unrelated queries were performed, which queries would not finish and sometimes bring down the database.
These experiences caused us to have to change our procedures and keep the test database data closely in sync with the production data, so we'd know how newly gathered statistics would affect queries. The database is large enough that having testing be a full and recent copy of production is pretty painful. Oracle has since introduced features in 11g that allow pinning query plans, we've yet to try these though.
In the simplest form, you could say, "I have 2 reasonable plans, let's try A, and if it takes above time X, then start B in parallel and go with whatever finishes first."
You could ramp up the idea to handle changing query plans based on updated statistics by sending some fraction of queries to one plan, and some to another. Then keep stats on how that worked out for you.
Basically never simply flip the switch in production to try a new, unproven query.
Incidentally 12g advertises that they will actually collect statistics during query execution and based on those will validate the query plan. If that works, then this problem should get better. But of course that comes with overhead, and is likely to be a complex piece of code, so you tell me whether you trust them to get it right right away.
A cost-based optimizer also needs a way to handle the combinatorial explosion of possible plans. For MongoDB, maybe they can be exhaustive or use a simple strategy. But for a SQL DBMS, the search space is way too huge, and you need a way to navigate it intelligently.
This "try the plan out" idea has come up many times, and it's a reasonable idea, but it's nothing new and not a complete solution. You need a way to evaluate whether a plan has a reasonable chance of being the best one before the whole plan is even constructed, much less executed.
Combinatorial explosion is also one of the reasons that humans generally aren't great at coming up with non-trivial plans, either.
A good optimizer is a combination of many good ideas; so I don't mean to be discouraging, just that one idea is not the full story.
If you are worried about the risk of a changing plan, you are between a rock and a hard place. If you lock the plan down (or hard-code it in the application), and the data characteristics change, that could be disastrous. If you don't lock it down, that's a leap of faith that the optimizer will adapt gracefully.
I understand the issues you've brought up in your post as well as the traditional methods of pruning the search space of query plans through using left-deep join trees and histograms for calculating selectivity.
My top-level point was that there is no way humans will come up with the optimal query plan by hand, and like you said even if they do - the data will change.
As an aside, the reason Mongo can use such a simple but effective strategy is because it doesn't support joins, which makes calculating the optimal plan much, much easier. RethinkDB does, however, and as such, a CBO is that much more important.
I am not disputing the value of a cost based optimizer.
I'm just pointing out that if you have an existing plan that seems to be currently working the default should be to not just switch it up without testing that your switch didn't hurt anything. Instead you need to use care.
Yup. This kind of complexity they have been forced into because the simple general solution that they think should work and which does the vast majority of the time also routinely causes pants on fire emergencies.
I frequently see apps with thousands of distinct query signatures. Having a developer manually chose indexes, join ordering, aggregation method, (just to name a few) for every single query, and then select multiple plans because input parameters absolutely do result in scenario where plans can be 10,000x off in performance -- well, that's just untenable.
Not trying to claim that CBOs are the panacea here, but let's be realistic. Having developers manually plan every single query is not the right choice.
There is a logical fallacy in what you've said. It absolutely is true that there are times when input parameters can be a factor of 10k in performance. It is also absolutely true that applications can have thousands of distinct query signatures.
What DOES NOT follow is that most of those distinct query signatures are very important to the application. A few are. But most are not. However volunteering to add the wrong one can take the application down.
Having developers manually plan every single query is not the right choice.
I agree. However the many applications successfully built on MySQL demonstrate that for a lot of people, crappy but consistent planning is good enough. At least then people can know where the problems are, and fix them.
slava @ rethink here. I don't think we disagree -- optimizers are unquestionably a good idea and are immensely useful. For us it was a matter of a) giving people the option to specify indexes directly, and b) shipping quickly. We'll implement a proper statistical optimizer in due time, but for the time being we found that the explicit approach gives a lot of people 80% of what they need with 20% of the work on our part.
An optimizer can be right most of the time but it's basically impossible to have it be right all of the time. It really depends on what you're doing as to which approach you prefer. If you absolutely need a query to have a certain performance characteristic then you can't count on the optimizer to get it right a human needs to think through exactly how they want this query to run to ensure it behaves correctly.
There definitely are several use cases where people are willing to sacrifice a risk of misoptimization in exchange for shorter code and we fully intend to support that (it's a complicated project though so we don't now.) However I disagree that having this logic in the app makes it more fragile. Having the precise semantics expressed in the application means you always know exactly how a query is going to be performed. We know this execution plan isn't going to change based on seemingly unrelated properties of the data (as it does with optimizers.) If fragility is the propensity to break then I'd argue having the optimizer in control rather than the developer gives you more fragility than less.
In practice, you end up with apps which dynamically build up predicates in different sections of the code. And when you combine that with many predicates, many tables, and other constraints such as ordering or aggregates, things get complex pretty quick.
Even if you have the best developers who understand all the in an outs of the dataset, re-implementing an optimizer in the app is rarely the right choice.
no, it's categorically not good to make an application more fragile. A weakness restated is not a strength. Every database should have an optimizer, period.
If a database system does not support joins, aggregations or subqueries like most realtime NoSQL solutions do, an optimizer becomes pretty trivial. Optimizers are needed for analytical stuff. That's why most optimizers are evaluated on analytic workloads (e.g. TPC-H, TPC-W) not transactional / realtime (TPC-C).
I did not state that an optimizer should not exist for a database - I think thats key actually - but rather that the tradeoff they made this time around was fundamentally good in that - at least for now - it forces the developer to think about application performance.
If that happens to make an application more fragile, I think that is more of a code organization/tooling issue than anything else.
+1, a good approach could be to still make users able to specify manually what index to use even when the optimizer will be available. Sometimes humans just know better.
That's something we'll make sure to keep around indefintely. I always hate things that are too magically, which optimizers frequently seem to me, but I also get why they're useful to people. The best compromise I think is to give people the magic and a way to opt out of the magic.
Thanks for the reply. Do you have any resources you could share regarding the pitfalls of query optimizers?
My experience with them comes primarily from studying the System R optimizer where the literature presented query optimization as a boon to performance without mentioning such drawbacks.
The primary problem is that picking indexes and execution algorithms essentially involves traversing an exponential space (and taking guesses about costs of things). Modern DB optimizers have hundreds (or thousands) of heuristics to do this well, but every once in a while they pick the wrong path and run a suboptimal query. This might be ok for offline analytics, but can be disastrous for production OLTP environments because a small change in statistical information can trip up such an edge case and the live system will crawl to a halt. For example, Postgres optimizer has been around for a long time and is very mature, but they still fix these bugs quite often AFAIK.
In many production environments admins end up having to hack the queries to "trick" the optimizer into doing what they want, which of course defeats the whole purpose. So, for real-time systems, being able to specify indexes manually is actually a productivity boost, because you often know exactly how you want the query executed.
I'll try to dig up some info on this, I don't have any links off the top of my head.
Even if you manually write the query plan, you are still subject to similar disasters. That's because your data is changing, which can cause the query performance to fall off a cliff without changing the plan.
I just wanted to pipe in and say that's excellent reasoning for this design, especially since the ability to create indexes on arbitrary ReQL expressions would make an intelligent query planner into a lot of work. I'm looking forward to giving these a try.
You can use multiple indexes in some queries. The case you described will work. However you can't use multiple indexes from the same table. For example:
table.between(...).between(...)
won't work. You'd have to use a filter (and thus a linear scan to do the second between. This seems like it would be a good thing for us to add to the FAQ because the rules on it can be a bit complicated. However the best way to find out might actually just be to try. Because we're not using an optimizer in the background you have to explicitly say which indexes you're using and thus they system will just refuse to do queries if it can't use indexes the way you're asking it to.
You can't currently do index intersection (so you can't chain `between` commands), but you can absolutely use multiple indexes with joins. E.g. `table.between(...).eq_join(...).eq_join(...)` will use three indexes in this example.
I love RethinkDB API. It makes joins possible without going through mapreduce. Though I am using MongoDB currently, I plan on switching to RethinkDB as soon as it is more stable and provides better performance. Excited to see the team making progress on all fronts.
I'm not sure what you think is pure gold here - the statement is correct. When dealing with big data, joins are the first thing to go as you need to shard your data. Whether RethinkDB will really solve joins across sharded data is up in the air though - ultimately you still need to do filtering per shard and then reduce the result somewhere - and doing filtering across two attributes means you are going to be doing a lot of data transfer, but RethinkDB is about the only database I've seen really making progress in this area.
I like it alot but even simple queries are hideously complex. It doesn't need to be this hard. The ReThink developers are exceptionally helpful but goodness. Try to do a search returning all items where an array contains a value. In simpledb it's just select * where field='value', even if field is an array. In RethinkDB I had to ask for developer help and ended up writing a deep, sophisticated query. Querying has to be dramatically more simple.
We're actually working on adding commands to ReQL to make queries a lot simpler. (It turns out that just a few commands that are very easy for us to implement can make enormous difference)
If you can, would you mind giving examples of things you'd like to do that are hard? We'd love to make it easy, and we think we've got most common things covered in the next release, but if there's something specific you have in mind we'd really appreciate it.
The basic stuff needed to let a beginner developer write a web app should be super super simple. REALLY simple. No map no reduce nothing hard.
Here are some of the basic examples of typical things people might need to do when writing a web app:
How do I select records based on a value in an array?
How would I find all records whose name begins with "star" i.e. 'Star Trek' 'Star Base' 'Star Wars' 'Star Anise'
I have records which contain an array of email addresses. How would I get all documents which contain one or more of a list of email addresses?
How would I add a where clause to my queries to ensure for example that the "status" field in my document equals "open"?
How would I implement paging through a result set?
How would I implement count to control the number of records returned during paging?
What is the right way to store dates in ReThinkDB so that they can be queried?
How would I select all documents in a date range?
How do I do queries on fields that are within other fields in my JSON structure?
How do I do wildcard searches?
How do I define which fields are returned from my query, or does it always return the entire document?
How do I make sure that I do not add documents to the database that are missing required fields, thereby breaking all queries that do not contain those required fields?
It should be incredibly easy (and well documented) to do all the above.
I wanted to work with ReThinkDB but chose a different database because I was so intimidated by the query writing.
Thank you, this is phenomenal feedback!! Most of these are already very simple, and the ones that aren't will become very simple soon. Michel (@neumino on github) is working on improving the docs right now, and I will pass this list on to him. We'll get this documented ASAP.
Am I the only one that hardly ever needs a case-sensitive query/index on string data? This is the one feature I miss from the MySQL glory days. With postgres, I use the citext type (case-insensitive text) but I guess Mongo and ReThinkDB expect you to either store the data as all lower or upper, or duplicate the field (for indexing).
The reason we (and probably Mongo folks) do it this way is because JSON is case-sensitive, so making indexes case-insensitive wouldn't work very well.
EDIT: it looks like I misread your comment. I was talking about index/field names, while I think you were talking about field values. In this case Sam's comment below is correct -- once we add string.toLower() (which is very easy) you'll be able to do case-insensitive indexing if you like because the indexing system supports arbitrary reql expressions. If you're using rethink and this is important to you, let me (slava@rethinkdb.com) know, and we'll prioritize this.
Yes, indices are updated incrementally. You can currently only evaluate an index for a given row based only on that row, not on other rows. The reasons for this are pretty deep -- if you start evaluating indexes on a row based on other rows, the result becomes non-deterministic and can't easily be replicated across the cluster. Once we introduce value-based replication (as opposed to evaluation-based replication we have now) this will become possible.
Installed and played with (1.3 I think) and it looks very promising but the ability to query and order by case-insensitive text is a requirement for us. Our app deals with user provided names of people, places and things that are commonly searched and sorted.
So not using RethinkDB today, but I am 100% in love with what you guys are trying to accomplish.
If or once a string.toLower() function is added to RQL, you'll be able to have case-insensitive indexes in RethinkDB, because you can index on arbitrary functions of your rows.
I'm really excited to see this project take off - I played with it a few months ago and it seemed like it was getting close to prime time. The two issues I see now are:
* More packages for more distros - It's great that I can install it on my OS X machine for testing, but I can only roll it out on an Ubuntu machine, and last time I checked the packages were a bit iffy when I tested them.
* More languages - It looks like you guys are working hard on this one, which is great.
Hopefully it'll be production ready soon! Great work guys.
Mike @ RethinkDB here. Great feedback-- while we've been a bit behind in rolling out releases to other distros and languages, we've been delighted to see the community pick up our slack:
- Our new build system has people successfully compiling from source and building community packages for CentOS, Slackware, Gentoo, Arch, and lots of other distros.
- We've seen an enormous number of client drivers for different languages pop up in a short time, with some impressive collaboration:
* rethink-net (C#/.NET): https://github.com/mfenniak/rethinkdb-net
* php-rql (PHP): https://github.com/danielmewes/php-rql
* rethinkgo (Go): https://github.com/christopherhesse/rethinkgo
* cl-rethinkdb (Common Lisp): https://github.com/orthecreedence/cl-rethinkdb
* haskell-rethinkdb (Haskell): https://github.com/atnnn/haskell-rethinkdb
* lethink (Erlang): https://github.com/taybin/lethink
* librethinkdb (C): https://github.com/unbit/librethinkdb
* Lots of emerging efforts for Java, Perl, and other languages.
It's great to see people passionate about bringing our project to other platforms / distros, and we're doing whatever we can to support them. If anyone's interested in a similar project, shoot me an email at mike [at] rethinkdb.com
Looked into it and discovered it was a simple lack of the pyyaml python module. Should probably put that in your build instructions and have the configure script throw an error.
I like Homebrew as well and I don't want to detract from how awesome it is, but dpkg + aptitude is much better in my opinion and I find I actually miss aptitude on OSX.
Congrats for your release, RethinkDB seems very interesting and I really appreciate how coffeemug and all the other team members genuinely and clearly reply to questions here.
Is there already a list of sites/services that use RethinkDB in production?
There isn't a list yet we can share, but we're starting to work very closely with a small group of early customers -- shoot me an email to slava@rethinkdb.com if you'd like to see if we can work together!
So is this production ready? Excited about this, but I don't have time at the moment to dabble on things I can't use and the last comment regarding this is a post on Quora from the founder saying that it is not production ready yet. 1.0+ version numbers makes you think this isn't beta software anymore, but that doesn't seem to be the case.
We'll bump the version to 2.0 when the product is production ready (the 1.x versioning scheme is an accident that unfortunately we can't do much about).
RethinkDB is definitely late to the party, so this is a great question. Here's how we think about it.
1. Technically, Rethink already does very useful things leading NoSQL contenders don't do. An extremely expressive query language, massive query parallelization, distributed joins, etc. You can't get that anywhere else, and we have lots of features in the pipeline to keep raising the bar.
2. I believe we have a unique take on usability and design that is very valuable to users but isn't present in other products. Rethink is pleasant to set up and (hopefully) pleasant to use. To us, design matters, and we put an inordinate amount of effort to make the product beautiful. It turns out users care about this a lot.
3. I feel like leading NoSQL contenders are somewhat stagnant. There is an enormous amount of innovation that could be done, needs to be done, and isn't being done. We'll continue releasing really useful features that nobody has, and (I think) nobody expects. I think we bring a unique philosophy to product design that's very valuable.
4. Details matter (which is why Rethink is late to the party). Once you outgrow the ten-minute blog stage, a lot of underlying architectural decisions start to really matter, and we took the time to do them right.
TL;DR: Rethink is already really good for building apps on top of it and offers things nobody else does. It will continue getting better. We'd be honored if you took it for a spin!
Also http://www.rethinkdb.com/docs/advanced-faq/. RethinkDB started out as a MySQL driver that was lock-free and did schema updates non-stupidly, but it sounds like they've abandoned ACID and schema enforcement entirely like almost all NoSQL systems (as if what I wanted is faster wrong answers).
I see this tude' among DB guys quite often -- unless the database guarantees unconditionally that it will always give the right answer and that no data is ever lost then it is worse than useless and most be nuked from orbit.
Quite frankly that is pretty arrogant. ACID may be required for a banks transactions, but in far the majority of cases it isn't required (yes, even in your business) -- nothing of any consequence is going to happen if your A/B testing DB lose a couple houndred entries because those weren't flushed to the cache.
As for non-schema enforcement, well you have too add a few if statements to your database model class -- and that is it.
To be clear, RethinkDB doesn't give wrong answers, and doesn't lose data. ACID isn't as much about right vs wrong answers, as it is about defining what "right answer" means and what constraints you're willing to give up in exchange for performance.
ACID is a great, but unfortunately it isn't free (especially in distributes systems).
And each of those individual operations must be preserved. The example of account credit/debit is not used because it reflects reality 100%, but because its a simple-to-understand example to illustrate the power of transactions. Most developers should be able to look beyond that simplification and see how they can put transactions to work (or see that they are unnecessary).
Wow, I thought this was a joke press release making fun of "web scale" fad craziness. They seriously gave a 1.0 release number to a "database" that you couldn't search?
Data has always been indexed it just was only indexed by a single primary key. Before this release the indexing scheme wasn't more sophisticated than a key value store now it is. I feel like there are several other products on the market that just do key value indexing.
I'm glad that the Rethink engineers are taking this approach to development instead of "let's win all the benchmarks!" Overall, it's apparent that they are really thinking through the consequences of their decisions and the tradeoffs in database systems, and that they want to build something that will last instead of something that will just become popular in the short term.