TiDB – A distributed NewSQL database compatible with MySQL protocol

yogthos · on Jan 2, 2017

Wonder how this compares to ActorDB http://www.actordb.com/

biokoda · on Jan 2, 2017

(ActorDB developer here)

Most distributed newSQL databases are an SQL layer on top of a distributed KV store. What they try to do is hide the distributed reality of their database so it acts like a regular database from the client side. Of course there are always caveats that might not be completely obvious but can cause terrible performance.

We take the opposite approach. We make the user aware of the distributed nature and force the user to use the distributed database like a distributed database should be used. You must split your data into chunks (actors) and you have a full raft replicated SQL engine (SQLite) within that chunk.

I think TiDB is comparable more to CockroachDB.

atombender · on Jan 2, 2017

ActorDB has a very different design. I've never used it (I've been meaning to, but I don't have a use case yet), so my information might be a little bit off, but:

Basially, ActorDB doesn't hide the fact that it's partitioned. Rather, it forces the client to deal with partitions (or "actors") at the application level.

In particular, this means that while you can have transactions spanning multiple actors, queries can't. So you can't do joins across actors, and if you want to select from multiple actors in one network roundtrip, you have to do a special kind of looping statement that first finds the actors to operate on, then executes a statement on each.

You can work with multiple actors at once, but the database doesn't pretend that it's a single database; rather, each actor is sort of like a separate database, and it's up to you to design the data model and the SQL statements to distribute the data in an optimal way.

In those cases where you do need joins or aggregations that span multiple actors, you'll have to jump through some hoops. Joins, in particular, are probably not going to be very efficient. You might precompute some data that other databases would figure out on the fly. Again, since I've not used it, I don't know all the ways you would work around such limitations.

Unlike ActorDB, TiDB and CockroachDB present the illusion of a single database, on top of a distributed key/value store. There's a magical execution engine that takes selects, even with joins, and automatically splits the query plan into multiple parallel requests to the shards that hold the data, and then merges the results back together. You can do "select * from sometable" and it will return your table in one piece, no matter how it's distributed physically.

There are certainly benefits and drawbacks to both approaches.

atombender · on Jan 2, 2017

Previous discussion: https://news.ycombinator.com/item?id=10180503

rystsov · on Jan 3, 2017

Many distributed databases supporting linearizability fail to provide consistent backups.

MongoDB's docs: "To capture a point-in-time backup from a sharded cluster you must stop all writes to the cluster"

Cassandra's docs: "To take a global snapshot, run the nodetool snapshot command using a parallel ssh utility ... This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken"

Riak's docs: "backups can become slightly inconsistent from node to node"

CockroachDB's docs: "The table data is dumped as it appears at the time that the command is started ... there is no guarantee that NOW() is monotonic in transaction order"

Does TiDB support consistent backups? Are there any docs covering how it is implemented?

ngaut · on Jan 3, 2017

Yes, it does. Thanks to MVCC, TiDB supports the repeatable read isolation level which guarantees that any data read cannot change, if the transaction reads the same data again, it will find the previously read data in place, unchanged, and available to read.

So we can use any MySQL tools such as mysqldump or mydumper to backup the database consistently.

See the MVCC implementation here: https://pingcap.github.io/blog/2016/10/17/how-we-build-tidb/

tyingq · on Jan 2, 2017

TiDB just rolled out RC1, this google groups post might be helpful to get an idea of where they are at: https://groups.google.com/forum/#!topic/tidb-user/4_cpTCSkKZ...

shenli3514 · on Jan 2, 2017

Please feel free to discuss in this google group or raise a new issue on github.

andrewchambers · on Jan 2, 2017

This very similar to cockroachdb, only mysql compatible instead of postgres compatible.

ngaut · on Jan 2, 2017

There are lots of other differences. The default distributed storage engine of TiDB is TiKV, and TiKV is written is Rust. The transaction mode is different, and so on. The are something similar are they are both NewSQL, both use Raft to replicate data.

simonw · on Jan 2, 2017

Any idea why the same people appear to have used Rust for their KV store but Go for their SQL frontend to it? They are both great languages, I'm just surprised to see one team with major projects spread across the two.

shenli3514 · on Jan 2, 2017

Go is good at concurrent concurrency and has great development efficiency. We could easily develop complex SQL logical and parallelize many operators. But its GC and cgo overhead is not suitable for developing storage engine. So we choose Rust to develop TiKV.

atombender · on Jan 2, 2017

Isn't the overhead of calling Rust from Go quite high?

Also, I'd assume that Rust would be better at expressing and pattern-matching against the sort of complex execution plan trees and query expressions you need for this sort of thing.

I was looking at Apache Spark the other day, which is written in Scala, and which has a planner that goes SQL -> AST -> logical plan -> physical plan. The planner/optimizer relies extensively on pattern matching to apply rules to the logical plan (pushdown and so on), and it manages a lot of this with "match" statements and not a lot of recursion.

But that stuff is murder in Go, which doesn't have pattern matching, or generics for that matter. I know it's bad because I'm in the middle of something similar right now, in Go. The Spark code is exactly how I wanted to organize it (minus the awful class inheritance they do), but that's not possible in Go.

shenli3514 · on Jan 2, 2017

Go code communicate with Rust code through network. So there is no overhead. Go has interface, which could be used for expressing the AST, plan tree. We define some interface for AST node and plan tree node. It is convenient to apply some rules (predict-pushdown/column-pruning/cost-computing) on the tree. You could refer to the code in https://github.com/pingcap/tidb/blob/master/plan/plan.go

biokoda · on Jan 2, 2017

> So there is no overhead.

Other than going through the kernel to talk to a library..

shenli3514 · on Jan 2, 2017

We use RocksDB as the single-machine storage engine and build TiKV above RocksDB as a distributed storage engine. At first we consider use Go to build TiKV, but the cgo overhead between TiKV and Rocksdb is considerable. So we turn to Rust. The network communication overhead between TiDB and TiKV has nothing to do with cgo.

ngaut · on Jan 2, 2017

(TiDB developer here)

Actually we are using RPC to call Rust from go, and it has nothing to do with Go or Rust, because either way we have to do RPC call(by encoding and decoding data), currently we depend on protocol buffer and customized RPC , and we are trying to migrate to gPRC. We know Go doesn't have pattern matching and generics, and it's kind of pain sometimes. But still we like Go very much because of simplicity and concurrency.

devty · on Jan 2, 2017

Did you (or your team) find rust's development efficiency to be less than that of go? (was it lack of libraries? was rust too low-level?)

shenli3514 · on Jan 2, 2017

Yes, Go has higher development efficiency than Rust, especially for new comers. For a storage engine, the most important thing is stable. We focus on write right and stable codes. Rust's compiler is quite strict. Which is good for write right codes. We have many good programmers who is very good at writing Rust. So we do not suffer from the low development efficiency. About the libraries, we could get most of we need. But some were missing when we begin to develop TiKV (GRPC/HTTP2 for example). But thanks to Rust Team/Community, they give us great supports. GRPC for Rust is available now. We also contribute back to the community. We develop prometheus client for Rust. The Rust community is active and helpful. So we do not worry about the lack of libraries.

dongxu · on Jan 2, 2017

TiDB's transaction model is different from Cockroach, and it's highly layered, you can use tikv(the underlying storage layer of tidb) as a key-value database without sql for better performance.

placeybordeaux · on Jan 2, 2017

Couldn't you also just create a K/V style table in cockroach?

ngaut · on Jan 2, 2017

Hi, TiDB maps SQL table to K/V table and store the SQL table to K/V store(TiDB store these K/V to TiKV), Please see https://pingcap.github.io/blog/2016/10/17/how-we-build-tidb/

devty · on Jan 2, 2017

Has go language become a go-to language to build SQL frontend for distributed databases? If so, why is that?

solidsnack9000 · on Jan 2, 2017

Many projects like this are written in Java, too: Zookeeper, Hadoop, Presto.

Go and Java both share good concurrency support, reasonable developer productivity, moderate type safety and good code generation.

jpgvm · on Jan 2, 2017

Likely because of it's nature as a network first language. Go is largely speaking a language designed to push bytes over sockets to lots of concurrent clients.

shenli3514 · on Jan 2, 2017

There are two more things: 1. Go has high development efficiency. So we could build complex SQL logic easily. 2. Go is easy to write concurrent code. So we could enjoy the benefit of multi-core cpu. For example, we could use multiple goroutine to scan data, do join, do aggregation.

aamederen · on Jan 2, 2017

Comparison Request: Citus (https://github.com/citusdata/citus)

eternalban · on Jan 2, 2017

Do they mean a Relational database? SQL is a language.

takeda · on Jan 2, 2017

This and NoSQL terms are so abused, because it shows that people who coined NoSQL or NewSQL, don't seem to understand databases.

Also the whole thing reminds me of this: https://mobile.twitter.com/edd/status/400190499585544192/pho...

It is missing NewSQL, which is happening right now, but it shows how we are going in circles. There are benefits of NoSQL, particularly for type of data that it is ok once in a while to have individual values wrong or missing (tracking users, shopping carts etc). CRDTs are also useful. I'm wondering what NewSQL would bring, but I'm thinking that we will go back to traditional relational databases once again. We probably end up with some kind of hybrid approach, and decide for given piece of data whether it should be distributed (at the cost of consistency) or vice versa.

coderholic · on Jan 2, 2017

NewSQL has a specific meaning: https://en.wikipedia.org/wiki/NewSQL

eternalban · on Jan 2, 2017

Coined by a journalist, no less ... https://451research.com/analyst-team/analyst/Matt+Aslett

shenli3514 · on Jan 2, 2017

Yes, TiDB is a Relational database.