Hi! I'm a Product Manager for the MySQL Server. I don't specifically work on InnoDB Cluster, but am happy to answer questions.
Let me start of by describing how the clustering works:
- A cluster is 3+ nodes in a group, each with a complete set of data
- When you say 'COMMIT' the server you are connected to certifies your write-set with its peers (i.e. has anyone else tried to modify the same data). Once the majority agree, the transaction is considered successful.
- The distributing and certification of the transaction is synchronous, but the applying of the transaction is asynchronous. Thus you actually retain pretty good performance if the network latency is reasonable.
The performance question comes up a bit, so here's another post for more context:
Thanks for the quick summary! You're saying that each node in the cluster is a complete copy of the data. So this solution will give you high availability on both writes and reads with one node in a three node cluster down. Can you read as long as one node us up but not write?
For the performance piece, I'm looking at the attached link and it seems like neither read nor write throughput goes up as you add nodes. So will this allow you to scale your read / write volume horizontally or is it just for high availability? It doesn't seem like you can just add more nodes to get more write throughput. I'm guessing you'd still need to take advantage of read only replicas to scale read volume in addition to a cluster.
Also, since each copy needs a complete set of the data it does not seem like this clustering solution addresses growing data size. Correct?
The three node minimum is to avoid split brains. You can still access the data with fewer (for recovery etc.) but the cluster is not HA.
Read throughput will go up by being able to distribute reads amongst the cluster. Write throughput shouldn't really get better; since all nodes have a copy of the data. Thus; the maximum node count is 9.
On the last question, there is data size, and there is working set size (what needs to be in memory). You can actually stretch working set size by sending certain queries (i.e. reports) to one of the nodes and keep the others for more transactional queries, but for storage on disk - yes, it is a multiple of however many nodes you have.
But also consider: Much of the pain I've had in large DBs is not being able to keep enough backups on fast media for quick restore (i.e. I'd like to have every day for the last 2+ weeks). From that pain point though, it's not a multiple - you just need to pick one node to backup.
The Tl;dr I give most people is that the Group Replication technology (innodb cluster) uses Paxos for group communication (with majority) and Galera uses Totem Single-ring Ordering (all nodes).
Interesting, I've been using galera clustering for a couple of years and the first months of setting up a new cluster, a new environment, have always been plagued by bugs and issues.
Eventually it stabilizes but I still have binary logs disabled because of an open issue over a year old.
I'm afraid that MySQL will have the same issues now that it's new.
We do not have current plans for an AWS CloudFormation template.
QA has definitely been one of the harder parts of creating a distributed MySQL. I don't work closely enough on that part to answer this question. Sorry!
In single primary the other nodes are put in a read-only mode so they can not accept writes. So stale reads can not happen.
To expand on your question a little - the MySQL router currently works better for single primary (it supports multi-primary but does not prevent stale reads). If you go third party, ProxySQL also supports Group Replication natively with multi-primary:
I was chatting to the author about this at FOSDEM - since it is possible to keep track of nodes state by following GTIDs executed in the binary log stream. I believe this is how ProxySQL is doing it, but it's possible the final implementation differs :-)
I have been introduced to this from my local MySQL team. Just want to confirm, how is the conflict resolution solved? e.g., two different updates on two different server at about the same time.
CitusDB (not Citus Cloud) is not fully automated: You can have only one master node, if this node fails you have to manually elect a new master node. So it really does not gain anything compared to stock postgresql for HA.
There's also CockroachDB, which is wire-compatible with Postgres. Of course, there are also some performance trade-offs to consider, and it is currently missing many SQL and data type features available in Postgres and others.
CockroachDB employee here - Yes, CockroachDB is a completely different database from PostgreSQL. CockroachDB supports HA from the bottom up with data broken up into ranges that are consistently replicated using Raft. As mentioned earlier in the thread, it uses the PostgreSQL wire protocol and supports a good amount of SQL (but not all since the dialect is quite large if you include extensions). If you are looking for HA with SQL compatibility, it would be worth taking a look at CockroachDB.
Node and Scala for example are open source projects first then a company comes in and hires engineers and provides support. That model tends to work fine because you still have a rich, diverse community.
It's different when it's a company that open sources a project. It can sometimes be difficult then for that project to have that same diversity of views.
You're right. I think in general the quorum commit in Postgres is a bit lower level, in the sense that it is less opinionated and allows you to configure the quorum (essentially enabling both semisync and synchronous).
Also, as sibling points out, semisync in MySQL seems to commit on the master first no matter what, while in PG quorum commit, no one commits until the required acks are received.
That's not what semi-sync does. Semi-sync commits the write to the master and then blocks on returning a result to the client until N replicas have received a write OR a timeout is reached. There is no ability to roll back the write on the master once it is committed.
I hate this. Kubernetes is designed specifically for stateless applications; everything about running stateful applications on it now is experimental.
Running k8s/Docker in prod is hard enough for actually stateless apps, running a database in it is an utterly horrifying idea.
Please limit the extent of your lemming-ness to at least not attempting to containerize PostgreSQL or other old-school database systems. They could not be less suited to this type of application.
> Kubernetes is designed specifically for stateless applications
Is that actually true? The Borg paper[1] mentions that Google uses it to run stateful things like Bigtable and GFS nodes, and as far as I can tell, Kubernetes has the exact same goals as Borg.[2]
I don't think there's a direct quote of k8s devs saying "this is designed for stateless applications" (although there very well may be), but it's pretty evident if you're following.
Blog post on the official site, about 9 months old, whose title expresses surprise that k8s may be reasonable for stateful applications. The title is "Stateful Applications in Containers!? Kubernetes 1.3 Says “Yes!”" [0]
k8s's "StatefulSets", which emerged in 1.5 (4 months ago, Dec 2016) and were transitioned from the previous feature "PetSets", are still marked a "beta" feature in the newest release (1.6, made a couple of weeks ago). [1]
In Nov. 2016, CoreOS issued a blog post with the subheading "Stateless is Easy, Stateful is Hard" while introducing a new pattern called "operators". Operators are intended to help Kubernetes better handle stateful applications (since the habit of k8s admins is to terminate pods casually). These are complicated to implement and haven't been widely adopted; basically, you register a third-party resource type, tell your operator to listen in for pod deletions, and hijacks these commands to ensure that everything is shut down in the correct order and with ample opportunity to save. [2]
-----
This is all just the k8s level -- let's not forget that the container engine underlying k8s will delete all changes made to the image by default, and has similar weird issues with stateful applications. Accidentally stop the wrong container on the docker cli? You have quite the problem now.
And on top of ALL THAT, let's also not forget that these databases are designed to sit on whole boxes and consume almost all of the memory, have kernel-level parameters tweaked to their liking, and so forth.
Though databases can co-habitate if forced, production deployments almost always have these guys with dedicated hardware, for various reasons related to performance, reliability, and data safety.
Oh, just use Rook, yet another distributed filesystem, you say? They have a nice k8s operator? Please no.
What's the reason to risk what is presumably important data this way? Just so you can say "Hello LITERALLY EVERYTHING runs in k8s now, can I have an award?"
You wanna do it for dev data or something that doesn't matter, go nuts. But please do not do this for something important. The thought that there are people out there doing this to themselves right now, all to gratify their own vanity by being part of the fad despite their obligations to their colleagues and customers, is so depressing.
It's the container shaped VC bonfire at work. Hype new things to drive adoption
K8S and Docker swarm/compose seem to be trying to replace VMs completely as a general purpose compute abstraction. Google certainly uses them that way, though they have a unique situation arguably. Mesos is doing it too but in a somewhat different way.
Given that, I'd say one of the main features k8s or container platform vendors promote is stateful workloads even though they're not baked at all. most stateless cloud platforms like Heroku or cloud foundry have tended to shy away from that for good reason. But I can't count how many unrealistic and unreliable "run this distributed database... on a single docker host! Slowly!" examples there are out there. "12 factor apps" is now used as a disparaging term for "you can't even persist, bro" rather than something to aspire to, which to me seems seriously misguided.
Disclaimer, I work for a software company that competes in this crazy cloud world
Hey man, I get that you need something NOW and I am sorry about that, but I have to say this is a teeeeensy bit over the top.
Yeah, StatefulSet is still beta. We're getting miles on it before we tell people that we 100% back it. But you know what? People ARE using it. In production. With real data. And they mostly are just fine.
I run a (tiny) database against k8s. I trust it with my own data.
What you say "the container engine underlying k8s will delete all changes made to the image" is true - don't write files straight to your image FS! This is containers 101. We have PersistentVolumes for this very reason. Data that has a lifetime of its own.
Does this absolve you from backups? No. Does it mean you don't need to think about upgrades? Hell no, you still have to know what your apps are up to.
Nobody is making people use containers for databases, but the power of systems like Kubernetes is pretty addictive, and a lot of people are pouring a lot of energy into this problem.
I understand that k8s is experimental and young -- and IMO that makes it exactly the wrong kind of thing to be running production workloads, Google association notwithstanding. None of this is meant to assault or attack Kubernetes for what it is, it's meant to highlight the absurdity of how it's being used and promoted.
>What you say "the container engine underlying k8s will delete all changes made to the image" is true - don't write files straight to your image FS! This is containers 101. We have PersistentVolumes for this very reason. Data that has a lifetime of its own.
Yeah, I'm not disputing this. It's just the most immediate and shocking example of how Docker can be tricky for stateful apps. On most systems, if your program is writing something to the filesystem, it's expected to persist. Any potentiality for lost files/data is normally treated as opt-in (writing to a temp folder). Are you sure you got every nook and cranny where your program expected to read/write from disk, and set all of your symlinks up right, etc.? Why deal with this?
>Does this absolve you from backups? No. Does it mean you don't need to think about upgrades? Hell no, you still have to know what your apps are up to.
The problem with every buzzword or hyped-up piece of software is that everyone just assumes it has magical powers. They have to get some mileage on it to realize that while it may offer some improvements, we still live in the real world.
I know the Kubernetes authors are aware of this, but I wish more Kubernetes users were.
>Nobody is making people use containers for databases, but the power of systems like Kubernetes is pretty addictive, and a lot of people are pouring a lot of energy into this problem.
But why? Kubernetes is cool but it doesn't seem any more "addictive" than what I had before, which effectively did the same thing: a script that spun up an instance from an image, connected to it with Ansible, automatically provisioned everything, and turned it on, and similar scripts that allowed me to view the state of my other instances and apply transformations on them. You can argue that they're different scripts and k8s is one program, but it's really just a difference in invocation.
Obviously I know that Kubernetes operates on containers and not VM instances, but in terms of how it affects our daily lives, k8s is, more or less, just another interface into management/automation technology that we've had since virtualization went mainstream.
What are the truly innovative or unique things it brings to the table? It mostly seems to consolidate these management things into one binary, which, don't get me wrong, is a totally fine thing to do. But it doesn't sound like it would give one "addictive powers".
I don't understand why someone wants to take a square peg and pound it through a round hole. Database servers are designed for dedicated machines. They want all the RAM, they want all the CPU, they want kernel parameters tuned to their liking, sometimes even kernel versions (which you can't replace on a container). On production, if you have any significant amount of traffic, you want to give the database what it wants.
So what practical value do I get by putting PgSQL in a container and/or in Kubernetes? It just sounds like a massive headache for no real benefit.
If they're expressing surprise at being able to run stateful applications, developing complex new patterns, and dispensing multiple iterations of experimental features to provide for this featureset, doesn't that show that it wasn't designed for it?
>But your previous comment implied that Kubernetes was designed inherently for stateless tasks only.
I would argue that there are things that make conventional databases an inherently bad fit for containerization, but I didn't say k8s was designed specifically to preclude the possibility of ever running any type of stateful app.
"Omega, an offspring of Borg stored the state of the cluster in a centralized Paxos-based transaction-oriented store that was accessed by the different parts of the cluster control plane (such as schedulers), using optimistic concurrency control to handle the occasional conflicts"
Seems like Google never stored state within the containers.
Kubernetes is designed specifically to abstract the clustered nature of multiple machines into something that's singly addressable. The upcoming addition of federation is intended to add another abstraction on top of that, allowing the addressing of multiple clusters as a single cluster.
The primitives currently included in Kubernetes are mostly geared towards scheduling containerized services onto multiple machines in order to achieve a desired state, but it's only that containerization that has the stateless/ephemeral connotation. Kubernetes is perfectly capable of handling stateful storage and it's as simple as mounting a storage volume: https://kubernetes.io/docs/concepts/storage/persistent-volum...
Although not immediately destroying data on exit is important, stateful applications have other important requirements. StatefulSets are the Kubernetes feature intended to address some of them, but it is not yet marked stable.
These are marked beta, and they don't address all of the needs of stateful applications. This blog post [0] from the official k8s blog covers many of the caveats well, and basically describes databases as the anti-use case with statements like:
>What is the benefit you hope to gain by running your application in a StatefulSet? [...] If you have a few instances of your storage application, and they are successfully meeting the demands of your organization, and those demands are not rapidly increasing, you’re already at a local optimum.
They've been out of beta a few months now and are well supported in GCE. You can't have an application without state, and the k8s team recognizes this. Anyway, OP is about MySQL I guess :)
Please remember that Kubernetes itself is literally less than 2 years GA. Stateful is hard. It wasn't the first thing we tackled. It's in progress now, and feedback so far is pretty good.
That may not be enough for you yet, and that's OK. Check back in 6 months and see how it is progressing.
Let me start of by describing how the clustering works:
- A cluster is 3+ nodes in a group, each with a complete set of data
- When you say 'COMMIT' the server you are connected to certifies your write-set with its peers (i.e. has anyone else tried to modify the same data). Once the majority agree, the transaction is considered successful.
- The distributing and certification of the transaction is synchronous, but the applying of the transaction is asynchronous. Thus you actually retain pretty good performance if the network latency is reasonable.
The performance question comes up a bit, so here's another post for more context:
http://mysqlhighavailability.com/an-overview-of-the-group-re...