Regional Persistent Disks on Google Kubernetes Engine

williamstein · on May 24, 2018

This is a cool feature. However, for me it is helpful to know how often GCE tends to have a zone fail, but not the whole region. Personally, I've been using GCP to run cocalc.com since 2014. In the last year I remember two significant outages to our site, which were 100% the fault of Google:

(1) Last week, the GCE network went down completely for over an hour -- this killed the entire region (not just zone!) where cocalc is deployed -- see ://status.cloud.google.com/incident/cloud-networking/18010

(2) Last year, the GCE network went down completely for the entire world (!), and again this made cocalc not work.

In both cases, when the outage happened, having cocalc be hosted in multiple zones (but one regions, or in (2) one cloud), would not have been enough. I haven't had to deal with any other significant GCE outages that I can remember that weren't at least partly my fault. For what it is worth, I used to host cocalc both on premise and on GCE, but can no longer afford to do that.

jkaplowitz · on May 24, 2018

Last week's outage does look bad, and clearly would have severely impacted certain dynamic patterns of scaling, but "went down completely" does not match what's documented on that link: it prevented the creation of new GCE instances in us-east4 which required allocating/attaching new external IP addresses.

Of particular relevance to this thread, if you had a GKE cluster spun up in that region, that cluster would have continued unaffected based on that description.

GCP does have outages just like AWS does, but in recent years the impact is usually something constrained to certain products and use cases.

(Disclosure: I worked for GCP 2013-2015 but haven't worked for Google since then.)

alexbeloi · on May 24, 2018

Somebody on HN once mentioned maintaining duplicate/redundant deployment scripts for two different cloud providers. If one goes down you 'just' redeploy onto the other cloud provider. So you would be down for only as long as it takes to run a deployment and update dns.

Minimal additional recurring costs:

* duplicate storage (e.g. GCS and S3) * engineering cost of maintaining/testing multiple deployments (non-trivial)

This feels very much like a five 9's level optimization that I would not find easy to justify.

It might also be worth investigating how Netflix handles this issue.

caleblloyd · on May 24, 2018

What does a failover look like in Kubernetes? From the GCP Docs: [1]

> In the unlikely event of a zonal outage, you can failover your workload running on regional persistent disks to another zone using the force-attach command. The force-attach command allows you to attach the regional persistent disk to a standby VM instance even if the disk cannot be detached from the original VM due to its unavailability.

Does the kubernetes.io/gce-pd provisioner have the logic to detect a zone failure in GCP and call the "force-attach" command if a failover is needed? Or does it always try to do a "force-attach" if a normal attach call fails? How does it handle a split-brain scenario, where the disk is requested by two separate nodes in each zone?

[1] https://cloud.google.com/compute/docs/disks/#repds

brown9-2 · on May 24, 2018

Note that in https://cloud.google.com/solutions/using-kubernetes-engine-t... they only simulate a zone failure by deleting one zone's instance group.

whydid · on May 23, 2018

This is going to enable some very simple multi-region failover for k8s, and I'm excited to try it out!

advisedwang · on May 23, 2018

I wonder what the performance hit from this is.

nimos · on May 23, 2018

https://cloud.google.com/compute/docs/disks/#introduction

Only difference in docs is 1/2 the write throughput per GB on SSDs. Latency is probably the real issue though. Kind of annoying that neither pricing or performance is even mentioned. Pricing is at a "promo" price of .24 a GB for SSD and .08 for standard.

thesandlord · on May 23, 2018

> Latency is probably the real issue though

Reads should be just as fast as zonal disks.

From the docs:

A write is acknowledged back to a VM only when it is durably persisted in both replicas. If one of the replicas is unavailable, Compute Engine only writes to the healthy replica. When the unhealthy replica is back up (as detected by Compute Engine), then it is transparently brought in sync with the healthy replica and the fully synchronous mode of operation resumes. This operation is transparent to a VM.

Regional persistent disks are designed for workloads that require a lower Recovery Point Objective (RPO) and Recovery Time Objective (RTO) compared to using persistent disk snapshots.

Regional persistent disks are an option when write performance is less critical than data redundancy across multiple zones.

(I work for GCP)

jorangreef · on May 24, 2018

Do you use vector chains or partially ordered sets etc. to detect or prevent split-brain?

e.g. The first replica fails, and the second keeps writing, then the second fails and the first recovers. The second never recovers.

Without vector chains, the user would never be made aware that data was lost on the second replica. Or if the second replica recovers, data might be lost when merged.

londons_explore · on May 25, 2018

In your example, the disk is entirely read-only I think.

I would guess the state of each disk replica (HEALTHY, UNHEALTHY) is stored in a master elected data store. Anytime a write to one replica fails, the data store must be updated to change the state of the failed replica before considering the write complete.

Then to change the state back to HEALTHY, everything must be online and fully re-replicated.

If the master elected state store can't be written to, because (due to a network split) it doesn't have sufficient votes to gain mastership, then no writes can occur.

miki4you · on May 24, 2018

am i right to assume that the regional persistent disk can only be mounted in write mode on one node at the time, same as with zonal persistent disk ?