It doesn’t have to be this way, but that’s partly a matter of culture. By aspiri...

felixhuttmann · on Dec 19, 2020

I often hear 'aim for elimination of global dependencies', but the reality is that there is no way around global dependencies. AWS STS or IAM is just as global as google's. The difference is that google more often builds with some form of guaranteed read-after-write consistency, while AWS is more often 'fail open'. For example, if you remove a permission from a user in GCP, you are guaranteed consistency within 7 minutes [1], while with AWS IAM, your permissions may be arbitrarily stale. This means that when the GCP IAM database leader fails, all operations will globally fail after 7 minutes, while with AWS IAM, everything continues to work when the leader fails, but as an AWS customer, you can never be sure that some policy change has actually become effective.

In general, AWS more often shifts the harder parts of global distributed systems onto their customers, rather than solving them for their customers, like GCP does. For example, GCP cloud storage (s3 equivalent) and datastore (nosql database) provide strongly consistent operations in multi-region configurations, while dynamodb and s3 have only eventually consistent replication across regions; and google's VPCs, message queues, console VM listings, and loadbalancers are global, while AWS's are regional.

[1] https://cloud.google.com/iam/docs/faq#access_revoke

inopinatus · on Dec 19, 2020

> In general, AWS more often shifts the harder parts of global distributed systems onto their customers, rather than solving them for their customers, like GCP does.

Choice of language in representing this is rather telling, because AWS can (and does) pitch this as a strength, viz. that regionalisation helps customers (especially, significantly, bigco enterprise customers) reason about the possible failure modes, and thereby contain the blast radius of component failure.

They'd never comment on competitors in public, but the clear implication is that apparently global services merely gloss over the risks, they don't resolve them, and eventually it'll blow up in your face, or someone's face at least.

> there is no way around global dependencies

This sounds more like a challenge than an assertion. In my very long experience of tech, anyone who ever said, "you can't do that", eventually ate their hat.

erhk · on Dec 19, 2020

Slow rollouts are a security hole.

cle · on Dec 19, 2020

Side note: AWS STS has had regional endpoints for years. The global endpoint is vestigial at this point. I didn't glean anything special about Google's endpoint that requires it to be globalized like this, but I can't really criticize it without knowing the details.

silentsea90 · on Dec 19, 2020

S3 is strongly consistent. https://aws.amazon.com/s3/consistency/

Which of Google's nosql db provides strong consistency - bigtable? Just confirming

felixhuttmann · on Dec 19, 2020

S3 is newly strongly consistent within a single region since last reinvent or so (google cloud storage has been strongly consistent for much longer). However, the cross-region replication for s3 is based on 'copying' [1] so presumably async and not strongly consistent.

GCP datastore and firestore are strongly consistent nosql databases that are available in multi-region configurations [2].

[1] https://docs.aws.amazon.com/AmazonS3/latest/dev/replication.... [2] https://cloud.google.com/datastore/docs/locations

frankchn · on Dec 19, 2020

S3 became strongly consistent only recently (https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...) while I think GCS and Azure Blob Storage has strong read-after-write consistency for a while now.

In any case, Cloud Spanner provides strong consistency in multi-region deployments.

dijit · on Dec 19, 2020

And GCP storage buckets has been built on top of spanner sine 2018- giving the same guarantees.

If anything. AWS is playing catch up.

Terretta · on Dec 19, 2020

AWS’s regionality and stronger region separation boundaries are a huge selling point for regulated (data regulation) industries and enterprises.

A bank, for instance, may be required to prove it cannot replicate customer data across regions, and that no third party provider will replicate its data using their own BCM or DR systems.

Regardless of CSP, startups should think about rules on movement of data among data jurisdictions (such as GDPR) and architect accordingly.

baskire · on Dec 20, 2020

S3 has strong consistency of list operations now.

ridaj · on Dec 19, 2020

Well, GCP followed that principle as their service account auth mechanisms did not fall over. So if you were to compare with AWS, it looks like something similar was happening.

The infrastructure running the auth service, like any service, is going to have a quota system, whether it's global or not. The lesson learned might be different if it wasn't global ("prevent fast changes to the quota system for the auth service") but the conclusion would be substantially similar - there is usually no good reason, and plenty of danger, for routine adjustments to large infrastructure to take place in a brusque manner.

That doesn't mean that non-infrastructure service needs to abide by the same rules...

inopinatus · on Dec 19, 2020

> The infrastructure running the auth service, like any service, is going to have a quota system, whether it's global or not.

Don’t assume that is the case. That’s exactly the kind of cultural assumption I’m speaking of.

Case in point, I routinely run services without quotas or caps and what have you, scale out for load, and alarm on runaway usage, not service unavailable or quota exceeded. I’d rather take the hit than inconvenience my customers with an outage. In this frame of mind, quotas are a moral hazard, a safety barrier with a perverse disincentive.

That principle of “just run it deep” works even at global scale, right up until you run out of hardware to allocate, which is why growth logistics are a rarely-discussed but critical aspect of running a public cloud.

The core learnings become, how to factor out such situations at all. That might be through some kind of asynchronous processing (event-driven services, queues, tuplespaces), or co-operative backpressure (a la TCP/IP) and so on. Synchronous request/response state machines are absolute murder to scalability, so HTTP, especially when misappropriated as an RPC substrate, has a lot to answer for.

ridaj · on Dec 19, 2020

> Don’t assume that is the case.

What I mean is, it's going to have limits of some sort, right? The world is finite...

jrockway · on Dec 19, 2020

Yes, everything has limits. Where Google says "quota system", for normal people that means "buy another computer"; you have hit your quota when you're out of memory / cpu cycles / disk. At Google, they have some extra computers sitting around, but it's still not infinite. Quota is a way of hitting some sort of limit before every atom in the Universe becomes a computer on which to run your program.

I don't think there is any way to avoid it. It sounds bad when it's software that's telling your service it can't write to disk, rather than the disk not having any more free sectors on which to write, but it's exactly the same thing. Everyone has a quota, and left unchecked, your software will run into it.

(In the case of this postmortem, there was a bug in the software, which makes it all feel self-inflicted. But if it wasn't self-inflicted, the same problem would have manifested in some other way.)

There is a comment in this thread where the author says they take less risks when the safety systems are turned off. That is fine and nice, but is not really a good argument against safety systems. I have definitely had outages where something hit a quota I set, but I've had more confusing outages from something exhausting all physical resources, and an unrelated system failing because it happened to be nearby. I think you should wear a helmet AND ride safely.

inopinatus · on Dec 19, 2020

> I think you should wear a helmet AND ride safely.

There's a difference here; helmets are personal safety equipment, which is the proper approach: monitor and manage yourself, don't rely on external barriers. But did-you-know that a statistically significant proportion of drivers change their behaviour around riders wearing helmets? [1] (That's not a reason to not wear helmets, everyone should ATGATT; it's a reason to change driver behaviour through other incentives).

We cannot deny the existence of moral hazards. If you want to nullify a well-understand, thoroughly documented, and strongly correlated statistical behaviour, something has to replace it. Google would, apparently, prefer to cover a hard barrier with soft padding. That might help ... until the padding catches fire.

To your example, writing to disk until the OS reports "there are no more sectors to allocate" just means no-one was monitoring the disk consumption, which would be embarrassing, since that is systems administration 101. Or projecting demand rate for more storage, which is covered in 201, plus an elective of haggling with vendors, and teaching developers about log rotation, sharding, and tiered archive storage.

Actionable monitoring and active management of infrastructure beats automatic limits, every time, and I've always seen it as a sign of organisational maturity. It's the corporate equivalent of taking personal responsibility for your own safety.

[1] http://www.drianwalker.com/overtaking/overtakingprobrief.pdf

inopinatus · on Dec 19, 2020

I love riding my bike on winding mountain roads. I’m a lot more careful when there’s no safety barrier. Funny thing is, the consequences of slamming into the barrier by mistakenly taking a tight bend at 90 rather than, say, 50, are just as bad as skidding out off a precipice. And I’ve got the scars to prove it.

haltcatchfire · on Dec 18, 2020

Do you blog? I really enjoyed reading that.

inopinatus · on Dec 18, 2020

Thanks. I'm more a forum-dweller when it comes to self-expression. There's an obvious .org but you'll be sorely disappointed, unless you're looking for arcane and infrequent Ruby/Rails tips.

erhk · on Dec 19, 2020

If Google is flaky you use yahoo.

If Facebook/Twitter/instagram is flaky you wait until it isn't and then post that update.