Hacker News new | past | comments | ask | show | jobs | submit login

It doesn’t have to be this way, but that’s partly a matter of culture. By aspiring to present/think/act as a monoplatform, Google risks substantially increasing the blast radius of individual component failure. A global quota system mediating every other service sounds both totally on brand, and also the antithesis of everything I learned about public cloud scaling at AWS. There we made jokes, that weren’t jokes, about service teams essentially DoS’ing each other, and this being the natural order of things that every service must simply be resilient to and scale for.

Having been impressed upon by that mindset, my design reflex is instead to aim for elimination of global dependencies entirely, rather than globally rate-limiting the impact of a global rate-limiter.

I’m not saying either is a right answer, but that there are consequences to being true to your philosophy. There are upsides, too, with Google’s integrated approach, notable particularly when you build end-to-end systems from public cloud service portfolios and benefit from consistency in product design, something AWS eschews in favour of sometimes radical diversity. I see these emergent properties of each as an inevitability, a kind of generalised Conway’s Law.




I often hear 'aim for elimination of global dependencies', but the reality is that there is no way around global dependencies. AWS STS or IAM is just as global as google's. The difference is that google more often builds with some form of guaranteed read-after-write consistency, while AWS is more often 'fail open'. For example, if you remove a permission from a user in GCP, you are guaranteed consistency within 7 minutes [1], while with AWS IAM, your permissions may be arbitrarily stale. This means that when the GCP IAM database leader fails, all operations will globally fail after 7 minutes, while with AWS IAM, everything continues to work when the leader fails, but as an AWS customer, you can never be sure that some policy change has actually become effective.

In general, AWS more often shifts the harder parts of global distributed systems onto their customers, rather than solving them for their customers, like GCP does. For example, GCP cloud storage (s3 equivalent) and datastore (nosql database) provide strongly consistent operations in multi-region configurations, while dynamodb and s3 have only eventually consistent replication across regions; and google's VPCs, message queues, console VM listings, and loadbalancers are global, while AWS's are regional.

[1] https://cloud.google.com/iam/docs/faq#access_revoke


> In general, AWS more often shifts the harder parts of global distributed systems onto their customers, rather than solving them for their customers, like GCP does.

Choice of language in representing this is rather telling, because AWS can (and does) pitch this as a strength, viz. that regionalisation helps customers (especially, significantly, bigco enterprise customers) reason about the possible failure modes, and thereby contain the blast radius of component failure.

They'd never comment on competitors in public, but the clear implication is that apparently global services merely gloss over the risks, they don't resolve them, and eventually it'll blow up in your face, or someone's face at least.

> there is no way around global dependencies

This sounds more like a challenge than an assertion. In my very long experience of tech, anyone who ever said, "you can't do that", eventually ate their hat.


Slow rollouts are a security hole.


Side note: AWS STS has had regional endpoints for years. The global endpoint is vestigial at this point. I didn't glean anything special about Google's endpoint that requires it to be globalized like this, but I can't really criticize it without knowing the details.


S3 is strongly consistent. https://aws.amazon.com/s3/consistency/

Which of Google's nosql db provides strong consistency - bigtable? Just confirming


S3 is newly strongly consistent within a single region since last reinvent or so (google cloud storage has been strongly consistent for much longer). However, the cross-region replication for s3 is based on 'copying' [1] so presumably async and not strongly consistent.

GCP datastore and firestore are strongly consistent nosql databases that are available in multi-region configurations [2].

[1] https://docs.aws.amazon.com/AmazonS3/latest/dev/replication.... [2] https://cloud.google.com/datastore/docs/locations


S3 became strongly consistent only recently (https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...) while I think GCS and Azure Blob Storage has strong read-after-write consistency for a while now.

In any case, Cloud Spanner provides strong consistency in multi-region deployments.


And GCP storage buckets has been built on top of spanner sine 2018- giving the same guarantees.

If anything. AWS is playing catch up.


AWS’s regionality and stronger region separation boundaries are a huge selling point for regulated (data regulation) industries and enterprises.

A bank, for instance, may be required to prove it cannot replicate customer data across regions, and that no third party provider will replicate its data using their own BCM or DR systems.

Regardless of CSP, startups should think about rules on movement of data among data jurisdictions (such as GDPR) and architect accordingly.


S3 has strong consistency of list operations now.


Well, GCP followed that principle as their service account auth mechanisms did not fall over. So if you were to compare with AWS, it looks like something similar was happening.

The infrastructure running the auth service, like any service, is going to have a quota system, whether it's global or not. The lesson learned might be different if it wasn't global ("prevent fast changes to the quota system for the auth service") but the conclusion would be substantially similar - there is usually no good reason, and plenty of danger, for routine adjustments to large infrastructure to take place in a brusque manner.

That doesn't mean that non-infrastructure service needs to abide by the same rules...


> The infrastructure running the auth service, like any service, is going to have a quota system, whether it's global or not.

Don’t assume that is the case. That’s exactly the kind of cultural assumption I’m speaking of.

Case in point, I routinely run services without quotas or caps and what have you, scale out for load, and alarm on runaway usage, not service unavailable or quota exceeded. I’d rather take the hit than inconvenience my customers with an outage. In this frame of mind, quotas are a moral hazard, a safety barrier with a perverse disincentive.

That principle of “just run it deep” works even at global scale, right up until you run out of hardware to allocate, which is why growth logistics are a rarely-discussed but critical aspect of running a public cloud.

The core learnings become, how to factor out such situations at all. That might be through some kind of asynchronous processing (event-driven services, queues, tuplespaces), or co-operative backpressure (a la TCP/IP) and so on. Synchronous request/response state machines are absolute murder to scalability, so HTTP, especially when misappropriated as an RPC substrate, has a lot to answer for.


> Don’t assume that is the case.

What I mean is, it's going to have limits of some sort, right? The world is finite...


Yes, everything has limits. Where Google says "quota system", for normal people that means "buy another computer"; you have hit your quota when you're out of memory / cpu cycles / disk. At Google, they have some extra computers sitting around, but it's still not infinite. Quota is a way of hitting some sort of limit before every atom in the Universe becomes a computer on which to run your program.

I don't think there is any way to avoid it. It sounds bad when it's software that's telling your service it can't write to disk, rather than the disk not having any more free sectors on which to write, but it's exactly the same thing. Everyone has a quota, and left unchecked, your software will run into it.

(In the case of this postmortem, there was a bug in the software, which makes it all feel self-inflicted. But if it wasn't self-inflicted, the same problem would have manifested in some other way.)

There is a comment in this thread where the author says they take less risks when the safety systems are turned off. That is fine and nice, but is not really a good argument against safety systems. I have definitely had outages where something hit a quota I set, but I've had more confusing outages from something exhausting all physical resources, and an unrelated system failing because it happened to be nearby. I think you should wear a helmet AND ride safely.


> I think you should wear a helmet AND ride safely.

There's a difference here; helmets are personal safety equipment, which is the proper approach: monitor and manage yourself, don't rely on external barriers. But did-you-know that a statistically significant proportion of drivers change their behaviour around riders wearing helmets? [1] (That's not a reason to not wear helmets, everyone should ATGATT; it's a reason to change driver behaviour through other incentives).

We cannot deny the existence of moral hazards. If you want to nullify a well-understand, thoroughly documented, and strongly correlated statistical behaviour, something has to replace it. Google would, apparently, prefer to cover a hard barrier with soft padding. That might help ... until the padding catches fire.

To your example, writing to disk until the OS reports "there are no more sectors to allocate" just means no-one was monitoring the disk consumption, which would be embarrassing, since that is systems administration 101. Or projecting demand rate for more storage, which is covered in 201, plus an elective of haggling with vendors, and teaching developers about log rotation, sharding, and tiered archive storage.

Actionable monitoring and active management of infrastructure beats automatic limits, every time, and I've always seen it as a sign of organisational maturity. It's the corporate equivalent of taking personal responsibility for your own safety.

[1] http://www.drianwalker.com/overtaking/overtakingprobrief.pdf


I love riding my bike on winding mountain roads. I’m a lot more careful when there’s no safety barrier. Funny thing is, the consequences of slamming into the barrier by mistakenly taking a tight bend at 90 rather than, say, 50, are just as bad as skidding out off a precipice. And I’ve got the scars to prove it.


Do you blog? I really enjoyed reading that.


Thanks. I'm more a forum-dweller when it comes to self-expression. There's an obvious .org but you'll be sorely disappointed, unless you're looking for arcane and infrequent Ruby/Rails tips.


If Google is flaky you use yahoo.

If Facebook/Twitter/instagram is flaky you wait until it isn't and then post that update.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: