SRE is not a byproduct of a bubble economy. I believe Google has had SREs since ...

dekhn · 2023-10-27T21:40:04 1698442804

Pedantically, Google didn't have SREs as the beginning. I asked a very early SRE, Lucas, (https://www.nytimes.com/2002/11/28/technology/postcards-from... and https://hackernoon.com/this-is-going-to-be-huge-google-found...), and he said that in the early days, outages would be really distracting to "the devs like Jeff and Sanjay" and he and a few others ended up forming SRE to handle site reliability more formally during the early days of growth, when Google got a reputation for being fast and scalable and nearly always up.

Lucas helped make one of my favorite Google Historical Artefacts, a crayon chart of search volume. They had to continuously rescale the graph in powers of ten due to exponential growth.

I miss pre-IPO Google and the Internet of that time.

davedx · 2023-10-27T20:10:35 1698437435

> “These days with devops the skill set needed for devs have indeed expanded to have significant overlap with SREs”

Respectfully disagree on this. SRE is a huge complex realm unto itself. Just understanding how all the cloud components and environments and role systems work together is multiple training courses, let alone how to reliably deploy and run in them.

derefr · 2023-10-27T21:06:40 1698440800

But modern approaches to dev require the SWEs to understand and model the operation of their software, and in fact program in terms of it — “writing infrastructure” rather than just code.

Lambda functions, for example: you have to understand their performance and scalability characteristics — in turn requiring knowledge of things like the latency added by crossing the boundary between a managed shared service cluster and a VPC — in order to understand how and where to factor things into individual deployable functions.

zdragnar · 2023-10-27T23:09:33 1698448173

That is barely tip-toeing across the very edges of SRE land.

derefr · 2023-10-28T17:16:37 1698513397

Alright, how about expecting devs to repackage their entire until-that-point-SaaS stack into an "appliance" (Kubernetes Helm chart), containing SWE-written resource manifests that define the application's scaling characteristics across arbitrarily-shaped k8s clusters they won't get to see in advance, using only node taints; memory limits for layers of their stack they've never even seen run full-bore before; health checks that multiplex back up to a central monitoring platform; safely-revertible multiphase upgrade rollout behavior that never decreases availability; and so forth;

...and then those same devs being expected to directly debug the behavior of this "appliance" in a client environment (think: someone consuming the "appliance" through the Amazon Marketplace, where this launches the workload into an EKS cluster in the customer's own VPC, with the customer in control of defining that cluster's node pools);

...where this can involve, for example, figuring out that a seemingly-innocent bounded-size Redis cache deployment, needs 10x its steady-state memory, when booting from a persisted AOF file... for some godforsaken reason.

bravetraveler · 2023-10-28T03:44:10 1698464650

Yea, this is buying and using toys. Need to go down a few layers of abstraction

jeffrallen · 2023-10-27T22:09:17 1698444557

The idea of ops people who wrote code for deployment and monitoring and had responsibility for incident management and change control existed before Google gave it a name.

Source: I was one at WebTV in 1996, and I worked with people who did it at Xerox PARC and General Magic long before then.