All infra teams eventually become platforms. All product teams eventually become experiences. When viewed negatively this is called scope creep. I don't know what it's called when viewed positively but I expect the word "holistic" to be used unironically.
Org charts that ship a platform are default stable because everybody it a team or group is doing approximately the same things. Growth is less uncomfortable, advancement feels more objective, and individual developers are relatively interchangeable.
But what if a company needs to change? Now the stable org chart resists that change. By rejecting requests from client teams that are responsible for a new set of objectives. This recurses. One layer of platform can simultaneously be moving too slowly for the layer above and too quickly for the one below. Shear forces tear it apart and the organization finds itself with n (3 < n < 6) fewer platform engineers.
In my experience having a formal platform structure helps tremendously to roll new solutions. When the platform officially provides something, everybody in the company can make use of it right away. Unlike the pet project that's being pushed by a random manager.
Breakthrough can be pushed by rolling a second library/framework/platform. Like AWS ELB and ALB. Then developers can adopt the later if it's so much greater, but they won't because it's 90% of the same and who wants to work on migrations?
Large organizations are fundamentally split apart. First part of the org wants A and B. Second part wants B and C. The developer team next floor is rolling their own thing to do C and D. All while features A and C are incompatible so it's impossible to satisfy everyone. There is no solution to resolve internal conflicts (except maybe reducing a large company to 20% of its current workforce).
The big thing I remember about their approach that surprised everyone: there was no mandate to use the platform/central team's tools. It made me chuckle how many times the presenter was grilled by everyone about that. It was like some of the audience straight up thought he was lying about that.
But basically, if you have a platform team, and a mandate to use that team's tools, well, the other teams aren't really "customers", in the sense you can leverage choice as a signal. So you have to make up for that. In my experience, you need very good management, and constant, multifaceted communication. Which... might work, might not.
It's probably best to delay any kind of centralized/platform work until you have a _very_ clear pain that defines a very clear set of roles and requirements. Unless everyone says "Oh shit this is amazing" ... just say no.
Mandate makes more sense the lower in the stack you go.
“This is the cluster scheduler you must use to run code in the datacenters” Ok.
“Since part of your feature communicates with end users, it must be implemented in this visual programming language we created for workflows that interact with end users.” Not fine. This is how you end up with abominations of workarounds on workarounds. Let me use my regular tools and give me a damn API. Your decision to invent a shitty half baked programming environment is not my problem. When you try to make it my problem you are only creating more damage.
It doesn't make sense at the datacenter level either.
How are ops/developers supposed to use ansible or salt or docker or kubernetes, when the only available solution to access/deploy to the servers is with the one centrally approved tool.
IMO one of the keys to a platform engineering group's success (alluded to in this post) is having a mindset in which other engineering teams are the customer. Once you flip into this mode it becomes a lot more clear how your platform is really a product, and the platform engineering team fits much more seamlessly into the overall organization.
You just wrote _exactly_ what I and others implored the new VP (for who knows what friends-hiring-friends reason) to make reality in the little company I just escaped. The idea of platform engineering NOT requiring other highly technical experts to stop their productive jobs to track weekly vicissitudes in helm/k8s configurations and blog posts was foreign to them. The company is in a downward spiral, but can't grasp why decimating productivity throughout engineering might be related. cartoons about dumpster fires come to mind (not joking) when I think of engineering there.
I hear you. Not passing judgment one way or the other as I don't know the details, but it sounds like your former team fell into the trap of allowing engineering teams to kick subpar outputs over the proverbial fence to one another.
The way you ultimately solve this is by aligning incentives, eg "platform engineers get fat bonuses/promotions when the products built on top of their platform kick ass."
I get a bit frustrated by this view solely because internally, folks aren't _just_ your customer. They are your co-workers, and the same thing for them. Their co-workers are providing the platform.
The extra bit of empathy makes all the difference, because without fantastic personal communication, 'platform' could be a waste of time for everyone involved.
Exactly. Once you view other teams as your customers, it helps you focus on things like quality, satisfaction, and adding value in a very clarifying way.
And honestly other eng teams kind of are your customers. They may not pay you directly, but they also help to build your product and the company that cuts your (and everyone's) paycheck.
Regarding treating the other eng. teams as customers, I fully agree. I know it's not something everyone does (Been there, argued that), but when you treat them that way and when you deal with their problems (Service xyz requests are too slow, feature xyz would really solve our problems ...) you will, in the end, help the customers.
If my coworkers need my services, then that's because they are developing something that a customer needs. I think of it like a dependency-tree. As long as you trust your company to not have multiple teams develop something that will never see the light of day, or something the user doesn't want or need, then this mindset is absolutely a good one (Sadly been there, done that too).
Yup, it isn't perfect but provided that everyone is largely seeking the same goal (company success) it tends to get you to better outcomes.
Btw this can be true of other departments as well – for example, SaaS product marketing is often an entity that exists to serve internal "customers." Product management can be envisioned this way as well.
Good way to put it, but I feel its incomplete without adding
>And being able to effectively and quickly incorporate feedback from consuming teams into the product.
Without that, you can have a quality platform team pushing out good products, but if it doesn't align with what other teams need or you don't expose good override hooks, then the end effect is the teams will fragment into doing what works best for them.
Yes they can. There's a name for that: shadow IT. All of a sudden you realize that the annoying engineers have stopped annoying you not because they become better people, but because they got the credit card of some VP and got their own platform on AWS.
Sure is, the VP is VP of Marketing and he puts the expenses under media research or something, and they put the whole company CRM data into Mongo or Elastic on AWS with no auth, open to the whole internet to see. And sadly that's more daily news than fiction.
100%. This is why it's leadership's job to align incentives such that teams are motivated / rewarded for providing other teams with a high-quality service.
Yup, for sure. Having PMs who understand infrastructure and distributed systems can be a secret weapon for high-scale businesses, especially in enterprise SaaS where competition is more directly head-to-head.
The author describes Terraform and CloudFormation as "imperative". This doesn't seem correct to me, although you can force a sort of imperative flow by manually defining your dependencies in a specific order. I have only a little experience with Ansible but I would say that is the only major imperative-ish IaC (at least the way I used it) aside from bash scripts or working with SDK's directly.
Terraform in and of itself is declarative, but it behaves in an imperative sort of way with the various backends that it supports.
These shortcomings all manifest themselves in how state is managed. Terraform state is declaratively described, and it may or may not match the state of the backend. Once this state drift exists, it becomes difficult to correct.
This is my primary criticism of Terraform and one of the reasons I prefer Kubernetes. I know it's an apple to orange comparison, but in Kubernetes there is both declarative configuration and active reconciliation. You have both current state and desired state and a set of controllers seeking to make them match. I'd love to see this implemented with Terraform.
Terraform attempts to refresh its state from the source of truth (eg aws apis) before planning. It’s not always possible, but often it should work just fine even if you’ve modified a resource outside of terraform.
Terraform mostly can make the current state match the desired state but the challenge is the real world side effects such as the state that exists in the database that is about to get destroyed or down-time of services that depend on the resources being managed and so on. So you can't blindly allow it to do what it wants.
Yeah, that's the wrong distinction between Terraform/CloudFormation and Kubernetes. Terraform/CloudFormation try to be declarative. The distinction is more that Terraform/CloudFormation are about provisioning infrastructure.
I had a really bad experience in an organisation possessed by a platform team. A small number of individuals were rolling change after change that impacted hundreds of engineers, halving their productivity and halting all development to a grind once a month.
This is a typical challenge for platform teams. Because they are responsible for the underlying foundation on which applications are build, a mistake can easily impact all those applications.
An important realization in my opinion is that a platform team is just another development team. They should consider their platform as a product and the developers as their customers. To minimize any downtime they should use the typical mechanisms that developers are also using: automated tests, pair programming, rolling out changes to test environments first, etc.
I'm sorry to hear that as someone who has done this role in the past I can tell you that it's the opposite of what a platform team should be doing.
When done right a platform team should basically not be noticed except that the tooling, developer experience and overall reliability of a system goes up.
«When done right a platform team should basically not be noticed except that the tooling, developer experience and overall reliability of a system goes up.»
Did the platform team use their own platform? I have a theory that that's the right way to set up incentives, but I'm curious whether it works in practice.
I sort of work on platform engineering in the form of a test automation framework used by more than a hundred testers. In general it works fine but you have to be ready to scale up your platform team at the same speed or even higher than the users of the platform.
You also have to be very careful allowing the users to expand the platform to their needs or the platform team will be a permanent bottleneck. I have seen this in companies where the SAP people had a backlog of several years.
+1 on this. Started to get contacted on linkedin and finding job postings for platform engineering last year. I have a strong feeling this could be the next buzzword of the 2020 decade.
People view SRE in many difference ways. If you were to go through the Google SRE book, there's nothing explicit in there about building "platform" or doing any infrastructure engineering.
It's is a given that SRE build tooling, but most of the SRE-focused work, as described in that book is around improving the resilience of a given application or product. Addressing production readiness, defining SLOs, and handling incident response.
There are platform specific SRE teams at Google, but there's not much published about how they get about creating platform.
The book "Seeking SRE" makes it clear that in most places, the notion of "platform engineering" varies tremendously.
I don't know of any authors that have addressed this explicitly other than Susan Fowler in "Production Ready Microservices" who writes:
Another important part of microservice adoption is the
creation of a microservice ecosystem. Typically (or, at
least, hopefully), a company running a large monolithic
application will have a dedicated infrastructure
organization that is responsible for designing, building,
and maintaining the infrastructure that the application runs
on. When a monolith is split into microservices, the
responsibilities of the infrastructure organization for
providing a stable platform for microservices to be
developed and run on grows drastically in importance. The
infrastructure teams must provide microservice teams with
stable infrastructure that abstracts away the majority of
the complexity of the interactions between microservices.
The book Team Topologies discusses Platform Engineering. They consider four core team types, and the platform team is one of those. Some more information about those types can be found here: https://teamtopologies.com/key-concepts-content/what-are-the...
I can definitely recommend this book for a (new) perspective on how platform teams fit into organizations.
> People view SRE in many difference ways. If you were to go through the Google SRE book, there's nothing explicit in there about building "platform" or doing any infrastructure engineering.
In a past life I worked on an SRE team that backed up data to cassette tapes using a fleet of robots with lasers. https://youtu.be/kQ2taAttvwo That was easy. Hard SRE work is more like Traffic Team.
My personal view is that microservices, libraries, frameworks, apis, abis, etc. have a great deal in common today with how we've always thought about internet routers. Ask yourself this: if JavaScript web apps could run traceroute, then how many stakeholders do you think would show up before your query even reaches the wire?
So I think what the author Nick is saying makes perfect sense. Yes, these platform layers can cause difficulties. Can we fix them? Yes, just read the SRE book. That same wisdom will carry over just fine to this kind of problem.
Org charts that ship a platform are default stable because everybody it a team or group is doing approximately the same things. Growth is less uncomfortable, advancement feels more objective, and individual developers are relatively interchangeable.
But what if a company needs to change? Now the stable org chart resists that change. By rejecting requests from client teams that are responsible for a new set of objectives. This recurses. One layer of platform can simultaneously be moving too slowly for the layer above and too quickly for the one below. Shear forces tear it apart and the organization finds itself with n (3 < n < 6) fewer platform engineers.