> You seem to be thinking of it in terms of sheer number of services or engineers working on them.
I'm not sure where I said that, but yes, that's part of the switching cost.
> The fact is that the highly demanding services have the huge majority of the resources, and are the most sensitive to performance issues. If your service uses 10% of Google's datacenter space, you won't accept a 5% or even 1% regression just so you can port to gRPC,
The thrust of my statement was that for many services, RPC overhead is minimal. So even a 2x or 3x increase in RPC overhead is still minimal. I agree, a 5% increase in resource utilization for a large service is something that would be weighed. But lets explore that idea for a moment:
> because at that scale your team can just staff someone or even several people to maintain the pre-gRPC system forever and still come out ahead on the budget.
Not necessarily. Engineers are expensive and becoming ever more expensive while computing resources are becoming increasingly cheaper. Not only that, but engineers tend to be more specialized and so you can't just task anyone to maintain the previous system, it tends to be people with deep expertise already. And those people also have career aims to do more than long-term support of a deprecated system, so there's retention to be considered.
Pretending for a moment that all your services except a small handful moved on to somme system B from some system A, if the maintenance burden of maintaining system B starts to eclipse the resource cost of moving to system A (which decreases all the time due to improvements in system B and the increasing cost of maintaining system A, and the monotonic reduction in computing resource cost), then you might well just swallow the 5%-10% increase in resources either permanently or temporarily and come out ahead in the end.
Additionally, as system B moves on, staying on system A becomes increasingly risky: security improvements, features, layers which don't know about system A anymore all threaten the stability of your service. If you've checked out the SRE book, you'll know that our SLOs are more important than any one resource. If nobody trusts your service to operate, then they won't use it and then you won't have to worry about resources anymore since the users will have moved on.
> because at that scale your team can just staff someone or even several people to maintain the pre-gRPC system forever and still come out ahead on the budget.
To reiterate the point above, these roles tend to be fairly specialized and hard to staff. Arguably these same engineers are better tasked making system B good enough to switch to so you can thank system A for its service and show it the door.
Bringing this back to Stubby vs. gRPC, it's a pretty academic argument so far. They're both here to stay. And honestly, when we say "Stubby" there's already different versions of Stubby which interoperate with each other and gRPC will not be any different. Likewise, we still use proto1 in addition to proto2 and proto3 (the public versions) since that just takes time and energy to fix.
We do make these kinds of decisions every day, and it's not always in favor of reduced resources. If we cared for nothing other than resource utilization, we'd be completely C++, no Java, no Python. Realistically, the cost of maintaining systems with equivalent roles can often lead to one or the other winning out, usually in favor of maintainability so long as their feature sets are roughly equivalent. We're fortunate to be in a position that we can choose code health and uniformity of vision over absolute minimum resource utilization. And again, even if we choose system B (higher resources) over system A, perhaps due to the differences in architecture or design choices the absolute bar for performance of that system will be greater than system A, despite starting lower. Sometimes it takes a critical mass of adopters to really shake out all those issues.
I know that quotes from Knuth are often trotted out during these kinds of discussions, but it's true: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
That 3% is where we choose to spend our effort, and that critical 3% includes the ability of our engineering force to make forward progress and not be hindered by too much debt. It also includes real data, check our Google Wide Profiling [1].
> Totally agree that world-facing APIs will all be gRPC and that makes perfect sense to me.
Probably not all. We still fully support HTTP/JSON APIs, but at least in our little corner of the world we've chosen to take full advantage of gRPC.
Anyways, thanks for letting me stand on my soapbox for a bit.
Interesting that you allude this the coexistence of C++, Java, Python, and Go because I think this bolsters my point. The overwhelming majority of services at Google are in C++. There are individual C++ services that consume more resources than all Java products combined. I think this speaks to the appetite for performance and efficiency within the company, since it is demonstrably the most difficult of these languages.
I'm not sure where I said that, but yes, that's part of the switching cost.
> The fact is that the highly demanding services have the huge majority of the resources, and are the most sensitive to performance issues. If your service uses 10% of Google's datacenter space, you won't accept a 5% or even 1% regression just so you can port to gRPC,
The thrust of my statement was that for many services, RPC overhead is minimal. So even a 2x or 3x increase in RPC overhead is still minimal. I agree, a 5% increase in resource utilization for a large service is something that would be weighed. But lets explore that idea for a moment:
> because at that scale your team can just staff someone or even several people to maintain the pre-gRPC system forever and still come out ahead on the budget.
Not necessarily. Engineers are expensive and becoming ever more expensive while computing resources are becoming increasingly cheaper. Not only that, but engineers tend to be more specialized and so you can't just task anyone to maintain the previous system, it tends to be people with deep expertise already. And those people also have career aims to do more than long-term support of a deprecated system, so there's retention to be considered.
Pretending for a moment that all your services except a small handful moved on to somme system B from some system A, if the maintenance burden of maintaining system B starts to eclipse the resource cost of moving to system A (which decreases all the time due to improvements in system B and the increasing cost of maintaining system A, and the monotonic reduction in computing resource cost), then you might well just swallow the 5%-10% increase in resources either permanently or temporarily and come out ahead in the end.
Additionally, as system B moves on, staying on system A becomes increasingly risky: security improvements, features, layers which don't know about system A anymore all threaten the stability of your service. If you've checked out the SRE book, you'll know that our SLOs are more important than any one resource. If nobody trusts your service to operate, then they won't use it and then you won't have to worry about resources anymore since the users will have moved on.
> because at that scale your team can just staff someone or even several people to maintain the pre-gRPC system forever and still come out ahead on the budget.
To reiterate the point above, these roles tend to be fairly specialized and hard to staff. Arguably these same engineers are better tasked making system B good enough to switch to so you can thank system A for its service and show it the door.
Bringing this back to Stubby vs. gRPC, it's a pretty academic argument so far. They're both here to stay. And honestly, when we say "Stubby" there's already different versions of Stubby which interoperate with each other and gRPC will not be any different. Likewise, we still use proto1 in addition to proto2 and proto3 (the public versions) since that just takes time and energy to fix.
We do make these kinds of decisions every day, and it's not always in favor of reduced resources. If we cared for nothing other than resource utilization, we'd be completely C++, no Java, no Python. Realistically, the cost of maintaining systems with equivalent roles can often lead to one or the other winning out, usually in favor of maintainability so long as their feature sets are roughly equivalent. We're fortunate to be in a position that we can choose code health and uniformity of vision over absolute minimum resource utilization. And again, even if we choose system B (higher resources) over system A, perhaps due to the differences in architecture or design choices the absolute bar for performance of that system will be greater than system A, despite starting lower. Sometimes it takes a critical mass of adopters to really shake out all those issues.
I know that quotes from Knuth are often trotted out during these kinds of discussions, but it's true: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
That 3% is where we choose to spend our effort, and that critical 3% includes the ability of our engineering force to make forward progress and not be hindered by too much debt. It also includes real data, check our Google Wide Profiling [1].
> Totally agree that world-facing APIs will all be gRPC and that makes perfect sense to me.
Probably not all. We still fully support HTTP/JSON APIs, but at least in our little corner of the world we've chosen to take full advantage of gRPC.
Anyways, thanks for letting me stand on my soapbox for a bit.
[1] https://storage.googleapis.com/pub-tools-public-publication-...