Our main hinderance with gRPC was that several disparate teams had strange issues with the fairly opaque runtime. The “batteries included” approach made attempts to debug the root causes quite difficult.
As a result of the above, we have been exploring twirp. You get the benefits of using protobufs for defining the RPC interface, but without quite as much runtime baggage that complicates debugging issues that arise.
That's always the problem with "batteries included". If they don't work it's often not worth the effort to fix them; you gotta toss em.
I'm curious what languages you were using gRPC with. The batteries includedness across tons of languages is a big part of gRPC's appeal. I'd assume Java and C++ get enough use to be solid but maybe that's wishful thinking?
One that we encountered in several services were gRPC ruby clients that semi-regularly blocked on responses for an indeterminate amount of time. We added lots of tracing data on the client and server to instrument where “slowness” was occurring. We would see every trace span look just like you would hope until the message went into the runtime and failed to get passed up to the caller for some random long period of time. Debugging what was happening between the network response (fast) and the actual parsed response being handed to the caller (slow) was quite frustrating, as it requires trying to dig into C bindings/runtime from the ruby client.
It was a couple of years ago, but the Go gRPC library had pretty broken flow control. gRPC depends upon both ends having an accurate picture of in-flight data volumes, both per-stream and per-transport (muxed connection). It's a rather complex protocol, and isn't rigorously specified for error cases. The main problem we encountered was that errors, especially timed-out transactions, would cause the gRPC library to lose track of buffer ownership (in the sense of host-to-host), and result in a permanent decrease of a transport's available in-flight capacity. Eventually it would hit zero and the two hosts would stop talking. Our solution was to patch-out the flow control (we already had app-level mechanisms).
[edit: The flow control is actually done at the HTTP/2 level. However, the Go gRPC library has its own implementation of HTTP/2.]
This isn’t to say it happened on every response..it was a relatively small fraction. But, it was enough to tell _something_ was going on. Who knows, it could be something quirky on the network and not even a gRPC issue. But, because the runtime is so opaque, it made debugging quite difficult.
Even Java codefen has issues - like a single classfile so big it crashes any IDE not explicitly set up for it, or a whole bunch of useless methods that lead to autocomplete being terrible.
As a result of the above, we have been exploring twirp. You get the benefits of using protobufs for defining the RPC interface, but without quite as much runtime baggage that complicates debugging issues that arise.