bidi streaming screws with a whole bunch of assumptions you rely on in usual fault-tolerant software:
- there are multiple ways to retry - you can retry establishing the connection (e.g. say DNS resolution fails for a 30s window) _or_ you can retry establishing the stream
- your load-balancer needs to persist the stream to the backend; it can't just re-route per single HTTP request/response
- how long are your timeouts? if you don't receive a message for 1s, OK, the client can probably keep the stream open, but what if you don't receive a message for 30s? this percolates through the entire request path, generally in the form of "how do I detect when a service in the request path has failed"
> - there are multiple ways to retry - you can retry establishing the connection (e.g. say DNS resolution fails for a 30s window) _or_ you can retry establishing the stream
This isn't a difficult problem to solve. We apply both of those strategies depending on circumstances. We can even re-connect clients to the same backend after long disconnection periods to support upload resuming etc.
> - your load-balancer needs to persist the stream to the backend; it can't just re-route per single HTTP request/response
This applies whether the stream is uni- or bi-directional. We already have uni-directional streams working well at scale, so this is not a concern.
> - how long are your timeouts? if you don't receive a message for 1s, OK, the client can probably keep the stream open, but what if you don't receive a message for 30s? this percolates through the entire request path, generally in the form of "how do I detect when a service in the request path has failed"
We maintain streams for very long periods. Hours or days. Clients can detect dropped streams (we propagate errors in both directions, although AWS ALBs are causing problems here) and the client knows how to re-establish a connection. And again this applies whether streams are uni- or bi-directional.
> - there are multiple ways to retry - you can retry establishing the connection (e.g. say DNS resolution fails for a 30s window) _or_ you can retry establishing the stream
That's not how protobuf works? If a connection fails, you simply get an IO error instead of the next message. There is no machinery in gRPC that re-establishes connections.
You do need to handle timeouts and blocked connections, but that's a generic issue for any protocol.
- there are multiple ways to retry - you can retry establishing the connection (e.g. say DNS resolution fails for a 30s window) _or_ you can retry establishing the stream
- your load-balancer needs to persist the stream to the backend; it can't just re-route per single HTTP request/response
- how long are your timeouts? if you don't receive a message for 1s, OK, the client can probably keep the stream open, but what if you don't receive a message for 30s? this percolates through the entire request path, generally in the form of "how do I detect when a service in the request path has failed"