Why do you say that? I'm involved in the planning for bidi streaming for a product that supports over 200M monthly active users. I am genuinely curious what landmines we're about to step on.
bidi streaming screws with a whole bunch of assumptions you rely on in usual fault-tolerant software:
- there are multiple ways to retry - you can retry establishing the connection (e.g. say DNS resolution fails for a 30s window) _or_ you can retry establishing the stream
- your load-balancer needs to persist the stream to the backend; it can't just re-route per single HTTP request/response
- how long are your timeouts? if you don't receive a message for 1s, OK, the client can probably keep the stream open, but what if you don't receive a message for 30s? this percolates through the entire request path, generally in the form of "how do I detect when a service in the request path has failed"
> - there are multiple ways to retry - you can retry establishing the connection (e.g. say DNS resolution fails for a 30s window) _or_ you can retry establishing the stream
This isn't a difficult problem to solve. We apply both of those strategies depending on circumstances. We can even re-connect clients to the same backend after long disconnection periods to support upload resuming etc.
> - your load-balancer needs to persist the stream to the backend; it can't just re-route per single HTTP request/response
This applies whether the stream is uni- or bi-directional. We already have uni-directional streams working well at scale, so this is not a concern.
> - how long are your timeouts? if you don't receive a message for 1s, OK, the client can probably keep the stream open, but what if you don't receive a message for 30s? this percolates through the entire request path, generally in the form of "how do I detect when a service in the request path has failed"
We maintain streams for very long periods. Hours or days. Clients can detect dropped streams (we propagate errors in both directions, although AWS ALBs are causing problems here) and the client knows how to re-establish a connection. And again this applies whether streams are uni- or bi-directional.
> - there are multiple ways to retry - you can retry establishing the connection (e.g. say DNS resolution fails for a 30s window) _or_ you can retry establishing the stream
That's not how protobuf works? If a connection fails, you simply get an IO error instead of the next message. There is no machinery in gRPC that re-establishes connections.
You do need to handle timeouts and blocked connections, but that's a generic issue for any protocol.
Not going to give you any proper advice but rather a question to have an answer for. It's not unsolvable or even difficult but needs an answer at scale.
How do you scale horizontally?
User A connects to server A. User A's connection drops. User A reconnects to your endpoint. Did you have anything stateful you had to remember? Did they loadbalancer need to remember to reconnect user A to server A? What happens if the server dropped, how do you reconnect the user?
Now if your streaming is server to server over gRPC on your own internal backend then sure, build actors with message passing, you will probably need an orchestration layer (not k8s, that's for ifra, you need an orchestrator for your services probably written by you), for the same reason as above. What happens if Server A goes down but instead of User A it was Server B. The orchestrator acts as your load balancer would have but it just remembers who exists and who they need to speak to.