Amazed to see connection oriented protocols still being used here. As someone whose been in the real time media space for decades, we switched everything to UDP about 15 years ago and achieved much higher throughput and lower latency. Our server software architectures use a single dedicated thread on a UDP port, which reads packets and distributes them to lockless queues each owned by a dedicated single thread handler exclusively bound to separate physical cores. Similarly for the outbound side. This allows us to scale vertically to maximize a single machine's resources, while allowing us to essentially cap the upper bounds on latency in high load scenarios. Our architecture enables us to have hundreds of thousands of users PER MACHINE. We've been running this architecture for over 10 years now in production successfully. While we do have logic to scale horizontally, we minimize this and use it only when its required for scaling further or traversing specific network clouds, depending on where a user is able to connect. Worth mentioning we use general linux distros on physical hardware. No virtual machines. We also use kernal bypass techniques.
Any time you add more servers to spread load, you're increasing latency because for each hop you're traversing the entire software network stack twice plus hops through hardware switches.
Nobody uses dedicate thread per client anymore (if they do, its a poor design).
> Our server software architectures use a single dedicated thread on a UDP port, which reads packets and distributes them to lockless queues each owned by a dedicated single thread handler exclusively bound to separate physical cores.
As an internet backseat network performance person...
Have you considered one thread per core/NIC queue receiving packets, with RSS (receive side scaling)? If your bottleneck is network I/O, that should avoid some cross-cpu communications. Otoh, if you can't align client processing to the core its NIC queue is handled on, then my suggestion just adds contention on the processing queues; although maybe receive packet steering would help in that case. But, I'd also imagine game state processing is a bigger bottleneck than network I/O?
We used to use RSS but switched to kernel bypass instead which increased throughput 10x easily. I imagine we also have a much higher bandwidth requirement than what MMO's use (we can do 40Gbps). Every stream requires encryption and decryption (AES GCM ciphers) so there is a huge user space cpu processing involved too (openssl). That's where kernel bypass helped a lot because it offloaded all of that network I/O to 2 single cores (input/output) and left all the other cores available for use for user space processing of streams.
also worth mentioning, we wrote everything in c++. anything else is too slow.
OCB also won the CAESAR competition for "High-performance applications" portfolio. It is much older than the competition and is no longer patent-encumbered.
> Our server software architectures use a single dedicated thread on a UDP port, which reads packets and distributes them to lockless queues each owned by a dedicated single thread handler exclusively bound to separate physical cores.
Could you explain more about your lockless queues?
I recently wrote a lockfree ringbuffer inspired by LMAX Disruptor but it is only thread safe 1-thread to 1 thread. SCSP. It has latency between threads on a 1.1ghz Intel NUC of 80-200 nanoseconds.
I have ported Alexander Krizhanovsky's ringbuffer to C but I haven't benchmarked it.
Sounds amazing! Did you implement congestion control per connection, and if so, which algorithms did you use? I can imagine that CC could really affect throughput at this scale.
Any time you add more servers to spread load, you're increasing latency because for each hop you're traversing the entire software network stack twice plus hops through hardware switches.
Nobody uses dedicate thread per client anymore (if they do, its a poor design).