Amazed to see connection oriented protocols still being used here. As someone wh...

toast0 · on Oct 21, 2023

> Our server software architectures use a single dedicated thread on a UDP port, which reads packets and distributes them to lockless queues each owned by a dedicated single thread handler exclusively bound to separate physical cores.

As an internet backseat network performance person...

Have you considered one thread per core/NIC queue receiving packets, with RSS (receive side scaling)? If your bottleneck is network I/O, that should avoid some cross-cpu communications. Otoh, if you can't align client processing to the core its NIC queue is handled on, then my suggestion just adds contention on the processing queues; although maybe receive packet steering would help in that case. But, I'd also imagine game state processing is a bigger bottleneck than network I/O?

yobid20 · on Oct 21, 2023

We used to use RSS but switched to kernel bypass instead which increased throughput 10x easily. I imagine we also have a much higher bandwidth requirement than what MMO's use (we can do 40Gbps). Every stream requires encryption and decryption (AES GCM ciphers) so there is a huge user space cpu processing involved too (openssl). That's where kernel bypass helped a lot because it offloaded all of that network I/O to 2 single cores (input/output) and left all the other cores available for use for user space processing of streams.

also worth mentioning, we wrote everything in c++. anything else is too slow.

coppsilgold · on Oct 21, 2023

If latency is so important why use AES GCM instead of AES OCB?

openssl speed -evp CIPHER for example: (ECB only for reference, don't use)

    type               16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
    AES-256-GCM         655193.56k  1747223.44k  3399072.36k  4490100.61k  5033129.30k  5108596.96k
    AES-256-OCB         631357.15k  2268916.74k  4794610.30k  6492985.36k  7174274.14k  7301540.60k
    AES-256-ECB         997960.96k  3972424.35k  8096120.70k  8105542.89k  8179659.94k  8188882.64k
    AES-128-GCM         747463.90k  1856932.41k  3762591.34k  4700335.25k  5224533.22k  5157661.06k
    AES-128-OCB         702729.38k  2479151.86k  5919529.91k  8719316.46k  10545305.23k 10536095.61k
    AES-128-ECB         1291715.10k 5180010.63k  11093258.14k 11815558.81k 11913592.61k 11947322.76k

OCB also won the CAESAR competition for "High-performance applications" portfolio. It is much older than the competition and is no longer patent-encumbered.

yobid20 · on Oct 21, 2023

We have external dependancies requiring this. We can't change the specific cipher suites ourselves.

KRAKRISMOTT · on Oct 21, 2023

> Our server software architectures use a single dedicated thread on a UDP port, which reads packets and distributes them to lockless queues each owned by a dedicated single thread handler exclusively bound to separate physical cores.

LMAX disruptor?

samsquire · on Oct 23, 2023

This is really interesting.

Could you explain more about your lockless queues?

I recently wrote a lockfree ringbuffer inspired by LMAX Disruptor but it is only thread safe 1-thread to 1 thread. SCSP. It has latency between threads on a 1.1ghz Intel NUC of 80-200 nanoseconds.

I have ported Alexander Krizhanovsky's ringbuffer to C but I haven't benchmarked it.

https://www.linuxjournal.com/content/lock-free-multi-produce...

meheleventyone · on Oct 21, 2023

Games use UDP quite a lot as well.

lowq · on Oct 22, 2023

Sounds amazing! Did you implement congestion control per connection, and if so, which algorithms did you use? I can imagine that CC could really affect throughput at this scale.

PTOB · on Oct 21, 2023

"switched everything to UDP 15 years ago" ME: <Nods head approvingly in 25-years-ago-LAN-party>

hipadev23 · on Oct 21, 2023

I feel like the design you explained is the same, just swap connection to connectionless.