Hacker News new | past | comments | ask | show | jobs | submit login

Amazed to see connection oriented protocols still being used here. As someone whose been in the real time media space for decades, we switched everything to UDP about 15 years ago and achieved much higher throughput and lower latency. Our server software architectures use a single dedicated thread on a UDP port, which reads packets and distributes them to lockless queues each owned by a dedicated single thread handler exclusively bound to separate physical cores. Similarly for the outbound side. This allows us to scale vertically to maximize a single machine's resources, while allowing us to essentially cap the upper bounds on latency in high load scenarios. Our architecture enables us to have hundreds of thousands of users PER MACHINE. We've been running this architecture for over 10 years now in production successfully. While we do have logic to scale horizontally, we minimize this and use it only when its required for scaling further or traversing specific network clouds, depending on where a user is able to connect. Worth mentioning we use general linux distros on physical hardware. No virtual machines. We also use kernal bypass techniques.

Any time you add more servers to spread load, you're increasing latency because for each hop you're traversing the entire software network stack twice plus hops through hardware switches.

Nobody uses dedicate thread per client anymore (if they do, its a poor design).




> Our server software architectures use a single dedicated thread on a UDP port, which reads packets and distributes them to lockless queues each owned by a dedicated single thread handler exclusively bound to separate physical cores.

As an internet backseat network performance person...

Have you considered one thread per core/NIC queue receiving packets, with RSS (receive side scaling)? If your bottleneck is network I/O, that should avoid some cross-cpu communications. Otoh, if you can't align client processing to the core its NIC queue is handled on, then my suggestion just adds contention on the processing queues; although maybe receive packet steering would help in that case. But, I'd also imagine game state processing is a bigger bottleneck than network I/O?


We used to use RSS but switched to kernel bypass instead which increased throughput 10x easily. I imagine we also have a much higher bandwidth requirement than what MMO's use (we can do 40Gbps). Every stream requires encryption and decryption (AES GCM ciphers) so there is a huge user space cpu processing involved too (openssl). That's where kernel bypass helped a lot because it offloaded all of that network I/O to 2 single cores (input/output) and left all the other cores available for use for user space processing of streams.

also worth mentioning, we wrote everything in c++. anything else is too slow.


If latency is so important why use AES GCM instead of AES OCB?

openssl speed -evp CIPHER for example: (ECB only for reference, don't use)

    type               16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
    AES-256-GCM         655193.56k  1747223.44k  3399072.36k  4490100.61k  5033129.30k  5108596.96k
    AES-256-OCB         631357.15k  2268916.74k  4794610.30k  6492985.36k  7174274.14k  7301540.60k
    AES-256-ECB         997960.96k  3972424.35k  8096120.70k  8105542.89k  8179659.94k  8188882.64k
    AES-128-GCM         747463.90k  1856932.41k  3762591.34k  4700335.25k  5224533.22k  5157661.06k
    AES-128-OCB         702729.38k  2479151.86k  5919529.91k  8719316.46k  10545305.23k 10536095.61k
    AES-128-ECB         1291715.10k 5180010.63k  11093258.14k 11815558.81k 11913592.61k 11947322.76k
OCB also won the CAESAR competition for "High-performance applications" portfolio. It is much older than the competition and is no longer patent-encumbered.


We have external dependancies requiring this. We can't change the specific cipher suites ourselves.


> Our server software architectures use a single dedicated thread on a UDP port, which reads packets and distributes them to lockless queues each owned by a dedicated single thread handler exclusively bound to separate physical cores.

LMAX disruptor?


This is really interesting.

Could you explain more about your lockless queues?

I recently wrote a lockfree ringbuffer inspired by LMAX Disruptor but it is only thread safe 1-thread to 1 thread. SCSP. It has latency between threads on a 1.1ghz Intel NUC of 80-200 nanoseconds.

I have ported Alexander Krizhanovsky's ringbuffer to C but I haven't benchmarked it.

https://www.linuxjournal.com/content/lock-free-multi-produce...


Games use UDP quite a lot as well.


Sounds amazing! Did you implement congestion control per connection, and if so, which algorithms did you use? I can imagine that CC could really affect throughput at this scale.


"switched everything to UDP 15 years ago" ME: <Nods head approvingly in 25-years-ago-LAN-party>


I feel like the design you explained is the same, just swap connection to connectionless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: