Achieving reliable UDP transmission at 10 Gb/s using BSD socket

ignoramous · on April 19, 2020

> (abstract) Optimizations for throughput are: MTU, packet sizes, tuning Linux kernel parameters, thread affinity, core locality and efficient timers.

Cloudflare's u/majke shared a series of articles on a similar topic [0][1][2] (with focus on achieving line-rate with higher packets-per-second and lower latency instead of throughput) that I found super helpful especially since they are so very thorough [3].

Speaking of throughput, u/drewg123 wrote an article on how Netflix does 100gbps with FreeBSD's network stack [4] and here's BBC on how they do so by bypassing Linux's network stack [5].

---

[0] https://news.ycombinator.com/item?id=10763323

[1] https://news.ycombinator.com/item?id=12404137

[2] https://news.ycombinator.com/item?id=17063816

[3] https://news.ycombinator.com/item?id=12408672

[4] https://news.ycombinator.com/item?id=15367421

[5] https://news.ycombinator.com/item?id=16986100

snisarenko · on April 19, 2020

Optimizing UDP transmission over internet is an interesting topic.

I remember reading a paper a while ago that showed that if you send two consecutive UDP packets with exact same data over the internet, at least 1 of them will arrive to the destination at pretty high success rate (something like 99.99%)

I wonder if this still works with current internet infrastructure, and if this trick is still used in real-time streaming protocols.

zamadatix · on April 19, 2020

99.99% for two tries would be a 1% drop chance which I'd say is pretty lenient - we average better than that on our sites running off 4G (jitter is horrible though and that will kill any real-time protocols without huge delays added).

Generally you'd just implement a more generic FEC algorithm though unless you had 2 separate paths you wanted to try (e.g. race a cable modem and 4G with every packet and if one side drops it hope the other side still finishes the race) as there are FEC options that allow non integer redundancy levels and can reduce header overhead compared to sending multiple copies of small packets.

syrrim · on April 19, 2020

>99.99% for two tries would be a 1% drop chance

Not per se. The drop chance for consecutive packets is likely correlated, such that if you know the first one was dropped you should increase your prior that the second one will also be dropped.

zamadatix · on April 19, 2020

Depends on the cause and root question. For instance in the most common scenario of congestion routers do intelligent random drops with increasing probability as the buffer gets more full https://en.wikipedia.org/wiki/Random_early_detection. The internet actually relies on this random low drop chance to make things work smoothly rather than waiting til things are failing apart to signal to streams to slow down all at once while it catches up. Same randomness with transmission bit errors which will cause drops but the randomness is not by design as much as by the way noise is what is causing those.

On the other hand if the root question is if there is an outage style issue then yeah if the path to the destination is having a hard down style issue no number of packets are going to help because they are all going to drop. Likewise if the question is "on a short enough time scale is reliability of delivering a single packet somewhere on the internet ever less than 99%" then yeah somewhere there is a failure scenario and if you look at a short enough time scale any failure scenario can be made to say there is 0% reliability.

ignoramous · on April 19, 2020

u/noselasd:

> Also keep in mind this note: http://technet.microsoft.com/en-us/library/cc940021.aspx

> Basically, if you send() 2 or more UDP datagrams in quick succession, and the OS has to resolve the destination with ARP, all but the 1 packet is dropped until you get an ARP reply (this behavior isn't entirely unique to windows, btw).

https://news.ycombinator.com/item?id=8468313

zamadatix · on April 20, 2020

Note this only applies to things on the same subnet as you. Once you're off your local subnet routing takes over so you only need to know the ARP of the local gateway (which the OS handles as part of bringing network up).

mcguire · on April 19, 2020

Odds are, at least one of the links between the source and destination will be shared. If so, sending two packets is an expensive attempt at reliability; it will cut the bandwidth in half. Further, one data packet will arrive with a highish success rate.

snisarenko · on April 20, 2020

Depends on the use-case.

If you are streaming a live sports event you don't want the video frames to fall behind from the real clock too far, but you also don't want the video to stutter when there is packet loss. I believe you can use the UDP trick I mentioned above to amortize loss of high quality frame data, with "double sent" low quality frame data.

mcguire · on April 20, 2020

As others have mentioned, you're looking for forward error correction, and there are better ways than sending duplicate packets, although they all do add redundancy in various ways.

tomohawk · on April 19, 2020

It depends on what the characteristics of the transmission line are. If it is purely random, that is one thing, but often if one packet is dropped or smashed, there is a higher probability of following ones to meet the same fate. For example, if the transmission is over a microwave link, it is easy to see how something could cause a few thousand packets in a row to go missing.

snisarenko · on April 20, 2020

> if one packet is dropped or smashed, there is a higher probability of following ones to meet the same fate.

Not necessarily. I am not an expert on this topic, but I believe physical device failure is one of the least common reasons a packet is dropped (at least on physical cable transmission). I think link saturation, priority routing (for certain types of packets, or contractual reasons) might be more common. There are thousands of routers out there with a variety of different configurations that decide which packets go through under which conditions.

amelius · on April 20, 2020

> Optimizing UDP transmission over internet is an interesting topic.

Can someone summarize the problem in abstract terms? What are the parameters that the algorithm can play with? Packet size and rate of transmission, anything else?

snisarenko · on April 20, 2020

Those two are pretty much it. I believe the ordering of the data transmission can matter as well (i.e. you are about to send 3 packets of 3 different sizes. the ordering might change deliverability behavior.)

I guess I meant it was an interesting topic to me when I read the paper :) It might not be that exciting overall.

readmodifywrite · on April 20, 2020

Do you have a link to the paper? I'd love to read it.

wmf · on April 19, 2020

So basically rate 1/2 FEC.

exdsq · on April 19, 2020

Can you do something similar with TCP and increase the packet size such that the "TCP Overhead" is reduced compared to 64 byte payloads but with the increased reliability over UDP?

toast0 · on April 19, 2020

In the system proposed, not really.

To use TCP instead of UDP there are two big problems:

1) the sensor device would need to keep unacknowledged data in memory, but it may not have enough memory for that

2) if they're running at line rate (max bandwidth in this case) in UDP, there's no bandwidth left to retransmit data

All of the buffer manipulation is going to be more CPU intensive on both sides as well, and you'd run into congestion control limiting the data rate in the early part of the capture as well.

For a system like this, while UDP doesn't guarantee reliability, careful network setup (either sensor direct to recorder, or on a dedicated network with sufficient capacity and no outside traffic) in combination with careful software setup allows for a very low probability of lost packets dispite no ability to retransmit.

zamadatix · on April 19, 2020

MTU is maximum transmission unit so increasing that does nothing about making 64 byte packets more efficient. You should try to send as much data as you can in one go and the socket will automatically figure out how to split that up the best it can. By default most systems default to a 1500 byte MTU so the OS will chunk it up to fit in multiple 1500 byte packets. The OS will usually try to optimize a send of a bunch of small payloads in one larger packet as well via e.g. https://en.wikipedia.org/wiki/Nagle%27s_algorithm but that's not guaranteed and much more CPU inefficient even when it does work.

99% of the time you are transferring data you don't need to think this deep into networking though. E.g. I have the exact same DL360 Gen9 servers with the same 10G NICs in my lab and 10G TCP streams run just fine on them without manual tweaking. Setting MTU to 9000 does make it more efficient but that's about as far as I'd go without a particularly strong driver to optimize (e.g. "We've got 2,000 of these servers and if we could get by with 5% fewer it'd save your yearly salary" kind of things).

zamadatix · on April 19, 2020

"In a readout system such as ours the network only consists of a data sender and a data receiver with an optional switch connecting them. Thus the only places where congestion occurs are at the sender or receiver. The readout system will typically produce data at near constant rates during measurements so congestion at the receiver will result in reduced data rates by the transmitter when using TCP."

At that point a better paper title would have been "Increasing buffers or optimizing application syscalls to receive 10 GB/s of data" as it has nothing to do with achieving reliable UDP transmission, which it doesn't even seem they needed:

"For some detector readout it is not even evident that guaranteed delivery is necessary. In one detector prototype we discarded around 24% of the data due to threshold suppression, so spending extra time making an occasional retransmission may not be worth the added complexity"

As far as actual reliable UDP testing at high speeds one might also want to consider the test scenario as not all Ethernet connections are equal. The 2 meter passive DACs used in this probably achieve ~10^-18 bit error rate (BER) or 1 bit error in every ~100 petabytes transferred. On the other hand go optical even with forward error correction (FEC) it's not uncommon to expect transmission loss in the real world. E.g. looking at something a little more current https://blogs.cisco.com/sp/transforming-enterprise-applicati... is happy to call 10^-12 with FEC "traditionally considered to be 'error free'" which would have likely resulted in lost packets even in this 400 GB transfer test (though again they were fine with up to 24% loss in some cases so I don't think they were worried about reliable as much as reading the paper title would suggest).

Generally if you have any of these: 1) unknown congestion 2) unknown speed 3) unknown tolerance for error

You'll have to do something that eats CPU time and massive amounts of buffers for reliability. If you need the best reliability you can get but you don't have the luxury of retransmitting for whatever reason then as much error correction in the upper level protocol as you can afford from a CPU perspective is your best bet.

If you want to see a modern take on achieving reliable transmission over UDP check out HTTP/3.

ignoramous · on April 19, 2020

> Generally if you have any of these: 1) unknown congestion 2) unknown speed 3) unknown tolerance for error

> ... If you want to see a modern take on achieving reliable transmission over UDP check out HTTP/3.

Not an expert but I have seen folks here complain that QUIC / HTTP3 doesn't have a proper congestion control like uTP (BitTorrent over UDP) does with LEDBAT: https://news.ycombinator.com/item?id=10546651

wmf · on April 19, 2020

LEDBAT-style congestion control is not proper for "foreground" Web traffic and it will result in lower performance than TCP-based HTTP. Fixing bufferbloat is an ongoing project and it isn't fair to blame QUIC for being no worse than TCP.

aDfbrtVt · on April 19, 2020

Traditional error free transmission in optical comms is 1E-15 BER. I can't access the EPON standard right now, but my experience with other IEEE standards would tell me they're probably guaranteeing 1E-15 for worst-case optical link. This link is pretty close to optimal, so 400G of data is nowhere the amount to say anything with certainty about the BER of the channel.

zamadatix · on April 19, 2020

IEEE only guarantees 10^-12 which is almost certainly why 1st gen 25G products released exactly when they were able to hit that. My estimate a 2m 10G DAC from 2017 would have a BER of ~10^-18 is from personal experience (As unlikely as it sounds I actually have done extensive testing 7 of the exact model server and NIC in our lab purchased about the same time, different switch though) not derived from the 400 GB transfers in the paper.

rubatuga · on April 19, 2020

TLDR:

   sysctl -w net.core.rmem_max=12582912
   sysctl -w net.core.wmem_max=12582912
   sysctl -w net.core.netdev_max_backlog=5000
   ifconfig eno49 mtu 9000 txqueuelen 10000 up

mynegation · on April 19, 2020

Relevant discussion on HN from 4 months ago of IBM’s proprietary large data transfer tool: https://news.ycombinator.com/item?id=21898072

Matthias247 · on April 19, 2020

Reading through the paper I can't see what the authors mean with "reliable transmission" there, and how they achieve it.

I only see them referencing having increased socket buffers, which then lead - in combination with the available (and non-congested) network bandwidth and their app sending behavior - to no transmission errors. As soon as you change any of those parameters it seems like the system would break down, and they have absolutely no measures in place to "make it reliable".

The right answer still seems: Implement a congestion controller, retransmits, etc. - which essentially ends up in implementing TCP/SCTP/QUIC/etc

bcoates · on April 19, 2020

Having end-to-end control of their topology in production is the measure they're using to make it reliable. Since they're saturating the link the receiver parameters are reasonably robust, the sender physically cannot burst any faster and overrun the receiver.

Retransmit-based systems are probably unusable in this application, even over the short hop the bandwidth-delay product is probably much bigger than the buffer on the sensor. The only case where retransmit would be happen is receiver buffer overflow, which is catastrophic: the retransmit would cause even more overflow.

If you had to fix random packet loss in a system like this you wouldn't want to use retransmission, you'd need to do FEC.

aDfbrtVt · on April 19, 2020

EPON already includes a RS(255,223) ECC scheme as part of the standard.

tomohawk · on April 19, 2020

If you have a very low error rate line, the main point at which packet loss will occur for UDP is on the receiving system. If the receive buffer size is not large enough, it is possible that it can get filled up while the receiving app is doing other things, and then packets will be dropped.

rubatuga · on April 19, 2020

They want reliable UDP, not TCP. They state that very clearly.

zamadatix · on April 19, 2020

Yes but they didn't do anything to make UDP reliable they just said in our test scenario we didn't notice any loss at the application layer after increasing the socket receive buffer and called it a day because elsewhere in the paper they noted "For some detector readout it is not even evident that guaranteed delivery is necessary. In one detector prototype we discarded around 24% of the data due to threshold suppression, so spending extra time making an occasional retransmission may not be worth the added complexity."

I think the paper meant "reliable" in a different way than most would take "reliable" to mean on a paper about networking similar to if someone created a paper about "Achieving an asynchronous database for timekeeping" and spent a lot of time talking about databases in the paper but it turns out by "asynchronous" they meant you could enter your hours at the end of the week rather than the moment you walked in/out of the door.

touisteur · on April 20, 2020

I just think they meant reliable in a 'how to dimension to greatly reduce the possible loss'. No protocol is 'fully' reliable in all dimensions (latency, message loss, throughput). Sometimes you benchmark your exact physical conf and you add large margins, add some packet loss detection mechanisms, eventually retries (but if your latency requirements are hard no dice) or duplicate the physical layer (oh god, de-duplication at 10GbE...) or just accept some losses.

I just meant 'reliable is a spectrum'...

p1necone · on April 20, 2020

Reliability in the context of networking protocols means a specific thing to me - guaranteeing packet delivery (to the extent that it is physically possible of course).

This does seem to be a technical term with a defined meaning that matches my assumption too: https://en.wikipedia.org/wiki/Reliability_(computer_networki...

wumpus · on April 20, 2020

Try considering what the authors of the paper mean.

p1necone · on April 20, 2020

If the authors of the paper are using a term that already has a specific meaning in the area they are working in, but meaning something different from that then they are making a mistake.

zelphirkalt · on April 20, 2020

Sounds like they got the wrong protocol for it then. UDP is not meant for "reliable". It's send and forget. Not sure why anyone would implement TCP on top of UDP.

bogomipz · on April 20, 2020

From the abstract:

>"In addition UDP also supported on a variety of small hardware platforms such as Digital Signal Processors (DSP) Field Programmable Gate Arrays (FPGA)"

I am curious what would be the use case for implementing a network stack and using UDP directly in a DSP chip? Perhaps I have a very narrow understanding of DSPs.

touisteur · on April 20, 2020

Well you might have to use a DSP to get the signal from your ADC to your PC for signal processing. You might find 8-core DSPs with built-in 10GbE capabilities easier to program than a 10GbE IP on a FPGA...

a_t48 · on April 19, 2020

I wish I had seen this at my last job. This is something I had to set up and it was painful - lots of trial and error.

otterley · on April 19, 2020

(2017)

fulafel · on April 19, 2020

This would be interesting to try on today's faster ethernet speeds, wonder how it goes at 100G.