Write a C program that creates a a single tcp connection, and an infinite loop that just calls write(2) or send(2) and then do the same for UDP. I'd be very surprised if you can cross 60k calls / sec. Given that, how can you send 2M or receive 2M messages a sec if your bottleneck can only handle 60K.. I could be missing an important piece here, but there isn't much information provided in the link either.
Check out netmap for an API without these issues. I'm not sure what API the original author is using to exchange data with the wire but bulk/scatter/gather approaches are typical in high performance messaging systems.
If you have 100Mb/s bandwidth, your theoretical limit (regardless of langauge) is 25600 packets/sec [1] and at roughly 290 (16b) req / chunked-pipelined packet, you are looking at 7.4+m req/s.
Of course, latency will suffer. You can't have both max throughput and min latency. It's a choice to be made.
It's clear that the MBA is throttling the calls at the Kernel level since there is nothing at the protocol or hardware level that would necessitate something as slow as 60k calls/sec. While that speed is certainly not bad, especially for a consumer laptop and may beat out of the box Linux distributions, it doesn't mean that the bottleneck isn't something that's hardware dependent.
Ultimately, if you want real speed doing something real simple, you're going to want to use FPGAs anyway which would be a trivial consulting fee to implement compared to a development effort that will go deep into your kernel, hardware drivers and probably protocol tuning as well.
Note that the experiment was done on an MBA.