If it's something like 1 Gig, then it's the OS that should be commended for the miracle of throughput, not Go. Even VB would be able to pull numbers like these with aggressive buffering.
A more sensible metric would be to measure the throughput and the longest time in transit. If you get Go deliver 2 mil/sec with sub-ms delivery time, then we'll have something to talk about.
The simple benchmark is testing throughput of a messaging system, specifically a new server written in Go. Both the client and server are on the same machine, but going over a tcp/ip socket. The Pub benchmarks send messages and then make sure the connection is flushed, meaning all messages have been completely processed by the server. Processing in this case means framing, protocol and routing logic. The PubSub versions test sending and receiving all the messages back in the client, with a few variations on using multiple connections and distributed queuing.
I'm looking forward to the full suite being released so we can repeat the tests and see how it fares under real-world conditions with different message-sizes etc.
Sorry for the snark in my initial comment but I found this gist really lacking (akin to those press releases where $vendor brags about some arbitrary figure without providing any details).
I do like the simplicity of NATS and look forward to a fast server implementing it. However for something like a MQ where performance is the key-metric and highly dependent on the chosen workload one really shouldn't throw around numbers without backing them up thoroughly - that only hurts credibility.
Please take the write-ups by the RabbitMQ guys as a guide, who publish the source-code for all their benchmarks and go to great lengths explaining them:
The code sample suggests it's a network interface at localhost, but it's not 2M syscalls or connections. It's just one big buffered write and it doesn't look like they're waiting for responses to those messages - so essentially this might just be a (pretty good?) stream processing app.
It is not one big write, but optimizations around msgs/write using buffering are used in clients, with obvious care to balance latency and throughput. In the benchmark, the write buffer is 16k, so it is flushed automatically via Go's bufio when it hits that mark, and then I flush it again when the loop is complete, flushing the remainder of the outbound buffer. I then use a PING/PONG, which is part of the NATS protocol, to only stop timing when the PONG returns, and I know all messages have been processed. NATS does have a verbose protocol flag that has all protocol frames ack'd with either +OK or -ERR.
Explain? ~600k tps on 1gb nics is very achievable. 10-15M tps on 10gb interfaces is also reasonable. In both cases I'm think of udp packets in/out measured on the wire. With modern hardware I'd buy similar numbers for tcp.
Write a C program that creates a a single tcp connection, and an infinite loop that just calls write(2) or send(2) and then do the same for UDP. I'd be very surprised if you can cross 60k calls / sec. Given that, how can you send 2M or receive 2M messages a sec if your bottleneck can only handle 60K.. I could be missing an important piece here, but there isn't much information provided in the link either.
Check out netmap for an API without these issues. I'm not sure what API the original author is using to exchange data with the wire but bulk/scatter/gather approaches are typical in high performance messaging systems.
If you have 100Mb/s bandwidth, your theoretical limit (regardless of langauge) is 25600 packets/sec [1] and at roughly 290 (16b) req / chunked-pipelined packet, you are looking at 7.4+m req/s.
Of course, latency will suffer. You can't have both max throughput and min latency. It's a choice to be made.
It's clear that the MBA is throttling the calls at the Kernel level since there is nothing at the protocol or hardware level that would necessitate something as slow as 60k calls/sec. While that speed is certainly not bad, especially for a consumer laptop and may beat out of the box Linux distributions, it doesn't mean that the bottleneck isn't something that's hardware dependent.
Ultimately, if you want real speed doing something real simple, you're going to want to use FPGAs anyway which would be a trivial consulting fee to implement compared to a development effort that will go deep into your kernel, hardware drivers and probably protocol tuning as well.
I wrote a system that tried to push as many messages through a socket as possible, whether it was tcp, udp, unix domain, or posix message queue. The goal was to determine which IPC mechanism was best suited for highest concurrency rather than highest throughput, but I think the results are interesting for both.
I wanted to measure the amount of time it took to do the same amount of work (i.e. enqueue and dequeue one million items) for each IPC mechanism and how concurrency affected the performance. The system uses a single producer with a varying number of consumer processes.
The stream socket implementations (tcp, unix domain stream socket) actually perform a minimum of two write()'s per queue item - once for the length and once for the actual content. Both of these are wrapped in loops to ensure the full content is written, so occasionally more than two write()'s might occur.
On one of the linux hosts, the TCP implementation can push 1 million messages through in roughly 0.66 seconds using a single producer thread, which corresponds pretty closely to the 2 million messages/second claim for NATS. The POSIX message queue can do it 0.48 seconds, which corresponds to more than 2 million messages/second, but POSIX messages queues are a datagram implementation that only require one mq_send() per message.
I think this shows that 2M syscalls/second is indeed possible. I made no special effort to optimize the C++ code. If you'd like to review the code or run the tests yourself, feel free to check out the code at https://github.com/adamonduty/queueable . Anyone can run the tests and submit results to view on the web interface.
Oh, and interestingly, the Macbook Pro I used to generate OS X results was the most recent hardware but among the slowest in wall-clock performance. OS X also shows a zig-zig effect as message size increases, similar to SunOS. Not sure why.
Not bad at all, that's approximately the message-passing overhead I measured in C++ on a similar CPU a while back.
I think the main utility for such a benchmark though is to establish a lower limit on theoretical per-message overhead. Any practical system is likely to want to do something interesting with the content of the messages.
But this lets us say "expend an average of at least 5 us of useful computation on each message in order to keep the overall cost of message passing below 10%".
That is correct, I am only attempting to measure the efficiency of the messaging processing engine within the server. I did want to include the network stack and the buffering portion, as well as the framing, protocol parser, and the subject based routing.
Slowly, go is getting into more and more places. This is yet another nice replacement. It is probably going to be way faster than the Ruby implementation in the long run.
Mh, if I understand correctly the messaging system being tested is https://github.com/derekcollison/nats and it is not written in Go but in Ruby (+EventMachine). Or is there a Go version of NATS?
Also, this is only testing the time it takes the go client to write the messages to the socket, not the time the server takes to process the messages. So the benchmark would be the same with a noop server that reads and discards all incoming traffic. Am I wrong?
I am not the author of the gist but quoting from it: "our work on a high performance NATS server in Go." Furthermore, the tests he shows are run using Go's testing tool (which tests go code). Finally, he discusses "no use of defer" a Go language feature not in Ruby. So yes I believe it is an implementation in Go.
The test is specific to a new NATS server written in GO (gnatsd). I described in more detail above the tests and what they do, but suffice it to say that it does test the message processing overhead of the server.
I'm curious to see if this is running with GOMAXPROCS above 1. I've seen the scheduler start to drag down reqs/sec with more than one thread in lightweight networking services like this.
It is not, but its on my TODO list, and I have also observed similar behavior at times. I have taken care to make sure the synchronization is efficient, but running the test is needed to get the real results.
NATS[1] is a relatively simple pub-sub protocol and broker created by Derek Collison. It's originally written in Ruby. It was created as the messaging layer of Cloud Foundry.
From the technical side - routing is done using regular expressions, the speed of internal routing is proportional to number of different subscriptions.
Additionally, NATS have some interesting specific features - for example: if I remember correctly messages for slow consumers are just dropped.
NATS was originally a monolithic server, but I see that some work on clustering had been done [2].
If it's something like 1 Gig, then it's the OS that should be commended for the miracle of throughput, not Go. Even VB would be able to pull numbers like these with aggressive buffering.
A more sensible metric would be to measure the throughput and the longest time in transit. If you get Go deliver 2 mil/sec with sub-ms delivery time, then we'll have something to talk about.