~2M msgs/second messaging system written in Go

eps · on Dec 6, 2012

How big is defaultSendBufSize?

If it's something like 1 Gig, then it's the OS that should be commended for the miracle of throughput, not Go. Even VB would be able to pull numbers like these with aggressive buffering.

A more sensible metric would be to measure the throughput and the longest time in transit. If you get Go deliver 2 mil/sec with sub-ms delivery time, then we'll have something to talk about.

krobertson · on Dec 7, 2012

I work with Derek at Apcera. The send buffer in the test is 16kb.

calebhc · on Dec 6, 2012

Yeah, how much is the buffer size?

moe · on Dec 7, 2012

  $ time seq 5000000 >/dev/null
  1.10s user 0.00s system 99% cpu 1.107 total

There, my Mac Mini can push ~5M msgs/second!!1one

Do I get a pony now?

Seriously, what is this doing on HN and what on earth are people discussing?

If you want to brag with benchmarks then how about providing at least a remote clue about what you are measuring...

derekcollison · on Dec 7, 2012

The simple benchmark is testing throughput of a messaging system, specifically a new server written in Go. Both the client and server are on the same machine, but going over a tcp/ip socket. The Pub benchmarks send messages and then make sure the connection is flushed, meaning all messages have been completely processed by the server. Processing in this case means framing, protocol and routing logic. The PubSub versions test sending and receiving all the messages back in the client, with a few variations on using multiple connections and distributed queuing.

moe · on Dec 7, 2012

Thanks, that's much more informative!

I'm looking forward to the full suite being released so we can repeat the tests and see how it fares under real-world conditions with different message-sizes etc.

Sorry for the snark in my initial comment but I found this gist really lacking (akin to those press releases where $vendor brags about some arbitrary figure without providing any details).

I do like the simplicity of NATS and look forward to a fast server implementing it. However for something like a MQ where performance is the key-metric and highly dependent on the chosen workload one really shouldn't throw around numbers without backing them up thoroughly - that only hurts credibility.

Please take the write-ups by the RabbitMQ guys as a guide, who publish the source-code for all their benchmarks and go to great lengths explaining them:

http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance...

pretoriusB · on Dec 7, 2012

>Do I get a pony now?

No, but you get an award for snarky 4-chan like behavior.

moe · on Dec 7, 2012

That's fine. I'll keep improving my code, perhaps I can get the Pony when it crosses the 10M/sec barrier.

halayli · on Dec 6, 2012

Hmm, you cannot make 2M read/write syscalls a sec. So what's the definition of a message here and through what transport medium does it go through?

Udo · on Dec 6, 2012

The code sample suggests it's a network interface at localhost, but it's not 2M syscalls or connections. It's just one big buffered write and it doesn't look like they're waiting for responses to those messages - so essentially this might just be a (pretty good?) stream processing app.

derekcollison · on Dec 7, 2012

It is not one big write, but optimizations around msgs/write using buffering are used in clients, with obvious care to balance latency and throughput. In the benchmark, the write buffer is 16k, so it is flushed automatically via Go's bufio when it hits that mark, and then I flush it again when the loop is complete, flushing the remainder of the outbound buffer. I then use a PING/PONG, which is part of the NATS protocol, to only stop timing when the PONG returns, and I know all messages have been processed. NATS does have a verbose protocol flag that has all protocol frames ack'd with either +OK or -ERR.

donavanm · on Dec 6, 2012

Explain? ~600k tps on 1gb nics is very achievable. 10-15M tps on 10gb interfaces is also reasonable. In both cases I'm think of udp packets in/out measured on the wire. With modern hardware I'd buy similar numbers for tcp.

halayli · on Dec 6, 2012

Write a C program that creates a a single tcp connection, and an infinite loop that just calls write(2) or send(2) and then do the same for UDP. I'd be very surprised if you can cross 60k calls / sec. Given that, how can you send 2M or receive 2M messages a sec if your bottleneck can only handle 60K.. I could be missing an important piece here, but there isn't much information provided in the link either.

Note that the experiment was done on an MBA.

hermanhermitage · on Dec 6, 2012

Check out netmap for an API without these issues. I'm not sure what API the original author is using to exchange data with the wire but bulk/scatter/gather approaches are typical in high performance messaging systems.

skybrian · on Dec 6, 2012

I can't find a reference, but isn't OS X known for being relatively slow at system calls?

eternalban · on Dec 7, 2012

If you have 100Mb/s bandwidth, your theoretical limit (regardless of langauge) is 25600 packets/sec [1] and at roughly 290 (16b) req / chunked-pipelined packet, you are looking at 7.4+m req/s.

Of course, latency will suffer. You can't have both max throughput and min latency. It's a choice to be made.

[1]: assuming 4Kb packets

IheartApplesDix · on Dec 6, 2012

It's clear that the MBA is throttling the calls at the Kernel level since there is nothing at the protocol or hardware level that would necessitate something as slow as 60k calls/sec. While that speed is certainly not bad, especially for a consumer laptop and may beat out of the box Linux distributions, it doesn't mean that the bottleneck isn't something that's hardware dependent.

Ultimately, if you want real speed doing something real simple, you're going to want to use FPGAs anyway which would be a trivial consulting fee to implement compared to a development effort that will go deep into your kernel, hardware drivers and probably protocol tuning as well.

adamonduty · on Dec 7, 2012

So I did some research about this in grad school.

I wrote a system that tried to push as many messages through a socket as possible, whether it was tcp, udp, unix domain, or posix message queue. The goal was to determine which IPC mechanism was best suited for highest concurrency rather than highest throughput, but I think the results are interesting for both.

I wanted to measure the amount of time it took to do the same amount of work (i.e. enqueue and dequeue one million items) for each IPC mechanism and how concurrency affected the performance. The system uses a single producer with a varying number of consumer processes.

The results are available for browsing here: http://queueable.herokuapp.com/

The stream socket implementations (tcp, unix domain stream socket) actually perform a minimum of two write()'s per queue item - once for the length and once for the actual content. Both of these are wrapped in loops to ensure the full content is written, so occasionally more than two write()'s might occur.

On one of the linux hosts, the TCP implementation can push 1 million messages through in roughly 0.66 seconds using a single producer thread, which corresponds pretty closely to the 2 million messages/second claim for NATS. The POSIX message queue can do it 0.48 seconds, which corresponds to more than 2 million messages/second, but POSIX messages queues are a datagram implementation that only require one mq_send() per message.

I think this shows that 2M syscalls/second is indeed possible. I made no special effort to optimize the C++ code. If you'd like to review the code or run the tests yourself, feel free to check out the code at https://github.com/adamonduty/queueable . Anyone can run the tests and submit results to view on the web interface.

Oh, and interestingly, the Macbook Pro I used to generate OS X results was the most recent hardware but among the slowest in wall-clock performance. OS X also shows a zig-zig effect as message size increases, similar to SunOS. Not sure why.

marshray · on Dec 6, 2012

Not bad at all, that's approximately the message-passing overhead I measured in C++ on a similar CPU a while back.

I think the main utility for such a benchmark though is to establish a lower limit on theoretical per-message overhead. Any practical system is likely to want to do something interesting with the content of the messages.

But this lets us say "expend an average of at least 5 us of useful computation on each message in order to keep the overall cost of message passing below 10%".

derekcollison · on Dec 7, 2012

That is correct, I am only attempting to measure the efficiency of the messaging processing engine within the server. I did want to include the network stack and the buffering portion, as well as the framing, protocol parser, and the subject based routing.

jamwt · on Dec 7, 2012

zeromq/czmq equivalent: https://gist.github.com/4229625

Does 3M+ messages a second over tcp/loopback in my test.

(The fact that this go is competitive is pretty sweet.)

jlouis · on Dec 6, 2012

Slowly, go is getting into more and more places. This is yet another nice replacement. It is probably going to be way faster than the Ruby implementation in the long run.

paulasmuth · on Dec 6, 2012

Mh, if I understand correctly the messaging system being tested is https://github.com/derekcollison/nats and it is not written in Go but in Ruby (+EventMachine). Or is there a Go version of NATS?

Also, this is only testing the time it takes the go client to write the messages to the socket, not the time the server takes to process the messages. So the benchmark would be the same with a noop server that reads and discards all incoming traffic. Am I wrong?

timtadh · on Dec 6, 2012

I am not the author of the gist but quoting from it: "our work on a high performance NATS server in Go." Furthermore, the tests he shows are run using Go's testing tool (which tests go code). Finally, he discusses "no use of defer" a Go language feature not in Ruby. So yes I believe it is an implementation in Go.

randomdata · on Dec 6, 2012

I agree with your analysis, but EventMachine does use a defer method to push long-running work off of the event loop.

4ad · on Dec 6, 2012

Defer in Go means something very specific and completely different:

http://golang.org/ref/spec#Defer_statements

http://golang.org/doc/effective_go.html#defer

randomdata · on Dec 6, 2012

But EventMachine#defer would make sense in this context, given that the Ruby version is written with EventMachine. That is why I noted it.

derekcollison · on Dec 7, 2012

The test is specific to a new NATS server written in GO (gnatsd). I described in more detail above the tests and what they do, but suffice it to say that it does test the message processing overhead of the server.

dbaupp · on Dec 6, 2012

For your first question: maybe, there is "s = startServer(b, PERF_PORT, "")" in the code, but this could easily be an external call.

malingo · on Dec 6, 2012

The benchmark command has the path "~/gnatsd" in the prompt. Looks like it's Go-NATS.

jasonmoo · on Dec 6, 2012

I'm curious to see if this is running with GOMAXPROCS above 1. I've seen the scheduler start to drag down reqs/sec with more than one thread in lightweight networking services like this.

derekcollison · on Dec 7, 2012

It is not, but its on my TODO list, and I have also observed similar behavior at times. I have taken care to make sure the synchronization is efficient, but running the test is needed to get the real results.

perplexes · on Dec 7, 2012

What does NATS achieve that isn't available by using RabbitMQ or 0MQ?

onetwothreefour · on Dec 6, 2012

This isn't impressive if it's in memory. Which it most probably is.

bascule · on Dec 6, 2012

It seems like if you spent two seconds reading the description in the gist you'd clearly see it wasn't.

2 million messages per second is a pretty respectable number, and fairly comparable to systems like distributed Erlang and Akka.

gtani · on Dec 7, 2012

the shootout says

http://shootout.alioth.debian.org/u64q/performance.php?test=...

igouy · on Dec 7, 2012

about thread-switching not PubSub

stevewilhelm · on Dec 6, 2012

I had the privilege of working with Derek "back in the day." He has worked on some pretty impressive mission critical, low latency message systems.

bryogenic · on Dec 6, 2012

NATS is new to me, are there interfaces for anything other than Ruby, Go, Node, and Java? C or Python? Any other resources about it? Thanks!

majke · on Dec 6, 2012

NATS[1] is a relatively simple pub-sub protocol and broker created by Derek Collison. It's originally written in Ruby. It was created as the messaging layer of Cloud Foundry.

From the technical side - routing is done using regular expressions, the speed of internal routing is proportional to number of different subscriptions.

Additionally, NATS have some interesting specific features - for example: if I remember correctly messages for slow consumers are just dropped.

NATS was originally a monolithic server, but I see that some work on clustering had been done [2].

[1] https://github.com/derekcollison/nats [2] https://github.com/derekcollison/nats/wiki/Cluster-Design

segmond · on Dec 7, 2012

~2 M msgs/second means nothing without telling us about the specs of the system.

michaelfeathers · on Dec 7, 2012

The language isn't the important thing: http://martinfowler.com/articles/lmax.html