Having played with NSQ off and on for the past few months, and having gone deeper with it over the past week or so in preparation for rolling out a production service, here are a few things that have really impressed me, in no particular order:
It's super easy to run. A few command line params, if that, and I've got a local nsqd running that I can develop and test against. Great for those offline coding sessions on BART.
The tools it ships with are very handy, specifically to_nsq and nsq_tail. Again, both make developing stuff really easy because I can get stuff in and out with fast, native command line tools or http. In fact, when I was first playing around I built an entire prototype based on a few simple shell scripts and the provided tools, just to see what would happen.
The nsqadmin and nsq_stat tools give you a lot of visibility into what your producers and consumers are doing, how well things are performing, who is connected to what, etc. And again, part of the distribution and very easy to run locally.
The Go producer and consumer API is clean and easy to build around, pretty well documented with a lot of examples in the NSQ apps themselves.
So far very stable. I haven't seen any crashes on the server or producer/consumer side.
Helpful and friendly people on a low noise freenode channel.
The Features page says it is "horizontally scalable (no brokers, seamlessly add more nodes to the cluster)", but the Design page says you need to run a 'nsdq' daemon - which is a message broker - i.e. you cannot embed the daemon into your application?
That said, for a larger deployment you'd still probably want to run the lookupd as well, at which point, at least for a typical architecture, running nsqd standalone is probably fine. Embedding is darn handy for testing, though.
That sentence is missing an important word. No centralized brokers, meaning the typical/recommended deployment topology is an nsqd node on all hosts producing messages.
I appreciate all the work that Tyler has put into his series of blog posts on messaging systems (and NSQ bug reports!) but I think this particular article is one that misses the mark [1].
While raw performance is important, I think comparing the guarantees and related operational and development semantics are far more interesting and useful.
These systems vary a great deal in the interfaces and functionality they expose to users building on top of them and operators dealing with them when they're deployed in production (and inevitably break!). This is ultimately what matters most.
P.S. single node (localhost) performance is essentially useless. Most of these tools are designed to be deployed as the backbone of large distributed systems (NSQ in particular), so scalability and performance in that context is a better indication of real-world performance.
We are using NSQ at Hailo for sending and receiving billions of messages every day. It scales incredibly well and is extremely reliable. We'd be happy to speak more about this with anyone who's interested.
How do you handle reliability? I'm reading http://nsq.io/overview/features_and_guarantees.html and I'm curious how you've set up your architecture to handle node failures without losing messages, or if losing messages is acceptable in your use case.
I'm also curious if you've had issues with duplicated delivery, where a message is delivered more than once, and how you've handled that case.
Hi. Good question. We actually have both use cases. In the case where we require reliability and guaranteed delivery we publish to multiple nodes and standard channels that persist messages in the case of subscribers disconnecting or overflow in what can't be buffered in memory. In the other case we just publish once and can use ephemeral channels. We scale clusters depending on usage but start with a minimum of 3 and use nsqlookupd to find topics in the cluster. In the case of things that can't handle duplicate deliveries we offload deduping to simple things memcached.
The standard JMS spec supports durable queues and topics. Generally a nightmare to manage though - trying to track down a unconsumed message on a clustered durable topic with lots of subscribers is like trying to find a needle in a stack of needles.
I'm glad these new projects are coming on, and this one seems to be very forthright about its limitations, but I'm just putting this in the bucket with all of the other messaging systems that provide a minimum feature set.
I haven't seen much on the market recently that offers things like (first-class) persistence, guaranteed ordering, guaranteed delivery, or any of the other more complex distribution patterns. That's why I'm still using ActiveMQ. I'm pretty happy with it, so I suppose I'm not really looking for a replacement, but I still wonder what benefit people are getting from these newer systems.
NSQ is as much about what it doesn't do as it is what it does. To a certain extent this mirrors, and was inspired by, the language's philosophy (Go) [1].
Also, NSQ was designed to replace an existing home-grown system deployed at scale. This dictated a lot of the initial requirements (and in certain cases excluded off-the-shelf tools).
When we left the experimental phase we realized we had built something that was useful to others, and it turns out that despite not having the features you've identified it can be incredibly effective in lots of use cases that don't need stronger guarantees.
[1] If I'm being honest, NSQ was a vehicle for adoption of Go at bitly as well as the project we used to learn the language. This was a huge risk at the time (almost 3 years ago) but one that has certainly paid off.
Yeah, don't get me wrong, it looks good for what it is, and certainly some of the best products come from skunkworks and hobby.
Is there anywhere I can read about the specific problem at bitly you built it to facilitate? I mean, I've seen a lot of these systems where there are integration and timing concerns, so some kind of message bus is useful. But if they do get to a point where things needs to be distributed, sending messages without order and delivery guarantees just seems to be passing the buck to downstream systems to not mess up.
I would also like to see a queue system with better guarantees. We've been running RabbitMQ for a couple of years and are fed up with its lack of safety in the face of minor cluster failures.
I don't know about NSQ — it looks nice, but the lack of true persistence and ordering means it's not something I can use. For example, lack of ordering means you can't use it as an event bus for document-oriented changes.
I know it's not a popular sentiment these days, but I prefer to err on the side of correctness. I have had enough of systems that are supposedly SPOF-free but still fail for mysterious reasons and are nigh-impossible to debug once they do (ElasticSearch, ugh).
RabbitMQ's lack of transparency means it's harder to work with the contents of a queue. For example, if the queue processing has stalled for some mysterious reason, what's clogging it? With RabbitMQ, the only way to peek into a queue is to actually pop messages from it (using a tool such as rabbitmqadmin or amqp-dequeue).
Another problem with RabbitMQ is that messages go away after being acknowledged. I, for one, would like to see a bit of history. Sure, you can build this into individual apps, but it turns out this kind of history is beneficial for many kinds of apps. There's DLX, but you can't (afaik) set up rules to automatically copy acked messages into a history queue.
I can't say I like AMQP. Binary protocol, requires several layers of client tooling to work with. If anything goes weird, strace or tcpdump are of no use. It's also a very complex protocol whose interpretation and matrix of support features seems to change a lot.
I'm currently writing a small job manager in Go that's backed by PostgreSQL (at least initially) because I want something safer, more stable and more transparent. It's based on my realization that jobs are different from events, and should be treated differently.
For example, RabbitMQ has no conception of scheduling or prioritization, and messages are ephemeral objects that go away once they've been acknowledged, as opposed to jobs, which have a life before, during and after processing.
So far, it looks pretty good. Postgres can't process a bazillion messages per second, and it's not going to work for "big data", but it seems to scale decently enough.
Hello, in the last couple of months I've been working on a message queue of the kind you describe, which is, one that is more biased about providing a number of features already built in inside the broker itself, instead of delegating it to the client. I understand the case for the other approach taken by NSQ and other systems, it's just a matter of what you want to do.
However while my queue project supports persistence, synchronous replication, delayed jobs, at least once and at most once delivery semantics, automatic federation, there is one thing I don't support in the list of features you mentioned: ordering. And I believe there is a strong argument for not supporting ordering in certain kinds of message queues.
By implementing only a weak form of ordering (approximated wall-clock ordering, so that, usually jobs are served in roughly insertion order) we gain: availability (as in CAP, the system can continue with a single node), latency (even in the case of synchronous replication, you need to care just for a message to be replicated into N nodes, regardless of what those N nodes are), and functionality (the queue can auto-reissue messages after a specified retry-time, so at-least-once delivery is trivial to accomplish for consumers and producers). Depending on how the system is designed, to give up strict ordering also wins you scalability.
So I agree about your reasoning but my feeling is that message ordering is a big point that really changes how a message queue is shaped. My bet is that there are many problems where ordering is not needed but all the other features are.
Is your queue project published anywhere? Why did you choose to support persistence? I'm dying for a messaging library that supports only brokerless publish/subscribe with QOS support, request/reply and discovery. On top of that I want to build a persistence layer supporting 'retry' and 'replay' operations; this would allow me to build highly decoupled components, with the ability to do maintenance operations (retry) and rebuild/debug/analyse/spike environments easily (replay), a model which I find supports the type of back-end business applications I build very well.
DDS supports all of this including the QOS layer, but no request/reply unfortunately. The existing open source DDS implementations are massive code bases that I wouldn't dare go near to attempt extending, and DDS is heavily based on IDL, which I'd prefer avoiding. If I had the messaging library I could do the QOS + persistence layer, but so far I haven't seen anything that ticks all the boxes (ZMQ has the raw capabilities but I'd need to build out the publish/subscribe and request/reply protocol, which I honestly don't have the time/skill to do.)
Hello, my project is still not public (but will be, BSD licensed, in a matter of a few months hopefully). However it is not broker-less, it's a cluster of instances that you talk via a network protocol. While messages are made durable via synchronous replication (or async, if you want), I'm adding persistence in order for single datacenter setups to be viable. However it is possible to turn off persistence.
I would assume its the ease of making the systems highly available and/or scalable. Scaling ActiveMQ can be a pain once you exceed the capacity of a single broker, and the standard HA setup requires shared file systems with non-broken file locks which are fairly complicated to get in cloud environments.
+1 agree with that (although you can use the DB based persistent store).
The other obvious problem with AMQ is that it supports XA transactions which means people are tempted to use it, which in turn leads to much pain and suffering for all concerned..
I don't think I had heard about NSQ till now; how is it better than, say, a pub-sub queue on a beefy Redis server? (Or a cluster of servers, if you wish). Nothing beats Redis' simplicity, AFAICT.
- NSQ is about equally simple as Redis. Like Redis, it's one binary (and an optional second, if you want a dynamically resizable cluster)
- simplicity of clustering, just boot more servers
- messages aren't lost if the worker process takes one and then dies, they will be retried.
- optional overflow to disk, queues do not have to fit in RAM
- optionally always write to disk, for reliability
- a message on one "channel" can be queued automatically in several "topics", and the sender doesn't need to know. So messages can be diverted to monitoring, for example.
I was under the impression NATS had no persistence and no "message transaction state" (no message acknowledgements), being more of a super fast pub-sub message bus than a distributed queue.
It looks like there isn't any kind of user-definable shard/partition key available within the topics - a message within a topic could go to any client subscribing to a channel for that topic? Is that correct?
That's obviously fine for AWS Lambda-style single event processing (maps, filters, sinks), but isn't that going to make multiple event processing (reduces, sorts) really difficult? It rules out using in-process memory or local KV storage for maintaining the necessary state across the aggregation window - you're only left with pounding some remote database, not a great strategy as per http://radar.oreilly.com/2014/07/why-local-state-is-a-fundam...
For the applications I've dealt with, it's an upside that a workers distributed across a number of servers can process messages from producers distributed across a number of servers, with no user-defined partitioning. Kafka, I think, requires you to define partitions to have multiple workers consume a stream. In the NSQ case, if you need more throughput, just spin it up.
Somewhat on a tangent, one of the goals of the design was to avoid any servers in the middle of the "workers" (consumers) and producers, and making it such that any consuming or producing server dieing or experiencing a network partition doesn't slow down any producers or consumers who can still reach each other.
You're right, this isn't really focused on replacing map-reduce thing like some other message streaming systems are. For one, I can't imagine a "sorting" step using nsq. It's still used for somewhat hefty data-processing tasks, in multiple stages. Each stage has as many servers spun up as needed, consuming messages directly from the previous stage, and publishing to nsqd on localhost. They might consume and acknowledge multiple messages to produce one new message.
For reduced inter-server hops before hitting the database, you can run workers on the same host as producers. So you might have a X hosts producing messages (and running their own nsqd locally, of course), and have Y consumers on each host, configured to consume messages from localhost instead of using nsqlookupd to find all sources, and those consumers would make direct requests to the appropriate database host. With this setup you can still use nsqadmin to monitor the queue levels on all producer hosts, and you can still be archiving messages to disk on some other remote hosts.
Thanks, that's a very helpful explanation. (A partition key is optional in Kafka - the DefaultPartitioner just computes a random partition regardless of key.)
I am having trouble visualizing this in NSQ:
> [A server] might consume and acknowledge multiple messages to produce one new message
I can imagine a server consuming multiple messages and performing a single write to e.g. ElasticSearch or Cassandra - in this case it doesn't matter that the batched write is a random subset of messages. But I can't imagine consuming multiple messages and emitting another message - at least not without being able to reason about what partition of data was present on that server. For example, I can't detect an abandoned shopping cart because a single shopper's events will be scattered across all servers.
Maybe I'm thinking of this wrong - can you give a live use case for consuming multiple events and emitting a single message in NSQ?
You're right, it can't do some calculation across all data for a single user or session or anything like that.
I suppose I was thinking that it might reduce "just a little", just adding stuff together as appropriate, but that wouldn't really be useful.
In practice, the way multiple events turn into a single event is that some are just thrown out, according to some static filter rule, or after checking a datastore.
The places where NSQ are typically used is where new data never stops flowing. So sometimes for this sort of thing, it updates a value in some datastore, and creates a new message for the next stage with the latest value, perhaps only if the latest value reaches some threshold, in a scheme where the value decays.
I've noticed that a lot of message queuing systems have an "delivery at least once" property. What exactly does the possibility that a message could be delivered more than once buy you?
You can have "at most once", "exactly once", or "at least once" [0]. Each has a different tradeoff in terms of implementation complexity, speed, and other availability/consistency concerns.
For different applications differently delivery QoSes are tolerable. For instance, if your messages are idempotent (either by design or nature) then "at least once" is equivalent to "exactly once" and may be much faster/easier.
Generally, "exactly once" demands the least of the application and the most of the broker.
[0] I suppose you could have "any number of times" as well, but that's no guarantee at all.
The thing is, "exactly once" isn't really possible, with a generic message queuing system. Let me explain briefly one way you need to think about distributed systems:
The network could cut at any moment during a transaction.
If message consumers don't send acknowledgements, and the queue considers the message sent as soon as it has been successfully written to the network connection, then it might never really be delivered and processed. You can't be sure if it really got to the destination before the network cut out, or something crashed. This is "at most once" - it may have been processed once, or not at all.
If message consumers acknowledge messages after they've finished processing them, and the queue waits for that, then the acknowledgement might be the thing that is lost. The queue sends the message to a different consumer if it doesn't get an acknowledgement before a timeout, but the message might have been processed already, and the consumer crashed or the network partitioned just before the acknowledgement. This is "at least once".
You could consider a final scheme - consumers acknowledge receiving the message, before they process it. The process could crash just after the acknowledgement, or the ack could be lost without it knowing. So this isn't "at least once" or "at most once", it's "no idea".
Or maybe they even wait for an ack that the server got the ack, before processing. The process could still crash after sending its ack or after receiving the other ack, this is still just "at most once".
"At most once" is usually not useful. Usually you want "at least once", and then logic specific to your data to resolve "more than once" in an acceptable way.
It's not that anyone benefits from having the message delivered more than once at the application level; actually, it's the opposite. The note that messages may be delivered more than once is a warning to developers who might expect exactly-once semantics.
However, it is a benefit in that allowing for more-than-once-delivery affords better performance. Think of it from an algorithmic perspective: if you want to send a message exactly once, you have to wait around for acknowledgement that your message was accepted.
If you instead don't mind sending duplicates, you can fire off the first copy, wait for an ack for a while, and then fire off again to another node (in case, perhaps, there was a network partition). If it turns out that both copies arrive at their final destination intact, your application now has to deal with duplicates. But, that's faster than the alternative.
So it puts more burden on the developer to deduplicate where necessary (or program in a model where duplicates don't matter anyway, such as with CRDTS), in exchange for greater throughput.
NSQ is a full-featured messaging platform out-of-the-box, whereas zeromq and nanomsq are lower level libraries that you could use to build (the same) functionality.
Looks very interesting, though their protocol seems very MQTT-like. I'd be interested whether they considered (and rejected) other protocols before writing their own).
MQTT is a wire protocol for connection a message sender and receiver, but leaves the actual messaging architecture (brokers, P2P, etc) unspecified. NSQ appears to include its own proprietary wire format, as well as a full im-memory messaging architecture.
Just want to add the NSQ wire format looks pretty efficient as well. Seems to be defined at the binary level which was one of the biggest wins for MQTT in terms of memory footprint. Although would be nice if a standard protocol was used instead of creating their own.
Binary formats have their drawbacks - the guy that designed the AMQP protocol called the wire protocol he designed for it an "expert mistake". Admittedly he was designing a spec rather than a product but worth understanding.
It's super easy to run. A few command line params, if that, and I've got a local nsqd running that I can develop and test against. Great for those offline coding sessions on BART.
The tools it ships with are very handy, specifically to_nsq and nsq_tail. Again, both make developing stuff really easy because I can get stuff in and out with fast, native command line tools or http. In fact, when I was first playing around I built an entire prototype based on a few simple shell scripts and the provided tools, just to see what would happen.
The nsqadmin and nsq_stat tools give you a lot of visibility into what your producers and consumers are doing, how well things are performing, who is connected to what, etc. And again, part of the distribution and very easy to run locally.
The Go producer and consumer API is clean and easy to build around, pretty well documented with a lot of examples in the NSQ apps themselves.
So far very stable. I haven't seen any crashes on the server or producer/consumer side.
Helpful and friendly people on a low noise freenode channel.