NSQ – A realtime distributed messaging platform designed to operate at scale

rjeli · on May 31, 2017

Segment is probably the biggest NSQ user right now, and they're moving to Kafka - any employees want to weigh in? :)

TheHydroImpulse · on May 31, 2017

Engineer @ Segment

NSQ has served us pretty well but long term persistence has been a massive concern to us. If any of our NSQ nodes go down it's a big problem.

Kafka has been far more complicated to operate in production and developing against it requires more thought than NSQ (where you can just consume from a topic/channel, ack the message and be done). More to that, if you want more capacity you can just scale up your services and be done. With Kafka we had to plan how many partitions we needed and autoscaling has become a bit trickier.

We now have critical services running against Kafka and started moving our whole pipeline to it as well. It's a slow process but we're getting there.

We've had to build some tooling to operate Kafka and ramp up everyone else on how to use it. To be fair, we've also had to build tooling for NSQ, specifically nsq-lookup to allow us to scale up.

We have an nsq-go library that we use in production along with some tooling: https://github.com/segmentio/nsq-go

doh · on May 31, 2017

Have you ever looked at any proprietary solutions like Google's PubSub? We're running on PubSub for over year now and outside of some unplanned downtimes it's scaling very well. But as we're looking to branch out out of GCP we are looking at Kafka as an alternative.

Could you comment on particular problems and challenges that you ran into?

For the context, we're currently sending around 60k messages/sec and around 1k of them contains data larger than 10kb.

TheHydroImpulse · on May 31, 2017

The biggest issue with PubSub and Amazon's alternative is the cost. Being capped at a per-message cost would be a no go.

If you can get away with using PubSub or the like it would be far easier than to manage your own Kafka deployment (correctly).

If data loss is unacceptable then Kafka is basically the only open-source solution that is known for not losing data (if done correctly of course). NSQ was great but lacked durability and replication. We can guarantee that two or more Kafka brokers persisted the message before moving on. With NSQ, if one of our instances died it was a big problem.

Managing Kafka in a cloud environment hasn't been easy and required a lot of investment and we have yet to move everything over to it.

ryanworl · on June 1, 2017

The per-message cost of AWS Kinesis is extremely tiny.

If your company's recent article, Scaling NSQ to 750 Billion Messages, is an accurate count of messages you'd put through Kinesis, that would cost around $11,000 over the entire lifetime of the system in per-message fees by my calculations.

That seems like a rather trivial cost.

If you expand this analysis to include the per-shard costs, assuming perfect and constant utilization over a four year period, delivering 750 billion messages would require (assuming 1kb messages) an average of 6 active shards at $11.25 per shard-month. You can scale these up and down dynamically, so real-world efficiency doesn't have to be wildly different.

If I were to complain about Kinesis, cost would not be my complaint. The limit of 5 reads per second per shard creates a hard floor on latency. Kafka can definitely beat that!

From an outsider's perspective, I would not dismiss Kinesis so quickly on cost alone. Lock-in and the product's actual limits seem like bigger problems.

EDIT: As an aside, don't forget to add the inter-AZ bandwidth cost into your Kafka equation if you want a true apples-to-apples comparison because Kinesis writes the messages to three availability zones.

user454545 · on June 1, 2017

It's not marketed as a messaging platform, but it sounds like Apache NiFi [1] may fulfill some of your needs, with many of the specialized tooling you described, already built in. NiFi is very tunable to support differing needs ("loss tolerance vs. guaranteed deliveries", "low latency vs. high throughput", ...). It is built to scale, it includes enterprise security features, and you can track every piece of data ever sent through it (if you want to.) It includes an outstanding web-based GUI, where you can immediately change the settings on all of your distributed nodes, through a simple and complete interface. It features an extension interface, but it contains many battle-tested commonly-used plugins (Kafka, HTTP(s), SSH, HDFS, ...) so that you can gradually integrate it into your environment.

NiFi has come up a few times on HN, but I really don't think it gets the attention it deserves --- I don't know how it would perform against Kafka, NATS, NSQ, *MQ, or other messaging platforms, and unfortunately, I don't have any metrics to share. But when I see that many users are taking these messaging platforms, and building additional tooling/processes to meet needs that are already built into NiFi, I think it shines as a very competitive open-source option.

@TheHydroImpulse: Thank you for sharing your insights in this post, and explaining why your organization made these selections. Have you considered or evaluated NiFi?

[1] https://nifi.apache.org/

(Disclaimer: All posts are my own, and not sponsored by the Department of Defense.)

ngrilly · on June 1, 2017

> If data loss is unacceptable then Kafka is basically the only open-source solution that is known for not losing data

What about RabbitMQ?

doh · on June 1, 2017

We used RabbitMQ extensively for almost two years but the problems we were encountering along the way weren't worth it. We ended up talking to the dev team too often to solve catastrophic issues that took down our whole production for hours.

We reconsidered using it again for a synchronous RPC communication as we were replacing gRPC, but ended up going with nats.io instead. It does have less fearures but we are able to squeze much more juice on a smaller stack.

derekperkins · on June 1, 2017

Why were you replacing gRPC?

doh · on June 1, 2017

gRPC is a great, but has a ton of small problems including a catastrophic documentation (in some cases we had to read byte code to figure out what to do).

The biggest issue for us was however was that there is no middle server that could handle an route connections to available workers. We were using haproxy which worked ok but far from great. It was very hard to figure out how many servers need to run at any given point and thus a ton of our requests were ending up with UNAVAILABLE response.

Essentially what we needed is a synchronous RPC over PubSub which gRPC doesn't offer.

mafro · on June 1, 2017

RabbitMQ does not persist messages to disk. A common mistake with RMQ is treating it like something of a database.

You would need to run RMQ in HA mode with multiple brokers to have any chance of not losing data.

ngrilly · on June 1, 2017

What about this: https://www.rabbitmq.com/persistence-conf.html ?

mafro · on June 1, 2017

I've never seen that option before - I stand corrected. I'd be interested to see how having it switched on affects performance. Also I expect there are still few guarantees about data loss in the event of failure.

ngrilly · on June 1, 2017

Here is some details:

https://www.rabbitmq.com/confirms.html#publisher-confirms-la...

To fully guarantee that a message won't be lost, it looks like you need to declare the queue as durable + to mark your message as persistent + use publisher confirm. And it looks like this costs you several hundreds milliseconds of latency.

bstahlhood · on May 31, 2017

NATS Streaming seems to be similar to Kafka feature set, but built using Go and looks to be easier to setup.

https://nats.io/documentation/streaming/nats-streaming-intro...

kasey_junk · on June 1, 2017

NATS does not support replication (or really any high availability settings) currently. Which is a major missing feature when comparing it to Kafka.

bstahlhood · on June 1, 2017

Thank you for the info!

molszanski · on May 31, 2017

Do you have anythings to say about nats.io?

doh · on May 31, 2017

We're using nats for synchronous communication, sending around 10k messages each second through it.

Must say that the stability is great, even with larger payloads (over 10MB in size). We're running it in production for couple of weeks now and haven't had any issues. The main limitation is that there is no federation and massive clustering. You can have a pretty robust cluster, but each node can only forward once, which is limiting.

rafaeljesus · on June 1, 2017

You mean synchronous communication using request-reply/rpc like?

doh · on June 1, 2017

Correct

ah- · on May 31, 2017

Out of interest, what kind of tooling did you build for Kafka?

TheHydroImpulse · on May 31, 2017

We started deploying our Kafka cluster as a set of N EC2 instances but we started running into a bunch of issues (rolling the cluster, rolling an instance without moving partitions around, moving partitions around, etc...)

Now we run Kafka through ECS and wrote some tooling to manage rolling the cluster and replacing brokers. krollout(1) (currently private) basically prevents partitions from becoming unavailable while rolling.

Now that multiple teams are using Kakfa we started exploring how to scale up. Each team may have different requirements and isolation can become an issue. Likely more tooling will need to be built around this.

caffeineninja · on May 31, 2017

We run 1.5 million messages per minute through our NSQ framework and we're starting to run into architectural limitations i.r.t to the number of workers in each producer/consumer pool, and are now testing/benchmarking Kafka in our staging environment.

Roritharr · on May 31, 2017

I think the biggest question is, do you think its feasible to directly start with Kafka instead of NSQ or does Kafka just require a much stronger/larger team to operate than NSQ?

be_erik · on June 1, 2017

We debated the same question and went with NSQ for now. We might need some of the guarantees that Kafka makes in the longterm or for some specific use cases, but for a no thrills distributed messaging platform that is incredibly simple to operate at scale, NSQ is pretty fantastic.

Building client libraries is also a joy. I blogged about it a bit here: https://product.reverb.com/how-to-write-an-nsq-consumer-in-g...

rafaeljesus · on June 1, 2017

Nice one! Just as a info, I wrapped producer/consumer in a pkg https://github.com/rafaeljesus/nsq-event-bus also it exposes request-reply/rpc like

kasey_junk · on May 31, 2017

Kafka and NSQ have widely variant promises around things like durability, order, etc.

In most use cases you can get NSQ like behavior out of Kafka and the inverse isn't true. Kafka's performance and added gaurantees come at the expense of being harder to operate.

stavros · on May 31, 2017

Can someone summarize the promises? Specifically, would NSQ work well as an easier-to-operate, Kafka alternative, or are there low-throughput use cases it's just not suitable for?

kasey_junk · on June 1, 2017

People tend to talk about Kafka in the same conversations as messaging systems because it can support messaging use cases, but that leads to a mental model of what Kafka is (and conversely what messaging systems tend to be) that is incorrect.

Its better to think of Kafka as a distributed log service than a messaging broker. When described this way, don't think log as in "the things humans look at to debug applications coming out of stdout" but "the things machines look at as a storage data structure". Think the write ahead log in a database and not printf statements.

What this means is that under the covers Kafka is a bunch of ordered files being written to by producers. Consumers can specify where in the log they want to start consuming from and then "tail" the log once they are caught up. The architecture is also such that consumers are very light weight (a tcp/ip connection and an offset in the log). This also makes trivial things like the "late joiner" problem in messaging systems and durability. Kafka then layers on high availability and consistency configurations that mean that you can be sure that your published log entries are 1) stored on multiple machines and 2) ordered the same for everyone. The combination of those 2 things is very powerful and makes reasoning about distributed systems application data much simpler. There are also certain classes of problems that need to be solved with those promises, namely things that are not idempotent.

NSQ is a much more traditional buffered messaging system. It has file durability but only as a) an optimization to prevent message loss once memory runs out and b) as a consumer archive. But a hard loss of a node means that those messages that have not been delivered can be lost as there is no promise they are published somewhere else. Further, there is no promise that the order of messages published to a topic and channel is the order of messages received by the consumer. Dealing with late joiners is an application level concern as is archive and replication.

That said, Kafka is complicated. I think its complicated because its solving a complicated problem not because its poorly factored (though I'd love it if they built the consensus stuff in directly and removed the zookeeper dependency).

NSQ isn't complicated and is easy to operate. If your problem set falls into a more traditional messaging domain that fits the NSQ model, you are almost certainly better off with it, but you can likely use Kafka also. If your problem set falls into the write ahead log model, you can't use NSQ (without massive application level logic) but you can use Kafka.

stavros · on June 1, 2017

That's a great comparison, thank you. Basically, the "log structure vs in-memory queue" distinction at the start crystallizes the differences, thanks for the reply.

TheHydroImpulse · on May 31, 2017

NSQ has been (IMO) far far easier to operate than Kafka has been. With Kafka you need a Zookeeper cluster in addition to your Kafka brokers. Not to mention developing against NSQ is pretty simple whereas Kafka you need to think about partitions and offsets.

If you're worried about data loss, Kakfa can be what you're looking for (but takes a lot to learn how to operate it correctly).

stavros · on May 31, 2017

I see, thank you. Honestly, NSQ sounds like a really good solution for 99% of use cases.

rthille · on May 31, 2017

I think it's less about throughput and more about durability guarantees. Kafka producers can specify the number of 'acks' (brokers which have written the message to disk) when they produce a message and the request will only return successfully if that can be done.

stavros · on May 31, 2017

I see, thank you. Can't NSQ do that, or at least get probabilistically close to it with multiple nodes (i.e. doesn't having more nodes reduce the probability of data loss)?

be_erik · on June 1, 2017

Absolutely, but you can't rely on it to be persistent like Kafka. We use it extensively and are incredibly happy with it, but we follow best practices around not putting state in messages, making changes idempotent, and ensuring that we can always replay a message if needed.

We've yet to lose any messages in production, but it could happen and we're okay with the tradeoffs between that and the operational complexity of kafka.

stavros · on June 1, 2017

Yeah, at-least-once is definitely the way to go, sounds like you've got a good architecture around that.

Another nice thing I've heard with Kafka is that it can store all messages since the beginning of time (if you want) and you can replay them to retrieve all your MQ-related state. Does NSQ do that, do you know?

be_erik · on June 1, 2017

NSQ does nothing like this out of the box. You could create a channel for each topic that accumulates messages, but those messages would not be distributed across the cluster and after consuming them once they would be effectively gone. If that feature is a hard requirement NSQ wouldn't fit your use case.

stavros · on June 1, 2017

I see, thank you. I'm a bit confused by "after consuming them once they would be effectively gone". Surely NSQ supports more than one subscribers, and won't remove messages after they've been delivered to a single client?

kasey_junk · on June 1, 2017

It supports more than one subscriber as long as you have a separate channel for each. But if the channel gets created after some messages have been delivered, they will not be delivered to the new channel, only the messages after that.

To do replay with nsq (one way is) you'd attach their file consumer to its own channel as the first operation on a topic. Then in application code you'd need to read from that file whenever you need to do a replay (and you'd need to know when/where to read from/to).

stavros · on June 1, 2017

That makes sense, thank you.

jehiah · on June 1, 2017

With NSQ there is a built in utility nsq_to_file which just becomes one additional consumer you'd use to archive each message topic to disk. It provides dead simple archiving of messages, but doesn't provide any native replay ability.

wand3r · on May 31, 2017

If you are starting off just use what you/the team is comfortable with. I mean, unless you think you will need capacity for over 25,000 messages per second.

azinman2 · on May 31, 2017

25k messages total, or per second?

wand3r · on May 31, 2017

Thanks. Fixed.

titobrown · on June 1, 2017

Kafka is much harder to operate in a production environment, I would only start there if you have a specific reason to.

matticakes · on May 31, 2017

I'm one of the original authors, happy to answer any questions.

agentultra · on May 31, 2017

There is quite a bit of documentation on the design but I haven't seen anything more specific along the lines of a TLA+, Lean, etc specification.

There are plenty of projects like this and I'm curious how they go about creating specifications, checking their designs, etc.

Would the project benefit from a formal model or proofs? A colleague and I started a side project to provide specifications for core Openstack components but we're keeping our minds open to other projects as well.

est · on June 1, 2017

The nsqadmin was written in backbone, it also requires statsd and graphite which are kinda obsolete these days.

je42 · on May 31, 2017

what are the typical use cases for NSQ ?

ryan_lane · on June 1, 2017

A use case that I really like is as a sidecar on every EC2 instance for local async store-and-forward, where you need to deliver data somewhere, but want to be able to handle large bursts of traffic without a massive spike in latency.

This use case assumes you generally don't trust the network and really, really don't want to block if the network is temporarily flaky.

jud_white · on May 31, 2017

* Disclosure: I sometimes contribute to NSQ.

We use NSQ at Dell for the commercial side of dell.com. We've been in Production with it for about 2 years.

> what are the typical use cases for NSQ ?

In the abstract, anything which can tolerate near real-time, at-least-once delivery, and does not need order guarantees. It also features retries and manual requeuing. It's typical to think order and exactly-once semantics are important because that's how we tend to think when we write code and work with (most) databases, and having order allows you to make more assumptions and simplify your approach. It typically comes at the cost of coordination or a bounded window of guarantees. Depending on your workload or how you frame the problem you may find order and exactly-once semantics are not that important, or it can be made unimportant (for example, making messages idempotent). In other cases order is important and it's worth the tradeoff; our Data Science team uses Kafka for these cases, but I'm not familiar with the details.

Here are some concrete examples of things we built using NSQ, roughly in the order they were deployed to PROD:

- Batch jobs which query services and databases to transform and store denormalized data. We process tens of millions of messages in a relatively short amount of time overnight. The queue is never the bottleneck; it's either our own code, services, or reading/writing to the database. Retries are surprisingly useful in this scenario.

- Eventing from other applications to notify a pinpoint refresh is needed for some data into the denormalized store (for example, a user updated a setting in their store, which causes a JSON model to update).

- Purchase order message queue, both for the purpose of retry and simulating what would happen if a customer on a legacy version of the backend was migrated to the new backend; also verifying a set of known 'good' orders continue to be good as business logic evolves (regression testing).

- Async invoice/email generation. This is a case where you have to be careful of at-least-once delivery and need to use a correlation ID and persistence layer to define a 'point of no return' (can't process the message again beyond this point even if it fails). We don't want to email (or bill) customers twice.

- Build system for distributing requests to our build farm.

- Pre-fetching data and hydrating a cache when a user logs in or browses certain pages, anticipating the likely next page to avoid having the user wait on these pages for an expensive service call. The client in this case is another decoupled web application; the application emitting the event is completely separate and likely on a different deployment schedule from the emitting application. The event emitted tells us what the user did, and it's the consumer's responsibility to determine what to do. This is an interesting case where we use #ephemeral channels, which disappear when the last client disconnects. We append the application's version to the channel name so multiple running versions in the same environment will each get their own copy of the message, and process it according to that binary's logic. This is useful for blue/green/canary testing and also when we're mid-deployment and have different versions running in PROD, one customer facing and one internal still being tested. I think I refer to this image more than any other when explaining NSQ's topics and channels: https://f.cloud.github.com/assets/187441/1700696/f1434dc8-60... (from http://nsq.io/overview/design.html).

Operationally, NSQ has been not just a pleasure to work with but inspirational to how we develop our own systems. Being operator friendly cannot be overrated.

Last thing, if you do monitoring with Prometheus I recommend https://github.com/lovoo/nsq_exporter.

justinsaccount · on May 31, 2017

probably better off looking at

http://word.bitly.com/post/33232969144/nsq

http://nsq.io/

TheHydroImpulse · on May 31, 2017

A bit of an older article but still useful: https://segment.com/blog/scaling-nsq/

be_erik · on June 1, 2017

We use it to sync data stores for denormalization purposes (think syncing postgres tables into ElasticSearch) and event based communication between small services.