Engineer @ Segment NSQ has served us pretty well but long term persistence has b...

doh · on May 31, 2017

Have you ever looked at any proprietary solutions like Google's PubSub? We're running on PubSub for over year now and outside of some unplanned downtimes it's scaling very well. But as we're looking to branch out out of GCP we are looking at Kafka as an alternative.

Could you comment on particular problems and challenges that you ran into?

For the context, we're currently sending around 60k messages/sec and around 1k of them contains data larger than 10kb.

TheHydroImpulse · on May 31, 2017

The biggest issue with PubSub and Amazon's alternative is the cost. Being capped at a per-message cost would be a no go.

If you can get away with using PubSub or the like it would be far easier than to manage your own Kafka deployment (correctly).

If data loss is unacceptable then Kafka is basically the only open-source solution that is known for not losing data (if done correctly of course). NSQ was great but lacked durability and replication. We can guarantee that two or more Kafka brokers persisted the message before moving on. With NSQ, if one of our instances died it was a big problem.

Managing Kafka in a cloud environment hasn't been easy and required a lot of investment and we have yet to move everything over to it.

ryanworl · on June 1, 2017

The per-message cost of AWS Kinesis is extremely tiny.

If your company's recent article, Scaling NSQ to 750 Billion Messages, is an accurate count of messages you'd put through Kinesis, that would cost around $11,000 over the entire lifetime of the system in per-message fees by my calculations.

That seems like a rather trivial cost.

If you expand this analysis to include the per-shard costs, assuming perfect and constant utilization over a four year period, delivering 750 billion messages would require (assuming 1kb messages) an average of 6 active shards at $11.25 per shard-month. You can scale these up and down dynamically, so real-world efficiency doesn't have to be wildly different.

If I were to complain about Kinesis, cost would not be my complaint. The limit of 5 reads per second per shard creates a hard floor on latency. Kafka can definitely beat that!

From an outsider's perspective, I would not dismiss Kinesis so quickly on cost alone. Lock-in and the product's actual limits seem like bigger problems.

EDIT: As an aside, don't forget to add the inter-AZ bandwidth cost into your Kafka equation if you want a true apples-to-apples comparison because Kinesis writes the messages to three availability zones.

user454545 · on June 1, 2017

It's not marketed as a messaging platform, but it sounds like Apache NiFi [1] may fulfill some of your needs, with many of the specialized tooling you described, already built in. NiFi is very tunable to support differing needs ("loss tolerance vs. guaranteed deliveries", "low latency vs. high throughput", ...). It is built to scale, it includes enterprise security features, and you can track every piece of data ever sent through it (if you want to.) It includes an outstanding web-based GUI, where you can immediately change the settings on all of your distributed nodes, through a simple and complete interface. It features an extension interface, but it contains many battle-tested commonly-used plugins (Kafka, HTTP(s), SSH, HDFS, ...) so that you can gradually integrate it into your environment.

NiFi has come up a few times on HN, but I really don't think it gets the attention it deserves --- I don't know how it would perform against Kafka, NATS, NSQ, *MQ, or other messaging platforms, and unfortunately, I don't have any metrics to share. But when I see that many users are taking these messaging platforms, and building additional tooling/processes to meet needs that are already built into NiFi, I think it shines as a very competitive open-source option.

@TheHydroImpulse: Thank you for sharing your insights in this post, and explaining why your organization made these selections. Have you considered or evaluated NiFi?

[1] https://nifi.apache.org/

(Disclaimer: All posts are my own, and not sponsored by the Department of Defense.)

ngrilly · on June 1, 2017

> If data loss is unacceptable then Kafka is basically the only open-source solution that is known for not losing data

What about RabbitMQ?

doh · on June 1, 2017

We used RabbitMQ extensively for almost two years but the problems we were encountering along the way weren't worth it. We ended up talking to the dev team too often to solve catastrophic issues that took down our whole production for hours.

We reconsidered using it again for a synchronous RPC communication as we were replacing gRPC, but ended up going with nats.io instead. It does have less fearures but we are able to squeze much more juice on a smaller stack.

derekperkins · on June 1, 2017

Why were you replacing gRPC?

doh · on June 1, 2017

gRPC is a great, but has a ton of small problems including a catastrophic documentation (in some cases we had to read byte code to figure out what to do).

The biggest issue for us was however was that there is no middle server that could handle an route connections to available workers. We were using haproxy which worked ok but far from great. It was very hard to figure out how many servers need to run at any given point and thus a ton of our requests were ending up with UNAVAILABLE response.

Essentially what we needed is a synchronous RPC over PubSub which gRPC doesn't offer.

mafro · on June 1, 2017

RabbitMQ does not persist messages to disk. A common mistake with RMQ is treating it like something of a database.

You would need to run RMQ in HA mode with multiple brokers to have any chance of not losing data.

ngrilly · on June 1, 2017

What about this: https://www.rabbitmq.com/persistence-conf.html ?

mafro · on June 1, 2017

I've never seen that option before - I stand corrected. I'd be interested to see how having it switched on affects performance. Also I expect there are still few guarantees about data loss in the event of failure.

ngrilly · on June 1, 2017

Here is some details:

https://www.rabbitmq.com/confirms.html#publisher-confirms-la...

To fully guarantee that a message won't be lost, it looks like you need to declare the queue as durable + to mark your message as persistent + use publisher confirm. And it looks like this costs you several hundreds milliseconds of latency.

bstahlhood · on May 31, 2017

NATS Streaming seems to be similar to Kafka feature set, but built using Go and looks to be easier to setup.

https://nats.io/documentation/streaming/nats-streaming-intro...

kasey_junk · on June 1, 2017

NATS does not support replication (or really any high availability settings) currently. Which is a major missing feature when comparing it to Kafka.

bstahlhood · on June 1, 2017

Thank you for the info!

molszanski · on May 31, 2017

Do you have anythings to say about nats.io?

doh · on May 31, 2017

We're using nats for synchronous communication, sending around 10k messages each second through it.

Must say that the stability is great, even with larger payloads (over 10MB in size). We're running it in production for couple of weeks now and haven't had any issues. The main limitation is that there is no federation and massive clustering. You can have a pretty robust cluster, but each node can only forward once, which is limiting.

rafaeljesus · on June 1, 2017

You mean synchronous communication using request-reply/rpc like?

doh · on June 1, 2017

Correct

ah- · on May 31, 2017

Out of interest, what kind of tooling did you build for Kafka?

TheHydroImpulse · on May 31, 2017

We started deploying our Kafka cluster as a set of N EC2 instances but we started running into a bunch of issues (rolling the cluster, rolling an instance without moving partitions around, moving partitions around, etc...)

Now we run Kafka through ECS and wrote some tooling to manage rolling the cluster and replacing brokers. krollout(1) (currently private) basically prevents partitions from becoming unavailable while rolling.

Now that multiple teams are using Kakfa we started exploring how to scale up. Each team may have different requirements and isolation can become an issue. Likely more tooling will need to be built around this.