Have you ever looked at any proprietary solutions like Google's PubSub? We're ru...

TheHydroImpulse · on May 31, 2017

The biggest issue with PubSub and Amazon's alternative is the cost. Being capped at a per-message cost would be a no go.

If you can get away with using PubSub or the like it would be far easier than to manage your own Kafka deployment (correctly).

If data loss is unacceptable then Kafka is basically the only open-source solution that is known for not losing data (if done correctly of course). NSQ was great but lacked durability and replication. We can guarantee that two or more Kafka brokers persisted the message before moving on. With NSQ, if one of our instances died it was a big problem.

Managing Kafka in a cloud environment hasn't been easy and required a lot of investment and we have yet to move everything over to it.

ryanworl · on June 1, 2017

The per-message cost of AWS Kinesis is extremely tiny.

If your company's recent article, Scaling NSQ to 750 Billion Messages, is an accurate count of messages you'd put through Kinesis, that would cost around $11,000 over the entire lifetime of the system in per-message fees by my calculations.

That seems like a rather trivial cost.

If you expand this analysis to include the per-shard costs, assuming perfect and constant utilization over a four year period, delivering 750 billion messages would require (assuming 1kb messages) an average of 6 active shards at $11.25 per shard-month. You can scale these up and down dynamically, so real-world efficiency doesn't have to be wildly different.

If I were to complain about Kinesis, cost would not be my complaint. The limit of 5 reads per second per shard creates a hard floor on latency. Kafka can definitely beat that!

From an outsider's perspective, I would not dismiss Kinesis so quickly on cost alone. Lock-in and the product's actual limits seem like bigger problems.

EDIT: As an aside, don't forget to add the inter-AZ bandwidth cost into your Kafka equation if you want a true apples-to-apples comparison because Kinesis writes the messages to three availability zones.

user454545 · on June 1, 2017

It's not marketed as a messaging platform, but it sounds like Apache NiFi [1] may fulfill some of your needs, with many of the specialized tooling you described, already built in. NiFi is very tunable to support differing needs ("loss tolerance vs. guaranteed deliveries", "low latency vs. high throughput", ...). It is built to scale, it includes enterprise security features, and you can track every piece of data ever sent through it (if you want to.) It includes an outstanding web-based GUI, where you can immediately change the settings on all of your distributed nodes, through a simple and complete interface. It features an extension interface, but it contains many battle-tested commonly-used plugins (Kafka, HTTP(s), SSH, HDFS, ...) so that you can gradually integrate it into your environment.

NiFi has come up a few times on HN, but I really don't think it gets the attention it deserves --- I don't know how it would perform against Kafka, NATS, NSQ, *MQ, or other messaging platforms, and unfortunately, I don't have any metrics to share. But when I see that many users are taking these messaging platforms, and building additional tooling/processes to meet needs that are already built into NiFi, I think it shines as a very competitive open-source option.

@TheHydroImpulse: Thank you for sharing your insights in this post, and explaining why your organization made these selections. Have you considered or evaluated NiFi?

[1] https://nifi.apache.org/

(Disclaimer: All posts are my own, and not sponsored by the Department of Defense.)

ngrilly · on June 1, 2017

> If data loss is unacceptable then Kafka is basically the only open-source solution that is known for not losing data

What about RabbitMQ?

doh · on June 1, 2017

We used RabbitMQ extensively for almost two years but the problems we were encountering along the way weren't worth it. We ended up talking to the dev team too often to solve catastrophic issues that took down our whole production for hours.

We reconsidered using it again for a synchronous RPC communication as we were replacing gRPC, but ended up going with nats.io instead. It does have less fearures but we are able to squeze much more juice on a smaller stack.

derekperkins · on June 1, 2017

Why were you replacing gRPC?

doh · on June 1, 2017

gRPC is a great, but has a ton of small problems including a catastrophic documentation (in some cases we had to read byte code to figure out what to do).

The biggest issue for us was however was that there is no middle server that could handle an route connections to available workers. We were using haproxy which worked ok but far from great. It was very hard to figure out how many servers need to run at any given point and thus a ton of our requests were ending up with UNAVAILABLE response.

Essentially what we needed is a synchronous RPC over PubSub which gRPC doesn't offer.

mafro · on June 1, 2017

RabbitMQ does not persist messages to disk. A common mistake with RMQ is treating it like something of a database.

You would need to run RMQ in HA mode with multiple brokers to have any chance of not losing data.

ngrilly · on June 1, 2017

What about this: https://www.rabbitmq.com/persistence-conf.html ?

mafro · on June 1, 2017

I've never seen that option before - I stand corrected. I'd be interested to see how having it switched on affects performance. Also I expect there are still few guarantees about data loss in the event of failure.

ngrilly · on June 1, 2017

Here is some details:

https://www.rabbitmq.com/confirms.html#publisher-confirms-la...

To fully guarantee that a message won't be lost, it looks like you need to declare the queue as durable + to mark your message as persistent + use publisher confirm. And it looks like this costs you several hundreds milliseconds of latency.