Hacker News new | past | comments | ask | show | jobs | submit login
From Kafka to ZeroMQ for real-time log aggregation (janczuk.org)
97 points by azaras on Sept 5, 2015 | hide | past | favorite | 68 comments



I don't really understand what the heavy emphasis on "real-time" here.

I mean it's log/event aggregation for ops insight. Unless the whole system is some tightly coupled feedback loop into an unsupervised machine learning model where the whole thing has actual hard-real-time requirements (something which might well be impossible to build), then there's no possible way that having a second or two delay, between a message being created and when you can actually see it, can possibly matter.

I mean you don't have someone with instantaneous reflexes and resolution ability sitting there 24/7 with their eyes peeled as a stream of thousands of log messages flies by.

The whole premise seems spurious. Is it really necessary for every startup and their uncle to delude themselves into thinking that their use case is "mission critical" or "carrier grade" or whatever?

Auth0 provides basically "login as a service". Its not like they're managing the access control to nuclear launch codes or something.

Unless some medical device manufacturer was stupid enough to make a critical surgical assistance device require an internet connection, a WAN round-trip on unreliable networks, and reliance on a 3rd party service in order to start operating it... how can this service being down possibly be anything more than an annoyance? What's the worst possible scenario? A session has to be rebuilt? A user has to make an extra login attempt?

By their own admission the service has gone down already due to the old system architecture. How many babies died?


The article is talking about webtask.io, the underlying engine for sandboxed code execution used by Auth0 for allowing customers to extend the platform with arbitrary nodejs code. In such system there is a need to acccess real time logs (think tail -f) while you are debugging your stuff.

Also, the emphasis is not just the real time aspect. The article mentions the issues with kafka for HA


Real time has a technical meaning in software and that's what he is referring to: https://en.m.wikipedia.org/wiki/Real-time_computing

Live updated logs meant for human readability shouldn't need anything hard or soft real-time in the technical sense.


Immediate (in human 'immediate' sense) visibility of a change via what basically is an ETL pipeline absolutely meets the technical meaning of "near-real-time data-processing" as in that the very link. The article meets "real-time ETL" in a very technical sense.


Very intersting article! I was working with both Kafka and ZeroMQ and both of them are really useful, but they are not interchangeable when it comes to designing your solution. You better know which one to pick for the job.

Zookeeper is a troublesome piece of software, especially when running on highly loaded clusters. I had experienced some pretty weird stuff with kafka + zookeeper, where reconnection and rebalancing of topics could break availability quite dramatically.

Another thing, that was not mentioned in the article is that Kafka guarantees the order of the messages, while ZeroMQ doesn't. This is certainly a showstopper for event sourcing applications.

On the other hand, ZeroMQ would be happily dropping messages when the pipeline is not balanced or when there is a spike in the data load. Also better set some sane high water mark, or you will end up consuming all your memory or probably get killed by OOM killer.

Anyway, while both are good solutions be careful not to be mislead by the advertising tone of both projects. Neither of them is a silver bullet...


Note that Kafka only guarantees message ordering per-partition


Perhaps obvious/naive, but Amazon Kinesis is a great "no sysadmin required" Kafka clone that we've had a lot of success with.

Per Zookeeper, I'm sure it's better than each system trying to rebuild consensus on their own, but I am loathe to ever be responsible for running a cluster.

I'm surprised Amazon doesn't have Zookeeper-as-a-service, or perhaps even better, a Zookeeper-as-a-service facade that actually uses Dynamo behind the scenes.


What's been your expiriance with regard to producer latency with Kinesis? My limited experimentation with it so far seems to point to that as a potential pain point for high throughout low latency systems.


In my experience, Kinesis' performance makes it a nonstarter for low-latency systems and use cases with many consumers. Shards are limited to 5 reads/sec, so consumers are heavily throttled. Producers also have significant put latency. Kafka's latency and throughput are OOM better but of course come with the operational overhead. Depends on what your needs are, but Kafka is a better choice for "real-time" systems.


Interesting; I didn't know that Kafka had order-of-magnitude-better latency. Thanks for the data point.

I wonder if this is something inherent to Kinesis's design, or if Amazon will magically make it faster at some point in the future. Do you have a suspicion/indication either way?


Well, for one, there is no streaming socket API, just HTTP. I suspect the 5 reads/sec limit on shards is mainly due to multitenancy, but I could see this increasing in the future. Amazon has already improved the latency of Kinesis a fair amount, but it still has a ways to go before it matches Kafka.

We also have cases where we need many consumers reading a stream, so having that effectively limited to 5 concurrent consumers is a pretty tough limitation. Their solution to this is to just have consumers sleep on a failed poll, but that doesn't really help if you want to scale out your ingestion. You don't have that problem with Kafka.


Kinesis has gotten better about it's latencies, but it is inherently an HTTP protocol, which already gets you in pretty big trouble for serious real-time applications.


I only ran some perf testing and haven't used Kinesis in production yet but perf testing I ran end to end was < 1 sec end to end for a couple of thousand records per second using 3 shards. The reads per second are a lot lower than the writes per second because readers are expected to batch reads - you get 1000 puts/sec/shard vs. 5 reads/sec/shard, but 1 MB/sec read throughput and 2 MB /sec write throughput in shard, so you can read faster than you can write (and catch up if you fall behind). You won't get low millisec or microsec responses because Kinesis is replicating messages to multiple stores for resiliency.


Generally 10s of seconds to low minutes. Two caveats: 1) we batch/queue writes on the producer side, and 2) our systems are more near-line than true low-latency, e.g. we scale the cluster up/down instead of keeping machines running to keep latency as low as possible. So YMMV.

I'd be interested in your results, if you were shooting for a specific number and couldn't hit it, just for future reference.


It looks like they kept the zookeepers on the same machines as the kafka brokers. Why not split them up and have a 5-node zookeeper cluster and decouple the number of kafka nodes from the number of zookeeper nodes (especially if you're virtualizing these machines already)?


That's a great idea, because the number of Kafka nodes scales with the size of your data. Your Zookeeper nodes don't need to scale along with your data.


Given that they consider durable logging to be a key non-functional requirement, I'm not sure how ditching durability is a win.

Single ordered delivery is hard. Really hard. It's easier to allow multiple delivery and/or unordered delivery -- if your log entries are given various correlating UUIDs and signed, this makes the whole problem about fifty hojillion times easier.

Loggregator[1] and Doppler[2] use UUIDs to identify request, app and agent, which makes it easier to allow logs to arrive out of order from multiple sources and then be reassembled.

One thing that helps a lot is to separate metrics from logs. Logging frameworks get overloaded into metrics systems. Logs are useful if you are recording essentially unique events ("I started at ...", "there was a request for /foo at..."). They are wasteful of high QOS resources if you're looking at statistical data in which fine-grained event identity isn't relevant ("57Mb is used by this process", "this request took 100ms").

For metrics it is better to use lossy sampling and derive a statistical view of goings on, rather than trying to capture every single data point.

A log entry should be like a page in your personal diary on your birthday. A metric is a phone survey asking your age. If you don't answer the survey, the data is still useful even with standard error.

Edit: I forgot my usual disclaimer that I work for Pivotal Labs, a division of Pivotal, which is the leading contributor of engineering effort on Cloud Foundry.

[1] https://github.com/cloudfoundry/loggregator

[2] https://github.com/cloudfoundry/loggregator/tree/develop/src...


Read more closely. Durability was NOT a requirement.

    Since we've decided to scope out access to historical logs 
    from the problem we were trying to solve and focus only on real-time log 
    consolidation, that feature of Kafka became an unnecessary penalty without
    providing any benefits.


I read it differently, based on:

    At Auth0, high availability is the high order bit. 
    We don't stress as much about throughput or performance
    as we do about high availability. We don't have a 
    concept of a maintenance period. We need to be available
    24/7, year round. Given the nature of some of our   
    customers, if we are down, "babies will die".
Though, as you point out, they simply dropped durability from their requirements.


And that puzzles me: how useful is a logging solution which can loose log messages by design? In an environment where "babies will die"? Maybe there is durable logging somewhere else? What am I missing here?


Consider the situations where it's likely to lose log messages:

When the system is overloaded.

Exactly the type of situations where you don't want to slow the system down further by spending resources on trying to let non-essential services survive.

Presumably their tradeoff is that it's more important for the system to remain available than for every log message to be delivered. Then secondly you try to deliver log messages with as high reliability as possible.

Often it is better to design for non-essentially components to fail early, or at least prevent their resource usage from escalating and dragging down other parts of the system (in this case, fixed buffers in 0MQ lets them isolate load in one part of the system by simply locking the rate the drain the buffers at below a suitable threshold that's normally fast enough).


This is why you should drop metrics first, then logs.

Also, as you've alluded to, guaranteeing log availability provides an enticing cross-tenant denial of service vector.

Vomit enough log messages into a shared fabric and you can begin to affect your neighbours. ZeroMQ, if I read right, diminishes some of this risk.


The problem is that overloaded 0MQ pub sockets drop exactly the data you don't want to drop — the newest data. Maybe they handle that in the application layer, but it's not clear.


It's a problem in theory, not in practice. The notion of throwing away old data (or coalescing new with old) is relevant to slow queues. It turns out to be irrelevant with ZeroMQ.

First, message rates with ZeroMQ are often hundreds of thousands per second. The architecture must be designed so that no buffers, anywhere, overflow. If they do, you have a problem, usually a slow subscriber. Throwing out older data doesn't cure the problem. What ZeroMQ does is punish the slow subscriber by dropping so that the publisher doesn't crash. It's not recovery for the subscriber, it's protection for the publisher (and thus for other subscribers).

Second, trying to delete old messages is complex and sometimes impossible (if they're already in system buffers). The design of ZeroMQ's internal pipes has one writer and one reader, without locks. For the writer to mess with the reader would slow down everything and introduce risk of bugs. Dropping new incoming data is the only way anyone has ever found to keep things running at full speed.

These design choices were often delicate and counter-intuitive, yet they have turned out to be mostly accurate.


>trying to delete old messages is complex and sometimes impossible

I'm not familiar with ZeroMQ's data structures, so forgive my ignorance. At the high water mark, why can't the consumer throw away old messages instead of the producer throwing away new messages? There are no locks or bugs — that's what the consumer does anyway.

I'm not saying that should be the default behavior, but perhaps an option. It's cleaner than silently dropping new messages, then sending an entire buffer of old messages if the subscriber recovers.


I love the thinking behind ZeroMQ... I've been a big fan of the work you guys have been doing all the way back to Libero (I used it to clean up a blazing fast multiplexed NNTP server with it back in '96 or '97; the first version did all the state transitions manually)


Strange, we've had exactly the opposite time with Kafka. It's never gone down for us unless we've done something dumb (e.g. running it out of disk). We punish Kafka and regularly publish and consume massive amounts of data to/from it. It's the one thing in our infrastructure I haven't been able to kill. Even when we ran Kafka out of disk it still kept running for quite a while.

Then again, I've never tried running stateful services on Docker. Seems like a bad time.

> Kafka/Zookeeper combo struggled to rejoin the cluster

We've had issues with the broker properly re-registering with Zookeeper, but nothing that wasn't solved by stopping the broker and starting it again after the Zookeeper session timeout elapsed.

> Large companies like Netflix may be able to spend some serious engineering resources to address the problem or at least get it under control.

We're a company of 70, of which maybe 15 qualify as backend engineers/ops. Never had an issue managing Kafka. We've accidentally deployed Kafka with insufficient heap space and it kept on ticking.

> As a result of this difference, despite Kafka being known to be super fast compared to other message brokers (e.g. RabbitMQ), it is necessarily slower than ZeroMQ given the need to go to disk and back.

Writes do, certainly. But you don't have to "go back" from disk with Kafka during steady-state operation. Kafka writes segments to disk, this ends up in the pagecache. Reads to this segment are served directly from RAM which is excellent for the fanout consumer case. A healthy Kafka cluster usually has little disk read IO (unless a consumer is catching up or batch jobs are reading older data), but lots of network IO out.

Not saying that Kafka was the right solution here, but it seems like guaranteed delivery and archival of logs is vastly more important than "real time" logs, but what do I know? Their design of the Kafka-based system seems sketchy: colocating brokers with workers is awfully strange.

If logging went down, do babies die? If so, ZMQ seems like the wrong solution. If they just wanted to publish logs in realtime only, cool. I'm sure ZMQ will serve them well as it's a better fit for realtime non-durable publishing of data.

That said, I'd be interested in them publishing more details about their Kafka outages. The fact that we've had such contrasting experiences points to an interesting X factor in their setup that should be avoided by others.


Why not both? Is what I was thinking. I agree with you, personally, about attempting to durably transport and store logs. If Kafka were not real-time enough for me I would probably consider a system where I routed logs to kafka and then into a more real-time system.


Would be interesting to see if the situation improved with Consul or Etcd. Here is a project that tries to make those look like ZooKeeper: https://github.com/glerchundi/parkeeper


goodluck with zmq. I've had pretty bad experience using zmq in production.


It seems like people are either very enthusiastic or very disappointed by zeromq. What were your pain points, if I may ask?


I hit the following issues: memory leak, hard to debug, silent disconnects.


How could you get memory leaks? It seems to me that only a very immature library should leak memory if the interface is being used correctly.


ZeroMQ is a C++ library written by a guy who hates C++:

http://250bpm.com/blog:4

He later rewrote the whole thing in C:

http://nanomsg.org/


We've also had memory leaks - not sure of the exact issue, but it's related to large scale reconnects. https://github.com/zeromq/libzmq/issues/1256


Yep. Every few days the process was oom killed because of that. When we moved away from it and used OS sockets directly, everything worked fine.


Not necessarily. It's not a good idea to assume that the library you are using is perfect and free of bugs.

OSes and popular OSS libraries have ridiculous bugs sometimes.


At least it isn't Zookeeper. Our Zookeeper cluster was the biggest PITA I've ever dealt with.


Could you explain what specifically goes wrong with ZK clusters? I keep hearing how people have issues with it, but we've been running one for ~4 months now with absolutely 0 issues.


While I've found ZK to be a massive pain, at the same time I've also found it to be the closest to actually delivering what they promise.

It's less a silver bullet, more really bad tasting werewolf repellent you have to take every 2 hours.


Could this be a good use case for NSQ? It seems simpler to operate than Kafka, and somewhat more durable than ZeroMQ.

http://nsq.io/


ZMQ presents its own set of challenges, because it's typically implemented as a linked-in native library, and this presents some interesting safety problems. In a previous life, we tried out ZMQ as a replacement to RabbitMQ for a use-case that we didn't actually need long-term persistence for (real-time notifications), but we found Java, and Python ZMQ to provide exciting, hard-to-debug memory leaks, as well as plain segfaults. But, for some time I've been thinking about this, because having a smart client library does allow you to do really neat things efficiently (See: Fast Paxos: http://msr-waypoint.com/pubs/64624/tr-2005-112.pdf, Chubby: http://static.googleusercontent.com/media/research.google.co...). I think the ideal model is to split up responsibilities in a distributed system between nodes that are smart clients, and the rest of the system, and in this, linked-in drivers can continue to utilize simple protocols like RPC + Protocol Buffers, and the actual system can use more complex, higher-level semantics, like dual-dispatch to smooth over latency. -- For whatever reason, people are more comfortable with running a totally foreign, binary library in their same process / memory space, as opposed to running a binary and depending on it. -- I realize that operationally, it's somewhat more complex to do the later, but that's a question about maturity of methods. We've seen the Go community adopt this approach somewhat with things like the Packer plugin architecture (https://www.packer.io/docs/extend/plugins.html).


Wonder what the advantage over just plain syslog is


IIRC, syslog is slower than file I/O, so I imagine one of the main benefits is speed (ZMQ is completely in memory)


Rsyslog has all sorts of input and output drivers, and can be set up without a spool directory, so that it's all in-memory. Don't know about syslog-ng, I'm sure it has similar support.


I stand corrected then! I have to say I don't have a huge amount of experience with syslog, but after a little bit of resarch it looks like it does support mostly in-memory (with some sync to disk) logging.


Syslog is just a network protocol. There are a whole bunch of implementations; there's nothing in the protocol that says it has to end up on disk.


> Syslog is just a [family] of network protocols[, most informally specified.]

syslog-ng (my syslog of choice) has 3 'syslog' network protocols. `tcp`/`udp`, `network`, and `syslog`. Now lets play 'match the syslog-ng name to the rfc` (or lack thereof)!


Syslog didn't start out as a standard, and the RFCs basically just try to formalize what was already implemented in OSes. It's a big mess. The protocol itself is not... great.


Hi, regarding syslog-ng and RFCs: - The udp driver is RFC3164, the original syslog protocol (also called BSD-syslog) - tcp driver is the same, just uses tcp instead of udp - the network driver is just a wrapper for the tcp+udp drivers (it also supports TLS-encryption) - the syslog driver uses the newer syslog protocol, RFC5424.

HTH,

Regards, Robert syslog-ng documentation maintainer


Which syslog?


All syslogs that write to files/sync non-optimally? There is of course the risk of dropped messages on machine failure but I think it's pretty safe to assume that at that point you're doing what ZeroMQ was going to do anyway

I should clarify syslog, with naive configuration, is slower than file I/O because it does stuff, then writes to disk. feel free to correct me if I'm wrong


Eh supporting newlines might be a good reason.


Great write-up, thanks for explaining the reasoning behind the switch, and including an honest comparison of Kafka and ZeroMQ (obviously they can't be compared directly as they have such different guarantees and features/use cases).


Odd that nobody mentioned RabbitMQ as an alternative. Why is that?


It's orders of magnitude slower. Might as well just use AWS sqs.

If you want to do throughput of 10k/s per machine, do yourself a favor and use ZeroMQ. Kafka is a nightmare when your topology needs to change. ZMQ is just connecting pipes and splitting throughput.


10k/s is no problem for RabbitMQ, even on aws t2 machines. Where did you get that from? 50k msgs/s is no problem on an aws c3.xlarge instance (single queue, auto-ack and transient messages).


> 10k/s is no problem for RabbitMQ, even on aws t2 machines

Sending them from themselves to themselves you can do even better than that. Loopbacks aren't useful metrics for message passing. http://bravenewgeek.com/tag/rabbitmq/ is the most recent attempt to benchmark the various solutions (that I have found) and my own testing results in slightly less than the results shown...which I can only attest to unfamiliarity with RMQ, Kafka, etc. In the end, ease of maintenance only speaks to the weaknesses of solutions that need tweaking for specific topologies.


We use https://github.com/gliderlabs/logspout streaming to elastic search but this does have a few seconds of latency.

Somewhat off topic but you aren't worried about using docker to sandbox user scripting? Especially being a security company.


To the author -- you should give credit to the artist whose artwork you used :).

http://mikeangel1.deviantart.com/art/Metamorphosis-Franz-Kaf...


Nice architecture. The proxy behavior sounds very similar to Pushpin [1], which supports listening from a ZeroMQ SUB socket and pushing out via HTTP streaming.

[1] http://pushpin.org


Its strange to me that dropping log messages is acceptable, particularly when failure occurs in the cluster. These log messages would be important for auditing the system or customer actions.


It will be helpful if Kafka can come up with a pluggable system for storing configurations instead of Zookeeper. In that case problems such as the one you faced could be solved.


I was wondering, how does your offering differ from aws lambda? Have you compared whether lambda provides any logging facility?


Does the realtime requirement rule out something like logstash for this situation?


I don't think they are using real-time in the same way as the computer science real-time. don't see how a second delay from logstash would matter.


It's a nice development feature to see console.log() output very close to when it happens. But I'd probably be able to live with a 1.0 second delay. I've worked on servers with a bigger delay. It's not as nice as a 0.1 second delay though.

They're probably trying to optimise for making developers happy.


so the application has to be aware of the log publishing strategy, and the application itself writes its log output directly into a zmq socket?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: