Kafka is a Big Deal, and it's great to see it get first-class support.
In the same way that Hadoop is starting to feel outdated, but HDFS doesn't seem to be going anywhere -- I think we'll see a lot of innovation in stream processing frameworks in the next few years, but Kafka will just keep on going.
There's a missing piece in the realtime puzzle, at least one that I haven't been able to find, for which Kafka is an overkill - perhaps someone here knows of a solution:
I have tens of system endpoints connected through unreliable (Line-of-sometimes-occluded-sight, 2G and 3G WWAN, some are in vehicles so connections are intermittent).
I just want to consistently tail their logs in a bandwidth-efficient, connection-drop resistant way; and I can't find any standard thing that does this.
Kafka would fit the bill in general, but would require a lot of work (reading textual logs into kafka, querying kafka for new stuff across connection, reading from kafka and writing to text files) - and I'm not sure how well it deals with dropped connections.
My existing solution is to rsync the log directories (--append, --inplace) as infrequently as I can from an operational view, which is 1 minute. It is relatively bandwidth efficient (although could be much better), robust with respect to connection issues, and generally works.
However, it is less efficient than it could be: if directories have a lot of files, like /var/log often does, there's a lot of sync overhead. The delay is 1 minute instead of a couple of seconds (which is what you would get with a simple "tail -f" through a TCP connection), and it doesn't play well with common log rotation schemes (though that's relatively easy to work around).
Anyone has a better solution, kafkaesque or otherwise?
How about a tiny forwarder on each device. It fopens the file and reads to the end, when the file is appended to it will be able to continue reading. Packet up each line and forward to you server using zeromq PUSH/PULL. If you are disconnected they queue up and send when zeromq automagically reconnects.
Since logstash was already mentioned, it's worth putting https://github.com/elasticsearch/logstash-forwarder here, as the logstash team built it as a solution for logstash-type needs on systems that may not be able to support logstash itself.
Probably. Do you know of a syslog/forwarder that deals properly with intermittent connections? (i.e., available for an hour, then gone for an hour, then available again, etc.)
Not sure about logstash but a lot of its users use NXLog (http://nxlog.org) as a shipper since it has a much lower resource footprint as there is no java and ruby in there. (I'm affiliated with NXLog.)
Syslog-NG PE works well for this. It can be configured to use a per-destination disk buffer, so that if the destination goes offline, messages will queue until they can be sent again.
Both this solution (and easytiger's above) work if disconnects happen rarely and for short times.
But it my case, I have "30 minutes on, 5 minutes off, 90 minutes on, 90 minutes off, 2 minutes on" kind of situations, in which anything that doesn't track what was already transferred and what wasn't, will lose data. (zeromq's buffers also have limited capacity and/or are tied to a process on the other side - if it restarts, buffers are gone).
It is surprising how long it has taken for the common practice of databases to be distributed. I'm glad that people are starting to move in that direction, and open source their work. This is a win for our industry overall.
We're working on a database/cache/messaging system too, http://github.com/amark/gun it is dedicated to removing the pain I and other Javascript/NodeJS developers had when it came to managing/debugging databases (devops and sysadmin work is frustrating).
Good luck to these guys. Its certainly an up and coming area with some overlap with an existing, mature market. Tibco (the information bus company) is a $4b company with relatively dated technology...Informatica is another legacy solution in this area with tons of revenues. Either way, the market is certainly poised for some disruption
In case anyone else is wondering why the article is linked to LinkedIn, Jay Kreps, co-founder/CEO was previously with LinkedIn. LinkedIn is also an investor of Confluence.
I'm a bit confused. The more i read the doc, the more i feel it look Just Another Queue system. I don't really see the difference between Kafka and let's say RabbitMQ, Celery or also statsd...
There are differences, the biggest of which is throughput. Kafka can handle incredible load. The messaging semantics are also a bit different. Here's a pretty good comparison:
Weird, that shouldn't be there. I double checked in an incognito and things seemed fine. LinkedIn blog posts are supposed to be visible with or without an account, so this must be a bug of some kind.
In the same way that Hadoop is starting to feel outdated, but HDFS doesn't seem to be going anywhere -- I think we'll see a lot of innovation in stream processing frameworks in the next few years, but Kafka will just keep on going.