Confluent, a company for Apache Kafka and realtime data

bkirwi · on Nov 7, 2014

Kafka is a Big Deal, and it's great to see it get first-class support.

In the same way that Hadoop is starting to feel outdated, but HDFS doesn't seem to be going anywhere -- I think we'll see a lot of innovation in stream processing frameworks in the next few years, but Kafka will just keep on going.

beagle3 · on Nov 7, 2014

Good luck to the Confluent guys!

There's a missing piece in the realtime puzzle, at least one that I haven't been able to find, for which Kafka is an overkill - perhaps someone here knows of a solution:

I have tens of system endpoints connected through unreliable (Line-of-sometimes-occluded-sight, 2G and 3G WWAN, some are in vehicles so connections are intermittent).

I just want to consistently tail their logs in a bandwidth-efficient, connection-drop resistant way; and I can't find any standard thing that does this.

Kafka would fit the bill in general, but would require a lot of work (reading textual logs into kafka, querying kafka for new stuff across connection, reading from kafka and writing to text files) - and I'm not sure how well it deals with dropped connections.

My existing solution is to rsync the log directories (--append, --inplace) as infrequently as I can from an operational view, which is 1 minute. It is relatively bandwidth efficient (although could be much better), robust with respect to connection issues, and generally works.

However, it is less efficient than it could be: if directories have a lot of files, like /var/log often does, there's a lot of sync overhead. The delay is 1 minute instead of a couple of seconds (which is what you would get with a simple "tail -f" through a TCP connection), and it doesn't play well with common log rotation schemes (though that's relatively easy to work around).

Anyone has a better solution, kafkaesque or otherwise?

mwarkentin · on Nov 7, 2014

Sounds like a job for logstash: http://logstash.net

If you don't mind using a 3rd party service, you could look into using Papertrail, Loggly, etc.

beagle3 · on Nov 7, 2014

3rd party is not an option. I'll have to look at logstash, thanks.

easytiger · on Nov 7, 2014

How about a tiny forwarder on each device. It fopens the file and reads to the end, when the file is appended to it will be able to continue reading. Packet up each line and forward to you server using zeromq PUSH/PULL. If you are disconnected they queue up and send when zeromq automagically reconnects.

imperialWicket · on Nov 7, 2014

Since logstash was already mentioned, it's worth putting https://github.com/elasticsearch/logstash-forwarder here, as the logstash team built it as a solution for logstash-type needs on systems that may not be able to support logstash itself.

gelliott · on Nov 7, 2014

Can your endpoints log to syslog so they can be forwarded on to a central collector?

beagle3 · on Nov 7, 2014

Probably. Do you know of a syslog/forwarder that deals properly with intermittent connections? (i.e., available for an hour, then gone for an hour, then available again, etc.)

b0ti · on Nov 7, 2014

Not sure about logstash but a lot of its users use NXLog (http://nxlog.org) as a shipper since it has a much lower resource footprint as there is no java and ruby in there. (I'm affiliated with NXLog.)

otterley · on Nov 7, 2014

Syslog-NG PE works well for this. It can be configured to use a per-destination disk buffer, so that if the destination goes offline, messages will queue until they can be sent again.

lrm242 · on Nov 7, 2014

autossh + ssh -C + tail -F ?

kafka sounds like an enormous overkill. If you want to store the logs locally while tailing, just add in a tee.

beagle3 · on Nov 7, 2014

Both this solution (and easytiger's above) work if disconnects happen rarely and for short times.

But it my case, I have "30 minutes on, 5 minutes off, 90 minutes on, 90 minutes off, 2 minutes on" kind of situations, in which anything that doesn't track what was already transferred and what wasn't, will lose data. (zeromq's buffers also have limited capacity and/or are tied to a process on the other side - if it restarts, buffers are gone).

marknadal · on Nov 7, 2014

It is surprising how long it has taken for the common practice of databases to be distributed. I'm glad that people are starting to move in that direction, and open source their work. This is a win for our industry overall.

We're working on a database/cache/messaging system too, http://github.com/amark/gun it is dedicated to removing the pain I and other Javascript/NodeJS developers had when it came to managing/debugging databases (devops and sysadmin work is frustrating).

capkutay · on Nov 7, 2014

Good luck to these guys. Its certainly an up and coming area with some overlap with an existing, mature market. Tibco (the information bus company) is a $4b company with relatively dated technology...Informatica is another legacy solution in this area with tons of revenues. Either way, the market is certainly poised for some disruption

joshmn · on Nov 7, 2014

In case anyone else is wondering why the article is linked to LinkedIn, Jay Kreps, co-founder/CEO was previously with LinkedIn. LinkedIn is also an investor of Confluence.

lern_too_spel · on Nov 7, 2014

I doubt anybody is wondering that. The very first section is titled "Origin at LinkedIn."

davidjgraph · on Nov 7, 2014

Sounds somewhat like Confluence, this will get confusing. Already I can see joshmn has made the mistake in his comment.

mlhamel · on Nov 7, 2014

I'm a bit confused. The more i read the doc, the more i feel it look Just Another Queue system. I don't really see the difference between Kafka and let's say RabbitMQ, Celery or also statsd...

possibilistic · on Nov 7, 2014

There are differences, the biggest of which is throughput. Kafka can handle incredible load. The messaging semantics are also a bit different. Here's a pretty good comparison:

http://www.quora.com/RabbitMQ-vs-Kafka-which-one-for-durable...

gfodor · on Nov 7, 2014

Congratulations guys and good luck on this next adventure! How does Samza fit into this new venture?

mountaineer · on Nov 7, 2014

Samza's role in this is a question I had as well. But, certainly Hadoop needs some polish first.

mountaineer · on Nov 7, 2014

I love the true engineers' launch, with no logo (twitter account is just the egg).

mintplant · on Nov 7, 2014

Does anyone have a mirror for those of us without LinkedIn accounts?

bcantoni · on Nov 7, 2014

Same article is on the new company blog: http://blog.confluent.io/2014/11/06/announcing-confluent-a-c...

mintplant · on Nov 7, 2014

Thank you!

joshmn · on Nov 7, 2014

http://pastebin.com/raw.php?i=Z6bRDA7V

postreader · on Nov 7, 2014

You don't need an account to read the post.

mintplant · on Nov 7, 2014

Are you sure? This is what I'm being redirected to when I click on the link:

http://i.imgur.com/w9YxLvA.png

sbilstein · on Nov 7, 2014

Weird, that shouldn't be there. I double checked in an incognito and things seemed fine. LinkedIn blog posts are supposed to be visible with or without an account, so this must be a bug of some kind.

mintplant · on Nov 7, 2014

It doesn't seem to be doing it anymore. Strange! Maybe whatever it was got fixed.