LogZoom: A fast, lightweight substitute for Logstash/Fluentd in Go

latch · on April 8, 2016

Operationally speaking the single most important thing you should be doing is collecting application and system logs and having them easily accessible and usable (and check your backups every now and again). I say this with respect to the value you gain in comparison to the relatively small costs. You're being your own worst enemy if you aren't staying on top of error logs.

The OSS solutions are mature and simple to setup. And it isn't something you need to get absolutely correct with 100% uptime. If you're an "average" company, a single server running Logstash+ES+Kibana is probably good enough. There's only two ways you can do this wrong: not doing it at all, or forwarding non-actionable items (which creates a low signal-to-noise and people will just ignore it).

After that comes metrics (system and application), which is important, but not as trivial to setup.

Quickly looking at LogZoom, I think the more forwarding options we have, the better. They make it very clear that, unlike Logstash, this doesn't help structure data. On one hand, I don't think that's a big deal. Again, if you're only writing out actionable items, and if you're staying on top of it, almost anything that moves data from your app/servers onto ES+Kibana (or whatever) is going to be good enough.

On the flip side, adding structure to the logs can help as you grow. Grouping/filtering by servers, locations, types (app vs system), versions...is pertty important. I like LogStash, I actually think it's fun to configure (the grok patterns) and it helps you reason about what you're logging and what you're hoping to get from those logs.

chetanahuja · on April 8, 2016

PacketZoom founder here. Glad you liked the project. Could not agree more with the importance of tracking logs (and metrics... but that's a topic for another post).

To respond to your point about absence of Grok like facility, avoiding the need to unmarshal and remarshal the data while passing through LogZoom was an explicit design requirement. The blogpost refers to our pain with Logstash/Fluentd etc. We were in a situation where our production code was fighting for resources against a log collecting facility.

In general, it's best to process the data (to the extent possible) closest to it's point of origin. It's orders of magnitude cheaper to create a well structured log line straight from your production code (where it's just some in-memory manipulation of freshly created strings) rather than in a post-processing step inside a separate process (or machine).

I've spent years dealing with performance problems in global scale production stacks and a surprisingly high number of resource bottlenecks (memory/CPU/Disk IO) etc. are caused by ignoring this simple principle.

I've lost count of the cases where a simple restructuring of the architecture to avoid a marshal/unmarshal step drastically cuts down resource requirement and operational headaches. Unfortunately a whole lot of industry "best practices" (exemplified by the Grok step in Logstash) encourage the opposite behavior.

ktamura · on April 8, 2016

>To respond to your point about absence of Grok like facility, avoiding the need to unmarshal and remarshal the data while passing through LogZoom was an explicit design requirement. The blogpost refers to our pain with Logstash/Fluentd etc.

I think there are two different (CPU) performance problems conflated into one:

(1) The cost of parsing logs with something like Grok and Regexp

(2) The cost of marshaling and unmarshaling data

While both do cost CPU time, based on my experience having talked to literally hundreds of Fluentd users (I'm a maintainer and was a core support member for awhile), the cost of (1) dwarfs the cost of (2). (2) is pretty cheap if you use efficient serializers like MessagePack. As for (1), both Logstash and Fluentd support an option to perform zero parsing (In Fluentd, it's "format none"). By using these options, you can bring down CPU time significantly.

All of this being said, it looks like LogZoom isn't a true competitor to Fluentd or Logstash or Heka. It made different performance/functionality trade-offs and by doing less, it saves more CPU time: If you forgo the option of parsing logs at source (and in Logstash and Fluentd's defense, they do a whole lot more), you obviously can save resources. On the flip side, you need to post-process your logs to make them useful, and some other servers downstream will pay for CPU (You might not care about this because your logs have been thrown over the fense and now it's data engineers's job =p)

jsmeaton · on April 8, 2016

I think you make a good point that logs should be transformed closer to the source. I work, primarily, with applications provided by a vendor, with very unstructured log data. Transforming (Grok) these logs is an absolute must, we couldn't look at something that didn't allow transformation. That said, maybe we should be looking at something closer to the source before handing it off to a central location. Are you aware of agent-like daemons that do transformation before handoff?

seanp2k2 · on April 8, 2016

Structured logs are awesome and a great idea. For the next few decades while standards come and go and everyone gets it all implemented across the board, yes it sucks to write grok patterns for the flavor of the week, but once you do it a few times, it takes maybe a few hours of work to get some app cluster with moderately logging flowing into ES with all the right types and all the edge cases accounted for. From there, ELK is such a Swiss Army knife that it's worth the trouble, since then it's e.g. trivial to fire PagerDuty alerts off if you hit some exception-level log lines, or post metrics about your logs, or put them on some queue to flow into some big data pipeline thing.

b0ti · on April 8, 2016

You might want to consider NXLog if you need to do transformation at the source. For us this was an explicit design goal. Moreover it is also lightweight and a lot of people use it in place of other fat and bulky solutions, quite popular with ELK users.

nathwill · on April 8, 2016

We use heka and love it

azylman · on April 8, 2016

I wonder if the considered Heka (https://hekad.readthedocs.org/en/v0.10.0/), made by Mozilla? It's written in Go and, as far as I can tell, solves many of the same problems and more.

orthecreedence · on April 8, 2016

We use Heka in production and it has changed what would have been an ELK stack to a EHK stack with 0 regrets. The fact that the collector nodes can ship directly to ES without having to go through an intermediary node is extremely special...it reduces system complexity a lot.

whost49 · on April 8, 2016

Heka looks good and does a lot more. It doesn't appear to support Redis and S3 out of the box, so we would have probably had to evaluate, learn, and change the third-party plugins had we known about Heka beforehand.

robbles · on April 8, 2016

I looked into heka a while back and was turned off by its seeming reliance on lua scripts for almost every feature. It just seemed overcomplicated to deploy and maintain for a Go app as a result, and I didn't understand why all that functionality didn't come built in.

Is this unfair? Perhaps I misunderstood the docs.

azylman · on April 8, 2016

If you want to write your own plugins, you can do them in lua or Go. Lua is the recommended way for most types - I think matching is supposed to be faster, and also then you don't need to recompile the heka binary.

If you don't want to write your own plugins, you'll never interact with lua.

robbles · on April 10, 2016

Don't you need to ship all those builtin lua plugins around with your deployment of heka though? That was what turned me off - most of the other options can be used with a binary + config file.

zenlikethat · on April 8, 2016

Yeah, I like heka a lot in my limited experience with it. Easy to setup, TOML config is pretty nice and forwarding filesystem logs is fast. Interestingly, parsing the logs can be done with custom defined Lua in addition to static methods like regex.

makapuf · on April 8, 2016

What about simple rsyslog ? I stumble on this kind of programs (others have mentioned heka, fluentd, logstash), but the general speed, simplicity, versatility -the feature range is actually quite big from ES output to unix pipes to simple filters - and ubiquity of rsyslog make it suited for many of these tasks. I am missing something ?

rabysh · on April 8, 2016

Don't forget that rsyslog can also parse and generate structured data (json with mmjsonparse for input + templates with json escaping for output).

It can also queue up messages in memory and/or to disk if your remote data sink is having a hiccup.

And for those wondering how to send multiline data, well you don't. If you need to write out a big blurb, you write it out on a single line from the application. If using structured data, you can output lines as separate array items. The current built-in limit for a syslog line is 8096 bytes, but that's tunable. Just make sure the thing that writes to syslog doesn't have a low hardcoded limit like older versions of logger from util-linux (1024 bytes)

edit: the version of rsyslog shipped with the distros might be a bit dated. They're providing packages for their latest stable version, we're using that and it works pretty well.

rodgerd · on April 8, 2016

Have fun getting multi-line stack traces through rsyslog unmangled.

dalore · on April 8, 2016

Well for starters I'd rather have a structured log entry with the ability of context instead of a single line having to grep for information. With context in the logs it allows easier indexing and searching, I can add more data to the log entry knowing I'm not making it hard to grep but easier.

dozzie · on April 8, 2016

It's easy to get a stream of messages out of Fluentd in raw(ish?) form or to write a message destination plugin for it. This makes Fluentd an excellent message forwarder for generic data. On rsyslog side, you can't get a line-wise stream of JSON messages passed to TCP or UNIX socket or through a pipe to a command, and writing a plugin for it takes some C code.

I wouldn't build a monitoring or inventory system on rsyslog, but I don't hesitate to use Fluentd. rsyslog was intended for logs only, and using it in any other way seems an abuse, even if smart and somewhat fitting.

I haven't used logstash, but I bet it operates in a similar way on its data sink border.

atombender · on April 8, 2016

We would switch away from Rsyslog in a heartbeat if someone could come up with a better syslog-compatible forwarder.

We have it set up to write logs locally (with a limited rotation) as well as forward them via TLS to a central Rsyslog server that collects the log in a single tree with a much longer retention time. (We don't use any of the non-file outputs, but we do sync to S3 for archival.)

It has major issues. For one, its spooling implementation is flaky. /dev/log is a limited, synchronous-blocking FIFO buffer, which means that everything that logs (including OpenSSH!) will choke if the buffer is full. For some reason, just a tiny bit of packet loss will throw Rsyslog.

It also frequently is unable to recover from a network blip, and a restart is the only solution. But its spool file is badly implemented, so on restart it will typically ignore the old spool files and start anew — meaning you lose data. Someone wrote a Perl script to fix a broken spool directory, but I never got it to work.

Ironically, Rsyslog is also terrible at logging what it's unhappy about at any given time, so whenever something bad happens, you probably won't get anything in the system log.

Rsyslog's configuration is a curious beast, and by curious I mean infuriating. Rsyslog originally had an antiquated, ad-hoc and messy line-oriented configuration file format (with directives like "$RepeatedMsgReduction off"), and author decided to transition to a more modern, block/brace-based syntax. Unfortunately, he decided to do this gradually, and both syntaxes can co-exist in the same file. For a while, many of the options were only available in the old syntax, so you had to mix the two.

Which leads me to the next problem: The documentation is absolutely atrocious. The Rsyslog site is a fragmented mess of mostly outdated information. It's gotten better with v8, but it's still the worst OSS project documentation I've encountered. There's no reference section that lists the possible config options. Frequently there is no documentation for a particular setting. Rsyslog is quite finicky about some combinations of options (like TLS driver configs) and you have to proceed by trial and error. Frustratingly, it will silently ignore some config errors (such as trying to set up multiple TCP listeners, which is still not supported).

The new config format is better, but it still has the feel of something that has been implemented before it was fully designed.

Again, we don't use any of the fancy output modules. Maybe they are solid, but based on my experiences with the simple file-based stuff, I wouldn't bet on it.

As an aside, it's worth pointing out that Rsyslog is still using the Syslog protocol, which has all sorts of issues (not consistently implemented by clients or servers; does not support multi-line messages). Rsyslog has another protocol, RELP, that I believe you can use for forwarding, but I don't think it's been implemented outside of Rsyslog.

As far as I can tell, there aren't any good alternatives. syslog-ng's forwarding support is commercial and quite expensive. Logstash might work, but I don't want to run a memory-hungry Java app on each box.

Karunamon · on April 8, 2016

Not sure what you mean by forwarding support - my entire environment is configured with syslog-ng forwarding to various places based on various rules.

Heck of a lot faster than rsyslog. The one thing I've not been able to do is get rsyslog forwarding to syslog-ng. Something happens to the message format between systems that leads to hilariously incorrect filenames on the collector systems.

atombender · on April 8, 2016

By forwarding I mean reliable, disk-buffered forwarding. This only exists in the commercial "Premium Edition" of syslog-ng.

b0ti · on April 8, 2016

There is disk based buffering in NXLog CE. You might want to check it out with respect to the other woes you have with rsyslog.

atombender · on April 8, 2016

Never heard of that one, thanks.

otterley · on April 8, 2016

... which I highly recommend until you outgrow it and graduate to Kafka.

atombender · on April 8, 2016

On the other hand, it's pricy and last I checked, licensing is based on the number of machines (ridiculous in a cloud environment).

noja · on April 8, 2016

logstash and rsyslog do different things. there is some overlap.

stephenr · on April 8, 2016

It isn't new and shiny and its available from distro packages, so it's not worth the attention of the cool kids.

If you don't need to curl|sh from a .io (or .sh) domain to install it, it's not worth using apparently.

dozzie · on April 8, 2016

Just for information, I generally agree with what you said here, except in this particular case rsyslog is not a generic data bus, and Fluentd and logstash are, so they're useful on their own merit. They're just often used as merely log transports, which overlaps with rsyslog.

andygrunwald · on April 8, 2016

We at trivago had a similar problem. For this we created Gollum: - http://tech.trivago.com/2015/06/22/gollum/ - https://github.com/trivago/gollum

We use it heavily to stream any Kind of Data into Kafka: Access and errorlogs, Application Logs, etc. Did you consider it as well?

whost49 · on April 8, 2016

No, we did not consider Gollum--it definitely looks like a possible solution and one we might have considered. I think the name of the project makes it hard to find, unfortunately.

andrewvc · on April 8, 2016

Hi all, Logstash developer here. It's always exciting to see new stuff in this space, however, this post has me confused. Maybe the OP can clue me in.

I'm a bit confused as the assertion "This worked for a while, but when we wanted to make our pipeline more fault-tolerant, Logstash required us to run multiple processes.", is no more true for Logstash than it is for any other piece of software. Single processes can fail, so it can be nice to run multiples. It would be great if the author of the piece had clarified that further. If you're around I'd love to hear specifically what you mean by this. Internally Logstash is very thread friendly, we only recommend multiple processes when you want either greater isolation or greater fault tolerance.

I don't personally see what the difference is between:

Filebeat -> LogZoom -> Redis -> Logstash -> (Backends)

and

Filebeat -> LogStash -> Redis -> Logstash -> (Backends)

or even better

Filebeat -> Redis -> Logstash -> (Backends)

You can read more about the filebeat Redis output here: https://www.elastic.co/guide/en/beats/filebeat/current/redis...

whost49 · on April 8, 2016

> If you're around I'd love to hear specifically what you mean by this. Internally Logstash is very thread friendly, we only recommend multiple processes when you want either greater isolation or greater fault tolerance.

Right, we considered using multiple Logstash processes, but we really didn't want to run three instances of Logstash requiring three relatively heavyweight Java VMs. The total memory consumption of a single VM running Logstash is higher than running three different instances of LogZoom.

We looked at the Filebeat Redis output as well. First, it didn't seem to support encryption or client authentication out of the box. But what we really wanted was a way to make Logstash duplicate the data into two independent queues so that Elasticsearch and S3 outputs could work independently.

andrewvc · on April 8, 2016

Thanks for the thoughtfully considered response :).

Regarding security with redis. Did you read the docs here? https://www.elastic.co/guide/en/logstash/current/plugins-out... Logstash does support Redis Password auth (as does Filebeat). Regarding the encryption with redis point, seeing as Redis doesn't support SSL itself, are you using spiped as the official Redis docs recommend?

Regarding the two queues, I would like to clarify that you can do this with the:

Filebeat -> Logstash -> Redis -> Logstash -> (outputs) technique.

If you declare two Logstash Redis outputs in the first 'shipper' Logstash you can write to two separate queues. And have the second 'indexer' read from both.

It is true that if one output is down we will pause processing, but you can use multiple processes for that. It is possible that in the near future we will support multiple pipelines in a single process (which we already do internally in our master branch for metrics, just not in a publicly exposed way yet).

Regarding JVM overhead. That's a fair point about memory. The JVM does have a cost. That said, memory / VMs are cheap these days, and that cost is fixed. One thing to be careful of is that we often times see people surprised to find that they get a stray 100MB event going through their pipeline due to an application bug. Having that extra memory is a good idea regardless. We have many users increasing their heap size far beyond what the JVM requires simply to handle weird bursts of jumbo logs.

whost49 · on April 8, 2016

Thanks for that information. There's no doubt Logstash can do a lot, and it sounds like with the multiple pipeline feature Logstash will make it easier to do what we wanted to do in a single process.

In the past, we've also been burned by many Big Data solutions running out of heap space that adding more processes that relied on tuning JVM parameters again did not appeal to us.

shlant · on April 8, 2016

So as someone who is just about to implement Fluentd, what is the status of using LogZoom with docker?

Currently, with Fluentd all I have to do is set the log-driver and tags in DOCKER_OPTS, point fluentd to ES, and I have all my container logs.

Does LogZoom work this seemlessly with docker? I know that at the very least I will need https://github.com/docker/docker/issues/20363 in order to implement any LogZoom plugin, so is this really much of a benefit if I don't have hundreds of containers running on a host? The only concern I had after reading this was if Fluentd will use as much resources as they mention. For my use case, I think not.

whost49 · on April 8, 2016

For your use case, I think Fluentd may work fine. LogZoom currently deals with structured JSON log data received from hundreds of hosts around the world. It could be modified to handle arbitrary logs (and wrap a structure around it) and integrate with Docker, but that was not the goal here.

otterley · on April 8, 2016

I'm a bit concerned that you're relying on RedisMQ for buffering. Redis is an in-memory store with optional persistence, but having persistence doesn't make it a log-structured system like Kafka. You still have to make sure you don't run out of memory. This greatly limits its ability to buffer messages.

It would have been much better IMHO to utilize an on-disk buffer instead, like syslog-ng PE does.

chuhnk · on April 8, 2016

I am the author of Logslam. Thanks to LogZoom for the credit.