Hacker News new | past | comments | ask | show | jobs | submit login
Logging can be tricky (squareup.com)
138 points by hariharan_uno on Sept 23, 2014 | hide | past | favorite | 74 comments



Reminds me of the time I sped up the main business app of a large company by 85% by removing "debug" logging. 2tb/hr of "made it here" isn't really useful at the end of the day.

Not the first time I've seen that by a long shot.

//shakes zimmerframe, shuffles off


Thus why your logging library should have a configurable log-level, and the first thing the user-facing log() function should do is check the arguments against the log-level and early-return if the message isn't important enough.

(In languages that support macros, you could also just make debug_log() a macro that gets compiled out on release builds, like assert(). That means a whole bunch of object files are going to be functionally different between debug and release builds, though, which is annoying for a lot of operational reasons.)

My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string. That way, you can actually put all the work required to generate the log message (inspecting data structures, doing intermediate lookups, etc.) inside the closure, and it all gets elided when you decrease the log-level.


Macros let you do option three, too, by just guarding the body with an if in the generated code. That way you avoid having to use an anonymous function each time. So in Lisp (log level (construct-a-string)) instead of (log level (fn [] (construct-a-string))), but with the same elisions logic.


At least in Java, log levels don't buy you much when you wind up possibly rendering object graphs to strings (via your ohsohelpful custom, recursive tostring implementation), then duping it to concat "made it here" to it, then immediately abandon the mess on the heap for the gc to handle, since the production log level is warn, so your carefully built, highly detailed log message is never used. Repeat every other line or so... In tight loops, call iter.next in log message and on and on...


That's not a java problem, that's an "I don't understand Java logging" problem. The right way:

    log.debug("I created this gigantic object {}",
              myBigUglyObject)


"possibly rendering object graphs to strings"

But that's a problem with how you choose to use logging in your application - not with logging in general or specific logging features in Java.

Logging used without any care can cause lots of problems - but then so can any part of an application. Used with some care it can be a very useful tool for helping to diagnose all kinds of problems.


Some java servers I have had the misfortune of diagnosing were found to be spending the majority of their CPU time (aside from garbage collection of course) formatting exception stack traces.


which is why slf4j has format strings


... or your recursive reflective tostring implementation....


> My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string.

This is a pretty fantastic route as far as syntactical simplicity goes.

I wonder if it can have an implications on the generated assembly nonetheless? Could there be extra assembly generated outside the branch for logging level check, which is necessary to bind the variables for the closure?

Similarly, I might suspect that an optimizing compiler that is considering a function for inlining might change its mind for any function where it sees a nontrivial closure like that, without knowing that "trace" level logging nearly never happens.

"Edge cases", to be sure, and the answers are surely different between languages, but interesting to consider nonetheless.


You may know this, but maybe it's worth pointing out that the big savings here is actually just that the function itself can be pretty expensive, since it may stringify some data structure. In eagerly evaluated languages, if you say this:

    log("debug", "I found: " + foo.toString());
you end up doing a lot of work to create that string, only to have it be ignored by the log call. Whereas if you say:

    log("debug", => "I found: " + foo.toString());
any extra assembly you execute (or generate) to make that work is peanuts compared to avoiding the call to toString(). None of that (besides the simple thing of log levels) helps with the OP's problem with IO, but it can make a real difference to your CPU.


> My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string.

A pretty common halfway solution (used by Python's logging for instance) is that the logging library takes a format string and its parameters, and does the formatting itself.

Doesn't save intermediate lookups or complex computations if they exist, but for the common case of a complex stringification it can save your bacon.

The alternative is an explicit conditional around the logging call, which is definitely worse than the ability to log a closure.


Monolog does that, you call the level that you want so you can early-return based on what environment/config you've got set up. Quite nice, means I can keep my debug logs around and deploy the same code to production without worrying about it. I think the PSR standard for logging actually specifies this, too.


Enclosed log evaluation is really nice and I'd have some trouble going back. It's older than Elixir, too; Commons.Logging in C# has supported it for a long time and most logging frameworks in Scala take a call-by-name parameter rather than an evaluated arg.


Example in Scala from Play:

https://github.com/playframework/playframework/blob/2.3.x/fr...

The '=>' means call-by-name in Scala, the argument is lazily evaluated.


Clever. I like that (using the closure).

I've never used Elixir but toyed with Erlang many years ago - is it called asynchronously by the logging process in Elixir? If so, I guess you're relying on immutable data structures to make sure you get the correct logging output?


Everything goes through an async message send to a gen_event (error_logger). The gen_event itself can have annoyingly synchronous+serial behavior with respect to its backends, but gen_event handlers aren't supposed to do much heavy-lifting; they mostly just decide whether they want a message, and then forward it to some backend process to do the formatting+writing+etc.

This gives you another neat benefit, nearly for free: if you have, say, application Foo on node A, wanting to log to SyslogSink on node B, then Foo first sends its logs to a gen_event on A, and the SyslogSink handler in that gen_event gets the opportunity to drop messages, using whatever logic SyslogSink likes, before forwarding them over to B (which might be half-way across the Internet.)


Not the only one. Someone set up NHibernate on a project I was working on to run logging in DEBUG mode and throw it in a log file capped at 100k. Pages were taking 4-5 seconds to render. Wasn't obvious that the logs were being written as the file was capped at 100k.

Took ages to work that out.

Ugh.


Grey text on white background? Please help!


You just dont understand sleek, minimal design. How pedestrian.

/s



I had a similar problem with java -- except that the entire application would freeze for double digit seconds. Another application would sometimes write a huge amount of data out very quickly to the fs cache. 30 seconds later (or w/e the expiration is), all those dirty bytes would get sync'd to disk more or less at once.

Turns out it was the JVM provided GC logging hanging on flush (not even fsync) calls. The flush call was during GC, and while the GC implementation held a stop the world lock. Digging through JVM source code is 'fun'.


JVMs can also block on class loading in the presence of a heavy writer. This can go on for minutes, or indefinitely. Where I work we deploy jars to tmpfs to avoid this irritating cause of high latency.


I think I have exactly this problem. How do I confirm this? "strace" does not show me any obvious problems. Where would I look to see if the JVM "flush" was giving me problems?


Well, the easiest way to confirm is to turn off GC logging and see if your huge "GC pauses" go away. Alternatively, you could turn up the verbosity a bit - there's an extra flag for details about reference processing. In my case I was able to track down the exact line in the JVM from the log lines attributing the huge pauses to whatever trivial component occurred right after the first flush call.

If you record system metrics (eg. to ganglia) then you can also attempt to correlate large pauses to a large and rapidly declining number of dirty bytes in the fs cache.


If all three machines had the same logging code, and one machine was fsync'ing slowly, isn't turning off fsync just a bandaid that hides the true problem?

When they discover the actual problem is some issue with the raid controller, I promise not to say "I told you so".


Hi, post author here. I didn't make it as clear as I could have, but the difference is that the problematic system had an unrelated process creating slightly more writes. I sort-of glossed over this with the 5%-20% difference in i/o util.

Unrelated write activity on a filesystem can cause cause fsync() calls in any other process to vary wildly in latency. This can be replicated, here's an experiment for you. First, run this:

strace -T -efsync ruby -e'loop { STDOUT.fsync; puts "a" * 120; sleep 0.1 } ' > ~/somefile

Then, in another terminal do a little bit of writing -- make sure it is on the same filesystem. For example:

dd if=/dev/zero of=~/someotherfile bs=4M count=1

On my poor little aws VM, here is what I see:

fsync(1) = 0 <0.025072>

fsync(1) = 0 <3.930661>

fsync(1) = 0 <0.024810>

That is, writing 4 megabytes in an unrelated process caused fsync() to jump two orders of magnitude.

Removing fsync() is an appropriate fix because we don't really ever want to flush this data to durable storage.


I probably read "an otherwise uniform cluster" too literally.

20% utilization on a resource that can only do 100 random operations per second is a major problem, and way different from 5%.

A piece of hardware that operates at 100 Hz, with 20% utilization, will block for about 135ms+ for the 95th percentile request. With 5% utilization, it will block for about 10ms for the 95% percentile request.

My quick calculations are somewhat below the 200ms discrepancy you show in the chart, but not far off.

Of course, turning off fsync is a perfectly good solution. Longer term, I would move to SSDs and just make this entire class of problem go away. I don't even have any spinning platters to test your strace on.


Computers are chaotic systems. They're deterministic, but changes in starting conditions below the threshold of observation can lead to wildly different outcomes. That's why they're usually unpredictable.


While I really like the article and its analysis, I think there is a more general point here - scale is tricky.

Most programming abstractions and tools (such as Logging / UI design / persistence) across instances are all "easy" when it comes to one machine (or a few machines) in a closed, deterministic environment.

All of these become extremely tricky when you have to deal with many instances across many different possibly indeterministic environments. For UI, it quickly becomes what's the viewport? What's the real estate? How is interaction (touch? mouse? keyboard? numpad?). For persistence, one has to worry about number of simultaneous connections, character encoding and normalization, read / writes per second, synchronisation, etc..

So, also for logging - Are we logging client side or over the network? what's the memory on the client? What else is running on the client? What's the permissions through which we log? Do we care about a system call on this client given any user interaction? etc..

Logging, UI, Persistence, Network protocols, heck even just choosing an intermediate data structure are all tricky .. at scale across devices and indeterministic environments.


Compressing the text logs prior to writing them to the disk also helps with these kinds of issues. You can also offload your logging to a dedicated thread and then use a lock free queue to increase your performance even more.


> You can also offload your logging to a dedicated thread and then use a lock free queue to increase your performance even more.

Or use syslog() like a sane person, and let that "extra thread" live inside the OS IPC mechanism.

(Or stdout/stderr like a modern sane person, and let upstart/systemd/docker/etc. push your logs to syslog if that's where it feels like pushing them.)


The OS IPC mechanism quite likely does not use a lock free queue -- or at least, not in quite the same way as I think the grandparent post refers to.

Using a well implemented ring buffer [+] can get enqueue operations down to a few instructions and something like two memory fences.

The overhead of IPC, which wakes up the kernel scheduler, switches the processor back and forth between privilege modes a few times on the way, knocks apart all the CPU cache and register state to swap in another process, while the MMU is flipping all your pages around because these two processes don't trust each other to write directly into their respective memory... is not going to have quite the same performance characteristics.

An moment in the history of logging is java's log4j framework, which, within a single process, used exclusive synchronization. When this was replaced by a (relatively) lock-free queue implementation, throughput increased by orders of magnitude. (Their notes and graphs on this can be found at https://logging.apache.org/log4j/2.x/manual/async.html .) This isn't an exact metaphor for the difference between a good lockfree ringbuffer and IPC either, but it certainly has some similarities, and indeed ends with a specific shout-out to the power of avoiding "locks requiring kernel arbitration".

--

[+] The "mechanical sympathy" / Disruptor folks have some great and accessible writeups on how they addressed the finer points of high performance shared memory message passing. http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-... is one of my favorite reads.


I think you're thinking of a synchronous RPC mechanism. I was talking about IPC mechanisms like unix domain sockets, where sending the message doesn't interact at all with the receiving process, but literally just sticks it into a buffer, where the other process can come and get it later.


> Or use syslog() like a sane person, and let that "extra thread" live inside the OS IPC mechanism.

As it turns out, we had nearly the same symptoms as the OP. After running strace on syslogd (CentOS 5), I realized that it also does an fsync() after every call. A quick trip to the man page and a configuration change later, our issues disappeared.


Evan, I would be happy if you could explain more about what you see in the strace outputs. E.g.:

> Time spent in fsync accounts for almost exactly the delay we were seeing.

What delay? I see the whole thing taking 1.5 seconds and 1.3 seconds spent in futex (0.4 more than on the normal host). Not sure, why we are suddenly talking about fsync. I also don't know what either method (futex, fsync) could be doing.

All these are not questions I want answers to (some stuff I could google of course). I just want to show that it's a rather confusing read for some readers if you expect them to understand the strace outputs as well as you do, when you seem to be using that tool on a daily basis and the readers might not have used it at all, ever. It would be great to follow your insides better. Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".


> Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".

Exactly. I try to always point to the actual data when I'm telling my readers my conclusions.

"Calls to futex, epoll_wait and select are largely the same on both systems. The problem host does 1727 calls to fsync(), while the normal host does only 849 calls. The problem host spends 1.24% of its time in fsync(). The normal host only 0.48%. This makes fsync() our number 1 suspect."

This at least lets the reader refer back to the output and see corresponding entries.


Ah, thanks for your feedback. I'm never sure how much detail I should dive into. I will add more next time. I'll answer them here just for practice :)

Recall, this is a problem in the 99th percentile. That is, on average our requests are still taking around 20ms to complete -- but one in every hundred or so takes about 200ms.

The first strace output shows a summary table of time, including both our fast 20ms queries and the 1% of slow 200ms queries. In the summary view I am looking not simply at the largest number in the summary but rather indications that my problematic system is spending time doing something that my healthy system is not. Time spent is a zero-sum equation -- the app is either doing what it's supposed to, or not. futex() is where it should be spending time while idle. fsync() time doubles proportionally, which means it's taking time away from the otherwise productive (or, idle) operations. Which means it's potentially the problem.

When I trace the fsync call individually I see a time delta roughly on par with my per-request delay for the slow 1% of queries. Which indicates to me that this call is happening during the slow queries, accounting for the delay in its entirety.


Why would you fsync logs for a high-level service? Are you afraid a power outage is going to cause you to lose service logs?


Making sure your logging data gets written is relatively important.

Say, if you hit a bug somewhere that causes the server itself to crash.


How about installing a SIGSEGV, etc. handler to do the fsync (and perhaps even print a stacktrace)?


Disabling buffering in the file library should provide for a similar mitigation outlook. If the OS accepts the writes it WILL make it to disk unless the some part of the storage sub system(or the os itself) fails. So if the process dies recent writes should make it to disk regardless. But if the process triggers a complete system crash then you're kinda stuck needing fsync's :|


> Disabling buffering in the file library should provide for a similar mitigation outlook

Did you mean logging library? I.e., application-level buffering? In that case, good point.


The fsync() is implicit behavior of the glog framework for Go. We didn't actually want it.


Did anyone understand why logging calls on only problem machine had different fsync behavior than normal machines?


Hi, post author here. The slight difference in i/o (specifically: writes) was the trigger. I talked a little more about that here: http://corner.squareup.com/2014/09/logging-can-be-tricky.htm...

And here: https://news.ycombinator.com/item?id=8359556

Even on the problematic host we only saw this latency issue in the 99th percentile. That is: even on the problem host 99 out of 100 queries were served as expected and only 1 out of 100 saw this additional latency.


Well, yeah, I noticed you guys' response to one of the comments on the blog post indicated that the problem machine had a different workload (additional tasks or something). That caused the additional writes, which then caused the latency for the main app on the box.

I think your point still stands about logging, being cautious about blocking I/O calls, etc. But, it seems the bigger point is one of how your overall system is architected, which proccesses run where, dedicating like nodes to their tasks vs. potential quality/consistency issues arising from having some pull double-duty, etc.

Those seemed to be the source of the real issue here.


Sort of. The catch is that even a very small write, say just a few megabytes, can drastically change the cost of an fsync(). On my test aws VM even writing just 4 megabytes one time is enough to trigger the problem. Even on an otherwise fully isolated system a few megs may be written from time to time, for example by a management agent like chef or puppet. Or by an application deploy copying out new binaries.

For example, here I reproduce the problem on a completely isolated machine: https://news.ycombinator.com/item?id=8359556


IMO the real issue is that a competent logging framework doesn't block app code to sync the log to disk. The buffer should be swapped out under lock, and then synced in a separate thread. Yuck.


The downside is of course that if you crash hard, the most valuable log entries are the ones least likely to be on-disk afterwards.


Which is why logging to disk on the server is BAD, have your log framework write to stdout and have upstart/systemd/whatever handle writing to a remote syslog server or whatever your fancy is.


Good points. I got something out of it on both fronts.


If you want tricky, try building a logging service and then enable logging to itself.


I once enable logging on an AWS S3 bucket... which was our 'logs' bucket... which was the stated destination for the bucket logging. It wasn't a particularly costly error, but there was a lot of logspam...


Does anyone think about doing logging to shared memory / memcached and then committing snapshots to disk at regular intervals via another process/machine?

If you're not all that concerned about consistency, each web server can keep their logs in a segregated memory space and then another process can combine/commit and send a flush command, leaving the primary machines relatively unencumbered.


With fsync() taken out of the equation this is essentially how a naive logger operates. The linux page cache plays the role of shared memory and only under rather heavy contention will a write() incur latency.

The trouble outlined in my blog post is that the logging framework was calling fsync() -- that is, specifically asking to flush the page cache all the way to disk.


Have you experimented with a log-structured filesystem? I once saw similar high-variation behaviour in a logging problem on extfs. Switching to NILFS pretty much got rid of variable latency entirely.


Not recently, no. I do know that a journaled filesystem can exacerbate this sort of problem as it can make extra work. For example: http://lwn.net/Articles/328363/

In a few cases in the past when dealing with unimportant data I have downgraded to ext2 for a nice performance bump.


Just to make sure we're talking about the same thing, a journalling FS isn't the same thing as a log-structured FS.

The first has a write-ahead log, the latter is basically just a log. So immediately writing to disk is relatively simple.


Yes. I haven't done anything with a log-structured FS. I have only played around with conventional filesystems.


Isn't this essentially syslog-ng?


Does syslog-ng use shared memory? I thought it communicated via open socket network channels.


apps can log using the syslog primitives. I do not believe the require network sockets. What syslog-ng decides to do with it (flush to disk, or push over network socket), is no different than the proposal of shared mem to another process (which then has to make the same decision).


Doe this increase the chance that you lose a bit of critical log if a fault causes the system to go down?


Yes, but if the logs are just for debugging an application, it usually doesn't matter. Application logs are useful for figuring out why an application crashed, but shouldn't be relevant for figuring out why a system went down. However, if the app crashes while some log information is still in the system buffers, the system will still write them out, even if fsync() is never called.

It's been a few years since I looked at this, but it seems like there's only a small window that data will stay in the buffer before it will get automatically flushed. The nice thing about this automatic flush is that it won't cause the application to block.


Also if your logging plugin / framework / module allows logging to remote syslog, that's a pretty good idea - no disk blocking. Although you are then risking data loss when the network goes down...


Wow...this post made me think I shouldn't even attempt to run my own server infrastructure for my startup. This kind of analysis is way deeper than I'm currently capable of.


Hell, we're now infrastructure-less. No VM's (except for our QA to test on, doesn't count), no static servers we can't just destroy and spin up another of in <5 mins, and no database servers that we have to manage.

It truly is a joy.


God bless you -- you are doing the Cloud's work.


Curious: How do you persist data with no database servers? Using an outside service to store it?


We used Azure SQL for our RDBMS requirements, and blob/table storage for our NoSQL stuff. Sure there's a server behind it, but it's transparent to us and it's geo-redundant and replicated. When i say "no servers" what I mean is, we don't need any system admins and very little ops folks. In fact at the moment, devs are pretty much ops since we need very little of it.

I also forgot to mention we have a build server, but thats just a virtual machine that we backup regularly. It's our only piece of static infrastructure. Everything else, except storage which is, again, transparent to us, is destroyed on a weekly, if not daily, basis for new releases.


Unless you're really unlucky/using young tech/bad at programming you won't need that kind of expertise unless you have the kind of traffic which means you can afford to hire that expertise.


Logging to files creates a lot of headache. I would say the tail plugin for Fluentd (the open source log collector that I am a maintainer of: www.fluentd.org), I would say 1/3 to 1/2 of the questions are related to tailing log files.

I think, fundamentally, files are the wrong abstraction for log data (as pointed out by Adam Wiggins of Heroku fame: http://12factor.net/logs), but it is something that we are stuck with.


I think this problem is mostly due to using Go, whose libraries are not mature.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: