Well, yeah, I noticed you guys' response to one of the comments on the blog post indicated that the problem machine had a different workload (additional tasks or something). That caused the additional writes, which then caused the latency for the main app on the box.
I think your point still stands about logging, being cautious about blocking I/O calls, etc. But, it seems the bigger point is one of how your overall system is architected, which proccesses run where, dedicating like nodes to their tasks vs. potential quality/consistency issues arising from having some pull double-duty, etc.
Those seemed to be the source of the real issue here.
Sort of. The catch is that even a very small write, say just a few megabytes, can drastically change the cost of an fsync(). On my test aws VM even writing just 4 megabytes one time is enough to trigger the problem. Even on an otherwise fully isolated system a few megs may be written from time to time, for example by a management agent like chef or puppet. Or by an application deploy copying out new binaries.
IMO the real issue is that a competent logging framework doesn't block app code to sync the log to disk. The buffer should be swapped out under lock, and then synced in a separate thread. Yuck.
Which is why logging to disk on the server is BAD, have your log framework write to stdout and have upstart/systemd/whatever handle writing to a remote syslog server or whatever your fancy is.
I think your point still stands about logging, being cautious about blocking I/O calls, etc. But, it seems the bigger point is one of how your overall system is architected, which proccesses run where, dedicating like nodes to their tasks vs. potential quality/consistency issues arising from having some pull double-duty, etc.
Those seemed to be the source of the real issue here.