>> Only applicable when you produce a small, predictable volume of loggi...

moe · on Aug 10, 2012

When you roll the logs at say 500Mb, there's the chance our monitoring system misses an error

Sorry but then your monitoring system is broken? Why is its operation tied to log chunking?

Multiple logfiles for a day are harder to search

Not really. Often enough interesting events span multiple days (or cross midnight) anyways. Thus regardless of chunking you have to be prepared for that case.

What benefits does a split logfile provide?

I'll give you that "one file per day" seems easier at first, but I've been bitten by that too often. A script to identify the file-range for a given time-range via binary search is really easy to write, has only to be written once, and the benefits of fixed chunks are:

Easier handling. If you need to offload "50GB" from a given log-host then that's easier to do when you know that will be 50 files.

You don't risk a scratch disk running full (i.e. those small and expensive spindles that take the initial load) when the application decides to have a bad-hair day and put out 10x the normal volume for a while (e.g. interleaved with stack traces).

>>fetching an arbitrary header from the right file for a given context can be rather painful. That's been a solved problem for a very long time. Just as recombining a separate trace log with a main app log based on time stamp is a solved problem.

I said painful, not impossible. You say you're dealing with 5GB/hour logs, then you must be well aware how long even the simplest merge-job takes on the slice of a single day?

>> When a serious error occurs then as much context as feasible must be provided Yes this is the extraordinarily rare case where stack traces can be useful.

We must be living in different worlds then.

When errors pop up in the main logfile, which happens frequently, then the last thing I want is to stitch together context from other files.

A mere grep on a day's worth of logs takes about 10 minutes for us. Any kind of text-processing or "for each line perform lookup in $otherplace" quickly pushes that into the hours.

3amOpsGuy · on Aug 10, 2012

>> Sorry but then your monitoring system is broken? Why is its operation tied to log chunking?

Yours clearly bends some unbendable truths of the known universe ;-)

A monitoring system ordinarily samples the logfile at a given point in time, that is an open followed by a close operation on the logfile (lest you run into phantom disk usage issues caused by holding an open file handle to a compressed / rotated logfile).

If between sample 1 and sample 2 the log file is rolled, your monitoring system never sees the data between sample 1 and the end of the previous log.

Possible workarounds include logfile streaming (say via a pipe on the filesystem) but that introduces much larger problems and would in no way be compatible with the app rolling its log files.

A much simpler fix is to roll only on events (bounced the app etc.)

>> A script to identify the file-range for a given time-range via binary search is really easy to write

Contrast with no script required. No bugs, no maintenance time, no different versions in different regions. Just no script. The benefits don't stop there - when you need to query, there's no waiting for a bunch of disk seeks while the script runs against a busy disk controller - you can just get straight on with parsing the results.

>> Easier handling. If you need to offload "50GB" from a given log-host then that's easier to do when you know that will be 50 files.

Moving 50 items is easier than moving 1? I'm unconvinced but still open to any interesting ideas here.

>> You don't risk a scratch disk running full

There is no change to the risk profile, neither approach is better in this regard, unfortunately.

>> e.g. interleaved with stack traces

Just one of the many excellent reasons not to log them in production.

>> you must be well aware how long even the simplest merge-job takes on the slice of a single day?

Yes - it runs at full disk I/O speed with almost no CPU overhead. Remember we are merging sorted data (already in temporal order).

>> which happens frequently

It's clearly a high profile app you're working on (in my experience the people in charge of the pennies refuse to sign off on the cost of disk space for this amount of logging without solid reasoning) i'm surprised it behaves as badly as you say. Concerning, but then it also appears you don't have an ops team either.

>> stitch together context from other files.

I get the impression you're thinking that you should manually stitch these together? On the very rare occassions you need this data, remember both are already in time order, we just instantly merge on the fly.

>> A mere grep on a day's worth of logs takes about 10 minutes for us.

It sounds like you've got it all in one file, i.e. not splitting your audit.log out from your error.log from your trace.log? - techniques which I couldn't do without to be honest (the monitoring system would have to consume every byte logged otherwise, when it really only needs to see a subset of total log output - thus saving CPU time and disk seeks).

>> Any kind of text-processing or "for each line perform lookup in $otherplace" quickly pushes that into the hours.

shakes head

Would you humour me with some context here. What is your background? From your replies i would say you are non-ops staff[1], self-taught[2] and with limited experience[3].

[1] You don't demonstrate any knowledge of basic tools, e.g. you mentioned writing a script for a problem solved by a 35+ year old command found on any unix host

[2] You haven't volunteered anything more advanced than obvious / first order approaches. Approaches that i'd expect any CompSci university grad could suggest off the cuff with no prior experience.

[3] You mention grepping through a single logfile of at least 35Gb, likely much larger.

moe · on Aug 10, 2012

Possible workarounds include logfile streaming (say via a pipe on the filesystem) but that introduces much larger problems and would in no way be compatible with the app rolling its log files.

What you call a workaround (streaming to a central location via syslog or scribe) happens to be the standard approach in my corner of the world. Analytics and monitoring operate naturally on the stream because, as you point out, sampling on rolling logfiles is not exactly reasonable, neither is scattering logfiles across application servers.

I'm leaving this discussion at this point because I'm not interested in your condescending tone and insults. It appears you haven't even centralized your logging, yet feel entitled to give ops-advice that is at odds with how the rest of the world operates.

3amOpsGuy · on Aug 10, 2012

>> streaming to a central location via syslog or scribe

Ahh, "man mkpipe", you're mixing up a separate concept - streaming to a remote host vs streaming via a pipe on the filesystem.

>> standard approach in my corner of the world

Syslog streaming has been a common approach since the ~70s i'd guess? Likely before.

>> yet feel entitled to give ops-advice that is at odds with how the rest of the world operates

Hopefully, you see the irony in this.

moe · on Aug 10, 2012

Cut it out already...

you're mixing up a separate concept - streaming to a remote host vs streaming via a pipe on the filesystem

You said above, quote: but that introduces much larger problems and would in no way be compatible with the app rolling its log files.

Why would you have log-pipes, or log-files on your app-servers to begin with?

Syslog streaming has been a common approach since the ~70s i'd guess?

Then how come you're not doing it?

Hopefully, you see the irony in this.

All I'm seeing is a constant stream of arrogance that doesn't seem to be backed up. Also, the command that you wanted to helpfully point out is called "mkfifo" or "mknod".

3amOpsGuy · on Aug 10, 2012

>> Hopefully, you see the irony in this.

>> All I'm seeing is a constant stream of arrogance

Perhaps not...