Hacker News new | past | comments | ask | show | jobs | submit login

>> Only applicable when you produce a small, predictable volume of logging.

No, It's primarily in the larger log volume systems we run into this problem. Take a FIX session adapter (very common software in the financial services world), baseline is around 5Gb/hour but its susceptible to fairly extreme fluctuation.

When you roll the logs at say 500Mb, there's the chance our monitoring system misses an error in the tail of the previous log as the logs roll (which could in the worst cases mean loosing a couple of million or even more).

Multiple logfiles for a day are harder to search through when handling the most common use case for log files, which is not debugging but ops staff answering questions on 'what happened with X?'

What benefits does a split logfile provide? The one use case is transferring logs across the network. However most log management systems already split large logs up before transferring to the archive silo.

You're loosing a lot, but gaining nothing with split logs.

>> When logging high volumes

It makes no difference, when the market swings you're going to need an extra 200Gb anyway, whether the logs are in chunks or not. In large volume apps deleting to make space is often not an option - you're logging in high volumes for a reason (often regulatory).

>> fetching an arbitrary header from the right file for a given context can be rather painful.

That's been a solved problem for a very long time. Just as recombining a separate trace log with a main app log based on time stamp is a solved problem. I believe this highlights the bigger problem, you can go to university or a.n. other school and learn to be a developer. Where can you go to learn operations? The net result is people come up with the own solutions each time instead of building on work already done. That work is available but its often passed on by word of mouth. Surprisingly hard even in today's world to find decent blogs on ops work (thankfully there are more around now).

>> When a serious error occurs then as much context as feasible must be provided

Yes this is the extraordinarily rare case where stack traces can be useful. They still don't belong in the main app log though.

Push to a separate trace log and recombine on the fly when you need to view. The benefits are to the ops team and the monitoring system.

Another approach not often mentioned is black box logging. Log to a circular buffer in memory, the first bytes of the buffer are set to a pattern that can be searched for in a core dump, the remainder is used as a circular buffer. It's fast to log to, takes no disk space and provides crucial context in the event of a crash.

Im not sure I'll be able to convince you based on your tone, but speaking from experience, you are making the most common mistakes.




When you roll the logs at say 500Mb, there's the chance our monitoring system misses an error

Sorry but then your monitoring system is broken? Why is its operation tied to log chunking?

Multiple logfiles for a day are harder to search

Not really. Often enough interesting events span multiple days (or cross midnight) anyways. Thus regardless of chunking you have to be prepared for that case.

What benefits does a split logfile provide?

I'll give you that "one file per day" seems easier at first, but I've been bitten by that too often. A script to identify the file-range for a given time-range via binary search is really easy to write, has only to be written once, and the benefits of fixed chunks are:

Easier handling. If you need to offload "50GB" from a given log-host then that's easier to do when you know that will be 50 files.

You don't risk a scratch disk running full (i.e. those small and expensive spindles that take the initial load) when the application decides to have a bad-hair day and put out 10x the normal volume for a while (e.g. interleaved with stack traces).

>>fetching an arbitrary header from the right file for a given context can be rather painful. That's been a solved problem for a very long time. Just as recombining a separate trace log with a main app log based on time stamp is a solved problem.

I said painful, not impossible. You say you're dealing with 5GB/hour logs, then you must be well aware how long even the simplest merge-job takes on the slice of a single day?

>> When a serious error occurs then as much context as feasible must be provided Yes this is the extraordinarily rare case where stack traces can be useful.

We must be living in different worlds then.

When errors pop up in the main logfile, which happens frequently, then the last thing I want is to stitch together context from other files.

A mere grep on a day's worth of logs takes about 10 minutes for us. Any kind of text-processing or "for each line perform lookup in $otherplace" quickly pushes that into the hours.


>> Sorry but then your monitoring system is broken? Why is its operation tied to log chunking?

Yours clearly bends some unbendable truths of the known universe ;-)

A monitoring system ordinarily samples the logfile at a given point in time, that is an open followed by a close operation on the logfile (lest you run into phantom disk usage issues caused by holding an open file handle to a compressed / rotated logfile).

If between sample 1 and sample 2 the log file is rolled, your monitoring system never sees the data between sample 1 and the end of the previous log.

Possible workarounds include logfile streaming (say via a pipe on the filesystem) but that introduces much larger problems and would in no way be compatible with the app rolling its log files.

A much simpler fix is to roll only on events (bounced the app etc.)

>> A script to identify the file-range for a given time-range via binary search is really easy to write

Contrast with no script required. No bugs, no maintenance time, no different versions in different regions. Just no script. The benefits don't stop there - when you need to query, there's no waiting for a bunch of disk seeks while the script runs against a busy disk controller - you can just get straight on with parsing the results.

>> Easier handling. If you need to offload "50GB" from a given log-host then that's easier to do when you know that will be 50 files.

Moving 50 items is easier than moving 1? I'm unconvinced but still open to any interesting ideas here.

>> You don't risk a scratch disk running full

There is no change to the risk profile, neither approach is better in this regard, unfortunately.

>> e.g. interleaved with stack traces

Just one of the many excellent reasons not to log them in production.

>> you must be well aware how long even the simplest merge-job takes on the slice of a single day?

Yes - it runs at full disk I/O speed with almost no CPU overhead. Remember we are merging sorted data (already in temporal order).

>> which happens frequently

It's clearly a high profile app you're working on (in my experience the people in charge of the pennies refuse to sign off on the cost of disk space for this amount of logging without solid reasoning) i'm surprised it behaves as badly as you say. Concerning, but then it also appears you don't have an ops team either.

>> stitch together context from other files.

I get the impression you're thinking that you should manually stitch these together? On the very rare occassions you need this data, remember both are already in time order, we just instantly merge on the fly.

>> A mere grep on a day's worth of logs takes about 10 minutes for us.

It sounds like you've got it all in one file, i.e. not splitting your audit.log out from your error.log from your trace.log? - techniques which I couldn't do without to be honest (the monitoring system would have to consume every byte logged otherwise, when it really only needs to see a subset of total log output - thus saving CPU time and disk seeks).

>> Any kind of text-processing or "for each line perform lookup in $otherplace" quickly pushes that into the hours.

shakes head

Would you humour me with some context here. What is your background? From your replies i would say you are non-ops staff[1], self-taught[2] and with limited experience[3].

[1] You don't demonstrate any knowledge of basic tools, e.g. you mentioned writing a script for a problem solved by a 35+ year old command found on any unix host

[2] You haven't volunteered anything more advanced than obvious / first order approaches. Approaches that i'd expect any CompSci university grad could suggest off the cuff with no prior experience.

[3] You mention grepping through a single logfile of at least 35Gb, likely much larger.


Possible workarounds include logfile streaming (say via a pipe on the filesystem) but that introduces much larger problems and would in no way be compatible with the app rolling its log files.

What you call a workaround (streaming to a central location via syslog or scribe) happens to be the standard approach in my corner of the world. Analytics and monitoring operate naturally on the stream because, as you point out, sampling on rolling logfiles is not exactly reasonable, neither is scattering logfiles across application servers.

I'm leaving this discussion at this point because I'm not interested in your condescending tone and insults. It appears you haven't even centralized your logging, yet feel entitled to give ops-advice that is at odds with how the rest of the world operates.


>> streaming to a central location via syslog or scribe

Ahh, "man mkpipe", you're mixing up a separate concept - streaming to a remote host vs streaming via a pipe on the filesystem.

>> standard approach in my corner of the world

Syslog streaming has been a common approach since the ~70s i'd guess? Likely before.

>> yet feel entitled to give ops-advice that is at odds with how the rest of the world operates

Hopefully, you see the irony in this.


Cut it out already...

you're mixing up a separate concept - streaming to a remote host vs streaming via a pipe on the filesystem

You said above, quote: but that introduces much larger problems and would in no way be compatible with the app rolling its log files.

Why would you have log-pipes, or log-files on your app-servers to begin with?

Syslog streaming has been a common approach since the ~70s i'd guess?

Then how come you're not doing it?

Hopefully, you see the irony in this.

All I'm seeing is a constant stream of arrogance that doesn't seem to be backed up. Also, the command that you wanted to helpfully point out is called "mkfifo" or "mknod".


>> Hopefully, you see the irony in this.

>> All I'm seeing is a constant stream of arrogance

Perhaps not...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: