Hacker News new | past | comments | ask | show | jobs | submit login

So... stop using a debugged and stable tool whose limitations and problems are well-known and understood and replace it with six bits of software duct-taped together, two of which aren't working yet (if they even exist), without any idea of how they interact when they hit edge cases.

I mean, "don't use X, use ShinyX instead" is one thing (and most of the time it's a bad thing but it does occasionally turn up good ideas), but this is just So Much Worse...




You're presenting the MySQL argument. "Why should we switch since we know it fails in exactly these 1,000 different ways and we can fix these problems? Using something better has unknown failure scenarios!"

Have you ever been woken up by a nagios page that automatically cleared after five minutes because the incoming queue was delayed past the alert interval?

Have you ever had your browser crash because you click on the wrong thing in the designed-in-1996-and-never-updated nagios interface and had your browser crash because it dumps 500MB of logs to your screen?

Have you ever had services wake you up with alert then clear then alert then clear again because some new intern configured a new monitor but didn't set up alerting correctly (because lol, they don't get paged, so who gives a flip if they copied and pasted the wrong template config, as is standard practice)?

Have you had to hire "nagios consultants" to figure out how to scale out your busted monitoring infrastructure because nagios was designed to run on a single core Pentium 90?

Being pro-nagios is like being pro-Russia, pro-North Korea, and pro-Rap Genius while arguing "but at least we know how bad they are and can keep them in line."


Did you seriously just equate North Korea, Russia, Rap Genius and Nagios?

Everyone knows that Rap Genius is an oppressive regime in which you're executed for dissent, North Korea are SEO spammers, and in Russia... there is no need for monitoring software because Russian computer does not fail!


I think his point is there are tradeoffs, and I agree. On top of that, meaningful debate over what tool should be about what context you're in. This applies to the OP's slidedeck.

To give context about my comment about context:

* was nagios setup before you started the job?

* did you setup nagios yourself?

* is your internal process for managing nagios broken?

* culturally do you work at a place where ops is an afterthought?

* if Nagios is your technical debt do you have a way out? are you crushed by other commitments? Maybe it's more of a management/culture issue.

... hmm actually I should stop. From re-reading your comment, I can't tell how much of it is trolling (in a entertaining Skip Bayless, right wing radio, Jim Cramer kind of way).


was nagios setup before you started the job?

Yup.

did you setup nagios yourself?

god, no.

is your internal process for managing nagios broken?

It was the best they knew after a dozen years of experience at other companies.

culturally do you work at a place where ops is an afterthought?

Nope. We had three people for 500 machines and two data centers.

if Nagios is your technical debt do you have a way out? are you crushed by other commitments? Maybe it's more of a management/culture issue.

It was "good enough" and nobody wanted to build out an alternative (or knew how to—the others were pure "sysadmin" people without programming backgrounds).

The problem: apathy. The solution: leave...after three and a half years.

I can't tell how much of it is trolling

Never trolling from this account, just unfocused anger with no other outlet. :)


So, you're sublimating your frustration with a broken company into dislike for a tool. Logical.


>Have you ever been woken up by a nagios page that automatically cleared after five minutes because the incoming queue was delayed past the alert interval?

No, because I know how to configure escalations properly.

>Have you ever had your browser crash because you click on the wrong thing in the designed-in-1996-and-never-updated nagios interface and had your browser crash because it dumps 500MB of logs to your screen?

Actually, no. I've had my browser crash due to AJAX crap all the time though. Nagios' (and Icinga classic's) interface is clear, simple and logical; it's just not 2MB of worthless javascript that wastes half my CPU time, so I can see why unpopular with some user types.

>Have you ever had services wake you up with alert then clear then alert then clear again because some new intern configured a new monitor but didn't set up alerting correctly (because lol, they don't get paged, so who gives a flip if they copied and pasted the wrong template config, as is standard practice)?

No, because I know how to use time periods, and escalations again.

>Have you had to hire "nagios consultants" to figure out how to scale out your busted monitoring infrastructure because nagios was designed to run on a single core Pentium 90?

No, because it isn't, because I know the basics of Linux performance tuning, and because I've heard of Icinga and/or distributed Nagios/Icinga systems for very large scale.

Your post reads like "Have you ever crashed your car head on into a concrete wall at 70mph because it didn't brake for me?". No amount of handholding a program can do will protect users who have no clue how to use it.

I do not by any means consider myself an expert in Nagios either - if there was such a market for consultants as you claim, I'd likely be doing it and therefore be rich, but in actual fact, it's a skill just about any mid-level or better admin has.

I've inherited a Nagios config before that was a mess, that I rebuilt from scratch in a maintainable way, as well as extended. If Nagios (or MySQL pre-Oracle, for that matter) has a problem, it's amateurs attempting it, making a mess, and others judging the quality of the tool on their sloppy work. Not unique to Nagios, by any means. If there's a criticism you can level at Nagios for that, it's the lack of documentation and examples in the config files.

I'm also not denying the existence of alternatives - OpenNMS is ok, as is Zabbix, but both are far more limited in terms of available plugins and extensibility, and by nature harder to extend. Munin is good for out of the box graphing, but relatively poor for actual monitoring/alerting and hard to write new plugins for with limited availability of additional plugins. Each one is a standalone tool that's good for a purpose, and not some vaguely defined set of programs, partly nonexistent, that everyone has to hack together for themselves.


Best approach to getting accustomed to Nagios is definitely setting it up for yourself. I used to support a mess of a Nagios server once, and on my new job, when they needed a good monitoring system I requested Nagios. Now we have our 100+ servers, 500+ switches and many other services monitored through it.

We had a college student during his summer time write up a quick nagios add/del/modify app. Took him a few hours to bring it up, now it is so easy to replace the whole configuration(s) through it.

Same here, has never crashed on us.. on the server or on the client. I don't know what this guys is talking about, maybe he's confused with the old OpenNMS or ZenOS?


not trolling - but how do you configure escalations properly in a circumstance where a queue might delay longer than any arbitrary period? In short - tell me your secrets


Well, why are you triggering on something that appears to have a completely random amount of delay? Either you choose your line in the sand, or you monitor the dependency that is causing the variability.


A typical nagios alert will fire if it hasn't been updated in X seconds. Sometimes the queue of incoming events gets backed up and nagios doesn't receive the results of service probes until X+5 seconds or 2X seconds later (due to internal nagios design problems, not the services actually being delayed).

So, nagios thinks "Service Moo hasn't contacted us in 60 seconds, ALERT!" when the update is actually in the event log, but nagios hasn't processed it yet.


I haven't seen this in ~1k services, but I guess it probably depends on the spec of the monitoring system to some degree, and I realise that 1k+ hosts is likely a different story. If you're using passive checks in any high-rate capacity, you should be using NSCA or increasing the frequency they are read in anyway. This is also another problem Icinga handles better - while I say Nagios for convenience's sake, my comments here refer to Icinga (and to Nagios XI, which is comparable but stupidly expensive).


No matter how long the delay until alerting is, there can always be the possibility of a service that stays critical for $time + 1; it's just that by delaying the alert by a min or two makes no appreciable difference for most circumstances (if it does, you should have 24/7 staff anyway) and filters out services briefly dropping out and immediately coming back, e.g. a service or host restart that happened to be caught at the wrong time.

That, and setting up proper retry intervals for checks that take a long time to execute.


The link makes the point that 'Nagios has all these things wrong with it, like a horrible UI. Fortunately, there's Sensu, which requires you to build your own UI to do these things it can do. And by 'it' I mean 'other software instead'.'

I like the idea of Sensu, and I would love to have a system built on it, but I don't have the time or energy to build that system, create the parts of it that need creating, tie them all together, and then (ha!) make sure they're all monitored so that none of my monitoring system fails.

Being pro-Nagios is like being pro-USA: it does a lot of things that annoy people, but in the end it means well, gets the job done, and you know what the caveats are. It's not perfect, but I have other problems to solve and systems to build.


>>You're presenting the MySQL argument. "Why should we switch since we know it fails in exactly these 1,000 different ways and we can fix these problems? Using something better has unknown failure scenarios!"

The point is that you cannot know that something is better if it has unknown failure scenarios.


Example: you have a car you know breaks down exactly after 200 every miles. You have a newer car you haven't seen fail yet, but like all mechanical things, you know it'll fail one day. Do you go with the old one because you know how it fails?


"pro-Russia" -- nice.


I was hoping for more than the same -- building an integrated package of tools is more work than the presenter of this post realizes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: