Hacker News new | past | comments | ask | show | jobs | submit login
Stop using Nagios (so it can die peacefully) (slideshare.net)
112 points by erbdex on March 4, 2014 | hide | past | favorite | 122 comments



So... stop using a debugged and stable tool whose limitations and problems are well-known and understood and replace it with six bits of software duct-taped together, two of which aren't working yet (if they even exist), without any idea of how they interact when they hit edge cases.

I mean, "don't use X, use ShinyX instead" is one thing (and most of the time it's a bad thing but it does occasionally turn up good ideas), but this is just So Much Worse...


You're presenting the MySQL argument. "Why should we switch since we know it fails in exactly these 1,000 different ways and we can fix these problems? Using something better has unknown failure scenarios!"

Have you ever been woken up by a nagios page that automatically cleared after five minutes because the incoming queue was delayed past the alert interval?

Have you ever had your browser crash because you click on the wrong thing in the designed-in-1996-and-never-updated nagios interface and had your browser crash because it dumps 500MB of logs to your screen?

Have you ever had services wake you up with alert then clear then alert then clear again because some new intern configured a new monitor but didn't set up alerting correctly (because lol, they don't get paged, so who gives a flip if they copied and pasted the wrong template config, as is standard practice)?

Have you had to hire "nagios consultants" to figure out how to scale out your busted monitoring infrastructure because nagios was designed to run on a single core Pentium 90?

Being pro-nagios is like being pro-Russia, pro-North Korea, and pro-Rap Genius while arguing "but at least we know how bad they are and can keep them in line."


Did you seriously just equate North Korea, Russia, Rap Genius and Nagios?

Everyone knows that Rap Genius is an oppressive regime in which you're executed for dissent, North Korea are SEO spammers, and in Russia... there is no need for monitoring software because Russian computer does not fail!


I think his point is there are tradeoffs, and I agree. On top of that, meaningful debate over what tool should be about what context you're in. This applies to the OP's slidedeck.

To give context about my comment about context:

* was nagios setup before you started the job?

* did you setup nagios yourself?

* is your internal process for managing nagios broken?

* culturally do you work at a place where ops is an afterthought?

* if Nagios is your technical debt do you have a way out? are you crushed by other commitments? Maybe it's more of a management/culture issue.

... hmm actually I should stop. From re-reading your comment, I can't tell how much of it is trolling (in a entertaining Skip Bayless, right wing radio, Jim Cramer kind of way).


was nagios setup before you started the job?

Yup.

did you setup nagios yourself?

god, no.

is your internal process for managing nagios broken?

It was the best they knew after a dozen years of experience at other companies.

culturally do you work at a place where ops is an afterthought?

Nope. We had three people for 500 machines and two data centers.

if Nagios is your technical debt do you have a way out? are you crushed by other commitments? Maybe it's more of a management/culture issue.

It was "good enough" and nobody wanted to build out an alternative (or knew how to—the others were pure "sysadmin" people without programming backgrounds).

The problem: apathy. The solution: leave...after three and a half years.

I can't tell how much of it is trolling

Never trolling from this account, just unfocused anger with no other outlet. :)


So, you're sublimating your frustration with a broken company into dislike for a tool. Logical.


>Have you ever been woken up by a nagios page that automatically cleared after five minutes because the incoming queue was delayed past the alert interval?

No, because I know how to configure escalations properly.

>Have you ever had your browser crash because you click on the wrong thing in the designed-in-1996-and-never-updated nagios interface and had your browser crash because it dumps 500MB of logs to your screen?

Actually, no. I've had my browser crash due to AJAX crap all the time though. Nagios' (and Icinga classic's) interface is clear, simple and logical; it's just not 2MB of worthless javascript that wastes half my CPU time, so I can see why unpopular with some user types.

>Have you ever had services wake you up with alert then clear then alert then clear again because some new intern configured a new monitor but didn't set up alerting correctly (because lol, they don't get paged, so who gives a flip if they copied and pasted the wrong template config, as is standard practice)?

No, because I know how to use time periods, and escalations again.

>Have you had to hire "nagios consultants" to figure out how to scale out your busted monitoring infrastructure because nagios was designed to run on a single core Pentium 90?

No, because it isn't, because I know the basics of Linux performance tuning, and because I've heard of Icinga and/or distributed Nagios/Icinga systems for very large scale.

Your post reads like "Have you ever crashed your car head on into a concrete wall at 70mph because it didn't brake for me?". No amount of handholding a program can do will protect users who have no clue how to use it.

I do not by any means consider myself an expert in Nagios either - if there was such a market for consultants as you claim, I'd likely be doing it and therefore be rich, but in actual fact, it's a skill just about any mid-level or better admin has.

I've inherited a Nagios config before that was a mess, that I rebuilt from scratch in a maintainable way, as well as extended. If Nagios (or MySQL pre-Oracle, for that matter) has a problem, it's amateurs attempting it, making a mess, and others judging the quality of the tool on their sloppy work. Not unique to Nagios, by any means. If there's a criticism you can level at Nagios for that, it's the lack of documentation and examples in the config files.

I'm also not denying the existence of alternatives - OpenNMS is ok, as is Zabbix, but both are far more limited in terms of available plugins and extensibility, and by nature harder to extend. Munin is good for out of the box graphing, but relatively poor for actual monitoring/alerting and hard to write new plugins for with limited availability of additional plugins. Each one is a standalone tool that's good for a purpose, and not some vaguely defined set of programs, partly nonexistent, that everyone has to hack together for themselves.


Best approach to getting accustomed to Nagios is definitely setting it up for yourself. I used to support a mess of a Nagios server once, and on my new job, when they needed a good monitoring system I requested Nagios. Now we have our 100+ servers, 500+ switches and many other services monitored through it.

We had a college student during his summer time write up a quick nagios add/del/modify app. Took him a few hours to bring it up, now it is so easy to replace the whole configuration(s) through it.

Same here, has never crashed on us.. on the server or on the client. I don't know what this guys is talking about, maybe he's confused with the old OpenNMS or ZenOS?


not trolling - but how do you configure escalations properly in a circumstance where a queue might delay longer than any arbitrary period? In short - tell me your secrets


Well, why are you triggering on something that appears to have a completely random amount of delay? Either you choose your line in the sand, or you monitor the dependency that is causing the variability.


A typical nagios alert will fire if it hasn't been updated in X seconds. Sometimes the queue of incoming events gets backed up and nagios doesn't receive the results of service probes until X+5 seconds or 2X seconds later (due to internal nagios design problems, not the services actually being delayed).

So, nagios thinks "Service Moo hasn't contacted us in 60 seconds, ALERT!" when the update is actually in the event log, but nagios hasn't processed it yet.


I haven't seen this in ~1k services, but I guess it probably depends on the spec of the monitoring system to some degree, and I realise that 1k+ hosts is likely a different story. If you're using passive checks in any high-rate capacity, you should be using NSCA or increasing the frequency they are read in anyway. This is also another problem Icinga handles better - while I say Nagios for convenience's sake, my comments here refer to Icinga (and to Nagios XI, which is comparable but stupidly expensive).


No matter how long the delay until alerting is, there can always be the possibility of a service that stays critical for $time + 1; it's just that by delaying the alert by a min or two makes no appreciable difference for most circumstances (if it does, you should have 24/7 staff anyway) and filters out services briefly dropping out and immediately coming back, e.g. a service or host restart that happened to be caught at the wrong time.

That, and setting up proper retry intervals for checks that take a long time to execute.


The link makes the point that 'Nagios has all these things wrong with it, like a horrible UI. Fortunately, there's Sensu, which requires you to build your own UI to do these things it can do. And by 'it' I mean 'other software instead'.'

I like the idea of Sensu, and I would love to have a system built on it, but I don't have the time or energy to build that system, create the parts of it that need creating, tie them all together, and then (ha!) make sure they're all monitored so that none of my monitoring system fails.

Being pro-Nagios is like being pro-USA: it does a lot of things that annoy people, but in the end it means well, gets the job done, and you know what the caveats are. It's not perfect, but I have other problems to solve and systems to build.


>>You're presenting the MySQL argument. "Why should we switch since we know it fails in exactly these 1,000 different ways and we can fix these problems? Using something better has unknown failure scenarios!"

The point is that you cannot know that something is better if it has unknown failure scenarios.


Example: you have a car you know breaks down exactly after 200 every miles. You have a newer car you haven't seen fail yet, but like all mechanical things, you know it'll fail one day. Do you go with the old one because you know how it fails?


"pro-Russia" -- nice.


I was hoping for more than the same -- building an integrated package of tools is more work than the presenter of this post realizes.


Stop using nagios, all you have to do is string together 6 random pieces of software, 2 of which don't exist yet!


2 of them are not available in nagios at all (graphing and anomaly detection) and 1 already sucks completely (UI), so I'm not sure this is a good way to look at the presentation.


Most nagios users configure the pnp4nagios plugin for graphing


...and some use Centreon, which bolts graphing and a better UI onto Nagios 'out of the box'

http://www.centreon.com/ ..or install via FAN: http://www.fullyautomatednagios.org/wordpress/


It isn't very clear to me which Nagios plug-ins are considered standard, and perhaps that sort of confusion is what creates all the FUD surrounding Nagios.


Use http://mathias-kettner.com/check_mk.html

Makes life so much better.


When two of the six points don't have answers, and one of those has "I will have to write something", it's pretty clear that the talk is not 'stop using Nagios right now', but 'what do we need to replace Nagios and do it right'. It's not a talk about a tool that is ready now.


Exactly. "Hey use Zabbix" is something only people who have never used Zabbix would recommend.

Nagios, and a host of other popular software that's difficult to use, exist because the alternatives are so poor.


Would you mind to elaborate on your problems with Zabbix?

We use it for a small ISP since nearly two years and are quite happy with it. All limitations that came up during integration could be fixed by writing some small scripts. Even upgrades worked.. :)

So if you have any Zabbix pain that might be waiting for us, please share.


Can it be completely driven by some configuration management tool and run without persistent storage? These are two parts of nagios that make it work for me. Granted, I didn't look too far, but some basic research says you need a database and have to manage everything via GUI.


The data is stored inside a relational database, which might become a problem in the future.

They do offer an API that is quite usable. I've build a simplified interface for our less trained staff members with it.

You won't find any RRD files which might be a problem for you. Until now that was only a theoretical problem for us.


Yea, you are right. On the other hand, you usually define the templates once and then it kinda works automatically via discovery.


Or just one. http://en.wikipedia.org/wiki/Shinken_(software)

Shinken was written by Jean Gabès as a proof of concept for a new Nagios architecture. Believing the new implementation was faster and more flexible than the old C code, he proposed it as the new development branch of Nagios 4.[3] This proposal was turned down by the Nagios authors, so Shinken became an independent network monitoring software application compatible with Nagios


The "main problem" with nagios is that it's configuration is godawful. The "main problem" with most nagios users is that they edit their configuration manually.

Automate that shit. We use Nagios to monitor our infra (5000+ checks, hundreds of hosts), and chef maintains the config. Works without a hitch - and best of all - it's been running for years, and not once has anyone had to poke it with a stick.

Yes, NetSaint is old, yes, the UI is worse to look at than Putin's crotch, yes, the plugin architecture is whimsical as all shit - but... IT WORKS AND YOU CAN RELY ON IT.


This is one of the weird things about puppet. The core of puppet has general types, and software-specific stuff is in modules... except for Nagios rules, which are core types. Struck me as odd that only Nagios gets this 'star' treatment in the core.

http://docs.puppetlabs.com/references/latest/type.html


Puppet is almost entirely composed from weird things, so much so that I'm surprised you're surprised.

rspec-puppet in particular; so much pain.


Another interesting example is that yumrepo is a core type, but apt::source has to be managed via a module


Once upon a time, back when I first installed NetSaint, the configuration was so bad that I learned OCaml to write a preprocessor to generate its configuration.[1] Nagios is the improvement!

And yeah, it's horrible.

[1] http://www.crsr.net/Software/nscc.html


^ This. I really don't understand how people claim that nagios can't scale.


It can't scale when we are talking 10000's of checks, triggers and hosts.

This is why we moved to Zabbix:

Number of items monitored: 77342 Number of triggers enabled: 25405

Try and scale Nagios to those values and let me know how it works. We tried and we let it go a few years ago.


5000 checks and a few hundred hosts is a fairly tiny operation. At an order of magnitude above that you will start feeling real pain with Nagios.


Yup, never claimed anything but - but it's big enough that automation is essential, lest you descend into some variety of lovecraftian horror-scape.


Is the configuration dynamic? Like EC2 autoscaling or something? As far as I know Nagios needs a restart for each configuration change - imagine doing tens of Nagios restarts daily...


That in one line: "But I don't like it because it doesn't do things my preferred way!".

People use Nagios because it works, and it gets everything right, including config if you have any clue whatsoever how to set up a good object hierarchy. The only real problem with it is maintenance, an issue which Icinga resolved long ago.


I'm only a Nagios newbie, and have inherited an amateur setup, but I can't see how Nagios does everything right. Comment out a host? Better hope there was no test linked to only that host, because Nagios will throw a fit until you hunt down anything orphaned and get rid of it. And it doesn't give you helpful log messages in figuring out what went wrong in these cases.

The UI is also horrible, though I see they've remade that in the latest versions. In particular the UI does not update to show changes in state in a timely manner. If you tail the logs, you see that the state has changed as expected when expected, but the UI is still reporting old data for minutes to come.

I haven't tried a lot of systems and am really no expert in monitoring, but saying Nagios gets everything right... just makes me feel oily. It can be best-of-class and still be shitty.


In my old job, for nagios they had a custom nagios tempting system built off C pre processor macros. You'd edit their custom config syntax, run a "build config" script, and everything would get converted into nagios-readable formats.

The thing is: nagios doesn't actually work for more than monitoring about 100 machines. Everything after that is a hack. It's made where you know your systems by hand, are hand configuring everything, and setting up all alerts with love and care and a kiss as you edit every file manually.

[This is your brain.] [This is your brain after losing sleep every night for a week because of nagios spurious alerting.]


We add/remove about 150 nodes every day to/from our monitoring system automatically via APIs. That use case has always sucked for me with Nagios. How would you do that?


Are you making the Nagios host pull (via NRPE) or are you asking the individual hosts to push (via NSCA)? I am trying to solve a similar problem. Given a dynamic population of hosts, each of which has a variable life span, I think that asking individual hosts to query their own state and then push that to a "monitoring receiver" is the most scalable, sustainable approach. At least, that's the theory I'll be testing this week.


We're not actually using Nagios. We use sensu because it was designed with this sort of dynamic environment in mind. (I'm trying to stay away from the "C" word. :)


In my current job, it's done automatically with Puppet. My previous job was a lot smaller scale and hosts were manually added to a config file of hosts, but I had set up groups properly so they only needed a single hostgroup to inherit all their services, dependencies and contacts from there.


I'd be interested to hear about that too - often Nagios can spend much of its time being reconfigured than actually monitoring…


I'll only address distributed monitoring. Use NSCA instead of NRPE. That bypasses the limitations of the Nagios active check scheduling. I have a wrapper that I use for that:

http://rbcarleton.com/send_nsca_service_check.shtml

Use some kind of automation system like CFEngine for the distributed scheduling. Some assembly required ;)


My main experience from working with nagios for somewhere around 10 years now is that when people complain about it, they are either too lazy to read, or too inept to understand, the documentation or the architecture. (1)

That being said, there _are_ limitations (but scaling is not one of them) to nagios, and the configuration is definitively not something you do in cute widdle config.yml.

Combined with the recent negative developments with the corporation behind the Nagios trademark and the enterprise version - which the author fails to mention, and should be even more alarming - one should at least consider using and contributing to bareos, the (hopefully) true OSS fork of nagios (I will for future deployments).

Oh hey, look at that, it would even pose the possibility to _improve_ the software. (I really don't see why an almost-complete rewrite of nagios should be necessary. Even after reading these slides(2)).

(1) That includes the author.

(2) Or rather especially after reading them.


> one should at least consider using and contributing to bareos, the (hopefully) true OSS fork of nagios

Bareos [1] is more an fork of Bacula [2], Isn't it? Did you mean Icinga [3]?

[1] http://www.bareos.org/en/

[2] http://www.bacula.org/en/

[3] https://www.icinga.org/


Oops, of course ! (Unfortunately, the edit allowed timespan on my comment has expired).

The irony here is that with bacula a similar thing happened.


Great idea. You can always use ugly Nagios for monitoring your great monitoring system built on top of RabbitMQ, Ruby, Elasticsearch, Redis and few other famous components.


I have wanted to try out collectd (https://collectd.org/) for some time now, does anyone have experience with it?


I just set it up last week to push data to Graphite. It took a little bit of time to understand how to configure it and the docs have conflicting information in some places. Also, you will need to build it yourself or get a PPA if you're on Ubuntu 12.04 and want to use it with Graphite since the version that Ubuntu ships with doesn't support graphite.

I haven't tried to write plugins for it but it comes with a lot out of the box and it's working well.


collectd is great because takes very few resources (cpu/memory) to collect statistics very rapidly and you can log your results however you wish (back to graph collectors, straight to CSV for later processing, out to custom processes or network protocols for other services to consume).

If you want to be lazy and not set up a complete graphing infrastructure, just run collectd, have it automatically log all your statistics, and use it with the bundled https://collectd.org/wiki/index.php/Collectd-web package to view your history when needed.


I use collectd and have found it to work rather well. The documentation could certainly use some work, but once you get figure things out it's pretty easy to configure. CPU usage is almost non-existent for the amount of data monitored. It's also pretty easy to write a custom plugin to collect whatever custom metrics you want. I use a script to expose the current values out as http/json for integration with Circonus for monitoring/alerts of key values. You can use whatever graphing tools you'd like, things are stored in rrd format.

Recently I've been implementing salt stack, and collectd is really easy to automate config of. I've got salt fully configuring collectd, and then using the circonus api to setup monitoring rules automatically per applied states. It's a beautiful thing.


You should try Scalyr (https://www.scalyr.com/). It's easier than juggling between six different tools. It was built by ex-Google DevOps engineers for the same reason you made this Slideshare: the available tools suck.

(Full disclosure: I'm working with Scalyr, but you should still try it.)


I don't think something like this should be hosted/SAAS. I don't want to send my Gigabytes of logfiles (sometimes containing confidential informations ...) over the internet to some unknown entity with probably questionable security.

My small startup company already has more than 10 (virtual) servers which would cost us 500 dollars a month to monitor which is more than the servers itself cost.


[Scalyr founder here]

We take security very seriously, but let me turn this around into a question: what would it take for you to trust an external service to manage your logs? Some of the things we're doing:

1. SSL everywhere (including internal traffic between our backend servers).

2. We add a tag to the raw representation of every string value, so that we can verify that data never leaks across accounts. (This has never detected a problem -- except in tests, because yes, we do test it.)

3. Implementing in a "safe" language (Java), to rule out low-level buffer management bugs.

4. As Greg noted, we make it trivial for you to redact sensitive data before it leaves your server.

We are sometimes asked for an on-premises installable version of our service. We don't provide that because we're using economies of scale on the backend to completely change the log management experience: when you give us a query, every CPU and spindle in our entire cluster is briefly devoted to that query. This means that you aren't limited to graphing predefined metrics; you can do ad-hoc exploration of your entire log corpus in on the fly. E.g. display a histogram of response latencies for all requests for url XXX on server group YYY in the last 48 hours, and expect a near-instantaneous response.


>what would it take for you to trust an external service to manage your logs?

I think that's a really weird question that completely fails to address the concerns that some people might have. We do logging of sales, profit margins and stuff like that. You can't have access to that because: "You're not us". If you can read our data, then we're not going to use your service and to do anything useful the logs you really do need read access.

Of cause you might have no reason to spy on our data, but the only safety is that you promise not to. We could seperate logs for different things, so webserver logs go to you, but email logs goes to an internal system, but then we would need two systems.


Do you ever email about this data or put it into spreadsheets on Google's servers?


What's your point there, that because they are already exposed in some areas they shouldn't care being exposed in more ? Or are you looking for a "ah-ah" moment; "you're already compromising that data" ?

Because either way, that doesn't change at all the concerns he voiced regarding this particular service.


I suppose both? I think it's pretty reasonable to expect a business dealing with storing sensitive data to not look at that data, regardless of it is email or logs.


> what would it take for you to trust an external service to manage your logs?

An act of God. This implies, of course, certain further prerequisites that are themselves probably quite challenging to meet.

> economies of scale

Under your current plans, the money spent putting my entire infrastructure on Scalyr would pay 2-3+ developers (or some mix of developers and sysadmins) in Taiwan, where my employer is based. We are not a tiny company, but that would be a substantial and welcome manpower increase for the server team, and only a fraction of that manpower would need to be spent meeting even our "would be nices" for logging and monitoring.

This would only get worse as we grew. Realistically, I would expect us to be able to open an office in Silicon Valley and start hiring there at market rates for the amount of money we'd be giving you.


>what would it take for you to trust an external service to manage your logs?

For a publicly traded company, a relaxation of relevant law to allow arbitrary data to leave the company.


> what would it take for you to trust an external service to manage your logs?

Well the no-brainers are:

1 - Encrypt everything before living my infrastructure (with my key, not SSL), only decrypt it in my infrastructure when generating reports. 2 - Anything that runs on my infrastructure is open source, and widely distributed. Bonus points for a simple protocol that I can write plugins for. 3 - Make it possible for me to back everything up, and restart everything in case you go out of business.

Those are the must have "I won't even let you get through security otherwise" features. Now that I think about it, #3 alone makes anything you can offer worst than doing it in-house.

But none of those are features that'll make your system look any good in my eyes, they are just enough for your system to not look like an enemy.


You are using economy of scale, but you're not serving customers that are afraid to give you important data, which is not economical at all.


Completely valid points. Here's how we're dealing with each:

1. With our custom parser you can replace or delete confidential information from your logs before they're stored on our servers.

2. We're exploring a different pricing structure right now that would address this scenario. If this is your only hesitation, I hope you check us out anyway and contact us about pricing.


Custom parser won't guarantee that all of the data is dumped...and some of it you may even want in the system.

Also -- lots of companies are seriously concerned about pushing their data externally. Making it a hard sell.

However, as a long tail service, this looks great.


> With our custom parser you can replace or delete confidential information from your logs before they're stored on our servers.

This is both error prone and utterly defeats the purpose. Why would I pay a bunch of money for somebody else to manage my logs when I'd just have to keep them all anyway so I can get at the unredacted versions when there's a problem?


Does the parser run on the client? If not, it defeats its own purpose.


Yes, on the client. See "Redaction" at scalyr.com/agent.


Agreed. Contracting out your monitoring seems to me to defeat the point of running your own infrastructure.


So what happens when my email server goes down? One of the huge advantages nagios has is that there are plugins that'll send SMSes, plugins that'll send me phone calls. Hell, people have written scripts that let them call to get nagios alerts [1]. One of the huge advantages of having something hosted in-house, and with nagios, is that it can be configured to a level of precision I can't see your tool coming close to achieving.

Yes, Nagios' configuration is ugly and occasionally requires sacrifices to elder gods. At the same time, I've never found any sort of monitoring/alerting I've needed done that it can't handle. As much as your service looks cool for a specific subset of monitoring, it is still missing half the hooks as to why nagios is stubbornly sticking around.

[1] http://www.googlux.com/callnagios.html


There's a distinct lack of images on that website. I for one would like to see how I would diagnose problems without opening 5 tabs. A "demo" mode would be even more helpful.


I was about to say the same thing... not only are there no images, but there is a such an overwhelming amount of text and bullet points on the "About" and "Feature" pages.

This looks more like documentation and less like a product landing page.

Here are some nice examples that might help: http://land-book.com/

https://www.gosquared.com/ (pulled from the first page of land-book)


We're releasing a new design in a few days, precisely because of past feedback like yours. It gets more to the point and has more UI screenshots. Thanks for the comment and examples!


Cool, you're welcome, and good luck!


Thanks for the feedback! Demoing a log management product is a bit tricky, because it's hard to come up with a representative data set that isn't sensitive. FWIW, we have put together a demo based on data scraped from the Github API -- it's not a typical server log data set, but it can be fun to explore:

https://www.scalyr.com/login?prefillEmail=demo-account%40sca...


https://www.scalyr.com/dash?page=Github-Statistics is broken. "Operation not permitted (Read Configuration permission required)"


Oop -- thanks! We'll sort this out, might take a little while though.


Yeah, no, not what I want to hear from the people hosting my logfiles.

There's your answer to "Why won't you host your server logs (which are usually key for troubleshooting flaming boxen)?"


As I probably should have clarified, this issue is specific to a single page in the demo, which we do not even link to at the moment (aside from a couple of older posts on our blog). Yes, it is embarrassing, and I apologize. However, if this issue had been standing between a customer and their data, we would have scrambled instantly.

Everything in life is a tradeoff. If you entrust your logs to us, you run the risk that we have an outage or failure of some sort. On the other hand, internal systems can fail as well. We hope to serve people who prefer not to carry the responsibility of maintaining their own monitoring infrastructure, and/or are interested in the features and performance we provide.


Fair enough. :)

Out of curiosity, what is your target customer? People running their own dedis probably are alright to grudgingly setup a monitoring solution, and people who are just using the ~=cloud=~ probably don't care.

What's your ideal customer?


Typically, someone who is already using some form of cloud-based infrastructure. From there, it's a fairly easy step to send your monitoring data to a specialized service. And cloud infrastructure can pose monitoring challenges (servers coming and going; multiple systems logging different types of data; unpredictable I/O performance causing problems for monitoring backends) that we help with.


If you entrust your logs to us, you run the risk...[of] failure of some sort.

Well, yeah.


Maybe I'm missing something but this just looks like log analysis (akin to Splunk) and not actually server monitoring? (active health checks, notifications, snmp, etc) Pricing seems wonky too...is it cheating the licensing model to aggregate logs on a single syslog server and submit from a single agent?


Log analysis is the heart of the product, but we also gather system metrics, provide notifications (scalyr.com/helpalerts), and we recently rolled out a basic active-checks feature (scalyr.com/helpMonitors). There's lots more to be done; for instance, we don't have any SNMP support today. But the vision is to be a full-spectrum tool, and we're actively working toward that.

As for the licensing model: we're going to move to per-GB pricing anyway, so no worries there. If you'd like something more concrete today, e-mail us at contact@scalyr.com and I'm sure we can sort out any pricing concerns.


Regarding the pricing, we're in the process of revising the pricing structure so that it's based more on volume instead of how many servers you have. In that case, you wouldn't need to cheat. :)


This is interesting for us as well... We currently have "only" 50 servers (mostly virtual) that need to be monitored, a per-server pricing would push the cost far too high regardless of the quality of the solution.


May I email you about this? The pricing structure will be changed soon to accommodate cases like yours--there are many of them--so if that's the only thing keeping you from trying Scalyr then I'd love to chat.


We've already started using a competitor's product (Logentries) and we're happy with it, so we're not looking to Scalyr or other log management solutions at the moment. Thanks for listening to feedback though!


The price is absolutely insane and a non-starter.


We've been hearing that loud and clear. :)

FWIW, the price actually works quite well for a lot of people. On a GB-for-GB basis, we're actually much cheaper than other hosted log management solutions -- we work hard on backend efficiency and we pass that along. But yes, if you're using small virtual servers then the pricing model breaks down. We originally went with this model to provide more predictability; log volume is often more volatile than server count. We've heard enough complaints that we've decided to move to more volume-based pricing model, we're just working out the details.


Thanks for the reply... as an OPs guy there are a ton of layered problems with running a highy elastic infra on something like AWS:

1. Dynamic registration of ephemeral systems with a monitoring platform.

2. Security monitoring of same

3. Meaningful graphing

When we are optimizing the purchase of hundreds upon hundreds of spot instances daily, where we are looking at grabbin hosts for just a couple cents an hour, the model of per host fees for things like StackDriver, CloudPassage and your service makes per-host pricing completely a no-go.

I don't have a good idea how these should be priced; but I think its important for people to understand all the other costs associated with having a solid management platform for your environment that covers all the bases and doesn't require another round of funding! :)



I have consulted on operational monitoring for many years, including with a customer that claims to have the largest deployment of Nagios anywhere (50K+ nodes). The author hits on many good points, right up until they suggest a solution. My advice to customers has long been that you can make any tool successful, but the tools are not what really matter. Too often I've seen customers invest $MM in tooling, and fail to understand that people and process around that tooling is the real challenge. Too often both the entrenched enterprise vendors AND startups in this space miss this too. When it comes to tooling, the problem that too many startups miss is that they repeat the patterns that the entrenched players formed decades ago, and fail to understand that the kind of monitoring tools like Nagios and its clones offer is but one piece of a comprehensive solution for all but the smallest of operations.


Hah, you sure are a consultant: your last four sentences basically repeat themselves. ;)


I'm a big fan of monit (tried and true) + m/monit (web interface + more complex logging and anayltics)

https://mmonit.com/


So use Shinken http://www.shinken-monitoring.org/, the Nagios rewrite in Python.


Shinken was specifically mentioned in the slides as just a "Nagios", and not solving the problem.

Not sure whether that's true or not, but they did address it...


Shinken is built to scale.


has anyone tried icinga or opennms, and can comment on that ?

https://www.icinga.org/nagios/feature-comparison/


I use icinga extensively. It has a better UI but still suffers from the same downfalls the presentation presents. With that said, the solution proposed seems to be incomplete and a step back from what icinga/nagios provide.


opennms is a bit of a culture shock if you're used to nagios. It works kind of inversely, in that it wants to auto-discover the servers and services to monitor itself. Frankly, it feels very much designed around snmp imo (which I'm not saying is a problem, but it's different to how we use nagios).

It's also the opposite of nagios in that rather than lots of smaller moving parts, it is one big mega (java) process that does everything. Again, not necessarily a problem (though I happen to think so :), but different.

I also found opennms to be VERY complicated. I suppose nagios is though, first time around.

For some reason though, I really want to use opennms and keep going back to try it out, but eventually give up.


OpenNMS is great to run in addition to a traditional monitoring system.

Your traditional monitoring systems have hand-selected features to monitor and alert for. OpenNMS will just go out and discover everything you have (and graph everything without any intervention too).

You probably aren't monitoring all the statistics on every interface of your switches (what? people have switches?), but just throw OpenNMS at your networking management subnet and it'll pick up everything for later review.

You can use OpenNMS for alerting and inventory tracking, but I prefer more extensible tools for those. Just use OpenNMS as a largely hands-off sanity check of your existing monitoring and graphing systems.


Sensu is alright but it also has a few downsides:

I'm not a huge fan of having yet another debian package with it's own version of ruby packaged. It does make the plugins easier to write though.

Checks need to be installed on the client (like nagios). It means that some coordination is necessary when you want to add a new check on the server side. This is largely resolved when using a configuration management system but it doesn't seem clean to me. The sensu-community repo has a lots of checks which is great to get started, some of them need some ruby gem dependencies to work though.

I had issues with malformed json config or rabbitmq disconnections which would crash the server. Because the debian packages uses the old sysvinit it wasn't restarting. Moved the init scripts to upstart and added json validation when generating the config and now it's fine.


Nagios is pretty much a joke compared to most enterprise production monitoring tools, e.g. Wiley and Foglight. I always find it funny to read what people consider "monitoring" they're talking about a few disparate metrics and then complain that the alerting/paging sucks.

If the tool can't trace a transaction end-to-end. I.e if a user visits your page or uses your application you need the ability to trace it from Http to EJB across any webservices and queues and ESB's right down to which queries were used in the database, if you can't do that you're using a shit monitoring suite.

Knowing infrastructure metrics is useless without knowing if it's actually affecting end-users and in which use-cases.


I hear a lot of people crapping all over Nagios, but none of the alternatives are any better.

I recently had an opportunity to do a clean sheet build out for monitoring, so I evaluated Zabbix, Munin, and combos of statd/Graphite, etc. and none of them were better.

That said, I have a stock Nagios base config that I can install and have monitoring in five minutes. The key to Nagios configs is to define hostgroups in one file, and then create config files for each host, assigning it to a group. Then you put the service definitions in a service file. Easy peasy.


One thing that is very strong is the need for a "host" with a Nagios service definition. This doesn't map so well on to environments like Amazon EC2 auto-scaling groups. You don't necessarily know the host names in advance. You wind up building Nagios plugins that can monitor a pool of hosts (using cloudwatch or whatever) and gives you some kind of aggregate status. It sounds like a kluge to push it into the plugin, but it does allow you to use the Nagios alerting, which is pretty well understood.



came here to rebut, read your response, went away satisfied with it.


Presentations like this always reminds me why "DevOps" is very different from system administration.


I choose nagios cause there is currently nothing else on the market that is actually better.

I think nagios is a piece of shit, but it is a working piece of shit.


Do we use nagios to monitor the other 6 utilities? Or what about when the alerting gateway goes down?


while we're at it, let's let graphite die too. in a fire.


Why? And replace it with what?



Datadog.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: