<rant> The "standard" ways are all very outdated, ugly, unscalable, and ...

gregburek · on May 19, 2011

I'm getting the feeling that with all the unique server setups in use, monitoring and metrics systems are going to be just as unique and specific.

There are some interesting process monitoring projects out there like god, monit and bluepill, as well as ec2/cloud specific stuff from ylastic, rightscale and librato silverline. Have you ever used any of those tools?

Fitting all these together for my setup is trial and error, but it really does force me to think hard about my tools and assumptions even before I get hard data.

josephruscio · on May 19, 2011

I hack on the aforementioned Silverline at http://librato.com, and we provide system-level monitoring at the process/application granularity as-a-Service. (We also have a bunch of features around active workload management controls, but that's out of scope here). It actually works on any server running one of the supported versions of Linux, not just EC2. Benefits of going with a service-based offering are the same as in any other vertical, you don't need to install and manage your own software/hardware for monitoring.

Here's an example of the visualizations we provide into what's going on in your server instances: http://support.silverline.librato.com/kb/monitoring-tags/app...

Joakal · on May 19, 2011

Sounds like Zabbix, Pandora FMS, Osmius, NetXMS, AccelOps and those are the ones that match your requirements.

Within each, if you search for templates or cookbooks or config scripts, you'll find ways of configuring it easily enough.

https://secure.wikimedia.org/wikipedia/en/wiki/Comparison_of...

seiji · on May 19, 2011

Almost.

They all suffer from inflexible data models (how many are using SQL and rrdtool in that matrix?), death at scale (what happens when you go from 10 to 500 to 3000 to 10000 servers? across three data centers? and transient xen servers?), lack of UI design, and community involvement (because of that massive comparison grid).

That's not even considering broken models for alerting (a server dies at 3am -- should it page you? no, because you have 200 of the same servers in the same roll. the load balancer will compensate.), historical logging, trending, and event aggregation/dedup.

It's a big problem, but making flexible tools from the ground up with sensible defaults can go a long way towards helping everyone.

We can fix this. We can make the redis of monitoring.

Joakal · on May 19, 2011

I have to laugh at you pointing out 'redis', redis can not scale at this time. Clusters are planned sometime mid-year but it'll be sometime before it has more features. Maybe you meant the MongoDB?

Alerting is quite flexible from what I read to the point that they are quite customiseable. I agree that a server dying at 3 am is not as important but should still be a valid alert to make an API call to the host to start a new server (Not sure if possible, alerts seem to be shell based).

Here's what your offering needs top in what I'm considering lately: http://www.zabbix.com/features.php

I'd love more competition but even you point out community involvement won't be as much because there's a lot of competition. Including your offering, soon.

Disclaimer: I started researching server monitoring a few weeks ago and considering Zabbix since last week.

Edit: The one issue I find is that there's lack of web transactions like New Relic has: http://newrelic.com/features/performance-analytics

You can see it in action with average response time: http://blog.tstmedia.com/news_article/show/86942

As far as I know, no open network monitoring service offers it.

blutonium · on May 20, 2011

That's the kind of thinking Josh calls out in the article. Redis is great in most of the use cases for replicated Mongo if you're gearing the rest of your architecture to use it properly.