I'm one of the founders @ cloudkick, and our goal is to eliminate the hassle of setting up and maintaining your own monitoring install.
Nagios is difficult to set up and get right, we abstract away all the details and let you simply click to add more monitoring. Its simple and we worry about the scaling.
Now I'm interested. If you could give me really good graphs of different aspects of performance, of an entire hadoop compute cluster at once... perhaps by tying into hadoop... I would love you.
We have a bunch of interesting distributed computing problems we are solving in the pipe, so stay tuned. Nothing out of the box yet...
Currently, we have a pretty unique graphing framework. Basically, you can plot any bit of data and send it back to the server, but the limitation is you have to launch the server through us. The feature will be added soon as available to anyone with a cloud server.
If I have to launch the instance with you, that pretty much makes using it for compute clusters impossible. One launches those from a script at the shell.
Its definitely the first iteration of the service with many more to come. This is one of the most basic needs for any production system, the diagnostics side (alerts) and the preventative maintenance side (load graphs). So stay tuned.
I like Nagios, but it's not designed to dynamically add and subtract nodes. It doesn't auto-spawn nodes when you're running over X% CPU on Y% nodes.
If this can do that, it's a winner.
-CPD
Scalr will do that, but not very reliably so far. As far as I know, Rightscale is the only reliable service that will let you do this, albeit with some work on your part with their images and scripts, but they are out of the price range of most bootstrapped startups at $500 a month and a bigger up front fee.