Hacker News new | past | comments | ask | show | jobs | submit login
Grafana 4.0 with alerting is released (grafana.org)
299 points by yobo on Nov 29, 2016 | hide | past | favorite | 86 comments



This release has been long in the making. We started on Alerting way back in March this year and it's finally released! Read more about all the highlights in the release here: http://grafana.org/blog/2016/11/09/grafana-4.0-beta-release/

Oh, and if your in New York tomorrow, signup for GrafanaCon: http://grafanacon.org


Great work! Including a way to set grace periods will be really useful to prevent flapping on the metrics. ex, 'Alert When CPU > 95% for 10m'


One way to solve that problem is to reduce the series with min(). Ex http://play.grafana.org/dashboard/db/alerting-flappy?panelId...

This means that the lowest value for the last 5min of the serie have to be above 80% before the alert triggers.


Congrats on release! We've been using Grafana for a few years now, and personally built-in alerting is the only thing I would have added. Some may argue that this "violates" good concern separation practices, but honestly you are going to be alerting about the same data you feed to Grafana, so at the end of the day it makes a lot of sense. Call it a "two in one" if you will. Either way this will make monitoring with Grafana much more streamlined.


Note that keeping them separate has a benefit that when your 'Visualization' portal is down, your 'Alerting' systems are unaffected (and vice versa).

Collectd, telegraf. etc, can be configured to send the same metrics to your favorite TSDB and Alerting system (like riemann) in parallel.


Agreed with the sentiment re: streamlined experience.

We struggled a bit with whether or not it really "belonged" in Grafana, but we believe in alerting while "in the flow".

It makes a lot of sense (from an experience standpoint) to 'manage' alerting while you're 'managing' your dashboards, visualizations, and queries; you already have a sense of the data _right there_.


As someone who is not too familiar with Graphana but is tasked with deploying it (shortly), just curious about the alerting and some of what I can do with it? I will read through the release page, but it's fun to get it from the horses mouth, if you're still available to comment?


Influx + Telegraf + Grafana is such a simple, sweet stack. No work to maintain, trivial to set up, I can ship just about anything I want into it, and reporting is fast.

With alerting in place now, I'm even happier than ever. A huge thank you to the Grafana team for solving a huge pain point!


What kind of volume are you sending into Influx? It crashed on me probably 5 times a day with only 100 requests per second.


Right now it looks like it's around 50/sec. A lot of data points get rolled up by Telegraf on individual machines, and then it's shipped in via the UDP line protocol. I've written much larger volumes, though, and never had an issue with stability.


If I may ask. How is UDP doing for you?

I checked my graphite setup once. We had 27% of metrics lost over UDP. That was bad.

pro-tip: "netstat -anus" and look at the error counters.


About 4% err-to-received ratio. That's probably due to untuned UDP buffer sizes though; despite dropped packets, we're getting enough information to provide the information we need.


Was this an older build? We had serious issues at first, but our setup is pretty stable these days.


Last I tried was 1.0.

I love everything about using Influx but it would die and never restart and every time it would some crash on semacquire. I'll have to try it again since I need to check out this Grafana update anyway.


There's some setup involved if you're sending a decent amount of traffic to it.

The two game changers are using the UDP line protocol instead of HTTP, and making sure you are batch-processing inputs. Fixing these settings is the difference between an instances that crashes all the time, and a purring one.


sending data in batches gives serious performance improvements. Don't send metrics directly to influx from your app. Send them to an intermediary like statsD which will aggregate them and send it.

Shameless plug - I recently published a log router in Golang. It sends data to influx too ! (github.com/agnivade/funnel)


Thank you. I'll check this out.


I use riemann in front if influx, which collects data and forwards it once a second. Works nicely, especially given that I aggregate some more high volume metrics before sending them to influx.


What transport are you using to secure telegraf into influxdb?

(Haven't tried telegraf yet, setuping a prometheus at the moment)


Not sure what you mean "secure telegraph into influxdb" but we've had great success with this stack for monitoring by just embedding an HTTP server into each application that needs to be monitored. We keep the HTTP server separate from any others used by the application (i.e. it runs on a separate thread) so performance isn't impacted.


My use case is one where I have servers in different datacenters and would want to have a simple, but secure, way to fetch metrics for graphing and alerts.

So, I meant encryption in transport, authentication, etc. as many solutions work well if you're monitoring "in the clear" from the backend, but not so much over the internet.


We're deployed on AWS in multiple regions with VPNs set up between VPCs. No particular attention paid to securing the transport between Telegraf and Influx at the moment since a) it's either in an internal VPC or secured via ipsec, and b) our monitoring data is low-value enough that it doesn't warrant its own secure transport.


IIRC, Influx supports https too. So you just have to setup some certs and switch to https in the client.


Quick note for the ones who are tired of the giant clusterfuck of open-source tools for monitoring + alerting + storage + other, which is no less than:

- statsd

- collectd

- graphite

- whisper

- carbon

- prometheus

- grafana

- seyren

- riemann

- nagios

- icinga

- zabbix

There are multiple modern SaaS software that will do all of that in a single tool with better integrations, more polish, less work and no maintenance.

1) See https://www.datadoghq.com and last news https://techcrunch.com/2016/01/12/investors-feed-datadog-a-h...

2) https://signalfx.com/ and last news https://techcrunch.com/2015/03/12/signalfx-emerges-from-stea...

3) http://www.bmcsoftware.uk/it-solutions/truesight.html if you're not anti entreprisey (that was the "Boundary" startup, bought by BMC a few years ago and integrated in their offerings).

And don't think that they are "new" fancy tools. They've been around for many years.


Agree that the SaaS offerings are a lot more turnkey; the integrations and polish that make all the difference.

What you call a 'clusterfuck' is really a wider ecosystem. It would be pretty crazy for a single organization to use all or even most of the tools that you list.

Right now, people accept high degrees of cost (especially for at scale users) and lock-in, in exchange for the convenience of SaaS. Or, they go open source (which to your point, certainly is an investment in time)

Watch out for what team Grafana will be doing in 2017. Our plan is to provide a fully turnkey, hosted offering based around Grafana (and a handful of other open source tools). OpenSaaS.

We hope that for many users, this can be a third choice, and in some ways the best of both worlds.


No offense but Grafana is only as good as the weaker piece in the monitoring chain.

Having nice graphs is nice... until they fall apart because the source is unavailable.

And that doesn't help with alerts either. (I tested the alerts in the v4 beta, it's just not comparable to the better alerting tools out there).


No offense taken ;) You're spot on about needing a solid and scalable backend; it’s more than 'nice graphs'. We think Grafana is a great piece in the chain to start with. We're trying to put as much momentum behind it as our burgeoning company will support.

The alerting in v4.0 is just the beginning. Torkel and the team have tried to optimize for the “relatively simple" 80% of alert use cases.

We are fans of other, more sophisticated open source alerting tools like Bosun, and you can be sure that we'll be both improving our alerting capabilities in 4.x


what are you missing compared to other alerting tools?


As long as enterprises can will understand that they can get support options for Grafana(on-prem, SaaS, etc.) it just comes down to choosing the most economic option. I see benefit in symmetry for enterprise who has hybrid or still mostly in their datacenter.


For installations of a few hundred instances or more, some of the SaaS offerings cost more than the engineering salaries it would take to maintain the OSS tools.


Shame that many of the OSS tools do not have any sort of corporate sponsorship, or if they do, that it doesn't cover all the work that goes into releasing OSS in this space.

Note: I am one of the maintainers of Diamond, a metrics collection tool written in python. https://github.com/python-diamond/Diamond



Unfortunately, the post doesn't share things like: how much infra is needed and how much does it cost, how much time it took to set up, how much maintenance it needs, how long upgrades of the setup take, how much time future hacking of missing features will take, and so on. After that sort of stuff is truthfully taken into account I suspect most if not all savings would be lost.


To have been on the maintainer sides of the OSS tools, your statement is untrue.

The OSS tools costs a fortune in human to maintain them, and another fortune in hardware to run it.


Datadog will cost you $165,600 a year for 600 hosts. That is objectively equal to a very well paid engineer. So no, the statement is not untrue.

(I picked 600 because that was the approximate number of machines we had at my last job, where we used Graphite maintained by one guy, part time).

You included a LOT of redundancy in your OSS list. Multiple timeseries databases. Multiple collection daemons. Multiple dashboards. Multiple alerting systems (Who in their right mind would use Nagios AND Icinga?). You're effectively arguing about maintaining multiple monitoring stacks, some of which are quited aged.


Yikes. I'm sure there's discount pricing available but some of us have tens of thousands of hosts to monitor. The pricing you quoted doesn't scale. For me it might be cheaper to collect with OSS and graph with SaaS.


Indeed, I gave a list of all the tools, you only need to make a stack of about 4 to 8 of them to get the job done.

Let's say statsd + collectd (metrics collection) + graphite (aggregation) + carbon/whisper (graphite storage) + icinga (alerting) + grafana (graphing). That doesn't exactly come easy.

No offense but a single graphite is not a monitoring solution. It's just the tip of the iceberg. Monitoring does take a lot of engineering work and a lot of maintenance. You won't get away operating 600 hosts on the cheap, just think about how much are the hosts themselves.


Let's talk about how much Amazon will charge you for 600 instances a year...


For microservices based architectures, things like OpenTracing can go a long way in de-cluttering the clusterfuck. Of course, it requires developers being up to speed on distributed tracing, which isn't the case across the board. http://opentracing.io


s/clusterfuck/ecosystem/

A typical system doesn't use all of the tools above. You use what fits you and many of the tools play pretty well together. I've had luck with Icinga2 and Grafana lately, for example, which integrated quite smoothly out of the box.


And you have include in the price the problems of a paid app:

- customization will be very expensive, if not impossible

- you must have people for the procurement process (x10 more costly if you are in a gov agency),

- weird failures due to not finding the license,

- your cheap personal that install software won't be able to do it,

- you'll have problems creating testing environments because you don't have licenses

- you won't be able to do some things immediately because there aren't enough licenses.

And these are just the problems that came to my mind right now. All of them are real problems that I'd found in commercial software.


> - customization will be very expensive, if not impossible

You've got a full API and integrations with a hundred different tools and services out of the box.

Seriously, my coworker was skeptic at first too (so was I). Then we configured the full integrations with AWS/the-agent/statsd/postgre/mysql/cassandra/elasticsearch/riak/nginx/haproxy/redis/memcache/pagerduty/slack and some more.

My co-worker concluded in front of my CEO, "it was 2 orders of magnitude faster [than anything else we've ever tried for monitoring]". And that's not even talking about the additional features and customization we couldn't even dream of.

> - you must have people for the procurement process (x10 more costly if you are in a gov agency),

True. That's the only major problem I can see: People who can't buy the software they need. That's a social problem, not a software problem.

> - weird failures due to not finding the license

It's only one API key to put in the agent config file.

> your cheap personal that install software won't be able to do it

I don't know who you're talking about. Monitoring has our best people working on it. At other places I've seen, it's done by devops consultants raking up £600 a day.

There is no cheap personal involved. (Maybe you're thinking about of cheap interns who add alerts? that's an anti pattern).

- you'll have problems creating testing environments because you don't have licenses

Same license. Put a tag environment=<environment> in the config and done, all metrics all servers and all alerts will be tagged.

- you won't be able to do some things immediately because there aren't enough licenses.

Not applicable. It's not a limited license by seats.

You pay the bill at the end of the month depending on the number of hosts in your package. There is a hourly price for ephemeral hosts and overrun.


I think the reason why you're getting negative reactions is that you're talking very broadly as if your personal experience is representative for everyone in the field. Rather than asserting that the real problems which neves mentioned don't exist, try describing how the specific products you've used were designed to avoid them.


Also https://www.hostedgraphite.com - We host Graphite and StatsD with Grafana dashboards, as well as alerting and integrations to several other dev tools. It's a self-funded business running for 5 years, and profitable with 14+ staff.


You guys are awesome. Can't wait to see Grafana 4.0 on hosted graphite.


As someone who's configured and worked with almost all of the tools in this list, I can only disagree with you. The old saying of you get what you pay for is some what relevant. But the integration of the newer OSS monitoring tools is becoming increasingly awesome. Take The graphite/prometheus/elasticstack integration with grafana for instance.

I think having one pane of glass to do all passive monitoring tasks is an incredible step forward.

I am yet to see if the Active monitoring of Grafana is any good, but it does look very promising


Is there any hosted monitoring solution that integrates with service discovery, so that it's actually useful for serious alerting in nowadays' dynamic environments? Otherwise you can't even tell if things that should be there are reporting in or missing.




When you team Grafana up with a general purpose database like Crate.io some pretty amazing things can happen. Not only can crate just "roll with the punches" of auto-sharding whilst dynamically scaling performance over N number of database nodes, it also possesses powerful aggregation capabilities. If that weren't enough, crate also dynamically gzips data by default which is impressive given its zippy performance.

You get all of this for free with Crate.io without giving up the flexibility of a general purpose SQL database...

Wanna start storing log data in crate as well? No problem! Just design your table schema, and API ingest layer (My favorite is NodeJS) but you can use any language you like.

Or if security (facing the public) isn't an issue (if you're on a subnet safe from the public internet) then you can certainly just use the built-in REST API which crate exposes.

With Crate, I've been able to store hundreds of GB of systems log data without worrying about silly things like table-bloat (the autosharding of partitioned tables handles the spectre of bloated table shards for me for free).

Thanks to the amazing developers over at Crate.io for taking the best of Elasticsearch and making it sane, fast, and chock-ful of SQL goodness!

Also a big thank you to the Grafana team for recognizing the potential synergies that Crate.io & Grafana could catalyse for unifying time-series & log data streams.


Grafana really looks interesting, and it is interesting that you can add all the different backends to it, for an example I didn't know you can use Elasticsearch as a timeseries backend.

Is it correct that Grafana works best with Graphite? At least that seems to be my impression, and it is a bit sad, since I think Graphite is cool, but it really has a lot of moving parts.


I wouldn't let that stop you from evaluating it. At my last few jobs I've used Grafana with Elasticsearch and Prometheus without a problem. I've never actually used it with Graphite. The only downside I've seen is that the queries are unique to the data source, so when looking at many of the examples online you have to figure out the equivalent for your's. There might be a performance difference, but I wouldn't know it since I haven't used Graphite.


We use it with Tgres [1] (which only pretends to be Graphite, but actually is Golang + PostgreSQL) - works great.

[1] https://github.com/tgres/tgres


Using it with InfluxDB, Prometheus and Zabbix. Works like a charm. One of my favorite tools!


Graphite was the original, but as others have mentioned, Influx, Prom, Cloudwatch & Elasticsearch are all first-class data sources.


We use it with influx, works great, though I find influx's syntax to be funky.

We also use it against mysql. Annoyingly there is no driver for that so we had to build a nasty layer than translates from influx to mysql.


Today after fighting with Kibana all morning, I decided to tweak our scripts to build Grafana dashboards against our existing ElasticSearch data store. Very fast to get up and running and the the features and speed are amazing.

Very glad I saw this announcement on HN.


I've used it with Influxdb, Elasticsearch, and Prometheus.

They all worked great. I can't think of any reason to use Graphite.


Indeed. We reduced our write iops by 95% by moving from Graphite/Carbon to Influx. Try one of the newer databases!


I'm currently using Prometheus, Grafana, and Alertmanager. I'm a big fan of the linux terminal, versioned config files, and separation of concerns but the rest of my team prefers web interfaces so I'm basically the only one maintaining Alertmanager. Grafana Altering looks appealing.

What have other people had success with?


Zabbix and Icinga2 were the most appealing alternatives that didn't require versioned config files for alerts last time I checked.

I think Grafana will fill the basic GUI alerting needs, though. When you need more than a simple flat treshold you usually want to get out of the GUI and ask the ops team for help anyway.


I've had success with killing all the s* free open source tools (Grafana, graphite, prometheus, whisper, icinga, nagios, carbon, ganglia, influxdb, zabbix...)

And using a single paid tool that does the job better AND doesn't kill me in maintenance work.

See https://www.datadoghq.com/ as leader or https://signalfx.com/ as the second comer, or http://www.bmcsoftware.uk/it-solutions/truesight.html if you're enterprisey.



I don't see how anyone can afford SaaS metrics/alert services at any sort of real scale.

$15/month/host gets expensive fast. Datadog doesn't start providing discounts till you are at 1000+ hosts.


All vendors provide discount if you negotiate. ;)

$15 * 500 hosts = $7500 per month.

If you think it's expensive, I can only advise you to check how much the hardware will costs on EC2 to run the free tools, plus how much work it will take to get the 8 different and independent OSS tools to work not only alone but integrate together, plus how much additional work and maintenance to keep it working without hiccups (war story: there is nothing worse than a monitoring tool that is less reliable than the thing it monitors).


Oh I agree. That's why we ditched ec2 for our own bare metal cloud based on joyent and saved over $200k/year


> $15 * 500 hosts = $7500 per month.

Oh that's ridiculous pricing. Any team running a server installation at any sort of scale would scoff at the pricing, and that's even before you take into account the implications of sending so much of your metrics data to a third party to be held hostage there if you decide to leave the service.


At my current gig the network is air-gapped. At previous gigs even though we had Internet access, security was very tight. They also seem to rather pay 40+ hours of dev salary installing and supporting open-source or home grown tools instead of having an ongoing subscription for a few bucks a month.


True story: Our monitoring stack now has three distinct components with alerting functionality.


We'll probably be in the same position. Grafana will make simple thresholds easy to visualize, Kapacitor can do more advanced anomaly detection, and we still need something like Sensu to do alerts that aren't really bound to metrics - and it provides a dashboard of alerts. Kinda annoying, but it works, I guess.


That's so that you can create a DevOps version of the final scene from Reservoir Dogs


Anybody got tips on how to start with implementing an alert system? Or what to read to get started?


I can't be the only one who laughed out loud while reading the ad for GrafanaCon. It contains the word "democratization" and takes place on an aircraft carrier..


Is anyone using a log management tool in conjunction with Grafana? I.e. if you see something anomalous or see an alert triggered, how do you investigate what's going on?


We've used Grafana with Sematext Logsene (which exposes Elasticsearch API, so it's like having Grafana talk to ES).

Here's a short howto + video: https://sematext.com/blog/2015/12/14/using-grafana-with-elas...


You can use ElasticSearch as an annotation provider over the top of your time series metrics. We publish events from our continuous deployment pipeline into ES and then surface those in a generic application dashboard. There hasn't been a deployment that we didn't already know about, but in theory when more users are going through CD it will provide more of a heads up.


You can use Graylog for log management, that's the free open-source solution. (graylog + elasticsearch + mongodb)

You can use Splunk if you have money. That's the de facto standard. Beware that it's one of the most expensive software license on the planet :D


It'd be nice if this meant being able to use Grafana as a frontend to alertmanager.

(Writing those "ALERT ..." requires a steep learning curve.)


This is exactly what I was thinking! I'd love to know what the dev of Prometheus think about alerting in Grafana...


https://twitter.com/fabxc/status/803870900097523712

> I repeat: Your alerts and dashboards belong into your SCM, not a random SQL database!

(And I 100% agree, particularly for alerts)


Hmm didn't know this was written in Go. Seems like Go is doing quite well in this space also with Bosun and Scollector.


Go code compiles to compact, statically linked binaries with relatively compact memory usage and reasonably good concurrency support, and an excellent standard library for handling networking - it's a natural fit for monitoring stacks. Even some monitoring systems that aren't Go on the back-end have Go-based collectors.


Do users still have issues full access to data sources, regardless of what dashboards they have access to? This is what keeps me from using Grafana to expose some data to clients.


Congrats guys!


I've had alerting via grafana built and deployed for the last 16 months. Not sure what took so long... but cool to see it native now. Keep up the good work.


Switch to Datadog and don't look back. Most valuable SAAS for my teams.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: