Grafana 4.0 with alerting is released

torkelo · on Nov 29, 2016

This release has been long in the making. We started on Alerting way back in March this year and it's finally released! Read more about all the highlights in the release here: http://grafana.org/blog/2016/11/09/grafana-4.0-beta-release/

Oh, and if your in New York tomorrow, signup for GrafanaCon: http://grafanacon.org

zphds · on Nov 29, 2016

Great work! Including a way to set grace periods will be really useful to prevent flapping on the metrics. ex, 'Alert When CPU > 95% for 10m'

yobo · on Nov 30, 2016

One way to solve that problem is to reduce the series with min(). Ex http://play.grafana.org/dashboard/db/alerting-flappy?panelId...

This means that the lowest value for the last 5min of the serie have to be above 80% before the alert triggers.

Mahn · on Nov 29, 2016

Congrats on release! We've been using Grafana for a few years now, and personally built-in alerting is the only thing I would have added. Some may argue that this "violates" good concern separation practices, but honestly you are going to be alerting about the same data you feed to Grafana, so at the end of the day it makes a lot of sense. Call it a "two in one" if you will. Either way this will make monitoring with Grafana much more streamlined.

zphds · on Nov 29, 2016

Note that keeping them separate has a benefit that when your 'Visualization' portal is down, your 'Alerting' systems are unaffected (and vice versa).

Collectd, telegraf. etc, can be configured to send the same metrics to your favorite TSDB and Alerting system (like riemann) in parallel.

nopzor · on Nov 29, 2016

Agreed with the sentiment re: streamlined experience.

We struggled a bit with whether or not it really "belonged" in Grafana, but we believe in alerting while "in the flow".

It makes a lot of sense (from an experience standpoint) to 'manage' alerting while you're 'managing' your dashboards, visualizations, and queries; you already have a sense of the data _right there_.

drvdevd · on Nov 29, 2016

As someone who is not too familiar with Graphana but is tasked with deploying it (shortly), just curious about the alerting and some of what I can do with it? I will read through the release page, but it's fun to get it from the horses mouth, if you're still available to comment?

cheald · on Nov 29, 2016

Influx + Telegraf + Grafana is such a simple, sweet stack. No work to maintain, trivial to set up, I can ship just about anything I want into it, and reporting is fast.

With alerting in place now, I'm even happier than ever. A huge thank you to the Grafana team for solving a huge pain point!

mattkrea · on Nov 29, 2016

What kind of volume are you sending into Influx? It crashed on me probably 5 times a day with only 100 requests per second.

cheald · on Nov 30, 2016

Right now it looks like it's around 50/sec. A lot of data points get rolled up by Telegraf on individual machines, and then it's shipped in via the UDP line protocol. I've written much larger volumes, though, and never had an issue with stability.

user5994461 · on Nov 30, 2016

If I may ask. How is UDP doing for you?

I checked my graphite setup once. We had 27% of metrics lost over UDP. That was bad.

pro-tip: "netstat -anus" and look at the error counters.

cheald · on Nov 30, 2016

About 4% err-to-received ratio. That's probably due to untuned UDP buffer sizes though; despite dropped packets, we're getting enough information to provide the information we need.

hoov · on Nov 29, 2016

Was this an older build? We had serious issues at first, but our setup is pretty stable these days.

mattkrea · on Nov 29, 2016

Last I tried was 1.0.

I love everything about using Influx but it would die and never restart and every time it would some crash on semacquire. I'll have to try it again since I need to check out this Grafana update anyway.

scrollaway · on Nov 29, 2016

There's some setup involved if you're sending a decent amount of traffic to it.

The two game changers are using the UDP line protocol instead of HTTP, and making sure you are batch-processing inputs. Fixing these settings is the difference between an instances that crashes all the time, and a purring one.

agnivade · on Dec 1, 2016

sending data in batches gives serious performance improvements. Don't send metrics directly to influx from your app. Send them to an intermediary like statsD which will aggregate them and send it.

Shameless plug - I recently published a log router in Golang. It sends data to influx too ! (github.com/agnivade/funnel)

mattkrea · on Nov 30, 2016

Thank you. I'll check this out.

piranha · on Nov 30, 2016

I use riemann in front if influx, which collects data and forwards it once a second. Works nicely, especially given that I aggregate some more high volume metrics before sending them to influx.

RRRA · on Nov 29, 2016

What transport are you using to secure telegraf into influxdb?

(Haven't tried telegraf yet, setuping a prometheus at the moment)

alfalfasprout · on Nov 29, 2016

Not sure what you mean "secure telegraph into influxdb" but we've had great success with this stack for monitoring by just embedding an HTTP server into each application that needs to be monitored. We keep the HTTP server separate from any others used by the application (i.e. it runs on a separate thread) so performance isn't impacted.

RRRA · on Nov 29, 2016

My use case is one where I have servers in different datacenters and would want to have a simple, but secure, way to fetch metrics for graphing and alerts.

So, I meant encryption in transport, authentication, etc. as many solutions work well if you're monitoring "in the clear" from the backend, but not so much over the internet.

cheald · on Nov 30, 2016

We're deployed on AWS in multiple regions with VPNs set up between VPCs. No particular attention paid to securing the transport between Telegraf and Influx at the moment since a) it's either in an internal VPC or secured via ipsec, and b) our monitoring data is low-value enough that it doesn't warrant its own secure transport.

agnivade · on Dec 1, 2016

IIRC, Influx supports https too. So you just have to setup some certs and switch to https in the client.

user5994461 · on Nov 29, 2016

Quick note for the ones who are tired of the giant clusterfuck of open-source tools for monitoring + alerting + storage + other, which is no less than:

- statsd

- collectd

- graphite

- whisper

- carbon

- prometheus

- grafana

- seyren

- riemann

- nagios

- icinga

- zabbix

There are multiple modern SaaS software that will do all of that in a single tool with better integrations, more polish, less work and no maintenance.

1) See https://www.datadoghq.com and last news https://techcrunch.com/2016/01/12/investors-feed-datadog-a-h...

2) https://signalfx.com/ and last news https://techcrunch.com/2015/03/12/signalfx-emerges-from-stea...

3) http://www.bmcsoftware.uk/it-solutions/truesight.html if you're not anti entreprisey (that was the "Boundary" startup, bought by BMC a few years ago and integrated in their offerings).

And don't think that they are "new" fancy tools. They've been around for many years.

nopzor · on Nov 29, 2016

Agree that the SaaS offerings are a lot more turnkey; the integrations and polish that make all the difference.

What you call a 'clusterfuck' is really a wider ecosystem. It would be pretty crazy for a single organization to use all or even most of the tools that you list.

Right now, people accept high degrees of cost (especially for at scale users) and lock-in, in exchange for the convenience of SaaS. Or, they go open source (which to your point, certainly is an investment in time)

Watch out for what team Grafana will be doing in 2017. Our plan is to provide a fully turnkey, hosted offering based around Grafana (and a handful of other open source tools). OpenSaaS.

We hope that for many users, this can be a third choice, and in some ways the best of both worlds.

user5994461 · on Nov 29, 2016

No offense but Grafana is only as good as the weaker piece in the monitoring chain.

Having nice graphs is nice... until they fall apart because the source is unavailable.

And that doesn't help with alerts either. (I tested the alerts in the v4 beta, it's just not comparable to the better alerting tools out there).

nopzor · on Nov 30, 2016

No offense taken ;) You're spot on about needing a solid and scalable backend; it’s more than 'nice graphs'. We think Grafana is a great piece in the chain to start with. We're trying to put as much momentum behind it as our burgeoning company will support.

The alerting in v4.0 is just the beginning. Torkel and the team have tried to optimize for the “relatively simple" 80% of alert use cases.

We are fans of other, more sophisticated open source alerting tools like Bosun, and you can be sure that we'll be both improving our alerting capabilities in 4.x

yobo · on Nov 30, 2016

what are you missing compared to other alerting tools?

dmitrios · on Nov 30, 2016

As long as enterprises can will understand that they can get support options for Grafana(on-prem, SaaS, etc.) it just comes down to choosing the most economic option. I see benefit in symmetry for enterprise who has hybrid or still mostly in their datacenter.

all_usernames · on Nov 29, 2016

For installations of a few hundred instances or more, some of the SaaS offerings cost more than the engineering salaries it would take to maintain the OSS tools.

josegonzalez · on Nov 30, 2016

Shame that many of the OSS tools do not have any sort of corporate sponsorship, or if they do, that it doesn't cover all the work that goes into releasing OSS in this space.

Note: I am one of the maintainers of Diamond, a metrics collection tool written in python. https://github.com/python-diamond/Diamond

jrv · on Nov 30, 2016

Case in point: http://blog.runnable.com/post/153498635761/how-we-saved-98-o...

otisg · on Dec 1, 2016

Unfortunately, the post doesn't share things like: how much infra is needed and how much does it cost, how much time it took to set up, how much maintenance it needs, how long upgrades of the setup take, how much time future hacking of missing features will take, and so on. After that sort of stuff is truthfully taken into account I suspect most if not all savings would be lost.

user5994461 · on Nov 29, 2016

To have been on the maintainer sides of the OSS tools, your statement is untrue.

The OSS tools costs a fortune in human to maintain them, and another fortune in hardware to run it.

katabatic · on Nov 29, 2016

Datadog will cost you $165,600 a year for 600 hosts. That is objectively equal to a very well paid engineer. So no, the statement is not untrue.

(I picked 600 because that was the approximate number of machines we had at my last job, where we used Graphite maintained by one guy, part time).

You included a LOT of redundancy in your OSS list. Multiple timeseries databases. Multiple collection daemons. Multiple dashboards. Multiple alerting systems (Who in their right mind would use Nagios AND Icinga?). You're effectively arguing about maintaining multiple monitoring stacks, some of which are quited aged.

technofiend · on Nov 29, 2016

Yikes. I'm sure there's discount pricing available but some of us have tens of thousands of hosts to monitor. The pricing you quoted doesn't scale. For me it might be cheaper to collect with OSS and graph with SaaS.

user5994461 · on Nov 29, 2016

Indeed, I gave a list of all the tools, you only need to make a stack of about 4 to 8 of them to get the job done.

Let's say statsd + collectd (metrics collection) + graphite (aggregation) + carbon/whisper (graphite storage) + icinga (alerting) + grafana (graphing). That doesn't exactly come easy.

No offense but a single graphite is not a monitoring solution. It's just the tip of the iceberg. Monitoring does take a lot of engineering work and a lot of maintenance. You won't get away operating 600 hosts on the cheap, just think about how much are the hosts themselves.

foobarian · on Nov 30, 2016

Let's talk about how much Amazon will charge you for 600 instances a year...

pritianka · on Nov 29, 2016

For microservices based architectures, things like OpenTracing can go a long way in de-cluttering the clusterfuck. Of course, it requires developers being up to speed on distributed tracing, which isn't the case across the board. http://opentracing.io

xorcist · on Nov 29, 2016

s/clusterfuck/ecosystem/

A typical system doesn't use all of the tools above. You use what fits you and many of the tools play pretty well together. I've had luck with Icinga2 and Grafana lately, for example, which integrated quite smoothly out of the box.

neves · on Nov 29, 2016

And you have include in the price the problems of a paid app:

- customization will be very expensive, if not impossible

- you must have people for the procurement process (x10 more costly if you are in a gov agency),

- weird failures due to not finding the license,

- your cheap personal that install software won't be able to do it,

- you'll have problems creating testing environments because you don't have licenses

- you won't be able to do some things immediately because there aren't enough licenses.

And these are just the problems that came to my mind right now. All of them are real problems that I'd found in commercial software.

user5994461 · on Nov 30, 2016

> - customization will be very expensive, if not impossible

You've got a full API and integrations with a hundred different tools and services out of the box.

Seriously, my coworker was skeptic at first too (so was I). Then we configured the full integrations with AWS/the-agent/statsd/postgre/mysql/cassandra/elasticsearch/riak/nginx/haproxy/redis/memcache/pagerduty/slack and some more.

My co-worker concluded in front of my CEO, "it was 2 orders of magnitude faster [than anything else we've ever tried for monitoring]". And that's not even talking about the additional features and customization we couldn't even dream of.

> - you must have people for the procurement process (x10 more costly if you are in a gov agency),

True. That's the only major problem I can see: People who can't buy the software they need. That's a social problem, not a software problem.

> - weird failures due to not finding the license

It's only one API key to put in the agent config file.

> your cheap personal that install software won't be able to do it

I don't know who you're talking about. Monitoring has our best people working on it. At other places I've seen, it's done by devops consultants raking up £600 a day.

There is no cheap personal involved. (Maybe you're thinking about of cheap interns who add alerts? that's an anti pattern).

- you'll have problems creating testing environments because you don't have licenses

Same license. Put a tag environment=<environment> in the config and done, all metrics all servers and all alerts will be tagged.

- you won't be able to do some things immediately because there aren't enough licenses.

Not applicable. It's not a limited license by seats.

You pay the bill at the end of the month depending on the number of hosts in your package. There is a hourly price for ephemeral hosts and overrun.

acdha · on Nov 30, 2016

I think the reason why you're getting negative reactions is that you're talking very broadly as if your personal experience is representative for everyone in the field. Rather than asserting that the real problems which neves mentioned don't exist, try describing how the specific products you've used were designed to avoid them.

apeofsteel · on Nov 29, 2016

Also https://www.hostedgraphite.com - We host Graphite and StatsD with Grafana dashboards, as well as alerting and integrations to several other dev tools. It's a self-funded business running for 5 years, and profitable with 14+ staff.

greenleafjacob · on Nov 30, 2016

You guys are awesome. Can't wait to see Grafana 4.0 on hosted graphite.

tremad · on Nov 30, 2016

As someone who's configured and worked with almost all of the tools in this list, I can only disagree with you. The old saying of you get what you pay for is some what relevant. But the integration of the newer OSS monitoring tools is becoming increasingly awesome. Take The graphite/prometheus/elasticstack integration with grafana for instance.

I think having one pane of glass to do all passive monitoring tasks is an incredible step forward.

I am yet to see if the Active monitoring of Grafana is any good, but it does look very promising

jrv · on Nov 30, 2016

Is there any hosted monitoring solution that integrates with service discovery, so that it's actually useful for serious alerting in nowadays' dynamic environments? Otherwise you can't even tell if things that should be there are reporting in or missing.

degio · on Nov 30, 2016

https://sysdig.com/

boazjohn · on Nov 29, 2016

What's new in v4: http://docs.grafana.org/guides/whats-new-in-v4/

jhacobian · on Dec 2, 2016

When you team Grafana up with a general purpose database like Crate.io some pretty amazing things can happen. Not only can crate just "roll with the punches" of auto-sharding whilst dynamically scaling performance over N number of database nodes, it also possesses powerful aggregation capabilities. If that weren't enough, crate also dynamically gzips data by default which is impressive given its zippy performance.

You get all of this for free with Crate.io without giving up the flexibility of a general purpose SQL database...

Wanna start storing log data in crate as well? No problem! Just design your table schema, and API ingest layer (My favorite is NodeJS) but you can use any language you like.

Or if security (facing the public) isn't an issue (if you're on a subnet safe from the public internet) then you can certainly just use the built-in REST API which crate exposes.

With Crate, I've been able to store hundreds of GB of systems log data without worrying about silly things like table-bloat (the autosharding of partitioned tables handles the spectre of bloated table shards for me for free).

Thanks to the amazing developers over at Crate.io for taking the best of Elasticsearch and making it sane, fast, and chock-ful of SQL goodness!

Also a big thank you to the Grafana team for recognizing the potential synergies that Crate.io & Grafana could catalyse for unifying time-series & log data streams.

kawsper · on Nov 29, 2016

Grafana really looks interesting, and it is interesting that you can add all the different backends to it, for an example I didn't know you can use Elasticsearch as a timeseries backend.

Is it correct that Grafana works best with Graphite? At least that seems to be my impression, and it is a bit sad, since I think Graphite is cool, but it really has a lot of moving parts.

pfranz · on Nov 29, 2016

I wouldn't let that stop you from evaluating it. At my last few jobs I've used Grafana with Elasticsearch and Prometheus without a problem. I've never actually used it with Graphite. The only downside I've seen is that the queries are unique to the data source, so when looking at many of the examples online you have to figure out the equivalent for your's. There might be a performance difference, but I wouldn't know it since I haven't used Graphite.

gtrubetskoy · on Nov 29, 2016

We use it with Tgres [1] (which only pretends to be Graphite, but actually is Golang + PostgreSQL) - works great.

[1] https://github.com/tgres/tgres

pimeys · on Nov 29, 2016

Using it with InfluxDB, Prometheus and Zabbix. Works like a charm. One of my favorite tools!

mattttt · on Nov 29, 2016

Graphite was the original, but as others have mentioned, Influx, Prom, Cloudwatch & Elasticsearch are all first-class data sources.

abraae · on Nov 29, 2016

We use it with influx, works great, though I find influx's syntax to be funky.

We also use it against mysql. Annoyingly there is no driver for that so we had to build a nasty layer than translates from influx to mysql.

mkching · on Dec 3, 2016

Today after fighting with Kibana all morning, I decided to tweak our scripts to build Grafana dashboards against our existing ElasticSearch data store. Very fast to get up and running and the the features and speed are amazing.

Very glad I saw this announcement on HN.

jmedefind · on Nov 29, 2016

I've used it with Influxdb, Elasticsearch, and Prometheus.

They all worked great. I can't think of any reason to use Graphite.

regecks · on Nov 29, 2016

Indeed. We reduced our write iops by 95% by moving from Graphite/Carbon to Influx. Try one of the newer databases!

pfranz · on Nov 29, 2016

I'm currently using Prometheus, Grafana, and Alertmanager. I'm a big fan of the linux terminal, versioned config files, and separation of concerns but the rest of my team prefers web interfaces so I'm basically the only one maintaining Alertmanager. Grafana Altering looks appealing.

What have other people had success with?

raziel2p · on Nov 29, 2016

Zabbix and Icinga2 were the most appealing alternatives that didn't require versioned config files for alerts last time I checked.

I think Grafana will fill the basic GUI alerting needs, though. When you need more than a simple flat treshold you usually want to get out of the GUI and ask the ops team for help anyway.

user5994461 · on Nov 29, 2016

I've had success with killing all the s* free open source tools (Grafana, graphite, prometheus, whisper, icinga, nagios, carbon, ganglia, influxdb, zabbix...)

And using a single paid tool that does the job better AND doesn't kill me in maintenance work.

See https://www.datadoghq.com/ as leader or https://signalfx.com/ as the second comer, or http://www.bmcsoftware.uk/it-solutions/truesight.html if you're enterprisey.

jrv · on Nov 30, 2016

See http://blog.runnable.com/post/153498635761/how-we-saved-98-o... as a counter point.

jmedefind · on Nov 29, 2016

I don't see how anyone can afford SaaS metrics/alert services at any sort of real scale.

$15/month/host gets expensive fast. Datadog doesn't start providing discounts till you are at 1000+ hosts.

user5994461 · on Nov 29, 2016

All vendors provide discount if you negotiate. ;)

$15 * 500 hosts = $7500 per month.

If you think it's expensive, I can only advise you to check how much the hardware will costs on EC2 to run the free tools, plus how much work it will take to get the 8 different and independent OSS tools to work not only alone but integrate together, plus how much additional work and maintenance to keep it working without hiccups (war story: there is nothing worse than a monitoring tool that is less reliable than the thing it monitors).

jmedefind · on Nov 30, 2016

Oh I agree. That's why we ditched ec2 for our own bare metal cloud based on joyent and saved over $200k/year

chetanahuja · on Nov 30, 2016

> $15 * 500 hosts = $7500 per month.

Oh that's ridiculous pricing. Any team running a server installation at any sort of scale would scoff at the pricing, and that's even before you take into account the implications of sending so much of your metrics data to a third party to be held hostage there if you decide to leave the service.

pfranz · on Nov 29, 2016

At my current gig the network is air-gapped. At previous gigs even though we had Internet access, security was very tight. They also seem to rather pay 40+ hours of dev salary installing and supporting open-source or home grown tools instead of having an ongoing subscription for a few bucks a month.

majewsky · on Nov 29, 2016

True story: Our monitoring stack now has three distinct components with alerting functionality.

raziel2p · on Nov 29, 2016

We'll probably be in the same position. Grafana will make simple thresholds easy to visualize, Kapacitor can do more advanced anomaly detection, and we still need something like Sensu to do alerts that aren't really bound to metrics - and it provides a dashboard of alerts. Kinda annoying, but it works, I guess.

coredog64 · on Nov 29, 2016

That's so that you can create a DevOps version of the final scene from Reservoir Dogs

creatio · on Nov 29, 2016

Anybody got tips on how to start with implementing an alert system? Or what to read to get started?

pizza · on Nov 29, 2016

I can't be the only one who laughed out loud while reading the ad for GrafanaCon. It contains the word "democratization" and takes place on an aircraft carrier..

poezn · on Nov 29, 2016

Is anyone using a log management tool in conjunction with Grafana? I.e. if you see something anomalous or see an alert triggered, how do you investigate what's going on?

otisg · on Dec 1, 2016

We've used Grafana with Sematext Logsene (which exposes Elasticsearch API, so it's like having Grafana talk to ES).

Here's a short howto + video: https://sematext.com/blog/2015/12/14/using-grafana-with-elas...

coredog64 · on Nov 29, 2016

You can use ElasticSearch as an annotation provider over the top of your time series metrics. We publish events from our continuous deployment pipeline into ES and then surface those in a generic application dashboard. There hasn't been a deployment that we didn't already know about, but in theory when more users are going through CD it will provide more of a heads up.

user5994461 · on Nov 30, 2016

You can use Graylog for log management, that's the free open-source solution. (graylog + elasticsearch + mongodb)

You can use Splunk if you have money. That's the de facto standard. Beware that it's one of the most expensive software license on the planet :D

RRRA · on Nov 29, 2016

It'd be nice if this meant being able to use Grafana as a frontend to alertmanager.

(Writing those "ALERT ..." requires a steep learning curve.)

thesorrow · on Nov 29, 2016

This is exactly what I was thinking! I'd love to know what the dev of Prometheus think about alerting in Grafana...

fidget · on Nov 30, 2016

https://twitter.com/fabxc/status/803870900097523712

> I repeat: Your alerts and dashboards belong into your SCM, not a random SQL database!

(And I 100% agree, particularly for alerts)

smegel · on Nov 29, 2016

Hmm didn't know this was written in Go. Seems like Go is doing quite well in this space also with Bosun and Scollector.

katabatic · on Nov 29, 2016

Go code compiles to compact, statically linked binaries with relatively compact memory usage and reasonably good concurrency support, and an excellent standard library for handling networking - it's a natural fit for monitoring stacks. Even some monitoring systems that aren't Go on the back-end have Go-based collectors.

tokenizerrr · on Nov 29, 2016

Do users still have issues full access to data sources, regardless of what dashboards they have access to? This is what keeps me from using Grafana to expose some data to clients.

mcncfie · on Nov 30, 2016

Congrats guys!

rsmets · on Nov 29, 2016

I've had alerting via grafana built and deployed for the last 16 months. Not sure what took so long... but cool to see it native now. Keep up the good work.

yclept · on Nov 29, 2016

Switch to Datadog and don't look back. Most valuable SAAS for my teams.