Grafana and graphite are what makes managing large infrastructure manageable.
I was introduced to grafana in early 2014. I was a bit sceptical as I was using graphitus to make dashboards. However I soon converted.
I maintained a very large graphite cluster at the Financial Times (I think it was about 1 million active metrics, but it might have be 0.5 mill, I forget) The only sane way to manage the front end was using grafana. Simple oauth2 integration meant that I could avoid the nightmare of trying to get AD access, and it also mean't one click SSO.
Grafana was one of those tools that was self evidently the best in class, so it was widely adopted. Within two years, virtually every team screen had grafana on it. Non programmers used it, and even set alerts. How many other "devop" tool can boast that level of universality?
Either way, keep up the good work, and best of luck.
As someone that recently consulted for a F500 client and had to make recommendations regarding their Grafana instance (among other things), we noted they were curating too many metrics (somewhere in the thousands). Our belief being that if you're providing stakeholders with so many metrics, you're forcing them to make their own decisions regarding what's valuable to track and what's not - rather than allowing leadership to provide direction as to how they're measuring performance, etc.
I can't imagine what it'd be like (as a stakeholder), using a Grafana instance that, in total, has >500k metrics. Would assume many of those are depreciated/ do not provide any value/ or do not spur any action by stakeholders.
I've worked with similar scale, and the situation is basically that when you have a thousand people working on a service, they have different needs. Ops needs a variety of host and container level metrics. Not just for action by stakeholders, but for autoscaling, autoremediation, etc. If you have a few thousand servers, you're probably talking 100k metrics right off the bat. More if you want statsd aggregate metrics instead of just one summary stat.
And if you have microservices, you want to track how well each client-server pair is doing, on both sides of the equation, which means tracking error codes, success/fail rates, etc.
Finance wants its own metrics to measure capacity versus utilization to prove to the CFO the spending is appropriately constrained.
Devs want to prove their system works and works quickly, so you'll have a variety of metrics revolving around subcomponent usage, and performance timing. Maybe even cache rates.
Not all of these metrics will spark action by stakeholders. Some will be retained 'just in case' since you can't retroactively collect data. When perf drops, in a canary because GC pauses are increasing, you definitely want to be able see both performance metrics over time as well as GC metrics.
the graphite instance had the metrics, not grafana. I don't know how many were actually graphed. One thing that I can be sure of was that they'd all been pushed within the last week, otherwise they'd get deleted.
There were at the time about 200 dashboards. They were controlled and curated by their own teams. It was pretty much the only shared tool that worked well. The only thing that I encouraged was tagging, but even then, they mostly did it themselves to make finding things easier.
There were about 80 active products, most had _a_ dashboard.
The cruicial thing was that it doesn't cost much to record those metrics. This means that post incident we can easily put an alert in, or prove x affects y because z.
limiting the number of metrics recorded is frankly silly. Enforcing rules about quality and location, certainly, its something I spend a reasonable amount of time on.
for example, the front end was a microservice. Each http call of each microservice was graphed, which allowed quick and simple diagnostics for general performance. Most of the time its not needed, but when you _do_ need it, its critical to have context
I hope the money won't be used to hire a bunch of graphic designers who want's to totally changes how everything looks. I really love the sexy color schemes and user interface as it is. If they want to do changes, I really hope the only do incremental changes to what's already there.
Director of UX here. We're trying to focus on improving existing workflows and less on visual design changes. That being said, form styles will be overhauled soon.
It often seems like UX teams can’t help themselves and that the urge to destroy useful in favor of some idealized UI that doesn’t solve the task at hand comes with them. The UX industry is terrified of expert users or having to understand the product.
It’s literally my job to keep Grafana real. Our budding UX team is getting a lot of training on SRE as a discipline and we run weekly internal UX feedback sessions where all of engineering can join and course-correct. Same with weekly feedback sessions with our existing user base. Get in touch with david at grafana if you’d like to participate.
It's a somewhat insulting stereotype to think of designers as merely being interested in "sexy color schemes".
There are many people in the design/UI/UX space specialising on (often numeric) information design. They tend to be as fluent as any programmer in statistics, if not more.
To this day, programmers tend to conspiratorially suggest to designers to read Edward Tufte, even though his are the first books they make you read in any information design class, and have been since the early 80s.
I have had some good success using Grafana to visualize live data from a fleet of robots - I started out with graphite as the data store but quickly moved to InfluxDB which had better performance. Overall it was a very impressive tool which required little set up and configuration!
Thanks Roland! For those out there with Graphite performance issue, Grafana Labs also develops an open source, Graphite compatible, performant and horizontally scalable TSDB called Metrictank.
https://github.com/grafana/metrictank
I came to Grafana via InfluxDB, been running it for system metrics for around 4 years now and it's been a real workhorse. Collecting metrics from around 100 machines using telegraf, presenting in Grafana.
Load significantly dropped when I shut off Carbon/Graphite after moving to InfluxDB.
We started with graphite, then moved to go-carbon + carbonapi. Replacing the python stuff with those two services made everything much faster - I think we had something like 1200 hosts submitting metrics every 30 seconds. So not a huge installation.
We've had the opposite experience - it's been slow, clunky, and hard to learn/teach. Of course, seeing some of the other things the former admin/advocate for it did after she left, I'm not sure it's Grafana's fault.
Yeah, we’re trying to make datasource query latency more obvious soon. It’s too easy to say Grafana is slow right now. We want users to make better decisions on what to optimize in their monitoring stack. That being said, dashboard search can be slow on bigger instances. But we’re working on it.
Does Grafana update fast enough for visualizing data from sensors on robots? (in particular for tracking positional information or response information if tuning/monitoring PIDs etc.)?
Director of UX here. This year we rewrote a lot of the data flow plumbing in grafana to allow streaming with the very goal to enable sensor and other high frequency use cases. More to come soon
How much of those improvements is in current versions of Grafana? I've tried building some dashboards in Chronograf for some data I work with, but there's a lot of sensors that update very quickly so the graphs basically killed the browser tab. If Grafana is better at handling that sort of use case I'll definitely check it out.
Grafana was pretty fast, but the total system latency can add up to about 1 second or so (wifi at least, cellular is longer).
We did not use grafana for debugging instantaneous, fast events, but it was fantastic for things like monitoring temperature, current modes, and running statistics.
Variables to abstract out some, a bit of "repeat" to loop over something, and you get pretty drop-downs that you can combine to show nice graphs.
Then you think "I'll add it to a playlist". and you do so.
Then you think "my kiosk can't scroll this much for all, let's have one screen each for the apps" and you do.
And then you realize you cannot use variables from playlists, and you cannot template screens.
So you make eight copies of your screen, one for each variable configuration.
And you edit each copy of your screen to set the variables, and save it.
And then you realize that there was a typo in one panel.
So you go in and edit in eight different screens to fix that typo.
Then you realize that it doesn't look good on the TN panel, so you need to change a few colours to get better contrast.
So you do that on eight different copies, by the means of clicking in every pane, navigating through the point-n-click and then pressing.
But you realized that you learned this, so you're fast, and use the keyboard. Except then the change doesn't take.
Because grafana requires you to click in another field after you've edited, or your change doesn't hold if you press "Escape" or other key to navigate back.
And that's how I learned how Grafana is best of breed in GUI dashboard tools. Sort of how a pug is best of breed in a dog competition.
I use Ansible + Jinja2 templates to create and update my dashboards. Minor tweaks and changes can be pushed to hundreds of dashboards using the grafana API
I don't have an easy way of sharing this but I'm free to answer any questions about it.
My process is
1. identify services that benefit from a generated dashboard (a service that I am running hundreds of instances of, for instance)
2. create the first dashboard by manually
3. export the dashboard to JSON and turn it into a jinja2 template
4. use ansible to access the cloud provider api to get whatever metadata I need to populate the now templated dashboard
5. store the updated dashboard as code and also push it to Grafana via API with Ansible
This is all automated and you can skip all the way to step 4/5 if you plumb this to your service build/delivery automation.
Two main problems I faced with grafana:
- Do you plan on allowing alerts on metrics with variables? This very important for a lot of people. For example, you usually have an environment in the metric name and thus we can't share dashboard between environments if we want to keep metrics.
- When do you plan to fully support CSP? I guess you have to remove the angular code for that, is there a timeline?
Yes! Loads of idea floating around and I think Jon has been working on some improvements for 6.6. Its going to be a big project in 2020 for sure.
(Its something I'm really keen on myself - We use jsonnet internally to version control our dashboards and load them into config maps in our kubernetes clusters)
That's really interesting, would you be keen on open sourcing that bit if you've not already? I know Raj and the rest of the crew are pretty all in on open source.
+1 for using jsonnet. An adequate language for this kind of task! (Here "adequate" is meant as high praise... too often the choice of config langauge isn't.)
Longtime Grafana + InfluxDB user here. Recently I've been using Prometheus as an additional Grafana data source and I was surprised to lose the graphical query editor that I had with InfluxDB. Is graphical query builder support for Prometheus data sources something that's possible or on the roadmap? I searched the docs and Github issues but couldn't find any mention of it.
Director of UX here. We’re trying to find a way to have a graphical builder for simple queries. But anything with binary operators is a UI challenge that will be text based for a while. We’re testing the waters with a graphical LogQL builder for Loki first. Watch this space.
Will you keep making it easy to self-host the services?
We currently use Grafana Loki + Grafana and it's working amazingly well, we've tested load over 600 logs/second (1 Core VPS 2GB RAM) without any issues.
I got it wrong with Cortex (scalable Prometheus) and have been paying the price - poor adoption, smaller community and less mindshare. We're 100% committed to making sure Loki and Grafana are super easy to self-host, and are even putting time and effort into making Cortex and Metrictank easier too.
Thats a very good question! The opensource Loki project will always be fully featured, but we are looking at building an enterprise version in a very similar way to Grafana Enterprise.
Having the right differentiation is key here, and something we're super sensitive to - for instance, we put LDAP (a typically enterprise feature) into OSS Grafana. What this space and let us know what you think!
I've implemented Grafana at many different jobs and in many cases have had an executive who would have been willing to pay for the featureset we got for free.
My two cents:
- Deliver a version of Grafana that is available packaged as an appliance, with very little need to edit the INI files. Click to deploy from the Amazon marketplace with the enterprise license included. Easy guide to getting it running with the IAM roles needed to be a first class Cloudwatch adjunct.
- Build on your current SSO offerings and get support for SAML into the mix, as it's what most companies have settled on for access control.
- Keep working on the dashboard provisioning tech and maybe even provide a templating mechanism accessible from the UI that spits out new dashboards on demand, beyond the (excellent) variable functionality that's currently there.
- More work on alerting! You're so close to a first-class alerting system, but we need template variables in alert text and subject lines, support for alerts on dashboards with template variables, severity levels, and a better UX for configuring alarm conditions. Another feature I'd push hard for any company to pay for an Enterprise license to get would be any kind of adaptive, trend-based thresholds rather than the fixed min/max you currently have. It's about the only reason I'd still push some folks towards Datadog and they rake the money in hand-over-fist.
Thanks so much for your product, it really is an absolute gem. Hope I can get an employer to start paying for it one of these years.
We know that there are some gaps in the alerting feature (we dogfood it ourselves). The Grafana team will be focusing a lot more on alerting in 2020. For Grafana 7.0 in May, we are aiming to build better alerting that retains the simplicity of the current alerting but that will fill some of the those gaps. The new engine will decouple alerting from the graph panel and hopefully sidestep the problem with template variables. Once we have got further in the design stage then we will share more with the community about the proposed solutions.
Seconded! In the current world where dashboards are coupled to alerts, it's very frustrating not to be able to define alerts on a templated dashboard.
It's also frustrating that it's effectively impossible with the Cloudwatch data source to select resources to alert on using their tags (e.g. select the elb with app=foo and env=prod). The dimension_values() function lets you filter based on tags, but instead of being usable directly in a panel's query it requires using a variable as an intermediary, which then disables alerting.
So would I! It is something we're going start work on pretty soon I think - I mainly work on Prometheus, Loki, Cortex etc and not so much on Grafana. Will let one of the Grafana team comment.
Director of UX here. What's your mobile vs desktop use? Which devices are you using?
That will help us getting more mobile-friendly in upcoming releases.
Are you planning to continue focusing on the core product (Grafana) or also branching out into other areas of the observability stack? (tsdb, log store etc)
>50% of out engineering effort goes into Grafana - but we've already branched out into other spaces, with a team working on Metrictank (scalable Graphite), Cortex (scalable Prometheus), Loki (log aggregation) and even the Graphite and Prometheus projects.
That being said, Grafana will always treat other datasources as first class citizens, like we already do with Influx, MySQL, CloudWatch, Stackdriver, Elastic etc. This is our "un-vendor" approach.
Thank you! I can't claim any credit for the Grafana project though - Torkel & team have done and continue to do an excellent job there. I mostly work on Prometheus, Cortex & Loki.
We want to use Jaeger as a datasource and show traces in Grafana - its all a bit early right now, but come see us at KubeCon and we'll have something to show!
Our partners are very happy Grafana users. But I've observed some pain when they construct dashboards and alerts for each new device added. They are going to add thousands so this is very labour intensive! Some kind of "smart clone" would be very useful. (Thousands of dashboards feels like an anti-pattern, but alerts attached to thousands of sources is not)
Definitely! I'd recommend using variables so you don't have to replicate the dashboards, but they don't work with alerts (yet). You could also use something like grafonnet or grafana-lib to generate dashboards with code.
Have you thought about adding dashboard generation to your build automation? This is something we've implemented with Ansible. When a new service is built, it generates a dashboard and pushes it to Grafana via the API.
Similarly, if the service is updated, it also generates a dashboard and updates the previous one (this is easy because the update is an UPSERT). You can do interesting things like modularize pieces of the dashboard and update those modules independently. They can then get pulled into the Jinja template during the update.
Director of UX here, we do want to integrate a trace viewer soon. We actively talking to Uber to find ways of directly reusing their Jäger components. Still considering other options too. Will check out your project!
Sounds promising. Have found Grafana really good, flexible and usable with many available data sources and display options.
Hope it continues to move forward and make it easier to get it adopted within my organisation.
I love Grafana and had no idea it was working on logs and tracing, only metrics!
"That goes hand-in-hand with pushing forward with our vision of building an open, composable observability platform that brings together the three pillars of observability – logs, metrics, and traces – in a single experience, with Grafana at the center."
Interesting. I thought graphana was a metrics-targeted clone of Kibana, which was for ElasticSearch/Logs/Traces visualization. Sounds like it went full circle.
I've recently managed to make Prometheus+Grafana stateless using https://github.com/cortexproject/cortex if anyone is having issues with scaling / backing up.
What I'd really like to see is a more generic approach to data visualization done with the same care and expertise as is displayed with the time series visualization currently. Perhaps this would be a different product under the same brand, but I believe that the data viz space has a lot of room for competition - tableaux and power bi, etc... are leaving a lot of room for competition. I'm currently looking closely at redash because of this, would love it if I could solve the same problems with grafana.
*note: I know that to some degree this is possible with current grafana, but if you read through the issues folks have with doing data viz outside of time series, you'll catch my meaning.
Ryan (mentioned in the article) is building a team around this (non-timeseries data, sensors data, manufacturing). Head to our hiring page if you want to join his R&D team (US-remote).
This is what Open Source Companies need to look like. Honestly the more they do stuff like this, the more this makes graphite's hosted options the only game in town I trust.
There is a big market IMO for companies that want to offer a managed grafana to their users. Just take a look at the latest offering from Logz or hosted-graphite. We also do it where I work. It is really not easy, but I hope it will get better. I would happily pay to get support from them and features that facilitate my life.
we do this too at Taloflow.ai
we basically deliver a managed version of Grafana (rather than a proprietary UI) for our AWS cloud spend management tool. we work primarily with devs, and they tend to love and work with Grafana already, so this was a no brainer at our startup.
We recently converted an internal dashboard for the Helium blockchain to a public tool and the reception/usefulness has been awesome. (For anyone interested -> http://dashboard.helium.com)
splunk is great at search, its also rather good at ploting graphs for a single search.
However doing it quickly or efficiently is not splunk's strong suit.
Grafana allows you to quickly graph data from source x, compare to source y and then build a dashboard.
Splunk can also do al those things, but much slower.
We had a huge splunk clusters (100gig a day), and its a great compliment to grafana.
typically you use grafana to alert you to when things are going wrong. It would point out which system was going wrong and at what time, then you'd use splunk to get the logs to figure out the cause.
I was introduced to grafana in early 2014. I was a bit sceptical as I was using graphitus to make dashboards. However I soon converted.
I maintained a very large graphite cluster at the Financial Times (I think it was about 1 million active metrics, but it might have be 0.5 mill, I forget) The only sane way to manage the front end was using grafana. Simple oauth2 integration meant that I could avoid the nightmare of trying to get AD access, and it also mean't one click SSO.
Grafana was one of those tools that was self evidently the best in class, so it was widely adopted. Within two years, virtually every team screen had grafana on it. Non programmers used it, and even set alerts. How many other "devop" tool can boast that level of universality?
Either way, keep up the good work, and best of luck.