Hacker News new | past | comments | ask | show | jobs | submit login
Grafana Incident: Smart incident management for your teams (grafana.com)
250 points by matryer on Feb 2, 2022 | hide | past | favorite | 56 comments



>Automatically create the online meeting spaces for collaboration

>Manage TODO items so nothing falls through the cracks

I work in incident response, and I feel a huge misunderstanding of incident response products fail to understand that companies already have established tools for collaborations and meetings and for capturing planned work.

I find adding these things is seen as nice and inclusive and it is easier to sell a product that does a lot, but it turns into complete bloat and makes adoption harder, and makes it harder to support a larger product.


This was a big learning for us when we were first building out Kintaba[1].

Re: task management specifically-- having previously been at FAANG companies that built all their own tools I had not realized just how prevalent Jira is. It. is. EVERYWHERE. and IT orgs at companies from 3 to 300,000 people are absolutely married to their carefully customized version of it as a system of record for everything that happens or will happen.

We see many on-premise implementations as well despite the announced sunsetting of that product.

I'm sure there's a #2 and #3 out there but honestly I almost never see it (we do see clubhouse/shortcut from time to time... but even those folks tend to move to Jira within 6 months).

OT but it really makes me doubly impressed that Slack was able move into organizations so successfully from all corners such that it was able to dodge what would traditionally be a pretty big Atlassian-owned barrier.

[1] shameless plug for our incident management tool @ https://kintaba.com


Having both used and administrated JIRA...

the on-prem version helps you ensure that it's running fast and secure, or you can end up fucking up the performance part. But on-prem also means systems that are firewalled from internet might have access to it, which helps with integration.


I think the problem is trying to present an abstraction layer to management, because we have those same features of todo lists, and recording information, in Jira and ServiceNow and like a dozen other pieces that's purpose is to coordinate and track work, and often they are unpopular with developers because they end up trying to provide an abstraction layer to the Execs to replace their management by spreadsheets, but unfortunately as anyone who has worked in software for long enough can tell you, abstractions are leaky.

Hence the dissatisfaction with a lot of these tools.


Interesting take..

What do you think is the solution - when an enterprise already has Jira, Github and Confluence, how do you think a product like Grafana Incident should integrate with these somewhat overlapping products?


This feels like a central question of post-cloud / post-SaaS outsourcing.

In the end, it boils down to two options: offer deep APIs into your product, or don't.

IMHO, what needs to happen to support the former is for every SaaS purchase to include full technical due diligence on external integration capabilities.

Integration needs to start being a headline feature in purchasing. And less an afterthought when a horrified engineer looks at some new enterprise product that's already being adopted.


I've used products in this space that would integrate with your existing video, chat and ticketing tools.


> companies already have established tools for collaborations and meetings and for capturing planned work.

Until the incident takes down those tools and the doors to get into the building.


I wish Grafana would stop trying to make offerings that already exist and focus on making their dashboards and alerts as code usable.

I would even pay money for an actual offering that worked.


Alert templating. Grafana is fussy about configuring alerts on dashboards that have variables. What this means is if you have 30 clusters and want to use a single dashboard with a drop-down variable seefting your cluster you cannot define alerts on it. It will refuse to do it.

Alerts are also integrated tightly in dashboards. Forces alerts to be saved/backedup/imported as single json blob. We want separate management of alerts so they can be defined as code and not in the dashboard blob of json!

What makes me chagrined is because of the above issues we have to use prometheus alert manager instead while our colleagues absolutely LOVE grafana itself! We can't duplicate alerts tens of tens times. We don't want that management nor do we want to teach our colleagues jsonnet/ksonnet to generate it. We also don't want permission problems.


The new Grafana alerts do absolutely nothing to help with this.

I'm at the point where I would pay 5 figures a year for something purely to do better alerting inside or alongside Grafana. Clicking alerts together is a nightmare when I have a ton of identical systems I need to configure. Same for dashboards - the limitations of the current mechanism are too severe.

I'd build my own templating mechanism for it, but I still want the alerts visible in Grafana itself. Zabbix has the power to do all this but with a UX that is not ideal....


Well, been down the Prometheus/Alertmanager path and it is pretty workable there.

You would get all the templating and grouping by labels.

Dashboards can showed Prometheus alerts through annotations on the graphs. So you get a visual feedback what was broken when.

Firing alerts are also a metric on Prometheus, so you can list those (or do other stuff with it).

It's not a UI thing though. More the lower layers to get stuff for Grafana.


Hey there! I work with alerting in general at Grafana - what are the pain points of dashboards and alerts as code you're currently experiencing? Would love to deliver / capitalise on the feedback.


Alert templating. Grafana is fussy about configuring alerts on dashboards that have variables. What this means is if you have 30 clusters and want to use a single dashboard with a drop-down variable seefting your cluster you cannot define alerts on it. It will refuse to do it.

Alerts are also integrated tightly in dashboards. Forces alerts to be saved/backedup/imported as single json blob. We want separate management of alerts so they can be defined as code and not in the dashboard blob of json!

What makes me chagrined is because of the above issues we have to use prometheus alert manager instead while our colleagues absolutely LOVE grafana itself! We can't duplicate alerts tens of tens times. We don't want that management nor do we want to teach our colleagues jsonnet/ksonnet to generate it. We also don't want permission problems.


I can't edit my above comment anymore but I see that at least alerting is now a separate system in grafana 8! Great, we will take a look again!


Thanks for sharing! We built Grafana 8 alerting to address all of the problems you've mentioned.

Appreciate the second look and please let me know if you have any additional feedback.


For one, I'm not convinced that the Grafana 8 Alerting API Swagger docs are up to date or ready for the public [0].

I've literally copied an alert's json format, and then tried to post it back and never got it to work.

Here's an example from my bash history:

> curl -X POST -H "Authorization: Bearer $GRAFANA_API_KEY" -H "accept: application/json" -d @rule.json some_endpoint/api/ruler/grafana/api/v1/rules/test1

I spent a solid day trying to play around with this to get it to work. Because of this the alerts are impossible to code review or store in a git source. Which stinks because Grafana's datasource API's would be amazing to use for alerting. But they're either unusable because anybody can change them or the administrator could bork them at any given point (which has happened before), or just undocumented to the point where they are useless.

That's not even to begin on dealing with the "big blob of json" problem [1] that was clearly important enough to be given an entire spot at GrafanaCon, but even Grafonnet is not supported with Grafana 8. There is apparently some CUE way of doing this, but I can't seem to find any official documentation on that.

Anyways, I've moved back to alertmanager for the time being.

edit: is all of grafana labs downvoting the GP? this is very honest and candid feedback here.

[0]: https://editor.swagger.io/?url=https://raw.githubusercontent...

[1]: https://grafana.com/go/grafanaconline/2021/dashboards-as-cod...


It's currently impossible to write alert rules for Prometheus vectors. https://github.com/grafana/grafana/issues/35663

Missing basic functionality like that is a dealbreaker.


Hoping to see cleaner ways to integrate across data sources, but developing that contract is going to take some time I think. In the meantime, should be able to get this supported with prometheus data source in a Grafana managed alert: https://github.com/grafana/grafana/pull/44865


Grafana would do well to look into the thinkorswim desktop platform and the ability to write code around metrics. They are entirely different use cases but I feel the desired goals are the same, which is making the most of an ocean of metrics. Financial world crushes at this for obvious reasons. Tech world? not so much.


Why the desktop platform specifically, given they also appear to have a web version?


Grafana Cloud is the best ROI money my startup spends every month.


What's your spend? We're way into five digits and we're not getting thar in ROI. We're heavy on metrics, which come at a huge cost.


This looks really sharp! Love the opinionated approach to how to handle incidents with assigned roles!



Jacob doesn't work here at Grafana Labs anymore unfortunately, but it's nice to know he's still keeping tabs on us and likes what we're up to :)


You should probably ask him to remove the Grafana employment reference/email from his Github profile then.


Ha ha you found it lol


It seems like this is a special case of project management software. If the existing products can't handle incidents then that software should be improved, not new software written. It's the best way to ensure that everybody on the team knows how to use the software when it's most urgently needed.

E.g. would you change your favorite editor to a different one, in case of an incident? Probably not. So why change project management systems?


While you certainly could cobble together incident response workflows in something like Jira, I think it makes more sense to extend the monitoring and paging tooling (in large part due to the reason you mention— familiarity with the tools that you're using as part of that response).


Jira now has OpsGenie so you don’t have to cobble anything together, in theory.


Did we watch a different presentation? ChatOps isn't new. What you're describing is what I would consider an antiquated practice. Nobody wants to go sniffing around a PM tool at 3AM in the morning.


Zero here!


You must have solid tech. :)


This is timely... I just started building out an internal "chatops" solution that leans heavily on OnCall. Looks like I may be able to set that aside.

If this is implemented as cleanly as OnCall, I have high hopes. It isn't without bugs, but it's already miles ahead of solutions like Pager Duty (in my opinion).


PagerDuty is a product that has not evolved much at all in the last 10 years, unfortunately.


Please reach out to me, it would be awesome to learn your experience of using our API and make sure we're aware of all bugs you noticed (and fixing them!), matvey.kukuy @grafana.com


I guess "bugs" is a strong word, more like "suboptimal UX", but I'll definitely reach out with details.


I'd checkout FireHydrant, but I'm biased ;)


Yeah, there are definitely already products in this space, but we're already invested in Grafana, so it makes sense to lean in that direction, even if it meant a little custom work on our end (though it looks like that may not be necessary now)


In most of places I have been involved with ServiceNow has been the core of incident management. From alerts playbook to follow up on systems/components uptime and daily/monthly/yearly SLA breaches.

Any system that is offered for enterprises should somehow integrate into that solution.

Generally speaking, I can say that ServiceNow is horrible to work with, use, manage; but it looks like it is the solution that is dominant in enterprises.


Will it always be a Grafana Cloud only offering?


For now, yes. Long term we're trying to offer everything we do both on premise and in the cloud. It's a bit tricky, so we can't say when....


Would it be possible to have a split offering, with both on prem and cloud? In my mind I would prefer to have things like Prometheus, Logs, and Metrics stored on prem mainly due to the volume of logs and metrics we create. Then use Grafana cloud for Grafana Dashboards, Loki logs, and incident management that pull directly from my on prem data stores. I bring this up as it may be cost prohibitive for us to store our metrics in the cloud ( we make so many metrics and logs! ) but I would love to off load hosting the front end. Grafana cloud takes care of managing and maintaining Grafana Dashboard and backend database, Authentication, updates, ect. I'm fine hosting Prometheus and Loki locally, have been for a long time! I just get annoyed having to host Grafana and setting it up, the database up, configuring auth, etc.


I’m pretty sure that is doable today: Hosted Grafana with data sources pointing at your on-prem Prometheus and Loki.

https://grafana.com/docs/grafana-cloud/fundamentals/gs-visua...

(I work for Grafana Labs, but not on this part)


> It's a bit tricky, so we can't say when....

I'm curious about this part, and I can absolutely understand if you don't want to answer but I do have the following question:

Why is it tricky to ensure an application can run on a cloud deployed system or a local Kubernetes/Docker Swarm/newfangle containerization mechanism of choice/etc. system?

Specifically I'm wondering what barriers you're running into that are pushing the focus to go cloud only.


Is there any hope of a Grafana Cloud data access proxy that runs on prem and enables us to give the Cloud access to databases we cannot expose?


Yes! It’s something we’ve be mulling for a while, and I was just talking to one of the PMs about it this morning. This year for sure I hope.


Yeah, building for Grafana Cloud has big dev benefits too. We can iterate quickly, run live experiments, and build a more complicated stack (e.g. for ML tasks). We're going to be integrating more and more with the rest of Grafana too. All of this is much easier to do in one place.


It also has drawbacks like being locked into Saas products that you don't have a lot of insight to.


Seems like the industry is headed in that direction.


It’s funny what process can do.

13 years ago I was working on a SaaS eCommerce platform and it feels like this tool is a relatively minor improvement over what we had built on top of IRC.

That said; it’s pretty cool and I’m definitely going to evaluate it: as our current PagerDuty integration is not nearly as clean as this.


So does Grafana actually believe in open source or not?


[flagged]


You'd do your job as a CEO better if you didn't spam competitors HN threads with your own product, unless you have something relevant to bring to the table. This comment just looks like a shameless plug because you're in the same sector.

One way you could approach is to highlight what you think is good with Grafanas implementation, and what could be better, and then contrast that with your own offering, without sounding like a salesman.


Seems post was deleted, not a great look for the CEO of incident.io


This is just incredibly rude. Please don't do it again.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: