I'm currently leading a observability team at a 150+ engineer, 150+ services company. We spend the last 6 months building infrastructure around Prometheus, including instrumentation, alerting rules, high availability and scalability/sharding. We have a multi-kubernetes-cluster setup and multi-customer-shards with a high growth rate.
When we started, our observability stack was composed of Riemann, Splunk and thousands of unloved cloudwatch alerts. We felt Riemann was very difficult to work with for our setup and our multiple shards, and we evaluated what options we had, finally choosing Prometheus.
We spend a lot of time pairing with squads so they can learn Prometheus and work on their own creating instrumentations and alerts.
We're in the middle of the rollout for Prometheus. Roughly 50% of the services and squads are using metrics and alerts, and this is on-track to becoming 90% until the end of the next quarter.
Having a metrics and alerting platform ready to use is making engineers much more confident on their services, but we're still facing some incidents that are reported by customer support instead of engineering tools, which could mean either we're missing something, or we just have to double down on the rollout to make sure everything is covered and alerting correctly.
Could you share your thoughts and experience around observability?
By looking at graphs on a daily basis, one can become much more rapid in being able to diagnose issues quickly, and also understand what the root cause is of the underlying issue (assuming you're measuring XYZ that broke). So many times I was able to look at a graph and go, "Something isn't right", long before an alert would go off. I wasn't magical, I just had spent a little bit of time every day reviewing graphs, so that I could pick out subtle changes across a handful of graphs very rapidly, which would have also been hard to pre-emptively build checks around.
One additional thought is to not over-alert. Alert fatigue is your greatest enemy, and needs to be avoided at all costs. If you're not reacting to an alert, you should question whether or not you need it.