I've worked with similar scale, and the situation is basically that when you have a thousand people working on a service, they have different needs. Ops needs a variety of host and container level metrics. Not just for action by stakeholders, but for autoscaling, autoremediation, etc. If you have a few thousand servers, you're probably talking 100k metrics right off the bat. More if you want statsd aggregate metrics instead of just one summary stat.
And if you have microservices, you want to track how well each client-server pair is doing, on both sides of the equation, which means tracking error codes, success/fail rates, etc.
Finance wants its own metrics to measure capacity versus utilization to prove to the CFO the spending is appropriately constrained.
Devs want to prove their system works and works quickly, so you'll have a variety of metrics revolving around subcomponent usage, and performance timing. Maybe even cache rates.
Not all of these metrics will spark action by stakeholders. Some will be retained 'just in case' since you can't retroactively collect data. When perf drops, in a canary because GC pauses are increasing, you definitely want to be able see both performance metrics over time as well as GC metrics.
And if you have microservices, you want to track how well each client-server pair is doing, on both sides of the equation, which means tracking error codes, success/fail rates, etc.
Finance wants its own metrics to measure capacity versus utilization to prove to the CFO the spending is appropriately constrained.
Devs want to prove their system works and works quickly, so you'll have a variety of metrics revolving around subcomponent usage, and performance timing. Maybe even cache rates.
Not all of these metrics will spark action by stakeholders. Some will be retained 'just in case' since you can't retroactively collect data. When perf drops, in a canary because GC pauses are increasing, you definitely want to be able see both performance metrics over time as well as GC metrics.