I'd be interested in hearing how it performs when rendering a huge page of graph...

bbrazil · on Feb 4, 2015

> huge page of graphs each with dozens to hundreds of graph lines

From experience that much data on a page makes it quite difficult to comprehend, even for experts. I've seen hundreds of graphs on a single console, which was completely unusable. Another had ~15 graphs, but it took the (few) experts many minutes to interpret them as it was badly presented. A more aggregated form with fewer graphs tends to be easier to grok. See http://prometheus.io/docs/practices/consoles/ for suggestions on consoles that are easier to use.

> It sounds like Prometheus is worse actually since it mentions manual sharding rather than consistent hashing that Graphite does.

The manual sharding is vertical. That means that a single server would monitor the entire of a subsystem (for some possibly very broad definition of subsystem). This has the benefit that all the time series are in the same Prometheus server, you can use the query language to efficiently do arbitrary aggregation and other math for you to make the data easier to understand.

mdeeks · on Feb 4, 2015

It depends on the use case. For high level overview of systems you absolutely want fewer graphs. Agreed there. For deep dives into "Why is this server/cluster running like crap!?" having much more (all?) of the information right there to browse makes a big difference. I originally went for fewer graphs separated into multiple pages you had to navigate to and no one liked it. In the end we adopted both philosophies for each use case.

Lots of lines on a single graphs helps you notice imbalances you may not have noticed before. For example if a small subset of your cluster has lower CPU usage then you likely have a load balancing problem or something else weird going on.

RE: Sharding What happens when a single server can no longer hold the load of the subsystem? You have to shard that subsystem further by something random and arbitrary. It requires manual work to decide how to shard. Once you have too much data and too many servers that need monitoring, manual sharding becomes cumbersome. It's already cumbersome in Graphite since expanding a carbon-cache cluster requires moving data around since the hashing changes.

bbrazil · on Feb 4, 2015

> having much more (all?) of the information right there to browse makes a big difference.

I think it's important to have structure in your consoles so you can follow a logical debugging path, such as starting at the entry point of our queries, checking each backend, finding the problematic one, going to that backend's console and repeating until you find the culprit.

One approach console wise is to put the less important bits in a table rather than a graph, and potentially have further consoles if there were subsystems complex and interesting enough to justify it.

I'm used to services that expose thousands of metrics (and there's many more time series when labels are taken into account). Having everything on consoles with such rich instrumentation simply isn't workable, you have to focus on what's the most useful. At some point you're going to end up in the code, and from there see what metrics (and logs) that code exposes, ad-hoc graph them and debug from there.

> Lots of lines on a single graphs helps you notice imbalances you may not have noticed before. For example if a small subset of your cluster has lower CPU usage then you likely have a load balancing problem or something else weird going on.

Agreed, scatterplots are also pretty useful when analysing that sort of issue. Is it that the servers are efficient, or are they getting less load? A qps vs cpu scatterplot will tell you. To find such imbalances in the first place, taking a normalized standard deviation across all of your servers is handy - which is the sort of thing Prometheus is good at.

> You have to shard that subsystem further by something random and arbitrary.

One approach would be to have multiple prometheus servers with the same list of targets, and configured to do a consistent partition between them. You'd then need to do an extra aggregation step and get the data from the "slave" prometheus servers up to a "master" prometheus via federation. This is only likely to be a problem when you hit thousands of a single type of server, so the extra complexity tends to be manageable all things considered.