Prometheus: An open-source service monitoring system and time series database

SEJeff · on Feb 4, 2015

As a graphite maintainer, see my other post about problems with graphite:

https://news.ycombinator.com/item?id=8908423

I'm super excited about prometheus, and can't wait to get some time to see if I can make it work on my rasberry pi. That being said, I'm also going to likely eventually work on a graphite-web / graphite-api pluggable backend to use prometheus as the backend storage platform.

The more OSS metrics solutions, the better!

mdeeks · on Feb 4, 2015

A huge problem I'm having with graphite right now (which is making me look at influxdb, etc) is its inability to render graphs with lots and lots of lines. For example CPU usage across a cluster of of hundreds of machines almost always times out now. I essentially am graphing this: system.frontend-.cpu-0.cpu-used Where "frontend-" expands to 200 or so machines. I'm not entirely sure where the bottleneck is here. Would love to know if you have ideas. Is this a limitation of graphite-web itself?

I have large graphite install of 20+ graphite carbon nodes running on SSDs and three additional graphite-web instances in front generating graphs. Ingesting something like 1 million metrics/min.

Also I didn't realize there were still graphite maintainers (seriously. not trolling). There hasn't been a release of graphite in well over a year. I assumed it was dead by now. Any idea when we'll get a fresh release?

SEJeff · on Feb 4, 2015

We are in the final stages of the last 0.9.x release, 0.9.13. From then on, we're going to be making some more noticeable changes that break some backwards compat to make the project a lot more pleasant.

Note that 0.9.13 is almost ready to be cut: https://github.com/graphite-project/graphite-web/commit/7862...

https://github.com/graphite-project/carbon/commit/e69e1eb59a...

https://github.com/graphite-project/whisper/commit/19ab78ad6...

Anything in the master branch is what will be in 0.10.0 when we're ready to cut that. I think we'll spend some more cycles in 0.10.x focusing on non-carbon / non-whisper / non-ceres backends that should allow much better scalability. Some of these include cassandra, riak, etc.

For it timing out, it is a matter of general sysadmin spleunking to figure out what is wrong. It could be IO on your carbon caches, or CPU in your render servers (where it uses cairo). I'm a HUGE fan of grafana for doing 100% of the dashboards and only using graphite-web to spit out json, or alternatively to use graphite-api.

Take a look at the maxDataPoints argument to see if that will help your graphs to not timeout however.

mdeeks · on Feb 4, 2015

My brief experience with browser based rendering was not good. Our dashboard pages often have 40-50+ graphs for a single cluster. I found it brought all browsers to a crawl and turned our laptops into blazing infernos when viewing longer timelines. Granted I didn't try out graphana so it could have been related to badly optimized javascript in the ones I tried.

CPU on the render servers is low. IO on the carbon caches is acceptable (10k IOPS on SSDs that support up to 30k or so). If the CPU Usage Type graph would render it would show very little IO Wait (~5%). Graphs if you're interested: http://i.imgur.com/dCrDynY.png

Anyway thanks for the response. I'll keep digging. Looking forward to that 0.9.13 release!

SEJeff · on Feb 4, 2015

maxDataPoints was a feature added by the guy who wrote giraffe[1], which is for realtime dashboards from graphite. It was too slow until he added in the maxDataPoints feature, and now it is actually really awesome when setup properly.

Also look at graphite-api[2], written by a very active graphite committer. It is api only (only json), but absolutely awesome stuff. Hook it up to grafana for a real winner.

[1] http://giraffe.kenhub.com/#dashboard=Demo&timeFrame=1d

[2] https://github.com/brutasse/graphite-api

bbrazil · on Feb 4, 2015

For comparison I tried out >1k cpu plots in Prometheus on a m1.large with 2xHDDs. It took 20s with a cold cache.

mdeeks · on Feb 4, 2015

I'd be interested in hearing how it performs when rendering a huge page of graphs each with dozens to hundreds of graph lines.

Unfortunately though Prometheus lacks easy horizontal scaling just like Graphite. It sounds like Prometheus is worse actually since it mentions manual sharding rather than consistent hashing that Graphite does. This rules out Prometheus as an alternative to Graphite for me even if it does render complex graphs better. I'm definitely keeping my eye on this one though.

bbrazil · on Feb 4, 2015

> huge page of graphs each with dozens to hundreds of graph lines

From experience that much data on a page makes it quite difficult to comprehend, even for experts. I've seen hundreds of graphs on a single console, which was completely unusable. Another had ~15 graphs, but it took the (few) experts many minutes to interpret them as it was badly presented. A more aggregated form with fewer graphs tends to be easier to grok. See http://prometheus.io/docs/practices/consoles/ for suggestions on consoles that are easier to use.

> It sounds like Prometheus is worse actually since it mentions manual sharding rather than consistent hashing that Graphite does.

The manual sharding is vertical. That means that a single server would monitor the entire of a subsystem (for some possibly very broad definition of subsystem). This has the benefit that all the time series are in the same Prometheus server, you can use the query language to efficiently do arbitrary aggregation and other math for you to make the data easier to understand.

mdeeks · on Feb 4, 2015

It depends on the use case. For high level overview of systems you absolutely want fewer graphs. Agreed there. For deep dives into "Why is this server/cluster running like crap!?" having much more (all?) of the information right there to browse makes a big difference. I originally went for fewer graphs separated into multiple pages you had to navigate to and no one liked it. In the end we adopted both philosophies for each use case.

Lots of lines on a single graphs helps you notice imbalances you may not have noticed before. For example if a small subset of your cluster has lower CPU usage then you likely have a load balancing problem or something else weird going on.

RE: Sharding What happens when a single server can no longer hold the load of the subsystem? You have to shard that subsystem further by something random and arbitrary. It requires manual work to decide how to shard. Once you have too much data and too many servers that need monitoring, manual sharding becomes cumbersome. It's already cumbersome in Graphite since expanding a carbon-cache cluster requires moving data around since the hashing changes.

bbrazil · on Feb 4, 2015

> having much more (all?) of the information right there to browse makes a big difference.

I think it's important to have structure in your consoles so you can follow a logical debugging path, such as starting at the entry point of our queries, checking each backend, finding the problematic one, going to that backend's console and repeating until you find the culprit.

One approach console wise is to put the less important bits in a table rather than a graph, and potentially have further consoles if there were subsystems complex and interesting enough to justify it.

I'm used to services that expose thousands of metrics (and there's many more time series when labels are taken into account). Having everything on consoles with such rich instrumentation simply isn't workable, you have to focus on what's the most useful. At some point you're going to end up in the code, and from there see what metrics (and logs) that code exposes, ad-hoc graph them and debug from there.

> Lots of lines on a single graphs helps you notice imbalances you may not have noticed before. For example if a small subset of your cluster has lower CPU usage then you likely have a load balancing problem or something else weird going on.

Agreed, scatterplots are also pretty useful when analysing that sort of issue. Is it that the servers are efficient, or are they getting less load? A qps vs cpu scatterplot will tell you. To find such imbalances in the first place, taking a normalized standard deviation across all of your servers is handy - which is the sort of thing Prometheus is good at.

> You have to shard that subsystem further by something random and arbitrary.

One approach would be to have multiple prometheus servers with the same list of targets, and configured to do a consistent partition between them. You'd then need to do an extra aggregation step and get the data from the "slave" prometheus servers up to a "master" prometheus via federation. This is only likely to be a problem when you hit thousands of a single type of server, so the extra complexity tends to be manageable all things considered.

falcolas · on Feb 4, 2015

Before digging into converting the backend to Prometheus, you may want to have a look at their actual persistence method. What worries me is that they are still storing one metric per file, and are copying data from one file to the next to perform checkpoints or purges (which is admittedly an improvement over whisper's method of thrashing back and forth across a single file).

They have the code to do in-place method of appending chunks, but in my admittedly brief foray through the code I could not actually find any use of it.

SEJeff · on Feb 4, 2015

The graphite backend is pluggable. It isn't converting so much as adding support for. One of my favorite backends is Cyanite which ultimately stores the data in Cassandra.

jrv · on Feb 4, 2015

Which storage/purge strategy would you recommend? (see also this thread for more discussion about the storage: https://news.ycombinator.com/item?id=8996832)

bbrazil · on Feb 4, 2015

Great! We're over in #prometheus on Freenode if there's anything we can help you with.

ddorian43 · on Feb 4, 2015

Announced here:

https://developers.soundcloud.com/blog/prometheus-monitoring...

bbrazil · on Feb 4, 2015

The SoundCloud announcement is a great overview, there's also a series of blog posts I did on http://www.boxever.com/tag/monitoring going into more depth with end-to-end examples.

ggambetta · on Feb 4, 2015

"Those who cannot remember the Borgmon are doomed to repeat it" ;)

Just kidding, this is looking really good, I hope to get some hands-on experience with it soon.

thesnider · on Feb 4, 2015

After working at a job that had a horrible patchwork of monitoring techniques (including at least two in-house systems), I was desperately pining for borgmon, actually. Never thought those words would come out of my mouth.

This does seem to have addressed at least a couple of the issues with that system, in that its config language is sane, and its scrape format is well-defined and typed.

on Feb 4, 2015

[deleted]

cespare · on Feb 4, 2015

You mean Borgmon? The Google monitoring system that stopped being a secret years and years ago?

(One example from 2 years ago: https://www.reddit.com/r/IAmA/comments/177267/we_are_the_goo...)

on Feb 4, 2015

[deleted]

lifeisstillgood · on Feb 4, 2015

What on earth? Google imposes non disclosure on ex-employees to mention by name in-house monitoring systems? (Presumably Borg-monitor?)

That strikes me as a bit paranoid, not letting the name of a monitoring system be revealed. Am I missing something?

ithkuil · on Feb 4, 2015

Once I heard a urban legend about a programmer being sued because he was using mickeyMouse as a variable name.

I wouldn't be surprised if many of such cases of non-disclosure are to avoid hungry lawyers for trying to suck some blood out of rich companies. I guess the star trek franchise is owned by Paramount or something like this.

ggambetta · on Feb 7, 2015

Thank you. I did, in fact, make sure the name and the fact that it is a monitoring system were already public (and from an official source) before mentioning them in this post.

danellis · on Feb 4, 2015

What does that mean?

chris_va · on Feb 4, 2015

After seeing the Prometheus query language, I opened the comments looking for this quote :).

clarkevans · on Feb 4, 2015

We've been looking for something like this, unfortunately the "pull" model won't work for us. We really need a push model so that our statistics server doesn't need access to every single producer. I see the pushgateway, but it seems deliberately not a centralized storage.

I wonder what InfluxDB means by "distributed", that is, if I could use it to implement a push (where distributed agents push to a centralized metric server) model.

jrv · on Feb 4, 2015

(initial Prometheus author here)

I wouldn't totally rule out the pushgateway for this use case. If we decided to implement a metrics timeout in the pushgateway (https://github.com/prometheus/pushgateway/issues/19), this would also take care of stale metrics from targets that are down or decommissioned. The pushing clients would probably even want to set a client-side timestamp in that case, as they are expected to be pushing regularly enough for the timestamp to not become stale (currently Prometheus considers time series stale that are >5 minutes old by default, see also the "Improved staleness handling" item in http://prometheus.io/docs/introduction/roadmap/.

Rapzid · on Feb 4, 2015

Same here; need push for integration. It appears that Prometheus favours pull to a fault. To me it makes sense to have a push/message infrastructure that you can then write scrapers for to your hearts content. InfluxDB has push, but I read that it uses 12X the storage due to storing metadata with each metric; Yikes!

jrv · on Feb 4, 2015

(Prometheus author here)

Yeah, I did that benchmark with 11x overhead for storing typical Prometheus metrics in InfluxDB in March of 2014. Not sure if anything has changed conceptually since then, but if anyone can point out any flaws in my reasoning, that'd be interesting:

https://docs.google.com/document/d/1OgnI7YBCT_Ub9Em39dEfx9Bu...

rdsubhas · on Feb 4, 2015

Hi, this definitely looks very cool, but how about cases where we have a bunch of instances running behind a load balancer, and each servers its own metrics?

We can't pull them, because hitting the load balancer would randomly choose only one instance.

Instances are scaled up based on load. So we can't specify the target instances in Prometheus because it keeps on changing.

We'd like to try this out, but any ideas what to do for the above?

bbrazil · on Feb 4, 2015

What you want to do is separately scrape each instance.

We're working on service discovery support[1] so that you can dynamically change what hosts/ports Prometheus scrapes. Currently you can use DNS for service discovery, or change the config file and restart prometheus.

[1]http://prometheus.io/docs/introduction/roadmap/

pauldix · on Feb 5, 2015

InfluxDB CEO here. The next version (0.9.0) has support for tags and efficiently encodes the measurement name + tagset as a single 4 byte uint.

Previously, you would have had to encode metadata in the series name. Otherwise if you used string columns you'd see a massive waste in disk space since they were repeated on every measurement.

clarkevans · on Feb 4, 2015

What do you use for push-oriented time-series metrics? We are thinking of using LogStash (it has a go client). Unfortunately, as I understand, it's not time-series oriented.

Rapzid · on Feb 4, 2015

I push all my logs into logstash->elasticsearch.

For metrics I go Riemann->Graphite. Riemann comes with a graphite compatible server so I push straight to that for processing and alerting. I also send from Riemann to logstash and logstash to Riemann where it makes sense.

For my metrics dashboard I use Grafana which is really awesome. I make use of its templating pretty heavily as I run a lot of individual IIS clusters. I can create cluster overview and drill down dashboards and template it so I can just change the cluster number to see its stats. You can also link directly into a dashboard passing variables as a query string parameter. Pretty excellent.

bbrazil · on Feb 4, 2015

https://github.com/rcrowley/go-metrics is one I've came across while researching potential Prometheus integrations, it seems to support a number of backends including Graphite and InfluxDB.

One challenge with that model of metrics is that it assumes that the monitoring system doesn't have much ability to work with samples, so it makes up for this by calculating rates and quantiles in the client. This isn't as accurate as doing it in the server, and some instrumentation systems don't allow you to extract out the raw data to allow you to do it in the server.

For example say you've a 1-minute rate exported/scraped every minute. If you miss a push/scrape you lose all information about that minute. Similarly if you're delayed by a second, you'll miss a spike just after the previous push/scrape. If instead you expose a monotonically increasing counter and calculate a rate over that in the server, you don't lose that data.

bbrazil · on Feb 4, 2015

We favour pull, but we don't force it. The collectd exporter is an integration with a push-based system for example.

My full thoughts are up at http://www.boxever.com/push-vs-pull-for-monitoring The short version is that I consider pull slightly better, but not majorly so. Push is more difficult to work with as you scale up, but it is doable.

pauldix · on Feb 5, 2015

InfluxDB CEO here. With the current version of InfluxDB (0.8.8), metadata should be encoded in the name of the series, which means you don't repeat it.

In the next version of InfluxDB (0.9.0), you can encode metadata as tags and it gets converted to a single id.

With either of those schemes you should see much better numbers on storage.

jrv · on Feb 5, 2015

Hey Paul, this is exciting news! So perhaps InfluxDB could work well as a Prometheus long-term storage backend after all. It would be so much nicer than having to run OpenTSDB based on Hadoop/HBase. It might be a while until I find the time, but looking forward to giving it another try!

bbrazil · on Feb 4, 2015

(One of the Prometheus authors here)

My understanding is that InfluxDB will work for your use case, as it's all push.

> I see the pushgateway, but it seems deliberately not a centralized storage.

Yeah, the primary use case for the pushgateway is service-level metrics for batch jobs.

May I ask why accessing your producers is a problem? I know for Storm I was considering writing a plugin that'd hook into it's metrics system, that'd push to the pushgateway and Prometheus would then scrape. Not my preferred way of doing things, but some distributed processing systems need that approach when work assigment is dynamic and opaque across slaves.

clarkevans · on Feb 4, 2015

Each node in our network has it's own access control permissions and organizational owner. It's easy enough to provide a centralized service with an end-point to push statistics from each node. In this case, organizational participation is "distributed".

A pull model, while technically distributed, is organizationally centralized: I have to get each node's owner to grant me direct access. Politically, and for security reasons, that's not going to happen.

bbrazil · on Feb 4, 2015

That'd be a challenge alright. One thing to consider is having each owner run their own Prometheus so that they can take advantage of all it's features, though it's a harder sell than "just push over here".

http://www.boxever.com/push-vs-pull-for-monitoring looks at other bits of the push vs. pull question.

hartror · on Feb 4, 2015

This is a common problem that you describe.

kordless · on Feb 4, 2015

Is the use case mostly about permissions and networking setup? I could see why a pull model wouldn't work in those cases.

paraboul · on Feb 4, 2015

what about druid.io?

JanezStupar · on Feb 4, 2015

Why do you need push replication?

From my experience with distributed systems push replication will get you in trouble, very soon.

Edit: I was too quick to post this question. An obvious scenario is where the client is behind a firewall. Never mind me. I am an idiot.

mbell · on Feb 4, 2015

From the storage system docs:

> which organizes sample data in chunks of constant size (1024 bytes payload). These chunks are then stored on disk in one file per time series.

That is concerning, is this going to have the same problem with disk IO that graphite does? i.e. Every metric update requires a disk IO due to this one file per metric structure.

beorn7 · on Feb 4, 2015

(Disclaimer: I'm a Prometheus author.)

Chunks are only written to that one file per time series once they are complete. Depending on their compression behavior, they will contain at least 64 samples, but usually a couple of hundreds. Even then, chunks are just queued for disk persistence. The storage layer operates completely from RAM, only has to swap in chunks if it was evicted from memory previously. Obviously, if you consistently create more sample data than your disk can write, the persist queue will back up at some point and Prometheus will throttle ingestion.

falcolas · on Feb 4, 2015

This was my first thought as well. Having one inode per metric measured can easily overwhelm some file systems, and the IO overhead on those disks gets silly (especially if it's not a SSD capable of constant-time writes against sparse data).

Combine with the Cacti pull model, and I think a wait-and-see attitude is the best for this for now.

beorn7 · on Feb 4, 2015

We were tempted to implement our own data management in a couple of bigger files, but extensive testing of various modern file systems (XFS, ext4) resulted in the perhaps not surprising effect that those file systems are way better in managing the data than our home-grown solutions.

falcolas · on Feb 4, 2015

So, the open source timeseries DBs (RRD Tool, Influx, and KairosDB), and others like sqlite or even InnoDB didn't make the cut? That surprises me.

> file systems are way better in managing the data

Except they're not managing data, they're just separating tables, to extend the DB metaphor. And you still run the chance of running out of inodes on a "modern" file system like ext4.

After having briefly dug into the code, I'm particularly worried about the fact that instead of minimizing iops by only writing the relevant changes to the same file, Prometheus is constantly copying data from one file to the next, both to perform your checkpoints and to invalidate old data. That's a lot of iops for such basic (and frequently repeated) tasks.

Still in wait-and-see mode.

jrv · on Feb 4, 2015

So we're admittedly not hardcore storage experts, but we did a lot of experiments and iterations until we arrived at the current storage, and it seems to be performing quite well for our needs, and much better than previous iterations. We're happy to learn better ways of data storage/access for time series data though.

RRD Tool: expects samples to come in at regular intervals and expects old samples to be overwritten by new ones at predictable periods. It's great because you can just derive the file position of a sample based on its timestamp, but in Prometheus, samples can have arbitrary time stamps and gaps between them, and time series can also grow arbitrarily large (depending on currently configured retention period) and our data format needs to support that.

InnoDB: not sure how this works internally, but given it's usually used in MySQL, does it work well for time series data? I.e. millions of time series that each get frequent appends?

KairosDB: depends on Cassandra, AFAICS. One main design goal was to not depend on complex distributed storage for immediate fault detection, etc.

InfluxDB: looks great, but has an incompatible data model. See http://prometheus.io/docs/introduction/comparison/#prometheu...

I guess a central question that touches on your iops one is: you always end up having a two-dimensional layout on disk: timeseries X time. I don't really see a way to both store and retrieve data in such a way that you can arbitrarily select a time range and time series without incurring a lot of iops either on read or write.

falcolas · on Feb 4, 2015

> does it work well for time series data

It's a key/value store at its heart, with all the ACID magic and memory buffering built in.

Almost any KV store would preform relatively well at time series data by simply by updates to overwrite old data instead of constantly deleting old data (assuming the KV store is efficient in its updates).

Issuing updates instead of deletes is possible because you know the storage duration and interval, and can thus easily identify an index at which to store the data.

jrv · on Feb 4, 2015

An earlier iteration of our storage was actually based on LevelDB (key-value store), with this kind of key->value layout:

[time series fingerprint : time range] -> [chunk of ts/value samples]]

At least this scheme performed way worse than what we currently have. You could say that file systems also come with pretty good memory buffering and can act as key-value stores (with the file name being the key and the contents the value), except that they also allow efficient appends to values.

> Issuing updates instead of deletes is possible because you know the storage duration and interval, and can thus easily identify an index at which to store the data.

Do you mean you would actually append/update an existing value in the KV-store (which most KV stores don't allow without reading/writing the whole key)?

ithkuil · on Feb 4, 2015

Did your previous leveldb approach perform way worse on reads, writes or both?

bbrazil · on Feb 4, 2015

Both. When I switched our production Prometheis over, the consoles rendered more than twice as fast for a simple test case.

pauldix · on Feb 5, 2015

InfluxDB CEO here. The data model for 0.9.0 should map up nicely with what you're trying to do. It supports measurement names, and tags. It would be great to work together to have InfluxDB as another option (other than OpenTSDB) for long term storage of metrics data.

jrv · on Feb 5, 2015

Hey Paul, it'd be great to work together! Although long-term storage isn't a SoundCloud priority right now (meaning it's going to be more of a free-time project with less time allocated), it's definitely important for many others and it would be great to have. I think we should start with Prometheus->InfluxDB write support first, and then approach the more complicated problem of reading back from InfluxDB through Prometheus. What's the best place to get informed about the new 0.9.0 data model?

jrv · on Feb 5, 2015

After reading http://influxdb.com/blog/2014/12/08/clustering_tags_and_enha..., I'm not completely clear on what "tags" mean. Are they single-word names, or are they conceptually key=value pairs like labels in Prometheus or tags in OpenTSDB?

EDIT: found https://github.com/influxdb/influxdb/pull/1059. Ok, does seem like tags mean key=value pairs.

jrv · on Feb 5, 2015

Ok, I found http://influxdb.com/blog/2014/12/08/clustering_tags_and_enha.... That all sounds very promising!

beorn7 · on Feb 4, 2015

The checkpointing is just a way to minimize sample losses from unfinished chunks upon a crash. If you don't care or if you don't crash, you can disable checkpointing. Normal operation will be the same.

Invalidation of old data is super easy with the chunked files as you can simply "truncate from the beginning", which is supported by various file systems (e.g. XFS). However, benchmarks so far showed that the relative amount of IO for purging old chunks is so small compared to overall IO that we didn't bother to implement it yet. Could be done if it turns out to be critical.

perlgeek · on Feb 4, 2015

This looks very interesting.

From http://prometheus.io/docs/introduction/getting_started/

> Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints on these targets.

I wonder if we'll see some plugins that allow data collection via snmp or nagios monitoring scripts or so. That would make it much easier to switch large existing monitoring systems over to prometheus.

bbrazil · on Feb 4, 2015

(One of the authors here)

Just last night I wrote the https://github.com/prometheus/collectd_exporter, which you could do SNMP with. I do plan on writing a purpose-designed SNMP exporter in the next a few months to monitor my home network, if someone else doesn't get there first.

perlgeek · on Feb 4, 2015

Awesome. I wondered if I should mention collectd as a possible source of data, and now you've already made it available.

sagichmal · on Feb 4, 2015

See Exporters for third-party systems[0] and Pushing metrics[1].

[0] http://prometheus.io/docs/instrumenting/exporters [1] http://prometheus.io/docs/instrumenting/pushing

discordianfish · on Feb 4, 2015

I've wrote a exporter for Docker, so if you have a Docker based infrastructure you get tons of metrics by using it: http://5pi.de/2015/01/26/monitor-docker-containers-with-prom...

bhuga · on Feb 4, 2015

It's great to see new entrants into the monitoring and graphing space. These are problems that every company has, and yet there's no solution as widely accepted for monitoring, notifications or graphing as nginx is for a web server.

Not that I'd do a better job, but every time I further configure our monitoring system, I get that feeling that we're missing something as an industry. It's a space with lots of tools that feel too big or too small; only graphite feels like it's doing one job fairly well.

Alerting is the worst of it. Nagios and all the other alerting solutions I've played with feel just a bit off. They're either doing too much or carve out a role at boundaries that aren't quite right. This results in other systems wanting to do alerting, making it tough to compare tools.

As an example, Prometheus has an alert manager under development: https://github.com/prometheus/alertmanager. Why isn't doing a great job at graphing enough of a goal? Is it a problem with the alerting tools, or is it a problem with boundaries between alerting, graphing, and notifications?

bbrazil · on Feb 4, 2015

> This results in other systems wanting to do alerting, making it tough to compare tools.

I'd think you've hit the nail on the head, every monitoring system has to do a bit of everything such as alerting, machine monitoring, and graphing; rather than focusing on just doing one thing and doing it well.

Prometheus comes with a powerful data model and query language, few existing systems support the same (e.g. Atlas and Riemann have the notion of a query language and labels) so we have to do a bit of everything to produce a coherent system.

I think a separate common alert manager would be good if we could combine, currently in practice that role isn't being done as you rely on de-duplication in a tool such as Pagerduty with support for silencing alerts not being unified.

KyleBrandt · on Feb 4, 2015

Shameless plug. But Bosun's focus is largely on Alerting (http://bosun.org). We have an expression language built in that allows for complex rules. It leverages OpenTSDBs multi dimensional facet to make alert instantiations, but you can also change the scope by transposing your results.

It also now supports Logstash and Graphite as backends as well. The Graphite support is thanks to work at Vimeo.

Another nice thing about Bosun is you can test your alerts against time series history to see when they would have triggered so you can largely tune them before you commit them to production.

jrv · on Feb 4, 2015

Great. Yeah, adding unit tests for alerts is something that we still need to add the capability for. But at least you can always already manually graph your alert expressions and see how they would have evaluated over time.

Interesting work on Bosun by the way! Seems like there is quite some overlap with Prometheus, but I yet have to study it in depth. Is my impression correct that OpenTSDB is a requirement, or is there any local storage component? I guess you could run OpenTSDB colocated on a single node...

KyleBrandt · on Feb 4, 2015

OpenTSDB is our main backend. But we can also query graphite and logstash. However the graphing page doesn't work with Graphite.

jrv · on Feb 4, 2015

Ah, ok. By the way, in case you make it to GopherCon this year, it would be interesting to exchange ideas! One of the things I'm happy about is that finally systems with multi-dimensional time series (based on OpenTSDB or not) are becoming more common...

KyleBrandt · on Feb 5, 2015

I was there last year but not sure if I'm going this year yet. Matt is going though (http://mattjibson.com/). I'd love to chat at some point though!

I need to take a closer look at stealing ideas from your tool :-) We are both leveraging Go templates (Bosun uses it for Alert notifications, but I've thought about using it to create dashboards as well).

jrv · on Feb 4, 2015

Brian already hinted at this, but the main reason why we do this is to have coherency of the data model across the whole chain:

client app -> Prometheus server -> alerting rules -> alert handling

This allows e.g. initial dimensional labels from the clients to cleanly propagate all the way into alerts, and allows you to silence, aggregate, or route alerts by any label combination.

If OpenTSDB-style dimensional metrics would have been more of a standard already, maybe that would be different.

bhuga · on Feb 4, 2015

I haven't used Prometheus yet, but I'm not sure I agree that it makes sense to send dimensional time series all the way down the alert chain. It allows full-featured alerts to be used, yes. But you're not going to be able to put all of your alert logic into the prometheus chain, so you're going to end up having alert rules in at least two places.

For example, in response to the latest security hullabaloo, we recently added a Nagios check for libc being too old, and for running processes that are linked to a libc older than the one on disk. This isn't a time series check; it's just an alert condition. Polling for that alert condition isn't what Prometheus is going to be good at, so cramming it into that alert pipeline is awkward. If we used something like Prometheus, we'd have rules in two places already.

Polling prometheus for saved queries and simply alerting on a threshold might result in a simpler system. The rules that could be expressed in an alert pipeline could be expressed as different queries, and all rules about thresholds, alerting or notifications could be done further down the pipeline. There's still some logic about what you're measuring in Prometheus, but not about what thresholds are alerts, or when, or for how long, or who cares. Threshold rules could all live in one place.

This isn't a knock on Prometheus, which of course supports a workflow along those lines, and seems interesting. Nor is it a knock on the prometheus alert system. I just wonder why software in the monitoring/notification/alerting space always ends up muddy. I believe the overlapping boundaries between projects are a symptom, not a cause. I keep asking myself if there's some fundamental complexity that I'm not yet appreciating.

bbrazil · on Feb 4, 2015

> This isn't a time series check; it's just an alert condition.

I think you're conflating instrumentation and alerting.

Many systems only offer alert conditions on very low-level data that's tightly-bound to the instrumentation (e.g. a single check on a single host), Prometheus is more powerful as the instrumentation and alerting are separate, with dimensions adding further power.

Prometheus has many instrumentation options. For your example I'd suggest a cronjob that outputs a file to the textfile node_exporter module (echo old_libc_process $COUNT > /configured/path/oldlibc.prom).

You can then setup an alert for when the value is greater than 0 on any host (ALERT OldLibcAlert IF old_libc_process > 0 WITH ...). The advantage of having all the dimensions is that you can analyse that over time, and graph how many hosts had a problem to trend how well you're dealing with the libc issue.

A big win for having dimensions for alerting all the way up the stack is in silencing, notification and throttling of related alerts. You can slice and dice and have things as granular as makes sense for your use case.

> Polling prometheus for saved queries and simply alerting on a threshold might result in a simpler system. The rules that could be expressed in an alert pipeline could be expressed as different queries, and all rules about thresholds, alerting or notifications could be done further down the pipeline.

There is an alert bridge to nagios if you want to send your alerts out that way. You'll lose some of the rich data that the dimensions provide though.

bhuga · on Feb 4, 2015

> I think you're conflating instrumentation and alerting.

A bit, because I'm hesitant to cram into a time series database data that I consider boolean. I'd move past this for consistency if the checks were easy to create, though.

> Prometheus has many instrumentation options. For your example I'd suggest a cronjob that outputs a file to the textfile node_exporter module (echo old_libc_process $COUNT > /configured/path/oldlibc.prom).

Interesting. That architecture or configuration of this exporter isn't documented yet, but if just magically sent numbers upstream from text files that I keep up to date, that might be worth a migration by itself. Nagios SSH checks are ridiculous to manage.

Thanks for indulging my conversation in any case. I'll put this on my radar and watch where it goes.

bbrazil · on Feb 4, 2015

> I'm hesitant to cram into a time series database data that I consider boolean.

A boolean can be treated as a 1 or a 0. Sometimes you can even go richer than that, such as with a count of affected processes from your example - which you could convert back to a 1/0 if you wanted in the expression language.

> That architecture or configuration of this exporter isn't documented yet

The node_exporter as a whole still needs docs. That module is pretty new, and it's not exactly obvious what it does or why it exists.

Labels/dimensions are supported too, it's the same text format that Prometheus uses elsewhere ( http://prometheus.io/docs/instrumenting/exposition_formats/)

> if just magically sent numbers upstream from text files that I keep up to date

Pretty much, you'll also get the rest of the module of the node exporter (cpu, ram, disk, network) and associated consoles which http://www.boxever.com/monitoring-your-machines-with-prometh... describes how to setup.

e12e · on Feb 4, 2015

So... how does this compare to http://riemann.io/ ? I just re-discovered riemann... and was thinking of pairing it with logstash and have a go. It would seem prometheus does something... similar?

bbrazil · on Feb 4, 2015

From my look at Riemann, it seems more aimed as an event monitoring system than a time-series monitoring system. You can (and many do) use Riemann as time-series monitoring system, my understanding is that Prometheus is a bit better for multi-dimensional labels.

I could see Riemann being used as an alert manager on top of Prometheus, handling all the logic around de-duping of alerts and notification. Prometheus's own alert manager is considered experimental.

brown9-2 · on Feb 4, 2015

Riemann has no concept of persistence, unless you forward events to some other system.

simple10 · on Feb 4, 2015

Looks really promising for smaller clusters. However, the pull/scraping model for stats could be problematic for larger scale.

I've been experimenting with metrics collection using heka (node) -> amqp -> heka (aggregator) -> influxdb -> grafana. It works extremely well and scales nicely but requires writing lua code for anomaly detection and alerts – good or bad depending on your preference.

I highly recommend considering Heka[1] for shipping logs to both ElasticSearch and InfluxDB if you need more scale and flexibility than Prometheus currently provides.

[1] https://github.com/mozilla-services/heka

bbrazil · on Feb 4, 2015

> However, the pull/scraping model for stats could be problematic for larger scale.

From experience of similar systems at massive scale, I expect no scaling problems with pulling in and of itself. Indeed, there's some tactical operational options you get with pull that you don't have with push. See http://www.boxever.com/push-vs-pull-for-monitoring for my general thoughts on the issue.

> InfluxDB

InfluxDB seems best suited for event logging rather than systems monitoring. See also http://prometheus.io/docs/introduction/comparison/#prometheu...

simple10 · on Feb 4, 2015

Good point on push-vs-pull. I'm biased towards push because of microservices that behave like batch jobs. In effect, I'm using AMQP in a similar way as the Prometheus pushgateway.

Agreed that InfluxDB is suited for event logging out of the box, but the March 2014 comparison of Influx is outdated IMO.

I'm using Heka to send numeric time series data to Influx and full logs to ElasticSearch. It's possible to send full logs to non-clustered Influx in 0.8, but it's useful to split out concerns to different backends.

I also like that Influx 0.9 dropped LevelDB support for BoltDB. There will be more opportunity for performance enhancements.

jrv · on Feb 4, 2015

Yeah, I would be really interested in hearing any arguments that would invalidate my research (because hey, if InfluxDB would actually be a good fit for long-term storage of Prometheus metrics, that'd be awesome, because it's Go and easy to operate).

However, if the data model didn't change fundamentally (the fundamental InfluxDB record being a row containing full key/value metadata vs. Prometheus only appending a single timestamp/value sample pair for an existing time series whose metadata is only stored and indexed once), I wouldn't expect the outcome to be qualitatively different except that the exact storage blowup factor will vary.

Interesting to hear that InfluxDB is using BoltDB now. I benchmarked BoltDB against LevelDB and other local key-value stores around a year ago, and for a use case of inserting millions of small keys, it took 10 minutes as opposed to LevelDB taking a couple of seconds (probably due to write-ahead-log etc.). So BoltDB was a definite "no" for storing the Prometheus indexes. Also it seems that the single file in which BoltDB stores its database never shrinks again when removing data from it (even if you delete all the keys). That would also be bad for the Prometheus time series indexing case.

pauldix · on Feb 5, 2015

InfluxDB CEO here. It's true that Bolt's performance is horrible if you're writing individual small data points. It gets orders of magnitude better if you batch up writes. The new architecture of InfluxDB allows us to safely batch writes without the threat of data loss if the server goes down before a flush (we have something like a write ahead log).

Basically, when the new version comes out, all new comparisons will need to be done because it's changing drastically.

bbrazil · on Feb 4, 2015

> the March 2014 comparison of Influx is outdated IMO.

I think we expected that, feel free to add comments on the doc for things that are different now.

simple10 · on Feb 4, 2015

Cool. Will do once InfluxDB 0.9 is released in March. Not worth comparing to 0.8 since so much is changing under the hood.

0xdeadbeefbabe · on Feb 4, 2015

While monitoring is obviously useful, I'm not understanding the obvious importance of a time series database. Can you collect enough measurements for the time series database to be useful? I worry that I would have lots of metrics to backup my wrong conclusions. I also worry that so much irrelevant data would drown out the relevant stuff, and cause the humans to ignore the system in time. I work with computers and servers, and not airplanes or trains.

bbrazil · on Feb 4, 2015

> Can you collect enough measurements for the time series database to be useful?

Yes, instrument everything. See http://prometheus.io/docs/practices/instrumentation/#how-to-...

> I worry that I would have lots of metrics to backup my wrong conclusions.

This is not so much a problem with time series as a question of epistemology. Well chosen consoles will help your initial analysis, and after that it's down to correct application of the scientific method.

> I also worry that so much irrelevant data would drown out the relevant stuff

I've seen many attempts by smart people to try and do automatic correlation of time series to aid debugging. It's never gotten out of the toy stage, as there is too much noise. You need to understand your metrics in order to use them.

zeus13i · on Feb 9, 2015

After reading this thread and comparing Influx and Prometheus, I've concluded that both look promising. I was going to go with Prometheus (as it's easier to get started with), but I was really put off by the 'promdash' dashboard - it uses iframes and depends on mysql. So I'm going with InfluxDB + Grafana and I'll keep an eye out for developments.

jrv · on Feb 10, 2015

The only time where PromDash uses iframes is when you specifically add a "frame" (iframe) widget to your dashboard to embed arbitrary web content. Native Prometheus graphs and pie charts don't use iframes at all.

Some kind of SQL backend is a dependency for now, however.

zeus13i · on Feb 15, 2015

Ah! Good to know, thanks. Somehow I missed that.

bbrazil · on Feb 10, 2015

You may want to look at the console templates, as they don't have any dependencies:

http://prometheus.io/docs/visualization/consoles/

kanwisher · on Feb 4, 2015

Would be interesting how this compares to InfluxDb

sagichmal · on Feb 4, 2015

http://prometheus.io/docs/introduction/comparison/#prometheu...

mhax · on Feb 4, 2015

I'm a little wary of a monolithic solutions to monitoring/graphing/time series data storage - it gives me flashbacks of nagios/zabbix ;)

I currently use a combination of sensu/graphite/grafana which allows a lot of flexability (albeit with some initial wrangling with the setup)

gambiter · on Feb 4, 2015

I'm not sure what's wrong with nagios or zabbix... I use them both in different capacities, and they are good at what they do.

Of course a piecemeal solution is more flexible, but as you said, configuration can be a beast, so many people prefer monolithic systems.

tinco · on Feb 4, 2015

In your architecture I see a single monolithic database server called 'Prometheus'. Does it shard? I can't find it in the documentation. You mention it's compatible with TSDB, why did you choose to implement your own backend, or is this a fork of TSDB?

The tech does look awesome though!

bbrazil · on Feb 4, 2015

> Does it shard?

Currently you can manually vertically shard, and in future we may have support for some horizontal sharding for when the targets of a given job are too many to be handled by a single server. You should only hit this when you get into thousands of targets.

Our roadmap[1] includes hierarchical federation to support this use case.

> You mention it's compatible with TSDB, why did you choose to implement your own backend, or is this a fork of TSDB?

Prometheus isn't based on OpenTSDB, though it has the same data model. We've a comparison[2] in the docs. The core difference is that OpenTSDB is only a database, it doesn't offer a query language, graphing, client libraries and integration with other systems.

We plan to offer OpenTSDB as a long-term storage backend for Prometheus.

[1] http://prometheus.io/docs/introduction/roadmap/ [2] http://prometheus.io/docs/introduction/comparison/#prometheu...

zacblazic · on Feb 4, 2015

At the bottom of http://prometheus.io/docs/introduction/comparison/ there is: "... Prometheus will be simpler to run initially, but will require explicit sharding once the capacity of a single node is exceeded."

secure · on Feb 4, 2015

I used to use InfluxDB + a custom program to scrape HTTP endpoints and insert them into InfluxDB before.

After playing around with Prometheus for a day or so, I’m convinced I need to switch to Prometheus :). The query language is so much better than what InfluxDB and others provide.

jrv · on Feb 4, 2015

(Prometheus author here)

Thanks, that's awesome to hear! Feel free to also join us on #prometheus on freenode or our mailing list: https://groups.google.com/forum/#!forum/prometheus-developer...

paulasmuth · on Feb 4, 2015

Shameless plug: This looks quite similar to FnordMetric, which also supports labels/multi dimensional time series, is StatsD wire compatible and supports SQL as a query language (so you won't have to learn yet another DSL)

xfalcox · on Feb 4, 2015

Guys, I've seen the libs for collecting services info, but how do I get OS level info, like load average, disk utilization, ram etc..?

I suppose that there's a simple service that we need to deploy on each server?

Any tips on this use case?

bbrazil · on Feb 4, 2015

We support that use case, here's a guide: http://www.boxever.com/monitoring-your-machines-with-prometh...

wyldfire · on Feb 4, 2015

That's great!

> For machine monitoring Prometheus offers the Node exporter

Is it possible for the frontend to utilize data from the cron-invoked sar/sadc that already covers much of this data?

http://sebastien.godard.pagesperso-orange.fr/

bbrazil · on Feb 4, 2015

The default instrumentation that comes with node_exporter covers far more than what systat provides. Retrieving data from /proc at every scrape also gives you a more accurate timestamp, which helps reduce graph artifacts.

As an aside, if you have machine-level cronjobs you want to expose metrics from you can use the textfile[1] module of the node_exporter, which reads in data from *.prom files in the same format as accepted by the Pushgateway.

[1]https://github.com/prometheus/node_exporter/blob/master/coll... [2]https://github.com/prometheus/pushgateway

xfalcox · on Feb 4, 2015

Thanks from a fellow brazilian!

reinhardt1053 · on Feb 4, 2015

Promdash, the dashboard builder for Prometheus, is written in Ruby on Rails https://github.com/prometheus/promdash

corford · on Feb 4, 2015

This looks great! Is an official Python client library on the roadmap?

bbrazil · on Feb 4, 2015

Yes, I've got something working at the moment. It needs cleanup, docs, unittests etc.

If you want to help out, it's up at https://github.com/brian-brazil/client_python

XorNot · on Feb 4, 2015

Huh, how fortuitous. I've been looking for this exact type of thing and HN gives me a great starting place to evaluate.

rgj · on Feb 4, 2015

Is it me or is it impossible to navigate the documentation on an iPad?

jrv · on Feb 4, 2015

I don't own any Apple devices, so I can't test, but it works well on my Android phone (being Bootstrap and using responsive design). The top menu collapses into a button with three horizontal bars which shows the menu upon click. The documentation-specific navigation is always displayed expanded when in the docs section of the site, but the contents are displayed underneath it. What does it look like for you?

discordianfish · on Feb 4, 2015

On my iphone it looks good