Unfortunately, yes. The OTel Collector has plans to implement a WAL for the OTLP exporter and when it does, you should be resilient to upstream temporarily having issues.
What are the merits of the prometheus approach versus one where events/metrics and their original timestamps can be preserved, stored (during temporary outages), then forwarded to backends when connectivity is reestablished?
Nice, do I understand it correctly that this would mean there is a straight forward way to let Prometheus ingest the new histogram type without needing any new daemons (like otel-collector)?
Currently, it doesn't appear the text format for pull metrics doesn't appear to support it.
We've been using influx with much success. I just don't think Prometheus'es pull model is the right one for metrics especially in isolated sites like a DC. Has anyone successfully migrated from influx to Prometheus? If so, why did you do that? What's better now?
I've been using telegraf + influxdb + grafana for my last projects, never really had to tweak anything after uncommenting the right sections in the telegraf.conf.. is prometheus and associated tools kind of an alternative to that stack?
VictoriaMetrics has many users, who successfully migrated from InfluxDB. It supports data ingestion via Influx line protocol, so you can continue using Telegraf and sending the collected metrics to VictoriaMetrics instead of InfluxDB. You get the following benefits after the migration from InfluxDB to VictoriaMetrics:
- Reduced memory usage by up to 10x [1].
- Reduced disk space usage.
- Higher query performance.
- Better query language than InfluxQL and Flux for typical queries over collected metrics [2].
- Compatibility with Prometheus ecosystem.
See also InfluxDB -> VictoriaMetrics migration guide [3].
Prometheus will always pull metrics from a metrics exposing endpoint. However, Prometheus can then push metrics to anything that has the proper integrations for remote writing[0]. So you could run Prometheus in Agent Mode [1] in your DC to ingest metrics and push it to some central location.
Prometheus can also receive remote write requests, however, we recommend only writing metrics scraped by another Prometheus or the agent. The datamodel still has a few things that expect the metrics to have been scraped.
Anyone with experience scaling Prometheus horizontally ? We are reaching the limits of our instance, memory and cpu wise, and I’m yet to choose between scaling it myself with sharding or using thanos/Victoria/cortex.
If you want to query across the whole data set, use one of the other things.
Prometheus has a "federation" option but there's not been any active work on it for years.
It's basically the definition of Thanos - take a bunch of Prometheus and query across them. Plus long-term storage in S3.
VictoriaMetrics, Cortex, Mimir are centralised data stores that accept data from multiple Prometheus, but you could also run headless agents scraping and sending the data.
Note if you are on a version before 2.44, try upgrading. Prometheus slimmed down a bit.
I've beenthrough this song and dance. Did months-long PoCs (with live data, running next to the then-production Prometheus deployment) of Thanos, Cortex and Victoria Metrics.
VM won hands down on pretty much all counts. It's easy and simple to operate and monitor, it scales really well and you can plan around how you want to partition and scale each component, it's incredibly cheap to run as performace is superior to the others, even when backed by spinning HDDs vs the other solutions on SSDs.
It's especially easy to operate on Kubernetes using their CRDs and operators.
I am not associated with Victoria Metrics in any way, just a happy user and sysadmin who ran it for a few years.
VictoriaMetrics was recommended to me by a contractor and I've been very happy with it as well. It does have an option to push in metrics, which I intend to use with transient environments like CI jobs and the like, though I haven't gotten there yet.
Yep, we used to use that in a few places. CI jobs, batch processes, etc. Prometheus has PushGateway which we also used before migrating to VM, but it had certain drawbacks (can't recall exactly what, sorry) that the new solution didn't.
Operational nightmare, expensive to run, various parts of the entirely-too-many moving pieces it contains broke all the time and the performance was…unimpressive.
I‘ve heard that some people manage to run this thing successfully, and power to them, but I want nothing more to do with it.
Just save yourself the pain and use Victoria Metrics. Added benefit: you get an implementation of a rate function that’s actually correct.
I have been running Mimir reasonably well. When it comes to performance, what exactly did you find unimpressive? Interested to know any pitfalls or pain points you have encountered so far?
I've had an interesting time transitioning our project from OpenCensus to OpenTelemetry now that the former is EOL'd. We use the otel stackdriver output. Anyone have a refernce comparison between GCP cloud metrics vs. a prometheus monitoring stack?
I did use stackdriver for quite a while before I moved to Mimir. TBH its great that you are still sticking to opentelemetry. Stackdriver as metric storage is not even a wise option in todays world give there are some really good TSDB providers SaaS or otherwise that would do a much better job.
I moved away because of 2 primary reasons
1. The cost of stackdriver can add up with large-scale deployments or high-frequency metrics. It's essential to monitor and control usage to avoid unexpected billing.
2. I have experienced delays in metric updates, specifically at high frequency data. While the delays are usually minimal, they may not be ideal for some real-time monitoring use cases. FYI GCP on its own resources makes metrics available after 210s so you are always behind.
Going the TSDB route to reliably run storage has worked for me.
Prometheus was known as a monitoring system, which promotes pull model over push model for metrics' collection. E.g. it is configured to discover scrape targets and then to scrape metrics from them at regular intervals - https://www.robustperception.io/its-easy-to-convert-pull-to-... .
It is interesting to see whether Prometheus will be transformed to multi-model (pull+push) monitoring system after the addition of Opentelemetry protocol.
P.S. VictoriaMetrics, the Prometheus-like monitoring system I work on, also gained support for OpenTelemetry data ingestion in the release v1.92.0. https://docs.victoriametrics.com/CHANGELOG.html
I’d be happy to answer any questions you have.