Hacker News new | past | comments | ask | show | jobs | submit login
Ingest OpenTelemetry metrics with Prometheus natively (last9.io)
90 points by donutshop on July 29, 2023 | hide | past | favorite | 32 comments



Hi! Author of the PR here. As a project, Prometheus would like to become more OTel native and this is only the first of changes that are coming.

I’d be happy to answer any questions you have.


Do you have a roadmap of the changes you are looking to bring? Any changes to Otel you’re looking to add as well?


Does it still apply that if my Prometheus goes down or network glitches then metrics for that period is lost forever?


Unfortunately, yes. The OTel Collector has plans to implement a WAL for the OTLP exporter and when it does, you should be resilient to upstream temporarily having issues.


WAL = Write-Ahead Logging ?


Yes sorry


How will that affect timestamps, won't all the metrics have the timestamp as the time when Prometheus finally receives them?


It does today. You have a retry queue and you can use a persistent storage for it.


What are the merits of the prometheus approach versus one where events/metrics and their original timestamps can be preserved, stored (during temporary outages), then forwarded to backends when connectivity is reestablished?


Nice, do I understand it correctly that this would mean there is a straight forward way to let Prometheus ingest the new histogram type without needing any new daemons (like otel-collector)?

Currently, it doesn't appear the text format for pull metrics doesn't appear to support it.


The protobuf format supports native histograms. That’s the most straightforward way, if you have a client library for it. Go does, for instance.

OpenTelemetry is push, so if you need pull and no new daemons this PR doesn’t help you.


Yeah, I have been reading up a bit. Only it's not clear to me if the protobuf format and text format can be mixed?

Can one service A use text-format and service B use protobuf both scrapped by the same Prometheus sidecar?


Yes. Scraping is done over http, and the format is negotiated via the ‘accept:’ header.


We've been using influx with much success. I just don't think Prometheus'es pull model is the right one for metrics especially in isolated sites like a DC. Has anyone successfully migrated from influx to Prometheus? If so, why did you do that? What's better now?


I've been using telegraf + influxdb + grafana for my last projects, never really had to tweak anything after uncommenting the right sections in the telegraf.conf.. is prometheus and associated tools kind of an alternative to that stack?



VictoriaMetrics has many users, who successfully migrated from InfluxDB. It supports data ingestion via Influx line protocol, so you can continue using Telegraf and sending the collected metrics to VictoriaMetrics instead of InfluxDB. You get the following benefits after the migration from InfluxDB to VictoriaMetrics:

- Reduced memory usage by up to 10x [1].

- Reduced disk space usage.

- Higher query performance.

- Better query language than InfluxQL and Flux for typical queries over collected metrics [2].

- Compatibility with Prometheus ecosystem.

See also InfluxDB -> VictoriaMetrics migration guide [3].

[1] https://valyala.medium.com/insert-benchmarks-with-inch-influ...

[2] https://docs.victoriametrics.com/MetricsQL.html

[3] https://docs.victoriametrics.com/guides/migrate-from-influx....


Awesome, does victoria work with grafana?


I thought Prometheus was a pull or push model? Although granted, I've spent very little time with it


Prometheus will always pull metrics from a metrics exposing endpoint. However, Prometheus can then push metrics to anything that has the proper integrations for remote writing[0]. So you could run Prometheus in Agent Mode [1] in your DC to ingest metrics and push it to some central location.

[0]: https://prometheus.io/docs/operating/integrations/#remote-en...

[1]: https://prometheus.io/blog/2021/11/16/agent/


Prometheus can also receive remote write requests, however, we recommend only writing metrics scraped by another Prometheus or the agent. The datamodel still has a few things that expect the metrics to have been scraped.

If you try to use Prometheus as a native push system and push directly, it'll work but you might not have the best experience. https://prometheus.io/docs/prometheus/latest/querying/api/#r...


Anyone with experience scaling Prometheus horizontally ? We are reaching the limits of our instance, memory and cpu wise, and I’m yet to choose between scaling it myself with sharding or using thanos/Victoria/cortex.


If you want to query across the whole data set, use one of the other things. Prometheus has a "federation" option but there's not been any active work on it for years. It's basically the definition of Thanos - take a bunch of Prometheus and query across them. Plus long-term storage in S3.

VictoriaMetrics, Cortex, Mimir are centralised data stores that accept data from multiple Prometheus, but you could also run headless agents scraping and sending the data.

Note if you are on a version before 2.44, try upgrading. Prometheus slimmed down a bit.

[I am a Prometheus and Mimir maintainer]


I've beenthrough this song and dance. Did months-long PoCs (with live data, running next to the then-production Prometheus deployment) of Thanos, Cortex and Victoria Metrics.

VM won hands down on pretty much all counts. It's easy and simple to operate and monitor, it scales really well and you can plan around how you want to partition and scale each component, it's incredibly cheap to run as performace is superior to the others, even when backed by spinning HDDs vs the other solutions on SSDs.

It's especially easy to operate on Kubernetes using their CRDs and operators.

I am not associated with Victoria Metrics in any way, just a happy user and sysadmin who ran it for a few years.


VictoriaMetrics was recommended to me by a contractor and I've been very happy with it as well. It does have an option to push in metrics, which I intend to use with transient environments like CI jobs and the like, though I haven't gotten there yet.


Yep, we used to use that in a few places. CI jobs, batch processes, etc. Prometheus has PushGateway which we also used before migrating to VM, but it had certain drawbacks (can't recall exactly what, sorry) that the new solution didn't.


Yeah, whatever you do, don’t use Mimir.

Operational nightmare, expensive to run, various parts of the entirely-too-many moving pieces it contains broke all the time and the performance was…unimpressive.

I‘ve heard that some people manage to run this thing successfully, and power to them, but I want nothing more to do with it.

Just save yourself the pain and use Victoria Metrics. Added benefit: you get an implementation of a rate function that’s actually correct.


I have been running Mimir reasonably well. When it comes to performance, what exactly did you find unimpressive? Interested to know any pitfalls or pain points you have encountered so far?


I've had an interesting time transitioning our project from OpenCensus to OpenTelemetry now that the former is EOL'd. We use the otel stackdriver output. Anyone have a refernce comparison between GCP cloud metrics vs. a prometheus monitoring stack?


I did use stackdriver for quite a while before I moved to Mimir. TBH its great that you are still sticking to opentelemetry. Stackdriver as metric storage is not even a wise option in todays world give there are some really good TSDB providers SaaS or otherwise that would do a much better job.

I moved away because of 2 primary reasons

1. The cost of stackdriver can add up with large-scale deployments or high-frequency metrics. It's essential to monitor and control usage to avoid unexpected billing.

2. I have experienced delays in metric updates, specifically at high frequency data. While the delays are usually minimal, they may not be ideal for some real-time monitoring use cases. FYI GCP on its own resources makes metrics available after 210s so you are always behind.

Going the TSDB route to reliably run storage has worked for me.

Also if this helps https://last9.io/blog/time-series-database-comparison/


Prometheus was known as a monitoring system, which promotes pull model over push model for metrics' collection. E.g. it is configured to discover scrape targets and then to scrape metrics from them at regular intervals - https://www.robustperception.io/its-easy-to-convert-pull-to-... .

It is interesting to see whether Prometheus will be transformed to multi-model (pull+push) monitoring system after the addition of Opentelemetry protocol.

P.S. VictoriaMetrics, the Prometheus-like monitoring system I work on, also gained support for OpenTelemetry data ingestion in the release v1.92.0. https://docs.victoriametrics.com/CHANGELOG.html


So no need for Jaegermeisters at work anymore!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: