Hacker News new | past | comments | ask | show | jobs | submit login
InfluxDB now supports Prometheus remote reads and writes natively (influxdata.com)
101 points by pauldix on Sept 14, 2017 | hide | past | favorite | 33 comments



I often encounter a lot of confusion about push vs. pull and whether you should pick Prometheus or Influx.

Prometheus comes along every so often and scrapes metrics your program exposes via a very simple API. This has the advantage that your code doesn't need to know about some endpoint of some cluster, it just needs to buffer up and expose some info it knows about itself from the recent past.

You don't need to maintain a cluster of Prometheus either, you can just run more than one for redundancy (kind of like an active-active) - it is meant for relatively ephemeral information, it is efficient, and one big Prometheus node will probably do you fine.

Where to store compacted historical metrics (less coarse resolution, still interesting data that you might not want to throw away) and how has been an open question. Sinking it into influx could be a good answer, so this is welcome news.


That's exactly what we're going for. Ideally, we'll also expose PromQL query functionality in Influx so Prometheus users can hit their historical data in Influx directly just like they would with Prometheus. Still early days on all this work, but I was very inspired coming away from PromCon last month.


Conversely, one has to tell the collector about a new device to poll. Being able to fire off a UDP packet with the metric and move on seems like would be the most lightweight approach.


It sure is until you consider that now you have to maintain that endpoint and have it be constantly up or it will simply miss metrics. Not to mention the occasional network partition is outside your control.

Additionally, I would argue some sort of discovery/registry mechanism is in order anyway.

For example, Prometheus has very solid integration with Kubernetes. In this universe you have one central control plane thingy (the scheduler) responsible for bringing resources up and down, it updates the collector (Prometheus) about all the devices (pods, containers, damn we have too many terms for things) accordingly, which in turn scrapes on a best effort basis.

Once everything is hooked up like this you're basically guaranteed that applications that provide information about themselves will get scraped eventually. If they are reachable you'll know what is going on inside them, and if they are not, you'll know that too.

The advantages of this sort of decoupling are subtle and difficult to get right, but you'll be thankful down the line for having done so correctly from the get go.


Two things to consider:

1. UDP packets are lossy, we had the UDP buffers of our Influx server fill up, and it took us a long time to detect we were dropping packets.

2. Many people want to detect when they are not getting data from an endpoint. Polling is a great way to quickly detect a endpoint is down.


For argument's sake (abuse is down the hall):

1. My understanding of UDP being lossy typically refers to it happening via transit but your example is an endpoint failure.

2. Since the whole point of metrics is to keep track of operations then the monitoring of the metrics themselves should be alerting to anomalies?


There's two issues with that, first the statsd event approach has scaling problems as relative to an instrumentation event a UDP packet is actually quite heavyweight (https://www.robustperception.io/which-kind-of-push-events-or...) and secondly how do you know that all devices that are meant to be sending are sending (https://www.robustperception.io/push-needs-service-discovery...).


When you need to consider scalability, pull also degrades more nicely.


Congrats! Well done, seriously. I love it when things like this happen; settling on Prometheus as a standard will open access to so much more tooling.

(Which is also why I'm super excited to use timescale [http://www.timescale.com] once grafana gets support for postgres!)


Thanks! Shared tooling is exactly why we're doing it. If we can make it easier for our users and potentially easy for some Prometheus users, it's a win. That's one takeaway I have from the Graphite project and community. So much tooling was built up on that standard over time and much of it was very useful. Iterating and narrowing in on a standard will get us to that same spot so I think it's great.


There's already a two-way integration between Prometheus and TimescaleDB: https://github.com/timescale/pg_prometheus


Still paid only for anything other than single node? We switched to Cassandra for that reason.


Such remarks make me wonder if the so-called "Open Core" business model would ever be successful to the degree of investibility. Influx (clustering) and CockroachDB (backup) are good experiments to keep an eye on. MongoDB had to "port" it's paid feature to open source after a community backlash (which feature was that, I forget?).


See the link in my comment further down this thread. We'll do what we can to open source as much as possible while still maintaining a viable business so we can continue to invest in OSS development. It's a problem that we continually revisit.


Is whatever product you are working on paid or free?


We offer free real-time forex streams ( polygon.io ). We are required to charge for equities because the actual exchanges charge us fees per use. Otherwise I would love it to all be free.


Products are only worth in comparison to the next best alternative. InfluxDB has free alternatives, even if possibly not as good, while qrpike's product may not have.


Influx guys are always doing great things. Nice that they see that Prom has become the standard way of doing things and embraced it.


In my experience they're also very responsive and friendly on community channels such as slack.


Thanks, we really appreciate the support. This is only the beginning of us adding support for Prometheus standards. Can't wait to show off some PromQL support :)


Great, if you guys bring Clustering back to the open source version i would be thrilled.


Have to balance the needs of funding the business so we can continue the OSS work we do. It's something we frequently consider. If we can figure out a way to do it while still maintaining a healthy business, we'll be doing it. Basically, it's an open problem. I talked about it earlier this year:

https://www.influxdata.com/blog/the-open-source-database-bus...


The HA relay seemed like a nice compromise, but the repo appears to be no longer maintained.


Thanks for adding this, it should help quite a bit with our kubernetes deployment


If I may ask, can you expand on this?


$EMPLOYER has a very sizable influx deployment and push the limits of the software (bursts of ~500k metrics per second in some cases). We are also in the process of moving virtually everything (including kapacitor, influx, etc) onto Kubernetes. That said, everything in the entire kubernetes ecosystem already includes prometheus exporter support builtin. Also, Tectonic (from CoreOS) has the most wonderful prometheus operator https://coreos.com/blog/the-prometheus-operator.html which lets each team trivially spin up prometheus to monitor all of their apps. It also lets each team spin up their own alertmanager to send alerts, with kubernetes guaranteeing the high availability.

Why is this useful? Prometheus is wonderful for ephemeral application state and monitoring, but isn't really meant for storing metrics longer term. Sometimes you want to look at the same metrics over a year or so. This is what Influx is built for. So you have prometheus for collecting metrics and monitoring your cluster, then you have it reading and writing to Influx. This is literally the best of both worlds. This will be a big deal going forward for our team.

Does this make sense?


Yes, thank you!


What advantage do either of these data stores have over Cassandra?


I'm curious what you think InfluxDB and Prometheus are; neither have much in common with Cassandra. To my ears, you might as well be asking what advantages nginx has over the JVM.


Both are open source, distributed ( must pay for influxdb ) databases. If you setup Cassandra's schemas correctly it can be good for time series. Influx is more suited to time series, however it's not 100% free and open source.


Cassandra is an excellent distributed database with scaling and high availability built in. It is also a bit difficult operationally to manage and a bit slow. It can be used for time series, but so can Postgresql or mysql, and they aren't very good for it either.

Influx is a time series database where the way the bits on disk are stored is written in "time series" order, so you can do certain types of operations literally orders of magnitude faster than you can on a more generic datastore (such as cassandra, mongo, mysql, postgresql, etc). The clustering bits in Influx are enterprise only, but influx (non-clustered) is entirely open source.


I'm not sure I would call Cassandra slow. If the schemas are done well, it can be quite good for time series. Obviously this depends on the type of time series you're writing/querying.

Our biggest goal was writes & uptime. We sometimes do over 150k writes/sec. We also needed it to be up and accepting writes even if one node goes down.

We regularly take nodes offline for updates/etc and cassandra never misses a beat.

We ~really~ wanted to use influxdb, but as a startup we couldn't justify the cost/benefit over Cassandra since we have 8 nodes for the DB. I just went to the influx site to try to find the pricing again and it seems to be hidden now :/

EDIT: As a PS, just remember every one of the influxdb benchmarks ( that I've come across ) are single node. Cassandra is meant to be horizontally scalable. Testing a single node Cassandra is like testing a racecar on your driveway...


And our influx setup does bursts of 500k writes per second with a lot less operational overhead than Cassandra. For time series data, a general purpose database is always going to be slower for both reads and writes. The data on disk and in memory is simply laid out differently.

For an excellent academic example, see this paper of Facebook's gorilla in memory TSDB:

http://www.vldb.org/pvldb/vol8/p1816-teller.pdf




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: