Hacker News new | past | comments | ask | show | jobs | submit login
InfluxDB – Open-source distributed time-series, events, and metrics database (influxdb.org)
162 points by mnutt on Nov 5, 2013 | hide | past | favorite | 76 comments



I was actually looking at a bunch of open source time series databases and settled on kairosdb but this looks pretty nice.

I think there is a hackernews rule someplace that a more interesting tech alternative shows up right when you decided to go with something else.

For reference here is the list I created when researching these:

http://opentsdb.net/overview.html Built on HBASE

http://www.gocircuit.org/vena.html Built on go + go'circult uses google's leveldb

https://code.google.com/p/kairosdb/ A rewrite of opentsdb which can use Cassandra

http://blueflood.io/ Built by Rackspace, decent but still seems a bit immature

http://graphite.wikidot.com/ Obligatory Graphite reference (uses whisper, new backend called 'ceres' is being developed)

https://github.com/agoragames/kairos (yet another 'kairos', alternative backends for graphite - SQL, redis, or mongo)

Riak seems to be popular with SASS metric providers (hosted graphite, boundary). There isn’t any code but there are a couple of talks that explain how and why they went with Riak:

http://basho.com/hosted-graphite-uses-riak-to-store-all-cust... http://boundary.com/blog/tag/tech-talks/

http://vimeo.com/42902962


You didn't look at istatd? It does 150000 counters at three different retention intervals every ten seconds, with average, total, standard deviation, mom and max for all metrics.

We looked at other options (including open tsdb and graphite) before building this.

Https://github.com/imvu-open/istatd


https://github.com/imvu-open/istatd/wiki is my favouite. C++, compile to a binary and then it can send updates in a tree of agents to a master node, and maintain a replica of the master, and it does graphs too, everything you need in one.



I should probably look at the code, but reading the post it isn't entirely clear to me: are you updating all three (second, minute, hour) keys on every insert? And are you just using the "last value" as the value for the key (so the value for minute becomes the last value of second that falls in that interval?

I need to read up on rrdtool as well, but I wonder if it would make much difference (good or bad) to store the mean or other average as the "higher up value" (ie: the average of the past 60 seconds as the minute value) ?


Thanks for sharing all these links! I'm always trying to find new and interesting DB technologies, especially time-series.

Can you share anything of your experiences with kairosdb so far - what's the use case and how has it performed?


Thanks! We also went through a similar assessment and came away with kairosdb being the favorable option. Have not started any integration so early thoughts would be helpful. We def favored cassandra over HBASE and the little work we did with opentsdb went well.


Actually, as a person who used most of these and built things similarly this project seems very interesting and useful and tackles a lot of the annoyances of those above. Kinda a combination of otsdb and esper but without a big requirement of setting up other technologies (I don't have to become or hire a HBase expert to maintain it). Having a bunch of online algorithms executed against the data is exciting too from a perspective of future improvements alsom I would love to see anomaly detection built in.


There's a lot to be learned from the non-open source players in this space. Specifically, kdb+ has always provided everything I needed. It's built for HFT, and does millions of data points per second with analytics, with history going back years.

It is, however, rather expensive.


The stuff built for finance is definitely interesting. John and I (two of the people working on this) previously worked at a fintech startup and worked with a closed source time series db called OneTick. It was super fast, but its API work for analytics/events use cases. Great for fast moving market data though.


I was trying to recall the name of OneTick ... It does work well for market data (especially with corrections and stuff), but is not a convenient general database.

kdb+, on the other hand, is a time series database that works perfectly well as a general database, with a query language that is at the same time infinitely simpler than SQL and yet much faster, more expressive and useful. There's a learning curve, it is steep, but it is well worth it.


My recall of OneTick was that it was actually not very high throughput at all.


I'm one of the committers. The project is still early stage. At this point we're looking for feedback on the API, which we're planning on finalizing this month. Would love to hear about anything you'd like changed or added to the API.


I wouldn't use this bastardized SQL dialect. SQL comes from relational databases, which comes from relational algebra, which is an exceptionally poor model for time series data. It's going to end up a mess and confuse people. SQL is already a mess by itself.

I would simply use functions and operators over time series or data frame types. Perhaps take a look at the R zoo library for examples of more advanced things people do with time series.


We opted for the bastardized SQL because we thought it would be easier to understand. Of course, that may not be the case which is why we'd like to hear what people think of it. We based our decision on people using our API in Errplane and not understanding some parts of it without a lot of additional explanation. One of our users said about one part of it "oh, that's just like group by".

I'm curious, did you find the SQL dialect readable and understandable?


I personally like the simplicity of the SQL dialect. It's a universal way to query data every programmer is familiar with it. Alternatively using functions and code to query data easily gets very complicated.


I really like SQL - as I was learning SQL, I thought it was terrible, but now after years of using it really makes getting to data easy, I'm very excited to see you choose SQL instead of JSON or something less query like to query...


... but that's sort of a stockholm syndrome; it's because you know SQL and it makes you feel comfortable, and you can think of worse options (like JSON).

But SQL is really a bad option in the moodern world, especially since (I estimate) about 90% of SQL statements are built programmatically; Thus, a query language that is easier to construct from code makes a lot more sense. (And not, that's not JSON - some form of algebraic notation or LISPish notation makes much more sense).

Also, SQL semantics are horrible if you have order involved (as you always do in time series).


I disagree - it's not stockholm syndrome.. for me it was realizing that SQL actually has solved a lot of common and useful ways of expressing the process of getting to the data you want in the order process you want... simply because we write code to generate SQL does not make it bad... I suspect you also think HTML is bad? Being able to say select x,y,z from table where x=1 is very simple and very clear IMO... I'm just giving you my opinion though... you clearly see things differently :D


HTML is indeed bad as a machine generated format -- which is what it is; e.g. <p>, list items and a few other things don't have a close tag, but most things do.

These things (like SQL) make sense if you assume that the input is (a) written manually, and (b) by people who are not expected to do this "professionally". Neither is the case of HTML nor SQL anymore.

(Seriously, SQL was originally marketed to managers with the idea that "it's just plain english so you can do it yourself, and don't need programmers!". You know how well that worked out)


> (Seriously, SQL was originally marketed to managers with the idea that "it's just plain english so you can do it yourself, and don't need programmers!". You know how well that worked out)

Pretty well, actually -- lots of nonprogrammer analysts use SQL for queries, and IME the ones that do consistently are better able to answer questions based on data than the ones that use "friendly" query tools, that inevitably end up being much more limited in practice, and requiring a lot more support from both programmers and DBAs to make the data that is already available accessible through.

Unfortunately, lots of environments prevent direct SQL access to DBs for "security" reasons (as if mutiuser DBMS's didn't have role based access controls as a core feature)


I still think SQL is a needlessly verbose mistake. The same people who can successfully write SQL for queries would have just as easily (or more easily) been able to use some algebraic notation. I am not advocating GUI query builders - I'm advocating a non-natural-language-looking (and hopefully better) language, along the lines of kdb+/q.

If you can properly do inner/outer/cross/asof joins to get to the data you want, the english-like syntax is just a burden - two queries that seem similar in their English more often than not produce completely different results because of SQL's 3-value logic, the way NULLs are joined, and various other things like that.


> I still think SQL is a needlessly verbose mistake. The same people who can successfully write SQL for queries would have just as easily (or more easily) been able to use some algebraic notation.

I don't think that's not true -- there's lots of adults that have anxiety around "maths-like" notations largely as a result of issues with maths education, cultural factors, etc., despite being able to intellectually handle the relevant manipulations -- and lots of those people end up in non-technical business positions that end up having to deal with data. Lots of the non-IT people I've seen using SQL definitely fall into that group, and I don't think they'd be as proficient with a more algebraic syntax like the comprehension syntaxes used in many modern programming languages.

Conversely, the people that can be proficient in those syntaxes almost certainly can be proficient in SQL, though they may complain about its verbosity.

Sure, in a perfect world where the cultural context was different, this wouldn't be necessary. But we don't live in that world.


This is a really strange opinion. SQL is an awesome DSL for interacting with structured, relational data. Got no idea where you get your 90% stat from; seems to me that's a product of your personal ecosystem rather than an objective statement about the universe.


Actually, it's nice to have both. you may have something already that is doing a lot of queries and this way you can swap this database in and not have to change the queries much.



I'll second what Chubot said that I'm not liking the SQL much.

The interface I'd like would be something close to numpy (or matlab/r, if those are more familiar). Let me do vectorized operations and write functions in code. Let me load a few different timeseries into a dataframe. Most likely the easiest thing to do would just be to either embed numpy into your engine, or create simple wrappers to load data to it.


The numpy style interface might be interesting. The problem is that it would probably kill performance to marshal into a Python process. Unless there's some way to embed numpy?

Supporting custom functions is definitely something we want to do.


Why would it kill performance? All you need to do is this:

    double* values = (double*)malloc(sizeof(double)*num_of_values);
(If you are unfamiliar with numpy, it's just a python wrapper around raw blocks of memory.)


You still have to get the bytes out of the server and into the Python process.


Assuming Python is not running in the same process as InfluxDB, amend my code snippet:

    double* data = shmget(key, sizeof(double)*num_data_pts, whatever_flag);
But directly embedding a Python interpreter is pretty easy in C, so I imagine it should also be fairly easy in Go as well.

(Of course, given that my code snippets are C, you can probably deduce I've never written any Go, so take my comments on how easy it is with a grain of salt.)


Go interop with C is very doable (although I've never tried it). Not sure about shared memory WRT garbage collection though.


A route to this would be providing an FFI for loadable modules that provide custom procedures or functions.

Folks that wanted to could then write Python/Numpy code for these and use Numba+LLVM to compile them.

It could be quite performant and avoid having to marshal data or do IPC, could possibly avoid even copying data in some cases (Numpy/Numba have pretty robust support for structures coming over an FFI, not sure about FFI in Go.)


This looks awesome. Interestingly, I started writing something similar to this myself not too long ago. I was using DynamoDB as the backer though.

I think some more aggregate functions would be useful:

- Count (not listed on the Functions page, but used in the query examples?) - Sum - Standard Deviation - First - Last


This project looks really, really promising, thanks for working on it!

How robust/scalable in your opinion the backend is at this stage? I'm just trying to set my expectations properly when checking it out.

Thanks, Sasha


We're writing the clustered portion of it right now. That won't be available until December, but we'll have performance benchmarks on a variety of configurations.

The single node performance at this point for writes is tens of thousands of points per second if batched, and for reads we haven't optimized yet. Queries that only have to go through a few hundred thousand points should return in < 1s. Of course, those numbers will be highly variable depending on if you're writing a series with 1 column or 30 columns. We'll be adding things to make that better.

For now we're focused on creating a developer friendly API and building out clustering.


That makes total sense. Thanks for the info!


It appears to be built with Go. https://github.com/influxdb/influxdb


http://influxdb.org/overview/

It's definitely built with Go.


Built one myself sometime ago:

https://github.com/lsh123/stats-rrdb

An important part for me was the desire to completely separate data and UX (so no graphite). Added bonus is ability to control resources (e.g. memory/disk usage).

We run it in production for quite some time processing hundreds of data updates per second and tens of queries per minute.


WOW.. the sandbox show passwords in URL !!!

http://sandbox.influxdb.org:9062/#/?username=ankit&password=...


Can't wait to use it. Loving all these new DBs written in Go.


I don't seem to be able to log in to their playground - anyone else able to register a new account? I just get "invalid username/password" no matter what I enter.

Other than that, I look forward to evaluating this .. maybe its a solution for a problem I have recently where I'm collecting massive log files of operation systems, and need to navigate/parse/analyze .. so I guess I import the logs into InfluxDB, and put a d3.js frontend on it ..


I get that. But, what's more worrying is that my password came through in the URL string (a GET request)


We're no longer putting the password in the URL, but play and sandbox aren't over HTTPS. And the password still gets sent. As their name implies, they're for playing around, not for real data. On a real installation you'll want to use SSL. We'll have that built into the prod releases or you can always have your load balancer/proxy handle that for you.


That's awesome then. Great product :)


It's back up now, a deploy had reset the password on it.


Every now and then I see a new opensource distributed and whatnot database pop out, now, I'm totally naive in terms of databases and distributed systems. Do we really need all this Databases? What's special about this one? Can someone give me a summary of the main ones (Mongo, Redis, Rethink, Riak, etc.) ?

Now not discouraging InfluxDB or anything, as a systems programming fan it's great to see more things like this coming, and as a Gopher too.


> Do we really need all this Databases? What's special about this one?

Good point, Comrade. I will propose to GOSPLAN that we rationalise the development of all new technologies, to avoid such accidental evolutionary convergence in future.


As I said, "Now not discouraging InfluxDB or anything". I have nothing against freedom of choice (That's what I use Linux for example) But I do agree that fragmentation might be bad (E.g. 200 Linux distros).


This talk by Martin Fowler on NoSQL http://www.youtube.com/watch?v=qI_g07C_Q5I and their use cases is a pretty good introduction.


Oh thanks, I'll look it tonight.


This looks like the exact feature-set we need at my company; we're in the middle of moving to Redshift but I'll be keeping an eye on Influx.

I know it's early days but I didn't see any information about cluster management - how does one setup an Influx cluster, can it be resized, what kind of hardware does it prefer?


We're building out that portion right now. There will be a web interface for managing the cluster. We'll benchmark it on cloud configs on different sizes with regular spinning disks, EBS, and SSDs.

The goal with the cluster stuff is that it should be possible to add nodes to the cluster, but the storage part of it isn't highly elastic. Meaning, you won't be adding and removing instances from it frequently. So adding nodes will require you to go into the admin interface, activate them, then wait up to half a day for rebalancing to be complete (but the cluster will be available for reads and writes during this time). However, we will be optimizing for the case of replacing a failed or soon to be shut down node.

If you're serious about giving it a try when we have the clustered version available, shoot me an email: paul@pauldix.net. Would definitely like to hear more about your use case.


I don't know of your exact needs but cluster management is always a pain have you thought more of a "DBaaS" type platform like TempoDB (https://tempo-db.com/)


Very cool, I actually made a basic version of this (only implemented increment) with Go for the same reasons, just drop it on a server and run it.

My implementation would output a chart given parameters:

/chart.png?metric=whatever&time=12h&interval=10m

Are there any plans for easy output of graphs?


We'd definitely like to do that soon. That's one of the really nice things about Graphite and it makes it easy to share in emails, chat rooms, etc. For the moment we're focused on the other parts of the API and building out the clustering part of it.


Whenever I see these new DBs I promise myself to try them in a side-project but I almost never get around to thinking of one. Can someone be kind enough to hit up a few ideas for side-project where this DB would shine?


I wouldn't say InfluxDB (or Graphite, or whatever) is something you'd develop a side project around (though you could probably implement some novel data visualization), but rather, they provide a backend for your side projects to collectively aggregate metrics.


Reply the API: For javascript anyways, I think a chainable/fluent interface with the method names modeled after Underscore.js would be grand.


you have a link to some specific chainable/fluent code that looks like what you're thinking of?


The sandbox appears to be down, but I'm a little concerned about the security. Can database users be created that only have read access?


You can limit read and write access on users. It's documented on this page: http://influxdb.org/docs/api/http.html. We haven't implemented the specific column limits part of that API yet so feedback would be great. Does that take care of the use case you were thinking of?

However, security is probably not something to bother with in sandbox since it's not HTTPS. We're looking at it now, and should have it back up in a bit.


Yes, that would handle it. I look forward to playing with the sandbox.



Have you looked into StatsD support? At the very least, a backend for StatsD (to write into InfluxDB) would make adoption a lot easier.


It's definitely high up on the todo list. First finalize the API and release production worthy builds, then all those additional little add ons!


I like the fact that data is dimensional.

What's the scalability model? It's not clear from the documentation.


We're working on the clustered version now. The short answer is that data points are sharded across the cluster and replicated based on a replication factor on a per database basis. Queries hit the # nodes / RF to answer any given query. So writes scale horizontally and queries balance across the cluster.


Is there someone out there maintaining a list of the various databases and the use cases for each?


sounds interesting; i am currently using hbase for similar purposes. Do "tables" have to be created explicitly, or can I just store a value into a timeseries, and if the ts doesn't exist yet it will be created?


You can just write data in on the fly. Time series get created when you write the first point. You also can create new columns on the fly. And there's no enforcement of a data type across all values for a given column. That's on the user.


Great. I am looking at integrating it into my fork of ethercalc so i can store not just the metrics but also some of the indicators and statistics i compute on them. For my use cases (R and spreadsheets) it would be handy if i could get results in csv format straight from the API when i make a query.


We'll add the CSV response to the todo list. You're not the first person to request it and it makes total sense.


This is built as a round-robin DB, yes?


Looking at the code[1], LevelDB is used for the datastore. This is using the LevelDB Go bindings (glue in C).

[1]: https://github.com/influxdb/influxdb/blob/master/src/datasto...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: