Hacker News new | past | comments | ask | show | jobs | submit login
DalmatinerDB: A fast, distributed metric store (dalmatiner.io)
96 points by sciurus on March 22, 2016 | hide | past | favorite | 48 comments



When would you want to use something like this over just ordinary postgresql?

Say you're trying to measure/analyze the use of a feature like clicking a button. You want to record every time a user clicks the button, their user ID, and some attributes of the user (maybe in bucket A or bucket B of a test). Later you'd want to be able to answer questions like "how many times was this button pressed?" or "how many unique users pressed a button?" or "how many times did users in bucket A press the button?".

Is that an "appropriate" use for postgres? Or is that the kind of "metrics" that a DB like Dalmatiner would be more optimized for? Is there a nice overview of the various kinds of DBs out there and what use cases they're optimized for?


The situation you describe is more an event then it is an metric, and indeed postgres is much better suited for it, if asked that is what I'd recommend for this kind of issue.

What DalmatinerDB was build of is handling high numbers of metrics like CPU usage, memory usage, etc, everything where for every point in time you have exactly 1 value. In that space it can handle millions of metrics (read millions of inserts) at the same time where a system like postgres would stat to have problems in my experience.


I see, thanks for your response!


I wouldn't use postgres for that at all! A NoSQL store is much better suited for that task.


There is not suck thing as "NoSQL store". NoSQL is just a buzzword that is used to describe dozens (hundreds?) of very different solutions.


You are being obtuse. It is obvious what I'm talking about.


No, I don't think it's obvious nor are you right about this.


Look at every mobile gaming company and you will see document stores. Dynamo, BigTable, DocumentDB. Google's very own analytics is built on BigTable. The examples are countless.


While a document store is a subset of NoSQL databases, NoSQL is not synonymous with document store. It also includes graphs, key-value stores, etc. That's what the original person was getting at - you should've started with "document store".

While I may use JSON for storing click data, I wouldn't do it using a NoSQL database.


Neither Dynamo nor BigTable are document stores.

And there is no reason why Postgres couldn't handle this kind of stuff. Sure it doesn't come with the needed scaling tools built in but it's doable. I'm running billions of rows through a Postgres analytics setup.

A purpose built system can be more efficient but it also doesn't need to mean NoSQL. You can see some of the serious-scale datastores adopting SQL-like query interfaces. For example Hive for Hadoop or PlyQL for Druid.

NoSQL is a loose term for a mish-mash of technologies.


Dynamo started as a document store. Bigtable is not but I included it because it is another very good option for analytics and is still lumped under the nosql umbrella.

Using relational database for event analytics means that you have a trivial use case. You might have a lot of rows, but your data is dead simple. Otherwise you would refactor every week to change the schema and sharding. Not to mention how much money you would spend on hardware.


> Dynamo started as a document store

No it didn't. The original Dynamo paper doesn't even include the word "document" once! The title of the paper makes it obvious what kind of system it is: "Dynamo: Amazon’s Highly Available Key-value Store"

You should really do some fact checking on your statements.

Bigtable alone does not make an analytics system. You'll need a lot more around that.

I don't have a trivial use case. We use the jsonb data type in Postgres and have no upfront defined schema for metrics.

Postgres' jsonb can out-perform many of the open source NoSQL databases out there. Our system consists of a single digit of nodes that can handle hundreds of thousands of writes per second and as I mentioned already, billions of rows. A single CPU core can scan though more than a million rows per second. Is it the fastest that there is? No. Is it good enough for most people? Absolutely.


While big table can be classified as a database that can be scaled to extremely large size, I would not recommend it for the backend of a game at all.

It would be extremely expensive to run a workload like this, and would not be very performant.

This is if we are assuming a very high transaction rate, lots of concurrency and likely highly volatile data. Big table is just not really for that. Either is Dynamo. Both of these are great at giant scale multi purpose databases however.


I hope my sarcasm-meter was accurate when gauging your post as "sarcastic".


Postgres has a "nosql" store, called hstore.


Nomenclature nitpick: This is more clearly called a time series database.

A "metric" doesn't imply multiple samples over time (although researching for this comment, there doesn't seem to be a strict definition)


I feel that time series implies different tradeoffs, this is very specifically build for metrics in the sense of application or hardware metrics which usually have a time time dimension but yes there are no general valid definitions


Yeah, what is a "metric"? Does that mean a "number"?



Wow, saw the benchmarks:

https://docs.dalmatiner.io/en/latest/benchmarks/jpc.html

"8,500,000 and 9,000,000 metrics per second" on a 5 node cluster.

This is impressive.

The 1 node is impressive as well (and it shows how it scales):

1,500,000 metrics per second


They are impressive, but let's put some context:

  No guarantee of storage

  DalmatinerDB offers a 'best effort' on storing
  the metrics, the ingress transport is UDP and 
  there is no log for writes (there is the ZIL if 
  enabled in ZFS) or forced sync after each write.
  This means that if your network fails packets can 
  get lost, if your server crashes unwritten 
  data can be lost.

[1] https://github.com/dalmatinerdb/dalmatinerdb#no-guarantee-of...


Actually let me put that paragraph fully in context with the reality.

UDP is no longer used (this is outdated sorry for that), the connection is TCP now as it turned out over all the performance was better.

Dataloss can still occur since DalmatinerDB keeps a cache (which other metric stores might also do). A lot of that can be mitigated by using N=2 (or 3) to store data in multiple nodes that will reduce the chance of dataloss significantly. Keeping caches isn't uncommon however, to ensure full consistency it requires a kind of transaction from that goes from client to server to client, which is 'really' costly and I am convinced not worth it for metrics, a few seconds of lost metrics doesn't warrant the cost of that.


The tests were run with TCP, they do not include non stored metrics.



Another exciting and related project is Tachyon, which allows for rapid collection and sending of metrics to Dalmatiner.

https://docs.project-fifo.net/docs/tachyon


If you are interested in storing time series, check out http://chronix.io/.

A colleague of mine is working on his PhD about time series storage and created it.


I see there's a Python client -- which looks fairly straightforward. I'd like to offer a ruby client, but would prefer not to duplicate an effort already underway.

What's the status of an official Ruby client?


Go write a ruby client.


Could somebody explain this?

"No guarantee of storage. DalmatinerDB offers a 'best effort' on storing the metrics"

https://github.com/dalmatinerdb/dalmatinerdb#no-guarantee-of...


it means the transport is UDP and there's no ack on write


I think it's acceptable data loss because "It Probably Works" - https://www.youtube.com/watch?v=FSlPU5Nrvds


This explains a bit on the take of loosing data: https://vimeo.com/148514080


There's a lot of overhead from disk seeks that plague a lot of traditional databases. Especially if you have a physical arm moving around. Any database like this that can write data sequentially and forgo update and delete operations can really speed things up. Also at scale you start worrying about things like storage costs. Sequential reads/writes make it reasonable to use spinning disks for the cost savings as SSDs cost more and lose their order-of-magnitude advantage when comparing sequential writes.


Tying itself to a technology that faces some legal challenges (ZFS) doesn't look like a good thing [1]. I understand the benefits of avoiding duplicating features that are already present in the ZFS layer, but it's a roadblock today and in the foreseeable future.

1 - https://news.ycombinator.com/item?id=11176107


That's the case if you were going to deploy this on Linux. The packages provided are for SmartOS.


The docs strongly recommend running on ZFS. What is the current state of ZFS for linux? Is it common in production today?


I know quite a few companies running ZFSOL in production. In terms of our own experiences I blogged about them here: https://dataloopio.wordpress.com/2016/03/07/zfs-on-linux/


Yes! Oh sweet wonderful gift from the Gods! I can't wait to dig around in the KV data model for this.


Ah, this is based on Riak Core, not Riak KV as is suggested by the Wordpress blog post announcing DalmatinerDB.

That actually makes me even happier. :-)


The storage engine is a separate library, it might be of interest for you: https://github.com/dalmatinerdb/mstore


So much so that I'd already been poking around there. :-)

I like to see new riak_core applications being made. For the most part it's a pretty clean framework to bootstrap something with.

I have a current particular interest in riak_core over the top of a pile of disk_log because I want something Kafka shaped, but where topics are cheap, and sloppy-leader-esque behavior is the default without destroying data. Optimizing for availability over everything else and guaranteed delivery of all ACK'd messages put into the system if the end of all logs have been reached (via continuation).


Sorry, blog corrected.


Here is a great video talking about the decisions made when designing DalmatinerDB - https://vimeo.com/148514080


Very nice. I like how it delegates some of the stuff to the filesystem (compression, checksumming).


Looks great! ..I like the fact its built on top of proven technology like ZFS and Riak core.


Is this an alternate spelling of "Dalmatian" I'm not aware of?


It's German for Dalmatian.


Good dog. 101 Dalmatians.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: