When would you want to use something like this over just ordinary postgresql?
Say you're trying to measure/analyze the use of a feature like clicking a button. You want to record every time a user clicks the button, their user ID, and some attributes of the user (maybe in bucket A or bucket B of a test). Later you'd want to be able to answer questions like "how many times was this button pressed?" or "how many unique users pressed a button?" or "how many times did users in bucket A press the button?".
Is that an "appropriate" use for postgres? Or is that the kind of "metrics" that a DB like Dalmatiner would be more optimized for? Is there a nice overview of the various kinds of DBs out there and what use cases they're optimized for?
The situation you describe is more an event then it is an metric, and indeed postgres is much better suited for it, if asked that is what I'd recommend for this kind of issue.
What DalmatinerDB was build of is handling high numbers of metrics like CPU usage, memory usage, etc, everything where for every point in time you have exactly 1 value. In that space it can handle millions of metrics (read millions of inserts) at the same time where a system like postgres would stat to have problems in my experience.
Look at every mobile gaming company and you will see document stores. Dynamo, BigTable, DocumentDB. Google's very own analytics is built on BigTable. The examples are countless.
While a document store is a subset of NoSQL databases, NoSQL is not synonymous with document store. It also includes graphs, key-value stores, etc. That's what the original person was getting at - you should've started with "document store".
While I may use JSON for storing click data, I wouldn't do it using a NoSQL database.
And there is no reason why Postgres couldn't handle this kind of stuff. Sure it doesn't come with the needed scaling tools built in but it's doable. I'm running billions of rows through a Postgres analytics setup.
A purpose built system can be more efficient but it also doesn't need to mean NoSQL. You can see some of the serious-scale datastores adopting SQL-like query interfaces. For example Hive for Hadoop or PlyQL for Druid.
NoSQL is a loose term for a mish-mash of technologies.
Dynamo started as a document store. Bigtable is not but I included it because it is another very good option for analytics and is still lumped under the nosql umbrella.
Using relational database for event analytics means that you have a trivial use case. You might have a lot of rows, but your data is dead simple. Otherwise you would refactor every week to change the schema and sharding. Not to mention how much money you would spend on hardware.
No it didn't. The original Dynamo paper doesn't even include the word "document" once! The title of the paper makes it obvious what kind of system it is: "Dynamo: Amazon’s Highly Available Key-value Store"
You should really do some fact checking on your statements.
Bigtable alone does not make an analytics system. You'll need a lot more around that.
I don't have a trivial use case. We use the jsonb data type in Postgres and have no upfront defined schema for metrics.
Postgres' jsonb can out-perform many of the open source NoSQL databases out there. Our system consists of a single digit of nodes that can handle hundreds of thousands of writes per second and as I mentioned already, billions of rows. A single CPU core can scan though more than a million rows per second. Is it the fastest that there is? No. Is it good enough for most people? Absolutely.
While big table can be classified as a database that can be scaled to extremely large size, I would not recommend it for the backend of a game at all.
It would be extremely expensive to run a workload like this, and would not be very performant.
This is if we are assuming a very high transaction rate, lots of concurrency and likely highly volatile data. Big table is just not really for that. Either is Dynamo. Both of these are great at giant scale multi purpose databases however.
I feel that time series implies different tradeoffs, this is very specifically build for metrics in the sense of application or hardware metrics which usually have a time time dimension but yes there are no general valid definitions
No guarantee of storage
DalmatinerDB offers a 'best effort' on storing
the metrics, the ingress transport is UDP and
there is no log for writes (there is the ZIL if
enabled in ZFS) or forced sync after each write.
This means that if your network fails packets can
get lost, if your server crashes unwritten
data can be lost.
Actually let me put that paragraph fully in context with the reality.
UDP is no longer used (this is outdated sorry for that), the connection is TCP now as it turned out over all the performance was better.
Dataloss can still occur since DalmatinerDB keeps a cache (which other metric stores might also do). A lot of that can be mitigated by using N=2 (or 3) to store data in multiple nodes that will reduce the chance of dataloss significantly. Keeping caches isn't uncommon however, to ensure full consistency it requires a kind of transaction from that goes from client to server to client, which is 'really' costly and I am convinced not worth it for metrics, a few seconds of lost metrics doesn't warrant the cost of that.
I see there's a Python client -- which looks fairly straightforward. I'd like to offer a ruby client, but would prefer not to duplicate an effort already underway.
There's a lot of overhead from disk seeks that plague a lot of traditional databases. Especially if you have a physical arm moving around. Any database like this that can write data sequentially and forgo update and delete operations can really speed things up. Also at scale you start worrying about things like storage costs. Sequential reads/writes make it reasonable to use spinning disks for the cost savings as SSDs cost more and lose their order-of-magnitude advantage when comparing sequential writes.
Tying itself to a technology that faces some legal challenges (ZFS) doesn't look like a good thing [1]. I understand the benefits of avoiding duplicating features that are already present in the ZFS layer, but it's a roadblock today and in the foreseeable future.
So much so that I'd already been poking around there. :-)
I like to see new riak_core applications being made. For the most part it's a pretty clean framework to bootstrap something with.
I have a current particular interest in riak_core over the top of a pile of disk_log because I want something Kafka shaped, but where topics are cheap, and sloppy-leader-esque behavior is the default without destroying data. Optimizing for availability over everything else and guaranteed delivery of all ACK'd messages put into the system if the end of all logs have been reached (via continuation).
Say you're trying to measure/analyze the use of a feature like clicking a button. You want to record every time a user clicks the button, their user ID, and some attributes of the user (maybe in bucket A or bucket B of a test). Later you'd want to be able to answer questions like "how many times was this button pressed?" or "how many unique users pressed a button?" or "how many times did users in bucket A press the button?".
Is that an "appropriate" use for postgres? Or is that the kind of "metrics" that a DB like Dalmatiner would be more optimized for? Is there a nice overview of the various kinds of DBs out there and what use cases they're optimized for?