We're in the AWS ecosystem, and the database offerings are really subpar. DynamoDB, which I originally expected to be somewhat comparable to MongoDB, is an incredibly frustrating (and expensive) product to use. AWS Data Pipeline is extremely confusing and very expensive as well.
AWS's offerings really lag behind Google's offerings (like BigQuery) in this space. Hopefully AWS can catch up because I'd rather not have requests bouncing between data centers.
If you are in AWS there is also the four RDS (Oracle, MySQL, PostresSQL, SQL Server) options as well as RedShift. Also the best thing about AWS is that there are so many third party choices e.g. MongoLab, MongoHQ, Instaclustr, Cloudant.
Databases is not the area I would be choosing Google for.
I think App Engine's Datastore is generally an under-appreciated gem. Possibly because you have to use App Engine to use it without sacrificing performance, maybe because it's not easy enough to use if all you have is some JavaScript + JSON and don't want or know how to write Python/Java/Go.
But it's actually the only generally available product I know of that solves all the hard problems (availability, partition tolerance, some - but well defined - consistency with cross entity transactions) with zero hassle for you.
If you read through http://aphyr.com/tags/Jepsen, you get some appreciation for how hard this is to pull off without running into operational nightmares (massive data loss, split brains, etc).
Disclaimer: I work for Google, though not on Datastore.
We've had good luck with DynamoDB, but it could be that it just fits our use case very well. What sort of frustration were you running into? (Honestly interested to avoid trouble down the line)
Most recently: hot hash key. DynamoDB uses the object to be persisted's hash key to route it to the right data cluster.
We're a SaaS company with lots of tiny customers and a few very large customers. We need to keep an index to show a specific customer only their data. That means the index for our largest customers gets hit a lot. The problem with this structure is we have to pay as if all of our customers were as popular as our biggest customers, or we get throttled. And even though the DynamoDB interface shows that you have provisioned 10x above your current usage, you still get throttled, because you're being throttled only in a single cluster.
So, let's say you solve that problem, but now you need to drop the troublesome index on a billion+ row table. With DynamoDB you can't change a table's indexes, so you have to migrate your table to a new table. Doing that without downtime is an incredible challenge.
Which reminds me of when they announced indexes. We were so excited only to find out we couldn't add indexes to our tables, but instead had to recreate them all.
The whole point of SaaS is to make our lives easier, but with DynamoDB our lives were much more difficult than just using Mongo.
Anyway, I need to do a blog post on this -- it's a bit too complicated for a HN comment. :-)
Yeah, I think all nosql db's will have that issue if you have extremely unbalanced sharding. This is an application level fault and should be solved there.
But, the thing that is extra bad about dynamo db is how they are paying for 10x higher provisioning as a stop gap, and still getting throttled. That sucks.
DataFlow is not Kinesis. It's more like Kinesis plus Esper plus BigQuery and you still wouldn't have one set of queries to run against streaming and batch data like you do with DataFlow.
I've been using Simple Workflow (in particular, the Flow framework: http://aws.amazon.com/swf/details/flow/) a lot recently to manage complex asynchronous distributed workflows, and it's a revelation. Some people laugh about the "Simple" part in the name, but it's kind of true - once you get over the initial learning curve. It's a bit like Git that way (First a lot of banging your head, then a lot of bang for buck)
Google Dataflow seems to be the big one here specially if it works well for stream processing. Fault-tolerant stream processing with huge scalability? Perfect for the IoT!
Judging from the code samples they showed during the keynote, I'd guess that Google Cloud Dataflow is based on (or an extension of, or a public version of...) FlumeJava, described in this PLDI 2010 paper: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...
The streaming data stuff looks extremely interesting. My main concern is around cost, unfortunately many of these things are great if you've got a massive data problem but not particularly worth it if you've got much smaller data.
I'm in a rather awkward phase of having small enough data that I don't need "Scale to 1000 machines!", I want just one or a few machines occasionally but managed for me (turn on, run code, shut off). Tutum works very well for this, but I'd like to use more of the ecosystem available at Google or AWS (pay-per-usage datastorage, for example). GCE is pretty decent, but a bit awkward, although the new docker support helps (but I've had problems getting it even working).
It looks like an attempt to respond to AWS kinesis which was released last year. The monitoring stuff seems to be about the software they got when they bought stackdriver.
DataDog is an important part of monitoring at SmarterAgent, where we use their API's and Integrations heavily, especially the CloudWatch integrations for Amazon's Web Services (AWS). Using these, we have been able to put up effective dashboards for new environments in a matter of minutes.
We leverage DD’s API primarily for eventing. For example, deployment notifications are posted to datadog, where they overlay our metrics. This has proven very useful in tracking changes due to deploys and/or configuration changes.
While we do leverage the DataDog agent for standard and custom metrics, DataDog’s ability to put together dashboards (and alerting) for AWS without any modifications to the host is what really closed the deal for us.
I probably will end up building the bare minimum to meet my needs and moving on tbh.
It was basically a monitoring/metrics system to merge how I handle the monitoring of crons, work queue, system metrics, analytics, etc. into a single service. Right now, I'm stuck using 3.
Sure, I could just build something to merge it together ... but at that point, I'm halfway to building my own.
I was about to do the same thing. App Engine sorely lacks those features currently, so I'm very excited for this (assuming it has good support for App Engine in addition to Compute Engine which I saw in the keynote).
Hi, I am developer and hacker and wanted to see if i can offer help here. The reason, wanted to see what would be the typical needs and use cases and learn from the experience.
I can be contacted on sid4it@gmail.com
We use datadog at pelotoncycle.com. My past experience was nagios but datadog is so much nicer as well as easier to scale. I looked into other services (cheaper, more expensive) before deciding on datadog.
We're in the AWS ecosystem, and the database offerings are really subpar. DynamoDB, which I originally expected to be somewhat comparable to MongoDB, is an incredibly frustrating (and expensive) product to use. AWS Data Pipeline is extremely confusing and very expensive as well.
AWS's offerings really lag behind Google's offerings (like BigQuery) in this space. Hopefully AWS can catch up because I'd rather not have requests bouncing between data centers.