Any experience with using Aurora in place of DynamoDB?
A couple years ago there was an interesting tidbit at re:Invent about customers moving from DynamoDB to Aurora to save significant costs.[1] The Aurora team made the point that DynamoDB suffers from hotspots despite your best efforts to evenly distribute keys, so you end up overprovisioning. Whereas with Aurora you just pay for I/O. And the scalability is great. Plus you get other nice stuff with Aurora like, you know, traditional SQL multi-operation transactions.
It was kind of buried in a preso from the Aurora team and the high-level messaging from Amazon was still, NoSQL is the most scalable thing. Aurora was and is still seemingly positioned against other solutions within the SQL realm. I sort of get it in theory that NoSQL is still theoretically infinitely scalable whereas Aurora is bounded by 15 read replicas and one write master.. but in practice these days those limits are huge. I think one write master can handle like 100K transactions a second or something.
So, I'm really curious where this has gone in the past couple years if anywhere. Is NoSQL still the best approach?
Oh cool. For those reading along this is titled "How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated)". Is this a new feature?
Yeah I should have elaborated a bit. I believe adaptive capacity was announced at re:invent in 2017 and may have released shortly after / maybe early 2018. The feature is getting a lot more press & push from AWS lately though for sure.
I remember having a conversation with our AWS rep about 2 years ago during our quarterly feature request meeting. I remember asking for DynamoDB autoscaling and burst capacity; pretty happy they finally delivered.
Since then we've pretty much cut our DynamoDB bill in half and had a drastic reduction in throttled responses.
I personally recommend using a SQL database until you're absolutely positively sure you don't need one, for many reasons.
But, as far as the "you end up overprovisioning" because of hotspots thing, DynamoDB does offer autoscaling these days, which should alleviate a lot of provisioning-related headaches and save you money compared to the provisioning you would have done with DynamoDB, from what I understand.
We use a hybrid. We process a lot of incoming data and dump most of it into dynamo (it's ephemeral so the TTL feature is nice) and if we get capacity errors (Dynamo takes a while to scale up sometimes) we just dump our objects in the DB. The end result is we keep a huge amount of writes off our DB for processing incoming largish objects. The amount of data it stores would cost an arm and a leg to put into redis.
Granted, I don't think I'd want to use Dynamo for anything other than temporary data. Lock-in makes me nervous, and the way it scales up/down really makes it difficult to use it for hourly workloads...by the time it scales up we're close to done needing more capacity, then it doesn't scale down for like 40m after. We set up caps and the DB overflow machanism keeps things from grinding to a halt.
GP used the wrong term, think they meant adaptive capacity, which is a newer feature where shards will automatically lend capacity to each other in the case of hotspots.
Autoscaling doesn't always help with hot shards (which I think gp was referring to) because you can have a single shard go over its share of the throughput[0] while still having a low total throughput.
This has largely been resolved, a single shard can now consume more of the throughput than your equation would give you. AWS refer to it as Adaptive Capacity
Yes. Relational databases are very fast and using them as key/value stores is a great use-case. Using a scale-out system like Aurora makes it even better. It's slower because of SQL parsing and generally the SQL clients are not as fast, but you can get close to single-digit millisecond latency these days.
We use Aurora or Postgres for key/value unless we need something specific, like multi-regional capacity or really high-end performance. For that we run ScyllaDB.
> It's slower because of SQL parsing and generally the SQL clients are not as fast
I'd be really surprised if the client library introduces a latency significant enough to be compared to the network latency between the app server and the database server.
Many libraries handle db connections poorly, or have heavy-handled pooling systems, or aren't fully async, all of which limits total throughput. The key/value clients usually have a much simpler APIs like HTTP which scale much better.
I don't understand. What makes you think it's easier for NoSQL clients (versus SQL clients) to correctly implement connection pooling and async networking? For example, MongoDB and Cassandra wire protocols are not based on HTTP. And even if they were based on HTTP, connection pooling and async networking still requires a specific effort. Which libraries are you thinking of (as examples of good and bad behavior)?
Relational databases tend to have bigger and more complicated protocols, with more complex session management, data types and parsing requirements, and connections that may only support a single in-flight query.
Libraries just have to do more work, compared to simpler protocols, or HTTP which is incredibly easy to scale and pretty much handled automatically by the standard libraries at this point.
Right, but that has nothing to do with connection pooling and sync. And there is no structural reason that makes easier to implemented prepared statements for PostgreSQL than for Cassandra. It's anecdotal evidence.
Whether NoSQL is the best approach and whether DynamoDB is the best approach are two separate issues. I find DynamoDB too limiting with the way that it handles indexing, read and write capacity, etc. compared to traditional NoSQL databases like ElasticSearch and Mongo.
That being said, one advantage of DynamoDB is that it is API based and you can make a true serverless web app where all of the logic is on the client, you use Web Federation for authentication to DynamoDB, and you host your JavaScript files, html and CSS on S3.
Another advantage until two days ago, was that with most of the data stores on AWS, you kept your databases behind a VPC and if you used lambda, your lambda also had to be in a VPC and that increased warm up time for the lambda.
Now, there is the Read Only Data API for serverless Aurora. You don’t have to worry about the traditional connection pooling or being in a VPC.
Aurora did not work well for us (it was using local ephemeral disk to do sorts so our query results were truncated / limited to largest local storage) so the best option for us was to run MySQL or PostGres on a i3 instance with local SSDs.
Ok but I'm not sure this is relevant. We're talking about using Aurora in place of DynamoDB, not how it compares to other SQL DBs. With DynamoDB the kind of internal sort you're talking about isn't even possible, right?
Also, better insight into partition sizes / what's causing hot spotting. The DB abstracts a lot from the user, which isn't necessarily great, because it's still subject to the normal pitfalls of a NoSQL database.
Not a particularly easy solution, but you can use dynamo streams to achieve this by loading fast into a temporary table, trickle-feeding via a stream into another table. When it’s caught up, stop writes on the import table then swap over to the permanent table.
A way of doing this without expending all that effort is oh my wish list too.
Congrats to the DynamoDB team for going beyond the traditional limits of NoSQL.
There is a new breed of databases that use consensus algorithms to enable global multi-region consistency. Google Spanner and FaunaDB where I work are part of this group. I didn’t catch anything about the implementation details of DynamoDB transactions in the article. If they are using a consensus approach, expect them to add multi-region consistency soon. If they are using a traditional active/active replication approach, they’ll be limited to regional replication.
They warn about other regions seeing incomplete transactions (if you opt into transactions on global tables), which fits with the current "copy each new item from the stream" async replication.
However, the more recent Google storage offerings based on Cloud Spanner do seem to offer this. I don't see how Amazon can make this statement - that doesn't stop it being an excellent enhancement to DynamoDB though.
DynamoDB is limited to 10 items, whereas the Cloud Datastore limits are 25 different 'tables' -> The new version via Cloud Firestore doesn't even have that restriction. AWS is several years behind and several NoSQL systems behind in this area. Still, a cool addition.
The “and tables” clause is the differentiator, I think. DynamoDB tables are roughly equivalent to Datastore namespaces; I don’t believe Google Cloud Datastore supports cross-namespace transactions.
FoundationDB https://www.foundationdb.org/ not only supports transactions, they are mandatory. They also go one step further and support atomic operations, which are especially killer.
What do you mean by this? There is only one “database” in FoundationDB terms. You can write transactions over the entire keyspace regardless of which machine the data is stored on.
I'm still not sure what you mean in terms of contrasting this with DynamoDB's new features. You could implement the entire DynamoDB API, with even stronger semantics than the new features listed in the article, on top of FoundationDB. Additionally, the latency would be theoretically lower as they describe needing to do a read, write, and another read per key to verify isolation, whereas FoundationDB uses an optimistic concurrency control scheme to verify at commit time that transactions do not conflict. In the common case (where transactions don't conflict) this is faster.
All I’m trying to do here is trying to see whether the claim made in the blog post is true or not. Some commenters were claiming it was false, but I don’t think they considered all the components of the claim.
There's not really a concept of "Database" in FDB. There is however a concept of key spaces, and "directories", which are basically the same, and these all support transactions.
As for Firestore, it’s not clear whether it supports cross-collection transactions. Cloud Datastore does not support cross-namespace transactions AFAICT.
b) Given that the primary use case for namespaces was/is multitenancy, it's not clear to me why you'd want to transact across them. Nevertheless, you can. What's leading you to draw this conclusion?
The documentation is what led me to that conclusion, since it's not explicit as to what the transaction boundaries are, but I could be mistaken. Does this mean that the poster's claim is erroneous?
However: “Multi-document transactions are available for replica sets only. Transactions for sharded clusters are scheduled for MongoDB 4.2.” DynamoDB is sharded by design.
I see, it appears to come down to how each db interprets "partitions".
If we're referring specifically to shards then "DynamoDB is the only non-relational database that supports transactions across multiple partitions and tables." no longer sounds like hyperbole.
Hyperdex Warp - I'm not sure if it's still available - which purports to provide serializability over multi-key transactions. In the HyperDex model that means over all defined spaces in the cluster. That's a stronger guarantee than DynamoDB provides, which is still susceptible to phantom reads. The DynamoDB team ought to be aware of it, because it's one of the first hits for "multi-key transactions" and the paper is an important one for designing transactions on KVS.
FaunaDB (mentioned in previous comment) -- it is multi-model NoSQL so you can do relational queries and it supports transactions across multiple partitions, documents, replicas. json docs, not tables.
This is cool, it lifts the burden of having to bake "atomicity" into your app if you're using a key/value store like DynamoDB. I can see a nice balance of combining this with some built in error checking in the app itself.
I'd be interested to see comparisons/benchmarks against FoundationDB. DynamoDB transactions make dynamo a serious alternative to FDB now. I can see the two manage advantages for FDB being: 1) you can deploy it on premise (which is potentially important for some B2B companies), 2) it shuffles data around so that hot-spotting of a cluster is eliminated (which dynamo appears to still suffer from).
Max 10 items per transaction, that's quite a restriction! I guess you have to plan all the transactions you would perform and make sure they meet the bounds.
A couple years ago there was an interesting tidbit at re:Invent about customers moving from DynamoDB to Aurora to save significant costs.[1] The Aurora team made the point that DynamoDB suffers from hotspots despite your best efforts to evenly distribute keys, so you end up overprovisioning. Whereas with Aurora you just pay for I/O. And the scalability is great. Plus you get other nice stuff with Aurora like, you know, traditional SQL multi-operation transactions.
It was kind of buried in a preso from the Aurora team and the high-level messaging from Amazon was still, NoSQL is the most scalable thing. Aurora was and is still seemingly positioned against other solutions within the SQL realm. I sort of get it in theory that NoSQL is still theoretically infinitely scalable whereas Aurora is bounded by 15 read replicas and one write master.. but in practice these days those limits are huge. I think one write master can handle like 100K transactions a second or something.
So, I'm really curious where this has gone in the past couple years if anywhere. Is NoSQL still the best approach?
[1] https://youtu.be/60QumD2QsF0?t=1021