Amazon DynamoDB Transactions

abalone · on Nov 28, 2018

Any experience with using Aurora in place of DynamoDB?

A couple years ago there was an interesting tidbit at re:Invent about customers moving from DynamoDB to Aurora to save significant costs.[1] The Aurora team made the point that DynamoDB suffers from hotspots despite your best efforts to evenly distribute keys, so you end up overprovisioning. Whereas with Aurora you just pay for I/O. And the scalability is great. Plus you get other nice stuff with Aurora like, you know, traditional SQL multi-operation transactions.

It was kind of buried in a preso from the Aurora team and the high-level messaging from Amazon was still, NoSQL is the most scalable thing. Aurora was and is still seemingly positioned against other solutions within the SQL realm. I sort of get it in theory that NoSQL is still theoretically infinitely scalable whereas Aurora is bounded by 15 read replicas and one write master.. but in practice these days those limits are huge. I think one write master can handle like 100K transactions a second or something.

So, I'm really curious where this has gone in the past couple years if anywhere. Is NoSQL still the best approach?

[1] https://youtu.be/60QumD2QsF0?t=1021

awinder · on Nov 28, 2018

https://aws.amazon.com/blogs/database/how-amazon-dynamodb-ad...

abalone · on Nov 28, 2018

Oh cool. For those reading along this is titled "How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated)". Is this a new feature?

awinder · on Nov 28, 2018

Yeah I should have elaborated a bit. I believe adaptive capacity was announced at re:invent in 2017 and may have released shortly after / maybe early 2018. The feature is getting a lot more press & push from AWS lately though for sure.

kenhwang · on Nov 28, 2018

I remember having a conversation with our AWS rep about 2 years ago during our quarterly feature request meeting. I remember asking for DynamoDB autoscaling and burst capacity; pretty happy they finally delivered.

Since then we've pretty much cut our DynamoDB bill in half and had a drastic reduction in throttled responses.

coder543 · on Nov 28, 2018

I personally recommend using a SQL database until you're absolutely positively sure you don't need one, for many reasons.

But, as far as the "you end up overprovisioning" because of hotspots thing, DynamoDB does offer autoscaling these days, which should alleviate a lot of provisioning-related headaches and save you money compared to the provisioning you would have done with DynamoDB, from what I understand.

orthecreedence · on Nov 28, 2018

We use a hybrid. We process a lot of incoming data and dump most of it into dynamo (it's ephemeral so the TTL feature is nice) and if we get capacity errors (Dynamo takes a while to scale up sometimes) we just dump our objects in the DB. The end result is we keep a huge amount of writes off our DB for processing incoming largish objects. The amount of data it stores would cost an arm and a leg to put into redis.

Granted, I don't think I'd want to use Dynamo for anything other than temporary data. Lock-in makes me nervous, and the way it scales up/down really makes it difficult to use it for hourly workloads...by the time it scales up we're close to done needing more capacity, then it doesn't scale down for like 40m after. We set up caps and the DB overflow machanism keeps things from grinding to a halt.

abalone · on Nov 28, 2018

Why don't you use Kinesis for this? Isn't that what it's made for?

abalone · on Nov 28, 2018

> DynamoDB does offer autoscaling these days, which should alleviate a lot of provisioning-related headaches

The problem they noted isn't lack of autoscaling, it's that you have to provision the entire datastore to accommodate your hottest partition.

paragraft · on Nov 28, 2018

GP used the wrong term, think they meant adaptive capacity, which is a newer feature where shards will automatically lend capacity to each other in the case of hotspots.

piinbinary · on Nov 28, 2018

Autoscaling doesn't always help with hot shards (which I think gp was referring to) because you can have a single shard go over its share of the throughput[0] while still having a low total throughput.

[0] total throughput/num shards

EwanToo · on Nov 28, 2018

This has largely been resolved, a single shard can now consume more of the throughput than your equation would give you. AWS refer to it as Adaptive Capacity

https://aws.amazon.com/blogs/database/how-amazon-dynamodb-ad...

manigandham · on Nov 28, 2018

Yes. Relational databases are very fast and using them as key/value stores is a great use-case. Using a scale-out system like Aurora makes it even better. It's slower because of SQL parsing and generally the SQL clients are not as fast, but you can get close to single-digit millisecond latency these days.

We use Aurora or Postgres for key/value unless we need something specific, like multi-regional capacity or really high-end performance. For that we run ScyllaDB.

ngrilly · on Nov 28, 2018

> It's slower because of SQL parsing and generally the SQL clients are not as fast

I'd be really surprised if the client library introduces a latency significant enough to be compared to the network latency between the app server and the database server.

manigandham · on Nov 28, 2018

Many libraries handle db connections poorly, or have heavy-handled pooling systems, or aren't fully async, all of which limits total throughput. The key/value clients usually have a much simpler APIs like HTTP which scale much better.

ngrilly · on Nov 28, 2018

I don't understand. What makes you think it's easier for NoSQL clients (versus SQL clients) to correctly implement connection pooling and async networking? For example, MongoDB and Cassandra wire protocols are not based on HTTP. And even if they were based on HTTP, connection pooling and async networking still requires a specific effort. Which libraries are you thinking of (as examples of good and bad behavior)?

manigandham · on Nov 29, 2018

Relational databases tend to have bigger and more complicated protocols, with more complex session management, data types and parsing requirements, and connections that may only support a single in-flight query.

Libraries just have to do more work, compared to simpler protocols, or HTTP which is incredibly easy to scale and pretty much handled automatically by the standard libraries at this point.

ddorian43 · on Nov 28, 2018

Example: psycopg2 (python-postgresql driver) doesn't have (or sucks) prepared statements compared to cassandra driver.

ngrilly · on Nov 28, 2018

Right, but that has nothing to do with connection pooling and sync. And there is no structural reason that makes easier to implemented prepared statements for PostgreSQL than for Cassandra. It's anecdotal evidence.

ahoka · on Nov 28, 2018

I have the exact same experience with npgsql. It's exporting postgres's "one session - one server process" model which is very outdated.

scarface74 · on Nov 28, 2018

Whether NoSQL is the best approach and whether DynamoDB is the best approach are two separate issues. I find DynamoDB too limiting with the way that it handles indexing, read and write capacity, etc. compared to traditional NoSQL databases like ElasticSearch and Mongo.

That being said, one advantage of DynamoDB is that it is API based and you can make a true serverless web app where all of the logic is on the client, you use Web Federation for authentication to DynamoDB, and you host your JavaScript files, html and CSS on S3.

Another advantage until two days ago, was that with most of the data stores on AWS, you kept your databases behind a VPC and if you used lambda, your lambda also had to be in a VPC and that increased warm up time for the lambda.

Now, there is the Read Only Data API for serverless Aurora. You don’t have to worry about the traditional connection pooling or being in a VPC.

wahnfrieden · on Nov 28, 2018

You can write too - not just read-only.

mdani · on Nov 28, 2018

Aurora did not work well for us (it was using local ephemeral disk to do sorts so our query results were truncated / limited to largest local storage) so the best option for us was to run MySQL or PostGres on a i3 instance with local SSDs.

abalone · on Nov 28, 2018

Ok but I'm not sure this is relevant. We're talking about using Aurora in place of DynamoDB, not how it compares to other SQL DBs. With DynamoDB the kind of internal sort you're talking about isn't even possible, right?

rawoke083600 · on Nov 28, 2018

"Plus you get other nice stuff with Aurora like, you know, traditional SQL multi-operation transactions." THIS !!!!

NoSQL has such a nich usage!

piinbinary · on Nov 27, 2018

My wishlist for DynamoDB is now down to:

* Fast one-time data import without permanently creating a lot of shards (important if you are restoring from a backup)

* Better visibility into what causes throttling (e.g. was it a hot shard? Was it a brief but large burst of traffic?)

* Lower p99.9 latency. It occasionally has huge latency spikes.

* Indexes of more than 2 columns

* A solution for streaming out updates that is better than dynamodb streams

monstrado · on Nov 28, 2018

Also, better insight into partition sizes / what's causing hot spotting. The DB abstracts a lot from the user, which isn't necessarily great, because it's still subject to the normal pitfalls of a NoSQL database.

tejasmanohar · on Nov 28, 2018

Bigtable is a different beast, but it's new "Key Visualizer" is impressive. Has helped us quickly find anomalies https://cloud.google.com/bigtable/docs/keyvis-overview

Wish Dynamo had something similar

darkr · on Nov 28, 2018

Not a particularly easy solution, but you can use dynamo streams to achieve this by loading fast into a temporary table, trickle-feeding via a stream into another table. When it’s caught up, stop writes on the import table then swap over to the permanent table.

A way of doing this without expending all that effort is oh my wish list too.

antonp · on Nov 28, 2018

> * A solution for streaming out updates that is better than dynamodb streams

What bothers you about dynamodb streams specifically?

Rafuino · on Nov 28, 2018

What kind of p99.9 latency are you looking for?

anentropic · on Nov 28, 2018

and would Dax help?

jchrisa · on Nov 28, 2018

Congrats to the DynamoDB team for going beyond the traditional limits of NoSQL.

There is a new breed of databases that use consensus algorithms to enable global multi-region consistency. Google Spanner and FaunaDB where I work are part of this group. I didn’t catch anything about the implementation details of DynamoDB transactions in the article. If they are using a consensus approach, expect them to add multi-region consistency soon. If they are using a traditional active/active replication approach, they’ll be limited to regional replication.

erik_seaberg · on Nov 28, 2018

They warn about other regions seeing incomplete transactions (if you opt into transactions on global tables), which fits with the current "copy each new item from the stream" async replication.

ryanworl · on Nov 28, 2018

“DynamoDB is the only non-relational database that supports transactions across multiple partitions and tables.”

Uh... this is just not true.

otterley · on Nov 28, 2018

Can you identify some others?

stickfigure · on Nov 28, 2018

The Google Cloud Datastore (formerly the "App Engine Datastore") has had cross-entity-group transactions since 2011:

https://googleappengine.blogspot.com/2011/10/app-engine-155-...

airwot4 · on Nov 28, 2018

The cross group transactions are a little limited - https://cloud.google.com/appengine/docs/standard/java/datast...

I don't think it's fair to compare them.

However, the more recent Google storage offerings based on Cloud Spanner do seem to offer this. I don't see how Amazon can make this statement - that doesn't stop it being an excellent enhancement to DynamoDB though.

wsh91 · on Nov 28, 2018

Cloud Firestore (the next generation of Cloud Datastore) removes those limitations. https://cloud.google.com/firestore/docs/manage-data/transact...

It also supports the Cloud Datastore API.

(I work on it!)

itcmcgrath · on Nov 29, 2018

DynamoDB is limited to 10 items, whereas the Cloud Datastore limits are 25 different 'tables' -> The new version via Cloud Firestore doesn't even have that restriction. AWS is several years behind and several NoSQL systems behind in this area. Still, a cool addition.

kevan · on Nov 28, 2018

I don't know if the overall statement is true, but Spanner is relational and the statement was limited to non-relational databases.

otterley · on Nov 28, 2018

The “and tables” clause is the differentiator, I think. DynamoDB tables are roughly equivalent to Datastore namespaces; I don’t believe Google Cloud Datastore supports cross-namespace transactions.

itcmcgrath · on Nov 29, 2018

It does, and has for several years.

monstrado · on Nov 28, 2018

FoundationDB https://www.foundationdb.org/ not only supports transactions, they are mandatory. They also go one step further and support atomic operations, which are especially killer.

otterley · on Nov 28, 2018

I don’t think FDB supports cross-database transactions, though.

ryanworl · on Nov 28, 2018

What do you mean by this? There is only one “database” in FoundationDB terms. You can write transactions over the entire keyspace regardless of which machine the data is stored on.

otterley · on Nov 28, 2018

Multiple clusters, then, or whatever you specify to FDB’s API to identify the instance when making a client connection.

ryanworl · on Nov 28, 2018

I'm still not sure what you mean in terms of contrasting this with DynamoDB's new features. You could implement the entire DynamoDB API, with even stronger semantics than the new features listed in the article, on top of FoundationDB. Additionally, the latency would be theoretically lower as they describe needing to do a read, write, and another read per key to verify isolation, whereas FoundationDB uses an optimistic concurrency control scheme to verify at commit time that transactions do not conflict. In the common case (where transactions don't conflict) this is faster.

otterley · on Nov 28, 2018

All I’m trying to do here is trying to see whether the claim made in the blog post is true or not. Some commenters were claiming it was false, but I don’t think they considered all the components of the claim.

monstrado · on Nov 28, 2018

Agreed 100% and as someone who has had to use DDB before, nothing would make me happier than seeing this built.

monstrado · on Nov 28, 2018

There's not really a concept of "Database" in FDB. There is however a concept of key spaces, and "directories", which are basically the same, and these all support transactions.

i.e.

/database1/key1 = foo

/database2/key2 = bar

airwot4 · on Nov 28, 2018

As far as I'm aware these offerings support transactions across the entire database.

Google Cloud Spanner: https://cloud.google.com/spanner/docs/transactions

Google Cloud Firestore: https://firebase.google.com/docs/firestore/manage-data/trans...

Plus if you use Cloud Firestore in Datastore Mode then Google Cloud Datastore would satisfy this requirement as well.

otterley · on Nov 28, 2018

Spanner is not a non-relational database.

As for Firestore, it’s not clear whether it supports cross-collection transactions. Cloud Datastore does not support cross-namespace transactions AFAICT.

wsh91 · on Nov 28, 2018

a) Cloud Firestore supports transactions across the entire database. You can learn more about them here: https://cloud.google.com/firestore/docs/manage-data/transact....

b) Given that the primary use case for namespaces was/is multitenancy, it's not clear to me why you'd want to transact across them. Nevertheless, you can. What's leading you to draw this conclusion?

otterley · on Nov 28, 2018

The documentation is what led me to that conclusion, since it's not explicit as to what the transaction boundaries are, but I could be mistaken. Does this mean that the poster's claim is erroneous?

itcmcgrath · on Nov 29, 2018

It does. It's not in the documentation because it doesn't have boundaries within the database.

otterley · on Nov 29, 2018

Can you have more than one database per project? If so, there still might be a valid claim here.

gmaster1440 · on Nov 28, 2018

MongoDB (https://www.mongodb.com/transactions)

“Multi-document transactions can be used across multiple operations, collections, databases, and documents.”

otterley · on Nov 28, 2018

However: “Multi-document transactions are available for replica sets only. Transactions for sharded clusters are scheduled for MongoDB 4.2.” DynamoDB is sharded by design.

gmaster1440 · on Nov 28, 2018

I see, it appears to come down to how each db interprets "partitions".

If we're referring specifically to shards then "DynamoDB is the only non-relational database that supports transactions across multiple partitions and tables." no longer sounds like hyperbole.

AaronFriel · on Nov 28, 2018

Hyperdex Warp - I'm not sure if it's still available - which purports to provide serializability over multi-key transactions. In the HyperDex model that means over all defined spaces in the cluster. That's a stronger guarantee than DynamoDB provides, which is still susceptible to phantom reads. The DynamoDB team ought to be aware of it, because it's one of the first hits for "multi-key transactions" and the paper is an important one for designing transactions on KVS.

https://arxiv.org/pdf/1509.07815.pdf

winetraveler · on Nov 28, 2018

FaunaDB (mentioned in previous comment) -- it is multi-model NoSQL so you can do relational queries and it supports transactions across multiple partitions, documents, replicas. json docs, not tables.

Nican · on Nov 28, 2018

CockroachDB And you can get pretty close to being schemlass with the json column, and maybe with the inverted index.

philliphaydon · on Nov 28, 2018

RavenDB I think?

jedberg · on Nov 27, 2018

This is cool, it lifts the burden of having to bake "atomicity" into your app if you're using a key/value store like DynamoDB. I can see a nice balance of combining this with some built in error checking in the app itself.

jared2501 · on Nov 28, 2018

I'd be interested to see comparisons/benchmarks against FoundationDB. DynamoDB transactions make dynamo a serious alternative to FDB now. I can see the two manage advantages for FDB being: 1) you can deploy it on premise (which is potentially important for some B2B companies), 2) it shuffles data around so that hot-spotting of a cluster is eliminated (which dynamo appears to still suffer from).

wahnfrieden · on Nov 28, 2018

#2 appears to be solved now https://aws.amazon.com/blogs/database/how-amazon-dynamodb-ad...

polskibus · on Nov 28, 2018

Foundation DB is open source!

psankar · on Nov 28, 2018

Postgresql gets native JSON support (at least since 9.2 onwards) to store schemaless free flowing text. Dynamodb gets transaction guarantees.

There is globalization and intermingling happening on technology too.

On a similar thought, a few years back, C# and Java got `Any` generic types, while Python/JS got static types (via python3 typings, typescript)

lozenge · on Nov 28, 2018

C# doesn't have an Any generic type (Foo<?> In java parlance)

manigandham · on Nov 28, 2018

C# can use `object` or `dynamic`:

https://stackoverflow.com/questions/2690623/what-is-the-dyna...

asien · on Nov 28, 2018

>If an item is modified outside of a transaction while the transaction is in progress, the transaction is canceled and an exception is thrown

You are still responsible to implements a Queue or a Lock on the Items you want to mutate.

That said this is a huge milestone for DynamoDB, we can now safely mutate multiples items while remaining ACID.

grogers · on Nov 28, 2018

Max 10 items per transaction, that's quite a restriction! I guess you have to plan all the transactions you would perform and make sure they meet the bounds.

Thaxll · on Nov 27, 2018

Is the heat map available to customers now or is it still a request you have to do?

darkr · on Nov 28, 2018

Still have to request it AFAIAA