Hacker News new | past | comments | ask | show | jobs | submit login
A Tour of Amazon's DynamoDB (paperplanes.de)
143 points by timf on Jan 30, 2012 | hide | past | favorite | 30 comments



I've been waiting for Mathias to write something like this, and I'm glad he didn't waste any time diving in. For those who don't know him, he has more breadth in the area of emerging datastores than anyone I know; he literally wrote the book on Riak[1].

I've also been acquainting myself with the DynamoDB API over the past week, and am building a node.js binding that I hope will abstract away most of the esoteric aspects of interacting with it. It has full API coverage currently and is tested in Travis[3], so now I'm writing the high-level interface. So far I've covered about half of the operations DynamoDB offers, but would love to hear any ideas/feedback.

    [1] http://riakhandbook.com/
    [2] https://github.com/jed/dynamo
    [3] http://travis-ci.org/jed/dynamo


Looks like we are building a very similar node.js client for dynamo. https://github.com/Wantworthy/dynode


I'm still struck by what a strong endorsement DynamoDB is of the design decisions we've been making in Cassandra over the last couple years. Composite keys, distributed counters, ...

More details: http://www.datastax.com/dev/blog/amazon-dynamodb


Dynamo as described in the original paper was not made into a web service.

Facebook took up the Dynamo torch and created Cassandra... then switched to HBase after trying to use Cassandra in production.

Have you ever stopped to wonder about what Amazon might have taken OUT of Dynamo before launching DynamoDB? Things that Cassandra still has and just might be holding it back?

That to me seems much more important than just saying "DynamoDB has feature X that Cassandra has had for years! We're so smart!"


There's less to write about there... One of the first Cassandra decisions was dropping vector clocks, reasoning that vector clocks weren't worth the additional complexity for 99% of uses (once you move from key/value to rows + columns). DynamoDB also switched to items/fields and dropped vector clocks (or at least does not expose them).

The original Dynamo was plain key/value with O(1) routing and vector clocks; there's not much else to strip out. :)

P.S. Facebook built Messages on HBase instead of Cassandra for political reasons rather than technical, and shards HBase to mitigate the availability problems it has otherwise. Facebook never ran an Apache Cassandra release in production.


I'm struck by how you somehow manage to pimp both Cassandra and DataStax on every post which has something to do with databases and/or datastores. :)


Didn't you folks design Cassandra after the Dynamo paper though?


Composite keys (multi-dimensional) and distributed counters were not in the Dynamo paper. They were added to Cassandra, which is indeed based partially on Dynamo.


Cassandra initial team was lead by Avinash Lakshman who was one of the authors of Amazon's Dynamo. So it's not strange that Cassandra is hugely inspired by Dynamo (and vice versa).


Yes, but the Dynamo paper offers two operations Get(key) and Set(key, value). Anything Cassandra adds atop that they did themselves i.e. column families, secondary indices.


How does Cassandra stack up against HBase these days?

Both projects seem to be moving so quickly that it's really hard to find an up to date comparison.


DynamoDB is not open source, at least according to a blog comment by Jeff Barr of AWS. (http://aws.typepad.com/aws/2012/01/amazon-dynamodb-internet-...)

I'm certainly not going to commit to platform lock-in like that. I know a lot of folks were hit hard by Googles app engine price changes and it could happen again here with DynamoDB.


Then you don't use S3?


I don't actually, but I feel there is less lock-in. It wouldn't be hard to move off.

Also, the risk of S3 pricing changing greatly is constrained by the wide adoption. If DynamoDB failed to become pervasive, it would be relatively easy for Amazon to increase the price to make an underutilized service profitable. OTOH, if DynamoDB does become as popular as S3 your exposure to risk will be much less.


It would be pretty easy to implement the DynamoDB API on top of HBase so I wouldn't worry too much about that.


There's an open implementation of S3.

http://open.eucalyptus.com/wiki/EucalyptusStorage_v1.4


I would second Mathias' thoughts in that this does feel more bigtable/column oriented than traditional dynamo. Dynamo and its derivative, Riak (not from amazon), make no attempt to determine data types or schema in any way. The fact that dynamodb can do range scans and has various counting features leads me to believe that it is more hybridized along the lines of Cassandra. Either way, it is a welcome addition to the nosql toolset that developers have to chose from today. I will certainly look at it more closely.


Actually Riak can be used in a way in which it is content-aware of JSON and XML documents.

http://wiki.basho.com/Riak-Search---Indexing-and-Querying-Ri...


Search and Indexing were addons to the initial Riak product that were born from user demand. Only recently (1.0 I believe) was Search integrated into the main Riak distribution. Riak has no capability of updating values in place, ie. a partial update and no means of ordering or pagination. Some of these features are available via mapreduce queries or Search/Index as you noted.

I'm by no means harping on Riak. I actually use it on a number of projects. But reading about DynamoDB's capabilities does not conjure Riak, it conjures Cassandra and HBase.


I agree with you on other counts, but Riak is a far cry from the spartan implementation of Dynamo outlined in the paper.


Does anyone know in which language DynamoDB is implemented? I've read somewhere that SimpleDB is done in Erlang. Is that the case with DynamoDB as well? I've been reading about Ets and Dets in Erlang and it makes me wonder whether they have anything to do with either of these data stores.


I haven't seen it written anywhere, but judging by the __type in errors like this, I would say Java:

  {"__type":"com.amazon.coral.validate#ValidationException",
  "message":"One or more parameter values were invalid:
  The provided key size does not match with that of the schema"}


About pricing, it is interesting how caching is apparently not an effective option to reduce the bill in a significant way since data size is a big component of price.


Thats because it is SSD based, so size starts to matter again.


Yes but real hardware with an SSD disk plugged inside could deliver excellent performances with many reasonable on-disk databases, with a fraction of the cost per megabyte, with many many millions of keys.

It is like that other than the fact you don't have to manage the DB, that is a good point, here the best reason for many small-mid business to use such a service is since they are already on EC2. May work in the short term but I see something odd about this model.


The API doesn't bother me as much as the temporary authentication via STS. Temporary credentials for database access? Seriously? I am the only one who sees how ridiculous this is?


What's wrong with using temporary credentials?


They add useless latency and another point of failure. And sometimes, the AWS APIs DO fail.


Temporary credentials are used specifically to reduce latency, since you get a token that is valid for a period of time and doesn't need to be checked against the auth service on every call. Since the credentials last for 12 hours the work to retrieve them should be negligible. Because they don't need to be checked against the auth service on every call it seems like they would more resilient to API failure than standard AWS credentials.


You still need to provide an access key id and secret access key with that session token as returned by Security Token Service. And properly sign every request with those credentials. With a proper session token, but invalid credentials, the request fails with HTTP 400: "__type: com.amazon.coral.service#InvalidSignatureException, message: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.". Amazon still checks the signature. The same logic that could be used for IAM credentials that don't expire, without the need to check for a valid session token. If it is an Amazon screw up, that static credentials take more time to evaluate vs temporary credentials - that's not out fault. It is a pure Amazon screw up. The signing logic is still there for every API call. So (apparently?) we're back to square one: extra requests and wrapper logic for zero benefits, and worse overall experience. The session credentials aren't stateful. I specifically checked for this behavior. Therefore what you say, doesn't seem to happen in reality.

And for the love of God, don't blindly trust the documentation / what the AWS folks say. As an AWS library author myself, I had a lot of fun debugging failed requests because the smarty who wrote the signing procedure docs forgot to mention some HTTP headers that are mandatory to sign. I had to reverse engineer an official SDK in order to patch my own code. While being implemented as the docs say, the signing method failed at every request, although I had a valid session token. Trial and error is always a broken way of developing things due to broken docs. If you're an AWS employee, please send my regards to the documentation folks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: