“ According to the Amazon Prime Day blog post, DynamoDB processes 126 million qu...

tedivm · on Oct 11, 2023

From the AWS blog post they referenced-

> DynamoDB powers multiple high-traffic Amazon properties and systems including Alexa, the Amazon.com sites, and all Amazon fulfillment centers. Over the course of Prime Day, these sources made trillions of calls to the DynamoDB API. DynamoDB maintained high availability while delivering single-digit millisecond responses and peaking at 126 million requests per second.

Amazon was very, very clear on this. For Google to use that number without the caveat is just completely underhanded and dishonest. Whoever wrote this is absolutely lacking in integrity.

ljm · on Oct 11, 2023

I used DynamoDB as part of the job a few years ago and never got single-millisecond responses - it was 20ms minimum and 70+ on a cold-start, but I can accept that optimising Dynamo's various indexes is a largely opaque process. We had to add on hacks like setting the request timeout to 5ms and keeping the cluster warm by submitting a no-op query every 500ms to keep it even remotely stable. We couldn't even use DAX because the Ruby client didn't support it. At the start we only had a couple of thousand rows in the table so it would have legit been faster to scan the entire table and do the rest in memory. Postgres did it in 5ms.

If Amazon said they didn't use DAX that day I would say they were lying.

The average consumer or startup is not going to squeeze out the performance of Dynamo that AWS is claiming that they have achieved.

In fact, it might have been fairer in Ruby if they didn't hard-code the net client (Net/HTTP). I imagine performance could have been boosted by injecting an alternative.

iot_devs · on Oct 11, 2023

No need to guess when you can measure.

I am running https://cloud-canary.com a service where I monitor AWS primary services for latency and availability.

It comes with a lot of data.

For instance this is the latency I see doing operations against Dynamo.

https://cloudcanary.grafana.net/public-dashboards/c53e2092d6...

loxias · on Oct 11, 2023

What a cool lil side project/company! Going to circulate this among friends...

Little bit of well meaning advice: This needs copy editing -- inconsistent use of periods, typos, grammar. Little crap that doesn't matter in the big picture, but will block some from opening their wallets. :) ("OpenTeletry", "performances", etc.)

All in all this is quite cool, and I hope you get some customers and gather more data! (a 4k object size in S3 doesn't make sense to measure, but 1MB might be interesting. Also, check out HDRHistogram, it might be relevant to your interests)

iot_devs · on Oct 11, 2023

Thanks!

Any feedback is appreciated!

I pick 4k as a no-op against S3, something that very little time but still does some work.

I will definitely consider to increase it!

bennyg · on Oct 11, 2023

Nice dash - if you don't mind a drive-by recommendation: I use Grafana for work a lot and it's nice to see a table legend with min, max, mean, and last metrics for these kinds of dashboards. Really makes it easy to grok without hovering over data points and guessing.

RedlineTriad · on Oct 12, 2023

What is more important for me when using Grafana (though a summary is as well) is actually units, to know if it's second, millisecond, microsecond, and also if 0.5 is a quantile or what.

Numbers without units are dangerous in my opinion.

iot_devs · on Oct 11, 2023

Thanks a lot!

I'll definitely update it!

qwertox · on Oct 11, 2023

What a cool service. Congratulations!

RhodesianHunter · on Oct 11, 2023

> We had to add on hacks like setting the request timeout to 5ms and keeping the cluster warm by submitting a no-op query every 500ms to keep it even remotely stable.

This sounds like you're blaming dynamo for you/your stack's inability to handle connections / connection pooling.

tedivm · on Oct 11, 2023

Yeah that TLS handshake is an absolute killer if you run it for every request.

iends · on Oct 11, 2023

Been using DynamoDB for years and haven’t had to do any of the hacks you talk about doing. Not using ruby though. TCP keep-alive does help with perf though (which I think you might be suggesting.)

I don’t have p99 times in front of me right this second but it’s definitely lower than 20ms for reads and likely lower for writes. (EC2 in VPC).

mk89 · on Oct 11, 2023

They very well know that people don't read sh* anymore. Just throw numbers there, PowerPoint them and offer an "unbiased" comparison where Google shines - buy Google.

Worst case scenario, it's Google you're buying, not a random startup etc.

azmodeus · on Oct 11, 2023

Google doesn't have a great brand of not killing products. No support and randomly killing stuff is not a good business relationship

GabeWeiss_ · on Oct 11, 2023

Just as a hand in the air...Be careful about what you're comparing here. # of API calls over a period of time is...largely irrelevant in the face of QPS. I can happily write a DDOS script that massively bombards a service, but if that halts my QPS then it doesn't matter. So sure, trillions of API calls were made (still impressive in the scope of the overall network of services, I'm not downplaying that), but ultimately, for DynamoDB and Spanner, it's the QPS that mattered to us in terms of comparisons of DB scaling and performance.

vineyardmike · on Oct 11, 2023

Google calls API calls “queries”… because of their history as a search engine. QPS == API calls/per second == Requests per second

That said, I can’t imagine these numbers mean much to anyone after a certain point. It’s not like either company is running a single service handling them. The scale is limited by their budget and access to servers because my traffic shouldn’t impact yours. I feel like the better number is RPS/QPS per table or per logical database or whatever.

GabeWeiss_ · on Oct 11, 2023

Yes, but QPS vs. "queries to the API". The difference is the time slice. I should have been more explicit. The key here really is the time function between the numbers. That the AWS blog calls out trillions of API calls isn't relevant because there wasn't a specific time denominator. The 126M QPS is the important stat.

forrestbrazeal · on Oct 11, 2023

We shared some details about Gmail's migration to Spanner in this year's developer keynote at Google Cloud Next [0] - to my knowledge, the first time that story has been publicly talked about.

[0] https://www.youtube.com/watch?v=268jdNwH6AM

gregdoesit · on Oct 11, 2023

I tried to find it in this video, but failed. Could you please share a time stamp on where to look?

It’s a pretty big deal if Gmail migrated to GCP-provided Spanner(not to an internal Spanner instance) and sounds like he kind of vote of confidence GCP and Cloud Spanner could benefit from: might I suggest to write about it? It’s easier to digest and harder to miss than an hour-long keynote video with no time stamps.

And so just to confirm: Gmail is on Cloud Spanner for the backend?

dekhn · on Oct 11, 2023

It's almost certainly not the case that Gmail uses Cloud Spanner rather than Internal Spanner. I don't think Cloud Spanner (or most of Google's cloud products) have the featureset required to support loads like Gmail (both in terms of technical capability, and security/privacy features).

When I worked at Google I tried to get more services to migrate to the cloud but the internal environment that was built up over 25 years is much better at supporting billion+ users with private data.

Cthulhu_ · on Oct 11, 2023

And yet, if they do, that's probably one of the best sales pitches they could have - dogfooding. After all, isn't that also how AWS started, just reselling the services and servers they already use themselves?

It doesn't make much sense to have a 'better' version of a product you sell but keep it internal.

xdeepak81 · on Oct 14, 2023

Yet Amazon Retail still don't use DynamoDb for the critical workloads. They still rely on an internal version of DynamoDb (Sable) which is optimized for Retail workload.

jeffbee · on Oct 11, 2023

It makes sense because the public will not use the internal APIs which have non-standard wire protocols, weird authentication schemes, etc.

alphabetting · on Oct 11, 2023

looks like it starts at 50:45. youtube recently made it so you can click "show transcript" in the description then ctrl-f takes you to all the mentions. very helpful for long videos like this.

easton · on Oct 11, 2023

It looks like the Spanner beta dropped to the public in 2017, so < 8 years ago: https://cloud.google.com/spanner/docs/release-notes#February...

I don't think they would've migrated again to GCP Spanner (even if it would've been a show of faith).

dabernathy89 · on Oct 11, 2023

Here's the link with timestamp (note that the speaker says it was a 2 year transition):

https://www.youtube.com/live/268jdNwH6AM?si=WkgnvqaIwFidt-hc...

tazjin · on Oct 11, 2023

Gmail is on Spanner, and Cloud Spanner is on Spanner.

vxNsr · on Oct 11, 2023

In the timestamped video link shared downthread, the speaker does seem to strongly imply that gWorkspace doesn’t manage the infra, when he finishes explaining the migration he declares (around 55:18)“[…]we can focus on the business of gmail and spanner can choose to improve and deliver performance gains automagically[sic]” which would imply, to me at least, that it’s on GCP.

jeffbee · on Oct 11, 2023

That's not what it implied to me. To me, it meant that they adopted an internal managed Spanner with its own SRE team, instead of running their own Spanner. In the past, Gmail ran their own [[redacted]]s and [[redacted]] even though there were company-wide managed services for those things.

eep_social · on Oct 11, 2023

Agree, but with the caveat that [[redacted]] and [[redacted]] were old and originally designed to be run that way. All newer storage systems I can recall were designed to be run by a central team after many years of experience doing it the other way. And many tears shed over migrating to those centralized versions.

Source: I was on the last team running our own [[redacted]].

Vt71fcAqt7 · on Oct 11, 2023

Thanks! Looks really interesting.

link with time-stamp:

https://www.youtube.com/watch?v=268jdNwH6AM?&t=3020

jeffbee · on Oct 11, 2023

Wow, almost content-free presentation! How obnoxious!

This wasn't the first time Gmail has replaced the storage backend in-flight. The last time, around 2011, they didn't hype it up, they called it "a storage software update" in public comms. And that other migration is the origin of the term "spannacle", because during that migration the accounts that resisted moving from [[redacted]] to [[redacted]] we called barnacles.

vxNsr · on Oct 11, 2023

Somehow I thought you were at Amazon/AWS because of how much you push it in your book. Cool to see you’re at GCP.

dmoy · on Oct 11, 2023

> I will say that it does show a vote of confidence to say that Photos, Gmail and Ads use GCP infra,

I'm not sure? I guess I'm mostly not sure what "gcp infra" means there. The blog post says

"Spanner is used ubiquitously inside of Google, supporting services such as; Ads, Gmail and Photos."

But there's google-internal spanner, and gcp spanner. A service using spanner at Google isn't necessarily using gcp. (No clue about photos, Gmail, etc)

Granted, from what I gather, there's a lot more similarity between spanner & gcp spanner than e.g. borg and kubernetes.

bananapub · on Oct 11, 2023

borg and k8s are completely unrelated bits of software with roughly similar goals.

gcp spanner and normal spanner are different deployments of the same code.

marcinzm · on Oct 11, 2023

>different deployments

Which can be the difference between 99.99% availability and 99% availability with data corruption issues. Not saying that's the case here but one should not downplay the difference deployments can make.

gregdoesit · on Oct 11, 2023

Surely in a post about Google Cloud Spanner, all examples mentioned use Google Cloud Spanner? It would be moot listing them as examples if they would not: so my assumption is they are all using GCP infra already for Spanner.

I really want to give Google the benefit of the doubt: but it doesn't help that they did not write that eg Gmail is using "Cloud Spanner." They wrote that it uses Spanner.

ericpauley · on Oct 11, 2023

This is putting a lot of faith in GCP advertising. I strongly doubt the idea that the Google workloads discussed are deployed on GCP instead of internal Borg infrastructure.

kccqzy · on Oct 11, 2023

Years ago they did a reorg and moved all infrastructure services under Cloud even though they are not Cloud products. That would enable this kind of obfuscation because Cloud is literally responsible for both Cloud Spanner and non-Cloud Spanner and they can conflate these two in their marketing copy. They probably feel justified in doing so because they share so much code.

0xbadcafebee · on Oct 11, 2023

Considering that most of Google does not run on GCP, I would not give them the benefit of the doubt.

blueg3 · on Oct 12, 2023

Photos, Gmail, and Ads use Spanner, not Cloud Spanner.

Apparently Cloud Spanner doesn't support protobuf columns? It would be hard for any internal Google product to use it under that restriction.

GabeWeiss_ · on Oct 11, 2023

Infra and Cloud Spanner are the same stack. Having those services run on infra is more about the legacy of tooling to shift it rather than anything around performance or ability to handle it

GabeWeiss_ · on Oct 11, 2023

Infra and Cloud Spanner are the same stack. Having those services run on infra is more about the legacy of tooling to shift it rather than anything around performance or ability to handle it.

tw04 · on Oct 11, 2023

>This comparison seems to be not exactly fair? Amazon’s 126 million queries per second was purely for Amazon-related services serving Prime Day generating this on DynamoDB, and not all of AWS is my read.

There's no indication that google is talking about ALL of spanner either? The examples they list are all internal google services, and they specifically say "inside google".

I'm also dubious that even with all of the AWS usage accounted for that DynamoDB tops Spanner if Amazon themselves are only at 126 million queries per second on Prime Day.

g9yuayon · on Oct 11, 2023

> At Amazon, practically every service is built on top of AWS - a proper vote of confidence!

Not only this, but practically most, if not all, of the AWS services use DynamoDB, including use cases that are usually not for databases, such as multi-tenant job queues (just search "Database as a Queue" to get the sentiment). In fact, it is really really hard to use any relational DB in AWS. I mean, a team would have to go through a CEO approval to get exceptions, which says a lot about the robustness of DDB.

tacozilla · on Oct 11, 2023

Eh, this isn't accurate. Both Redshift and Aurora/RDS are used heavily by a lot of teams internally. If you're talking specifically about the primary data store for live applications, NoSQL was definitely recommended/pushed much harder than SQL, but it by no means required CEO approval to not use DDB

Edit: It's possible you're limiting your statement specifically to AWS teams, which would make it more accurate, but I read the use of "Amazon" in the quote you were replying to as including things like retail as well, etc.

g9yuayon · on Oct 11, 2023

Yeah, within AWS. I'm not sure about other parts of Amazon

sharpy · on Oct 11, 2023

When I was at AWS, towards later part of my tenure, DynamoDB was mandated for control plane. To be fair, it worked, and worked well, but there were times when I wished I could use something else instead.

brunoborges · on Oct 11, 2023

> What would have perhaps been a more fair comparison is to share the peak load that Google services running on GCP generated on Spanner, and not the sum of their cloud platform.

Not necessarily about volume of transactions, but this is similar to one of my pet-peeves with statements that use aggregated numbers of compute power.

"Our system has great performance, dealing 5 billion requests per second" means nothing if you don't break down how many RPS per instance of compute unit (e.g. CPU).

Scales of performance are relative, and on a distributed architecture, most systems can scale just by throwing more compute power.

hangonhn · on Oct 11, 2023

Yeah I've seen some pretty sneaky candidates try that on their resumes. They aggregate the RPS for all the instances of their services even though they don't share any dependencies nor infrastructure. They're just independent instances/clusters running the same code. When I dug into those impressive numbers and asked about how they managed coordination/consensus the truth comes out.

GabeWeiss_ · on Oct 11, 2023

True, but one would hope that both sides in this case would be putting their best foot forward. Getting peak performance out of right sizing your DB is part of that discussion. I can't imagine AWS would put down "126 million QPS" if they COULD have provided a larger instance that could deliver "200 million QPS", right? We have to assume at some point that both sides are putting their best foot forward given the service.

redditor98654 · on Oct 12, 2023

The 126M QPS number was certainly parts of Amazon.com retail that powers Prime Day not all of DDB traffic. If we were to add up all of DDB's volume, it would be way higher. At least a magnitude if not more.

Large parts of AWS itself uses DDB - both control plane and data plane. For instance, every message sent to AWS IoT will internally translate to multiple calls to DDB (reads and writes) as the message flows through the different parts of the system. IoT itself is millions of RPS and that is just one small-ish AWS service.

Source: Worked at AWS for 12 years.

BoorishBears · on Oct 11, 2023

Put yourself in the shoes of who they're targeting with that.

Probably dealing with thousands of requests per seconds, but wants to say they're building something that can scale to billions of requests per second to justify their choices, so there they go.

Rapzid · on Oct 11, 2023

Frankly it's a bit weird to see this kind of dick measuring in a product blog post from the "Director of Engineering" :/

cbarrick · on Oct 11, 2023

s/the "Director of Engineering"/a "Director of Engineering"/

There are many engineering directors at Google.

blueg3 · on Oct 12, 2023

Director is what, L8? There's a ton of those.

Rapzid · on Oct 11, 2023

And only one attributed to the blog post.

swish

ripper1138 · on Oct 11, 2023

True and even worse, inaccurate dick measuring.

jjtheblunt · on Oct 11, 2023

> At Amazon, practically every service is built on top of AWS

is that true finally? It sure wasn't in the 2020-2021 timeframe.

dastbe · on Oct 11, 2023

it does depend on what you mean. By 2020/2021, effectively everything was on top of AWS VMs/VPC and perhaps LBs at that point? Most if not all new services were being built in NAWS.

jjtheblunt · on Oct 11, 2023

SPS was heavily MAWS and I got sick of being the NAWS person from years prior pushing for NAWS in our dysfunctional team, and quit. The good coworkers also quit.

Yet I still see the very deep stack of technically incapable middle manager sorts dutifully posting "come join us" nonsense on LinkedIn.

(I had the luxury of having worked in one of the inner sanctums of Apple hardware for years prior, so was immune to nonsense, and didn't need the job.)