Hacker News new | past | comments | ask | show | jobs | submit login

“ According to the Amazon Prime Day blog post, DynamoDB processes 126 million queries per second at peak. Spanner on the other hand processes 3 billion queries per second at peak, which is more than 20x higher, and has more than 12 exabytes of data under management.”

This comparison seems to be not exactly fair? Amazon’s 126 million queries per second was purely for Amazon-related services serving Prime Day generating this on DynamoDB, and not all of AWS is my read.

What would have perhaps been a more fair comparison is to share the peak load that Google services running Cloud Spanner, and not the sum of all Spanner services across all of GCP and all of Google (Spanner on non-GCP infra).

I will say that it would show a massive of confidence to say that Photos, Gmail and Ads heavily rely on GCP infra: which would be brand new information for me! It would add to confidence to learn more on how they use it, and if Cloud Spanner is on the critical path for those services.

What is confusing, however, is how in this article "Cloud Spanner" is consistently used... except for when talking about Gmail, Ads and Photos, where it's stated that "Spanner" is used by these products, not "Cloud Spanner!". Like if they were not using the Cloud Spanner infra, but their own. It would help to know what is the case, and what the load of Cloud Spanner is: and not Spanner running on internal Google infra that is not GCP.

At Amazon, practically every service is built on top of AWS - a proper vote of confidence! - and my impression was that GCP had historically been far less utilised by Google for their own services. Even in this post, I'm still confused and unable to tell if those Google products listed use Cloud Spanner or their own infra running Spanner.




From the AWS blog post they referenced-

> DynamoDB powers multiple high-traffic Amazon properties and systems including Alexa, the Amazon.com sites, and all Amazon fulfillment centers. Over the course of Prime Day, these sources made trillions of calls to the DynamoDB API. DynamoDB maintained high availability while delivering single-digit millisecond responses and peaking at 126 million requests per second.

Amazon was very, very clear on this. For Google to use that number without the caveat is just completely underhanded and dishonest. Whoever wrote this is absolutely lacking in integrity.


I used DynamoDB as part of the job a few years ago and never got single-millisecond responses - it was 20ms minimum and 70+ on a cold-start, but I can accept that optimising Dynamo's various indexes is a largely opaque process. We had to add on hacks like setting the request timeout to 5ms and keeping the cluster warm by submitting a no-op query every 500ms to keep it even remotely stable. We couldn't even use DAX because the Ruby client didn't support it. At the start we only had a couple of thousand rows in the table so it would have legit been faster to scan the entire table and do the rest in memory. Postgres did it in 5ms.

If Amazon said they didn't use DAX that day I would say they were lying.

The average consumer or startup is not going to squeeze out the performance of Dynamo that AWS is claiming that they have achieved.

In fact, it might have been fairer in Ruby if they didn't hard-code the net client (Net/HTTP). I imagine performance could have been boosted by injecting an alternative.


No need to guess when you can measure.

I am running https://cloud-canary.com a service where I monitor AWS primary services for latency and availability.

It comes with a lot of data.

For instance this is the latency I see doing operations against Dynamo.

https://cloudcanary.grafana.net/public-dashboards/c53e2092d6...


What a cool lil side project/company! Going to circulate this among friends...

Little bit of well meaning advice: This needs copy editing -- inconsistent use of periods, typos, grammar. Little crap that doesn't matter in the big picture, but will block some from opening their wallets. :) ("OpenTeletry", "performances", etc.)

All in all this is quite cool, and I hope you get some customers and gather more data! (a 4k object size in S3 doesn't make sense to measure, but 1MB might be interesting. Also, check out HDRHistogram, it might be relevant to your interests)


Thanks!

Any feedback is appreciated!

I pick 4k as a no-op against S3, something that very little time but still does some work.

I will definitely consider to increase it!


Nice dash - if you don't mind a drive-by recommendation: I use Grafana for work a lot and it's nice to see a table legend with min, max, mean, and last metrics for these kinds of dashboards. Really makes it easy to grok without hovering over data points and guessing.


What is more important for me when using Grafana (though a summary is as well) is actually units, to know if it's second, millisecond, microsecond, and also if 0.5 is a quantile or what.

Numbers without units are dangerous in my opinion.


Thanks a lot!

I'll definitely update it!


What a cool service. Congratulations!


> We had to add on hacks like setting the request timeout to 5ms and keeping the cluster warm by submitting a no-op query every 500ms to keep it even remotely stable.

This sounds like you're blaming dynamo for you/your stack's inability to handle connections / connection pooling.


Yeah that TLS handshake is an absolute killer if you run it for every request.


Been using DynamoDB for years and haven’t had to do any of the hacks you talk about doing. Not using ruby though. TCP keep-alive does help with perf though (which I think you might be suggesting.)

I don’t have p99 times in front of me right this second but it’s definitely lower than 20ms for reads and likely lower for writes. (EC2 in VPC).


They very well know that people don't read sh* anymore. Just throw numbers there, PowerPoint them and offer an "unbiased" comparison where Google shines - buy Google.

Worst case scenario, it's Google you're buying, not a random startup etc.


Google doesn't have a great brand of not killing products. No support and randomly killing stuff is not a good business relationship


Just as a hand in the air...Be careful about what you're comparing here. # of API calls over a period of time is...largely irrelevant in the face of QPS. I can happily write a DDOS script that massively bombards a service, but if that halts my QPS then it doesn't matter. So sure, trillions of API calls were made (still impressive in the scope of the overall network of services, I'm not downplaying that), but ultimately, for DynamoDB and Spanner, it's the QPS that mattered to us in terms of comparisons of DB scaling and performance.


Google calls API calls “queries”… because of their history as a search engine. QPS == API calls/per second == Requests per second

That said, I can’t imagine these numbers mean much to anyone after a certain point. It’s not like either company is running a single service handling them. The scale is limited by their budget and access to servers because my traffic shouldn’t impact yours. I feel like the better number is RPS/QPS per table or per logical database or whatever.


Yes, but QPS vs. "queries to the API". The difference is the time slice. I should have been more explicit. The key here really is the time function between the numbers. That the AWS blog calls out trillions of API calls isn't relevant because there wasn't a specific time denominator. The 126M QPS is the important stat.


We shared some details about Gmail's migration to Spanner in this year's developer keynote at Google Cloud Next [0] - to my knowledge, the first time that story has been publicly talked about.

[0] https://www.youtube.com/watch?v=268jdNwH6AM


I tried to find it in this video, but failed. Could you please share a time stamp on where to look?

It’s a pretty big deal if Gmail migrated to GCP-provided Spanner(not to an internal Spanner instance) and sounds like he kind of vote of confidence GCP and Cloud Spanner could benefit from: might I suggest to write about it? It’s easier to digest and harder to miss than an hour-long keynote video with no time stamps.

And so just to confirm: Gmail is on Cloud Spanner for the backend?


It's almost certainly not the case that Gmail uses Cloud Spanner rather than Internal Spanner. I don't think Cloud Spanner (or most of Google's cloud products) have the featureset required to support loads like Gmail (both in terms of technical capability, and security/privacy features).

When I worked at Google I tried to get more services to migrate to the cloud but the internal environment that was built up over 25 years is much better at supporting billion+ users with private data.


And yet, if they do, that's probably one of the best sales pitches they could have - dogfooding. After all, isn't that also how AWS started, just reselling the services and servers they already use themselves?

It doesn't make much sense to have a 'better' version of a product you sell but keep it internal.


Yet Amazon Retail still don't use DynamoDb for the critical workloads. They still rely on an internal version of DynamoDb (Sable) which is optimized for Retail workload.


It makes sense because the public will not use the internal APIs which have non-standard wire protocols, weird authentication schemes, etc.


looks like it starts at 50:45. youtube recently made it so you can click "show transcript" in the description then ctrl-f takes you to all the mentions. very helpful for long videos like this.


It looks like the Spanner beta dropped to the public in 2017, so < 8 years ago: https://cloud.google.com/spanner/docs/release-notes#February...

I don't think they would've migrated again to GCP Spanner (even if it would've been a show of faith).


Here's the link with timestamp (note that the speaker says it was a 2 year transition):

https://www.youtube.com/live/268jdNwH6AM?si=WkgnvqaIwFidt-hc...


Gmail is on Spanner, and Cloud Spanner is on Spanner.


In the timestamped video link shared downthread, the speaker does seem to strongly imply that gWorkspace doesn’t manage the infra, when he finishes explaining the migration he declares (around 55:18)“[…]we can focus on the business of gmail and spanner can choose to improve and deliver performance gains automagically[sic]” which would imply, to me at least, that it’s on GCP.


That's not what it implied to me. To me, it meant that they adopted an internal managed Spanner with its own SRE team, instead of running their own Spanner. In the past, Gmail ran their own [[redacted]]s and [[redacted]] even though there were company-wide managed services for those things.


Agree, but with the caveat that [[redacted]] and [[redacted]] were old and originally designed to be run that way. All newer storage systems I can recall were designed to be run by a central team after many years of experience doing it the other way. And many tears shed over migrating to those centralized versions.

Source: I was on the last team running our own [[redacted]].


Thanks! Looks really interesting.

link with time-stamp:

https://www.youtube.com/watch?v=268jdNwH6AM?&t=3020


Wow, almost content-free presentation! How obnoxious!

This wasn't the first time Gmail has replaced the storage backend in-flight. The last time, around 2011, they didn't hype it up, they called it "a storage software update" in public comms. And that other migration is the origin of the term "spannacle", because during that migration the accounts that resisted moving from [[redacted]] to [[redacted]] we called barnacles.


Somehow I thought you were at Amazon/AWS because of how much you push it in your book. Cool to see you’re at GCP.


> I will say that it does show a vote of confidence to say that Photos, Gmail and Ads use GCP infra,

I'm not sure? I guess I'm mostly not sure what "gcp infra" means there. The blog post says

"Spanner is used ubiquitously inside of Google, supporting services such as; Ads, Gmail and Photos."

But there's google-internal spanner, and gcp spanner. A service using spanner at Google isn't necessarily using gcp. (No clue about photos, Gmail, etc)

Granted, from what I gather, there's a lot more similarity between spanner & gcp spanner than e.g. borg and kubernetes.


borg and k8s are completely unrelated bits of software with roughly similar goals.

gcp spanner and normal spanner are different deployments of the same code.


>different deployments

Which can be the difference between 99.99% availability and 99% availability with data corruption issues. Not saying that's the case here but one should not downplay the difference deployments can make.


Surely in a post about Google Cloud Spanner, all examples mentioned use Google Cloud Spanner? It would be moot listing them as examples if they would not: so my assumption is they are all using GCP infra already for Spanner.

I really want to give Google the benefit of the doubt: but it doesn't help that they did not write that eg Gmail is using "Cloud Spanner." They wrote that it uses Spanner.


This is putting a lot of faith in GCP advertising. I strongly doubt the idea that the Google workloads discussed are deployed on GCP instead of internal Borg infrastructure.


Years ago they did a reorg and moved all infrastructure services under Cloud even though they are not Cloud products. That would enable this kind of obfuscation because Cloud is literally responsible for both Cloud Spanner and non-Cloud Spanner and they can conflate these two in their marketing copy. They probably feel justified in doing so because they share so much code.


Considering that most of Google does not run on GCP, I would not give them the benefit of the doubt.


Photos, Gmail, and Ads use Spanner, not Cloud Spanner.

Apparently Cloud Spanner doesn't support protobuf columns? It would be hard for any internal Google product to use it under that restriction.


Infra and Cloud Spanner are the same stack. Having those services run on infra is more about the legacy of tooling to shift it rather than anything around performance or ability to handle it


Infra and Cloud Spanner are the same stack. Having those services run on infra is more about the legacy of tooling to shift it rather than anything around performance or ability to handle it.


>This comparison seems to be not exactly fair? Amazon’s 126 million queries per second was purely for Amazon-related services serving Prime Day generating this on DynamoDB, and not all of AWS is my read.

There's no indication that google is talking about ALL of spanner either? The examples they list are all internal google services, and they specifically say "inside google".

I'm also dubious that even with all of the AWS usage accounted for that DynamoDB tops Spanner if Amazon themselves are only at 126 million queries per second on Prime Day.


> At Amazon, practically every service is built on top of AWS - a proper vote of confidence!

Not only this, but practically most, if not all, of the AWS services use DynamoDB, including use cases that are usually not for databases, such as multi-tenant job queues (just search "Database as a Queue" to get the sentiment). In fact, it is really really hard to use any relational DB in AWS. I mean, a team would have to go through a CEO approval to get exceptions, which says a lot about the robustness of DDB.


Eh, this isn't accurate. Both Redshift and Aurora/RDS are used heavily by a lot of teams internally. If you're talking specifically about the primary data store for live applications, NoSQL was definitely recommended/pushed much harder than SQL, but it by no means required CEO approval to not use DDB

Edit: It's possible you're limiting your statement specifically to AWS teams, which would make it more accurate, but I read the use of "Amazon" in the quote you were replying to as including things like retail as well, etc.


Yeah, within AWS. I'm not sure about other parts of Amazon


When I was at AWS, towards later part of my tenure, DynamoDB was mandated for control plane. To be fair, it worked, and worked well, but there were times when I wished I could use something else instead.


> What would have perhaps been a more fair comparison is to share the peak load that Google services running on GCP generated on Spanner, and not the sum of their cloud platform.

Not necessarily about volume of transactions, but this is similar to one of my pet-peeves with statements that use aggregated numbers of compute power.

"Our system has great performance, dealing 5 billion requests per second" means nothing if you don't break down how many RPS per instance of compute unit (e.g. CPU).

Scales of performance are relative, and on a distributed architecture, most systems can scale just by throwing more compute power.


Yeah I've seen some pretty sneaky candidates try that on their resumes. They aggregate the RPS for all the instances of their services even though they don't share any dependencies nor infrastructure. They're just independent instances/clusters running the same code. When I dug into those impressive numbers and asked about how they managed coordination/consensus the truth comes out.


True, but one would hope that both sides in this case would be putting their best foot forward. Getting peak performance out of right sizing your DB is part of that discussion. I can't imagine AWS would put down "126 million QPS" if they COULD have provided a larger instance that could deliver "200 million QPS", right? We have to assume at some point that both sides are putting their best foot forward given the service.


The 126M QPS number was certainly parts of Amazon.com retail that powers Prime Day not all of DDB traffic. If we were to add up all of DDB's volume, it would be way higher. At least a magnitude if not more.

Large parts of AWS itself uses DDB - both control plane and data plane. For instance, every message sent to AWS IoT will internally translate to multiple calls to DDB (reads and writes) as the message flows through the different parts of the system. IoT itself is millions of RPS and that is just one small-ish AWS service.

Source: Worked at AWS for 12 years.


Put yourself in the shoes of who they're targeting with that.

Probably dealing with thousands of requests per seconds, but wants to say they're building something that can scale to billions of requests per second to justify their choices, so there they go.


Frankly it's a bit weird to see this kind of dick measuring in a product blog post from the "Director of Engineering" :/


s/the "Director of Engineering"/a "Director of Engineering"/

There are many engineering directors at Google.


Director is what, L8? There's a ton of those.


And only one attributed to the blog post.

swish


True and even worse, inaccurate dick measuring.


> At Amazon, practically every service is built on top of AWS

is that true finally? It sure wasn't in the 2020-2021 timeframe.


it does depend on what you mean. By 2020/2021, effectively everything was on top of AWS VMs/VPC and perhaps LBs at that point? Most if not all new services were being built in NAWS.


SPS was heavily MAWS and I got sick of being the NAWS person from years prior pushing for NAWS in our dysfunctional team, and quit. The good coworkers also quit.

Yet I still see the very deep stack of technically incapable middle manager sorts dutifully posting "come join us" nonsense on LinkedIn.

(I had the luxury of having worked in one of the inner sanctums of Apple hardware for years prior, so was immune to nonsense, and didn't need the job.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: