Just moved our infra from GCP to AWS. Kubernetes clusters, LB, storage, lambdas, KMS and all of it.
Google runs their tech stack as if it's a startup that builds their CV. Everything is immature, tons of hacks, undocumented features. If you are on their k8s there are tons of upcoming new versions and features that force you to revisit key hacks you put in your infra because of their misgivings. Our infra team keeps tinkering around our infra and it never ends. It's 50:50. 50% of time making sure we are prepared for their shit and 50 % our ambitious infra plans. Good luck with that.
With AWS our bill is 60% of what GCP used to be running 3 k8s clusters.
AWS support is so nice, you can't believe it.
Nah, I don't trust Google with anything. It's a scam. Google's support is horrendous. They refer you to idiots that drag you through calls until your will for life dies. And you're back to the mercy of some lost engineer that may comment on a github issue you opened 20 days ago. We have a bug reported back in 2020 that got closed recently without any action because it became stale and the API changed so much it doesn't really matter. It's that bad.
The billing day is a monthly reminder you're paying entitled devs to do subpar work other companies do a lot better.
Interesting, if you swap GCP and AWS in your post then thats exactly my experience.
I wonder what makes us different, I work in europe on video games; AWS’s handling of me when I was at Ubisoft left a really sour taste - when I moved into Tencent/Sharkmob I tried really hard to love AWS as it was the defacto industry standard and instead I was left with a feeling that most of it is inconsistent garbage papered over with lambda functions. I referred to these weird gotchas as “3am topics”; things that I don't have the mental capacity to deal with at 3am and convinced the studio to switch to GCP- which, incidentally they are still extremely grateful to me for doing.
Small examples included (I’m on my phone so these are from memory and you’ll have to forgive the lack of great detail):
1) having the project/account your in visible at the top at all times.
We used SSO for “accounts” which is AWS’s way of completely separating resources; the long string that is returned is not unique in the start and the remainder is cut off: so all accounts/projects looked the same, was impossible to tell at a glance if you were in dev, staging or prod.
2) Autoscaling groups with that had human readable incrementing “names”, in AWS instances have hex slugs as instance names and you can give an instance a special “Name” label: but any new machines created with an ASG will just reuse the same name label making them hard or impossible to tell apart.
The AWS official solution for this is to have a lambda function hook on the scale event and give your new node an incremented name label. Given that AWS is pricy to save me time: I do not personally consider this an elegant solution.
3) having all regions on one page.
We spent €6,000~ on a database we didn't know about until we started digging into the bill. Not knowing what resources are available at a glance feels pretty basic to me tbh.
4) the network implementation overall; in Google you can just make a network and it will work without having to mess with zone routing and configuration of that which is put on the user.
If it’s on the user, it’s a variable that has to be checked during an outage; it is terraform code that has to be grokked and so-on.
“2) Autoscaling groups with that had human readable incrementing “names”, in AWS instances have hex slugs as instance names and you can give an instance a special “Name” label: but any new machines created with an ASG will just reuse the same name label making them hard or impossible to tell apart.
The AWS official solution for this is to have a lambda function hook on the scale event and give your new node an incremented name label. Given that AWS is pricy to save me time: I do not personally consider this an elegant solution”
Why were you even messing with the instance name? This is a ridiculously simple problem to solve with tags on your ASG. And AWS even did the courtesy of propagating those tags across the ASG and all its instances.
company im working at currently uses Token Vending Machine.
pros: cannot get accounts mixed up.
cons: All sessions are actually 12hr sessions (ASIA not AKIA) and no access to perm keys for cli, security i suppose. Its not too bad though as TVM gives creds for various use cases.
we fix that internally by having names for accounts and having stages for accounts in a meta tool. There's a tampermonkey script that pulls that in and shows it on screen and a red banner if it's prod. Could be a json file in a github repo. And yes it could be a console feature but everyone's got different concepts of prod. I think a ton of companies use like 2 total accounts as well.
It's amazing how people complain about GCP. We run a massive deployment across 100+ regions cross-cloud GCP, Azure, AWS and oh boy. GCP has good support if you are big enough. Azure though which has a much bigger share than GCP is horrendous. Absolutely garbage all around. Good luck ever getting anyone in Engineering even if you are paying for support. AWS on the other hand - Amazing. We have Ent Support so those guys in our slack channel. The TAMs are amazing. Need to get hold of someone in Route53 no problem they are on the call this week. Feature request for EKS - ok talk to the Product Manager this afternoon.
Can you give Azure specifics, as you know Azure has a massive offering.
My experience has been the opposite though not without issues, Azure has some of the best corporate and security features of any cloud and it's only getting better. The zero trust model fits in so nicely with their identity platforms it's a sight to behold compared to other cloud providers which likely use some form of AAD or AD DS anyway.
Their support is responsive and they seem to know what they're talking about. (AKS)
Azure has had repeated, significant security failures that impact numerous customers. I don't understand how anyone can defend their security except through willful ignorance.
I have friends forced to use Azure and they routinely report issues with provisioning resources, things taking a very long time to spin up or simply being rejected because Azure doesn't have any capacity.
A memorable example is when we ran a heavy Azure Functions workload on our App Service Plan, the hosts would devour themselves.
Functions use containers under the hood. Each invocation created a new container, and when enough of them ran long enough, the host disk would fill up. (Pretty sure our workload wrote almost nothing to disk.)
An internal Azure disk clean-up routine kicked in, which deleted image layers for running Functions. This deleted the filesystems for containers that were still running, yanking them out from underneath the running processes. It also meant the host couldn't launch new instances of our Functions.
At this point the host was poisoned and couldn't launch any new work, even after the workload was reduced, It had to be terminated and replaced, after we detected the problem manually.
Azure support never weemed to take the problem seriously, and after we migrated our workload off of Functions they decided the problem must be resolved since we weren't complaining anymore.
This reminds me of the fond days of having weekly customers calls. We develop AWS services, and we answer our customer-support calls directly. No middle man. Just techies to techies. And we made promises to customers on the fly, and customers sometimes project managed us.
We have an old "quiet part out loud" corporate story. It's about how one arm of Google using our service and wondering why it had so much downtime, only for us to point at their GAE arm and say "when they're down, we're down". They went and talked to GAE and - funny enough - were able to correlate the downtime they observed with GAE downtime.
GAE uptime improved, for a little while. Yeah, we're on AWS now too.
From my understanding they don't dogfood a lot of gcp products internally. That's how you end up with janky integrations between their products. It's really frustrating at times to see their cloud architects pitch some grouping of technologies that you should use to find out the integrations aren't well tested at scale. For example, pushing for pubsub to be used with dataflow for near real time processing just to figure out at scale global pubsub has high latency, above 1 minute sometimes 5 minutes, on 1% of messages at scale.
Yes in the sense that they use all the services and infrastructure that GCP is built in, but no in the sense of using the vanilla GCP interface.
Instead many aspects of GCP's management console are handled by different internal tools, often command line driven. IME they are often far more unwieldy than GCP.
Sometimes this makes sense (far tighter access controls and configuration change controls than a typical company), and some times it's just because of legacy ways of doing things.
I worked on a team at Google that used the internal GCP to serve some code/content for a specific feature, and it was in some ways it was more frustrating than using just either the normal internal systems or just vanilla GCP.
Parts, yes. In reference to the specifics mentioned in here though, those services run on Infra Spanner, not Cloud Spanner, but they're the same stack. The main reason things like Gmail, Ads, etc haven't swapped into GCP is because of the internal tooling that's built up around the infra spanner relating to those services specific to Google that don't make sense in Cloud Spanner.
It's way WAY more than just Infra Spanner vs Cloud Spanner. Cloud spanner doesn't support protobuf, which is annoying, but that's not a dealbreaker; it's still just a DB. The issue is really all the various internal frameworks (such as Apps Framework for Java), deployment systems (Server Platform, AKA Boq/Pod/Urfin), and so forth.
Not just migrations are hard, either; Google Cloud has put (almost?) zero effort into making it easy to use Cloud from systems running on Borg.
My old team was building a system that was half-GCP and half-Borg, and we had to write our own (extremely bad) Cloud Spanner fake for use in tests. In contrast, Infra Spanner is extremely well supported for tests. Same with BigQuery vs Dremel and many other systems.
At one point, Google reached out to me to try and tempt us over from AWS. I had bad experiences with Google support in the past, but liked their AI stuff and was keen to give them another go.
We booked a follow up call in the calendar, I spent good time preparing my notes and requirements for the meeting... and then nobody on their side showed up or contacted me again.
Much of the time GCP feels like a science project, and not a real business. AWS (and Azure) seem to be driven by customer requests, instead of Google, which feels very engineering-centric.
Which is on brand with Google. They have no problem launching stuff, and no problem killing stuff. But man, then just get out of the cloud business and focus on what you're good at.
It's actually sort of ridiculous. AWS has the best support I have ever interacted with. I mean, our org certainly pays enough for it but it's so completely unusual in tech, or really any sector to get great support even when you're paying for it.
I worked in a Digital team 4 years back where the team was building voice channel apps for our customers on both Amazon Alexa and Google Dialogflow. Alexa NLP engine was less sophisticated we had to give it hundreds of prompts and intents. Dialogflow NLP engine required a handful of prompts for the same thing. But when it came to integration with backend APIs and support Alexa was far ahead. Despite having Dialogflow enterprise Google support would suggest to ask in StackOverflow. Amazon support on the other hand was excellent. We needed support for mTLS with the backend APIs, Amazon supported it as they understood enterprise. Google just shooed us away, their support wouldn’t even escalate this.
I don't know. I like GCP. I have been in an Azure centric corporation for close to two years now and I dearly miss GCP almost every day.
My team has a sort of a sandbox where we can use almost any Azure product we want (our IT is supportive and permissive as far as that sandbox goes, which is a blessing), but even then it's just painful in comparison.
There is no way this is true. Only explanation is you work for AWS :-).
GCP strength is it's cost. Yes may be the support could be better. But can you care to explain what "hacks" are you talking about ? And the claim that K8S(from Google) is better on AWS than GCP is absolutely false
Our reason for going all in with GCP was the k8s. We've been using GCP for 2+ years.
The trouble we have is with stability and so many of the features being constantly rolled out.
Our experience was that K8s cost more on GCP than AWS.
Just on LoadBalancers alone, you have tons of tricks that are specific to GCP implementation. And we needed a few extra because you couldn't run all the features we wanted on 1-2 per cluster.
For example, we have a 3rd party that required all our requests to always originate and respond back from a fixed IP address. We could only pick one not a range, not a list. This was a hard requirement. The service was important so we had to do it.
It took our team several days to find how to do it using online documentation and support. Tech support was useless. We had one guy in our team that spent 2 days on the phone with a paid, local GCP implementation partner trying to get this problem sorted. Nothing came out of it other than being pitched on our dime a lot of services and architecture we didn't need. Eventually we figure it out on our own. I don't even remember speaking about this when we transitioned to AWS.
Matches my experience. GCP has many better services than AWS but I am not going to run production workload with them after 2 years of experience in previous company. There are so many undocumented quirks that many times you could find better solution from some random person in stackoverflow than highest tier paid support.
That was my experience, too - a couple of things which were better than AWS but this constant stream of paper cuts hitting all of the problems which weren’t cool enough to get someone promoted.
I generally like GCP, however their sales and customer support just aren't any good. And some services like Vertex AI are extremely buggy while it's hard to actually report these bugs.
I think Google Cloud needs someone like Jeff Bezos as their head: Look what your customers actually want and need and understand their requirements. And they usually want good customer support and want a competent key account manager as well.
When we were looking to migrate our analytics database from on-premise to a cloud alternative we were looking at BigQuery and Snowflake. BigQuery is a great product and we were already deeply invested in GCP as well. However the GCP sales team just couldn't sell BigQuery - they just don't know what old corporations want to hear in a sales pitch. So we went with Snowflake in the end. Not because it's the better product but because their sales team is better.
I'm not sure if the cloud business is actually a priority at Google. If it is then I think they don't understand the mistrust Google is facing when it comes to stable long term support of their products.
The horror stories of Google support, across all of their products, is enough for me to never trust GCP. Even if someone told me today "GCP is the exception, they have great support" I probably wouldn't care - they are so organizationally incapable of providing good support that, even if they did so today, I wouldn't believe that it could last.
Support wise, GCP is a joke run by entitled people. I had an issue some time ago with a VPN and after doing a lot of troubleshooting and having them agree the problem is on their end (packets would go in their VPN Gateway from the VPC, nothing would come out), the solution was to update my configuration on my end to workaround whatever they did because "it is how is going to be"...
“ According to the Amazon Prime Day blog post, DynamoDB processes 126 million queries per second at peak. Spanner on the other hand processes 3 billion queries per second at peak, which is more than 20x higher, and has more than 12 exabytes of data under management.”
This comparison seems to be not exactly fair? Amazon’s 126 million queries per second was purely for Amazon-related services serving Prime Day generating this on DynamoDB, and not all of AWS is my read.
What would have perhaps been a more fair comparison is to share the peak load that Google services running Cloud Spanner, and not the sum of all Spanner services across all of GCP and all of Google (Spanner on non-GCP infra).
I will say that it would show a massive of confidence to say that Photos, Gmail and Ads heavily rely on GCP infra: which would be brand new information for me! It would add to confidence to learn more on how they use it, and if Cloud Spanner is on the critical path for those services.
What is confusing, however, is how in this article "Cloud Spanner" is consistently used... except for when talking about Gmail, Ads and Photos, where it's stated that "Spanner" is used by these products, not "Cloud Spanner!". Like if they were not using the Cloud Spanner infra, but their own. It would help to know what is the case, and what the load of Cloud Spanner is: and not Spanner running on internal Google infra that is not GCP.
At Amazon, practically every service is built on top of AWS - a proper vote of confidence! - and my impression was that GCP had historically been far less utilised by Google for their own services. Even in this post, I'm still confused and unable to tell if those Google products listed use Cloud Spanner or their own infra running Spanner.
> DynamoDB powers multiple high-traffic Amazon properties and systems including Alexa, the Amazon.com sites, and all Amazon fulfillment centers. Over the course of Prime Day, these sources made trillions of calls to the DynamoDB API. DynamoDB maintained high availability while delivering single-digit millisecond responses and peaking at 126 million requests per second.
Amazon was very, very clear on this. For Google to use that number without the caveat is just completely underhanded and dishonest. Whoever wrote this is absolutely lacking in integrity.
I used DynamoDB as part of the job a few years ago and never got single-millisecond responses - it was 20ms minimum and 70+ on a cold-start, but I can accept that optimising Dynamo's various indexes is a largely opaque process. We had to add on hacks like setting the request timeout to 5ms and keeping the cluster warm by submitting a no-op query every 500ms to keep it even remotely stable. We couldn't even use DAX because the Ruby client didn't support it. At the start we only had a couple of thousand rows in the table so it would have legit been faster to scan the entire table and do the rest in memory. Postgres did it in 5ms.
If Amazon said they didn't use DAX that day I would say they were lying.
The average consumer or startup is not going to squeeze out the performance of Dynamo that AWS is claiming that they have achieved.
In fact, it might have been fairer in Ruby if they didn't hard-code the net client (Net/HTTP). I imagine performance could have been boosted by injecting an alternative.
What a cool lil side project/company! Going to circulate this among friends...
Little bit of well meaning advice: This needs copy editing -- inconsistent use of periods, typos, grammar. Little crap that doesn't matter in the big picture, but will block some from opening their wallets. :) ("OpenTeletry", "performances", etc.)
All in all this is quite cool, and I hope you get some customers and gather more data! (a 4k object size in S3 doesn't make sense to measure, but 1MB might be interesting. Also, check out HDRHistogram, it might be relevant to your interests)
Nice dash - if you don't mind a drive-by recommendation: I use Grafana for work a lot and it's nice to see a table legend with min, max, mean, and last metrics for these kinds of dashboards. Really makes it easy to grok without hovering over data points and guessing.
What is more important for me when using Grafana (though a summary is as well) is actually units, to know if it's second, millisecond, microsecond, and also if 0.5 is a quantile or what.
Numbers without units are dangerous in my opinion.
> We had to add on hacks like setting the request timeout to 5ms and keeping the cluster warm by submitting a no-op query every 500ms to keep it even remotely stable.
This sounds like you're blaming dynamo for you/your stack's inability to handle connections / connection pooling.
Been using DynamoDB for years and haven’t had to do any of the hacks you talk about doing. Not using ruby though. TCP keep-alive does help with perf though (which I think you might be suggesting.)
I don’t have p99 times in front of me right this second but it’s definitely lower than 20ms for reads and likely lower for writes. (EC2 in VPC).
They very well know that people don't read sh* anymore. Just throw numbers there, PowerPoint them and offer an "unbiased" comparison where Google shines - buy Google.
Worst case scenario, it's Google you're buying, not a random startup etc.
Just as a hand in the air...Be careful about what you're comparing here. # of API calls over a period of time is...largely irrelevant in the face of QPS. I can happily write a DDOS script that massively bombards a service, but if that halts my QPS then it doesn't matter. So sure, trillions of API calls were made (still impressive in the scope of the overall network of services, I'm not downplaying that), but ultimately, for DynamoDB and Spanner, it's the QPS that mattered to us in terms of comparisons of DB scaling and performance.
Google calls API calls “queries”… because of their history as a search engine. QPS == API calls/per second == Requests per second
That said, I can’t imagine these numbers mean much to anyone after a certain point. It’s not like either company is running a single service handling them. The scale is limited by their budget and access to servers because my traffic shouldn’t impact yours. I feel like the better number is RPS/QPS per table or per logical database or whatever.
Yes, but QPS vs. "queries to the API". The difference is the time slice. I should have been more explicit. The key here really is the time function between the numbers. That the AWS blog calls out trillions of API calls isn't relevant because there wasn't a specific time denominator. The 126M QPS is the important stat.
We shared some details about Gmail's migration to Spanner in this year's developer keynote at Google Cloud Next [0] - to my knowledge, the first time that story has been publicly talked about.
I tried to find it in this video, but failed. Could you please share a time stamp on where to look?
It’s a pretty big deal if Gmail migrated to GCP-provided Spanner(not to an internal Spanner instance) and sounds like he kind of vote of confidence GCP and Cloud Spanner could benefit from: might I suggest to write about it? It’s easier to digest and harder to miss than an hour-long keynote video with no time stamps.
And so just to confirm: Gmail is on Cloud Spanner for the backend?
It's almost certainly not the case that Gmail uses Cloud Spanner rather than Internal Spanner. I don't think Cloud Spanner (or most of Google's cloud products) have the featureset required to support loads like Gmail (both in terms of technical capability, and security/privacy features).
When I worked at Google I tried to get more services to migrate to the cloud but the internal environment that was built up over 25 years is much better at supporting billion+ users with private data.
And yet, if they do, that's probably one of the best sales pitches they could have - dogfooding. After all, isn't that also how AWS started, just reselling the services and servers they already use themselves?
It doesn't make much sense to have a 'better' version of a product you sell but keep it internal.
Yet Amazon Retail still don't use DynamoDb for the critical workloads. They still rely on an internal version of DynamoDb (Sable) which is optimized for Retail workload.
looks like it starts at 50:45. youtube recently made it so you can click "show transcript" in the description then ctrl-f takes you to all the mentions. very helpful for long videos like this.
In the timestamped video link shared downthread, the speaker does seem to strongly imply that gWorkspace doesn’t manage the infra, when he finishes explaining the migration he declares (around 55:18)“[…]we can focus on the business of gmail and spanner can choose to improve and deliver performance gains automagically[sic]” which would imply, to me at least, that it’s on GCP.
That's not what it implied to me. To me, it meant that they adopted an internal managed Spanner with its own SRE team, instead of running their own Spanner. In the past, Gmail ran their own [[redacted]]s and [[redacted]] even though there were company-wide managed services for those things.
Agree, but with the caveat that [[redacted]] and [[redacted]] were old and originally designed to be run that way. All newer storage systems I can recall were designed to be run by a central team after many years of experience doing it the other way. And many tears shed over migrating to those centralized versions.
Source: I was on the last team running our own [[redacted]].
Wow, almost content-free presentation! How obnoxious!
This wasn't the first time Gmail has replaced the storage backend in-flight. The last time, around 2011, they didn't hype it up, they called it "a storage software update" in public comms. And that other migration is the origin of the term "spannacle", because during that migration the accounts that resisted moving from [[redacted]] to [[redacted]] we called barnacles.
> I will say that it does show a vote of confidence to say that Photos, Gmail and Ads use GCP infra,
I'm not sure? I guess I'm mostly not sure what "gcp infra" means there. The blog post says
"Spanner is used ubiquitously inside of Google, supporting services such as; Ads, Gmail and Photos."
But there's google-internal spanner, and gcp spanner. A service using spanner at Google isn't necessarily using gcp. (No clue about photos, Gmail, etc)
Granted, from what I gather, there's a lot more similarity between spanner & gcp spanner than e.g. borg and kubernetes.
Which can be the difference between 99.99% availability and 99% availability with data corruption issues. Not saying that's the case here but one should not downplay the difference deployments can make.
Surely in a post about Google Cloud Spanner, all examples mentioned use Google Cloud Spanner? It would be moot listing them as examples if they would not: so my assumption is they are all using GCP infra already for Spanner.
I really want to give Google the benefit of the doubt: but it doesn't help that they did not write that eg Gmail is using "Cloud Spanner." They wrote that it uses Spanner.
This is putting a lot of faith in GCP advertising. I strongly doubt the idea that the Google workloads discussed are deployed on GCP instead of internal Borg infrastructure.
Years ago they did a reorg and moved all infrastructure services under Cloud even though they are not Cloud products. That would enable this kind of obfuscation because Cloud is literally responsible for both Cloud Spanner and non-Cloud Spanner and they can conflate these two in their marketing copy. They probably feel justified in doing so because they share so much code.
Infra and Cloud Spanner are the same stack. Having those services run on infra is more about the legacy of tooling to shift it rather than anything around performance or ability to handle it
Infra and Cloud Spanner are the same stack. Having those services run on infra is more about the legacy of tooling to shift it rather than anything around performance or ability to handle it.
>This comparison seems to be not exactly fair? Amazon’s 126 million queries per second was purely for Amazon-related services serving Prime Day generating this on DynamoDB, and not all of AWS is my read.
There's no indication that google is talking about ALL of spanner either? The examples they list are all internal google services, and they specifically say "inside google".
I'm also dubious that even with all of the AWS usage accounted for that DynamoDB tops Spanner if Amazon themselves are only at 126 million queries per second on Prime Day.
> At Amazon, practically every service is built on top of AWS - a proper vote of confidence!
Not only this, but practically most, if not all, of the AWS services use DynamoDB, including use cases that are usually not for databases, such as multi-tenant job queues (just search "Database as a Queue" to get the sentiment). In fact, it is really really hard to use any relational DB in AWS. I mean, a team would have to go through a CEO approval to get exceptions, which says a lot about the robustness of DDB.
Eh, this isn't accurate. Both Redshift and Aurora/RDS are used heavily by a lot of teams internally. If you're talking specifically about the primary data store for live applications, NoSQL was definitely recommended/pushed much harder than SQL, but it by no means required CEO approval to not use DDB
Edit: It's possible you're limiting your statement specifically to AWS teams, which would make it more accurate, but I read the use of "Amazon" in the quote you were replying to as including things like retail as well, etc.
When I was at AWS, towards later part of my tenure, DynamoDB was mandated for control plane. To be fair, it worked, and worked well, but there were times when I wished I could use something else instead.
> What would have perhaps been a more fair comparison is to share the peak load that Google services running on GCP generated on Spanner, and not the sum of their cloud platform.
Not necessarily about volume of transactions, but this is similar to one of my pet-peeves with statements that use aggregated numbers of compute power.
"Our system has great performance, dealing 5 billion requests per second" means nothing if you don't break down how many RPS per instance of compute unit (e.g. CPU).
Scales of performance are relative, and on a distributed architecture, most systems can scale just by throwing more compute power.
Yeah I've seen some pretty sneaky candidates try that on their resumes. They aggregate the RPS for all the instances of their services even though they don't share any dependencies nor infrastructure. They're just independent instances/clusters running the same code. When I dug into those impressive numbers and asked about how they managed coordination/consensus the truth comes out.
True, but one would hope that both sides in this case would be putting their best foot forward. Getting peak performance out of right sizing your DB is part of that discussion. I can't imagine AWS would put down "126 million QPS" if they COULD have provided a larger instance that could deliver "200 million QPS", right? We have to assume at some point that both sides are putting their best foot forward given the service.
The 126M QPS number was certainly parts of Amazon.com retail that powers Prime Day not all of DDB traffic. If we were to add up all of DDB's volume, it would be way higher. At least a magnitude if not more.
Large parts of AWS itself uses DDB - both control plane and data plane. For instance, every message sent to AWS IoT will internally translate to multiple calls to DDB (reads and writes) as the message flows through the different parts of the system. IoT itself is millions of RPS and that is just one small-ish AWS service.
Put yourself in the shoes of who they're targeting with that.
Probably dealing with thousands of requests per seconds, but wants to say they're building something that can scale to billions of requests per second to justify their choices, so there they go.
it does depend on what you mean. By 2020/2021, effectively everything was on top of AWS VMs/VPC and perhaps LBs at that point? Most if not all new services were being built in NAWS.
SPS was heavily MAWS and I got sick of being the NAWS person from years prior pushing for NAWS in our dysfunctional team, and quit. The good coworkers also quit.
Yet I still see the very deep stack of technically incapable middle manager sorts dutifully posting "come join us" nonsense on LinkedIn.
(I had the luxury of having worked in one of the inner sanctums of Apple hardware for years prior, so was immune to nonsense, and didn't need the job.)
And for many projects, Postgres is still cheaper than both. Having used both, I would much, much rather do the work to fit my project in Postgres/CockroachDB than use either Spanner or DynamoDB, which have WAY more footguns. Not to mention sudden cost spikes, vendor lock in, and god knows what else.
AWS and GCP (and Azure, and Oracle cloud, and bare Kubernetes via an operator, and...) support Postgres really well. Just...use Postgres.
> And for many projects, Postgres is still cheaper than both.
ok? and sqlite3 in memory is even cheaper than postgres!
if you can use (and support correctly) postgres then you should use it, obviously there's no point using a globally scalable P-level database if you can just fit all your data on one machine with posthgres.
Except for projects for which NoSQL is a better fit than a RDBMS, no?
If I'm writing a chat app with millions of messages and very little in the way of "relationships", should I use Postgres or some flavor of NoSQL? Honest question.
Postgres. NoSQL databases are specialized databases. They are best-in-class at some things, but generally that specialization came at great cost to their other options. DynamoDB is an amazing key-value store, but is severely limited at everything else. Elasticsearch is an amazing for search and analytics, but is severely limited at everything else. Other specialized databases that are SQL-full are also great at what they do, like Spark is a columnar database that has amazing capabilities for massive datasets where you need lots of cross-joins, but that severely limits it's ability to act in a lot of roles, because they traded latency for throughput and horizontal scalability, and you're restricted in what you can do with it.
The super-power of Postgres is that it supports everything. It's a best-in-class relational database, but it's also a decent key-value store, it's a decent full-text search engine, it's a decent vector database, it's a decent analytics engine. So if there's a chance you want to do something else, Postgres can act as a one-stop-shop and doesn't suck at anything but horizontal scaling. With partitioning improving, you can deal with that pretty well.
If you're writing fresh, there is basically no reason not to use Postgres to start with. It's only when you already know your scale won't work with Postgres that you should reach for a specialized database. And if you think you know because of published wisdom, I'd recommend you set up your own little benchmark, generate the volume of data you want to support, and then query it with Postgres and see if that is fast enough for you. It probably will be.
Golden Rule of data: Use PostgreSQL unless you have an extremely good reason not to.
PostgreSQL is extremely good at append-mostly data, i.e like a chat log and has powerful partitioning features that allow you to keep said chat logs for quite some time (with some caveats) while keeping queries fast.
Generally speaking though PostgreSQL has powerful features for pretty much every workload, hence the Golden Rule.
100% this, and even though I work for Google I absolutely agree. BUT, for the folks that need it, PostgreSQL just DOESN'T cut it, so it's why we have databases like DynamoDB, Spanner, etc. Arguing that we should "Just use PG" is kinda a moot point.
I think I said this in another comment, but I'm not shitting on Spanner or DDB's right to exist here. Obviously, there are _some_ problems for which a globally distributed ACID compliant SQL-compatible database are useful. However, those problems are few and far between, and many/most of them exist at companies like Google. The fact is your average small to medium size enterprise doesn't need and doesn't benefit from DDB/Spanner, but "enterprise architects" love to push them for some ungodly reason.
Don't forget PostgreSQL extensions. For something like a chat log, TimescaleDB (https://www.timescale.com/) can be surprisingly efficient. It will handle partitioning for you, with additional features like data reordering, compression, and retention policies.
this is what I've done sqlite3 for my personal stuff, postgres for everything else. I'm far from a "120 million requests per second" level though, so my experience is limited to small to mid-size ops for businesses.
Millions is tiny. Toy even. (I work on what could be called a NoSQL database, unfortunately "NoSQL" is a term without specificity. There's many different ways to be a non-relational database!)
My advise to you is to use Postgresql or, heck, don't over think it, sqlite if it helps you get a MVP done sooner. Do NOT prematurely optimize your architecture. Whatever choice results in you spending less time thinking about this now is the right choice.
In the unlikely event you someday have to deal with billions of messages and scaling problems, a great problem to have, there are people like me who are eager to help in exchange for money.
Lots of people like to throw around the term "big data" just like lots of people incorrectly think that just because google or amazon need XYZ solution that they too need XYZ solution. Lots of people are wrong.
If there exists a motherboard that money can buy, where your entire dataset fits in RAM, it's not "big data".
I've found it's pretty easy to massage data either way, depending on your preference. The one I'm working on now ultimately went from postgres, to mysql, to dynamo, the latter mainly for cost reasons.
You do have to think about how to model the data in each system, but there are very few cases IMO where one is strictly 'better.'
You can also create arbitrary indices on derived functions of your JSONB data, which I think is something that a lot of people don't realize. Postgres is a really, really good NoSQL database.
Sure. Suppose that we have a trivial key-value table mapping integer keys to arbitrary jsonb values:
example=> CREATE TABLE tab(k int PRIMARY KEY, data jsonb NOT NULL);
CREATE TABLE
We can fill this with heterogeneous values:
example=> INSERT INTO tab(k, data) SELECT i, format('{"mod":%s, "v%s":true}', i % 1000, i)::jsonb FROM generate_series(1,10000) q(i);
INSERT 0 10000
example=> INSERT INTO tab(k, data) SELECT i, '{"different":"abc"}'::jsonb FROM generate_series(10001,20000) q(i);
INSERT 0 10000
Now, keys in the range 1–10000 correspond to values with a JSON key "mod". We can create an index on that property of the JSON object:
example=> CREATE INDEX idx ON tab((data->'mod'));
CREATE INDEX
And we can check that the query is indexed, and only ever reads 10 rows:
example=> EXPLAIN ANALYZE SELECT k, data FROM tab WHERE data->'mod' = '7';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tab (cost=5.06..157.71 rows=100 width=40) (actual time=0.035..0.052 rows=10 loops=1)
Recheck Cond: ((data -> 'mod'::text) = '7'::jsonb)
Heap Blocks: exact=10
-> Bitmap Index Scan on idx (cost=0.00..5.04 rows=100 width=0) (actual time=0.026..0.027 rows=10 loops=1)
Index Cond: ((data -> 'mod'::text) = '7'::jsonb)
Planning Time: 0.086 ms
Execution Time: 0.078 ms
If we did not have an index, the query would be slower:
example=> DROP INDEX idx;
DROP INDEX
example=> EXPLAIN ANALYZE SELECT k, data FROM tab WHERE data->'mod' = '7';
QUERY PLAN
---------------------------------------------------------------------------------------------------
Seq Scan on tab (cost=0.00..467.00 rows=100 width=34) (actual time=0.019..9.968 rows=10 loops=1)
Filter: ((data -> 'mod'::text) = '7'::jsonb)
Rows Removed by Filter: 19990
Planning Time: 0.157 ms
Execution Time: 9.989 ms
Hence, "arbitrary indices on derived functions of your JSONB data". So the query is fast, and there's no problem with the JSON shapes of `data` being different for different rows.
Either way can work. Getting to millions of messages is going to be the hard part, not storing them.
As with all data storage, the question is usually how do you want to access that data. I don't have experience with Postgres, but a lot of (older) experience with MySQL, and MySQL makes a pretty reasonable key-value storage engine, so I'd expect Postgres to do ok at that too.
I'm a big fan of pushing the messages to the clients, so the server is only holding messages in transit. Each client won't typically have millions of messages or even close, so you have freedom to store things how you want there, and the servers have more of a queue per user than a database --- but you can use a RDBMS as a queue if you want, especially if you have more important things to work on.
This is going to feel like a non-answer: but if you need to ask this question in this format, save yourself some great pain and use Postgres or MongoDB, doesn't really matter which, just something known and simple.
Normally you'd make a decision like this by figuring out what your peak demand is going to look like, what your latency requirements are, how distributed are the parties, how are you handling attachments, what social graph features will you offer, what's acceptable for message dropping, what is historical retention going to look like...[continues for 10 pages]
But if you don't have anything like that, just use something simple and ergonomic, and focus on getting your first few users. There's a long gap between when the simple choice will stop scaling and those first few users.
just migrated off of PG to ddb as the main db for my application (still copying data to SQL for data analytics). Working with distributed functions and code hosted on lambdas, the connection management to SQL became a nightmare with dropped requests all over the place.
Yeah I have been using Supabase recently and I really like it. You still get the “serverless” benefits but at the end of the day it is just a Postgres database with some plugins. It is super easy to figure out where the data is coming from/going to.
Meanwhile at work I have a cowoker who loves to create AWS soup where they use an assortment of lambdas/api gateways/sqs queues/sns topics to accomplish tasks such as taking files from one s3 bucket and putting them in another s3 bucket owned by a different team. Their justification of this was that it was generic so other teams could use it, but it is a pain to maintain and make changes to.
Not to be that guy, but why lambdas? I'm genuinely curious. I've never found the "cost savings" (big air quotes) worth it in comparison to the increased configuration/permissions complexity. Especially when Fargate exists, where you can just throw a docker container at AWS, what do Lambdas add? The zero scaling?
With CDK, I can get an ECS service up and running in the same amount of time it'd take to create a lambda function behind API gateway or triggered by SQS/cron. Deploys are easier, cost savings are real, permissions/configuration are the same level of complexity unless you're cutting corners. I'd only use ECS for stuff I know would be high sustained throughput, long duration(>15m) tasks, or things that absolutely need more persistence between executions.
Serverless is great if you recognize that it's just somebody else's container runtime. I wish there was better tooling for Docker based Lambdas though. I hate whole S3 deployment dance for zip file based Lambdas (yes SAM does it for you now but it's still there).
EC2-backed ECS has a great use case for things that you can run ephemerally in a container but require a persistent data store.
Why not? The setup I’m experimenting with for an API right now is basically a single Lambda that’s accessible through a function URL (so no ELB/ALB) + an RDS instance. Spinning up additional environments is a single Cloudformation call and deployment artifacts should work with both Docker containers or S3 (depending on the Lambda execution environment).
Seems like a leaner setup than using ECS/Fargate + LBs to me. Have I overlooked something?
One of lambda's ideal use cases is personal projects. Personal projects usually serve very few requests so lambda's ability to scale to zero results in cost savings.
I totally believe you, I just can't see how it becomes easier than chucking a container on Fargate or something. Maybe I've just been scarred by lambda rat's nests in the past.
Yeah, the "proper" way to do Lambdas, shown in so many fancy architecture diagrams, is a rat's nest. I don't like APIs on Lambda unless you can shove them into one container with a catchall proxy on API Gateway. They really shine if you're processing SQS messages or EventBridge events. If you aren't using other AWS services and aren't cost engineering, then Lambdas probably aren't worth the headache.
Lambda is the most expensive thing you can do if you have more than 25% utilization. Fargate is extremely close to modern on-demand EC2 pricing (m7a family).
Right, running ECS on EC2, not Fargate on EC2. When ECS launched it only had the EC2 launch type (where as you said you must manage your machines). Fargate then came along for both ECS and EKS where Amazon managed the machines for you.
But that's kind of a moot point. I mean, if you're even looking at the likes of DynamoDB or Spanner, it's because you need the scale of those engines. PostgreSQL is fantastic, and even working for Google, I 100% agree with you. Just use PG...until you can't. Once you're in the realm of Spanner and DynamoDB, that's where this discussion becomes more of a thing.
Not necessarily true. DynamoDB on demand pricing is actually way cheaper than RDS or EC2 based anything for small workloads, especially when you want it replicated.
Postgres and Spanner do different things, in different ways, with different costs, risks, and implications. You could "just use" anything that is completely different and slightly cheaper. You could use a GitHub repository and just store your records as commits, for free, that's plenty cheap and works for small projects. But not really the same thing, is it?
My point is that I've seen very, very few situations (I can think of two in my entire career so far) where a "hyperscale NoSQL database" was actually the right choice to solve the problem. I find that a lot of folks turn to these databases for imagined scale needs, not actual hard problems that need solving.
DynamoDB is fantastic for not doing things at scale. It costs a few pennies, there is nothing to set up, it's all managed for you, it is insanely reliable, and it just works. I use it for all kinds of crap that an entire RDBMS is way overkill for.
I don't think the API you interface with fundamentally changes the point that Spanner is hard to recommend from an engineering perspective at anything except the absolute most massive of scales, and even then it will create nearly as many problems as it solves. I'm not saying spanner is _wrong_ or shouldn't exist, but it's very difficult to be in the position where Spanner is the critical key to your application's success and not replaceable by <insert other, cheaper database here>.
Sure. Spanner is expensive, and your primary job as an engineer (if you work for an enterprise like most of us do) is to generate business value. So, if nothing else, you will run into the cost problems of Spanner. There are also other problems; iirc both DynamoDB and Spanner shard their key spaces, and each shard gets the same quota, and the key space shards all have to be the same size. This means that even though you might have paid for 1000rps, for example, that RPS volume is divided across all your shards, so if you have one part of the key space that gets way more volume than another you end up eating up the fractional capacity of that shard way faster than you intend and you have to either overprovision or queue requests, both of which are not ideal.
At a previous job, we ended up creating a very complicated write through cache system in front of spanner that dynamically added memory/CPU capacity as needed to prevent hot shards; our application was extremely read heavy, and writes were relatively low RPS, so this ended up working OK, but we were paying tens of thousands of dollars a month for Spanner plus tens of thousands of dollars a month for all the compute sitting in front of it. I don't think we ended up doing much better than if we had bitten the bullet and run clustered Postgres because our write volume ended up being just a few hundred RPS, even though the read volume was 1000x that. Postgres behind this cache system would have handled the load just as well and cost less than half as much.
The other thing that frustrates me personally about Spanner is that Google's docs are incomplete (as usual); there are lots of performance gotchas like this that exist throughout the entire service, and they aren't clearly documented (unlike, to their credit, AWS with Dynamo, who explains this entire problem very clearly and has an [expensive] prebuilt solution for it in the form of the DynamoDB accelerator).
It's not purely a matter of cost, right? Say you want or need a highly available, high performance distributed database with externally consistent semantics. Are you going to handle the sharding of your Postgres data yourself? What replication system will you use for each shard? How will you ensure strong consistency? Will you be able to do transactions across shards? These are problems that systems like Spanner, CockroachDB, etc solve for you.
Just curious, why would distributed be design requirement? Is individual machine failure likely in AWS/GCP? The only failure I have seen in region level issues which spanner or dynamo don't help with AFAIK.
Individual machine failure is not likely, but we're hypothesizing the need for multiple shards for high performance. So now we have more machines and so the probability of failure increases. So we need to add replication, but then we need to deal with data getting out of sync, etc.... As others have mentioned though, these issues only really become important at a certain scale.
It's hard to find a use case that a plain old filesystem can't handle at small to medium scale. But there are perhaps more important considerations than just "can it handle it"
Uh let's not get carried away. It's fine with enough work maybe. But Postgres has a lot of awkwardness too. HA is a pain, major version upgrades are a pain, JS or JVM stored procs are a pain, configuring auth is a pain. There is a reason so many people are desperate to pay someone else to run Postgres for them instead of just renting a few VMs and doing it themselves.
It's the best, but it's far from perfect. Default mode is non-ACID, and going `serializable` mode makes it very slow. Spanner is always ACID... but always slow.
I know the spanner marketing blurb says you can scale down etc. But I think in practice spanner is primarily aimed at use cases where you'd struggle to fit everything in a single postgres instance.
Having said that I guess I broadly agree with your comment. It seems like a lot of people like to plan for massive scale while they have a handful of actual users.
I said this in another comment, but I have seen _two_ applications in my career that actually had a request load that might warrant something like one of these databases. One was an application with double digit million MAU and thousands of RPS on a very shardable data set, which fit Spanner's ideal access pattern and performance profile pretty well, but we paid an absolute arm and a leg for the privilege and ended up implementing a distributed cache in front of Spanner to reduce costs. The other just kept the data set in memory and flushed to disk/S3 backup periodically because in that case liveness was more important than completeness.
In the first case, the database created as many problems as it solved (which is true of any large application running at scale; your data store will _always_ be suboptimal). A fancy, expensive NoSQL database won't save you from solving hard engineering problems. At smaller scales (on the order of tens-hundreds of RPS), it's hard to go wrong with any established SQL (or open source NoSQL if that floats your boat) database, and IMO Postgres is the most stable and best bang for your engineering buck feature wise.
Postgres/mysql shouldn't have much trouble doing thousands of RPS on a laptop for basic CRUD queries (i.e. as long as you don't need to do table scans or large index range scans). It's possible to squeeze a lot more than that out of them.
My team bought the "scale down" thing and got bit.
Using Spanner is giving up a lot for the scalability, and if you ever reach the scale where a single node DB doesn't make sense anymore, I don't know if Spanner is still the answer, let alone Spanner with your old design still intact. For one, Postgres has scaling options like Citus. Or maybe you don't need a scalable DB even at scale, cause you shard at a higher layer instead.
I've only kicked the tires, but https://neon.tech is a pure hosted Postgres play. I'd be curious to hear if anyone has used them for a real projects, and how that went.
I mean sure, NoSQL gives you more opportunities to screw stuff up because it's doing less for you. But it can be a reasonable tradeoff in some scenarios anyway
Sure, You can compare Cloud SQL vs Cloud Spanner and RDS vs Dynamo, but it makes more sense to just say "Postgres" and assume that the reader can figure out that it means "Whatever managed postgres service you want to use".
The entire point is that every cloud provider has a managed postgres offering, and there's no vendor lock-in. Though, technically, Dynamo does have a docker image you could run in other cloud providers if it came down to that, you'd get no support for it.
I don't think it's really relevant to compare plain Postgres to Spanner, most folks have no need for something like this. It is made for folks who need to do millions of ACID type transactions a second/minute from all over the globe, and have a globally consistent database at all times.
There's a reason why Google installed and built their own atomic clocks and put them in their datacenters, it is to facilitate global timekeeping for this type of services. Most likely 99.9% of the time this type of database is overkill, and also likely way more expensive than you need.
> It is made for folks who need to do millions of ACID type transactions a second/minute from all over the globe, and have a globally consistent database at all times.
I think just doing some (not necessary millions) ACID transactions over the globe and have consistent DB is strong value proposition even for small users.
The dynamodb docker image you’re referring to will get you shot if you try and use it in prod. It’s an API wrapper on top of sqlite and has a ton of missing functionality
There are a couple of databases out there with ddb compatible interfaces, like scylladb
The Free Tier is completely irrelevant here, though. The very reason someone might use Spanner is its excellent scalability. I don't believe there is any reason to use it for smaller projects other than education. The customers who will use Spanners are those for whom CockroachDB is not enough, for example. For everybody with databases that are not that huge PostgreSQL will do just fine.
> If you squint, any database engine is “in memory” if there is more buffer than data.
That is sadly not true, I remember one lonely night debugging a MSSQL 2012 instance that was _very_ slow, and it turned out that for a simple query (one join, 100 rows in one table and 10 in the other, 100 result in total, one where clause) it forced writing the result to disk before evaluating the WHERE condition. Unable to fight the scheduler I've ended up making a ramdisk for this data.
Very true, but most people do not yet no about scale-to-zero pay-for-what-you-use sql server clouds with a free tier like CockroachDB and neon. They think that you must pay $5 a month to run a sql server, which has been the case until very recently, so they go with no sql options to get the free tier.
Edit: actualy Spanner looks like another CockroachDB. You use sql to interact with it. In which case I can see many people who would want to use this with a free tier for hobby projects. ie. in between education and production development.
> Edit: actualy Spanner looks like another CockroachDB. You use sql to interact with it. In which case I can see many people who would want to use this with a free tier for hobby projects. ie. in between education and production development.
Pedantically, cockroachDB is another spanner. It was made by Google devs who left Google having previously used spanner, and intentionally made something similar to spanner (ish, lots of handwaving happening here)
oh cmon. this is just 50 QPS. i mean yea obviously for someone having as little as 50QPS is not going to bother with the massive scale and availability in cloud spanner.
there are a TON of applications today reaching 100M+ users in just a month. you are not dealing with 50QPS. oh and you forgot the crazy byte boundaries in DynamoDB.
if you go over a single byte above 1kb, you are charged 2 Read units !
That seems nitpicky. The free tier is a marketing program, not a product.
"Google should offer intro discounts" is IMHO a very valid point (absolutely no idea why this doesn't exist), but it doesn't really speak to whether or not the real product is more or less expensive.
it's a bit different because the free tier for dynamodb is not like the other 12 month limited offer, it's marketed as free forever, so it's not just an intro product, it's something you can run a small business off for free.
I wish I could play around with Spanner for personal/side projects, but a production ready instance starts at $65/mo. DynamoDB can run for ~$0.00/month with per-request pricing.
You can! Spanner has a free trial: https://cloud.google.com/spanner/docs/free-trial-instance. Keep in mind, that per-request pricing isn't free unless you stay under the free tier. So just take a look at what those limits are because going above them means you're not free anymore.
if you are interested in spanner, you might take a look at cockroachdb. esp the production ready serverless offering which is pay for consumption only. crdb architecture is essentially spanner under the bonnet, GIFFE
I'm torn because I have really liked Google offerings in the past (I'm pretty locked in on gmail, I have different things running on GCP already, etc). But I've also been feeling burned a bit by Google suddenly ending services. I had all my domains happily in Google domains until they recently sold it suddenly to Squarespace, who I'm not interested in dealing with. My phone is a Google Pixel and I was using the Google podcast app, but just heard that too is being discontinued and moved to Youtube Music, which is a service I tried and really disliked, so now I need to find a replacement for that too. I didn't personally use some other services, but I know there have been many others ended (such as Stadia for gaming, which made a lot of press at the time).
Those are more minor services in the long run, but it makes me a little nervous to go in again on Google for a critical service. Before I invest my time and effort into using it I have to ask myself "Will Google someday sell off or end the cloud spanner service? Will I be in trouble if they do so?".
As someone in the midst of transitioning an organization to GKE, Google Domains was the first shutdown that truly frightened me - AFAIK the first true B2B IT offering that was unceremoniously shuttered. Domain registration may be a regulatory/reputational minefield - but then so are many of their other cloud offerings, up to and including content distribution. I don't think it's indicative of a larger pattern of shutting down Google Cloud services yet, but it's certainly a yellow flag at least.
> But I've also been feeling burned a bit by Google suddenly ending services.
The products/services that “Google” the search company launches are different than “Google Cloud”. While the discontinuation of Google products is annoying it has nothing to do with Google Cloud products/services. I don’t think Google Cloud abruptly announces discontinuing products/services as they have paid customers.
Regarding Google Domains that is a Google product. The equivalent product from Google is “Google Cloud Domains” which is available to Google Cloud customers.
> Regarding Google Domains that is a Google product. The equivalent product from Google is “Google Cloud Domains” which is available to Google Cloud customers.
Cloud services rely a ton on marketing, B2B relations, and customer support. Google has never exactly been about that stuff, and GCP was suffering. So they pulled in Oracle, MSFT, etc execs, which lame as that sounds was probably the right move for GCP in particular.
And I guess this is the kind of marketing that attracts the customers they want.
Exactly: I have to provision for peak throughput on Spanner. Average throughput is much lower than the peak throughput, so I'm doubtful of seeing savings no Spanner.
(But I bet that Spanner is much easier than DynamoDB to develop with...)
You can scale Spanner up/down based on demand although there is a lag time with it.
I built a system that relies on a high-performance database and tested with both AWS DynamoDB and Google Cloud Spanner (see disclaimer) and was able to scale Google Cloud Spanner much higher than AWS DynamoDB.
DynamoDB is limited to 1000 WRUs per node, and there isn't an obvious way to get more than 100 nodes per table, so you're limited to 100,000 WRUs per table (= 102400000 bytes/sec = 97 MiB/sec = 776 Mib/sec) -- even if you reserve more than 100,000 WRUs in capacity for the table. The obvious workaround would be to shard the data across multiple tables, but that would have made the software more difficult to use.
Google Cloud Spanner was able to do much more than 97 MiB/sec in traffic (though the exact amount isn't yet public), and also was capable of much larger transactions (100 MiB versus DynamoDB's 25 (now it is 100) items * 400KiB of ~10 MiB) which was a bonus.
Disclaimer: The work was funded by a former Google CEO and I worked with the Google Spanner team on setting it up, while I am a former AWS employee I didn't work with AWS on the DynamoDB part of it, though I did normal quota adjustments.
This definitely happened with Google Maps, where Google was clearly the dominant player. Has it happened to other services? It seems far less likely to happen for business cloud services where they are a distance 2nd (3rd?) to AWS. I know of some examples (admittedly ancient) where they have reduced costs: https://cloudplatform.googleblog.com/2015/05/Pay-Less-Comput...
Before they offered Google cloud, they offered Google App Engine. After they introduced Google Cloud they got bored of App Engine and 10x’d the price. This was particularly painful because App Engine was a batteries-included solution with heavy vendor lock-in.
> After they introduced Google Cloud they got bored of App Engine and 10x’d the price.
I don't know about "got bored of"; I'd say more "effectively deprecated, using increasing costs† as an implicit push toward rewriting your service for more modern parts of their platform."
> Cloud Run is the latest evolution of Google Cloud Serverless, building on the experience of running App Engine for more than a decade. Cloud Run runs on much of the same infrastructure as App Engine standard environment, so there are many similarities between these two platforms.
> Cloud Run is designed to improve upon the App Engine experience, incorporating many of the best features of both App Engine standard environment and App Engine flexible environment. Cloud Run services can handle the same workloads as App Engine services, but Cloud Run offers customers much more flexibility in implementing these services. This flexibility, along with improved integrations with both Google Cloud and third-party services, also enables Cloud Run to handle workloads that cannot run on App Engine.
Anyone who's still on GAE (rather than having moved over to Cloud Run) at this point is a "legacy enterprise customer"; and so Google have at this point moved GAE pricing beyond just a monetary disincentive to use, to being "fired-customer pricing" — i.e. the price you charge when you don't really want to work with a customer any more, a price that says "go away", but if they still want to pay you even at that price-point, then sure, why not?
But, Google Cloud was a rename/expansion (specifically, expansion) of Google App Engine (GAE a pre-existing Google products that was then part of Google Cloud when the latter brand was launched.)
BigQuery initially was a lot more powerful, then they started adding a bunch of resource limits on queries that you could only overcome by paying up. Not a direct fee increase, rather you paid the same fee for a worse product.
In my world, most production services need to last longer than twelve years, so need relative pricing stability and no complete pricing restructures over a longer period. Perhaps your world is different.
We are introducing a new charge for public IPv4 addresses. Effective February 1, 2024 there will be a charge of $0.005 per IP per hour for all public IPv4 addresses, whether attached to a service or not
Yeah. I think the whole point of Spanner is to handle larger, larger, larger workloads and databases, and if you can fit everything on a single server, you definitely should.
Comparing Postgres to Spanner is kind of like comparing a delivery van to a train. The train is always going to have higher overhead costs.
Linux admin is a useful skill, but I know my Linux admin skills can’t compete with the reliability, availability, and scalability of cloud systems… like Dynamo, S3, Spanner, etc.
> decided I might as well give up and learn the basics of Linux admin once and apply it for life.
Yeah I think this raises an underappreciated drawback of working on heavily AWS/GCP native projects. So much of the time ends up being spent on service level config and troubleshooting that has little relevance elsewhere.
Or you could use DynamoDB for basically free. One month of 1GB of storage, 1kb item size, 100,000 writes, and 100,000 reads, would be $0.39 on DynamoDB on-demand. A million writes and reads respectively would be $1.63. Strongly consistent reads'd make that $1.75, and transactional writes would make it $3.00
I get it but I feel like it isn't really mine, and learning this new db/console/product/vendor that won't be around in 10-20-30 years is a waste of very limited time.
Linux admin + hosting a server = my data, on my terms, until I keel over, and possibly long after that.
By the time DynamoDB goes away, that Linux server you have will have been EOL for two decades and chock full of security holes, not to mention disks filling up or getting wiped. The amount of time you'll spend tinkering with your server will dwarf the tiny amount of time it'd take to replace DynamoDB with an alternative service.
- in the next few decades, my Linux servers will have been updated completely multiple times
- software updates happen on my schedule and at my behest
- I can move to newer hardware whenever the mood strikes me
- I maintain full de jure and de facto ownership of my data (AKA I control it completely)
- Since I own the data, I can always upload it to some vendor in future. Due to vendor lock-in, non-standard data formats, and my least favourite: data egress fees, it's not straightforward to go from a vendor to another vendor, or from a vendor to DIY. I maintain maximum optionality
- Since I committed to the private server path, I can take full advantage of the server being a general computing device. I can combine web-hosting, databases, and other things on the same device / a stable of devices. I end up having ridiculous performance, full control of my entire stack, and at a huge discount, and it's a very simple system.
Security concerns are addressed in a couple of ways:
- By having everything on one server, or by architecting things just so, I can stand up a database that does everything I need, including serving my web-apps, without ever facing the public internet directly.
- Maintaining a secure server is admittedly more of an ongoing chore, but it's not a significant timesink at all
- Every online service by AWS et al ultimately runs on a server much like mine, so if there's some serious widespread Linux vulnerability, it'll affect managed services just as much as my server.
- The managed services themselves are not only juicy targets but are themselves vulnerable to both hacking and phishing. I'm convinced SSH'ing into Postgres + Linux is a safer option than a more complicated structure.
All of the above assumes my apps will never be planet-scale, which even in the most bullish case, they never need to be.
The more I deal with scalable relational DBMSes (Spanner in particular), the more I doubt their usefulness even at large scale. Relational + fully-consistent are always at odds with scalability. It seems like either you have a classic NoSQL use case and use that, or you shard at the application level rather than in the database.
Could someone share a use case where you truly benefited migrating from Postgres/MySQL to Spanner/Citus/Cockroach, and there was no better solution? I'd like my hunch to be wrong.
that really should be the first heuristic for almost any systems design problem - if you can afford to buy a big enough pair of machines to fit your data in hot-swap postgres then just do that.
don't bother with mongo or mysql or dynamo or cassandra or bigtable or spanner or ... until your lack of profitability or size means you can't afford to just use postgres.
> What do you mean "can't afford to just use postgres"?
if I have 10TB of hot data, can I afford two machines with 10T of RAM each? how about 100T?
> I thought postgres cost per query in many cases is cheaper than competitors.
that's not really a useful metric without size/latency/etc attached to it, being cheap for 0.1qps might be fine for a YC company but that's no good for my successful company etc
one of the best parts of DynamoDB are all of the patterns others have figured out (e.g https://www.dynamodbbook.com/), I'm not aware of the equivalent for GCP Spanner (besides an O'Reilly book I've never read)
> 7. Benchmarking. Customer may conduct benchmark tests of the Services (each a "Test"). Customer may only publicly disclose the results of such Tests if (a) the public disclosure includes all necessary information to replicate the Tests, and (b) Customer allows Google to conduct benchmark tests of Customer's publicly available products or services and publicly disclose the results of such tests. Notwithstanding the foregoing, Customer may not do either of the following on behalf of a hyperscale public cloud provider without Google's prior written consent: (i) conduct (directly or through a third party) any Test or (ii) disclose the results of any such Test.
It looks like that is less restrictive than it used to be. I found a blog post from last year mentioning an additional requirement to obtain Google's prior written consent prior to publishing (for all customers, not just fellow cloud providers) which no longer is included. https://cube.dev/blog/dewitt-clause-or-can-you-benchmark-a-d...
Benchmarks vary quite a bit based on specific workloads. That is why open benchmarking helps where any person can validate the benchmarks. Google enforces De-Witt clause which makes it impossible for a third party to publish such benchmarks. Quite a few cloud providers don't restrict that now.
The DeWitt clause is only there to prevent badly written performance data from getting attention it shouldn't. If anyone writes a good benchmark (good process, not good results, necessarily) to bring to us, our product teams absolutely will consider it.
I have a personal GCP account with an average spend of $12/month over the last year or so. The past week, I've been bombarded with messages and phone calls from a GCP saleswoman trying to get me to spend more. These business practices reek of desperation.
That headline is half truth. Price of a cloud database depends on so many things. what is the I/O, what is the storage utilization, cross-region traffic, etc. Picking one scenario and claiming the cost to be half is too simplistic and a marketing gimmick.
Also which dynamo price are they comparing to? On demand? Provisioned? Reserved provisioned? We had a huge spanner db with low throughput so had to add idle nodes just for storage which also ballooned costs for spanner(the increased storage size here helps)
An internal version has been used within Google since ~2012 and the GCP version has been available since 2017. It's used extensively within Google. It's not going to be deprecated any time soon.
At the end of the day this comes down to more than cost. I trust AWS to be a stable, long term foundation to build a product on, I don't trust GCP to be the same.
I'm really excited about this as a customer. We're probably going to be able to save a lot of money because of it. So now, with the new Postgres compatibility layer and the "lower cost", it will be easier for me to choose Spanner when we start new projects.
Really glad to hear! Please find me on social media or LinkedIn and let me know how it goes for you using the PG layer. I'd love to hear more feedback.
Well, the thing with Google is whenever their service does not get enough users to reach their target revenue then they can suddenly shutdown in the following year. This is the reason why I would always go with AWS or Azure and GCP the last choice.
DynamoDB and ScyllaDB are very different use cases than Spanner.
• Spanner is more in the domain of CockroachDB — distributed strong consistency and ACID compliance. Both are ANSI SQL.
• DynamoDB and ScyllaDB are in the NoSQL key-value store / wide column domain. ScyllaDB is API compatible with DynamoDB, but is also API compatible with Cassandra Query Language (CQL). These databases are for eventual consistency use cases.
As to the "99%" comment, there are over 415 databases currently tracked on DB-engines.com (https://db-engines.com/en/ranking). It is a very competitive and specialized industry. Very few databases have more than 1% marketshare these days.
...only the top 9 options show >1% of marketshare.
MySQL being so far in the lead (42%) is far more likely due to it being baked into so many OEM deals (over 2,000 as per https://www.mysql.com/oem/). Like, every WordPress site in the world than anything else.
Yet that doesn't really make MySQL great for every use cases. It has obvious limitations.
We also have to note that different studies have far different results.
It shows a different set of top databases, and has 11 of them showing >2% of marketshare. It has MySQL at only 13.91% of marketshare, with Oracle being on top at 31.19%.
The source of this poll is TOPDB Top Database index:
It's not based on money marketshare, nor on poll results, but simply is counting how often a database is searched on Google. Note that anything that could garner 1% of the market would suddenly catapult that database into the top 16th place on the list.
At the end of the day, database "market share" — however it is calculated — is not representative of exclusive percentages. Many companies run more than 1 type of database. There might even be multiple SQL and NoSQL databases, each designed for purpose, at large enterprises. OLTP, OLAP. I would not be surprisedi in the least to learn that large corporations had well over 100 different databases running at any given time.
So, back to your basic comment: if a database is useful for even 1% of the use cases, that would still place it in the top 20 databases in the industry.
Moreover, a lowest-common denominator database will not be able to support many critical use cases where you really do need a specialized data model, or index type, or query language, or workload type, or latency, or distribution, or scale of QPS/TPS/OPS, or total data size or query payload size, etc.
I have likened the database industry to the state of Christianity after the Reformation. Even if there are still plenty of "Catholics" (classic SQL adherent, like Oracle), there are also Reformed faiths ("NewSQL" / "Distributed SQL", e.g., PostgreSQL), plus any number of Protestant reformation alternatives (NoSQL). I'm not sure what the "Orthodox" in this analogy refers to. Maybe SQL data warehouses? The schism between OLTP and OLAP? Dunno.
Anyway, it's an analogy. Bound to break down at some point.
At the end of the day, my advice: use the database most closely aligned to the dataset and workload and use case you have at hand. Don't just throw something at a problem because it's a popular choice, or because it's the database you happen to be most familiar with.
AWS is really no better. Both consoles are a mess in their own ways. CLIs are really what matters, though. In general, I have found the GCP command line (“gcloud”) easier to work with.
Tell me more (bonus points if you find me on LinkedIn or other social because tracking comment responses on HN is really rough). I'd love feedback you have so I can bring it back to the product team!
There is no problem PostgreSQL can't solve unless the problem involves massively distributed global databases that need to be incredibly fast and in sync with each other, and even then I think Slony can help.
These are not the same data store solutions. Document stores like DynamoDB have a specific operational Write Model purpose whereas an RDBMS is better suited to analytics and reporting.
Spanner is probably the least likely thing to ever get cancelled at google by a long shot. It powers basically every single product at google and has been around for many years. I can't think of another Google product that has been around for so long and used internally and externally so heavily besides Gmail (which uses spanner).
A problem for me building on this is that Google Cloud has no coherent strategy. So the price could easily be double or quadruple Amazon in 2 years.
Google Maps is the key lesson here. 10 to 20 times price increase, just because someone had a meeting. No justification or coherent strategy, just "sorry here's a shiv to the gut".
At least with Amazon, as chaotic as it is, you know that they just mark stuff up to the margin they want and let the market sort out what gets used. Google is into secretive grand strategy that changes completely anytime 4 product managers get together in a room.
Oof, thanks for reminding me about the Google Maps pricing increase.
It was very difficult to understand their thought process, and we were pretty much forced to find alternative services that had a more reasonable pricing structure. The amount they wanted to charge us for a few million geocode requests per month was bananas.
Was there really no value in keeping a customer like us around at a more reasonable rate?
It's surprising how Google have shot themselves in the foot with their short term thinking. Nobody trust them for anything that involves long term support on any new services/products(and several already existing products).
Yegge had an article saying that Google had really good tools to clean up uses of deprecated calls and the like and this had made them way too cavalier about breaking changes. Can't vouch for it but an interesting dynamic.
I mean, a common joke among ICs at Google is that the API or service you're calling is deprecated but the replacement system isn't ready yet. They have good tools, but the cavalierness about breaking changes would be more because teams can just say "not supported" anymore for internal APIs and services, a more thorough test suite, & really principled rollout strategies to catch issues before they get too big.
This thread is already a damning indictment of google cloud. Price cuts alone won't attract customers -- google has sustained significant reputational damage with their crappy customer service and support. That's hard to fix
Yes, but also every 3 out of 4 people here have applied to work for Google, and most got rejected, so that can contribute to the negative sentiment.
Spanner is a pretty solid database, it checks all the boxes: consistency, geographic replication and transactional updates are not present at the same time in most large scale db's out there.
I've been doing Android development since 1.0, many Android developers are unhappy with the abusive "relationship" they have with Google, because of their control over the Play Store, your developer account, and resolving things if they come up. I'd pick any non-Google tech if there was an alternative.
I get the skepticism, but these core SaaS offerings tend to have a much longer support life than consumer products. Google’s Bigtable is still around as a cloud offering, and it’s almost 20 years old at this point.
Spanner is used for a lot of internal Google services, too. Spanner has basically been eating all the other databases Google uses internally. Nom nom nom, every database that used to be something else is now Spanner.
GCP is profitable and growing 27% YoY. It made more money than YouTube ads did in Q2 2023. In what world would it make sense to shut down GCP at this point?
I think the "google shuts things down" trope is justified and something that I wince at every time I see something like this, but GCP isn't a random pet project that loses money and has no users. It's a very successful enterprise business.
(disclaimer: I work at google in cloud, so obviously I have personal bias, but this is just my opinion based on the pure numbers.)
I really doubt that. They’re starting to finally turn a profit ($200 million) and it’s a 8 billion in revenue a quarter line of business. Google shuts down businesses that aren’t as revenue generating but 8 billion a quarter (32B annually) is a massive business and the “losses” are likely mostly from R&D reinvestments to keep growing that line of business.
Does Google Cloud Spanner use the Spanner database per the 2013 paper? Or does Google simply use the brand and implement a cheaper and more performant db under the hood? I suspect it is the latter because most companies do not need global consistency anyway, so it may make sense to relax parts of what Spanner was originally built for.
More specifically, infra and cloud Spanner are the same stack. So they've progressed together hugely since 2013. :) The real differences between the two are more about the internal tooling we (Google) have around infra that's built up with our other services that consume it over the years that aren't relevant to anyone other than Google.
Google runs their tech stack as if it's a startup that builds their CV. Everything is immature, tons of hacks, undocumented features. If you are on their k8s there are tons of upcoming new versions and features that force you to revisit key hacks you put in your infra because of their misgivings. Our infra team keeps tinkering around our infra and it never ends. It's 50:50. 50% of time making sure we are prepared for their shit and 50 % our ambitious infra plans. Good luck with that.
With AWS our bill is 60% of what GCP used to be running 3 k8s clusters.
AWS support is so nice, you can't believe it.
Nah, I don't trust Google with anything. It's a scam. Google's support is horrendous. They refer you to idiots that drag you through calls until your will for life dies. And you're back to the mercy of some lost engineer that may comment on a github issue you opened 20 days ago. We have a bug reported back in 2020 that got closed recently without any action because it became stale and the API changed so much it doesn't really matter. It's that bad.
The billing day is a monthly reminder you're paying entitled devs to do subpar work other companies do a lot better.
No, we don't miss them already.