Scaling to 100k Users

majkinetor · on Feb 5, 2020

This is relevant only for multimedia apps.

I have fintech systems in production with 100k+ users with complex Gov app for entire country that runs on commodity hardware (majority of work done by 1 backend server , 1 database server and all reporting by 1 reporting backend server using the same db). Based on our grafana metrics it can survive x10 number of users without upgrade of any kind. It runs on linux and dot net core and Sql Server.

Most of the software is not multimedia in nature and those numbers are off the charts for such systems.

bob1029 · on Feb 5, 2020

Thank you for this post. I read "10 Users: Split out the Database Layer" and about had an aneurysm.

I also work with fintech systems built upon .NET Core and have similar experiences regarding scaling of these solutions. You can get an incredible amount of throughput from a single box if you are careful with the technologies you use.

A single .NET Core web API process using Kestrel and whatever RDBMS (even SQLite) can absolutely devour requests in the <1 megabyte range. I would feel confident putting the largest customer we can imagine x10 on a single 16~32 core server for the solution we provide today. Obviously, if you are pushing any form of multimedia this starts to break down rapidly.

collyw · on Feb 6, 2020

As someone that uses Python and Django most of the time (a "slow" language), I haven't seen much in the way of performance problems at the application level. Usually some database optimization will do the trick.

Actually thinking about it- I have inherited plenty of performance problems at the application layer, but that's because the joins were effectively being done at the application level (and making huge numbers of database calls) when they should have been done at the database layer.

thinkmassive · on Feb 5, 2020

It seems like they recommend splitting out the database at the start because using a managed service is much easier than properly managing your own production database.

throwaway5752 · on Feb 5, 2020

I can't speak for high performance/near or realtime system people but I would not trust a managed database service for those needs. My experience is that the managed offerings lag behind upstream versions substantially and usually are economical because they are multitenant. So you have a bit less predictability in io wait / cpu queue, lose host kernel level tunings (page sizes or hugepages, share memory allocation, etc), and - not naming names - some managed db services are so behind they lack critical query planner profiling features. That's not even going into application workload specific tuning for various nosql stores. This is a nice article but its audience is people that haven't scaled up a system but are trying to cope with success. It's not great generalized scaling advice.

jschmitz28 · on Feb 6, 2020

Those all sound like concerns that probably aren't your top priority when you have 10 users unless your product is performance critical, and then none of this guidance applies anyway. When you have 10 users, your database concerns are more likely to be: Is my database up? Is my database secured? Do I have automated backups? Am I otherwise at risk of losing data? Is the database fast enough for now to not be a blocker to my business? All of which are concerns that tend to be well covered by managed database services.

bob1029 · on Feb 6, 2020

All I will say is that our latency getting a business entity in or out of a SQLite database (running on top of NVMe flash) is on the order of tens to hundreds of microseconds. There will never be a hosted/cloud offering that can even remotely approach this without installing some "bring the cloud to you" appliance in your datacenter.

throwaway5752 · on Feb 6, 2020

It almost kills me, just 10 years ago it was nearly six figures to get a RamSan that had 2TB storage and did 100k/25k r/w iops (https://www.networkworld.com/article/2268291/less-expensive-...)

Now a WD Blue NVMe does 95k/84k r/w iops at $215 and that's just off their website (https://shop.westerndigital.com/products/internal-drives/wd-..., may be more depending on shipping method...)

That said, it's not a fair comparison and I wouldn't want to run a big service on a single sqlite/nvme setup for more reasons than are worth mentioning, but not prematurely optimizing can take you really far - scale and money - with good design.

jbergens · on Feb 7, 2020

I think we also should be more clear with "10 Users". What does he/she mean? What do we mean?

It could be 10 users that each views a few web pages once every month or it could be 10 heavy users for an internal system that uses the system most of the working day and creates thousands of calls per user to the system each day.

It is as a lot of people have mentioned also very different with information that can be easily cached and things like monetary transactions where you don't want to see old data in reads and therefore often can't cache or use read replica db.

For systems with heavy users I have seen the need to use a few servers even with just 1000 - 2000 users. For a web system the same can happen if you have 1000 users active on the site at the same time (within a minute or so).

karambir · on Feb 5, 2020

Yeah, for normal web app, we can easily have 100x users for each step mentioned in article.

For our company, we had more than 500k users with 1 small nginx + 1 medium appServer(with autoscaling, though never needed it) + 1 small cache server and RDS till now. We just added a aws managed load balancer into the setup and think it might be overkill.

For a client with NewsFeed needs, I used a dedicated server(64GB, 2TB space) to run nginx, app, cache, huge elasticsearch and postgres. It was great(and cheap) option for an MVP and let them validate the product for few months with >10k users.

It was awesome to learn a few years ago, how much compute power we don't use.

holoduke · on Feb 6, 2020

Yes exactly. I have a an app with about 1m concurrent users at peak hours. For years it is running perfectly fine on a single 8 core machine. Linux - > Nginx -> memcache -> nodejs -> filesystem. Yes filesystem. All content is fetched from files. Another process is populating all files. I pay 40 dollars/m for this self managed machine.

mwcampbell · on Feb 6, 2020

What do you do when that one machine goes offline? Say there's a power outage, or a networking issue, or you have to reboot with a patched kernel. Now your 1M concurrent users can't use your application. This is why we go to the trouble to use multiple machines, ideally in multiple availability zones if we're using one of the big three cloud providers.

zrobotics · on Feb 6, 2020

What is wrong in this case w/ multiple VPS's? For instance, I run a shipping system that our entire operation depends on. I can't afford downtime either, as that would shutdown our entire shipping department. However, 1 server running on-prem, 1 remote in shared hosting and 2 failover servers on DigitalOcean costs ~$120/month.

What would a managed cloud like AWS or Azure provide besides higher expenses? Last I checked, AWS would be ~$550/month based on their calculator.

IT departments managed redundancy long before cloud was a thing, and it's certainly possible to achieve acceptable uptime without the managed services. I realize that I still rely on DO for their uptime, and I'm pricing renting rackspace for that reason. However, uptime is currently within acceptable levels (ie I haven't had to explain to the CEO why all his employees are standing around). We had tried 2 SaaS vendors before our current on-prem solution, and both had had downtime that cost us money. Both were hosted in AWS, so clearly that isn't the panacea for ultimate reliability. If our on-prem server goes down, I get a call. However, I got calls when our SaaS vendor went down as well; sitting on line with their tech support wasn't any great comfort.

The same graceful failover processes are accessible on machines you control, there's no reason that you have to run on one machine if you avoid the cloud. The biggest hurdle for me to implement this was database concurrency with multiple servers, but a cloud solution wouldn't have done anything to solve that problem.

[edit:] typos and miss-typed AWS pricing due to mobile keyboard.

Kiro · on Feb 6, 2020

I tell the users to try later. Not every app needs a SLA.

jjeaff · on Feb 6, 2020

That could be solved by mirroring the system to another $40/mo machine in another data center.

Redundancy was around long before cloud compute was a thing.

tempestn · on Feb 5, 2020

Even though the absolute numbers may well not apply to other types of apps, the general concepts of how to scale do. We have 100k+ users on effectively a single box too (actually two for redundancy, ease of upgrades, etc., but one can handle it), but this is a great overview of how to think about scaling beyond that, however many users that's done at. Honestly when I was reading the article I read that 1/10/100 as more of a unitless degree of scale than actual numbers of humans.

Kiro · on Feb 5, 2020

Yeah, it really depends on the kind of app. My app with a couple of hundred thousand users runs on a single $5 Digital Ocean droplet with standard PHP and MySQL.

leeoniya · on Feb 5, 2020

> My app with a couple of hundred thousand users

per year? per month? per day? simultaneously? doing what?

it matters.

i ask this as someone who runs a $40/mo Linode with a debian/nginx/node/mysql stack that's definitely 20x over-provisioned for an e-commerce site with 10k daily visitors, 15 simultaneous backend users (reporting, order-entry, CRM, analytics) and 0 caching tricks. i could easily run the site on any 5 year old laptop with an SSD and 8GB RAM.

normalize/de-normalize when needed, understand and hand-write efficient SQL queries (ditch ORMs), choose small/fast libs carefully (or write your own), and you can easily serve 100k users per day on a single cheap VPS with no orchestration/replication/hz-scaling bullshit. definitely can't say the same about 200k simultaneous users - that would need proper hardware, but can still be a single server.

Monoliths Are the Future: https://news.ycombinator.com/item?id=22193383

sgtfrankieboy · on Feb 5, 2020

>(ditch ORMs)

Depends on which ORM you use I think. Basic ORMs like Dapper in .NET just map your SQL query result to a object model.

sk5t · on Feb 6, 2020

Dapper's a "micro-ORM" and quite lightweight. Usually folks intend ORM to refer to Hibernate or Entity Framework, maybe ActiveRecord, etc.

Scarbutt · on Feb 5, 2020

hand-write efficient SQL queries (ditch ORMs)

Raw SQL strings or a query builder?

leeoniya · on Feb 6, 2020

doesnt matter. query builders are negligible overhead.

neillyons · on Feb 5, 2020

What is the e-commerce site?

leeoniya · on Feb 6, 2020

i'm working on a blog post about the tech choices & stack. maybe it'll make it to the front page in a month or two - that'll be a good stress test!

Demiurge · on Feb 5, 2020

Are you offering a free stress test? :)

clementmas · on Feb 5, 2020

Same thing here, TravelMap.net has 100k+ users (total), handles photo uploads and runs on a simple droplet with PHP and JS.

I guess it depends on the daily traffic.

jiux · on Feb 6, 2020

Do you mind sharing more details on how you store and manage these photo uploads? Are you leveraging NFS or cloudfront + S3 at all?

wackget · on Feb 5, 2020

Wow, what's your app?

kernoble · on Feb 5, 2020

Hey, really specific question regarding your deployment. A teammate of mine reported difficulties with the .net SqlServer database driver establishing connection from a linux client (container instance based on the public .net core image) to a SqlServer instance (on windows) . Are you familiar with this problem? I think moving our systems over to .NET Core on linux is the future, but this one experience has somewhat soured the idea for some decision makers and the team managing our db.

mikem170 · on Feb 6, 2020

I bumped into that at the office - connecting from a linux server to a microsoft sql server. We were able to connect after installing unixodbc and microsoft's odbc driver for sql server on linux. Note that a local mssql account is required, you can't connect with microsoft domain authentication from a linux box. We haven't finished testing and tweaking, but so far it looks like it's working.

kernoble · on Feb 6, 2020

Right, the auth from non-windows was a separate organizational hurdle we had to overcome, there was real hesitancy about not using a GMSA windows user to auth with SqlServer.

When you say "local mssql account is required" do you mean a user or some other entity on the linux client? Or do you mean an mssql user that the client with authenticate as?

mikem170 · on Feb 8, 2020

an mssql user account

kernoble · on Feb 12, 2020

Two more questions (if you see this, I know it's been a few days), are you running SQL Server on Linux or Windows? And are you using encrypted connections?

NicoJuicy · on Feb 5, 2020

Any tests on progress? I'm going on Postgress because of Bson support and no-sql with sql duplicate fields for search.

Using .net core also.

In general: I agree, there's rarely a case for really using the cloud. Page loads of my E-commerce project are 8ms for basket, I'm wondering what should kill it first on a big load, even without caching. Probably the database, not sure yet.

arsenico · on Feb 7, 2020

It also depends on the nature of the fintech app. Imagine running a 100K + users on a stock trading app for day traders, where latency matters, and a lot of streaming happens.

grezql · on Feb 5, 2020

SQL Server is a different beast. You get alot of performance enhancement out of box. Yes its costs money you save tremendous amount of tweaking time and headaches.

viggity · on Feb 5, 2020

or you can just use AzureSQL and essentially just pay what it costs for the box, because it is platform as a service. Its far cheaper (and easier to maintain) than what it'd cost to buy a SQL Server license and run it on a VM.

dillonmckay · on Feb 6, 2020

A 32 core SQL Server license is $700.

mtVessel · on Feb 6, 2020

This page[1] seems to indicate licensing SQL Server (Standard Ed) for 32 cores would cost $59,472 ($3,717 per 2 cores). Where do you get $700 from?

[1] https://www.microsoft.com/en-us/sql-server/sql-server-2017-p...

dillonmckay · on Feb 6, 2020

https://indigosoftwarecompany.com/products/microsoft-sql-ser...

jabart · on Feb 6, 2020

That is a 10 cal license, so you can only have 10 users using it, and that is 10 assigned users. Don't quote me on that but you have to use core licenses for web apps if you have internet users.

vonseel · on Feb 6, 2020

Those prices are too good to be true, reminds me of the sites that sell windows or office licenses to personal users for like 10 bucks.

james_s_tayler · on Feb 6, 2020

This site isn't Microsoft?

rlander · on Feb 5, 2020

I’ll bite. What’s a “fintech system” with “complex Gov app”?

jermaustin1 · on Feb 5, 2020

Looking at his CV [1], it looks like he's done a lot of work inside the Serbiam Treasury.

1: https://gist.github.com/majkinetor/877d5174ba322fbb808cc47a8...

jrvarela56 · on Feb 5, 2020

An hour of work put in by any of you reading this is worth several months of hosting for a starter project in an expensive provider like Heroku.

Do not invest time making sure your service runs for $6 a month if it can run for $50 with 0 hours invested. Invest that time talking to customers and measuring what they do with your service.

Most times a few customers pay for the servers.

This is just a friendly reminder. I see a lot of comments talking about running backends for cheap.

herval · on Feb 6, 2020

A friend of mine recently launched a side-project that does heavy processing of audio. He decided to invest ~2-5 hours properly setting up auto-scaling, a job queue, etc, before releasing v1.

Fast-forward two days later, his service and a competitor were both featured on Product Hunt. He's now making a profit on the service, as he managed to scale it up very fast, while the competitor buckled and completely lost momentum.

If you're talking about spending _a long time_ preparing a perfect infra, then your argument makes sense. Spending a few hours? It's both a great learning exercise and can literally save your project, so why not?

hartator · on Feb 6, 2020

Auto-scaling doesn’t take 3-5 hours to set up. Or come work for us!

herval · on Feb 6, 2020

He used GCP, setup a pubsub topic and a cloud function. It took less than an hour to setup, the rest of the time was rewriting a portion of the code to write to the queue, etc.

There’s other ways in other platforms as well (eg if you’re using Kubernetes)

pojntfx · on Feb 6, 2020

With a Kubernetes autoscaler, this takes you ~20 Minutes to setup pod autoscaling. If you run your k8s on say GKE, setting up node autoscaling is another 5 Minutes.

nrb · on Feb 6, 2020

I totally agree with you, once you've factored in the dozens of hours gaining knowledge of k8s and hundreds+ of hours of experience dealing with it in production. You can get by with far less, but it's going to be pretty stressful when things go sideways in prod without knowing exactly why.

pm90 · on Feb 6, 2020

This is sort of a strange argument. If your argument is that we need to factor in prior knowledge, naturally I will ask: how MUCH prior knowledge?

* do i need to factor in experience with VMs?

* do i need to factor in OS and networking theory learnt in college?

* do i need to factor in high school algebra?

* do i need to factor in learning the english language?

Of course you need to factor in learning k8s. Before, it was learning VM's+ {ansible, chef, salt}, before that it was prob something else.

koffiezet · on Feb 6, 2020

In theory maybe.

In practice - the application running in the pod has to be aware of this, and the most intensive part are rarely the bottleneck. Most of the time this is an architecture issue, not a resource issue. This takes time and experience, and overhauling a platform to remove a bottleneck is usually very painful if you have a bit more complex setup.

collyw · on Feb 6, 2020

We put everything in Kubernetes to scale. Now large uploads and downloads fail more often than before.

mkarnicki · on Feb 9, 2020

Are you saying k8s is to blame for large file upload/download failures? Maybe your infra needs a tweak?

fogetti · on Feb 6, 2020

> completely lost momentum

I am sorry but you are really talking about 2 days of time interval and projecting predictions based on that? Unbelievable...

So what's to stop the competitors to do the same thing that your friend did to invest 2-5 hours and catch up on the third day???

I wish I was good at ascii art, then I would draw a nice facepalm here.

yardstick · on Feb 6, 2020

My take: depending on the nature of the business, and how the publicity was done, they may only have had one shot at gaining the customers. In 2-3 days time you might have fixed things, but by then the prospects moved on to the site that worked.

I’m not convinced that you need to superscale your infrastructure first. I think it’s normally a waste of time and money. But for the example listed this is a likely benefit.

henrikschroder · on Feb 6, 2020

> So what's to stop the competitors to do the same thing that your friend did to invest 2-5 hours and catch up on the third day???

Pride, a refusal to accept vendor lock-in, misaligned incentives, sunk cost fallacy, etc, etc.

herval · on Feb 6, 2020

The competitor didn’t know what he was doing, pretty much. He hacked an MVP together with Rails in Heroku, then when people flooded in, he couldn’t scale up and the site kept crashing. By the end of day two, there were articles and publicity about my friend’s site, and then it became a flywheel. He eventually made it work, of course, but the botched launch gave my friend a HUGE advantage (and paying users). I bet he’s paying a ton of money and still trying to scale on Heroku (I made a considerable amount of money as a consultant fixing cases like that too)

Welcome to the web?

AlchemistCamp · on Feb 6, 2020

> Fast-forward two days later, his service and a competitor were both featured on Product Hunt. He's now making a profit on the service, as he managed to scale it up very fast, while the competitor buckled and completely lost momentum.

That's an incredible story! Could you please link to the two products on PH?

makstaks · on Feb 6, 2020

Not knowing much about your friend's service, it sounds like the value of his product is "heavy processing". Therefore, I also would have included scalability as part of a v1 deliverable and wouldn't consider it as an optimization task. Great that your friend identified that.

thiagocsf · on Feb 6, 2020

Because you don’t have to.

The few hours you used on infrastructure can be better used fixing a bug, polishing/adding a feature or even giving yourself a break so you can be better focused the next day.

A Heroku-like platform will literally do the scaling for you. The non-financial cost is that you need to develop your application in-line with their framework/platform. If you make this decision at the start, this cost in practically nil.

meesterdude · on Feb 6, 2020

This is how I see it as well. Start making cookies with the kitchen oven before planning to start a bakery. Heroku lets you do both.

tluyben2 · on Feb 6, 2020

> $6 a month if it can run for $50 with 0 hours invested

With the risk of sounding like a broken record: yes in such a case it is better but that is often not the case at hand; this exact argument is used for spending $500/mo after optimizing the software vs $50k/mo autoscaling with ‘0 hours’ invested (between ‘ because ofcourse it takes a lot of time to even get that working, but, for many programmers, it is apparently easier work?).

A few weeks ago I commented here while optimizing a Laravel cloud install, now I am working on a Clojure one. Client is spending $28k/mo on aws, especially Dynamo and the rest on ELB.

Rewriting to Postgresql standard and the dynamo part to postgres columnstore and adding proper indexes has the lastest stress tests down to a few 100$/mo they will spend when they launch this.

The $28k is spent using exactly your argument, and like most of these projects like this I do, it was quite a lot less than 28k (1 month hosting) to optimize this (which had us rewrite a lot of spaghetti from dynamo to psql).

So yes, in some cases you are right, I would say when cloud hosting pops over 3k (especially if sudden), I would hire a me to have a bit of check to see if you are not burning money for nothing.

literallycancer · on Feb 6, 2020

Saving a business $X/month is likely worth more than $X.

tluyben2 · on Feb 6, 2020

It is, but that is not my business model :)

philliphaydon · on Feb 6, 2020

Dynamo is costing 28k? I'm a bit confused. What's the cost from Dynamo to Postgres?

tluyben2 · on Feb 6, 2020

Not only Dynamo; the total bill. And Dynamo completely abused while Postgres used as it should be and tuned for what they are doing.

I am not saying Dynamo could not be used or tuned for this work, but Postgres was a better fix and more far more portable ofcourse.

philliphaydon · on Feb 6, 2020

Ah I'm just curious about the $$ cost, I 2 RDS PG instances running on t3.medium, one is 30gb never goes over 10% cpu, the other is ~150gb and never goes over 50% cpu. I also have some dynamo stuff for some session management and wonder if it would be better to just shove it in postgres.

So I read your comment as $28k dynamo to $100 postgresql.

RussianCow · on Feb 6, 2020

If you're bootstrapping a side project that you aren't confident will make a profit and you have more time than money, I think it's perfectly reasonable to prioritize low operating costs. In all other instances, though, I think you are correct.

dumbfoundded · on Feb 6, 2020

Make profitable decisions and be aware of your bankroll.

ajnin · on Feb 6, 2020

This is the conventional wisdom but the more I think about it the less I agree. I think we have a duty to minimize resource usage and waste in general. The part of energy used by IT infrastructure is constantly increasing and has a significant impact on the environment.

Let's put your statement on another perspective : it is worth investing a few days/weeks of work of a low power machine (a human) to reduce significantly the long-term power usage of a high-powered machine (a computer or cluster thereof).

thereisnospork · on Feb 6, 2020

> a low power machine (a human) to reduce significantly the long-term power usage of a high-powered machine (a computer or cluster thereof).

That seems a sad view on the value of a person.

Shouldn't the whole point of technology and infrastructure be to allow a person to use more resources so as to better[0] allocate their limited time?

filoleg · on Feb 6, 2020

A lot of people tend to not consider time invested as a measurable resource, and that is a sad thing indeed, as that is one resource that can only be spent.

gmueckl · on Feb 6, 2020

You're entirely missing the point. If I'm spending a month to optimize some code so that it is 25% faster, these gains don't go away again once I'm done optimizing. They stay in the product for months or years. And this is where the effort pays off.

BoorishBears · on Feb 6, 2020

...

1 kwh on the power grid produces 0.9884 lbs of greenhouse gas: https://carbonfund.org/calculation-methods/

0.1163 kWh of human energy produces 0.64 lbs of greenhouse gas: https://www.washingtonpost.com/national/health-science/runni...

If we even humor this sentiment (I don't buy that we should, 5$ vs 50$ of compute is not why we're struggling with climate change. Compute period is 10% of all electricity production, we're not going reduce that with a few hours of optimization here and there.. It's way too late to be taking such half-measures seriously), the math doesn't work out.

-

Thinking about climate change on a personal level is positive, but I also feel our efforts should be grounded in reality, not just things that feel good.

The hours you spend optimizing your bootstrapped service to reduce it's CO2 footprint... could be spent in plenty of other activities that actually reduce your carbon footprint.

robomartin · on Feb 6, 2020

The level of misinformation, misplaced priorities and uninformed conclusions in climate change has reached staggering levels.

We have people who don't believe it's real (which is silly) and, at the other end of the scale, people who believe we can actually fix it in 50 years (which is just as silly, even 100 years is silly). People are convinced they "know" the "truth" without even bothering to throw a few numbers at a spreadsheet to see if what they think they know aligns with any imaginable version of a non-science-fiction reality.

I am very concerned that politics and ignorance is driving this far more than real science.

gmueckl · on Feb 6, 2020

What are you trying to get at with these numbers? By themselves, they mean absolutely nothing! It all depends on the impact you're having. If you're optimizing code running on a server that's mostly idle, then you'll not see a big reduction in energy use. If the result of your optimization is that you can shut down 10 servers because of overall load reductions, the time invested suddenly has a very real benefit.

BoorishBears · on Feb 6, 2020

They mean plenty, they just require you apply some critical thinking...

5$ to 50$ is not 10 servers on Heroku. In fact, it's not even one server, you'll be sharing resources at that price point.

Let's say you generate approx. 500 lbs of CO2 a year (based of figures for a half desktop PC running 24/7 because you're only getting 2 cores and 1GB of ram)

At this point you're thinking, 500lbs?! That's insane!

But 40 hrs (1 week) is a lot of time, and 500lbs of CO2 is less than it seems.

If you spent 40hrs spread out over an entire year air drying your clothes a total of 20 times, you'd save over 2 Tons of CO2 a year. (You can do the math for a dishwasher if air drying doesn't work where you are)

If you live in a cold climate, the EPA estimates you can save 15%, or almost 1000 lbs of CO2, by weather proofing your home, easily accomplished in 40hrs

-

You might say "I already do all these things!", but the point is there are so many ways to convert time to CO2 savings.

We spend a lot of CO2 trying to save time you could say.

Optimizing your bootstrapped service is not one of the places I would use CO2 expenditure as reasoning in the slightest.

gmueckl · on Feb 6, 2020

Uhm... isn't that exactly what I wrote? Besides, the CO2 that we humans exhale doesn't count towards global warming because the carbon in it is part of the natural carbon cycle. That's roughly the same amount that gets reabsorbed into plants that are grown for next year's food. What really matters is the carbon that is added to that cycle from sources outside this biological cycle. In other words, the 1kWh of electical power is strictly worse than the 1kWh of human power as long as some fraction of it comes from non-renewable sources. At best, if it were sourced purely regeneratively, it would be exactly even.

BoorishBears · on Feb 6, 2020

If you think my comment is saying what yours said, kindly read both again.

This comment reads you didn't read my reply, humans could make 0 CO2 and spending a week optimizing 10 servers and not make a dent compared to simple lifestyle changes.

I used 1 server because that's the scale thread was about ( saving <50$ of spend on Heroku)

I know you tried to make it about 10 servers to force a point, doesn't end up changing much though...

vonseel · on Feb 6, 2020

* Compute period is 10% of all electricity production*

You found stats on that? I’m surprised

BoorishBears · on Feb 6, 2020

Actually the figure I found was for "all ICT", if we mean compute to be data centers (I did) , it's even lower:

2% projected to reach 8% in a decade (https://fortune.com/2019/09/18/internet-cloud-server-data-ce...)

ako · on Feb 6, 2020

The best way to reduce usage is to not build and run what people do not want.

spookthesunset · on Feb 6, 2020

Sure but now you are wasting significant human capital reinventing the wheel with substantially worse results.

So using your math, you’ve got humans, which have high environmental costs in order to... well... live... wasting their life and consuming tons of resources doing something that was done better for cheaper.

Humans cost tons of money to operate. More than data centers or anything else. Don’t waste them on stupid projects like writing shitty versions of AWS.

Your own logic should lead you to conclude all the people wasting expensive human lives reinventing javascript frameworks, deployment systems, cloud orchestration systems, database systems and AWS... these are the folks doing true harm to the environment. It is much better for the planet to lock yourself into AWS, Azure, or google cloud and exploit the shit out of everything they do than it is to piss away incredibly expensive resources building your own.

And I am more than happy to boldly assert if you are working on a project that aims to re-invent AWS for your company... you are a waste of human capital.

somehnguy · on Feb 6, 2020

As someone who sort of held the view that the incredible inefficiencies in modern computing and 'cloud stuff' was a net negative on the environment, thanks for this comment. I still don't think mindlessly throwing extremely inefficient stuff into a datacenter for no reason is a great thing to do but this gives me a lot better perspective as to why it isn't so black and white of a thing. I forget sometimes that humans have a very high cost to operate too.

keyle · on Feb 6, 2020

The fact is, getting the backend running on cheap is far more fun than chasing and talking to people.

nine_k · on Feb 6, 2020

Depending on who you are: an engineer or a businessman.

A startup is a business; a side project may be pure hobby.

ako · on Feb 6, 2020

The fact is, that is not a fact, but an opinion.

keyle · on Feb 6, 2020

Very true, I guess I woke up in this world where facts are treated as opinions.

bcrosby95 · on Feb 6, 2020

Hours aren't really fungible if you're paid salary though. I can't decide to work another hour to earn another $100.

zenlikethat · on Feb 6, 2020

So skip dinner out this weekend and use that $100 for the project ;)

Most software engineers make plenty to budget for infrastructure at a side project.

munns · on Feb 5, 2020

As the original creator of the presentation referenced by the blog author (later re-delivered by Joel in the linked post) I am super excited to see this still have an impact on people, but I'd say today in 2019 you'd probably do things very differently(as others call out).

Tech has progressed really far and there are tools like Netlify for hosting that would replace 90% of the non-DB parts of this. Cloud providers have also grown drastically and so again a lot of this would/could look a lot lot different.

Fwiw original deck from Spring of 2013, delivered at a VC event and then went on to be the most viewed/shared deck on Slidehare for a bit: https://www.slideshare.net/AmazonWebServices/scaling-on-aws-...

thanks, - munns@AWS

debaserab2 · on Feb 5, 2020

Does it look that much different if you exclude solutions that increase vendor lock-in?

Swizec · on Feb 5, 2020

You always pay the vendor. Whether that’s in sweat and tears or in dollars is up to you.

Fwiw, you are almost certainly shooting yourself in the foot by avoiding vendor lockin at stages before 8 revenue figures per year. Your engineering takes longer, is more brittle, and because you’re only using 1 vendor actively, your solution is still vendor locked-in.

Love, ~ Guy who learned his lesson many times

debaserab2 · on Feb 5, 2020

I think it depends on the type of vendor lock-in -- sure, the trade off of having a managed Postgres instance is obvious, but it becomes less obvious to me when you're using things like a proprietary queueing or deployment service.

Writing service API integration code instead of code that interfaces directly with the technology that service is doing makes code quite brittle. If/when the vendor deprecates the service, introduces backwards incompatible changes, or abandons development of the product, you're left on the hook to engineer your way out of that problem. Often times that effort is equal to or greater than the effort of an in-house solution in the first place.

I had the same mentality as you until this happened to the SaaS product I work on for a few different services. Now at very least I try to make sure solutions are cloud agnostic.

Swizec · on Feb 5, 2020

My use case is that for the past 5 years I’ve built several API integrations in a vendor agnostic way. We never changed vendors.

Actually, we did once and we found that our abstraction was so tightly coupled to the underlying API that we had to remake it anyway. The core concepts between those APIs were just too different.

And I’ve had at least 2 cases where our attempt at being vendor agnostic made the integration completely fail and never work right. To the point the vendor told us “You’re holding it wrong, please stop”

munns · on Feb 5, 2020

Fwiw: Slack, Lift, AirBnb, Snapchat, Stripe are all 100% public cloud based (in so far as I know). So up through 8+ figures they are still doing it too.

removed Uber as its not 100% cloud (or at least wasn't in the past)

jayp · on Feb 5, 2020

Uber has always been self hosted. Some workloads are on Cloud and migrating more there. I last worked there 2 years ago.

munns · on Feb 5, 2020

Thank you for clarifying! I know quite a lot has supposedly shifted. Will update my original comment.

munns · on Feb 5, 2020

On your own metal it looks like it does in this post.

With managed services it looks a world different.

jedberg · on Feb 5, 2020

Heh, this is one of the questions I liked to used for interviews.

"Let's work together and design a system that scales appropriate but isn't overbuilt. Let's start with 10 users".

Then we talk about what we need and go from there. The end result looks a lot like this blog post, for those who are qualified.

ignoramous · on Feb 5, 2020

/offtopic

Heh, you're being modest. I'm sure you've dealt with far more complex distributed systems than the hypothetical one in the blog post.

jedberg · on Feb 5, 2020

Sure, but most of the people I was interviewing hadn't, so it was a good way to test their knowledge. :)

If you can scale to 100K users, you can probably learn the rest to scale to 100M users.

gfodor · on Feb 5, 2020

It's probably a bad idea to switch to read only replicas for reads pre-emptively, vs vertically scaling up the database. Doing so adds a lot of incidental complexity since you have to avoid read after writes, or ensure the reads come from the master.

The reason punting on this is a good idea is because you can get pretty far with vertical scaling, database optimization, and caching. And when push comes to shove, you are going to need to shard the data anyway to scale writes, reduce index depths, etc. So a re-architecture of your data layer will need to happen eventually, so it may turn out that you can avoid the intermediate "read from replica" overhaul by just punting the ball until sharding becomes necessary.

jedberg · on Feb 5, 2020

The problem with going to the "top of the vertical" scaling so to speak is that one day, if you're lucky, you'll have enough traffic that you'll reach the limit. And it will be like hitting a wall.

And then you have to rearchitect your data layer under extreme duress as your databases are constantly on fire.

So you really need to find the balance point and start doing it before your databases are on fire all the time.

toast0 · on Feb 5, 2020

Assuming you have a relatively stable growth curve, you should have some ability to predict how long your hardware upgrades will last.

With that, you can start planning your rearchitecture if you're running out of upgrades, and start implementing when your servers aren't yet on fire, but are likely to be.

Today's server hardware ecosystem isn't advancing as reliably as it was 8 years ago, but we're still seeing significant capacity upgrades every couple years. If you're CPU bound, the new Zen2 Epyc processors are pretty exciting, I think they also increased the amount of accessible ram, which is also a potential scaling bottleneck.

jedberg · on Feb 5, 2020

> Assuming you have a relatively stable growth curve, you should have some ability to predict how long your hardware upgrades will last.

But that's not how the real world works. The databases don't just slowly get bad. They hit a wall, and when they do it is pretty unpredictable. Unless you have your scaling story set ahead of time, you're gonna have a bad day (or week).

toast0 · on Feb 5, 2020

If you're lucky, the wall is at 95-100% cpu. Oftentimes, we're not that lucky, and when you approach 60%, everything gets clogged up, I've even worked on systems where it was closer to 30%.

Usually, databases are pretty good at running up to 100%, though. And if you started with small hardware, and have upgraded a few times already, you should have a pretty good idea of where your wall is going to hit. Some systems won't work much better on a two socket system than a one socket system, because the work isn't open to concurrency, but again, we're talking about scaling databases, and database authors spend a lot of time working on scaling, and do a pretty good job. Going vertically up to a two socket system makes a lot of sense on a database; four and eight socket systems could work too, but get a lot more expensive pretty fast.

Sometimes, the wall on a databases is from bad queries or bad tuning; sharding can help with that, because maybe you isolate the bad queries and they don't affect everyone at once, but fixing those queries would help you stay on a single database design.

bcrosby95 · on Feb 5, 2020

The minute your RDBMS' hot dataset doesn't fit into memory its going to shit itself. I've seen it happen anywhere from 90% CPU down to around 10%. Queries that were instant can start to take 50ms.

It can be an easy fix (buy more memory), but the first time it happens it can be pretty mysterious.

throwaway5752 · on Feb 5, 2020

There's no CPU wall. You have assets: CPU, memory, disk bandwidth/latency, and then structural decisions in your schema. The key is knowing your performance characteristics and where you will start queuing. That's hard to figure out. Practically speaking, you're right, and you're 100% right about unreplicated/unsharded data stores eventually hitting a wall and needing to have a strategy how/when to scale. I just noticed your username and feel silly for telling you stuff you already know far better than me, but posting it anyway in case it benefits others.

wolco · on Feb 5, 2020

That's exactly how the real world works. Databases will get slow, then slower. Resources get used. Unpredictable not really. Maybe you've run out of space or ram or processes are hanging. The database will never just start rending html or formatting your disk or email someone. It is pretty predictable.

jedberg · on Feb 5, 2020

The failure I've seen multiple times is that the database is returning data within normal latencies, and then there is a traffic tipping point and the latencies go up 1000x for all requests.

tilolebo · on Feb 6, 2020

Capacity planning can't be solely linked to the growth curve. This assumes that the number and complexity of your SQL queries never evolve, which isn't true in most cases.

Your will implement new features, add new tables and columns, indexes, etc which will affect your data layer.

NicoJuicy · on Feb 5, 2020

I actually implemented domain driven design WITH an Api-layer ( so core, application, Infrastructure + api). They also are split on Basket, catalog, checkout, shipping and pricing domain with seperate db's.

So just splitting up the heaviest part (eg. Catalog) into "a microservice" would be easy while I add nginx as load balancer. I already separated domain vs Integration Events.

Both now use events in memory in the application, I only need a message broker like NATS then for the integration events.

It would be a easy wall ;). I have multiple options like heavier hardware, splitting up the db from application server or splitting up a domain bound api to a seperate server.

As long as I don't need multimedia streaming, kubernetes or implement Kafka the future is clear.

Ps. Load balancing based on tenant and cookie would be a easy fix in extreme circomstances.

The thing I'm afraid for the most is hitting the identity server for authentication/token verification. Not sure if it's justified though.

Side note: one application has an insane amount of complex joins and will not scale :)

lllr_finger · on Feb 5, 2020

DID is an extremely important concept that is alien to a lot of developers: Deploy for 1.5X, Implement for 3X, Design for 10X (your numbers may vary slightly)

cactus2093 · on Feb 5, 2020

There are some cases where adding a read replica can be helpful at almost no extra overhead - for instance if your product has something like a stats dashboard you'll have some heavy queries that are never going to result in a write after read and don't matter if they are sometimes a few ms or even a few seconds or tens of seconds out of date. Similarly if you have analysts poking around running exploratory queries, a read replica can be the first step towards an analytics workflow/data warehouse.

danenania · on Feb 5, 2020

For those who have reached vertical database write scaling limits and had to start sharding, I'm curious what kind of load that entails? Looking at RDS instances, the biggest one is db.r5.24xlarge with 48 cores and 768 gb ram. I imagine that can take you quite a long way--perhaps even into millions of users territory for a well-designed crud app that's read-heavy and doesn't do anything too fancy?

adventured · on Feb 5, 2020

> the biggest one is db.r5.24xlarge with 48 cores and 768 gb ram. I imagine that can take you quite a long way--perhaps even into millions of users territory

That will run Stackoverflow's db by itself for reference, along with sensible caching (they're very read-heavy and cache like crazy). Here's their hardware for their SQL server for 2016:

2 Dell R720xd Servers featuring: Dual E5-2697v2 Processors (12 cores @2.7–3.5GHz each), 384 GB of RAM (24x 16 GB DIMMs), 1x Intel P3608 4 TB NVMe PCIe SSD (RAID 0, 2 controllers per card), 24x Intel 710 200 GB SATA SSDs (RAID 10), Dual 10 Gbps network (Intel X540/I350 NDC).

https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...

bcrosby95 · on Feb 5, 2020

Very far I would guess. 10 years ago we took a single bare metal database server running mysql with 8 cores and 64gb of memory to 8 million daily users. 15k requests per second of per user dynamic pages at peak load.

We did use memcached where we could.

gfodor · on Feb 5, 2020

Yeah 10 years ago you could support millions of users on a high traffic site on a single box. (This was on postgres in my case.) Today, I'd guess at least a 10x increase due to both software optimizations and increased hardware capabilities, if not significantly more.

Truthfully, unless you're working on some kind of non-transactional problem like analytics, even assuming you will need to shard the data or scale out reads ever due to user activity is borderline irrational unless you have extremely robust projections. The database will be the last domino to fall after you've added sufficient caching and software optimization. It's so far down field for most projects (and the incidental complexity cost so high) that my personal bias is that even having the conversation about such things on most projects isn't even worth the opportunity cost vs talking about something else.

Even then, the first thing to fall over will probably be write heavy analytics-like tables that are usually append only due to index write load. Out of the box, you can often 'solve' this by partitioning the table (instead of sharding.) In modern DBs, this is a simple schema change.

charlesju · on Feb 5, 2020

These posts are great and there is always great information in them. But to nitpick, it would be a lot easier to digest on face value if you lead with concurrency rather than raw total users as that's the true gauge of how your server infrastructure looks like.

stingraycharles · on Feb 5, 2020

Yeah I still don’t understand the need to split servers at 10 users. Even if this is in parallel, it must still mean there is a well-beyond-average resource consumption per user.

cwingrav · on Feb 5, 2020

Probably so when your 10 users grow to 1000, your efforts at 10x are good for 1000x, and you’re working on 100000x.

bcrosby95 · on Feb 5, 2020

Conversely, rent or buy 1 bare metal server. That's how we went until we hit around 300k users. Back in 2008.

brokencode · on Feb 5, 2020

I think it’s kind of crazy that we have 64 core processors available, but still need so many servers to handle only a hundred thousand users. That’s what, a few thousand requests per second max?

Having many servers gives you redundancy and horizontal scalability, but also comes at a high complexity and maintenance cost. Also, with many machines communicating over the network, latency and reliability can become much harder to manage.

Most smaller companies can probably get away with having a single powerful server with one extra server for failover, and probably two more for the database with failover as well. I think this would also result in better performance and reliability as well. I’m curious to know whether the author tried vertical scaling first or went straight to horizontal scaling.

neurostimulant · on Feb 5, 2020

The bottleneck on a single big server setup is available network bandwidth to serve all those 100k users. If you run a simple site you can probably slap some cdn to serve all your static assets so it won't clog your network, but if your app uses more bandwidth per user than a typical website that can't be offloaded to cdn, then your single server might not have enough available bandwidth to serve all those 100k users and you'll be forced to scale horizontally even though your server still have plenty of cpu and i/o capacity. You might be able to increase your bandwidth but your mileage may vary as dedicated server vendors usually cap their offering to 1-3gbps per server.

jjeaff · on Feb 6, 2020

That would have to be 100k concurrent users all streaming data at to 100kbps to saturate a 10gbps connection. Which are not hard to find these days. At least I came across several offerings when browsing bare metal server options recently. And they were not that expensive either.

And as a side note, anyone who is using up that kind of data is not going to be able to afford cloud egress prices unless they are making a mint on those users. Saturating a 10gbps connection would cost you around $450 an hour at AWS rates.

neurostimulant · on Feb 6, 2020

European providers seem to be more generous with bandwidth. On other locations if you want 10gbps per server you probably need to talk to someone first, and there are nonzero chance that they can't fulfill that if their datacenter is not that big.

abraxas · on Feb 6, 2020

You can have multiple NICs on a single server

dillonmckay · on Feb 6, 2020

Can one add additional network interfaces?

thaniri · on Feb 5, 2020

This blog post is almost entirely a re-hash of http://highscalability.com/blog/2016/1/11/a-beginners-guide-...

The primary difference is that this post tries to be more generic, whereas the original is specific to AWS.

The original, for what it is worth, is far more detailed than this one.

lixtra · on Feb 5, 2020

That’s on purpose:

>> This post was inspired by one of my favorite posts on High Scalability. I wanted to flesh the article out a bit more for the early stages and make it a bit more cloud agnostic. Definitely check it out if you’re interested in these kind of things.

k2xl · on Feb 5, 2020

A little bit of overkill recommendations here.

With 10 users you don't "need" to separate out the database layer. Heck you don't need to do that with 100 users. Website I ran back in 2007-2010 had tens of thousands of users on a single machine running app, database, and caching fine.

Users are actually a really poor way use for scalability planning. What's more relevant is queries/data transmission per interval, and also the distribution of the type of data transfers.

I'd say replace the "Users" in this posts to "queries per second" and then I think it's a better general guide.

AlchemistCamp · on Feb 7, 2020

Even in that case it's overkill. My site load tested at ~1k requests per second two years ago, when it was entirely on a $5/month DO droplet.

marcinzm · on Feb 5, 2020

That seems pretty aggressive for just 100k users unless they mean concurrent users (in which case they should say so).

Let's say that maybe 10% of your users are on at any given time and they each may make 1 request a minute. That's under 200 QPS which a single server running a half-decent stack should be able to handle fine.

erkken · on Feb 5, 2020

We now use a DigitalOceans managed database with 0 standby nodes, coupled with another instance running Django. It is working good.

We are however actually thinking about switching to a new dedicated server at another provider (Hetzner) where we are looking at having the Web server and the DB on the same server, however the new server will have hugely improved performance (which is sometimes needed), still at a reduced cost compared to the DigitalOcean setup.

The thing we are doubting is if having a managed db is worth it. The sell in is that everything is of course managed. But what does this mean in reality? Updating packages is easy, backups as well (to some extent), and we still do not use any standby nodes and doubt we will need any replication. So far we have never had the need to recover anything (about 5 years). Before we got the managed db we had it in the same machine (as we are now looking at going back to) and never had any issues.

Any input?

heffer · on Feb 5, 2020

Do note that with dedicated servers you are subjecting yourself to things such as hardware upgrades and failures which you will have to manage yourself if you want to prevent downtime.

And while Hetzner customer support is generally excellent, in my experience, their handling of DDoS incidents will generally leave your server blackholed and sometimes requires manual intervention to get back online.

This is something you need to account for in terms of redundance if you are planning to expose your application directly to the net without any CDN/Load balancer/DDoS filter in place.

From my experience it makes sense to work with a data centre that is less focussed on a mass market but allows for individual client relations to mitigate risks like that. I love Hetzner for what they are and do host some services with them, but I wouldn't build a business around services hosted there.

And this not only goes for Hetzner but pretty much any provider whose business model is based on low margin/high throughput.

erkken · on Feb 5, 2020

Oh, their DDoS protection was one of the reasons we were thinking about moving away from DO.

It is a public facing SaaS API which does not have much traffic in terms of requests, but would be catastrophic to be blackholed. So its that bad?

Regarding hardware failures- have never experienced any so far, but guess it's just a question of when then.

heffer · on Feb 5, 2020

> It is a public facing SaaS API which does not have much traffic in terms of requests, but would be catastrophic to be blackholed. So its that bad?

Well, depends on whether you have people that don't like you. For them it can be rather easy to stage a DDoS against your server and take that server offline for some time.

> Regarding hardware failures- have never experienced any so far, but guess it's just a question of when then.

It has been happening to me much less often since they switched most of their portfolio over to "Enterprise Grade" disks. These days I tend to go with NVMe anyway so it has become less of an issue.

ryanar · on Feb 5, 2020

I thought part of their managed service was that they optimized / tuned your postgres db based on how you were using it. If that is true, then moving off of the managed service means you are tuning postgres yourself now.

Also want to throw in there that it is important to not only compare specs, but to also compare hardware. If DO has newer chips and faster RAM, then you will take a performance hit moving to the new provider even if the machine is beefier.

adventured · on Feb 5, 2020

Pretty certain that DO tunes to broad usage performance optimization (all the easy, obvious performance wins), not dynamically per client to each client's usage.

Here's their pitch: easy setup & maintenance, scaling, daily backups, optional standby nodes & automated failover, fast reliable performance including SSDs, can run on the private network at DO and encrypts data at rest & in transit.

ryanar · on Feb 6, 2020

huh, when I used them in the past we had someone specifically look at our RDS and data usage and tune it.

sb8244 · on Feb 5, 2020

They quote "We’ll handle setting up, backing up, and updating". I interpret that as literally the database itself, not the application specific nature of it—how it's used.

For example, I would be surprised if they noticed that your IOPS was high and you needed to upgrade the storage/disk components. (That would be cool if it's the type of thing they offer).

AlchemistCamp · on Feb 6, 2020

I find this deeply unconvincing as I've scaled multiple apps to the 10k range while on low-tier hardware using the setup Alex suggests is only appropriate for one user.

The old Stack Overflow podcast was also very instructive. They went a very long way on a single server and had the Reddit founders on the show to talk about their scaling during their process of adding a second box. This was on servers of the mid-aughts, running ASP.NET.

dillonmckay · on Feb 6, 2020

So, I just finished reading the article, and the last paragraph quickly mentions logging.

So, in the ‘spirit’ of this article, would that not be one of the first things to implement with the system?

Wait until you have added a caching layer and sharding the DB, to begin implementing logging?

I may not be reading this correctly.

I could see the case being made for distributed tracing, but having a logging strategy that can also scale and be flexible seems really important, to me at least.

simplecto · on Feb 5, 2020

I am glad to see so many HN'ers here who run very successful projects on bare metal or simple VMs. We should do more to talk about those business verticals, use cases, and how we solve them practically from a tech point of view.

ReverseCold · on Feb 5, 2020

I'm running ~1k DAU on a $6/mo Vultr VPS. Just a Phoenix Web App, no special things done to optimize. If I cache a little more aggressively on the frontend I should be able to handle even 10k. As always, advice in article depends on what you're doing.

huzaif · on Feb 5, 2020

We can now achieve pretty high scalability from day 1 with a tiny bit of "engineering cost" up front. Serverless on AWS is pretty cheap and can scale quickly.

App operations: |App| <-> |Api Gateway| <-> |Lambda| <-> |Dynamo DB|

Add in Route53 for DNS, ACM to manage certs, Secrets Manager to store secrets, SES for Email and Cognito for users.

All this will not cost a whole lot until you grow. At that point, you can make additional engineering decisions to manage costs.

aratakareigen · on Feb 5, 2020

Great, but this reads like a particularly blunt Amazon ad. Is there a way to achieve "high scalability" without selling my soul to Amazon?

huzaif · on Feb 5, 2020

Yes, it does read like that.

In the context of a start-up, cost is a big factor and then perhaps (hopefully) handling growth. You could start small and refactor apps/infrastructure as you grow but I am unsure how one could afford to do that efficiently while also managing a growing startup.

On the selling soul to cloud provider, I don't see it like that. I have a start-up to bootstrap and I want to see it grow before making altruistic decisions that would sustain the business model.

Once you are past the initial growth stage, there are many options for serverless, gateway, caches, proxies that can be orchestrated in K8 on commodity VMs in the datacenter. Though this is where you would need some decent financial backing.

(I am not associated with Amazon, Google or Azure. I do run my start-up on Azure.)

ignoramous · on Feb 5, 2020

I'm down a similar route, but I must point out that beyond a certain number of users / scale, Serverless becomes cost-prohibitive. For instance, per back-of-the-napkin calculation, the Serverless load I run right now, though very cost-effective for the smaller userbase I've got, would quickly spiral out of control once I cross a threshold (which is at 40k users). At 5M users, I'd be paying an astonishing 100x the cost than if I hosted the services on a VPS. That said, Serverless does reduce DevOps to an extent but introduces different but fewer other complications.

As patio11 would like to remind us all, we've got a revenue problem, not a cost problem. [0]

[0] https://news.ycombinator.com/item?id=22202301

fragmede · on Feb 5, 2020

The big clouds have similar enough products, just the names are changed, so at a high level, GP's list of AWS products can be swapped with eg, Azure's product names. https://www.wintellect.com/the-rosetta-stone-of-cloud-servic...

Sadly, anything more in-depth than that, you'll need to sign an NDA with AWS to learn anything about the performance limits of their services (eg Redshift), and you won't get that unless you're already a big customer there. Azure's not going to be falling over themselves to let you know where they fall short, either. This is vendor lock-in, and is why there are so many free cloud credits to be had to startups.

This is also a reason I believe SaaS companies will find it is harder than they realized to arbitrage between clouds, and business models based on that may not be able to get that right.

dumbfoundded · on Feb 5, 2020

I think if you use something like serverless, you can abstract the cloud layer. I've never used it for anything more than a toy project though.

https://serverless.com/

jrockway · on Feb 6, 2020

High scalability always revolves around the storage layer. There are plenty of options in the free software world; MongoDB, Redis, variants of MySQL/Postgres replication, Cassandra, FoundationDB, and many more.

Using one of those is where you'll spend most of your operational time if you really need that level of scalability. Most people don't, but the options are there if you really need them.

If you are happy with your storage layer, which most people are, the rest scales horizontally pretty easily. And there are plenty of free things you can use to get what a cloud provider gives you.

The CDN is always going to be tough to replicate on your own. In the end, latency is bounded by the speed of light, so you can only bring your files closer to your users. I wouldn't expect you to build one of these yourself; just buy one until you're the size of Google.

> App operations: |App| <-> |Api Gateway| <-> |Lambda| <-> |Dynamo DB|

Your app should be designed to scale horizontally; don't keep any state in your app, delegate it to your storage layer so you can scale a CPU-intensive app up across multiple servers.

There are quite a few API gateways around; Ambassador comes to mind but there are a million. I personally use raw Envoy for everything. I was load-testing my website the other day and pushed 5000qps through it from my cable connection before I decided "it's probably fine". (I started dropping frames on the Twitch stream I was watching, though ;)

There are plenty of "serverless" frameworks that emulate what Lambda does. knative comes to mind. I have not experimented with them in depth, but am intrigued by the idea. (I am more intrigued by turning config files into webassembly-compiled programs, to make existing apps more configurable at runtime. This is like serverless, but less general.)

> Add in Route53 for DNS, ACM to manage certs, Secrets Manager to store secrets, SES for Email and Cognito for users.

CoreDNS scales nicely and has an API. cert-manager is an open source way of obtaining certificates (though it's tightly coupled to Kubernetes); either ACME (letsencrypt) or your own root CA. There are a bunch of free software secret managers; Vault, bitnami-labs/sealed-secrets, etc. I personally use git-crypt ;)

Email deliverability is always going to be an issue. Like the CDNs, you might want to delegate it while you're small. Use anything except Mandrill.

sky_rw · on Feb 5, 2020

Yes, sell your soul to Google.

papito · on Feb 5, 2020

I bet that DNC Iowa primaries app was serverless. Problem solved! [dusts off hands].

ludamad · on Feb 5, 2020

I hated DynamoDB. What good is there about it other than convenience?

ignoramous · on Feb 5, 2020

I've found that KV stores like DynamoDB make for a good control-plane configuration repository. For instance, say, you need to know if a client, X, is allowed to access a resource, Y. And, say, you've clients in order of millions and resources in order of 100s, and you've got very specific queries to execute on such denormalized data and need consistently low latency and high throughput across key-combinations.

Another good use-case is to store checkpointing information. Say, you've processed some task and would like to check-in the result. Either the information fits the 400KB DynamoDB limit or you use DynamoDB as a index to a S3 file.

You could do those things with managed or self-hosted RDBMS, but DynamoDB takes away the need to manage the hardware, the backups, the scale-ups, and the scale-outs, reduces ceremony whilst dealing with locks, schemas, misbehaving clients, and myraid other configuration knobs whilst also fitting your queries patterns to a tee.

KV stores typically give you consistent performance on reads and writes, if you avoid cascading relationships between two or more keys, and make just the right amount of trade-offs in terms of both cross-cluster data-consistency and cross-table data-consistency.

Besides, in terms of features, one can add a write-through cache in front of a DynamoDB table, can point-in-time-restore data up to a minute granularity, can create on-demand tables that scale with load (not worry about provisioned capacity anymore), can auto-stream updates to Elasticsearch for materialised views or consume the updates in real-time themselves, can replicate tables world-wide with lax consistency guarantees and so on...with very little fuss, if any.

Running databases is hard. I pretty much exclusively favour a managed solution over self-hosted one, at this point. And for denormalized data, a managed KV store makes for a viable solution, imo.

danenania · on Feb 5, 2020

All good points, but one thing people should look at very closely before choosing DynamoDB as a primary db is the transaction limits. Most apps are going to have some operations that should be atomic and involve more than 25 items. With DynamoDB, your only option currently is to break these up into multiple transactions and hope none of them fail. But as you scale, eventually some will fail, while others in the same request succeed, leaving your data in an inconsistent state.

While this could be ok for some apps, I think for most use cases it's really bad and ends up being more trouble than what you save on ops in the long run, especially considering options like Aurora that, while not as hands-off as Dynamo, are still pretty low-maintenance and don't limit transactions at all.

Scarbutt · on Feb 5, 2020

If you don't mind watching a video: https://www.youtube.com/watch?v=6yqfmXiZTlM

dkarras · on Feb 5, 2020

It will cost an arm and a leg when you eventually grow though.

marriedWpt · on Feb 5, 2020

AWS seems like it would be expensive long term.

Between my issues with AWS currently and the exterior look of Amazon, I'm skeptical AWS is a good solution.

lbriner · on Feb 5, 2020

Like most providers, it does depend. Some products are priced very competitively while others seem over-the-top. For smaller companies, the cloud is a cheaper starting point for many systems but even for larger organisations, there are savings to be made by outsourcing your servers. Do you know how much it costs to install and maintain a decent air-con system for your server room?

One of the other major advantages of cloud is that you can save a lot in support staff. Compare the wages of even 1 decent sysadmin looking after your own hardware compared to several thousand dollars of AWS and it's still loads cheaper. Hardware upgrades, OS updates etc. are often automatic or hidden.

flukus · on Feb 6, 2020

Users has to be the worst unit you can possibly use, there's just to much variance between projects. If it's 100k users for a timesheet app people use once a week you've got very different scalability requirements than a 100 user app that people constantly interact with day in day out. Even then there's big differences depending on the domain, 100k users inserting data into a single table looks very different to the same entering information across 30 tables. You can't just pretend there's magic numbers that apply to everyone.

Many of those steps are reduce scalability if they're applied prematurely, splitting out the API and database layer at 1000 "users" is going to use more resources serializing things across the network than keeping it in process would. Same for seperating out the database, it's great if you need it but there's a cost if you don't. I worked on one system where we pulled out the API layer after realizing that this was where ~50% of our CPU time was being spent.

It also seems to focus on vertical layering more than horizontal splitting, being a photo sharing website I would have thought there was a lot of CPU intensive photo manipulation or something they can split off to services on the side that doesn't need to be done in real time.

highhedgehog · on Feb 7, 2020

Of course the article doesn't mean to give a solution to every possible scenario. It has to be taken as a general guideline of the necessary steps you should take in scaling an application but obviously every scenario is unique and needs to be analyzed before taking actions.

segmondy · on Feb 5, 2020

IMHO, I believe discussion about scale should begin with at least 1Million users these days. 100k has been old news for more than a decade.

lbriner · on Feb 5, 2020

As stated above, the number of users is not a measure of a system, it is the concurrency multiplied by the typical system loads per action.

Clearly a million users on Facebook is much heavier than a million registered with online banking and who only use it once a month.

highhedgehog · on Feb 7, 2020

I have a few questions:

1) I don't really get the "100 Users: Split Out the Clients".

It definetly helps in terms of understanding your customer profiles, whether they prefer the mobile app or the web interface for instance, and it might help from a usability point of view, but how does this help scalability per se if the API layer stays the same?

Splitting client happens, obviously, at the client level..

2) Also, I don't understand "This is why I like to think of the client as separate from the API."

Who considers the client and the API as the same thing?

You can consider the API as the client of the DB, sure, but why would you mix the user client and API together?

3) Caching: here I lack some knowledge. "We’ll cache the result from the database in Redis under the key user:id with an expiration time of 30 seconds". I assume that every access to Redis will not refresh the cache (aka reset the counter), otherwise you could potentially never get an updated data, right?

arwhatever · on Feb 6, 2020

Couple of questions for the author as well as the HN crowd:

1. Should I interpret each X number of users as stated in the article to be "simultaneous" or "generally active over the past Y amount of time" or even "total user count ever?" "Unique per day?"

2. What do you think in general about skipping the step "100 Users: Split Out the Clients" if one is reasonably certain to not want or need multiple clients? It would seem as though this could keep the deployments and testing simplified until later in the growth stage, as more code can be deployed/tested as a single bundle. But also, I want to be sure I'm not missing something by just trying to justify my own interests.

kashprime · on Feb 5, 2020

Early stage social media/news feedy startup here, wondering if a strategy of just starting off with Firebase and praying will work.

I figure if we did get to a thousand or ten thousand users, we could recruit the DevOps talent to pull this off before the bills kill us!

tqkxzugoaupvwqr · on Feb 6, 2020

Start with PostgreSQL as database and PHP/NodeJS/Java/<popular language> as web server. It doesn’t sound sexy but leaves enough room to grow your app if need arises, and it’s much easier to find people who know the stack. Firebase is a mistake in my eyes because when high bills come in, you are locked in. Either the bills or the migration to another stack will kill you.

kashprime · on Feb 6, 2020

I think you're right... We blew through the free tier 50k database reads just on testing, and that's just 3 users!

fuzzmz · on Feb 6, 2020

Speaking as a software engineer working on the devops side of things, it's best to have someone with experience in scalable architectures from the get go, otherwise things might get really complicated really fast if you want to transition an existing project to a more scalable, observable hosting platform.

echelon · on Feb 6, 2020

> I figure if we did get to a thousand or ten thousand users, we could recruit the DevOps talent to pull this off before the bills kill us!

How do you plan on recruiting DevOps talent? Do you have funding to pay them?

ablekh · on Feb 5, 2020

Nice, but very simplistic (on purpose, it seems), write-up on the topic. For a much more comprehensive and excellent coverage of designing & implementing large-scale systems, see https://github.com/donnemartin/system-design-primer. Also, I want to mention that an important - and relevant to scaling - aspect of multi-tenancy is very often (as is in Alex's post) not addressed. Most of the large-scale software-intensive systems are SaaS, hence the multi-tenancy importance and relevance.

katzgrau · on Feb 6, 2020

Good post, two thoughts:

* I'd imagine the website layer is frequently static (html/js) and could just be hosted on s3/cdn. One part of scaling avoided.

> This is when we are going to want to start looking into partitioning and sharding the database.

You have to be at pretty huge scale before you really need to consider this. A giant RDS instance, some read replicas, and occasional fixing of bottlenecks will go a lonnng way. And scaling RDS is a few clicks. By the time you need to start sharding, you can probably afford a dedicated database engineer, or at least I'd hope.

echelon · on Feb 6, 2020

Questions for y'all. (Rather, I'm soliciting broad technical and business advice.)

I have built a very fast and efficient CPU-only neural TTS engine in Rust/Torch JIT that is the synthesis of three different models. I've got a bunch of celebrity and cartoon voices I've trained. The selling point is that this runs on cheap, commodity hardware and doesn't require GPUs. I can easily horizontally scale it as a service.

I've currently got it running in a Kubernetes autoscaling group on DigitalOcean, but I'm worried about the bandwidth costs of serving up potentially thousands of hours of generated audio. I haven't thrown any real traffic at it beyond load testing, but I think it can survive heavy traffic. The thing that worries me is the bandwidth bill.

Does anyone have experience with other hosts that are cheap for bandwidth intensive apps? Are there hosts that provide egress bandwidth on the cheap for dynamically generated (non-CDN) content?

Subsequent to this, I would really like to sell or monetize this app so I can fund the R&D / CapEx intensive startup I really want to undertake.

Who might be the market to buy a TTS system like this?

I was thinking Cartoon Network might want "Rick and Morty" TTS, but despite my engineering to scale this and make it sound really good, I doubt they'd pay me much for the product. I suppose $2M would give me runway to hire a few engineers and buy a lot of the equipment I need, but I have no idea who would pay for this.

Glass for optics is surprisingly expensive, and beyond that I have other extremely high R&D costs.

Alternatively, I also have a "real time" (~800ms delay) neural voice conversion system. I thought about running a Kickstarter campaign and selling it to gamers / the discord demographic. It's relatively high fidelity with no spectral distortion, and I have a bunch of hypothetical mechanisms to make it an even better fit.

I've also thought about slapping a cute animation system on top of my TTS service let people animate characters interacting. (Value add?) An earlier non-neural TTS system I built before the last presidential election cycle had something like this, but more primitive: http://trumped.com (The audio quality of this concatenative system is absolute garbage. The new thing I've built is unrelated.)

ronsor · on Feb 6, 2020

If you're worried about bandwidth, you could just let people download an offline app/sdk under a proprietary license.

agumonkey · on Feb 5, 2020

odd, that was my google query of yesterday..

I'm curious what kind of hardware can sustain 100k concurrent connections these days.

lbriner · on Feb 5, 2020

We were running a speed test with node vs dotnet core and even on a small Linux box (4GB, 1 core), we could reach nearly 10K concurrent requests for a basic HTTP response but the exact nature of the system will affect that massively.

Add large request/response sizes or CPU/RAM bound operations and your servers can very quickly reach their limits with far fewer concurrent requests.

Architecture is a big picture task since you have to consider the whole system before implementing part of it, otherwise you end up having to start again.

agumonkey · on Feb 5, 2020

thanks that's already a lower bound point of reference

superphil0 · on Feb 5, 2020

Use firebase or any other serverless architecture and forget about scaling and devops. Not only will you save development time but also money because you need less developers. Yes I understand at some point it will get expensive, but you can still optimize later and move expensive parts to your own infrastructure if needed

abinaya_rl · on Feb 6, 2020

I'm running 20$ Linode instance and I'll not upgrade that very soon :)

marknadal · on Feb 6, 2020

Eh, this article was disappointing.

It teaches good business practice of tackling 1 thing at a time & not over-engineering.

But simply using read replicas, caches, CDNs, etc. does not mean things will scale.

Actually, often this will break your app unless you write concurrency-safe code. Learning and writing concurrent code is how you scale, the "adding more boxes" is just the after effect.

For instance, we have about 10M+ monthly users running on GUN (https://github.com/amark/gun), this is because it makes it easy/fun to work/play with concurrent data structures, so you get high scalability for free without having to re-architect.

But learning something new is never an excuse for shipping stuff today. Ship stuff today, you can always learn when you need it.

iamaelephant · on Feb 6, 2020

Jesus. If you're having to split your application across multiple machines at just 1,000 users you are poked by the time you have to properly scale.