Hacker News new | past | comments | ask | show | jobs | submit login
Glory is only 11MB/sec away (2023) (thmsmlr.com)
241 points by todsacerdoti 5 months ago | hide | past | favorite | 157 comments



My little old hosting business lived and died on this particular hill, I just didn't realise what was happening at the time.

When we grew up in the early 2000s, our bigger sales usually featured complicated stacks. They had redundant load-balancing, redundant firewalls, way more than many customer never needed. (But they did ask for it). The failover often cost more in management complexity than it saved when a box died in the "right" way to trigger the planned failover event.

We sold ourselves on cleverness when people asked for that. It worked to grow our business to 30 staff, a data centre in our home city, life was good! We responded to AWS with an API-based cloud hosting platform. But sales still peaked in 2012.

Customers wanted even more complex solutions than the ones we were selling - partially or wholly based on AWS. But - we figured - the hardware we were buying was hugely powerful compared to 10 years previously, and sites weren't that much more complicated. The bigger customers would (surely!) want fewer, less complicated boxes as a result. Unfortunately that is not selling on cleverness, that is selling on price. We never understood the financial ambition needed for that pviot. Nobody trusted a single cheap server, and even if they bought two, where was the scalability? It worked enough to keep revenue flat, but we obviously couldn't compete on building managed service stacks and software ecosystems quicker than Amazon.

When the new technical challenges had long dried-up, we sold in 2018.

My thinking (and so most of the company's product design) came from being bootstrapped where the possibility of an uncapped hosting bill seemed like an insane risk to take. Who would take it? (wait - what - why was everyone taking it??!)

AWS are embedded not just because VC makes their high-priced products feasible, but because their particular brand of cleverness is embedded in a generation of software developers. It obviously works! But the knowledge of when you might not need their cloud (or what the alternatives could ever be) feels quite a niche thing now.


Interesting story. Thanks for posting.

Unfortunately AWS's complexity and cleverness is catnip for developers, and its support for résumé-driven development is second to none.


There are a couple of things wrong with the numbers.

1. Traffic is not evenly spread. The figures from the article (400M page loads per month) are subject to recursive 80/20 rule. 80% of the requests (320M) are served within 20% of the time (6 days). Within those high-volume days, 80% of the requests (256M) are served in 20% of the time (~29h). And if you're serving particularly spiky traffic patterns, then 80% of the those requests (~205M) come in during just 20% of the time window (5.8h) -- 5.8h is a little shy of 21000 seconds. 205M / 21k is about 9.7k requests per second.

That's still doable on a single system, but it's no longer trivial. Especially if you want to run off a single DB with no read replicas. And while the total amount of traffic served remains the same, the necessary bandwidth cap for peak loads gets far above the optimistically averaged 11MB/sec.

2. Unidirectional end-to-end latency only applies to streaming data. A cold start in the real world (no HTTP/3) requires first to establish the underlying TCP connection which is three trips, then the TLS connection which is a minimum of two more trips, and then you get to send the actual HTTP request... which still needs to send the response back.

If you want to serve real humans, everything observable has to happen in less than one second[0]. After that there's a steep drop-off as users assume your system is broken and just close the tab.

Disclosure: in previous life I helped to run a betting exchange. The traffic patterns are extremely spiky, latency requirements are demanding and trading volume is highly concentrated in just a tiny fraction of the overall event window. For any activity involving live trades, we had to get the results on their screen within 100ms from the moment they initiated the action. That means their network roundtrip latency ate into our event processing budget.

0: https://www.nngroup.com/articles/response-times-3-important-...


Point 1 is the kicker, I think, for something like a business insider clone. If a big twitter account links to one of your articles, you may get the whole damn lot of average monthly page loads over the next few minutes.

And you absolutely can deal with that sort of load in single large server configurations, but now we're not just building a webserver, we're building a pretty hardcore frontend load balancer (that happens to have an embedded webserver).

They may well charge through the nose for it, but a hell of a lot of engineering has gone into AWS' load balancers and network infrastructure, so that the rest of us don't have to become experts in that whole segment of the stack.


> a hell of a lot of engineering has gone into AWS' load balancers and network infrastructure

Most of it hard earned, after having to deal with thousands of outlier events.

An old friend used to run the Finnish election results services' public-facing backend, and was an early AWS customer. He broke their load-balancers - twice.

Both times he got in touch with AWS support well in advance, asking to pre-scale and pre-warm the load balancers because he was expecting a major traffic surge once the results started to come in. First time around, he was confidently told that he shouldn't worry, since AWS can take on any amount of traffic.

On the election night, AWS load balancers failed to scale in time. The traffic estimates my friend had provided were indeed accurate within an order of magnitude, but AWS hadn't believed his numbers. NOBODY could have a service that legitimately required scaling from a few hundred requests per second to 2M requests per second in about one minute. Apparently their on-call engineers got burned badly that night.

Couple of years later, he approached AWS support again, with the same request. That time he was taken more seriously, and was assured that the autoscaling algorithms were now much smarter and their relevant engineering teams had prepared the ground for more rapid autoscaling needs. They were confident they didn't need to pre-scale the fleet.

The scaling wasn't still fast enough and their on-call team had to _again_ manually force their balancers to stay just ahead of the demand with the massive surges coming in waves. So much for the fine-tuned scaling algorithms.

Fast-forward to today, and AWS have a specialist "hot launch" service offering where they work with the customers up front to make sure their launches have sufficient capacity available and their perimeter load balancers properly pre-warmed to absorb these kinds of supposedly one-off situations.


Majority of websites that have such viral spikes like businessinsider are content sites. Caching HTML (and everything else) on CDN solves that. Afaik news/political media have to do that anyway because of DDoS. Thats why they can run wordpress (even though thats terribly slow cms/framework). You end up with essentially static website.


Even content sites mostly have user-generated comment feeds (requires a DB connection), and ads (which require processing metrics to justify ad spend).

I don't know if the business folks would rather sacrifice those in a load spike, or just load shed at a request level - for an business defined by engagement metrics and ad revenue, it's not clear the former is better than the latter.


Most of these are handled with outside systems nowdays anyway? Its embedded/loaded using separate API. Separate what can be static and what has to be dynamic. I dont think anyone serious would use comments system from CMS directly. If your comments section dies its a lot better than whole site going down.


The point of article is to stick to a single system. Comments can be cached too tho.


Article talks about single server. Not a single system. You will not crash your nginx+static website if your nodejs comments api crashes.


> 2. Unidirectional end-to-end latency only applies to streaming data.

Agreed that his "across the world" example is a bit silly. Because he doesn't take into account the connection construction.

His primary point is still reasonable. How many services need world wide reach? Did you build it for multiple languages also?

If you're in the US. Or you're in the EU. A nice centralized server will have <=30 ms of latency to the entire region you are serving.

Edge is over valued unless you do have true global needs and then you have to also manage global database (s).


> 500 Internal Server Error

Looks like the author is getting more than 11MB/s traffic. Here's an archived version: https://archive.is/UVpg0


> An error occurred during a connection to thmsmlr.com. PR_END_OF_FILE_ERROR

I've never seen that error before. Various websites suggest it's caused by using a proxy, a VPN, or DNS-over-HTTPS. None of these apply to me.


I got the error as well on the first load. Reload fixed it.


I feel like this is the wrong way to look at it.

A better way IMO is: don't scale prematurely.

Build things as you need them. In the vast majority of cases, even CDNs are an unnecessary cost (presuming you're not paying the exorbitant cloud provider bandwidth tax). If you start to see performance issues, then deal with it as needed.

And if your workhorse suddenly grows that coveted single horn?

That's a problem you want to have!


Seconded. Start with a single box. You're not Facebook. The success of your startup won't depend on whether you have an unplanned downtime of 15 minutes once a month during your first year. If your product is any good people will retry after an hour and not switch to your competitor. If you hit scalability issues go for a hybrid model where you make the resource intensive part scalable.

But hey if you think you need to start out with webscale, have two dozen micro services and an exploding number of failure states between them and blow through your VC money with an AWS bill before the first customer signs up, more power to you.


Or maybe 2004 Facebook which ran on a single box.

>Mark hosted Facebook (it was theFacebook.com back then) from his own computer in his Harvard dorm room. You can buy a domain name and point it to any static IP address. That's how we did things back then because we have fewer options for hosting services that didn't cost an absurd amount for a college student.

>Eventually, Facebook was hosted in a shared datacenter in California,..(off Quora)


The “problem” I have is Digital Ocean gives me a bunch of conveniences like CDN, CI, HA, Git Deploys for $5 a month. Thats where I start. Running my own metal is a later problem not a now one. By the time $1m AWS bill vs. big fat box is the choice, I’m not reading blog posts to decide. I am running tests.


$40 a month on a dedicated box will cover you for your first million users.

No you aren't running tests because you will be locked in to aws and your cloud computing bills will eventually bankrupt you.


$40 a month and $400 of my time a month, that is the issue. At some point the lines meet and it is worth the time to save a $10000 DO bill for hight traffic I am sure. But that is a later problem.


It's a later problem you can't solve! You cheaped out in the short term and will face ludicrous long term costs.

And not just you, it's a core part of cloud computing's monetization model.

Additionally, if anything goes wrong with your Digital Ocean account, Digital Ocean will not help you. So you are running a high risk.


DO runs your app pretty much as is with few modifications. Think npx next init; git commit; git push; so it ain’t hard to migrate. You end up doing the same stuff you would have done on day 1.


So in essence, you want to use cloud providers UNTIL you need to scale? Isn’t that the opposite way round? Doesn’t it make it sound ridiculous, seeing as though the cloud providers were meant to help you scale?


Hard to answer in absolutes. Depends on use case.


That does not in any way negate my point. Where you start depends on cost/benefit. Personally, I'm liable to start in AWS, actually. That doesn't mean I'll use every high-dollar service they offer, though.


At a second glance yes probably it aligns with your point, or at least is anecdotal for it.


With AWS you also buy a scapegoat because it's much easier to explain to superiors or investors when a large cloud service has downtime than it is to explain the same cumulative downtime caused by human error in your team.


Human error is the number 1 cause of outages, before and after the cloud zeitgeist.

How does that jive with your interpretation of cloud as a scapegoat?


But why SQLite, if you're going for vertical scaling? Nobody's stopping you from self hosting Postgres (or Supabase!) on the same server as your app, and I can't think of any disadvantage other than more effort to set up.

Now if you're willing to go DB-less, keep your whole global state in literal memory on one big server (no round-tripping to Redis or whatever, actual in-process objects), just occasionally snapshotting that memory to disk (this part's tricky), and use a compiled, multi-threaded language -- then you can saturate a Gbit or bigger NIC and literally serve the world from one box. I kind of wish I had a real use case for that architecture.


Why not SQLite? :) Of course the answer is always "it depends," but lately, I've seen the general "SQLite isn't a real database" ethos challenged more and more. Outside of standard relational persistence patterns, there can be significant feature differences that could very well mean Postgres is the better option. However, for some architectural patterns, SQLite could come out ahead! For a content-heavy application like BusinessInsider, the Baked Data pattern with SQLite might very well offer better cost and latency performance!

simonw (datasette) has built troves and troves of tools and writings about SQLite's production use for content-heavy and/or data rich websites: https://simonwillison.net/2021/Jul/28/baked-data/


> But why SQLite, if you're going for vertical scaling?

From the benchmarks I’ve seen, because it’s significantly faster, specifically round trip times. This would make sense since SQLite is in-process and doesn’t need serialization[1]. Which in turn offers a second, optional advantage within reach - serial processing of operations, which is significantly easier to test, reason about, and build supporting cache layers around.

[1]: If you don’t use a Unix socket you also have networking overhead - but you said same server so I’ll leave this as a side-note since it’s extremely common to put postgres on a different machine for isolation. In fact, it’s one of the main advantages with networked dbs.


If I'm running on a single machine then sqlite comes out ahead of postgres/mysql. Sqlite has everything I need plus the simplicity and speed is superior. Sqlite can do terabytes of data, can handle multiple readers, has live streaming backup, and is a pretty well rounded sql implementation in general.

I would only consider postgres/mysql if I outgrew vertically scaling a single box.


The author probably means that we can use SQLite on the edge.

https://blog.cloudflare.com/introducing-d1


No. The author is clearly and explicitly saying the edge does not make sense for most use cases.

The author means sqlite makes sense if you're running one machine, which is what makes most sense for most use cases.


I'm thinking that a lot of commenters are—understandably—feeling the need to defend the status quo in the name of availability and reliability since the author zeroed-in on latency, bandwidth, and spend.

My takeaway is not a debate against the merits of cloud in the face of trade offs, but against the necessity of—the now ubiquitous—cloud architecture pattern (and related lock-in). The "this-vs-that" is a rhetorical device to introduce an alternative. Of course which solutions are right for different use cases will depend on so many different things, and it's those things that keep engineers employed! :)

That said, we can put our engineering hats on to solve for SRE concerns within the patterned proposed by the author; I'm thinking the "what about availability when your one server goes down?!" is a straw-man in the sense that we have different ways of solving the availability story than our Ubiquitous System, and of course the solution must depend on what's actually relevant.


How is it a strawman? It’s what anyone sane operating on premise or in the cloud would ask about this in design review


I agree with you. I mean to say that after reading the author's ideas my takeaway was not that they were suggesting that I irresponsibly construct a single-point-of-failure system for a situation that demands a strong availability story. My takeaway was to mull over and challenge what the availability requirements are before building the ubiquitous cloud system that I'm familiar with. What tradeoffs are made and what risks are budgeted are not ubiquitous, therefore, our architecture should not be.

I, for one, appreciated the thought experiment and I took it for that: a theoretical alternative that would otherwise seem untenable being held up against a known pattern. I take it as theoretical because the author didn't actually build the system they're proposing for BusinessInsider, and without putting it into practice, I can only see it as that.

In practice, articulating and defining the cost of making choices is still the responsibility of the engineer and I don't believe the author was ignoring availability to suggest that we should too. In fact, I'm left with: "I find this proposal interesting, of course there are availability concerns, how might I solve those concerns from here?" The difference between that thought and the "straw-man" I pointed out is that the straw-man argument reads: "throw this all out because that system is only going to fail."


His experiment would have looked a lot better without his server going offline from too much traffic, though.


It's a straw man because both the cloud and dedicated suffer the same problem.


Here’s something you can do for a little extra step that offers a lot more on a frugal budget: put your API and SQLite db together. Ideally, the API uses a low-overhead binary serialization format and persistent conns. Then you can use edge stuff from whatever VC firm that is currently burning money and enjoy the free tier (currently though, Cloudflare workers is really generous - and free egress).

The trick is that SQLite can run single-threaded a massive amount of qps. So you can pump large amounts of ops in serial which is trivial to reason about and allows for in-memory caching on the API side with trivial invalidation. And by still keeping your web serving separate, you can still enjoy edge-performance of handshakes, and skip db altogether for static pages. (The article underestimates the RTT issue - it’s very real - real world apps need more round trips than you think)

Most of cpu outside of the db comes from parsing, deserializing, copying data, tls (both number of conns and encrypting data). By offloading the big chunks of these you can get easily 10s of thousands of writes/s on an entry level machine. Reads are faster.

That said, I think it’s always worth benchmarking the typical bottlenecks especially io. Providers are lying and misleading a lot, so just run your own on their free tier. Just make sure to have some integration test/bench ready in case you need to switch up.


I've been experimenting with concepts adjacent to this. PostgreSQL can output it's query results as a JSON so I can make it's output be exactly what the API should spit out to the client without needing to parse it first, which is pretty great if the result is huge. I just hope that PostgreSQL doesn't run into it's own bottleneck JSONifying large datasets though.


    You need to be on the edge, they say.
    Be close to your users. Minimize latency.
How much of an issue is latency in practice?

Here is an example. My book recommendation project Gnooks, which I run on a server in Germany:

https://www.gnooks.com

Does it feel too slow for anybody?

Over the last years, I have gotten many thousands of suggestions from users on this project. Yet, as far as I can remember, nobody ever touched the topic of latency. While the largest group of users is from the USA.


Your site keeps most of the logic on the backend, and only makes a single roundtrip to the server per user interaction - that makes you significantly less subject to latency problems than the trendy PWAs that implement most of their logic in the frontend (and then have to issue multiple queries to the backend to load the necessary data)


As an aussie, it's noticeable that it's not a fast local site (it could either be a slow local site or a fast overseas one), but that's to be expected (ping time is ~300ms FYI). IMHO your site is fine! It's if you're site is already slow then it's a problem.


Interesting. Thanks for the info.

Would like to hear how it feels from Australia compared to other websites. Like Hacker News, Goodreads and Wikipedia for example.


This HN page is feels faster (I did the clean refresh before doing the ping, to avoid bias): ping is ~185ms (so speed of light is likely the difference), Goodreads is similar to HN, ping ~236ms, Google(.com) feels fast, but not as fast as to could be (10 ms ping—not sure if I'm logged in makes a difference, and ABC our local public broadcaster's site abc.net.au does feel a bit faster with the same ping time, and is probably the fastest site you could reasonable have here), and Wikipedia home page is same as Goodreads (and same ping time as well). So I'd say you're not doing bad at all (the IPs that return pings are US-based apart from Google and ABC). This is all on a desktop on a decent (for AU residential) fully-wired connection though, so I suspect introducing some packet loss (e.g. on trains going through tunnels) would give more interesting results.

Fun aside related to latency: I've sshed from France to Australia, modal editors are so much easier over those latencies, as you can effectively queue up commands and round-trip time leaves you with enough time to plan you next key strokes.


Great. Thanks for the infos!


Try using a VPN from Australia to browse the internet.

From Europe that would give you 2x the latency an Australian user would see which should help illustrate their worst experience.


From Hanoi, it’s better than I expected. Still 1s+ for initial DNS+TLS until page load. Subsequent are expectedly better, <1s at least.


Hello Hanoi! That's probably as far as it gets from Germany :)

Yes, DNS is a different issue. To bring that down, I would have to use a distributed DNS I guess.


Japan. First TLS+DNS felt like 3 seconds. Further loads are almost instant. I think it's totally fine.


Like alot of old questions, it depends.

The cool part is, you can test this reliably, and see if it matters. That'll get you a concrete answer.


The article does not even mention availability. Service running on a single box will have downtime, both planned and unplanned. And you should think about RPO/RTO too, when (not if!) the box blows up how long does it take to recover, and how much data will you lose.

> No matter how you design your site, SPA, SSR, some hybrid in between you can’t get around that if there is at least one database query involved in rendering your page, you have to go back to your database in us-east-1.

https://aws.amazon.com/rds/aurora/global-database/

https://aws.amazon.com/dynamodb/global-tables/

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Conce...

https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/...

And so on.. Have fun kludging together the equivalents yourself.


Availability is overblown these days IMO.

Ten or twenty years ago I would have agreed with you: the internet was new, and when the site went down, people blamed the site, and things got nasty fast. Nowadays, though?

People are much more likely to blame their internet provider first, or just plain try again later. They're used to the unreliable nature of the internet now.

Unless you're Google, but how many businesses are Google?

As someone else on this thread said, simpler systems are less likely to fail anyway. Keep backups, replicate your DB to a cold DR site if you're truly worried. That will handle the majority of non-FAANG businesses in the majority of non-FAANG cases.

And if that failure occurs? So long as you don't lose data, it will be a short-lived blip for most businesses. Even being down for a few days can be weathered if you have good PR, provided that it's a once in a blue moon sort of event. For that matter, data losses can be weathered in many cases.

What was that saying? Each nine of reliability doubles the cost or some such?

That needs to be accounted for when determining the ROI. How reliable does your business actually need to be?

That depends on your target market. The default answer from IT, however, is an automatic "100%!". This is wrong, IMO.


> What was that saying? Each nine of reliability doubles the cost or some such?

I don't know but if there's a domain where the five nines are a pipe dream it's the cloud: we've seen a shitload of major outages where this or that cloud provider had services down for hours. This blows the five nines for years, if not decades (6 minutes max downtime per year: good luck with your five nines when you've got a 300 minutes cloud downtime).

The cloud is so unreliable people are now simply used to the Web not working: "Oh shit, it's down again, I'll try again later today or tomorrow".

That's how bad the "x nines" expectation became because of the cloud.


My experience is: Downtime does matter.

Customers ignore downtime when you're a service with big lock-in or no real alternative (GitHub, Facebook, etc). Because then they can't do anything about it.

When you're a small/medium service, and you're down for 3 hours during a time where 4 employees of a company need you for productive work, they ask what's up, and if they must expect that this will happen again. Generally, shorter, one-off downtimes are forgiven, but longer or recurring ones are not.

We run on Hetzner dedicated, and I think the best way to run most sensible software is to run on 3 or 5 beefy machines. If you only have 1 machine, the following will take you down, all of which have happened in my production systems (except the last, which I only read about in the news for 3 hosters):

* You need to reboot for kernel security updates. A reboot causes 3 minutes downtime because that's how long a reboot takes on server hardware, including Hetzner's (multiple POST style screens, netboot timeouts and so on, most of which make sense). Kernel updates come in approximately every 3 weeks. You may skip some of them that are not security-relevant for you, but this requires careful analysis, which is more effort than just installing.

* The mainboard dies. It needs to be replaced. Immediately causes 3 hours of downtime.

* Some disk or RAM dies in a way that prevents the mainboard from booting. This should not happen, but it does. Also 3 hours of downtime.

* One of these happens at 1 am on Saturday. You are in Europe or Asia. If you are not set up to be paged out of bed, your US customer will see 8 hours of downtime.

* The top-of-rack or datacenter building core router has some trouble, and it takes the hoster 3 hours to fix it. 3 hours of downtime.

* There's a fire or sprinkler flooding in your data center. It takes the hoster 2 weeks to clean up the physical mess. 2 weeks of downtime, unless you manage to re-build the infra somewhere else and restore from backups, then maybe 48 hours of downtime.

Note most of this also applies if you run a single VM in AWS. If you're lucky, the restore times will be faster by being a VM. But that is not guaranteed; I've had instances not do their work for some hours and afterwards got the email "We noticed a problem with the physical host of your VM, we've automatically migrated it somewhere else now".

Most of these problems go away if you have 3 servers (of the type the original post describes) instead of one, which is only slightly more effort.


I largely agree with this. I didn't take "one server" literally; more as hyperbole.

Concern about availability is overblown, not irrelevant!

There are shades of gray, though. Most sites don't need to engineer their system to hide catastrophic database failures from the user. Not being able to tolerate a single web server failure, though... that's just careless, and will end up giving you a bad reputation.

You probably don't actually need five nines -- but three probably won't go amiss...


> So long as you don't lose data

The article advocates storing data in local sqlite on single server. That means that when the server goes poof you will lose data. How much data, depends on how frequently you run backups. But it will be non-zero amount.

And backups bring the other point. It will never be just one server like the author proclaims. You need another server for the backups. Now you have two servers, and a dependency between them. The dreaded complexity is creeping in again, almost if it was inevitable


The article suggest using litestream to continuously back up the data to remote storage.

Sure you'll still loose some data but that's always the case whatever solution you choose.


A quorum across isolated facilities doesn't lose data unless a disaster hits all of them at once. Unfortunately, "disaster" includes a bad commit that is deployed too quickly.


Sqlite has exactly the same backup solutions as other sql dbs and the author specifically recommends using litestream to backup.


Wat? If the site is down I’ll probably wont be back, and if I am paying for a service I am angry and might unsubscribe. Let alone lost revenue for the company from new sign ups or readers.


Github was down for the nth time a few days ago. Yet I'm back. Who comes up with these stupid claims? I'm not going to ditch <messenger everyone in my friends circle is using> if they're down for a few minutes once a month. The supermarket at the corner had problems with their payment system for three days(!) last year and only took cash. I'm still a regular there. My favorite sushi place was closed for two months after a fire; I'm still going there.

If people stop using your site after it's been down once it can't be that good.


There are counterexamples yes. The almost monopoly on open source hosting. If Github goes down every day for an hour, companies will move to Gitlab. Obviously in business there are power dynamics. I queued for an hour to get an ice cream once. That seller was in a kids water park. Yet McDonalds did a hockey stick because they solved fast food. There are always exceptions where a business has you. Nice to own such a business!


> If Github goes down every day for an hour

Now that's what I call moving the goalposts :) suddenly we're at an hour a day.

Github wasn't the first OSS hosting platform, yet they managed to steal the show, and let me tell you a secret, it wasn't because they had less downtime than others. But if you want to keep telling yourself five nines is of utmost importance to your startup, more power to you.


Yes the goalposts move if you run a near monopoly with a huge mindshare. Let’s say you want an electric toothbrush. Amazon is down, you order from kmart or something. If Github is down you don’t simply migrate your entire Ci/Cd/code/issues/actions to something else. You are locked in! So they can afford crappier goalposts, but only so crappy.


You again moved the goalposts. First you suddenly argued I would move away from Github if they were down for an hour every day, when before, the question was whether a short downtime once a month would be acceptable or drive away customers. And now you suddenly say for Github those moved goalposts (ie one hour downtime per day) are acceptable because I'm already locked in, so contrary to your previous comment, I would not move to gitlab. But it seems you already decided you want to die on that hill, so you'll keep changing your story willy nilly.


I think you are arguing against a strawman of the worst interpretation of what I said. Which means we’d have to have a cross-examination type debate that I ain’t got the energy for.

You are assuming that there is a single thing called downtime (not a scale of slightly bad to outrageous) and a single set of considerations as to whether to move.


Why are you on HN? It had quiet a lot of downtimes/service degradations over the last year, and according to your claim you don't use services that aren't 100% available.


You snookered me there! Nice. But I will say that I have only seen it degraded, i.e. slow, not completely down.

Also it may matter less to HN which is a cult following. For a FAANGy type company of course it matters.


So you sold all your apple products last week? They had a full downtime after all.


Well maybe on your splunk dashboard but I have not experienced downtime of any Apple services.


It's almost like short downtimes don't matter, isn't it? And depending on the service, sometimes even multi-hour downtimes that make almost all of your services (store, music, arcade, fit,...) unusable like in apples case. Most of the users won't even notice, and if they do... They still come back.


Yes if the site is down you loose reader/buyers (in the end money)

But keeping the site up also cost money. So context is really important here. How much money do you spend on the last % of availability vs how much money so you loose on downtime. In my experience people have a more complicated setup then needed because of availability, but at the same time have not tested a real DR scenario.


Your sentence ANDs together an overlap of a bunch of improbable scenarios:

- if the site is down for more than a few seconds

- if you attempt to access the site during its downtime

- if you pay an ongoing subscription (this could very well be a non-profit site, or one that doesn't primarily drive its revenue from subscriptions)

- if you feel like unsubscribing

How much does this *really* affect a non-moonshot business?


The article talks about a high traffic site, maybe a top 10 one or something like that. It is already a moon business, not a moonshot. For a small blog or local business, sure a bit of downtime is OK but why are they even bare metal when squarespace etc. exists.


> The article talks about a high traffic site, maybe a top 10 one

The article explicitly restricts the problem space to top 1000 websites.

I've taken 30 seconds to browse through top 1000 websites and outside of the usual suspects up top, most of them look like https://www.pro-football-reference.com or https://www.tempo.co which to me are not far off blogs and local businesses.


Well for a top 10 site, it's $1000 a month dedicated or $1,000,000 a month cloud.


Then I have to ask if you're a customer I want to have. Customers with unreasonable demands (especially demands for absolute perfection) are a cost, not a revenue stream IMO.


Yeah agree - if a website doesn’t respond I immediately open google in another tab; if google doesn’t respond I blame my internet and swap to my phone.

We’re all on redundant internet connections now.


Running on bare metal nets you serious latency reductions and I/O bandwidth increases compared to the products you linked to, and you have significantly less complexity in a single server or dual server model, that is why operations like LetsEncrypt choose to use this dual physical server architecture to power their global services at web scale: https://letsencrypt.org/2021/01/21/next-gen-database-servers...

Very low internal latency from an on server database or directly adjacent database allows your software to run database queries orders of magnitudes faster and spend less resources per client served, resulting in a speedier, snappy experience for your end users compared to what any of these managed database services can provide in practice.


Right, and to do what any of the services above do, you’re doing this in a HA setup on a per region basis and writing your own cross region consistency code.

Users are not noticing the difference between 5ms lookups and 20ms lookups. They are noticing outages, data loss, and inconsistency.


Users are noticing the difference between 5 non-parallel database queries that each take tens of milliseconds to a distributed database compared to querying an on server or directly adjacent server in under 0.2ms.

I am not proposing writing any additional cross-region consistentcy code, many CRUD applications have added zero cross-region consistentcy code and just left this entirely on Postgres to handle since bi-directional replicas are now a core feature.

The more complicated you make your stack the more likely you will end up with outages, data loss and inconsistency. Using plain jane battle hardened, well developed tools that can serve the internet at scale from a single modern server gives you a very simple architecture where most of that architecture is getting significant improvements every quarter as the libre software community improves Linux, Nginx, Postgres, etc versus whatever random person Amazon is letting work on RDS or DynamoDB for the next year until they PIP them (as they almost always do).


> The article does not even mention availability. Service running on a single box will have downtime.

So will your service running on AWS or equivalent. Complexity brings its own pitfalls and I don't think there's a web service in existence that never screwed this up.

That is if it's not AWS itself having a screw-up.


You will have major availability problems on AWS.

VPS is not worse off here.


A lot of good arguments which I think are often under looked. "We need to go into the cloud" often feels more like a peer pressure thing. It might be difficult to get a Hetzner server through the ordering processes of your company, compared to self-service rent something from AWS or the like (feels like this may be equal reason why they're so successful?).

But the article completely mises out on the aspect of availability, which would be quite risky with a single server.

Why not do mix & match smartly? A CDN is already "the cloud". Having your static content hosted (and pre-generated fully) in the cloud and only go on dynamic generated / rendered responses if you really have to. Then you have good latency, high availability and still low costs.


I think it's more an architecture question. If one box works, stop building on services that force horizontal thinking, and pricing, from the get go.

You can solve one box availability with box 2 (hot backup) - all within the same architecture and price structure.


While I somewhat agree with the premise, "Plop one in Virginia and you can get the English speaking world in under 100ms of latency." is just false (unless you only count the US+UK as "English speaking", ignoring the rest of the Commonwealth...).


Well, there's Canada, Ireland, and the English-speaking islands of the Caribbean, also.

The equatorial African nations aren't that much farther away than Great Britain.

Of course that does still leave out Southern Africa, India and Australia.


There's great-circle distance and then there's internet-cable distance though (I'm not sure what the path to equatorial Africa is for example, but I can imagine it being bad and not direct at all). But yes, I primarily thinking Asia/Oceania (you forgot New Zealand! https://www.theguardian.com/world/2018/may/02/jacinda-ardern...), given that people do forget how big the world is, and how noticeable the latency is.


> I'm not sure what the path to equatorial Africa is for example, but I can imagine it being bad and not direct at all

I realized that I didn't actually know, either.

https://mybroadband.co.za/news/wp-content/uploads/2021/08/Te...

I don't know how credible this map is, but it appears to match all the others I was able to find in a quick image search. I picked this one because it's larger and shows more detail than some of the others.

It looks like equatorial Africa is reasonably well connected to the Americas and Europe.

I'm sure connectivity is crap once you get out of the large coastal cities, but on the other hand it's pretty crap in Alaska and the Canadian interior, too (at least before Starlink came along).


It's implied from context that he means "US-English" :)


I like the power that we see with cheap commodity hardware, but we mustn't forget that it's crap. He talks about Litestream as a continuous backup but doesn't get into how long it takes to bring up a hot spare and whether he's automated that. I'm glad to see Litestream has synchronous replication on their roadmap, but in the meantime writes can be irretrievably lost and I don't know whether transaction boundaries are preserved.


> I like the power that we see with cheap commodity hardware, but we mustn't forget that it's crap.

Google grew up (early days) using cheap white box consumer PCs while "best practice" was expensive server boxes.

It's a tried and true method of budget hosting.

The mini PC world is exploding and they make for a solid low cost, low power server platform.

It also makes having hot and cold spares cheap and easy.


This is all good advice .. for your hobby project. But the moment that server goes down and you have stakeholders draining your phone's battery you'll realize that scaling also brings availability as a very nice bonus. Sure, 99% of the time it looks like you're wasting resources. Until you need it and you do the math and realize you're way ahead in revenue not lost.


My experience over the decades is an insider dirty secret: downtime doesn’t really affect revenue.


I haven’t ever worked for a business where this was true, including AWS. I’m sure they exist, but I think your experience is an outlier


AWS has plenty of failed data retrievals, outages not mentioned or documented, and other shenanigans of the sort. It's a turn and burn operation tho where skilled folk only spend a few quarters or years before moving onto other (usually more) stable tech companies.

You can hear all about this if you turn up to Harry's at 7pm on Thursdays, or at many of the other tech events in Seattle. Lots of current and ex-Amazonians out there with stories!


I’ve worked at AWS, working with high hundreds of customers during my time there. I haven’t ever seen someone using a service with irrecoverable data loss or an inability to use a backup.

Regardless, that wasn’t my point — the outages you mention do drive customer churn and cause sales problems. Ie AWS outages do impact revenue.


I was referencing startups by and large. If you have 100 engineers or less.

I expect AWS outages affect revenue. But on the score of the number of companies… very few are in a similar position of AWS.


I've seen a aws database segfault. Most likely due to custom aws patches.


Ah, seems you missed out on the funnest shenanigans inside AWS :D

These are called Fight Club issues, and are restricted access issues that don't show up for most team members. Gotta cordon off the real hairy bugs!


I used to think this way until in noticed that every clustering technology I had ever used had caused more outages than it had prevented.


Yeah, it often seems like in trying to eliminate the main single point of failure we end up with many more single points of failure.


Start by building a business that isn't differentiated by how many 9s you have. Something customers want so badly that a few hours of inconvenient downtime doesn't move the needle at all.

In this situation, blowing up your system complexity to maybe get another 9 makes no sense. Then the revenue change is pretty irrelevant for modest downtime.

People under estimate single server uptime. If availability is really that important, buy a hot backup. Put it in another region. Done.


AWS has major availability issues. VPS is not worse off here. If anything it is easier to debug.


I'm scared of running stuff on servers.

Jk. But kinda. I have a little hobby website. Mostly for fun, I keep rewriting it in different languages, stacks, and deployment methods. At one point it was on a $5 AWS server. It continuously ran out of memory and crashed. Then it crashed for some other reason, I think because it ran out of disc space from log files. Then the postgres database got encrypted for ransom because I stupidly left it open without a password.

So now I use things like Fly.io and Firebase. And they work great for hobby stuff. I'd like for my projects to grow, and this article makes a good case why I should be competent enough to run them on a server myself.

At work I help run a much larger website that we run with K8s and a managed db. The idea of directly running that on servers seems equally daunting.

But I know it shouldn't be that way. Thanks for the reminder.


There are people with a specific huge vested interest in making sure you always feel that way. So that alone is a good reason to try to resist that feeling.

Not just for the obvious reason to grow your abilities, because you can say exactly the same about everything and you can't become expert in everything.

But simply because someone makes a lot of money off of you if you feel like you need them, and they are big enough to do many indirect things and change the entire environment to make you feel you need them and never even question it, and suffer essentially ostricisation if you ever do question it.

Making those baby sysadmin mistakes is perfectly fine. Everyone must make them. Are you ever going to forget to secure the access to any db after that? That is not only one specific conig but an entire class of problem that you are alert to now.

Not only is it ok because it's low stakes and then you know better for high stakes at work after that, but really even at work it should be normal to suffer a breakage once in a while, because at work it's even more important that you know how to recover when it inevitably happens anyway even without making mistakes. I don't mean break things on purpose, I just mean if you never suffer a problem, you never become prepared to deal with a problem. That is not good.

Besides, no matter what you still suffer equivalent problems, cloud or no cloud. What's the difference between your db going down from a hardware fault or misconfig, or a cloud account getting killed because of a billing or tos error?

Also, k8s actually makes an otherwise manageable system into a daunting one.

All in all, more and better reasons to be brave than to be afraid.


What doesn't kill you makes you stronger. You apply what you learn to make your next server setup better. Your db got encrypted by a ransomware? Then for your next setup you'll make sure to configure the database to only listen to local network connections and have daily offsite backup. Out of disk space due to huge log files? Then your next setup will have an automatically rotated log files. Your apps died without you noticing? Then your next setup will feature health checks and alerting. Repeat this enough time and you'll eventually have a bullet proof setup.

Of course, not wanting to go through all of this is a valid choice too, especially if you have no interest in running your own services.


> Plop one in Virginia and you can get the English speaking world in under 100ms of latency.

Never start a fist fight with an Australian. Let alone 26 million of us at the same time.


Well he gets a 150ms reaction time advantage to the incoming punches so he might be ok.


Meh.

The classic Australian peel assures a continuous pipelined fighting response.


I think the standard response is that Aussies certainly speak something, but that it’s arguable if it’s English or not. ;)


Or the kiwis!


(The subtext of my prior message: I'm starting a fist fight with New Zealand.)


Meanwhile, I was just served a 503


Yeah, might be an idea to move this site to S3 so it works.


(Just to clarify, that was intentionally tongue-in-cheek... From the discussion of databases here, I assume the post talks about a database-driven site that wouldn't work on S3. It just seemed amusingly ironic that I couldn't see what points it was trying to make about usability and robustness because the site was down.)


Didn't we already do all of this 30-ish years ago?

And I'm not trying to suggest that we haven't learned a lot since then, but:

There was a time when the corpo web server/Internet computer existed as a pizza-box Sun Microsystems machine on someone's desk.

And sure, bandwidth was a lot less than 11MB/sec back then, but that's not a stretch at all for today's modern equivalent to that expensive pizza box.

But I'm old, and man do I sure as fuck remember the Web being a generally-unreliable turd back then. It was common for a website to not work today, or for an ISP's solitary email server to be down for a week or more.

It sure felt more real (and I even built a couple of those email servers), but it was also obviously very broken some of the time in ways that people don't generally accept today -- especially with 400 million visits per month, which was largely unfathomable at the time.


Reliability steps were made when we updated the “sparcstation 1s in a university lab” hosting architecture to what marketing called a “hot-swappable blade server with integral UPS” i.e. a bunch of second-hand thinkpads on a rack shelf above the Livingston Portmaster


Ah. The Livingston Portmaster, and its finger server (or the same on many other dial-up terminal servers): An easy way to not only show the RADIUS username of a person dialed into a port, but also an easy (and scriptable) way to discern their IP address and ping them to death.

Or, working oppositely: Knowing the IP of the user (from IRC, say), it was often possible to discern the accounts's username. And since usernames were often factually-descriptive back then -- sometimes for billing purposes (firstnamelastinitial, say): It was easy to give them details about themself or their family that they would not have guessed were possible.

The greater Internet was a fun time back then for me as a teenager, and I never did anything particularly damaging with it even though I absolutely fucked around with things from time to time.

It is certainly fun to remember, but there are aspects of how things worked back then that I'm not itching to bring back to life here in 2024.

(And that includes the solitary pizza-box server on someone's desk, as well as the wooden racks of modems and thinkpads.)


in summer we’d nick desk fans off the telco folks to cool down our USR Sportsters, distracting them first with snarky remarks about ATM overhead

absolutely do not miss those days


You were closer to the fire than I was, and I do not envy this position that you had.

I'm over here remembering that /etc/passwd was world-readable (and shadowless) on a given system's [singular] shell box, but you were more like the man behind the curtain.


Like many others, folks around here are often blind to nostalgia's impact on their perception of 'good ol days' tech. Just because something was simpler to write or administer does't man it was better for users, and there are a lot more users.


A single server in a data centre now is a lot more reliable than a machine sitting on someone's desk back then. The location probably makes the biggest difference - reliable network connections.


There are other advantages directly related to machines themselves, too. Power consumption and heat management alone can make a big difference for when things go wrong. Enclosures that are engineered for a finite life in a data center rather than lighting equipment on tour with a metal band are useful. (Given, if we're talking about Sun, specifically, I really can't fault their hardware.)


That was a pretty good read. The edging reference made me chuckle.


A lot of thoughts —

Most businesses are not running a SPA which only makes a database call. Most infrastructure I’ve seen is not web-facing. Managing this from a consistent place reduces operations overhead.

The idea that you’re SSRing everything instead of making a few API calls is strange to me. Most businesses in the top 1000 will have optimized for caching when at all possible. They’re also not paying sticker price to CDNs.

Offloading works to CDNs comes with inherent operational benefit.

This all seems great until you have your infra fail. Stories of unrecoverable outages abound. Even if you run in this architecture, you should always always always be using an external party for backup and log retention.

More of the above… this doesn’t interact well with the real world, but everything old is new again and many new CTOs /founding engineers haven’t seen how infra breaks and how operations impact throughput.


> The latest AMD server processors have 64 cores and 128 threads. Their newest Zen 5, Turin server processors are rumored to have 192 cores. On a dual socket server, you'll be near the 400 core count with 768 threads. You can serve the world from a single box.

Even if the math is a bit off on the 11 MB/sec due to spikes he's got a point...

The world's population grows at a much slower pace than server processors. Heck, we're even talking about population degrowth now in many paces. Meanwhile AMD shall keep coming up with beefier and beefier servers.

For many non-FAANGs what couldn't be done 15 years ago on a single box is now totally doable on one. And this trend shall only continue.


> You can serve the world from a single box.

If you're comfortable with a single point of failure.

> Why do we need Docker, serverless, horizontal scalability again?

The above and I don't ever have to run security updates or patch an OS.


Docker brings in the majority of the OS. Please patch your app / containers.


> The above and I don't ever have to run security updates or patch an OS.

Have you been involved in the xz project in the last 2 years? Curious.


> I don't ever have to run security updates or patch an OS.

You would have to update the base image to benefit from those updates though.


> The above and I don't ever have to run security updates or patch an OS.

This has to be satire?


I literally have dreams about SQLite supporting write concurrency somehow. That's the only thing holding me back from going all-in on the monolith nirvana described in this article.


Most web server use cases have at least an order of magnitude more reads than writes. So sqlite is a good default choice for most such use cases.

I don't know your specific use case though.


Single point of failure in the cloud can be due only to software errors. At some level.

When your "stack" and deployment process involves multiple tools and services, then yes you need high availability. Simple things with careful design and implementation tend to break less often.

Use tested software bits, test your own bits and the odds of a failure drop quickly.


I don’t disagree fully with this post but I find the 150/s and 11M/s disingenuous. Sure, if load was constant and spread out over 24 hours. But it’s not, it comes in waves and spikes. Sometimes it’s cost-effective to have big enough box to handle the spikes sitting around mostly idle the rest of the time, sometimes it’s not, it comes down to utilization.


I agree. Depending on the type of hypothetical website, you could expect that most traffic comes e.g. during the week at business hours, or maybe in the evening after work. You need to design for that spike in workload (maybe 10x?)

There was a story recently about running the workload of Twitter from a single machine: https://news.ycombinator.com/item?id=34291191

Yes - possible but no redundancy, difficult maintenance


Twitter is a much wilder size category.

If you want to handle 10x spikes over 11MB/s, then you need a single gigabit port. That's not hard to get, nor hard to feed.


If you had read-only database replicas on your edge servers then you would only be paying the latency cost to us-east-1 on writes.

I haven’t done that, though. It doesn’t seem widely available yet, or at least not for hobbyist programs like me.


I think it's doable with LiteFS[1], but I also haven't done that yet. Wish I could hear about people using it in prod.

[1]: https://news.ycombinator.com/item?id=32925734


Bare metal > Cloud.


11mbps away, I just can't reach the goddam thing. Internal Server Error. When the fox can't reach the grapes she says they're sour anyway.


I think this post is kind of missing business sense. It doesn't matter if TempleOS on i386 could handle LLMs 10x faster than competition if you couldn't assemble a 24/7 Ops team to support it.

Businesses need something predictable and repeatable(great if scalable too) that can be offered to customers at higher value than cost, and to make it repeatable businesses demand infinitely distributable architechtures. Otherwise you end up with just various forms of losses, missed opportunities, trust issues, foxhole problems, technical debts and such.

The idea of a single mainframe in a shed maintained by a UNIX wizard generating boatloads of cash would be awesome. The last part has proven difficult.


Stackoverflow's architecture, while more than just one server, is very much in the spirit of this and I'd say they're a highly successful high traffic website.

https://stackexchange.com/performance

9 IIS web servers, 1 SQL server (plus one hot standby SQL) with 1.5 TB of ram and 2 redis serving 1.3 billion views a month. And if anything they've been scaling down their architeture over the years as hardware gets better and their software improves: in 2016 they ran 11 web servers for example. Notice how little CPU usage they've got, even in peak usage, it's really about I/O ie lots of RAM and fast storage.

So it depends on the use cases, but I do not think "I need the cloud" is the automatic answer to "I am making something for the web". In fact the vast majority of people will never even get close to the level of traffic SO has.

Of course, even with low traffic, some type of software would highly benefit from a distributed cloud architecture. Can't really make a search engine that indexes the whole internet on so little. But are you making the next google?

Their stack is certainly a lot, lot, lot cheaper to run in the long term than going with Amazon's blood sucking costs.


But the article is pretty clear with a common use case example?

And maintaining a linux server is common knowledge. People have done it for decades. No wizardry needed.


I live in a country with

- very unreliable power

- the internet was shut down completely in the entire country during the last election cycle

Just those two problems make cloud hosting very attractive. No amount of cheap prices can make me switch to a physical server.


Usually you would go with a dedicated server provider like Hertzner or OVH and they host your instance in their DC locations in various countries like Germany and France


And the site is up.

I 100% agree with the article.

Hetzner is the way to go.


> AWS, and yet so many people talk about how the cloud is so cheap

Is AWS not widely known as one of the most expensive cloud providers all round?

And the cloud is not cheap in general, it's just convenient. If you do the math, most cloud providers only need a few months to make back the cost of the hardware, after that it's all money printing.

If you need something running 24/7 for a year+ it's always cheaper to buy... unless you're a large firm who doesn't trust their high turnover employees with their own infrastructure, which is where AWS makes their money.


AWS is one of the three similarly priced vendors that people who think cloud hosting is a must would tend to push you into.


It's a cloud problem, you need good customer support to trust your business to the cloud. Only AWS delivers good customer support.


Sure, what I rather meant that the choice isn't really that great. People who need catchy buzzwords on their resume are not going to promote some great but niche cloud vendor even if they were a thing.


Yeah then security comes in and asks for certification and encryption on rest etc.

Big companies don't use cloud just for fun.

And yes it's expensive but a ton of people working on that are good but not security experts or ops experts.

Managed shit takes the complexity out of this.

For everything non corporate yeah go with SQL lite if you like but you should have enough money anyway that those 'optimizations' don't matter.

One expert is not cheap and just coming up with that basic blog post requires a little tof expertise too.


Feels like a junior smart-ass engineer wrote this post.


accurate




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: