Hacker News new | past | comments | ask | show | jobs | submit login
Stack Overflow is a cacheless, 9-server on-prem monolith (twitter.com/sahnlam)
160 points by mike_h on Feb 26, 2023 | hide | past | favorite | 118 comments



Even though I love their simplicity as an example of how to be pragmatic and not over-engineer, do remember that they’ve tuned their code to the point that they built an ORM that is one of the fastest in the NET world. I used it and it was awesomely lightweight.

It’s as much an example of how far world class talent can go, as it is about doing more with less.


Right - Marc Gravell and Tim Craver, who worked on the core architecture of Stack Overflow, were both so obsessive about extracting performance from .net web applications that when they couldn’t do any more from the outside, they both quit and went to work for Microsoft on performance improvements in the framework itself.

I feel like it’s similar to how people point to Craigslist as evidence that you can still build sites in Perl - ignoring the fact that Craigslist has Larry Wall on a retainer.

Running highly scalable monoliths is easy! As long as you’re willing to hire some of the five to ten people in the world who are capable of advancing the state of the art of development on that technology stack…


Except that servers are literally 50-100x more powerful than they were when these sites were built. You just don't need legendary talent anymore to accomplish pretty reasonable scaling with a simple low server count architecture.


It's true! Hardware is powerful enough nowadays to run all those needless microservices and containers. ;-)


You don't, really. You can use Django or Perl today and just enable nginx caching for non authenticated users, for many applications.

Stack Overflow didn't need these optimizations. They could have just deployed 20 servers instead and still been profitable. People optimized just because they like to.


Yes, Microsoft SQL Server is famous for its ability to get faster just by adding more servers.


The discussion isn't really around the DB/SQL Server. As far as I could tell, we were discussing .NET and optimizations in its ORM.


Minor correction but that’s Nick Craver https://nickcraver.com/


> Running highly scalable monoliths is easy! As long as you’re willing to hire some of the five to ten people in the world who are capable of advancing the state of the art of development on that technology stack…

I truly believe that being able to design and run a modular monolith application effectively (not talking about the 'hyperscale' scenario here) should be a prerequisite for designing and running a set of interconnected microsservices. The challenge is similar, but dealing with modular monoliths has the advantage of not having to deal with the uncertainty of networking programming (i.e. remote calls, network error handling, distributed transactions).


I think the other point being - very few applications need this kind of scaling.


Dapper! I used it a while back and it was a single class that bundled query results straight into a list of objects by emitting low level CLR bytecode

Looks like its expanded a little since then

https://github.com/DapperLib/Dapper


You can also see this the other way around — it's a testament to how slow some other stuff is.

Which, to be clear, is not intended to be a negative statement about that "other stuff". It really depends. Some is. But I've also seen things just done poorly by applying tools wrong, e.g. ORM misuse leading to thousands of queries that should have been one OUTER JOIN.

But I don't think you need engineers of their unique calibre to get most of what they got. It's probably an exponential thing, if you have some merely good engineers you could maybe achieve 80% of their performance. The last 20% are just much more costly.


Yep. Following some of the SO folks on Twitter a while back, I remember watching them do all sorts of things with .NET that didn’t feel remotely “necessary” for a Q&A website. It’s not like you can pull people off the street and have them get away with infrastructure this simple.


> It’s not like you can pull people off the street and have them get away with infrastructure this simple

I know that in many cases simple != easy but I can't help feeling sad while reading this.

When I started my career cloud wasn't yet mainstream bu as a beginner I was able to deploy and configure a nginx proxy and loadbalance between 2-3 backend servers without too much effort. It wasn't some kind of rocket science.

I guess the current issue is that cloud has been marketed so much that nobody who's just starting out in the industry even has a second thought about using it by default. What can I say, great job from the cloud providers in capturing their customers as soon as they get in front of the store.


Great, now you have an nginx reverse proxy as a load balancer in front of a few servers. Now sort out log storage, certificate expiry, access controls, patch management, health monitoring, and remote administration, update it whenever you add or remove backend servers for maintenance, and make sure to synch it up to DNS, and you’ve almost got the same capability as an AWS ELB. Except yours doesn’t have high availability or horizontal autoscaling.

Getting all of that stuff right actually kind of gets close to rocket science. Which can be worth doing… but just be aware that Amazon will happily sell you a rocket kit.


I'm not an "on-premise bare-metal server absolutist". Of course there are trade offs in terms of convenience but there are also trade offs in terms of cost and performance and vendor lock-in. It all depends on what you need and what are your specific constraints.

Is time to market critical? Will you have daily traffic fluctuation between 10 to 10k users? Will you lose a ton of money/customers for any service interruption? By all means use the latest version of managed kubernetes combined with whatever other cloud service tickles those itches. But don't forget to always keep an eye on your bills and think how can you reduce them by simplifying your architecture.

But if you're just building a corporate intranet for a few dozen users who log in once a week I'm pretty sure a simple VM (even if managed in AWS) would make much more sense.

And if you really want to roll your own there are plenty of options to make your life much easier compared to sending a rocket into outer-space. Yes it's more work upfront but after you do the setup the first time there's little to do.

infra automation & templates: - ansible, docker, etc

log storage: - mount shared storage - ELK - use a paid LaaS or monitoring SaaS

certificate management (on LB machine only): - certbot

access controls: - linux user and groups management

patch management: - enable unattanded upgrades for security patches

health monitoring: - in terms of lb nginx has that built in. - for more advanced use cases use a paid service (new relic) or a free one (nagios)

remote administration: - ansible, etc.

Don't get me wrong I use cloud on a daily basis for work, I'm just sad because most teams don't know how to use it effectively without jumping the gun.


> log storage, certificate expiry, access controls, patch management, health monitoring, and remote administration, etc

This is how you can satisfy those needs with stock Linux. Install Ubuntu then:

    apt-get install certbot unattended-upgrades systemd-journal-remote
    wget -O /tmp/netdata-kickstart.sh https://my-netdata.io/kickstart.sh && sh /tmp/netdata-kickstart.sh
Remote admin and access controls are already handled via SSH and ordinary UNIX permissions. DNS editing is easy, just use your registrars UI for it.

Oddly, the most painful part is uploading servers and making them properly start up, be backed up etc. You can use Docker but I've written a tool that does it without that, just using systemd and Debian packages. You can run it on Mac/Windows too and it'll build a package for your server, upload it, install it, start it up etc to a list of servers defined in the config. You can sandbox the server with an additional line of code, define cron jobs with a few others etc. It's a bit more direct than Docker, and gives you the traditional stuff like OS managed security updates (for the libraries the OS provides).

> Except yours doesn’t have high availability or horizontal autoscaling

HA: Some people have extremely distorted ideas of how reliable server-class hardware and datacenters can be. There was someone on Reddit commenting on the 37signals cloud exit who believed that normal datacenters have 99% availability! Actual figure for most well run commercial DCs: closer to five nines. Some datacenter providers like Delft (as used by 37signals) promise 100% availability and give SLA credits for literally any downtime at all, which they can do because they have so little.

Auto-scaling: this is often a requirement that comes from the high cost of cloud services. If you only need 9 servers you don't need to auto-scale, you can just buy the servers and leave them running 24/7. Yeah, there are definitely places for that like companies that need to occasionally run huge batch jobs where the cloud model of multi-tenant sharing makes total sense, but for a website like Stack Overflow it's just not needed. Remember that their hardware runs at low utilization despite not having any caching layer; they can absorb huge spikes in traffic without issue assuming they're provisioned with sufficient bandwidth.

> Getting all of that stuff right actually kind of gets close to rocket science ... Amazon will happily sell you a rocket kit

This makes me feel kinda old, but I can't grow a beard let alone a gray one :( It's a type of sysadmin skill that was once considered entry level and which could be readily found in any university IT department. Probably still can be. Yes, if you grew up with AWS writing nodejs apps on a MacBook, if you never installed Linux into a VM and played with it, then it may seem scary. But it's not really so bad. You should try it some time, it's a generic skill that can come in handy.


To add on to the HA comment: A lot of people have distorted ideas of how much availability they actually need. A lot, if not most, applications could probably get away with the absolutely abysmal 99% uptime, depending on how that downtime was distributed. 99% uptime could mean anything from ~3 days of downtime a year, 7 hours a month, 14 minutes a day, to half a second of unavailability a minute.

Like sure, it's not ideal, but real businesses almost never are. And, as you pointed out, most datacenters get dramatically better uptime than that.


Not to take anything away from Dapper (it's an excellent library), but it isn't really that much faster than EntityFramework anymore.

> EF Core 6.0 performance is now 70% faster on the industry-standard TechEmpower Fortunes benchmark, compared to 5.0.

> This is the full-stack perf improvement, including improvements in the benchmark code, the .NET runtime, etc. EF Core 6.0 itself is 31% faster executing queries.

> Heap allocations have been reduced by 43%.

> At the end of this iteration, the gap between Dapper and EF Core in the TechEmpower Fortunes benchmark narrowed from 55% to around a little under 5%.

https://devblogs.microsoft.com/dotnet/announcing-entity-fram...

Again, this isn't to take anything away from Dapper. It's a wonderful query library that lets you just write SQL and map your objects in such a simple manner. It's going to be something that a lot of people want. Historically, Entity Framework performance wasn't great and that may have motivated StackOverflow in the past. At this point, I don't think EF's performance is really an issue.

If you look at the TechEmpower Framework Benchmarks, you can see that the Dapper and EF performance is basically identical now: https://www.techempower.com/benchmarks/#section=data-r21&l=z.... One fortunes test is 0.8% faster for Dapper and the other is 6.6% faster. For multiple queries, one is 5.6% faster and the other is 3.8% faster. For single queries, one is 12.2% faster and the other 12.9% faster. So yes Dapper is faster, but there isn't a huge advantage anymore - not to the point that one would say StackOverflow has tuned their code to such an amazing point that they need substantially less hardware. If they swapped EF in, they probably wouldn't notice much of a difference in performance. In fact, in the real world where apps, the gap between them is probably going to end up being less.

If we look at some other benchmarks in the community, they tell a similar story: https://github.com/FransBouma/RawDataAccessBencher/blob/mast...

In some tests, EF actually edges past Dapper since it can compile queries in advance (which just means calling `EF.CompileQuery(myQuery)` and assigning that to a static variable that will get reused.

Again, none of this is to take away from Dapper. Dapper is a wonderful, simple library. In a world where there's so many painful database libraries, Dapper is great. It shows wonderful care in its design. Entity Framework is great too and performance isn't really an interesting distinction. I love being able to use both EF and Dapper and having such amazing database access options.


Totally agree. To clarify, when I picked Dapper, it was 2014, where there was a huge difference.

No doubt EF has probably gotten to that level since MS has done a stellar job with .NET core of relentlessly slimming things down and improving performance.


The best cache is the one built into the database. People seem to forget that the major rdbmses have sophisticated cache strategies of their own and that handing them more RAM (and ensuring they are configured to use it for query or other cache) is usually a good first strategy before trying to second guess and reinvent the cache outside the db.

Thread says SO allocates 1.5TB RAM to SQL Server. Sounds wise.


Makes sense. Traditional RDBMSs are basically a buffer cache and a query optimization engine.

If the data is sitting in memory, and you've tuned extracting the data from memory as fast as possible, job done.


Not just a RDBMSs. Any modern DB, document store, or kv store will use a buffer cache.


It's all about the load though. SO is probably 95% Read-Only which makes sense for removing the cache layer. If you had a more writes, then they would need an external cache to offset the read load.


I don’t follow. Holding the total server load constant, why wouldn’t a read-heavy workload benefit more from caching than a more balanced read/write workload?


Microservices remains mostly an organisational pattern to scale development teams not necessarily the system performance. Microservices add a lot of complexity and overhead.


"Normal" sized services should be adequate enough for that purpose.


Microservices became a synonym for Services Orientated Architecture years ago.

It's almost always relatively normal sized services split by functional area e.g. Auth, Cache etc.


Hehehe. This is absolutely true and trying to explain it to anyone around me makes me feel like a Cassandra quoting battlestar galactica: "All of this has happened before. All of this will happen again". I'm not even old, I just happened to witness the tail end of SOA when I was still a student, followed by the rise of micro-services afterwards. Your service mesh is just someone else's message bus. Sure, it's different. Except not really. Also, bonus points for younger developers reinventing features for json APIs that SOAP already had.


> service mesh is just someone else's message bus

I don't think you mean service mesh here, or at least not Istio/Linkerd style service mesh


I agree that it does among some. But the semantics of "MICRO-services" is that smaller is better and you can find those who take it to be smallest is bestest. You can't just assume the word "micro" is just a hanger on and isn't doing work convincing people of the virtue of super-small. Lots and lots of orgs and people fall into this trap. I've worked at them. I've argued with advocates in the professional sphere. All the time.

If it was literally just SOA, I wouldn't have issues with it as you can reasonable have conversations about where divisions should be placed. Maybe you are surrounded with more reasonable advocates, but that is not the norm in my experience.

When you have orgs assigning a team to build and maintain 20+ services... It's gone into full self destruct mode.


When concept X becomes hot as an alternative to Y, everyone absolutely have to do X or be square. But for large enough companies it's cheaper to influence the industry so that X = Y and we relegate Y to Y', where Y' only contains our bad memories of Y.

This is the "Enterprise Technology Adaptation Strategy".


SOA was all XML at the API level.

Serverless is RESTful, lower-level, and coupled to the cloud-provider's menu of containers, key/value store, authentication, &c.

"Synonymous" seems a stretch.


The comparison was between SOA and microservices. No one mentioned serverless.


Fair point, but my argument doesn't vary terribly much.


Besides, microservices don't guarantee horizontal scaling just like a monolith does not imply no ability to do horizontal scaling.


The main takeaway is that the questions searched for are so widely distributed that there is no need for a cache layer - they are nothing but long tail.

At that point there is no 'cloud' design that can help. Its either one database (or maybe just shard everything onto thousands of distributed nodes)

But the point I am trying to make is that kubernetes and microservices etc are based on idea of winners - power laws. One tweet everyone wants to read. One search term, one viral video.

Then again. This is just a question of taste - the taste of the dev lead. What (s)he feels is best approach. Take another company doing the same thing and different approach might emerge.


I mean, kubernetes or microservices don’t care how the data reads are distributed, right? That problem is a database-level thing whereas k8s is infrastructure, you can run any kind of database with any kind of sharding you want on it. I feel like it might be more accurate to say something like “the value of caching is based on the idea of winners” for example


Yes. I think basically I would not have done it like that but they did, it's wildly successful so fair play - it's taste that makes the difference


It is ironic that many questions on Stack Overflow are about various cloud services, hyped-up technologies, and problems caused by over-engineering.


various cloud services

This question does not appear to be about programming, Closed.

hyped-up technologies

subjective, Closed

problems caused by over-engineering

Opinion-based, Closed.


I know you were joking but I am so glad stuff like that is not on SO. It would look like Quora which is the scourge of the internet.


This! Quora is a great example of what SO would be if it wasn’t for their moderation.

That being said, I think most would agree it might have gone too far and there may be value in trying to tweak it a bit to make it easier to contribute.


I’m always puzzled when I’m using SO to help diagnose some obscure problem in my tech stack and I see a bunch of “hot questions” in the sidebar about whether dwarf armor can deflect magic bullets, or what the energy capacity of a Stormtrooper’s laser rifle is, etc.


Not knocking any interest, just really curious that there is such a wide range of topics on there.


Those are not Stack Overflow questions. Stack Overflow is strictly for programming questions. They are from other sites that are also part of Stack Exchange.


"The medium is the message" wins again.


Imagine trying to present this kind of architecture to a room full of executives already sold on the "benefits" of kubernetes, big data, serverless, etc.


Hah. I get your point but it would be an easy sell for them. The impossible sell would be to engineers. Executives would just compare operating costs estimates.


Good point, for normal executives (whatever that means). In my little bubble most executives I have to deal with believe themselves to be on par with solution/enterprise architects and they like to show this by saying stuff like: "Let's use microservices and kubernetes for better scalability, everybody's doing it..."


What would prevent you from running 9 "web server pods" with 64GB ram each? Just implement the whole thing on top of Kubernetes, why not?


A 64GB RAM instance on a cloud (which is what you're most likely using if you have K8S) will set you back a decent amount of money, even more so if you want one matching the specs that Stack Exchange actually uses.

If you need that level of performance you need to go bare-metal, and this is where you'll hit a lot of roadblocks (yet they will be happy to spend 10-100x more money trying to make do with the cloud).


did that before: running a single monolithic app inside a kubernetes cluster on a single pod. I still feel dirty after doing it.

My current hobby is to try and run monolithic apps like these on serverless services like cloud run. There's still some pain related to attaching persistent storage to a container but otherwise it feels like a great option.


The use case is simple i.e. web front end, thin app layer, database.

So if you were to implement this same architecture using Kubernetes or Serverless it would be as equally simple as a bunch of Ansible or Puppet scripts.


Keep in mind that the only reason this works is because it's all running on beefy bare-metal servers.

If you want to run it on Kubernetes I hope you know how to install/maintain K8S on-prem, because there's no way you're going to get this level of performance from any cloud provider (not at a sane price anyway).


Agreed, but would most engineers understand that they can keep the simplicity of the solution if the underlying infrastructure is based in the cloud/serverless/etc?

Fro my limited experience many engineers fall in the trap of adding accidental complexity to an otherwise simple architecture just by trying to use the latest/coolest cloud architecture trend.

Monolith in the cloud on kubernetes? Speak no such abomination. Of course we have to do microservices, the more the better. How can we scale otherwise?

SQL DB? What is this, 2010? Of course we're going to use Cosmos DB, how else could we get "single-digit millisecond response times, automatic and instant scalability, along with guarantee speed at any scale".

Of course I'm exaggerating for dramatic effect but I rarely see teams disciplined enough to keep cloud architectures simple and clean.


The folks over at SO picked a stack (C#, SQL Server, IIS), and optimized the heck out of it to keep this "simplicity". Much of SO is custom built from the ground up to push performance and stay within the purity of the canonical .net stack.

It isn't clear to me this is a model that would work elsewhere, or should be held up as something to be replicated.

Did they save time? Did they save money? Did this help make SO a wildly successful company? Did it allow them to deliver features to customers faster?


It's worth reminding people what is actually possible with a relatively simple architecture. There's a vast number of websites and services with a very small fraction of the traffic of Stack Overflow with a much more complicated architecture simply because everyone thinks you need Kubernetes etc to scale out.


That's the point though. If you want to focus your engineering time on optimization and code quality, then of course you can scale to SO's size with 9 servers and a simple architecture.

If you're still growing and more interested in delivering tons of features quickly, and/or don't have the ability to attract world leading talent, then a more complicated architecture with clear boundaries is often a better call than delivering relatively few features with obsessive rigor in a monolithic codebase.


You'll only need the expertise to perform in-depth optimization if you're scaling to the level of Stack Overflow though. For the vast majority of sites it won't be a concern. The simpler architecture should be the default because local method calls are easily 2 orders of magnitude faster than a network roundtrip (yes I know that's not the whole story). I'm not sure how or why creating clear boundaries without the RPC crutch suddenly became insurmountable.


It's not cacheless. There are countless caches throughout (including what appears to be ~1TB of memory in the database server), just not a dedicated cache machine.


It think OP is only referring to server architecture. And as you say there is no cache server. So cacheless server architecture.


By this definition almost all non-toy applications under non-toy OSes have caches, because of CPU caches and registers.


I don't think its that much more complicated than Wikimedia, which does 5x the traffic: https://meta.wikimedia.org/wiki/Wikimedia_servers


Not that long ago (2016) they had:

  Servers:

  SQL Servers (Stack Overflow Cluster)
   2 Dell R720xd Servers
  SQL Servers (Stack Exchange “…and everything else” Cluster)
   2 Dell R730xd Servers, each with:
  Web Servers
   11 Dell R630 Servers
  Service Servers (Workers)
   2 Dell R630 Servers
   1 Dell R620 Server
  Elasticsearch Servers (Search)
   3 Dell R620 Servers
  HAProxy Servers (Load Balancers)
   2 Dell R620 Servers
  Redis Servers (Cache)
   2 Dell R630 Servers
  VM Servers (VMWare, Currently)
   2 Dell FX2s Blade Chassis, each with 2 of 4 blades populated
   4 Dell FC630 Blade Servers (2 per chassis)
   2 Equalogic SAN PS6000-series
  Machine Learning Servers (Providence)
   2 Dell R620 Servers
  Machine Learning Redis Servers (Still Providence)
   3 Dell R720xd Servers
  LogStash Servers
   6 Dell R720xd Servers
  HTTP Logging SQL Server
   1 Dell R730xd 
  Development SQL Server
   1 Dell R620 

  Network:

  2x Cisco Nexus 5596UP core switches (96 SFP+ ports each)
  10x Cisco Nexus 2232TM Fabric Extenders (2 per rack)
  2x Fortinet 800C Firewalls
  2x Cisco ASR-1001 Routers
  2x Cisco ASR-1001-x Routers
  6x Cisco 2960S-48TS-L Management network switches (1 Per Rack)

https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...


Isn't stackoverflow, incidentally, one of the websites who would benefit the most from caching, given their content supposedly is going to be static the majority of the time?


This is addressed in one of the linked tweets.



That defies the laws of physics. How can they be web scale without cloud and microservices?


I want to upvote you, but you forgot MongoDB, which is the most fundamental law of web scale.


We all know that /dev/null is an adequate substitute, as long as it gets those kickass benchmark numbers.


In the diagram [1], I can see why you might design it that way if starting from scratch but it works as is so why change it.

Is there a particular reason to suggest a change to the architecture?

[1] https://twitter.com/sahnlam/status/1629713954225405952/photo...


Diagram 1 has the comment "What I think it should be".

It's easy to interpret that as "stackoverflow should change to be like this", but I think it was meant to be more like "If I had to guess how stackoverflow works, this is what I think it would look like".

It's amazing how much performance and scalability you can get out of computers, if you don't burden them with 100x overhead caused by shoveling data between microservices all the time :-)


    It's easy to interpret that as "stackoverflow should change to be
    like this", but I think it was meant to be more like "If I had to
    guess how stackoverflow works, this is what I think it would look
    like".
That's not a better interpretation. It says something (something not good) about the mindset of modern software engineers that the first thing they think of when they look at a website like StackOverflow is a n-layer microservice architecture, with more moving components than a Swiss chronometer.


This comment also says something.

You've taken the opinion of one engineer and used it to denigrate the mindset of all modern (young ?) software engineers.


From Britannica:

modernity, the self-definition of a generation about its own technological innovation, governance, and socioeconomics. To participate in modernity was to conceive of one's society as engaging in organizational and knowledge advances that make one's immediate predecessors appear antiquated or, at least, surpassed

In engineering, etc, these days, it usually refers to the idea that a single solution works for all people / use cases. Kubernetes proponents are a great example of current day modernists.


You have a very odd interpretation of "modern".

Kubernetes was initially released 9 years ago.

And it's not worlds apart from VM managers like vSphere which is 14 years old.


It is absolutely a better interpretation. The former signals arrogance while the latter shows that OP accepts the inferiority of their guess in comparison to the actual architecture.


The word "should" might be confusing here. I didn't read it as the author recommending a change; rather the author first proposes "Given what I know about Stack Overflow, they must be doing something like this, right?" Then boom comes the surprising revelation.


Is there a website that tracks outages of other websites like Stack Overflow over years? I know some that tell you if it's down right now, but not over years.

I have a subjective feeling that Stack Overflow is down a lot more than other websites. I don't see that ever mentioned in the discussion of cloud vs on-prem which makes the discussion seem lacking.



Seems to be testing from just one location, as far as I can tell?

Randomly, packets time out on the internet, I would take this random dashboard with a grain of salt, we cannot be sure SO had a outage just because one request happen to fail.


On the other hand, if they had a 'down for maintenance' page up, pings would still work


With a caveat that pingdom will mark "a connection from pingdom server from other side of the world to the server" as downtime, even if the target and your ISP, and the ISP of your ISP had no problems.


That’s an engineering choice not cloud vs. cloud. How many services are down when AWS us-east has a problem?


True. But cloud makes it a lot easier. In some cases it's built-in, like S3. In others it's a checkbox like RDS Multi-AZ. And if you need to roll your own, multi-AZ or even multi-region is much more straightforward than renting another rack somewhere.

I have personally seen Stack Overflow be "under maintenance" or straight up down a lot more than I have seen entire us-east-1 down.


Keep in mind that the "cloud" relies on an opaque control plane with undocumented failure modes (that sometimes even the provider does not know).

Just because you tick a checkbox doesn't mean it'll actually work as planned, and unlike infrastructure within your control that you can actually test (pull the network or power cable from a live server if you need to), you can't simulate a cloud provider outage.

> multi-AZ or even multi-region is much more straightforward than renting another rack somewhere.

Assuming that enough of the AWS control plane is alive to actually allow you to login and administer the services in your backup region.

Furthermore, cloud providers are their own businesses and are constantly in motion (introducing new features, etc). That's good for their business but bad for yours, as it means they might be doing risky changes that could affect you should it go wrong.


Exactly. I run a large enterprise service in a single datacenter with 5 years of 100% uptime. Our design goal is 99.97% measured monthly.

We have that because we have complete control end to end. We made an engineering decision not to have geo-redundancy because many of the dependent services aren’t available that way either.

Because of the compute requirements, running that service in AWS or GCP would cost about 80% more, inclusive of all costs (equipment, labor, utilities, etc)


Not caching the questions and answers makes sense to me, as I imagine the hit rate wouldn't be terribly good. I would guess, though, that they somehow cache things like the sidebar list of blog articles, featured items, "Hot Network Questions", etc.


They do in fact cache some things like that, they've had caching issues in the past (and again recently, I think) with the wrong cache being used in some situations:

https://meta.stackexchange.com/a/235277


The linked url [0] is also a great visualization with a bit more data than the twitter image.

[0] https://stackexchange.com/performance


Only 450 peak reqs/s? Doesn't that seem low?


x9, it says per server.


> Removed Redis 4 years ago; average latency remained unchanged at 20ms.

A hidden taken away is that NVMe storage databases are so fast, they are comparable to in-memory (redis) databases these days.


Throwing 1.5TB of RAM in the SQL Server (server) has to help too!


> [1.5TB of RAM] that is a third of the entire Q&A dataset.

Yes, but maybe not as much as you’d think.

https://twitter.com/sahnlam/status/1629713961951330304


It's probably debatable in this case but from what I know of Postgres (as one example) the general thinking seems to be "throw as much (relatively cheap) RAM at it as you can", tune some of the default (conservative) memory consumption params, and let Postgres eat the RAM.

See the various parameters here[0] - it's complicated but from my understanding you can pretty quickly run into performance issues depending on some not-exactly obvious variables in dataset size, specific queries, etc.

Of course Postgres != SQL Server but the concepts are likely similar. That said you won't catch me every researching this because I've never used SQL Server and never will :).

[0] - https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...


Please ignore my lack of understanding a bit here. I'm genuinely trying to learn.

I've always heard (and it made sense to me) that to reduce latency of requests from across the globe, you might want to have read replicas or caches spread on global infrastructure. Then how is it that stack overflow is fast here when the db is on-prem, 7 seas across from me? Any amount of RAM should not account for the distance, right?


You can put a big dent in the impact of the speed of light if you keep round-trips to a minimum.

This is one advantage of server-rendered HTML (though that's not the only option you have).

It also helps that StackOverflow is light on interactivity. You load a page, read for a minute, then maybe click a vote button or open a textarea to discuss. As long as the text and styles load quickly, you won't notice if progressive enhancement scripts take a little more time to load.


When I look up www.stackoverflow.com, I get Fastly IPs. I feel like using a CDN has to count as some cache?


It’s also one of the few sites I use that regularly goes down for maintenance.


steam would be the biggest for me


Source material is from 2022, so title should include that disclaimer.


And somehow Wikipedia require thousands of severs.


Wikipedia servers much heavier multimedia content around 20x more often (in page views), with a vastly highier write load.


And runs on .NET

One of the only well known sites to do so, I think?


I think most things Microsoft run on .net incl. parts of bing and office online.


ah yes, of course


Joel Spolsky used to work for Microsoft and all his products were developed using the MS ecosystem, I believe.


It's a useful reality check. Dedicated machines are fast and you can do a lot without much software complexity. People mention the StackOverflow guys optimizing their software, but their CPU utilization is 5% so they have a lot of headroom to be less optimized. Probably they just enjoyed it and could spend time on that, so why not?

At KotlinConf in April I'll be giving a talk on two-tier architecture, which is the StackOverflow simplicity concept pushed even further. Although not quite there yet for social "web scale" apps like StackOverflow, it can be useful for many other kinds of database backed services where the users are a bit more committed and you're less dependent on virality. For example apps where users sign a contract, internal apps, etc.

The gist is that you scrap the web stack entirely and have only two tiers: an app that acts as your frontend (desktop, mobile) and an RDBMS. The frontend connects directly to the DB using its native protocols and drivers, the user authentication system is that of the database. There is no REST, no JSON, no GraphQL, no OAuth, no CORS, none of that. If you want to do a query, you do it and connect the resulting result stream directly to your GUI toolkit's widgets or table view controls. If what you want can't be expressed as SQL you use a stored procedure to invoke a DB plugin e.g. implemented with PL/Java or PL/v8. This approach was once common - the thread on Delphi the other day had a few people commenting who still maintain this type of app - but it fell out of favor because Microsoft completely failed to provide good distribution systems, so people went to the web to get that. These days distributing apps outside the browser is a lot easier so it makes sense to start looking at this design again.

The disadvantages are that it requires a couple more clicks up front for end users, and if they have very restrictive IT departments it may be harder for them to get access to your app. In some contexts that doesn't matter much, in others it's fatal. The tech for blocking DoS attacks isn't as good, and you may require a better RDBMS (Postgres is great but just not as scalable as SQL Server/Oracle). There are some others I'll cover in my talk along with proposed solutions.

The big advantage is simplicity with consequent productivity. A lot of stuff devs spend time designing, arguing about, fighting holy wars over etc just disappears. E.g. one of the benefits of GraphQL over plain REST is that it supports batching, but SQL naturally supports even better forms of batching. Results streaming happens for free, there's no need to introduce new data formats and ad-hoc APIs between frontend and DB, stored procedures provide a typed RPC protocol that can integrate properly with the transaction manager. It can also be more secure as SQL injection is impossible by design, and if you don't use HTML as your UI then XSS and XSRF bugs also become impossible. Also because your UI is fully installed locally, it can provide very low latency and other productivity features for end users. In some cases it may even make sense to expose the ability to do direct SQL queries to the end user, e.g. if you have a UI for browsing records then you can allow business analysts to supply their own SQL query rather than flooding the dev's backlog with requests for different ways to slice the data.


When my startup was acquired a few years ago, our infra was hosted at AWS, but most of our "cloud features" were used more for monitoring, alerting, and dashboarding. The real work was done by Windows/SQL and .NET app code. Ours was a messaging application that we tested to support about 350 messages/second, and we had to integrate with the "big co" backend after we were acquired. The bigco back-end could handle about 3-5 messages/second.

Our main production "infra" was a load-balanced pair of medium CPU front-end servers and a high-memory back-end for the SQL server. Theirs was approximately 20x the size, and a more "traditional" cloud microservices, etc. infrastructure. Optimization makes all the difference. So many of the "extras" just add unnecessary complexity, just like avoiding those "extras" probably does when they actually are required.


On the topic of Postgres versus MS SQL Server or Oracle, I wonder if any of the newer Postgres-compatible databases, like Cockroach or Materialize, solve the scalability issue you raise with Postgres, while not having quite the stigma of MS SQL Server or (especially) Oracle.


I'm not sure. Postgres itself has good performance but the issue for two-tier architecture is number of simultaneous connections. Postgres uses a process per connection. Something like pgbouncer in front can help with that but then the complexity starts going up again, as pgbouncer limits to some extent what you can do. Obviously if you have enough RAM to service all simultaneously connected clients it's not a problem, and you can scale RAM by just adding RO replicas. You can also set connections to aggressively time out if clients are idle, and the clients can re-establish them on demand, so there's lots that can be done.

But ultimately, a db like SQL Server or Oracle will just let you use lots of connections without breaking a sweat. They're both threaded and fully async, it's a much more efficient model.


Is it hosted on the cloud?



"What I think it should be"

That's a little bit arrogant no?


Quite the opposite. It’s what mere morals think it’d be, vs what the extraordinary talent has gotten away with.


They mean preconception




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: