There is a big difference between developers who know their hardware stack, and those who don't. Performance efficiency is often an order of magnitude out of whack for companies that prefer a horizontal-first scaling approach (i.e. lots of containers in the cloud, k8s, etc.). As an extreme example, consider Plaid, which at one point had 4000 Node containers [1] running in parallel to handle just as many requests per second. My guess is they could've easily done with a few beefy baremetal machines, if they knew their hardware well, and architected it sensibly from the beginning.
Money wasn't a problem for them because they were able to just throw $350k a year at AWS to make it work (and millions more at their dev team). But money isn't the only thing. This sort of approach sacrifices real-time latency (which is felt by the user), reliability (Plaid is notorious for having a ~20% fail rate for requests), flexibility (Plaid is notoriously slow at fixing bugs), developer costs (a dedicated DevOps team), agility (very slow deployment times), not to mention the priceless value of simplicity, a system that can fit in the mind of a single experienced developer, so they can write and test performant code on their own machine.
But hey, Plaid is now a unicorn, so who cares about efficiency.
Yes I absolutely agree. I posted a blog post near the start of the year looking at how much performance automerge (written in javascript using immutablejs) leaves on the table. At the time, automerge took 5 minutes to run a benchmark, and my own code could do the same thing in 0.06 seconds. Since then I’ve sped up my code by another order of magnitude - it can now run the same benchmark in 0.005s. (4.9ms).
I don’t want to badmouth automerge - they’re working on it and have made big improvements this year. But that ratio - 5min:5ms - gives you a sense of the mountain of inefficiency we live with on modern computers with a lot of modern software.
Modern CPUs and ram are obscenely, incomprehensibly fast. With the exception of a few very specific tasks, you should never be waiting for a modern computer to do anything. We sent men to the moon in fewer cycles than it takes for modern windows to boot, by several orders of magnitude.
Complaining about the global chip shortage is a bit like a hoarder complaining about a room shortage in their overflowing house.
(With carve outs in my scorn for video editing, AAA game engines, AI, etc)
> Modern CPUs and ram are obscenely, incomprehensibly fast. With the exception of a few very specific tasks, you should never be waiting for a modern computer to do anything.
I really wish we could drive this point home for everyone. The amount of work you can get out of 1 x86 core is insane if you know how to feed it properly. The upper bound is somewhere around 500 million transactions per second per core. I have personally built systems that can achieve somewhere around 14 million per second.
If you can reliably handle the entire serialized workload on a single thread you start to retool for the idea that you can fit the entire business in 1 box.
Being able to run everything out of a single computer is an incredible virtue, not some curse that must be cast to the clouds.
> Being able to run everything out of a single computer is an incredible virtue, not some curse that must be cast to the clouds.
Isn't the problem then: what happens when take computer goes down? I'm also on the side of "know your hardware", but the topic of resilience is a different one. Also, perhaps you didn't mean "single computer" in the literal way: it's not incompatible with avoid the cloud, though.
A good thing with "single computer" is that you have less complexity going into it so failures will be less frequent. For a lot of cases, that will be enough. It's also easier to build a resilient system starting from a single machine than from a cluster: less components to replicate to ensure they're redundant.
Something I've been telling customers for years is: I would rather have a standalone SQL Server Standard Edition instance on VMware than an Enterprise Edition cluster on anything.
Simple means that there's less things to break. This isn't just academic. My experience with clusters is that the clustering itself causes more problems than it solves.
(Don't think this is somehow isolated to Windows. ZooKeeper will refuse to start after a power outage for no discernible reason. Etc...)
> My experience with clusters is that the clustering itself causes more problems than it solves.
There's also a whole slew of failure modes that are capable of bringing down entire clusters if a single machine misbehaves. Usually to do with situations where the nodes are reachable, but some resource (I/O, latency, bandwidth) is uncharacteristically slow on one of the machines. I'm yet to encounter a distributed system that handles that situation well
Some people see small twin engine airplanes as safer than single engine. Single point of failure and all that.
The problem is that the two engines of an airplane aren’t independent of each other in the factors that threaten an engine. Same pilot, same fuel. Separate mechanical parts, but that’s a minority of issues.
Most engine-outs in small airplanes are caused by running out of fuel. Your two engines don’t help you there. If anything it’s more likely because of mismanaging the complex fuel system.
The most common phase for loss of an engine is on take off (misfueled, mechanical issue). This is what your comment reminded me of. With two engines, you’re more likely to have a mechanical issue on take off. While it’s generally theoretically survivable, handling one-sided power loss on takeoff requires a quick and counterintuitive response. That requires consistent training, more than most pilots do.
In a small, recreational twin, the most common outcome for an engine misbehaving on take off is it rolls the airplane, taking down the whole cluster.
I can’t echo this enough. At work our DevOps folks started out of the gate with a Postgres cluster (hosted in our K8s infrastructure) and it’s been a nightmare. Luckily we’re still in development mode and haven’t moved any customers to our new architecture. Complexity kills and most of the time YAGNI.
We do need to account what is required and what is achievable for each use case.
We run an ERP system with >250k self-service users on a single beefy database. There's redundancy in web servers and app servers and proxy servers and whatnot. But in 5+ years and a LOOOOOOT of all kinds of issues -- it's never been that single big LPAR with single big RDBMS that was the issue. And we're not even on a mainframe. And it's not because we got lucky - some single-computer/LPAR/RDBMS stacks simply give us as many 9's as we need for our use-case.
(webservers on other hand - they feel like child's toys. I don't understand how these leaky errory cachy things survive... YOU HAVE ONE JOB! And you're not very good at it :P )
I did mean single computer in terms of 1 active primary machine at any given time.
Negotiate the advantages of this architecture with the business. For the systems I've built, synchronous log replication is a natural extension of the performance abstraction, and provides fairly robust protection when some manual intervention is allowed.
I built a production grade cluster for fun. It’s only three nodes… without anything installed on my 40 core cluster, except k8s and longhorn, it’s idling at 30% cpu usage. When I started up a managed cluster on Azure, I was throwing hundreds of euros at their logging storage just idling, so I noped to bare metal.
This mirrors my experience with K8s, though one can suggest that the software running on the cluster also played a part in it.
Have you looked into K3s or other lightweight distros of Kubernetes, though? https://k3s.io/
That's what i turned to when i needed to learn the Kubernetes API and explore running clusters, but didn't like my small VPSes (~4 GB of RAM) failing to launch anything because of the frankly insane resource usage of full K8s clusters.
Transaction in this context being "Place order", "Cancel order", "Update user settings", etc.
We need to remember there isn't a 1:1 relationship between instruction/clock rate and the number of actual instructions that are retired per unit time. Pipelining and ILP allow for speedups that range anywhere from 1-100x depending on the specific flavor of application.
Most applications are not intrinsically I/O bound, it is only an artifact of the scale of wastefulness and inefficiency in how they use I/O. Even databases are no longer I/O bound except at the extremes -- memory bandwidth is likely a bigger bottleneck if the I/O is handled competently.
Software being "I/O bound" stopped being a legitimate excuse for the vast majority of applications many years ago. Hardware and software has moved on.
> Even databases are no longer I/O bound except at the extremes -- memory bandwidth is likely a bigger bottleneck if the I/O is handled competently.
> Software being "I/O bound" stopped being a legitimate excuse for the vast majority of applications many years ago. Hardware and software has moved on.
You know, i'm not sure about that.
You can write really performant code, but oftentimes in our world that's increasingly infected by SaaS solutions, you'll be bottlenecked by network calls, be it to an external service or a DB instance.
Sure, your DB driver might be really performant and the actual progress made in the last decades on the DB engines has been amazing, but you can't beat the speed of light.
And, since many still choose external DBs when something like a local SQLite DB would suffice (sometimes a form of cargo culting, sometimes choosing a particular risk management profile/failure mode with multiple nodes), that cannot exactly be ignored anyways.
It's not a big deal if you're only running one thing on the computer. The wasted cycles could have been used by a different process to do something else. Multiply this mindset by many services and applications and suddenly the computer is slow when it doesn't have to be.
Most applications do far more IO than what is necessary for the task.
Anyway with a modern SSD I don't think you are right. My disk can do something like 3 GB/s and 100k random accesses per second. At that rate most desktop programs should have the binary and all dependencies loaded to memory in approximately 0.1s. That seemingly doesn't stop a lot of them from taking several seconds to boot. Likewise games, despite usually consuming less than 10 GB of video and main memory combined, can take much longer than 3.3s to load.
Decompression is fast if you do it right (using anything from the Oodle pack is a good first step towards doing it right).
Hopefully a game doesn't need to parse a whole lot of things, however stuff needs to appear in memory, it can be stored that way on disk, possibly compressed with aforementioned fast compression libraries.
There is nothing special about loading data onto the graphics card, if you know what you are doing it is just copying data, and no point on the path is slower than disk.
I know there are games that somehow rely on thousands of shaders. Like most games you can just not do that, or you can store the compiled shaders so that the compilation only needs to be done on first run.
An instruction is not a transaction. Commodity x86/x64 systems do not implement transactional memory or any model of computation that could be reasonably described as transactional as far as I'm aware.
I can’t tell you how many times I’ve worked with people who look at a flat perf chart and decide that since they can’t see a tall tent pole there’s nothing more to be done. If the business insists that customers are complaining, they just show the charts as if that explains anything. Charts don’t absolve you. They just mean you’ve given up without learning much at all.
Flame charts might have helped a little, but it’s still largely the same problem.
On my Google interview, after I knew they weren’t interested, they asked me what the hardest thing is in software, I told them one of my more controversial beliefs because 1) fuck it, and 2) sometimes these seeds grow in the most unlikely of places. And that is that managing the C constant is the hardest thing we do. Paraphrasing an old, old manager: anyone can do the first order of magnitude, the second is hard, and the third is beyond many people.
It’s not an intellectual process. It’s discipline. Pure, concentrated discipline to keep doing the hard work to figure out all of the tiny XY problems in your code that have a performance penalty even if it’s just a factor of two or three. And once you’ve found them, care and organization to not create huge externalities for others. Refactoring doesn’t keep you from making regressions, it just saves you 9 times out of ten, 99 out of 100 at the most. You’re gonna break an egg or two per year.
I have a sense that it is very rare for people who start out writing very inefficient code to have a much of plan to make it faster. Some don’t even have a plan to make it right, let alone make it fast. If things are as you say, then more than likely automerge will need new leadership from new contributors, and/or it will always lag behind by an order of magnitude.
> If the business insists that customers are complaining, they just show the charts as if that explains anything. Charts don’t absolve you. They just mean you’ve given up without learning much at all.
This mirrors my own experience in the industry.
When clients complain that loading some data in a CRUD app takes them upwards of 6 seconds in a bespoke ERP system, they're met by the blank stares of some devs. A cursory review of the code reveals that what could have been a single SQL query is instead a bunch of service calls which ends up being a textbook example of the N+1 problem. And when asked about this, the devs respond with: "But it's easier to use service calls to get the data because of code reuse, working with complex SQL queries is harder."
Well, you picking the easier way out caused the performance to be bad in this case, especially because of your needless queries lead to degraded performance for all of the users in the app.
But how could you even get to that point? Because the other devs who did the code review were struggling to even review the overcomplicated business logic in the new code. Because they didn't care enough to ask about the architecture and the N+1 problem, or because they simply didn't see it in the overcomplicated codebase. Because the merge/pull request description is literally blank because no one cares about keeping track of ADRs (architecture design records) or even the historical context for certain changes. Because there has been very little care put into load testing in any capacity so far, because the system doesn't lend itself nicely to testing or even setting up new environments, any attempt to do which is met with resistance from ops. New features get mostly prioritized over maintenance, addressing technical debt or solving these longstanding issues, both from the client company and management side. And when things inevitably do break, the devs are blamed for seemingly not doing this all on their own time.
That disconnect between what should be done if anyone actually cared about software engineering and what is actually done in the industry irks me greatly. I hate the short term thinking that makes systems perform badly in the long term and hard to maintain. That is not sustainable in any way whatsoever.
Luckily, i've basically just said "no" to the above and for the past N months have been working on modernizing the app, adding proper APM instrumentation and monitoring, improving environment setup whenever possible (enterprise DB is still an issue, everything else currently being managed with Ansible) and introducing containers for easier runtime management and health checks with restarts/load balancing, while also writing unit tests (some of my new code has 100% coverage, actually) and addressing technical debt.
At this point, i don't care if it earns me scorn, pressure or if i get fired because of it down the road - i'm an engineer (with a degree that says exactly that) and that's what i'll do, asking neither permission, nor forgiveness (while clearly communicating what i'm doing as a matter of fact). Sure, there are times when you have to cut corners and ship things or help others with sub-optimal solutions, but i wish that more people in the industry didn't waiver under pressure from management and didn't choose to lie about estimates just to please them.
I have asked, romanced, and forced better process on different teams over the years and the fact is that some people just have never had anything better than chaos or bureaucracy and so they don’t know the level of accidental pain they’re carrying around until they get a taste of something better.
In a couple instances a former coworker told me after I left that the boss tried to walk back some of my changes and was surprised that the majority said no. Our theory is they thought I was being mean, rather than tough love.
I keep joking with my beer group that I am going to hire a personal trainer just to have them talk me through how they get people to do things that are good for them but painful in the short term. There is some fundamentally different thought processes between people who have never been “in shape” and those who aren’t now. I know my “reasons” are excuses and I know what the carrot tastes like. For some that carrot is too abstract and sounds maybe a little culty.
> There is some fundamentally different thought processes between people who have never been “in shape” and those who aren’t now. I know my “reasons” are excuses and I know what the carrot tastes like. For some that carrot is too abstract and sounds maybe a little culty.
Sure, part of it could be due to the experiences that shape people's opinions. For example, i've stood in a governmental building and have seen queues forming because people could not receive the healthcare services that they needed in a timely fashion, all due to a system not working. I was called in to help and admittedly there was a lot of satisfaction/relief after i pulled out the non-functioning guts of the system and replaced them with something a bit more stable.
I think that developers could benefit from seeing how their code runs out in the wild, or even dogfooding their own projects. Furthermore, actually getting to be a part of a good team that's not constantly fighting some fires is definitely an eye opening experience!
Similarly, that one situation made me reconsider how much impact my decisions could have down the road for both others and myself. Do i really want to skip out on writing unit tests now and then spend a week fixing both some rather bad bug whilst also trying to fix the data that has been damaged by it? Should i really approve these code changes, if it seems like the implementation could cause resource usage issues, even without doing a few benchmarks first? Is saving 15 minutes by not writing docs worth wasting 2 hours rediscovering all of the functionality when something will break 4 months later and i'll have no recollection of this code?
Of course, i'm not infalliable myself, far from it. Thus the balance between "strong opinions that are loosely held" vs "loose opinions that are strongly held" is hard to find and i have to constantly keep exposing myself to new technologies and approaches, even if i don't always agree with them. Ideally, i also have to make a large amount of mistakes, but in environments where their consequences are minimized, such as side projects and prototypes, so that i may learn from them.
Here's hoping that i won't lose the strength and discipline to continue doing that for the following decades, and leave all of the software that i work on in a better state than it was when i started.
There's an underlying element that isn't explored in this enough: constraint is many times useful (rephrasing of Necessity is the mother of all invention).
The prevailing sentiment these days seems to be that developers don't need to reason about performance "because we can just scale it." It might be useful, or at least informative, to start with "make your app run well on a Raspberry PI" or equivalent low-power system, and then consider your performance from there.
Understanding the performance journey that these financial exchanges went through has saved me a lot of time and frustration when attempting to build similar high-performance software.
When I see these stories about horizontal scaling on thousands of nodes, I wonder how many people don't know how much performance can be squeezed out of beefy machines. At my job we build probes for network monitoring, and we're able to ingest quite a lot of traffic (up to 14Mpps per link, for a fully saturated 10G link in our hardest tests), ingest a good volume logs or metrics from remote systems, do some extra processing and serve a web application in just a single instance. We use C/C++ for the traffic capture and processing, Postgres/Elastic for storage (depending on the balance performance/flexibility), some AWK for quick file processing (you'd be surprised how fast can it be) and then Python for plumbing code.
On the other hand, I've seen technical blogposts and whitepapers on how to optimize a "big data ingestion system" that requires clustering and tweaking and whatnot for a fraction of what we're able to do with a single machine. People should try harder to scale up before starting a cluster and scaling out.
I was in a discussion on HN just yesterday where the assertion was made that startups should default to the cloud because the upfront cost of allocating real hardware is too high. I’ve got a $20 Netcup server that can probably handle the traffic of 90% of any projects I’ve ever worked on (including my stint at big tech cos). It’s amazing how cheap and fast modern computing is when compared to the standard cloud offerings.
... but netcup is just another cloud offering? What are you getting at here? I mean, it always makes sense to shop around and look at pricings before choosing one, but my unresearched guess is that there are so many providers nowadays that the pricings across all of them are within cents / dollars for the same service (based on RAM, storage, etc)
> When I see these stories about horizontal scaling on thousands of nodes, I wonder how many people don't know how much performance can be squeezed out of beefy machines.
Beefy machines don't give you redundancy though. Beefy machines arguably make redundancy more expensive.
If you run on a single beefy VM, you need at least two of those running for redundancy, meaning you're 100% overprovisioned (m6a.32xlarge x2 = $11/hr).
If instead you run on 4 smaller instances split across multiple regions (or AZs or whatever), you only need to overprovision by 25% if you want to handle a single failure (m6a.8xlarge x5 = $7/hr).
Obviously if you're doing offline or non-latency sensitive processing, that doesn't matter.
Conversely, since no one is actually doing true resilience anyway (just look at how many services failed when one AZ in us-east-1 went down), why not just start a new machine with the same code and data (either through active replication or shared hard disks), take the 1h downtime until that's up and running, and be golden?
You can even deduct it from your customer's monthly bill.
But if you have 20 of them then the frequency of problems goes up.
In the context of this conversation, where people are bringing up situations with 100’s of servers, 4 servers still counts as vertical.
For AWS, where you can lose an AZ (and lately that feels like a certainty), there are arguments to be made that 3 or 6 servers are the least you should ever run, even if they sell a machine that could handle the whole workload. You should go down 2 steps and spread out. But don’t allocate 40+ 2xlarge machines to do the work. That is also dumb.
What is also screwing up the math here is that AWS is doing something untoward with pricing. With dedicated hardware, for the last six decades, you have a bimodal distribution of cost per instruction. The fastest hardware costs more per cycle, but if your problem doesn’t fit on one (or three) machines you have to pay the price to stay competitive. At the other end the opportunity costs of continuing to produce the crappiest machines available, or the overhead of slicing up a shared box, start to dominate. If you just need a little bit of work done, you’re going to pay almost as much as your neighbor who does three times as much. Half as much at best.
I don't really think that these ideas mutually exclusive. The example given here ( https://news.ycombinator.com/item?id=29660117 ) discusses 4000 nodes running extremely inefficiently. Depending on how inefficient it could be, you could replace it with multiple beefier servers placed redundantly in different zones. (And if it's _really_ inefficient you might be able to replace it with an order of magnitude fewer servers that aren't beefier at all.)
I think really the point being made is to efficiently use your resources. Of course some of the most expensive resources you have are programmers so inefficient hardware solutions may end up cheaper overall, but I've personally seen solutions that are architecturally so complex that replacing them with something simpler would be a big win across the board.
Depends on what kind of redundancy you want. You can achieve quite a decent uptime for a single instance with local redundancy, process health watchdogs to restart things, crash isolation… Yes, some people will need a cluster with no downtime on failure events, but the clustering itself adds complexity and can even increase downtime if it causes problems. For example, a single machine can’t have load balancer issues, or replication errors.
Also, you’re assuming an active-active redundancy scheme, which might not be always necessary, and not counting the cost of the elements required to support the cluster.
I get the feeling that for non-interactive workloads you want to go as big as you can get. There are many problems you can just crunch through using a computer engineer’s cleverness instead of overdoing it with your own. These are usually throughput problems and the solution domain is distinct.
When humans are involved you don’t want a single failure to be felt by everyone at once. And they get grumpy when we don’t pretend their button click is the most important thing we will handle today.
> they were able to just throw $350k a year at AWS to make it work (and millions more at their dev team)
Throwing $350K/yr at AWS sounds like peanuts to me, to be honest. How many extra engineers would you hire if you could completely avoid $350K/yr in expenses? One?
There are other, totally valid, reasons to understand your computing environment, but a mid-six figure bill isn’t one of them for anyone with millions in engineer payroll.
In the expensive tech hubs you can expect to pay at least $150k/year in salary for a competent engineer. Add all the admin overhead and other per-employee expenses (1.5x multiplier?) and 350k/year doesn't buy you anywhere near two full-time engineers.
But sure - given a tech-heavy environment, $30k/month for a competent, senior engineer with in-demand skillset sounds like it's in the ballpark.
My primary estimation is that one engineer for a startup can create more enterprise value than $350K/yr by working on the product rather than optimizing hosting cost.
I was taught to estimate burdened (i.e., with company's cut of taxes, healthcare, cost of tooling and cubicle space, administration, etc.) engineering time at $100/hr on average. More recently I've learned the average burdening correction is 130% of salary.
By these estimations, you get one engineer and $150k leftover (not quite enough for the second,) or two $130k/yr engineers (respectively.)
Edit: This is not to imply that 130% is the "correct" number for your industry or area, only that it's what we use in my neck of the woods.
I was taught $100 an hour back around ‘01. Inflation calculator is telling me that’s $155 today, and that probably doesn’t represent real estate and Dev salaries accurately, both of which have outpaced inflation.
And how many more engineers were needed just to manage that monstrosity? It's a virtuous loop: more money justified to pay engineers, to build more overly complex and inefficient systems that need even more engineers to maintain.
The money isn't the issue, as I pointed out. The issue is that such ridiculously complex systems are less agile and harder to evolve and maintain, and it's hard to put a price on that.
> This sort of approach sacrifices real-time latency (which is felt by the user)...
By far the biggest contributors to latency in user present operations was either latency in the bank's responses or "stuff" required to integrate with the long tail of banks. In fact we found that latency below ~5 or so seconds was _too_ fast for users to trust, they didn't feel confident that it was working if it returned too quickly. As time wore on, latency grew but it wasn't due to technical issues, it was bank caused.
> ...not to mention the priceless value of simplicity, a system that can fit in the mind of a single experienced developer...
I've got to say that the bank integrations were _amazing_ to work on. The tooling was incredible and it was super easy to work with. If there was a problem with "a bank", you could spin up the entire bank integration stack very simply, took maybe 10 seconds, less if you kept it up to date, and plunk away.
Your point stands for much of the rest of the stack, pretty complicated.
> ...agility (very slow deployment times)
I think it was on the order of single minutes to reach each environment, I don't recall deployment times being onerous.
> ...reliability (Plaid is notorious for having a ~20% fail rate for requests)
Business decision and the difficulty of building on ever shifting sands (banks change things), the problems are not technical.
> I've got to say that the bank integrations were _amazing_ to work on. The tooling was incredible and it was super easy to work with. If there was a problem with "a bank", you could spin up the entire bank integration stack very simply, took maybe 10 seconds, less if you kept it up to date, and plunk away.
Care to elaborate? From my experience as a Plaid customer, bank integrations seem extremely finicky, unpredictable and buggy. I have support cases going back months which should've been simple fixes (the bug was acknowledged, yet not fixed).
> I think it was on the order of single minutes to reach each environment, I don't recall deployment times being onerous.
I'm referring to the Plaid blog post linked above, which claimed that deployment times were slow.
> Care to elaborate? From my experience as a Plaid customer, bank integrations seem extremely finicky, unpredictable and buggy. I have support cases going back months which should've been simple fixes (the bug was acknowledged, yet not fixed).
For some of my time I worked on the support team, so trust me, I feel your pain! I may have even been the one to say, "Sorry, no ETA" :( I share in your frustration with slow bug fixes as well, I was always banging the drum to get more engineers on the problem but it is definitely worth appreciating the scale and complexity. 10,000 banks, each bank has tens to hundred of account types, banks get bought, merged, change their backend. Brokerages also carry a gigantic chunk of complexity, you'd think stock tickers are standardized across brokerages, but absolutely not! And pulling consistent security prices, even just end of day, very complicated.
You start to do the scaling math and you quickly realize that you can't fund _every_ bug fix, it just isn't feasible. With that said, do they fund it enough? My opinion is no, more resources should be assigned to fixing bank integrations. But I don't cut the checks ;)
re:deploys The deploy times were slow for awhile, I forget for how long, they grew to maybe 20-30 minutes? But for most of my time they were _fast_. And you got a nice Slack message when they hit different envs so you could be doing something else and come back to watch the outcome/test when you got the ping.
I appreciate the scale of the problem, but that is Plaids entire business. Bank integration is literally their product, and so if they don’t fix bank integration problems then why does Plaid exist?
I wanted to know how the bank integration process works, technically speaking. Was it a Selenium/Puppeteer setup?
> Performance efficiency is often an order of magnitude out of whack for companies that prefer a horizontal-first scaling approach (i.e. lots of containers in the cloud, k8s, etc.).
This is a curious point, considering that removing the constraint of having to make a single instance work well could lead to relaxed (read: lazy) development practices and therefore suboptimal performance that needs horizontal scaling to remain viable.
That said, i also enjoy sleeping during the nights and not getting paged, so running a single instance monolith makes little sense if that's a goal of mine. Horizontal scaling gives people the ability to deal with short term problems that would otherwise be escalated by just bumping up the resources that are available to the deployment and the failure of individual instances is no longer a showstopper.
In my eyes, the best approach would be to do both - expect everything to work with just a single instance (with proper integration and load testing, possibly chaos engineering just to be sure), but run multiple instances in practice (with the exact same tests as before, plus additional ones for byzantine failures and unreliable networks being written in addition).
Why can't we take best of both worlds - a focus on the quality of engineering, but also as much scalability as needed. Sure, you might disagree, saying that personnel expenses dwarf those needed for just running sub-optimal code on multiple nodes, but my salary is such that you could hire a team of people like me for the cost of 1 developer in the US.
And yet, ignorance about the performance and how optimized code is in many "business centric" domains (e.g. web development vs OS development) seems universal. No one wants to pay for nearly perfect code in 5 years, but rather good enough code in 2 years.
You have to be careful with Node because being single threaded but highly async, you could be running anywhere from ten requests per process down to three processes per core and one request per process, depending on workload.
But I’m in a similar boat, and when I think about our ratio of cores to requests per second, it’s scandalous.
When a problem is big and the patterns are set, there’s not much you can do, and the people I flag as potential recruits have a habit of quitting instead of diving onto this grenade. I have learned all new ways that some known (to me) anti patterns can screw you over.
And it grates that Tony Hoare said this 40 years ago:
I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.
If something stays slow it’s because someone worked their butt off making sure that none of the deficiencies were obvious.
Plaid is webscraping banks as a service, so I think it was good of them to use node and to priortize dev speed & hiring than to make a well tuned native monolith made by more expensive developers in a stack that had a slower iteration speed. All for a relatively smaller hosting bill in exchange for slower dev speed for something that isn't really compute, io or storage bound at all. You make the well tuned pieces of software when your dropbox writing core i/o bound infra in rust, like magic carpet.
What is the cause of the failure rate for plaid? Is it their tech stack, or the fact they are fighting with bank's anti-fraud systems constantly to make their fundamentally fragile web scraping architecture work?
> 4000 Node containers [1] running in parallel to handle just as many requests per second.
This looks crazy to me at first glance. 4k requests per second, for any reasonable sized request, should be easily handled by a single C thread. With plenty of room to grow.
1.) Rush to build. Backpressure and deadlines lead to "lets just spin it up thje easier way"
2.) We said "premature optimisation is the root of all evil".
This ended up in devs not thinking about performance at all because "hurr durr" premature optimisation. I've witnessed this mentality causing huge problems in performance due to simplest decisions that would be normal in other circumstances, but snowball into a huge perf issue. Exactly as the title says, best practices can slow your application down, especially when used out of context. Some include "I'll serialise this into json and deserialise it again 5 lines later", "I'll use a regex to parse this simple string", "I'll nest 3 map calls here on a list", "I'll find it later in the list again".
3.) People forgot/never learned what the stack is.
Honestly, when people say software was harder to write back in the day, I'm never sure if I agree with it or not. One one hand, yes, we had less tools and materials at our disposal. But at least the thinking through it was easier, since with the small set of tools you have, you could go up and down the stack in your head and know that if you do X, Y and Z, that's what would happen. Nowadays, we have so many tools and layers that it takes a huge amount of knowledge, time and mental energy to reason about your whole stack. We have people specialised in tiny subsets of the stack, so getting performance squeezed out at overall stack level requires people who understand more of the stack, which is getting rarer and rarer by day.
4.) We have lost touch with the power of hardware.
Consider that most of senior devs are from an age where hardware was way less powerful and in the last 20 years, the power and availability of hardware has increased so much that it is hard for us humans to grasp - especially since the machines we use on a day-to-day basis still work nearly the same. Yet, we hear all these stories about hundreds of lambdas serving thousands of requests, so most of devs don't even have an idea of how much a single machine can serve but picked up "oh it's cloud scale" mentality. Combine this with "fear of low resources" and the ease of spinning up a new machine when resources are low and the answer is obvious - spin it up.
5.) We still think linearly.
People don't think exponentially most of the time. We don't assume that line of code will net us 30x speed improvements. We don't assume this loop will go quadratic. We tend to ignore tiny accumulations until it's too late. And when met with the performance wall, we discard optimisation because its hard to believe that we can easily 10x or 20x the speed since in our heads "it's just unnatural". You can't change tires on a car and have it go 2000mph suddenly. But with code you can, that's the problem.
I always like to add the whole quote: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
I see it way too often that someone wastes way too much time with optimizing a code that is run only once with like 3 items. So I still think the measure and optimize later is still true. With today abstractions noone can really tell what will be a bottleneck — even just the CPU has an insane amount of complexity going on with reordering and branch-predictions. Even the seemingly faster code can be an order of magnitude slower surprisingly.
> So I still think the measure and optimize later is still true.
This is a huge issue because measurement is difficult. There's also a lot of developers that are bad at measuring and interpreting the results when they do measure.
So you end up with people writing unoptimized or just plain slow code with // TODO: Add measurements. Then complicate the whole thing by simply accepting bad performance and scaling horizontally and paying through the nose.
> 4.) We have lost touch with the power of hardware.
I think this is actually a key to solving all of these problems. If you target (at least during testing) low power bare-metal hardware, a raspberry pi 2 or whatever, then performance problems become immediately obvious.
If you put a web server on a pi, it should be able to serve dozens, maybe even hundreds of requests per second. If it can't, then odds are you are doing something wrong.
But since there aren't unlimited people the increase in exponential growth doesn't matter much. The same number of people are going to be exposed and it will happen only slightly quicker.
> often an order of magnitude out of whack for companies that prefer a horizontal-first scaling approach
This is practically worth its own thread, so I will.
I think it’s the principle of the excluded middle biting us on the ass again. Vertical first risks having people assume shared state across events. Everything collects in memory, because look at all of the “unused” memory that nobody will complain about me using for my dumb feature because nobody is looking yet (when the moments are still teachable).
To your hardware point, many people don’t understand how that “free” memory is used by the OS to reduce mean response time. So they overload the machine and don’t bother trying to do anything until they hit 80%, when the penalties have started racking up at 60%.
So we shall force the issue by giving them tons of tiny machines, as if going to the other extreme ever fixed a goddamned thing.
You don’t want tall. You don’t want wide. You want square.
I saved us almost 10% on hardware by swapping in larger VMs and fewer of them. We don’t have a homogenous workload, but a good fraction is, so larger machines can share a bit of OS cache and statistical clusters of slow events can be load balanced better, improving p95. But at some point the performance improvements we’ve made will take us in the other direction, because then we’ll have too high a fraction of traffic on a single VM. That’s not a problem by itself, but could be half of a production incident.
You bring up a good point: solving the business problem is more important than the technical implementation.
Software and SaaS is a very high margin business compared to storing widgets in a warehouse and selling them in a retail store. So, you're right: who cares about efficiency?
(Well, it is important, but it isn't always the top priority.)
Thing is, k8s is not about scaling out. A core part of it is about efficient usage of hw you have (which also maps well to avoid expensive cloud bill).
The more I optimized, the more I could leverage k8s to optimize hw usage.
I agree with you. It will efficiently allocate resources for an inefficient application. But when your system can't handle the load anymore, you'll have to allocate more resources and k8s will handle it for you like magic.
It's so easy that people can use thousands of machines to handle the load. This with the "hardware is cheap, dev time is expensive", makes it common to see those kind of very large systems.
Well, yes. Except I get to live with "hardware is more expensive than dev time", so I tend to cheat like a proper dirty cheating wizard with k8s' binpacking ;)
(my motto is "what do you call a company using Heroku? Bankrupt" :P)
It can't fix what's inside your containers, but it can fix how efficient you're in packing your workloads onto available hw. I.e. Given a set of Workloads, set of available hw, k8s will help you find maximal amount of Workloads that you can fit.
vs. for example manually trying to fit single-purpose VM instances in EC2, with inevitable overheads, etc.
Before anyone wants to criticize Stack Overflow, consider their server footprints as of 2016 [1]:
4 Microsoft SQL Servers (new hardware for 2 of them)
11 IIS Web Servers (new hardware)
2 Redis Servers (new hardware)
3 Tag Engine servers (new hardware for 2 of the 3)
3 Elasticsearch servers (same)
4 HAProxy Load Balancers (added 2 to support CloudFlare)
2 Networks (each a Nexus 5596 Core + 2232TM Fabric Extenders, upgraded to 10Gbps everywhere)
2 Fortinet 800C Firewalls (replaced Cisco 5525-X ASAs)
2 Cisco ASR-1001 Routers (replaced Cisco 3945 Routers)
2 Cisco ASR-1001-x Routers (new!)
Assuming they aren't lying, this is the entire list of machines needed for a site used by the world's developers. That's no small feat. Many companies whose developers blog about their "best practices" would struggle to achieve 1/10th as much efficiency.
I hope this is obvious, but best practices can often intentionally sacrifice performance for the sake of other aspects. The fact that breaking them can give performance gains shouldn't come as a surprise and can be the right thing to do.
My reasoning is that they should only be broken if you understand the reason for them, how it applies to your situation and the implications (or if nothing's at stake, which can be a great learning experience as to why those recommendations are made).
The part that's not obvious to those applying best practices is doing so without such a similar understanding. It is a shallow appeal to authority of the 'best'.
Even before breaking them, I think one of the bigger problems with best practices in practice is, that many people have forgotten or never tried to understand why they exist in the first place, what problems they are trying to solve and how, their scope and what the trade-offs are you are making by implementing them.
Instead they get treated as magic wands swung around promising to solve all the worlds problems if you just believe in them strongly enough.
performance has almost always been about violating best practices because a simple happy path isn't usually suitable for hot traffic (e.g. bulk insert vs insert into).
Back in the early 2000s when the traffic just started to hit people were de-normalising databases just to make sure stuff could even work.
On the other hand, I worked for a company that seemed to follow Stack Overflow's article (sparse tests, unmockable static fields) that made everything a nightmare and slowed velocity almost to a halt, and provided a brittle customer experience.
I can quite believe that caching SQL results saved time (hardly not a "best practice"). The waters become murkier when you say you can't afford basic abstractions to make things testable.
They provide no statistics on how much time their "optimisations" actually saved. Likely, their optimisations wouldn't have made a difference, as they would have been JIT'd away anyway.
I am very much opinion that if you're writing code without some sort of test, you're writing legacy code.
While they do not provide any benchmarks in the summarizing article, I can guarantee that they have done some (read: alot) profiling. I follow both Nick Craver and Marc Gravell on Twitter, and they regularly post benchmarks for optimizations done at SO, in the .net world they are some of the most knowledgable people on performance optimizations.
> (sparse tests, unmockable static fields) that made everything a nightmare
I had the same thought. I've noticed over the past few decades that most developers migrate toward static fields (i.e. making everything global) not for performance reasons, but just because it's requires less up-front thought. And when the only thing that shows up on your year-end performance review is "did you hit the date?" this is almost understandable. Maybe the folks at SO really are that good - they can code everything right the first time and there's no particular need for unit tests, but I'm used to finding bugs that have been in the code for YEARS that would have been trivially caught if the code had just been run even one time under observation. I'm also used to code that's so monolithic (statics that import other statics that import other statics) that it takes months for new hires to get any sort of proficiency in - and that's if they're really experienced to begin with.
It wasn't really an option when they started. Nginx couldn't and still can't run ASP.NET application. You could maybe put the application in a container, or run it as directly with the new versions of .Net, but that requires reworking the entire application. There's not really a lot to gain, not if you already know IIS.
I wonder to what degree they have automated the configuration of IIS…? From my POV it seems really tricky to automate. But I am by no means an IIS expert, I just have some clients that use it.
You can automate IIS configuration with Ansible, that's what we do.
Mostly the issue is documentation, you sort of have to know what things are called, but you can get around that by using win_iis_webapppool and get the configuration from a running IIS server and then work out which properties you can set.
There might be things you can not automate, we don't use that many advanced features.
Edit: I wasn't being fair. Ansible, is actually pretty good at managing Windows servers. It's just different from managing Linux servers.
It's no small feat, but the nature of SO is very cache-friendly (content focused, reads >> writes). If you're building a modern SaaS, dealing with the same amount of traffic as SO, you might find a hard time with the same setup.
Am a firm believer that a billion dollar company (Saas, etc) can be run with very sane and simple stack:
- 1x web-server [$FOSS_or_COSS, iis] # COSS => commerical open source e.g caddy. nginx is both
- 1x application-server [.net, java, python, php, and all other langs. heck PERL too!]
- 1x database-server [$FOSS_or_COSS, sql-server, oracle, etc]
Virtualized or baremetal. Containers not necessary!
Increase it by 2 or 3 for HA if you so desire.
Engineering teams should aspire to advocate "We run this entire thing on these few resources vs our application runs on 26k nodes" The latter just means we've lost the plot and throwing money to solve a problem. Forgive my sarcasm here, but well done k8s&&containers!
Don't see any reason why FB can't run on the same comparative hardware as SO. I.e. replace the CCSS (commercial closed source software) bits with the FOSS/COSS ones. And when we take it in as a whole, this is much more greener/sustainable to work with.
Also, there is a legend out there that I heard of many moons ago :) ... THERE is a billion dollar company running on a single sql-server and single iis web-server
If you read further in the article there appear to be 6 SQL Server machines (across 2 clusters). Those are some highly spec'd servers with huge amounts of RAM and CPUs. With decent caching the SQL Server machines wouldn't be hit too often either, hence the Redis and Elasticsearch servers.
Fewer, but larger DB servers would typically mean much less maintenance and dev ops headaches.
As others have said, skipping best practices is typically reasoned out really well before proceeding. Not everyone needs every best practice.
If they'd used open-source software rather than the Microsoft stack, they could have scaled horizontally without having to worry about buying more software licenses.
Personally I'd rather manage a small number of large servers, compared to managing a large number of smaller ones.
The licensing costs associated with Stack Overflow also can't been that large. Sure Microsoft software isn't cheap, but it's also not that expensive. The licensing cost is easily offset by staying on a technology stack their employees know.
> The licensing costs associated with Stack Overflow also can't been that large
Because we went with the reliability of a full Microsoft stack—.NET, C#, and MSSQL—our costs grew with the number of instances. Each server required a new license
They literally mention license costs as an architecture driving constraint.
Sorry, I wasn't making that clear. It's not that it isn't a factor, they clearly think it is. The question is how big a factor is it really? My experience is that the licens is less of an issue, compared to the time and resources you'd otherwise need to redesign everything.
When Jeff Atwood and Joel Spolsky started Stack Overflow they had a podcast. In that podcast Joel repeatably argued that hardware wasn't that expensive, compared to Jeffs time. His argument was that buying a larger server and scaling that way was much more cost-effective. I think that's still true, to a point, even if you didn't have the licensing cost.
> [...] buying a larger server and scaling that way was much more cost-effective. I think that's still true, to a point, even if you didn't have the licensing cost.
I don't know if in the cloud world this still holds true, considering that (at least in AWS) if you vertical scale 2x you get usually a 2x price increase. Clearly I'm not factoring costs of making an app horizontally scalable or DB-shards aware (which I think was always the main selling point for SO vertical scaling: they usually referred to the DB, and they were against sharding)
> Because we went with the reliability of a full Microsoft stack
As opposed to what? What reliability are they worried about trading off?
Don't get me wrong - I love C# and .NET, and MSSQL is a fantastic RDBMS. And if your people already are familiar with this tech stack (which I think is the case here), it makes sense to play to strengths.
But let's not pretend that this stack is the more reliable option to any other alternative...
That was just a few years after Postgres became competitive.
Postgres was already a better choice than Oracle or SQL Server on almost every way (the clustering setup UX sucked). But very few people were aware of that, and they would have to learn a completely new stack to use it.
I don't know if SO is using everything that comes with MSSQL, but it's a lot more than just a RDBMS. SSIS for example is a full workflow/job execution environment fully integrated with MSSQL. When I did a similar analysis to move off of MSSQL in the past, there is a lot more to it than just swapping out databases.
SSIS, MSAS and SSRS are minimally functional products that exist from stopping MS customers from looking elsewhere. If they are adding value to you (instead of just being demanded from the high up and never used), it's better to look at the alternatives, some times even if you are already paying for them (that one is a difficult decision, as the others are kinda expensive).
Anyway, the Microsoft implementation of data workflows is so stupid that you are better of with manual dblinks or creating the entire thing in a general purpose language. Their largest competitors share a lot of the same problems, so I'd say that workflow software is just an excuse to hire low-paid professionals and a real drag on any working development team.
I worked at the MySpace parent company during the whole growth explosion. Microsoft would send us custom MSSQL patches to fix issues the team was running into. Sure, someone could have patched an open source database but not everyone wants or needs to staff engineers of every category.
> The licensing cost is easily offset by staying on a technology stack their employees know.
Stack Overflow wasn't developed by an established software company. It was a completely new startup, founded by two high-profile (at the time) software commentators.
They could have hired people to use any stack they wanted. While the state of web development wasn't what it is today, OSS stacks such as the LAMP stack were fairly well-understood and popular.
Stack Overflow existed for a couple of years before taking any funding and it was developed by three people who were friends (Jeff Atwood, Jarrod Dixon and Geoff Dalgas) none of which were even getting paid regularly at the start and who were all primarily familiar with the .NET stack from previous jobs.
It makes total sense why the tech is what it is. One could argue they could have shifted away from that stack since then, but one could also easily argue (very convincingly to anyone who has ever been involved with any significant project rewrite) "but... why?".
That's a reasonable point, but still is the cost of the licens higher than the cost of one or two additional developers? Jeff Atwood was/is if I recall a pretty talented C# developer, so why onboard extra people. It's more than just their salary, there's a cost beyond that associated with hiring people.
I'm not arguing that the licens isn't a concern, I just question how much of a concern it really is, compared to everything else.
The MS stack is affordable to developers working on developing countries and free for small products (with a lot of strings attached).
Of course, the OSS stack costs nothing to acquire, is cheaper to admin, and will give you more functionality and easiness of use, but the MS one isn't some unaffordable monster like other proprietary ones.
Depends on where you live. Certainly not peanuts here.
Edit:
I know that if you spent your whole life in California it can be hard to believe that in some places junior developers make $250 a month (senior more than that, but typically less than $1k a month), so a single Windows Server license buys you four months of work of a single junior dev (however unproductive it may be), or 1-1.5 months of a senior dev's work.
In my current place of work a few years back I started a strong push towards Linux and as much FOSS as possible, and over some time (as new projects were written and some legacy stuff re-written) it resulted in a massive reduction of total operation costs.
I didn't expect any other reaction, though. You understand our realities better than us.
They also commented that the licenses paid for themselves in reduced hw spend, as AOT compiled code plus very performant SQL Server allowed them to run in fraction of the size comparable sites operated.
they were losing money in the first few years. struggling to find way to monetize it. at one point were selling programming books with amazon affiliation.
these types of cost cuts make the difference between life and death in early stage startups.
It's simply impossible to have reliable "best practices" in programming. It's still a young field and most of the ideas floating around are just fads. Yesterday's best practice will quickly become an anti-pattern tomorrow when another person comes along and creates a pretty looking website to promote his best practice.
None of it is based on actual collective experience. Just the opinions of people who are good at marketing their ideas.
Somehow if something is nicely written up and put up on a good looking website, everyone starts treating it like a well accepted best practice.
> None of it is based on actual collective experience. Just the opinions of people who are good at marketing their ideas.
I think often best practices are based on experience, but they stem from a lost programming zeitgeist that the new practices have replaced.
Before Don't Repeat Yourself became popular, there was a serious problem with code duplication. Now there is a serious phobia of code duplication instead (to the point where it's a problem). DRY was good advice in a world where DRY was an obscure idea, is bad advice in a world where DRY is a popular idea.
If you look at something like Clean Code today, it seems like a bunch of pretty bad advice, but in the context of the late 2000s Enterprise Java, it seemed like a really fresh breeze. We've already adopted the good ideas, and discovered that pushing farther in that direction was a bad idea.
These books and ideas were lauded for a reason, but often that reason isn't there anymore.
> > None of it is based on actual collective experience
> I think often best practices are based on experience
But not collective experience. Just some people's experience.
If you survey experienced people in different corners of the industry, you will not find agreement with whatever is currently being floated as a best practice.
Rather amusingly, best practices are often adopted much more strongly by the beginners (and the expert[0] beginners) rather than the experienced. When you reach intermediate level of experience you start questioning many of what you learned from the internet as a best practice and start looking for alternatives or just experimenting yourself with different ways of doing things.
But beginners will always out number the experienced, and they form an echo chamber.
This is because every year, many beginners join, but reaching an adequate level of experience to start questioning the crowd wisdom takes several years. By the time you get out of the rut of "best practices", you are in the minority.
Imagine a pryamid and you're in the middle of it. There's a lot more volume below you than above you.
> If you look at something like Clean Code today, it seems like a bunch of pretty bad advice, but in the context of the late 2000s Enterprise Java, it seemed like a really fresh breeze. We've already adopted the good ideas, and discovered that pushing farther in that direction was a bad idea.
I kindly disagree. I think people get better with experience. Maybe a long the way you encounter many different opinions and you try out different things. But I can't say that the particularly bad advice was useful for you in your journey to improve your skills. Quite the opposite, they might have been a hinderence, and maybe you would have grown a lot faster if you were lucky enough to not have encountered them.
I have no idea what Enterprise Java was like a decade ago, but I can't imagine it was much worse or better than it is now. I can't say I have ever seen a Java code base that I thought positively of. They might be terrible in different ways, but all terrible none the less.
> I have no idea what Enterprise Java was like a decade ago, but I can't imagine it was much worse or better than it is now
I have seen a bunch of it, well from pre 2005, and it tended to be horrific. You can still write java horribly, of course, but the codebases I've seen written in Java in the past decade are far, far better. I still don't like they approach Java developers take (the tend to architect in "elegant", ornate, massive OOP paradigms), but it generally is waaaay better than old enterprise Java.
> But not collective experience. Just some people's experience.
This is what collective experience is, though. Several peoples' experience.
> Rather amusingly, best practices are often adopted much more strongly by the beginners (and the expert[0] beginners) rather than the experienced. When you reach intermediate level of experience you start questioning many of what you learned from the internet as a best practice and start looking for alternatives or just experimenting yourself with different ways of doing things.
This is very much true. Best practices typically are aimed at beginners. Experts usually rely on tacit understanding rather than rules-based thinking.
> I have no idea what Enterprise Java was like a decade ago, but I can't imagine it was much worse or better than it is now. I can't say I have ever seen a Java code base that I thought positively of. They might be terrible in different ways, but all terrible none the less.
Here are some highlights of what early Java code could look like:
* OOP was unbelievably hot, but nobody quite knew how to do it, so they did all Gang of Four-patterns on all classes. If your class didn't at least end in AbstractBuilderFactoryFactoryDelegateFacadeVisitorImplDaoAdapterInterface, you just weren't with it.
* Logic and data typically wasn't separated at all. A class would frequently have both public static logic methods and mutable data.
* All functions had long boilerplate javadoc comments, none of them were useful.
* Extremely long classes that had zero separation of concern. Static access were very common, and the thread model was rarely well understood, so a lot of methods were "public synchronized static".
* Extremely long functions, longer than I've seen in almost any language. They could be hundreds, sometimes thousands of lines long. Often they were surrounded in one large try-catch as well, with half a dozen catch statements at the end that were all like e.printStackTrace()
* As icing on top, this was before runtime annotations, so you had megabytes upon megabytes of XML gluing the monstrosity together. This horrifying XML-Java chimera was a trademark of Java EE. In a way, Java EE was ahead of its time, because it's very similar to the YAML-golem that ties together Kubernetes these days.
> This is what collective experience is, though. Several peoples' experience.
No, this is not what I mean by collective experience. I mean something like what the vast majorit of experts agree on and follow in their day to day programming activities.
> This is very much true. Best practices typically are aimed at beginners. Experts usually rely on tacit understanding rather than rules-based thinking.
That means they are exactly not best practices. They are rails and safe guards for beginners to help them ramp up. Training wheels. They should never be used as an argument for an engineering decision.
What I remember mostly from the bad old Java days -- well, the "Enterprise Java Beans" days at least -- was all the bureaucracy and XML files for servers and containers and beans and stub classes and whatnot that you had to touch in order to get anything done.
I believe the RoI on best practices is generally correlated to the scale of the application and engineering team you are working on.
A large number of best practices come from Big Corp engineers or organizations. Big corps have thousands of engineers that need to be able to read and contribute to code from many other teams. They have systems that need to be able to handle millions or billions of users.
In situations of massive scale strictly adhering to best practices is worth the investment, since it will make code easier to contribute to (readability/architecture standards), make sure it does what it's supposed to (unit tests), or keeps things performant. But the RoI of following best practices decreases as you move down the spectrum of scale
Regarding the basic idea that "best practices are not always best" I've been involved in a lot of these conversations at small startups, and mostly around technologies like Kubernetes.
But every time I say something critical about Kubernetes, someone responds, "So you think Google is stupid? You think the engineers at Google don't know what they're doing?"
To which I always respond: "Are you Google? Are you running 10 million boxes at 50 data centers in 30 countries? If not, you might consider a simpler approach."
I do get that big companies find something like Docker/Kubernetes useful for security, scaling, and especially, dealing with the homogenous nature of the varied software stacks that typically build up in a large corporation.
However, my message to the small startups I consult with is, "Keep it simple. If possible, just stick with something like Heroku for awhile, or a few big, beefy machines. You don't need High Availability, you can survive with 99.99% uptime for now. You've got an engineering team of 3 people, so keep everything simple."
And yet I still get some pushback, people saying, "No, we want play in the big leagues, we want everything to be perfect right from the start."
That kind of perfectionism can kill a startup. That kind of perfectionism is entirely at odds with the spirit of a "minimal viable product."
Yet I keep having clients who want to prematurely imitate the best practices of Google, long before they have revenue to justify it.
I wrote a satire of how some of these conversations go, which I posted here:
I work for a startup and we use k8s for our web service deployments for two reasons - there is ample material available, and there are hosted services available. As we've grown its been helpful to be able to utilise all of the ecosystem around k8s. It's been a huge kick-start for our team.
> Too many best practices in software engineering are followed as if they were immutable laws.
The wise programmer is told about Tao and follows it. The average programmer is told about Tao and searches for it. The foolish programmer is told about Tao and laughs at it.
If it were not for laughter, there would be no Tao.
The highest sounds are hardest to hear. Going forward is a way to retreat. Great talent shows itself late in life. Even a perfect program still has bugs.
> Too many best practices in software engineering are followed to improve developers experience rather than the quality of the end result.
Thus spake the Master Programmer:
"Though a program be but three lines long, someday it will have to be maintained."
> Too many best practices in software engineering have no empirical basis.
Thus spake the Master Programmer:
"When you have learned to snatch the error code from the trap frame, it will be time for you to leave."
And don’t get me started on all the pseudo-profound advice dressed up in mystic, Asian clothes. Sometimes I feel what programmers really value is their ego.
So long as they’re contingent, corrigible, evidence-based and lead to a better product I’m all for them. Sadly most aren’t and really should be called “my personal, preferred way of doing things”.
Best practices, when blindly followed, especially without understanding, lead to cargo cult programming. Ill-understood best practice documents are an organisational code smell.
Complaining about best practice, however, is itself a best practice.
I feel like having automated testing though is objectively beneficial. I imagine SO make up for that by having a QA team and spend a heap of time manually testing stuff at the expense of “performance”.
I agree, automated tests of all kinds do have empirical basis every time I’ve seen them used. You can literally see their value by measuring their output. There’s a lot more to the “best practices” around automated testing than just doing it though. And therein lies the rub.
> I imagine SO make up for that by having a QA team
Most places I've ever worked have made automated testing impossible but then gone ahead and made up for it by just never testing anything and then blaming the programmers when bugs show up in production.
The comments here are all discussing the ideas presented abstractly.
I'm more interested in the specifics:
> We use a lot of static methods and fields as to minimize allocations whenever we have to. By minimizing allocations and making the memory footprint as slim as possible, we decrease the application stalls due to garbage collection.
I would really love to see some profiling comparisons on this.
I simply cannot believe that just avoiding object initialization or state would add up to a meaningful difference. Any reasonable inversion of control framework applied dogmatically would pretty much ensure you almost never initialize objects at runtime anyway - only at service startup. BUT you maintain the clean composition and encapsulation of objects and gain the testability that they've given up.
I also can't believe that using fields directly over (i assume) Properties would make a meaningful difference either...
Again, not doubting the premise that if job one is performance that you have to make readability and maintainability compromises. But I am unsure if the examples presented here are actually relevant, or were they just easy to digest in a blog post.
My guess? They operated in crunch-y startup mode for many years, optimized feature delivery over all else (appropriate), and now have a messy kludgy codebase that - at least - remains performant. Now they're scared to refactor it and improve it list the performance gods smile unkindly, and have come up with ipso-facto justifications for why they don't want to bother with writing unit tests (which incidentally work JUST FINE with unit testing static methods)
It makes a difference. Back in my C# days, a team rewrote an internal service in a C-like fashion: structs, no properties, almost no runtime heap allocations. The application absolutely screamed. If I recall, it was 2 or more orders of magnitude faster than the app it replaced. Because it was highly used, this made a big difference, and enabled new use cases that weren’t possible previously.
You see the same thing in Go. Libraries that are zero-allocating are significantly faster than the competition.
Depends. Of course it is generally preferable to do no heap allocation but the latter is very cheap as well with good GCs, and the reclamation process is amortized/can be done in parallel for the most time. Eg. Java’s allocations are like 3 CPU instructions if I’m not mistaken?
Allocations are pretty cheap, but GC can be quite expensive under high load/big heap. It is probably not the first optimization you would need to make, but allocaitons have been a huge focus by the .net core team, and is part of the reason why it is so much faster than the .net full framework. See for example the post on improvements in dotnet 5: https://devblogs.microsoft.com/dotnet/performance-improvemen...
But that depends on the number of objects. So unless you are creating objects like there is no tomorrow in a hot loop, the GC should be able to keep up with your workload quite well.
GC will cost 50-100 instructions per object reclaimed AND 5-10 per object not reclaimed.
Even if we ignore the latter, gc will can add 2x for objects that you don't do much with.
And then there's the cache effects. Close to top-of-stack is pretty much guaranteed to be in L1 and might even be in registers. Heap allocated stuff is wherever. Yes, usually L2/L3 but that's at >2x the latency for the first access.
Is your instruction numbers from a generational GC? Not doubting it, but perhaps that can further amortize the cost (new objects that are likely to die are looked at more often).
Cache misses are indeed true, but I think we should not pose the problem as if the two alternatives are GCs with pointer chasing and some ultra-efficient SoA or array-based language. Generally, most programs will require allocations and those will cause indirections, or they will simply not loop over some region of memory at all. Then GC really is cheap and may be the better tradeoff (faster allocation, parallel reclaim). But yeah, runtimes absolutely need a way for value classes.
Generational affects the number of free objects that can be found, not the cost of freeing or looking at an object, or even the number of objects looked at.
I thought that the comparison was between "heap-allocate every object" vs "zero allocation." (C/C++ and similar languages which make it easy to stack-allocate objects, which is not far from zero-allocation.)
If the application is such that zero-allocation isn't easy, then that comparison doesn't make sense.
However, we're discussing situations when zero (or stack) allocation is possible.
The code is very simple and mostly consist of taking json, parsing it, checking a few fields and modifying them, and calling another service with another format of json.
Most of what we do is these null checks disguised as opionals.
And then people complain performance got worse compared to mainframe assembler.
What I find hilarious is that languages like Java & C# is the essence of Object Oriented Programming but if you want to keep things performant you should avoid creating objects.
Well, they might have tried an unreasonable inversion of control framework (like Spring) and observed that Spring uses reflection on top of reflection and does slow down your code significantly. The solution isn't to abandon IOC (or, god help us, just give up and declare all your variables static/global), it's to abandon IOC frameworks. They're useless and they don't provide any advantage. IOC is a coding style that you can follow without the help of some idiotic "framework", just like ORMs.
> But it’s no silver bullet: your software is not going to crash and burn if you don’t write your tests first, and the presence of tests alone does not mean you won’t have maintainability issues.
> It was not our priority early on. Now that we have had a product up and running successfully for many years, it’s time to pay more attention to it.
This is a hard earned lesson but one that is firmly rooted in the reality of the industry IMO.
I’ve had the good fortune of working on a few greenfield products and bringing them to market. Time and time again I see technical and executive leadership steer teams in this direction. The fact is that writing good tests is hard and expensive. And most new software products survive less than a handful of years.
Accepting this reality will help ease the pain of working on that spaghetti codebase at your day job. At least it did for me.
Good on stackoverflow for paying back some tech debt in their successful product. I certainly appreciate the tradeoffs that come with building a business, but they can be difficult to swallow for engineers who take pride in producing high quality and well tested code.
I for one am quite happy that the whole hype from pair programming, extreme programming, whatever X programming with mocking everything for 100% coverage on QA dashboards is now behind us.
Even if you have written your application in a specific manner to achieve some technical goal, for Stack Overflow that is performance, you can still have automated testing with end-to-end test or system tests.
I prefer that over unit tests because a test that is larger in scale does not care how you have organized your code in contrast to a unit test.
Thus you can refactor all you want and the tests stays the same. Unit tests on the other hand becomes useless when doing large refactoring because you have to throw them out or refactor them too when classes and/or functions get removed or changed.
To add to this, The whole TDD approach as described in the original book by Kent Beck advocated for throwing away your smaller unit tests after a certain point while keeping the higher level ones. This way you can still be confident in your code on the local level, but you won't get bogged down in the future when you need to do foundational changes.
Sometimes I get the feeling that the TDD cargo cult members haven't read the original material.
An obvious question is whether the focus on performance first is correlated with the success of stack overflow. If the same level of succes could have been achieved without squeezing as much performance out of each server then the sacrifice of testability is a dubious one.
The article doesn't explain why horizontal scalability wasn't pursued instead of their approach of performance efficiency (i.e. greater performance from a given amount of resource). It's certainly true that at the time that SO was starting, scaling horizontally required far more handbuilt tooling than it does today but it certainly was already a widely used pattern. The likely reason is cost, but that then comes back to the first question - would SO have been equally successful if it had higher infrastructure costs in its early days?
The persuit of efficient performance at the cost of aspects of quality (of which testability is only one) can only be justified if efficient performance is truly necessary for success. A much smarter person than me warned against premature optimisation. That seems relevant here.
I don't really get why people are so focused on the lack of horizontal scaling. They already have 11 IIS servers, that seem like horizontal scaling. As for the databases: People underestimate modern database servers and a well designed database layout. What would you even replace it with, that doesn't complicate the setup?
I like SO approach, maximum utilization of the hardware and don't try to be fancy, if you don't have to. Less hardware to manage, it's easier to find people who can quickly be on-boarded to the team, easier to debug.
For those who only read the quote "premature optimization is the root of all evil", here's the context:
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
From some readings of Knuth, I have a feeling that he always think about efficient code.
He talks about measuring costs in this ASM, and he mentions that it's both cpu and mems that he measures, which is something I'd never seen before in algorithmic analysis (but which of course makes perfect sense given the large discrepancies in speed between them).
If you understand Microsoft licensing well you'll know it also costs to scale vertically.
Depending on your license procurement model (SPLA, Retail, OEM, OVA) you are usually paying for server software on a per-cores basis. Some products (included most mentioned) have a scale tipping point where they become more flexible (DC licensing).
Yes, but they will do it in an elegant manner and be sanctioned by lots of self appointed experts.
Where would us be today without the wisdom of Uncle Bob and Gang of Four? We would be still in the prehistory of programming writing efficient, inelegant code.
It depends on the application. You have to balance the costs of writing efficient code with the costs of running the machines.
If you have lots of users hitting your servers, it's worth investing more in efficiency, but if you barely have any, it's much more worth it to write in a convention focusing on maintainability.
There are entirely different problem domains inside software engineering. Pushing that last bit of performance at the price of some maintainability can be the sane choice for some numerical optimization/simulation algorithm for example, but would be a terrible choice for a bank application dealing with transactions and having often changing requirements. At the latter, even serious inefficiencies are absolutely worth it. Hardware is relatively cheap and really fast.
Also, don’t forget that we are trying to balance at the top of the often non-reducable complexity mountains, never ever seen before. Compared to some of the programs out there, going to the moon is just some trivial computations. The former may very well not be remotely possible without some structure for our limited human brains.
> Similarly, we don’t write unit tests for every new feature. The thing that hinders our ability to unit test is precisely the focus on static structures. Static methods and properties are global, harder to replace at runtime, and therefore, harder to “stub” or “mock.” Those capabilities are very important for proper isolated unit testing. If we cannot mock a database connection, for instance, we cannot write tests that don’t have access to the database. With our code base, you won’t be able to easily do test driven development or similar practices that the industry seems to love.
Why bother with stubbing, mocking and changing code to be testable if you can use the more useful integration and/or UI tests? They will bring in some great value without having to rearchitect code.
> Why bother with stubbing, mocking and changing code to be testable if you can use the more useful integration
End-to-end tests are useful, but they take a long time to set up, maintain and run. They have to have a live database to connect to and that database itself has to conform to particular preconditions that can easily be violated even if the code itself is just fine. What I've seen, over and over, is that the end-to-end test fail because the test database instance is under load, or somebody wrote a test that inserted some data but forgot to remove it, or it doesn't work right if two particular tests run at the same time, to the point where they report false negatives so often, everybody just starts ignoring them.
Because fine-grained tests allow for more complete coverage of complex but tightly-bounded code sections. For example, if you've implemented a concurrent data structure, it's much easier to verify its correctness with unit tests; with integration tests only, it would be a lot more effort and more difficult, maybe even impossible, to thoroughly test all code paths.
Of course, if your entire software only consists of trivial glue code, then don't bother with unit tests, I guess.
Odds are that the concurrent data structure will naturally be very testable. If it's not, it a matter of abstracting it a bit more, not of using the testability design guidebook.
But well, mocks and stubs is how you test the thing, and has no relation to coding best practices. I think the GP had an unreasonable grouping here.
Back in 2005 Ebay was served using single executable binary that had around 100MB. At some point they hit maximal number of functions allowed by compiler.
Back than the best "practice" for large scale services was enterprise Java on VERY expensive Sun server.
I read that and I feel a surge of sympathy for the new guy they hired to work on this thing. How many MONTHS did it take him to even compile it? How many hours of frustration did he have to spend thinking, "I can't even get this thing to build, they're gonna fire me, I need this job or my kids are gonna starve..."
The way I’ve dealt with this kinda of stuff and still make the day to day development experience not deteriorate is to spend the upfront time deigning good self-contained non-leaky abstractions and then hiding all your rule-breaking code behind them.
I’ve used the *-able pattern with pretty good success. Let your devs mark functions as retryable, cacheable, etc. and then centralize the code that manages the caches and retry logic.
If you have a qa team, perfect. But I don’t recommend to follow this way.
Many code is slow due to it has a couple of wrong implementations, such as rendering or make something heavy all the time when you have to do it only one time.
There is no problem to create a decoupled system using di, the final performance it will be tolerable.
Also, you have the possibility to create custom builds with tests and good practices in the main branch.
DI is the biggest snake oil out there. The vast majority of time it's just used as a hack job to enable test isolation in weak languages and gives you no real-world decoupling.
Sure it's sometimes necessary, but I wish people would stop with the expostfacto justification. It doesn't make your code better if you add it just to test your code.
> DI is the biggest snake oil out there. The vast majority of time it's just used as a hack job to enable test isolation in weak languages and gives you no real-world decoupling.
Have you got any data or writings to back this up? I ask because what you write is my gut feeling but I can't put it into words or convince anyone.
That I have ended up programming in an area where the answer to testing is "spin that into its own class so it's testable" when the code is used in a single place, would be fine as a private or protected method, but still needs tests to confirm the output is right, drives me crazy. For libraries sure, but for checking the date parsing is right, that you didn't mess up the calculation... just let me test the damn protected method and be done with it.
Hundreds of tiny classes benefits no one, only the testing framework.
I agree with this. FWIW, Java (for instance) could treat testing as a privileged mode of execution, allowing for all properties (even private ones) to be get/set/interacted with as if they were public, for the duration of the test. This would simplify so many things and get rid of so much boilerplate code.
I agree actually and was thinking that visibility modifiers should not really deny access for the internals of the class under testing. Maybe c++’s friend classes would be a good solution?
Or I don’t really find groovy to be the best thing, but it could probably be hacked to translate private accesses to reflection calls under the hood making testedClass.privateField actually callable?
Allow all properties and methods to be get/set/called freely by the test harness (such as JUnit in Java's case), as if they were public all along. The need for DI in testing would be greatly reduced, while keeping the public/private distinction intact while the code is running as normal.
I had the same thought - if SO says they're fast because they don't use DI, then ok, maybe SO is fast because they don't use DI AND they have an elite group of amazing full-stack programmers who write OS kernels and prove that P != NP over their lunch break for fun. Most programmers fuck things up on a regular basis and don't even realize it until it's broken production for a couple of years for a handful of users who go down a non-standard path every once in a while, and they do it in a way that would have been caught if the code had been run even one time, but they can't run the code, because somebody else insisted on using a hundred static (i.e. global) variables.
It's like database schema design - denormalised schema are often faster for certain queries.
But you should still run a 3NF schema until such time as you actually need that performance, because running a denormalised schema comes with additional complexity cost.
What I've done in these cases is keep the tx database ~3NF and denormalized at some interval to a reporting database [1]. That seems to be a fairly standard way to do things up to a point.
[1] How data gets to the reporting database is implementation specific. Simplest case is run a nightly job to copy the data over. More complex cases use things like data lakes and/or event buses where all interested parties subscribe to data they want. I'm sure there are many others.
“ Similarly, we don’t write unit tests for every new feature. The thing that hinders our ability to unit test is precisely the focus on static structures. Static methods and properties are global, harder to replace at runtime, and therefore, harder to “stub” or “mock.” Those capabilities are very important for proper isolated unit testing.”
Is it in C#? That’s surprising. It’s relatively easy to stub and mock things out in C++, so you don’t need all these “testability” patterns, although I do find passing dependencies around as constructor parameters or properties makes code easier to understand overall.
I think they're point is that parameter passing is a performance hit they wanted to avoid.
But it didn't make sense to me either. I don't know anything about C# but I assume there's some way to use compile time flags to assign static values. Seems like they should be able to mock that way.
> Perhaps we were not super clear in that piece – but testing is one of the things we “punted” back then and are definitely fixing now. One of our major engineering goals in the past few years was writing comprehensive test suites – and all new features are getting their share of automated tests as well. We still have way more integration tests than unit tests, but our test coverage is increasing progressively.
Best practices that get published/advocated are like gym memberships. works for some but not everyone. What does work though is knowing a team's culture (strengths/weaknesses) and tweaking that. It goes without saying that a best practice in one team at e.g Google will, with a high degree of certainty, fail at Amazon. Why you ask, simple ... tribalism.
It might be true for a domain, and false in another. Yet it sounds like a general rule.
For example, in my development environment, this is not true. Best practices - collected and distilled during years - are stuffed into code generators which generate skeleton code across projects and teams.
They don't slow the app down, and increase app development speed with orders of magnitude.
I think it's fair to say that best practices are specific to goals. Low latency, overall load time, ease of scale... These will all have different (though perhaps overlapping) best practices. The trick, as with most projects, is balancing goals against methods.
It's one thing to have a poor-quality code base. It's another thing to be proud of it, to the extent that you boast about your highly questionable decision-making in a blog post. I don't envy anyone who joins this company and has to deal with it.
I'm with you. I read this and I think, "ok, well, it's StackOverflow, and they're obviously a bunch of really smart people... but they're still doing it wrong". I've been doing Java for 20 years now, almost since there was Java. I've observed that most Java developers migrate toward "everything is static (i.e. global)". In fact, the abomination that is the Spring "framework" more or less requires it. I don't believe they do it for performance reasons, I believe they do it because you don't have to think much (up front) that way, and you can just kick the can down the road.
SO has been wildly successful, the site is consistently fast and reliable, and it doesn't seem like their decisions have significantly held them back during development. None of this sounds "poor-quality" to me.
Has anyone actually really noticed the performance of Stack Overflow? It feels like your average website before the JS bloat took over most “forum” style sites.
I wonder if they’ve traded off performance (and by the sounds of it, maintainability) with cost by squeezing all they can out of a minimal amount of hardware.
That's the difference between web development, "web2.0-style", and engineering. A web developer, when experiencing lag, will just add another AWS instance. An engineer will figure out where things are inefficient, and work against that inefficiency. You would be surprised how much of a performance boost you can get just by having good server-side pre-caching.
Also, consider that a lot of the lag you experience in the web today is actually caused client-side by excessive use of JS.
Engineering is a pretty slow-and-steady profession. Best practices, tests, and robustness are part of the point. There's an argument that a good engineer would know when those criteria are satisfied and so be more able to run up to the cusp, but I don't think it is fair to say engineers necessarily (or typically) work against inefficiency. Adding more AWS nodes might be the boring, prudent choice.
Redundancy is a form of inefficiency. Failure paths often introduce inefficiency. Well defined tolerances introduce inefficiency... even though you know those motors could handle a couple extra volts...
Money wasn't a problem for them because they were able to just throw $350k a year at AWS to make it work (and millions more at their dev team). But money isn't the only thing. This sort of approach sacrifices real-time latency (which is felt by the user), reliability (Plaid is notorious for having a ~20% fail rate for requests), flexibility (Plaid is notoriously slow at fixing bugs), developer costs (a dedicated DevOps team), agility (very slow deployment times), not to mention the priceless value of simplicity, a system that can fit in the mind of a single experienced developer, so they can write and test performant code on their own machine.
But hey, Plaid is now a unicorn, so who cares about efficiency.
[1] https://plaid.com/blog/how-we-parallelized-our-node-service-...