Hacker News new | past | comments | ask | show | jobs | submit login
How capacity planners credibly estimate application performance (acm.org)
131 points by yarapavan on July 24, 2023 | hide | past | favorite | 51 comments



This doesn't capture the basics of capacity planning, which really started in the 1980's with roll-out of bank ATM networks.

The first step is not to use technical measures like throughput or response time, but to use business units (like ATM account balance checks or withdrawals), so you can forecast capacity according to business plans.

In modern web apps, that's served by issuing a transaction id to user actions, and tracing that through all the systems and sub-systems required to service that transaction. That's your data plane.

Unfortunately, the next step of applying queuing theory to your software model is almost moot in a multi-core, multi-machine world with GPU's on a modern memory bus, because the model cannot be both abstract and accurate.

But the real problem is cultural: IT is willing to set targets for measures it can own and manipulate, but it's much, much less willing to commit to business unit measures that senior management can see clearly. Indeed, one of the best ways to understand how IT fits into the organization is to ask what's being measured.


I was attracted to my current employer when the ops director said almost exactly what you did in your last sentence. He wanted our dev/ops arm to be numerate and to work in dollars and TPS.

We actually do capacity planning in dollars to this day, and did do queuing models of both batch and multi-core, multi-machine transaction processing.


I understood the prior post differently, in that they tied capacity planning to biz metrics rather than TPS.

For example, businesses in HR might have metrics like average case correspondence count, open rate per day, idle status, etc. The dev team would commit to certain of these metrics, and optimize both costs and these meteics. This, to me, would signal an IT organization with high status (autonomy delegated to tech leaders).

Whereas a different company might have IT committed to X TPS without the ownership of business metrics


Incredible. Imaging for my startup how much it would take to translate potential features into dollars. “We can take this market of $55bn, it would take 70 features and 12 marketings, that’s 100 million per feature, but you only get it if the market develops accordingly.”

The simple fact that a large company is able to liquefy a non-linear large-blocks market into dollar amounts that you can measure, is the ultimate dream of any 1800 economist.


Yes, every resource-exhaustion causes a bottleneck, and a queue builds up. My effort here is to try and tell people to look in the right place to spot the bottleneck. That's a lot easier than a measurement exercise.

I sort of lump the "do a benchmark" advice in with "look under the lightpost, it's much brighter there" when you've lost your car-keys in a dark garage (;-))


It's my hunch that benchmarks are often in practice very difficult to get to actually measure the right thing, because of the inbuilt conflict in requirements of isolating the thing we want to benchmark to make sure we benchmark the right thing; and having the benchmark still resemble anything like a real-world scenario.

They also tend to sort of make you want to optimize the way the code already works on some level (a function or a service), as opposed to making larger scale optimizations that often yield much bigger performance improvements. Often the really big real world improvements are architectural rather than the result of tweaking the access order to get better CPU cache access patterns. Like the latter happens too, but you run out of those types of optimizations relatively quickly.

Like if you have some small piece of code you want to optimize (for some reason), a benchmark is invaluable; but if you have a system you want to optimize, it's less useful.


I entirely agree: I used to run the performance team at Sun Canada, and 99% of the time, we were on a bottleneck-hunt. Annoyingly, the bottleneck was rarely in the first place we looked (;-))

Benchmarks are well-known, well-understood and popular. They aren't what capacity planners and performance engineers use, though.


The gist of the article is to find bottlenecks in your service pipeline. I.e., when a request gets queued up because an earlier request is still processing.

By determining when and where the bottlenecks will occur, teams can plan for resource allocation ahead of time.


To scale up fast with surges of traffic, I really like Cloud Run. The auto-scaler is very integrated with the load balancer, so that you can define the max number of requests each instance of your app can handle, and the platform takes it from there. This way, you don't have to worry about request queueing. It even scales down to zero when there is no traffic.

Disclaimer: I'm not involved with GCP in any way, besides being a satisfied customer.


FWIW Cloud Run is extremely expensive if your app doesn't scale to zero, in my experience. You're better off having permanent instances with a CUD.


CUD?


Committed Use Discount - agree to minimum usage on your cloud provider in exchange for discounted rates (sometimes significant discounts).


This article seems like it was applicable circa 2006 but not in 2023. With asynchronous request handling (non-blocking request processing), you cannot reliably calculate the "processing time" of a single request since it is only actively being serviced (and therefore preventing another request from being serviced) for a tiny fraction of the total time.

(The exception being for a request that requires no asynchronous handling from the point the GET is received on the server to the point the first byte is written to the response, but that's rarely the kind of request that's ever a bottleneck in the first place since it's basically limited to just static html.)


All systems respond the same way eventually. Keep multiplying your load by ten and you'll see them.


Yes but the premise of the article is that you don’t need to test under load (i.e. benchmark) - only accurately measure the time it takes to fulfill a single request.


You measure the things that block, to predict when they will queue. Your async dispatcher still takes time per request. The resources that each request needs will eventually be exhausted. Your network interface isn't infinitely fast. Identify these, measure them, and you'll have a decent chance at predicting some of the problems you'll run into as things scale.


How does this compare to simply targeting a certain CPU utilization (say, 70%)?

In a sufficiently complex system you will have many different queues and other bottlenecks that affect throughput at scale. Some requests will also require more compute than others.

I like to plot the correlation between utilization and latency. Since utilization will typically vary during the day, you get lots of datapoints without running any benchmark. That lets you make more informed decisions regarding target utilization and when it becomes necessary to scale up.


When this was my main focus at work, we would have thresholds for CPU utilization and memory and I/O, but you couldn't simply throw more traffic at a collection of servers until they hit one of those thresholds because most of those thresholds were engineering limits, not steady-state but also because if you did you generally had no elasticity in the system for unplanned business events. We also broke things out based on what was critical path for revenue generation vs running the business or value-add.

You also have to adjust your thresholds based on your DR/BCP plan, your fault tolerance design and operational requirements. If you were hot/cold between 2 sites or AZs with a 95% SLA but you could run things hotter than if you were hot/hot between 2 sites with a 99.9% SLA. It wasn't unusual for our CPU usage threshold to be 30%.

Ultimately the critical piece of knowledge is when your utilization vs latency curve turns into a hockey stick and knowing what the bottlenecks are that drive that. For us we had to learn the hard way exactly what kind of throughput we could expect out of each layer of our infrastructure because in many cases, the latency would spike but the utilization didn't correlate with that under stress conditions but it did under normal conditions, learning how to pick out those early warning signs was an art.

We did some analysis to determine the mix of transactions for different types of business activity (BAU, annual big event that changed a profile permanently, periodic big event that represented an impulse/one time change) and could project expected CPU utilization based on a model for each mix of transactions based on business metrics. We based all of our capacity recommendations on the business volume projections by event type so that we weren't asking our BAs to tell us things like transactions per second by API which in most cases would have just gotten us blank stares, we'd ask them to give us a sales projection and apply our model to it.


I do exactly that plot, usually for CPU, occasionally for rotating-rust disks or for memory in non-GC languages. If a day's normal variation doesn't make a machine busy enough, I usually fiddle with the load-balancer to git it more work (;-))


I really wish I understood how I estimate the effects of performance improvements to code. (I also wish I could estimate stories the way I estimate changes).

Where businesses often seem to fall down is that, if they round-trip lessons from capacity planning to the business strategy, they sure don't look like it. They don't talk about it, and they make decisions that don't align with the capacity planning numbers.

If you know the cost of 1000 new customers is $M of new hardware, and revenue from the new customers is $N, then if you spend more than $(N-M) on acquiring the customers you're digging a hole, and the more successful you are the faster you dig. And then there's the grey area where you actually make money, but so little that any other initiative would have been a better use of your time.


Anyone else remember this game? It was hilarious.

https://en.wikipedia.org/wiki/You_Don%27t_Know_Jack_(franchi...


Taking this tangent even further, the successor of those games is the Jackbox series, which make great party games. You put the game on a screen everyone can see and then each player uses their own web browser (on whatever kind of device they have) to play the game. Since you can share the main computer's screen over a video call, it works for distanced game nights or for a remote team activity at work. They regularly have decent sales on Steam, too.


We played MITs Beer Distribution game[0] in grad school. Great example of the bullwhip effect[1]

[0]: https://mitsloan.mit.edu/teaching-resources-library/mit-sloa...

[1]: https://en.wikipedia.org/wiki/Bullwhip_effect


The site references the PDQ book and library. Dr. Gunther also wrote a book called Guerrilla Capacity Planning that goes into detail on the curve fitting method of analysis which is worth a read.


[flagged]


Clicked, and happy to have done so. Getting offended enough by titles to miss out on such content is a pity. I wholeheartedly encourage you to reconsider.


Being annoyed by obnoxious nondescript clickbait titles is not the same as being offended by them. Respect my time and use a proper label for your content if you want my click.


If this were Gawker (RIP), Buzzfeed, Upworthy, or another content farm, then yes, you'd have a point. This is ACM Queue, though. Here's some other works you might never have read, based on their titles.

- Go To Statement Considered Harmful <https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.p...>

- Why Johnny Can't Encrypt: A Usability Evaluation of PGP 5.0 <https://web.archive.org/web/20140712002547/http://www.gaudio...>

- Remembrance of Data Passed <http://simson.net/clips/academic/2003.IEEE.DiskDriveForensic...>

- Reflections on Trusting Trust <http://dl.acm.org/citation.cfm?id=358198.358210>

- No Silver Bullet: Essence and Accidents of Software Engineering <https://doi.org/10.1109%2FMC.1987.1663532>

- The Emperor's Old Clothes <https://dl.acm.org/doi/pdf/10.1145/358549.358561>

- The next 700 programming languages. <https://www.cs.cmu.edu/~crary/819-f09/Landin66.pdf>

- Attention Is All You Need <https://arxiv.org/abs/1706.03762>

As a matter of opinion, there are far more worthy reads in programming and computer science with "clickbait" titles than with more staid academic titles. There are untold numbers of forgettable works with forgettable titles.


None of that means it's good, and it certainly doesn't mean I have to like it. In fact, clickbait titles on good content is worse than bad content because it makes it look cheap before anyone reads the first word.


ACM noticed a trend and created the "You Don't Know Jack" series to capture more of them.


It takes 60 seconds to read every title on the front page of HN. Nobody is disrespecting your time.


The issue was about bait titles. Following links, reading the content (usually partially) and squaring the title with what you can summarize, is an effort.

"You Don't Know Jack about Application Performance" has very little information. Is this the title of a talk or blogpost? What runtime is it targeting? Is it because of containerization or hardware issues? Is this about statistics? This is a low quality title. If HN were full of low quality titles, the negative impact of having to weigh through the chaff would be more obvious. Better to just be vigilant and call it out when you see it.


Did the tag line "How capacity planners credibly estimate application performance (acm.org)" show up?

The ACM series is called "you don't know jack", but the tag line is supposed to distinguish it from all the other "jack" articles.

And yes, the series title is not want you want for something like HN.


No— it was "you don't know jack about application performance," which is a nice fun title for the series. The current headline is much more useful totally out of context on HN.


60 seconds, if that... Because most of the links have genuinely descriptive headlines. If I had to click on each link and skim to see what was on the other end because the headline was about as descriptive as a top-level topic tag, it would take 10 minutes.


It probably takes less since most of the titles are a vague half sentence with no context


You don’t know how many times a minute I read HN.


Consider you may be the person disrespecting your time


And they might like to consider that a thread about it is disrespecting other readers time as well as their own :-)


Here’s more disrespect for your, my, and everybody else’s time. Though I was mostly joking. :)


Application Performance


Why don't we strip it down another layer and say it's about "computers."


I agree with the sentiment but must admit that titles like "everything programmers should know about X" absolutely annoy me as well.


Those are a brilliant dialogue of articles covering in-depth edge cases of frequently used but poorly-understood data formats, dating back to “Things every Computer Scientist should know about Floating Point Numbers” published in 1994.


I am not saying it makes the articles bad, I'm just saying that the titles are clickbaity and annoy me.


I feel like writers want to have their cake and eat it too:

- Make weirdly evocative titles

- Get more clicks because people are positively or negatively intrigued by such a title

- If someone protests it, though, they are being shallow and nitpicky


In that case I wouldn't recommend Spolsky's talk "You Suck at Excel".


I think it’s a great title. It advertises exactly the sort of persona that he presents.


You don't know what they know and what they don't know, so you don't absolutely know that they don't know what you know and what you don't know.


I’m on the Queue advisory board and I’d appreciate hearing your suggestions for a better title.


I think the title is fine. You will never find a title that won't run someone on HN the wrong way.


Some times the tag line is what needs to be used: I see HN quoted it, so we're doing something right...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: