> How doesn't paging worsen speed performance though?
It does worsen the performance of the attention kernel, if comparing to kernels which takes keys and values in continuous memory layout.
> Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?
Actually it puts everything in the same batch. The reason for its high throughput is that sequences are removed from the batch as soon as it's finished, and new sequences can be added to the batch on-the-fly if there is enough space in KV cache. This is called continuous batching (https://www.anyscale.com/blog/continuous-batching-llm-infere...).
Paged attention and "virtualized" KV cache play an important role in an efficient implementation of continuous batching. Text generation in LLM is a dynamic process and it's not possible to predict how long the output is when scheduling incoming requests. Therefore a dynamic approach is needed for KV cache allocation, even though it hurts the performance of attention.
Before remote working becomes a trend recently, location-based pay is just a result of price being determined by supply and demand, plus the fact that location is a major constraint for both job seeking and recruiting. For anyone believes their work has intrinsic value, if you try to calculate this value into a number (salary), ultimately you need to use some kind of market reference (Like, I am able to get an offer of $xxxx from another company). This market reference is heavily based on location if remote work isn't a viable option to you.
Now, why do companies still stick to location-based pay when many other companies are embracing remote work? I think that's just cultural inertia and eventually software engineers will be paid without taking their location into account. But that's not a good thing for everyone, because the salary at that point will probably be much lower than what people get paid in SF area today.
> consider that those savings probably won't even pay for more than a quarter of a one of their developers
Although I never run a business, I do believe this kind of optimization is quite meaningful even though they will never be the top priority of a business.
Those optimizations lower operational cost while being mostly maintainance free (except the one that switches off from AWS certificate manager, which may increase some effort when renewing), risk free (unlike refactoring a large legacy system) and requiring little engineering effort (Maybe 10 engineering days from investigation to writing the blog post?)
In addition this blog post itself brings intangible benefit on their branding, website ranking and hiring.
I think you're exactly right. It has become a HN trope that every cost optimization story gets a response like this: your infrastructure costs are trumped by the cost of your developers, so why spend the expensive resource (developers) on optimizing the comparatively cheap bit (infrastructure). I'm tired of the trope because it's such an oversimplification.
What matters is the return on investment, and as you state, one of the great things about cost optimization is that its returns come largely risk free. By my math the optimizations described here return $100k a year. On a risk-adjusted basis, what task could this developer have performed that would have returned more?
In this thread line regarding small businesses, another critical point is that the $24,000 (and certainly in the $100k premise) also might be part of the remainder compensation or profit calculation for the owner of the business. Sure it pales next to the cost of five engineers and yet it could easily be anywhere from 1/10 to 1/3 of the annual profit for a small business. If you're the owner, that's a big deal over time. You never know how tight a small business has to operate, however typically it's thinner than not.