Hacker News new | past | comments | ask | show | jobs | submit login
When costs are nonlinear, keep it small (jessitron.com)
179 points by zdw on Jan 19, 2021 | hide | past | favorite | 65 comments



I thought this would be about pricing. In pricing, there’s a real advantage to being small: prices are often artificially low at low volume to promote lock in. Then they gouge you in the middle. Then at the high end, they have to stop gouging you because you know that they know that you know you could go elsewhere for your volume.

For example, AWS free tier is a great bargain. AWS is a terrible value once you’re paying the price of a bare metal box each year (or sooner!), but then at Netflix scale it balances out again because Amazon knows that Netflix could easily talk Microsoft into doing the port themselves if it would actually make financial sense.

Most pricing is S-shaped, so it typically is best to be in the position of exploiting a loss leader or running at scale.


> AWS is a terrible value once you’re paying the price of a bare metal box each year (or sooner!)

You're probably off by about 2 orders of magnitude.

If your AWS spend is $10K per year or below, it simply isn't worth thinking about in terms of engineering time if you have real revenue.

Once your AWS spend crosses about $100K/year, it's time to pay attention. However, you probably want to be stingy about this as your engineer is in that range. Your engineers are supposed to be providing business value and looking at $10K/year costs probably isn't worth their time.

At the $1M/year mark, you already have an engineer on this full-time anyway so it makes sense to have some hardware you actually own in a colocation facility somewhere.


I agree - but there's a few gotchas here. One is that there's often simple configuration changes where a few hours of engineering effort could save $30k+ each year, and those changes can still be worth looking into. I set up AWS a few years ago at a startup I worked at. I didn't really understand the difference between spot instances and reserved instances on EC2, and I didn't bother to learn (we were all very busy!). We didn't notice for 18 months or something, and the amount of money we wasted with that single mistake would have paid my salary for several months.

Its also easy to accidentally lock yourself in to bad infrastructure decisions early, that become extremely expensive to fix later. Mongodb might seem like a good place to start your project, but trust me - moving away from it down the road will be a nightmare.

Too much up front design will kill your velocity. And not enough up front design will cost you too. If there's a good rule of thumb here, I haven't found it yet.


> he amount of money we wasted with that single mistake would have paid my salary for several months.

Sure. Presumably, though, your engineering salary for a couple months had more business value than you would have saved, so it's still a net plus.

> Too much up front design will kill your velocity. And not enough up front design will cost you too. If there's a good rule of thumb here, I haven't found it yet.

In my opinion, err on the side of velocity. Most startups are worrying about staying alive.

If the startup survives to have the problem of ripping out bad decisions, consider that your engineering decisions did their job and chalk it up to the scars of battle.

"Good judgement comes from experience. Experience comes from bad judgement."


Thanks for the honesty, most new engineers make such mistakes, AWS and other massive Cloud offerings are like a forest where one can easily get lost!


Not sure how large the niche is where it makes sense to use “a” collocation facility - quickly you’re at a point where you should be thinking about multi region.


> At the $1M/year mark, you already have an engineer on this full-time anyway

do you mean that for 100K/year you DON'T HAVE and engineer managing your account infra? Who does this in this case? CEO? This volume of spending implies quite an infra in AWS.


No they're saying at $1M/year it's one person's full-time job to manage your AWS (probably, unless it's dead simple and you're paying strictly for volume). It's unlikely you're paying an engineer $100-150k/yr to manage your $100k/yr AWS infrastructure full-time. More than likely at $100k/yr spend it's a piece of one person's job, or it's one of those things everyone on the team just sort of looks after.


I still don’t understand how this business model works at all. Take cloudinary, for example. You can have your images transformed and hosted by cloudinary. So like you have a photo and let cloudinary examine the photo and serve different sizes, cropping to a human face as a circle for profile pictures for example. It is expensive. All you need is some idiot hitting refresh in a few tabs on the browser for a few hours on your website for you to run over your free tier.

I’d imagine that anyone spending real money is always thinking of ways to get off of something like this?


People don't pay you to run xyz, they pay you to solve their problems. If you can solve a problem that real businesses have, then as long as you're not charging enough to be a bigger problem than the original problem, they'll be happy to pay it.


I did some contracting work a few years ago, and one of our clients was a small entertainment news site. They used cloudinary for image hosting and were paying some insane amount for image hosting. I think their cloudinary bill was $8000/month or something. The client didn't even know that was expensive. They just saw it as a cost of doing business, alongside their cloudflare bill and so on. (And what they were paying us.)

It took work over several weeks to convince them that that was a ripoff. I think their business people eventually called up cloudinary and negotiated a better rate, but by that point I was annoyed by the whole situation. The site used cloudinary's HTTP based API from the browser. I configured cloudflare to redirect any request hitting example.com/images/... to act as a caching proxy to cloudinary's actual servers. Unsurprisingly, that one configuration change made their cloudinary bill drop to about 1/20th what it was. When the client was billed we got a panicked phone call asking why it was so low, and if something was broken on their site.

Anyway, tldr; lots of folks out there have no idea what a service like cloudinary should cost. Apparently more than enough to make cloudinary a profitable business.


I think it's also a problem that engineers are rarely incentivized to make the right decision initially. If the initial price is low they will be prized if they able to cut down the development time. After that if the solution auto scales then it will be the beancounter's problem.


I deliberately make things expensive during development so there’s room to ‘optimize’ later.


I actually looked at Cloudinary for my job. I thought the pricing didn’t make sense so I ended up just running imgproxy on Heroku (it was free tier for a long time but I upped to hobbyist to get better latency).


Awesome story.


We pay like $200/mo for Cloudinary and we barely use any of the features. Primarily it's to serve up an image scaled to the right (single) size and occasionally I'll tweak the URL to add an effect, a tint, or something. This convenience is worth the money so far considering $200/mo is a rounding error in our monthly expenses but I could probably write a script to achieve similar ends in a few hours and upload to S3 instead..


Even if it's a rounding error, it sounds like you'd break even from having one of your developers do it within a month or two. Unless everyone truly has a months-long backlog of high-impact stuff to do it sounds like a good argument to do it some Friday afternoon.


It doesn't work like that. Cloudinary caches the transform so you only pay for the first time for the transform. Of course, you do have to pay for the bandwidth but by then the image would be cached somewhere on the network or on the browser anyways.


Incidentally, that building maintenance graph showing why an ounce of prevention is worth a pound of cure is also useful for illustrating why poverty is such a drag on the economy - if you barely have the money to survive short term, then you don't have the money to invest into even slightly long term things like maintenance, health or education, which ends up costing you even more in the long run. It's an incredible waste of resources (and that's not even counting the human suffering) and it benefits nobody (except maybe the company granting you the emergency loan you use to repair your car when it finally breaks down, but you might never finish paying off).


Everyone repeats this trope yet when it comes to the lowest of the low margin businesses the only ones that seem to have any staying power are the ones that operate by putting things off until they can't and then going for the minimum viable fix.


I've worked in facility maintenance since 1985. If maintenance isn't completed in a timely fashion the cost goes up, but also the risk of unexpected failure. In my experience, maintenance isn't authorized because of two main reasons:ignoring the problem or management being afraid to spend the money for the jobs. The money is in the budget, they just don't want to be seen as a "spender" to upper management.


> It’s like DevOps and CI/CD: more frequent deploys are safer. This happens because each deploy contains fewer changes.

Is this generally true? Deploy as in going to production?

I join a web app project that didn't have any tests and the code was brittle, where changing one part of the app could break a completely different part without you realising. It was much more efficient (and less stressful) to batch up lots of small changes and do one thorough manual testing session of the whole thing while we gradually got the codebase under control.

It likely depends on how much automated testing you have, how bad it is if bugs go live and how quickly you need features live.


That's the whole idea of CD: if you set up your project from the start so that deploys go to production, it changes the team culture and the way of working. People don't commit code without any tests (because it will break production!) and conversely don't stress over manually testing dozens of scenarios, some of which were "one-offs" that may become forgotten - it's "normal" that your code gets released, so the release is no longer a large, time-consuming and stressful event.

If you start with a project where code is brittle and there are no tests, it is indeed extremely hard and time consuming to move to CI/CD. Doesn't mean it's not a good idea, just that it won't be easy and it'll take time (you have to build the whole infrastructure of automated testing, canary deployments, etc. - but, it's a good idea to do that anyway! It'll gradually improve the velocity. The alternative is that you end up in a world where everybody is afraid of making changes and the simplest requirement ends up getting estimates like "3 months of work".


> People don't commit code without any tests (because it will break production!)

It sounds like a really bad idea. It's like saying: When juggling, it's best to start with knifes, because every mistake will mean you're going to cut yourself, and you will quickly learn not to make mistakes.


Except the "production" when you start is really a pre-production environment, meaning you won't _really_ cut yourself. To modify your analogy - which strategy is better:

- start with blunt knives, realize the opportunities to hurt yourself, put safeguards in place (e.g. use protective gloves) OR

- practice juggling balls until you get really good at it, then switch to knives for the real show.


Each deploy containing fewer changes makes sense however it's not the only reason. Another significant reason is that more frequent deploys means more opportunities to identify and resolve any issues with the process. It becomes like a well-oiled machine.

On the other hand when a process is run infrequently you tend to get surprised each time because something has changed or broken since the last time. Or you just simply haven't done the thing enough to observe the possible issues so every six months you discover and re-discover the problems.


It's generally true, yes.

While manual testing changes the equation some (the testing effort is the same for each release; the more you can automate that, the greater gains), how easy to -fix- the issue becomes more complicated the more changes there are. 1 PR, you know what introduced the breakage. 30 far reaching ones, with interesting overlap between them? Good luck.


> 1 PR, you know what introduced the breakage. 30? Good luck.

To be clearer, I mean there would be say 30 pull requests merged into "develop" and they would only be merged into "master" and pushed live after detailed manual testing. This is versus merging each pull request directly into "master" and going live immediately, where you wouldn't have time to do detailed manual testing every merge.

You would still have the git history in both approaches to debug.


Right. I'm saying there are two different things here. Regardless of your merge strategy.

You either test after each PR merge, or you test only when you 'craft a release'. Doesn't matter whether this merge is into a develop branch or straight into master; it's solely a question of when you test it.

If you manually test, testing only a batch of changes, together, is obviously easier, from a testing perspective. However, the effort to figure out and fix the issue can be very large. Plus, after fixing that one issue, you have to retest everything. So every bug requires retesting in any case (so while it's still a lower testing burden, unless you introduced 30+ unrelated issues, it's not as low as you think it might be on the surface).

Compared with testing (and at that point you might as well release if your pipeline and org let you) with each PR - any manual testing effort is obviously higher due to testing so often (whereas automated is no extra effort), but, figuring out what caused the breakage, and addressing it, is much, much easier, order of magnitude easier, to figure out, and fixing it is much less likely to introduce new issues.

Which is more to the original point; catching a bug earlier, allows the fix to be more targeted, which makes it more likely to be right, which reduces the likelihood of it making it to prod.

There is also, to the original point, a statistical thing to consider.

Let's say in 30 PRs, there is one that introduces a bug. Well, if QA is testing the one PR that broke something, they're more likely to spot it, since they're focusing on that PR. They're less likely if it's one of thirty sets of changes. But beyond that, let's say, in both situations, QA misses it. If you have one deploy, you've broken prod 100% of your releases (1 of 1 release). If you have 30 deploys, you've broken prod only ~3% of your releases. So immediately the original statement is validated; if your testing effort is the same (big if!), more frequent releases, with smaller changes, means the same or better percentage of deploying working code. Also, if you have to rollback with one mega release, you lose all 29 other changes; if you have to rollback with separate changes, you lose no other changes.


If you have decent automated testing, canarying, and monitoring, then yeah, releasing as frequently as possible means that usually a breakage is immediately narrowed down to 2-3 CLs. We release once every 2 hours, because it takes about 2 hours to cut, test, and slowly deploy one release. When we have a bug, we usually notice during the canarying process, and then roll back and fix it, and figure out how to test for that case.


It’s true that deploying frequently reduce the surface of potential new problems per release. But to make it work flawlessly (not perfectly, but approaching) you do need some engineering rigor in the product.


It's true in the sense that # of bugs or downtime per release is smaller. But that's not what clients care about at all. They care about total number of bugs and downtime.

In addition this completely ignores that if you spend the same amount of time testing your testing per release drops when you increase the number of deployments.


> In addition this completely ignores that if you spend the same amount of time testing your testing per release drops when you increase the number of deployments.

No, you just automate your testing.


I've never worked anywhere where 100% of testing was automated.


In practice, automating around 90% of it is enough.

This does explain why it didn't work for you though.


I would hate to work somewhere that my changes are deployed within days of writing my code.

CI/CD lets you iterate faster and it encourages best practices like proper testing. That said, if you do CI/CD without proper testing then you're going to break production.


Wait until you have to work with clients that don't even have staging environments let alone automated testing. Source control, CI/CD, automated testing, pull requests, code reviews, feature flags, linters, good logging etc. might be the norm on HN but not everywhere, especially companies where tech isn't the focus.

It can also be a tough sell to make process improvements like this when the old way got them this far and they want new features now. I've worked with clients that don't even use source control, do all edits on production and there's no way setup to develop locally.


Although my current day job is very professional about this, I’ve seen the other side, too. «We have some code from 5-10 y back, can you compile it?» and only half the files are there.

It would be great if HN had some content on «moving from random files on a network drive to git and basic automated testing» - it’s not an easy job and others would benefit from such writeups.


Joel on Software: go through the "Rock Star Developer" section of his archive.


> I would hate to work somewhere that my changes are deployed within days of writing my code.

Did you miss a "not" here? For me an environment where my changes are deployed and used quickly is a lot more fulfilling than one where I work on something that may only see the light of day weeks or months down the line.


> I would hate to work somewhere that my changes are deployed within days of writing my code.

You think waiting days would be hateful. Having to wait months is more frustrating. And more risk for the business.


It only works if you control the deployment (like in a webapp). Try pushing firmware updates to non-internet connected devices every day or week, and you will see a lot of customer backlash.


If your code is that brittle, than sure.


If you invest money in developing (and testing) a new feature, then you don't get the benefit of that investment until it is actually deployed and made available for use. Until then, it is just burnt money with no return.

Batching up many changes and features over longer and longer release periods brings so many issues, and definitely greater risks when actually deployed as the scale of change is larger.

What the article perhaps doesn't pick up on so well is that poor or difficult deployment processes lead to this behaviour of batching and putting-off the deployments. Do them less often, have less periods of service impact. Is the logic.

Which of course is one of the key points about DevOps, feedback loops, and CI/CD processes to simplify things.

If you can make deployment easier, you can do them more often. Making the changes smaller can help ease deployments, so can be a virtuous circle. Though not if deployments are not reliable or repeatable.


I worked on a project that used Angular. Every major version of Angular holds back the version of it's dependencies, such as Typescript.

When I came onboard this project, I came into a situation where the client simply did not upgrade their libraries & nodejs version for a few years. When it came time to catch up with the latest version of Angular, the experience was painful. Three major version updates later, (version 9.x to 10.x), the Angular upgrade became unbearable & I moved the project over to Svelte & a pnpm monorepo. Now all dependencies are up to date, the architecture is improved, & page size reduced. It took a few months to hammer out all of the edge cases due to the complexity of the app but release is imminent.


These "success stories" are so funny.

"The old way was all over the place so instead of improving the existing thing, I completely rewrote everything from scratch and now everything is cool. Yay me/us!".

Why couldn't you just upgrade the dependencies once then set up the same CI/CD you're presumably using for Svelte so that you can them upgrade versions easily?


I had a water pipe crack in the cement foundation of my house. Just and old pipe. The plumber drained and capped the pipe then ran a new pipe around my house in the ground. Next time it breaks it can be fixed. Rewrites aren't bad and usually quicker if you keep a lid on features. All your realy doing is writing down the business rules again.


Is there an app, except for "Hello world", that's as simple as a pipe?


They always forget about all of the bugs they have not found in their new app.


The problem isn't that people make the wrong decisions about batch sizes or failure tolerance, but that they don't make a decision at all. The default state is a blind push towards both more changes and less pain. Not acknowledging that a tradeoff exists.


Could you expand on this a bit? How would I blindly push towards less pain in this case?


There needs to be another dimension in this model. In building maintenance, there needs to be an inspection to ensure the maintenance was performed properly. For example, you may replace a cracked roof shingle as a do-it-yourself preventive maintenance activity, but not do it correctly and actually cause a leak that was not there when it was only cracked.

In software, you need to have code reviews, unit and regression testing, or you can fix one minor bug only to introduce a catastrophic fail.


cron to the rescue


The author uses "nonlinear" but means "superlinear", because I guess sublinear cost behaviors don't cause the same kind of dramatic problems, and so we can get away with not thinking about them most of the time. I think it says something about the ways that software systems give rise to a particular flavor of problems, that allows us to (relatively effectively) ignore sublinear cost behavior, whereas in other areas those would be really important dynamics.


When would sublinear costs exist?

Sublinear costs seem like something that you'd be best off delaying to fix, maybe forever, since their impact lessens the longer you wait.

Maybe this would apply when a major change will happen that would obviate the need for a fix: a planned building teardown causes a needed roof repair on the old building to have sublinear costs, or introducing a new subsystem that eliminates the old subsystem that had the outstanding repair.

Or aesthetics: a minor marring of a building facade matters when it's pristine, but if you wait longer, the more other minor marrings appear, the less that first mar individually matters to the value of the building.


You don't always have the choice to delay. Sometimes the sublinear cost is associated with something you really need to do.

Examples of sublinear costs:

- When you pay lower per-unit price by ordering large lots. But you can't just wait a long time and be able to afford a giant order; your first small orders may help you generate revenue that let you buy the later big orders.

- When your team is young perhaps you need 1 new laptop per new hire. A few years later, you need < 1 new laptop per new hire b/c some receive recycled laptops from departed employees. But you can't wait 3 years and then hire 100 people and 80 laptops.

- You're growing some infrastructure which serves/covers some territory. At first, all new customers are in new territory; the ratio of new infrastructure to new customers is high. Later, some portion of new customers are covered by existing infrastructure, and that portion grows over time. But even if you could afford to build everything at once, you may not know where to build most of it until you have a bunch of customers.

- I don't really know if this one is true, but a whole industry seems to believe that a burst of advertising effort all at once is more effective than a marginally greater volume of advertising spread over a longer period. But you can't do zero advertising for 5 years and then take over time square.


Batching is an example of a sublinear cost. The more loads of laundry that I do at one time, the lower the total cost to do laundry. Unfortunately in this case, the smell in my closet goes superlinear, so it is rarely worth it ;)


Buying more groceries at a time is another obvious batching that one can do. So is cooking in bigger batches. Fixed cost in household chores is mostly about time, so you should try to batch them to save time. Buying groceries monthly vs. daily saves you 30 hours in a month, if it takes an hour for one trip.


24/7 365 support is a less obvious sub linear cost. At 4am one person can be vast overkill, but you can’t get away with zero people. The same is true as you keep scaling support, having level 1 vs level 2 is really about filtering problems ever more efficiently and having enough work where that’s actually needed at 4AM on Christmas takes scale.

Which is what economies of scale mean, just a huge range of sub linear costs all lumped together.


I think that is just discontinuous, not sub linear.


I think you’re misunderstanding the idea. Let’s use a simple model for N customers you need Pi people rounded up. As you increase by 10x the average cost per N customers keeps decreasing. So, f(N) = 4, f(10N) = 3.2, f(100N)= 3.15, f(1,000N)= 3.142 ...

Except in the real world this is being optimized based on a forecast at every point in the day and every day of the year. You might not think of demand in those terms, but it’s a huge area of optimization at both large companies like Walmart all the way down to individual restaurants.


You could also create combined call centers which service two industries which are similar enough, that the same support rep can serve them both, but have an opposite demand curve troughout the year.

To see these opportunities you need data, but it is the most closely guarded secret of companies.

In the short term it gives them an advantage, but in the long run society as a whole could benefit tremendously from open data sharing.

Am I wrong about this?


If you have a group of N potential customers that seek your services randomly and independently, then the standard deviation of customers is sqrt(N). This means you need less and less % extra capacity for handling randomly busy days the larger N is.


fixed costs are sublinear.


Exactly.

If you need to make changes to the design of an injection molded part, you'd better make them all at once, because the cost of a new mold is $100k whether you make one change or seven.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: