writing as somebody who runs a big collection of bare-metal hypervisors for ISP ...

dang · on Dec 10, 2020

Please don't be a jerk on HN, especially in response to someone else's misfortune, even if they brought it on themselves. Maybe you don't need to treat these people better (though why not?) but you owe the community better if you're posting here. If you wouldn't mind reviewing the site guidelines and taking the intended spirit to heart, we'd be grateful. Note these ones: "Be kind" and "Please don't sneer"

https://news.ycombinator.com/newsguidelines.html

p.s. I skimmed through your recent commenting history and it looks great—just the kind of thing we want here. Sharing some of what you know is exactly what we want users to do. But please don't be supercilious about it, as in this comment and https://news.ycombinator.com/item?id=25372847. Ignorance doesn't deserve humiliation, and that ingredient poisons the ecosystem (and eventually starts a degrading spiral, e.g. https://news.ycombinator.com/item?id=25373520). The rest is good.

walrus01 · on Dec 10, 2020

Thanks for the feedback. I almost certainly shouldn't have included the part about the smirk, and I can definitely see how that could appear to be making fun of somebody else's misfortune. And the rest of it could have been phrased in a more diplomatic way.

For what it's worth it wasn't intended personally at the person who almost incurred the $72k bill, but more at the general concept of test/beta software gone rampant and out of control in an environment where billing has no limits. I think we've all tested some sort of software in development environments that caused havoc - but up until very recently it's been hard for that to immediately begin causing real world financial consequences...

Alex3917 · on Dec 10, 2020

> Maybe you don't need to treat these people better (though why not?)

IMHO the best argument for 'why not' would be that it's generally unethical to deploy software without first taking the time to read the manual and understand how your dependencies work. In this case the system wasn't live and the costs of this fuckup were solely externalized onto Google, which is fine because it was in large part their fault anyway. But when dealing with production deployments, this same behavior often results in users having all their private information leaked or deleted.

walrus01 · on Dec 10, 2020

I think cautionary tale are important - but it's also possible, as I likely did above, to come down on people too harshly. Not everything has consequences as severe as a therac-25.

donmcronald · on Dec 10, 2020

The crazy part to me is using the cloud for testing. It’s crazy. I have a 5 year old dual CPU Xeon with 128GB of RAM and a couple NVME disks that I’ve spent about $1000 CAD total to build ($700 USD). Something in that range on Azure is about $1 / hour if you reserve a year. ~$9000 per year.

All the people running workloads that don’t require the redundancy given, like CI, blow my mind. The costs are astronomical vs buying a cheap or used server. Sure, use the cloud for you production builds, but why not augment it with something that doesn’t cost as much?

walrus01 · on Dec 10, 2020

as a totally randomly chosen example, that I spent not more than 20 seconds searching for, here's a system with 128GB of RAM for way under $500.

https://www.ebay.com/itm/DELL-R910-16SFF-model-4x-Intel-X755...

need to add your own storage (good quality SSDs, of course).

and it assumes you have somewhere to put noisy things...

I would estimate it's about a 500W electrical load, so figure $40-50 additional electrical bill, if you're trying to precisely account for all costs.

you can totally set up a desktop workstation dual xeon for a similar price as well.

skrebbel · on Dec 10, 2020

> The costs would be absolutely fixed and known.

Until the server breaks and you have to drive over in the middle of the night and try to replace it but the only available server right now is a shitty one and oh shit only half the backups work cause the onsite backups are fried too etc etc etc.

There's many good arguments against high-level BaaS such as Firebase but I'm not sure that "colo is cheaper" is one.

senko · on Dec 10, 2020

> I'm not sure that "colo is cheaper" is one.

It absolutely is (cheaper, and a good argument). As an example: we're in the process of switching from Digital Ocean to Hetzner for a project, that will increase infrastructure performance (roughly memory/cpu/storage) by 4x and decrease costs by 4x. And no driving to the colo center is neccessary, as it's their dedicated server, so their on-site engineers do the hardware replacement.

Also, if you are not okay with your site being down for a few hours, you can always buy two, like you would with a sensible cloud setup. It'd still come up way cheaper (+ get more perf if you can do load balancing for your usage).

Also, I don't look at it from "colo is cheaper" point of view. To me, it's "I can have several times more performance and hire a full time sysadmin to worry about it, for the same price".

Axsuul · on Dec 10, 2020

Can anyone recommend a similar provider but for North America?

1337shadow · on Dec 11, 2020

Sure thing https://oneprovider.com/dedicated-servers-in-north-america

daneel_w · on Dec 10, 2020

It's anecdotal, but I'm convinced I'm not alone here: we've had more Amazon-related failures/outages in 3 years with AWS than we had in 4 years of colo before heading to the cloud because of the exact fear you described.

Even a cloud setup needs good management and contingency planning, and in absence of such it can fail just as hard as a colo setup.

walrus01 · on Dec 10, 2020

https://www.bbc.com/news/technology-55087054

https://www.seattletimes.com/business/amazon/amazon-web-serv...

walrus01 · on Dec 10, 2020

fine, buy two identical ones. and set up proper backups. or even make the backup a hot-spare.

if it's a test/development system that's meant to possibly break, you shouldn't be driving anywhere at 0300 in the morning anyways.

wrkronmiller · on Dec 10, 2020

That implies you are going to be running prod in the cloud. Unless you're developing against purely synthetic data, the data transfer costs are potentially astronomical.

z3t4 · on Dec 10, 2020

Just because you use "the cloud" doesn't mean you don't need backups. "the cloud" also have downtime and other failure. When deploying to the cloud you have to factor in the cost for moving to another provider if/when it will be needed.

onion2k · on Dec 10, 2020

Dell 1U servers on eBay right now for under $500 with 128GB of RAM

In part 2 the author says "Had we chosen max-instances to be “2”, our costs would’ve been 500 times less. $72,000 bill would’ve been: $144". In other words, that $500 server is several times more expensive than it would have been if Firebase and GCP had saner defaults.

jmull · on Dec 10, 2020

That $144 would have been for a single two-day test.

Anyway, getting caught up in specific remediations that could have prevented this is beside the point. For development you want a safe testing environment because mistakes, gaps, misunderstandings, bugs are a fundamental part of it. The entire point of tests and testing environments is to discover the problems you know exist but need to test to find.

walrus01 · on Dec 10, 2020

Yes I would agree that having automatic-scaling set to effectively infinite by default is not the best choice for the end user who is paying.

But for the cloud operator, when somebody's runaway application results in a $15,000 bill that has to be paid, sure...

As to whether letting people's runaway things scale up infinitely is an intentional choice, I couldn't say.

reddit_clone · on Dec 10, 2020

For me the needle swings towards Malice, away from Mistake on this one. At the very least callousness

Add to the long list of disappointments at humanity:

- Late fees were a big part of BlockBuster's business.

- Police departments factor traffic fines into their budget.

- Thousand other dark patterns that are unethical but not illegal.

JimDabell · on Dec 10, 2020

> this post quite honestly just makes me smirk.

It seems pretty callous to laugh at somebody else’s $72K misfortune, especially as they took reasonable steps to set a budget on the platform.

walrus01 · on Dec 10, 2020

From my point of view after doing this for 20 years, it's like seeing the past 12 years of the "put everything in the cloud" era, of new different people repeating exactly the same mistake over and over again.

It's like if you lived near a public park with particularly aggressive geese that return every year, and watched new ignorant groups of people get chased by the geese every spring.

It's not callous - it's the perspective of the people who are responsible for the hypervisors that run underneath the VMs and services that cause some of these massive billing outrages.

scarygliders · on Dec 10, 2020

I quite agree with everything you've said in this and your other post.

My development environment? : My own dual-booting Windows/Linux PC with 32G RAM and a few TB of SSD. Not to mention the Nvidia RTX graphics card for gaming...

I either spin up a VM to test stuff, or spin up a Python virtualenv. Postgresql also running on this machine. Whatever's needed. Need to emulate Stuff Happening From Different Servers? Why just spin up another few VM's - assign them the minimum resources required to get them doing what they need to do, set up your VM network etc. Any decently specced desktop machine can do that, never mind a noisy rack system - considering they're way better and vastly more powerful then the PC's we had 2-5 years before that, which themselves were vastly more powerful than the ones before them, and so on...

Result? Can develop at home to my heart's content, then when it comes to deployment spin up a remote VM on e.g. DigitalOcean and take it from there.

At the end of the day, "sErVeRlEsS" (I just don't like that term, for some reason it rubs me up the wrong way, perhaps because of...) just means "running stuff on someone else's kit" - the same as "tHe ClOuD", so if I'm going to be developing some system & software, I'd rather be doing it locally, setting up whatever's needed to get it running, and once satisfied, deploying it.

Like you, I see either the same people, or new people, simply Not Learning From The Past. There are many good reasons why things were done like they were - developing on a system you own, for example, rather than spinning up all sorts of Cloudy Things or "serverlessy things" right from the start.

Hardware is cheap - you don't need a supercomputer to run the beginnings of your latest Supah Scalable System[tm], you just develop and run it on a reasonably up to date box, and, sure, when you get to the stage where you need more space/bandwidth/whatever, that's the point where you deploy to some Cloudy Thing or SeRvErLeSs Thing.

walrus01 · on Dec 10, 2020

My personal home office development environment at the moment, done on an ultra low budget, is a dell precision t5600 mid tower workstation PC (dual xeon, e5-2630) that I got for $350 with 64GB of RAM in it, upgraded it to 128GB, and put a $150 Samsung SATA3 SSD in. It's small and relatively quiet and sits under my desk tucked in a back corner with just a power cable and a few ethernet cables plugged into it.

Maybe some time in the near future I'll add a 2TB HDD that I have sitting around into it so that I can create VMs that have a 'fast' boot/root disk, and also give them some lvm partitions on a big slow disk.

It's running debian stable amd64 and is set up as a xen dom0 hypervisor, with 768MB of RAM assigned to the dom0 and the rest available for VMs.

The amount of capacity that's available there to create random PV or HVM VMs with as much RAM as I could want, is more than sufficient for my personal needs. If I need anything bigger I'll make it a more formal process and put it on a machine at work.

1337shadow · on Dec 11, 2020

FaaS is what they call Serverless I guess, anyway, that seems a step backward to me, like going from FastCGI back to CGI, and somehow they market it as "progress".

At least, they could be using OpenFaaS or something, or even free software like firebase such as kuzzle.io or Mozilla Kinto.

jedimastert · on Dec 10, 2020

> It's like if you lived near a public park with particularly aggressive geese that return every year, and watched new ignorant groups of people get chased by the geese every spring.

You're not really helping your argument here. Particularly if people have been attacked for over a decade and no one has put up a "aggressive geese" sign

walrus01 · on Dec 10, 2020

Quite literally in my specific area, the former is true, and also it's a fact the city government has put up a number of signs in the nesting area. Still happens.

sdf131 · on Dec 15, 2020

What misfortune? They made a 100b requests and used 16k CPU hours because because they were careless

This cost money to Google, they're the offended party here

sofixa · on Dec 10, 2020

Oh please, you can't be serious?

First, that $1200 server costs fixed money upfront, and then you pay per month for colo, and for internet, which usually includes a fixed bandwith cap or limits, with bursts which you pay for. So no, it's not fixed.

Second, a server you have to maintain, harwdare and software-wise, is much more complex, and takes much more time, than a managed service. You want a database? Install it yourself, maintain it yourself, backup it yourself, monitor it yourself. And same with everything else.

Third, there's zero redundancy in your "setup". If you want it with the most basic redundancy, you triple the costs (second server, extra networking equipment, etc.).

Fourth, geo redundancy/distributedness? Please. Good luck if you have someone far away who wants to visit your site.

Fifth, let's say you need to scale. Like, you get 10 more users today than you did yesterday, or you get featured on HN or Reddit or local news or whatever. F. You're looking at months and a lot of cash, upfront.

"A big collection of bare-metal hypervisors" makes sense in some cases, but don't pretend it doesn't come with a non-negligible time spent maintaning it and requires significant upfront capital and man hours to do the same you get easily on a public cloud platform (databases, message brokers, object storage, etc. etc . etc. etc. etc. etc.).

walrus01 · on Dec 10, 2020

yes, I am serious, because as described in the original post this was somebody's test/prototype environment. Which is the ideal use case for a DIY scenario, until you're ready to send things into production.

I have seen people spend thousands of dollars on a cloud hosting platform to develop and test something when it could have been done equally well on a 4-year-old desktop PC sitting on somebody's desk. If they had only thought to bother installing the same (debian, centos, whatever) environment + packages + custom configuration on it.

mmcconnell1618 · on Dec 11, 2020

But one big benefit of cloud providers is that you can spin up those servers for testing and when it doesn't work out as expected, you can just magically make them go away. If you're putting out the capital to buy servers in a rack, you have to use them all the time for the cost benefit to work out. A regular test environment that is used all the time? Yes, that could possibly be cheaper if you purchase the hardware but you also need to amortize the cost of purchasing new servers every 3 to 5 years to get the equivalent of the equipment provided by the cloud provider.

MattyMc · on Dec 10, 2020

> writing as somebody who runs a big collection of bare-metal hypervisors for ISP infrastructure purposes.

I run a cloud SaaS company (3 employees). If I had the skillset that it sounds like you do, I might be inclined to hose on bare metal. But I don't. I don't know what a 1U dual socket server is.

It would take me some time to build these skills, and to match the agility that the cloud offers. I don't think it's worth my time, and probably not the author's time, either.

walrus01 · on Dec 10, 2020

absolutely an understandable concern. One way of abstracting away the need to own or maintain physical servers while still achieving a definitive fixed monthly cost is to do as this other commenter has done, renting dedicated servers from a company that specializes in such:

https://news.ycombinator.com/item?id=25372912

salmonlogs · on Dec 10, 2020

And this comment makes you look incredibly naive and narrow minded.

Running some code on a CPU != running a startup. Great you can buy a Dell server on eBay, or you can build a powerful desktop, or rent a VM or get a droplet or scrape on lowendbox. These are not a secret and there is a great reason no one does this other than hobbyists and neckbeards.

You do your testing and it works, then what? You have to deliver scalable reliable systems in production that require identity management, security, backups, resiliency, reliability, various networking services and a million other supporting services and all the systems that come with it. Never-mind actually scaling the application, monitoring it and all the tools, systems and processes needed to run reliable systems in production.

The eBay servers provide you exactly zero of that and you've just wasted time setting up an environment that is a snowflake and doesn't represent reality. Testing on the cloud on exactly the same platform you would use for production has a lot of benefits when you look at value as limited developer time delivering value to customers and the business.

Whilst the $1200 server on eBay might be cheap today, you are entirely missing the hidden cost of lost time when your team of developers costing $M/year are wasting on testing in an environment that doesn't help them find and solve production issues. You don't need many hours of wasted time or downtime to lose all of your so called cost gains.

Optimising for absolute minimum cost is a fools errand that only slows down actually delivering production systems that deliver value to your customers.

Please spend some time thinking bigger about the opportunity cost and value delivery of technology beyond the immediate dollars and cents - it might surprise you.

1337shadow · on Dec 11, 2020

Really, "a million other supporting services and all the systems that come with it" ?

You get a server with SSH, then you need something expose your container stacks over HTTPS like Traefik (which auto configures), and something for alerting such as Netdata (which auto configures too!), both of which are just a single binary to configure and setup, and probably it won't take long before you have scripts to automate that like we do[0]

Not only do you get amazing prices[1] but ...

You get to be part of an amazing community running the world on Free Software

But yeah, maybe I'm "missing the bigger picture" by "not locking myself in proprietary frameworks"

[0] https://blog.yourlabs.org/posts/2020-02-08-bigsudo-extreme-d... [1] https://oneprovider.com/

salmonlogs · on Dec 11, 2020

I'm a huge fan of open source and open standards, in fact I always push for and expect portability and avoid proprietary systems where possible. Abstractions like Kubernetes are a fantastic middle ground to provide portability across platforms whilst taking advantage of cloud provider services where they exist. The same for apps and frameworks built on open standards like Kubeflow and Apache Beam.

The supporting services and systems come when you run services that require strong guarantees for reliability and resiliency, and meeting the needs of different lines of business.

If I think of a mid-size company that wants to run these kind of workloads and demands minimal downtime, resilience against local disaster and minimal data loss: - Customer facing applications in a reasonable scalable manner to meet peaks and troughs of demand without needing to size of peak demand

- CRM/ERP systems to manage customer data, payments, sales processes and inventory that CANNOT have corrupted or lost data

- Data platforms for running reasonable scale analytics and analysis on reasonable size volumes of data (say a few Petabytes online accessible, analyse 10s of Terabytes per quer)

- Capability for mid-level machine learning and access to modern acceleration hardware, up to date GPUs and maybe some NVIDIA Ampere type equipment

- Tools and platforms for operations and security that can capture, store and analyse all the logs produced by all those systems, plus some half decent cyber services - network level netflow analysis, maybe IDS if you are feeling fancy, endpoint scanning and analysis, threat intelligence capabilities to correlate against all of that data

- Tooling and platforms for developers - source control, artifact repositories, container registries, CI platforms like Jenkins ideally with automated security scanning integrated, CD for deployments like Spinnaker to canary and deploy your releases safely

- Networking for all of that equipment, ideally private backbones, leased Ethernet or MPLS - and that all needs to be resilience, redundant and duplicated

- Storage for all of the above that meets performance and cost needs, replicated, and backed up offline

Yes, you can do all of that yourself! But let's be clear, buying a server on eBay is not even a fraction of 1% of the reality of running real infrastructure for real systems for real businesses. There ARE reasons why you might do this but that is increasingly the exception due to either extreme scale, regulatory and privacy requirements typically from data sovereignty or very unique hardware requirements.

1337shadow · on Dec 12, 2020

I'm not talking about /buying/ a server, but renting one as a service.

99.9% uptime is plenty enough for 99.9% of projects and that's easy to achieve with one server, k8s is not necessary here. You're not concerned with MPLS or whatnot when you rent a server.

I can tell because I'm running governmental websites on this kind of servers actually, with over a thousand admins managing thousands of user requests. I've been deploying my code on servers like that for the last 15 years and it was great really, also got fintech/legaltech project in production and much more.

I guess the project you're describing falls more in the 0.1% of projects than 99.9%.

salmonlogs · on Dec 12, 2020

The issue for me is that 99.9% uptime isn't a useful or meaningful metric. End users only care about the experience and if the application isn't performant, reliable and durable it doesn't matter if the lights are flashing - they will tell you that it's not working as intended. And when you rely on SLAs from third party providers the liability is not equally shared; they might credit you some % of your bill if its offline, but your reputational impact and opportunity cost is likely orders of magnitude greater. You also can't control how that 99.9% will happen, and more often than not its going to happen at the worst possible time (payroll dates, reports due, board need statistics etc.)

Mitigating these failures will always lead you down the path of replication, load balancing, high availability or at the very least frequent backups and restore strategies. And all of that is going to need to be done across multiple physical locations because I am never going to stake my reputation on a single physical site not losing power, connectivity or cooling. Now you are in the realms of worrying about network reliability, bandwidth availability for those replication and backup services in a way that doesn't impact user applications. And monitoring all of that, managing failures etc. etc.

As someone who helps organisations with their IT strategy and overall budget allocation process the focus is always on delivering reliable applications to customers and business users. Using a cloud provider helps us to ignore all the complexity behind the scenes that require significant investment in people and resources to manage once you hit a non-trivial scale. Paying a premium to do that is absolutely worthwhile compared to the downside of it going wrong, and the opportunity cost of wasting time on minute details that do not add value.

[Edit] And for context I DID used to buy servers one eBay for testing and development, and then migrated to bare metal colo, and all the while thought I was winning and it was cheaper. But over the years I've experienced enough issues and worked with enough companies to understand this was a false economy and now see the errors of my decisions and seek to help others avoid them.

1337shadow · on Dec 14, 2020

Have you seen link[0] in my comment? automated backups are of course a big part of the plan but replication is not an alternative to backups in my book anyway.

I'm not talking about buying servers and colo, but about talking about renting servers as a service[1], where you get the benefits of dedicated hardware but not the inconvenience.

The added security that we have by "not sharing our hardware"[2] also deserve to be mentioned here.

[0] https://blog.yourlabs.org/posts/2020-02-08-bigsudo-extreme-d... [1] https://oneprovider.com/dedicated-servers-in-north-america [2] https://media.ccc.de/v/33c3-8044-what_could_possibly_go_wron...

salmonlogs · on Dec 14, 2020

Yes I read your blog and it seems like you've written plenty of shell scripts and utils to try and abstract and automate away infrastructure, but it feels a lot like you are trying to re-invent the wheel when all of this and much more is available in major cloud providers as a service.

One item that stands out for me is that your backup is a couple of shell scripts and you mention that you would dump your database to a different RAID array. That means you are now on the hook to procure/rent, manage, update and monitor that RAID array. And you even call out that you don't include offsite backups so you are at risk of total loss because you are using a single physical site for your prod data AND backup data.

You mentioned above that you are "not locking myself in proprietary frameworks" - but in the process you have built custom one off scripted systems that are bespoke. If you leave or your consulting engagement ends it will be very hard for someone to take over and manage and maintain your systems - because your design, configuration and implementation is effectively lock in to YOU as a person and your consulting company.

Personally I would rather trust a cloud provider to offer something like backup as a service where they can handle geographic replication, snapshots, restores for me as a service and deal with all the hardware, disk replacements, hardware monitoring and network fun that comes with it. The human cost of moving to another cloud provider is not that large and I can easily hire a person or consultancy that has knowledge of Cloud Provider A and Cloud Provider B to make that transition because their services and systems are well documented, conform to a contract and there are training and certifications for how they work.

I still hold my opinion that taking advantage of services offered by cloud providers is value add in the context of running a business.

Also I would much rather trust a cloud provider with a big team of security experts to run my infrastructure than a random company renting me some servers. If you are getting them as a service then there is still a shared admin control plane, likely management type networks and infrastructure around it that are managed for you by a third party. Trusting their team, processes and security capabilities is a very high bar to meet.

craftinator · on Dec 10, 2020

Prick.

dang · on Dec 10, 2020

Would you please stop posting unsubstantive comments to HN and stop breaking the site guidelines? You've been doing it a lot and we ban that sort of account. I don't want to ban you because your good comments are good, but the bad comments are like mercury: they build up in the system and poison things.

The rules apply regardless of how bad or wrong another comment is, or you feel it is.

https://news.ycombinator.com/newsguidelines.html