You're forgetting the cost of fighting IT in a bureaucratic corporation to get them to let you buy/run non-standard hardware
Much easier to spend huge amounts of money of Azure/AWS and politely tell them it's their own fucking fault when they complain about the costs. (what me? no I'm not bitter, why do you ask?)
Unfortunately, the fight goes even further than that when you go against the cloud.
Last week I was in an event with the CTOs of many of the hottest startups in America. It was shocking how much money is wasted on the cloud because inefficiencies and they simply don't care how much it costs.
I guess since they are not wasting their own money, they can always come up with the same excuse: developers are more expensive than infrastructure. Well... that argument starts to fall apart very quickly when a company spends six figures every month on AWS.
I'm on the other extreme. I run my company stuff on ten $300 servers I bought on eBay in 2012 and put inside a soundproof rack in my office in NJ, with a 300 Mb FIOS connection using Cloudflare as a proxy / CDN. The servers run Proxmox for private cloud and CEPH for storage. They all have SSDs and some have Optane storage. In 6 years, there were only 3 outages that weren't my fault. All at the cost of office rent ($1000) + FIOS ($359) + Cloudflare costs and S3 for images and backups.
With my infrastructure, I can run 6k requests per minute on the main Rails app (+ Scala backend) with a 40ms response time with plenty of resources to spare.
Only a Rails developer would think 6k requests per minute with 40ms latency is reasonable with all that hardware. If you rewrote it you probably only need 1 server but you will probably make an argument about how developer time is more valuable :)
I'm talking about a real application here. With 100s of database and API calls on each web page load. I could make the whole thing in Golang or Scala and that would be at least one order of magnitude faster. But then I would have to throw away all the business knowledge that was added to the Rails app.
For instance, the slowest API call on the 40 ms is one that hits an ElasticSearch cluster with over 1 billion documents and is made on a Scala backend using Apache Thrift. There's a lot of caching but still, long tail and customization will kill caching at the top level.
It sounds very similar to the Rails frontend I helped replace at Twitter — no business logic was thrown away, it was done without any loss in application fidelity. We did get approximately 10x fewer servers with 10x less latency after porting it to the JVM as a result. However, without the latency improvement, I don't think we should have done it. Fewer servers just isn't as important as the developer time necessary to do the change as you just pointed out. Just as using the cloud to simplify things and reduce the amount of effort spent on infrastructure is the main driver of adoption. There is clearly a cross-over point though where you decide it is worth it. The CTOs you are speaking of are making that choice and it probably isn't a silly excuse.
It's a matter of dosage. You might be talking about how another tech stack would perform better at this metric, but the price of that is the company would have been simply unable to ship anything in the time they could afford.
Swinging the conversation beyond the dosages of either side doesn't produce interesting insight
I'm not sure how many outages I could avoid using Heroku, but I guess at least a few.
One time I was using Docker for a 2 TB MongoDB and it messed the iptables rules. I notice everything slow for a few days until the database disappeared and when I logged in to check, there was a ransom note.
I flew from Boca Raton to NJ to recover the backup and audit if that was the only breach. That was the longest outage.
Like Rome, this infrastructure was not created in one day. Adding Optane storage is something more recent, for example. Or adding a remote KVM to make easier to manage than dealing with multiple DRACS, which I did after a moved to Florida.
But I'm not against using the cloud. I'm actually very in favor. What I'm against is waste.
In my case, being very conservative with my costs and still have a lot of resources available allowed me to try and keep trying many different ideas in the search for product/market fit.
Are you using the Optane SSDs for Ceph as you mentioned in your original comment? Curious what benefit you're seeing if not for Ceph, and if for something else, would you mind commenting? We're looking to share best practices with the community we're building over at acceleratewithoptane.com on how to take advantage of Optane SSDs.
Anyone who thinks spending $10 instead of $1, so that you can book the expense (presumably for a tax write-off which might save you 30%... MAYBE?) needs to stay away from Finances.
You can typically take the full depreciation in the year you purchased the equipment under IRS Section 179, up to a limit which varies depending on which way the wind is blowing in Congress. For 2018 the limit is $1MM. Whether or not its more beneficial to you to take the depreciation over time is a question for your accountant—technically if you later sell the equipment you're supposed to recapture the revenue from the sale for tax purposes.
That's very passive aggressive way to deal with it. If that's your only option you are really a cubicle slave in corporate hell.
In my opinion it's better to escalate upwards with proposals and not back down easily. You just have to frame it correctly and use right names and terms.
* Usually big companies understand the concept of "a lab" that has infrastructure managed outside the corporate IT. Once you fight the hard fight, you get your own corner and are left alone to do your job and can gradually grow it for other things.
* Asking forgiveness works even for large companies. Sometimes someone is not happy and you 'are in trouble' but not really. You just have to have a personality that does not care if somebody gets mad at you sometimes.
It has to be, even with as irrational as the market can be I don't see people sustaining this kind of spend on public clouds -- the cultural barriers and perverse incentives that prevent effective use of private / hybrid cloud have to erode eventually.
That's not borne out by history, especially in the face of "enterprise" hardware, software, and support, which, in many large companies, is being replaced by cloud (at lower cost!).
For smaller companies, it may be a different story, especially at the next economic downturn, especially if VC money becomes scarce enough for long enough.
There are armies of salespeople brainwashing executives about "the cloud". At my last job it only took a few years before every single C-level was spouting off recycled cloud propaganda
I wanted to put an Ubuntu partition on my work PC for Python deep learning work, as I'm significantly faster and happier on it. When I mentioned it to the sysadmin, he said "I'm not allowing that. Linux is like Wikipedia, any idiot can contribute to it. Windows is made by professionals so it has to be better."
Wow. Sometimes I wonder how these people even get hired. I guess a decent workaround would be if you can just get Docker approved, then you can do what you want.
This is absolutely the right answer as to how AWS got so big in the first place. Capex and IT are huge pain points. Starting something up on the free tier isn't. Once something's running and providing value, spending money on it becomes a "necessity" and the obstructionism goes away.
In most companies, AWS just becomes a new front-end to the same old IT bureaucracy, and dev teams are still disallowed from creating their own instances or EMR clusters or setting up new products like their own Redshift instance or ECR deployment solution.
Not every company is a single page webapp and simple service portfolio... I work at a place with 3,000+ developers and over 700 applications - there is no way in hell our cloud portfolio would have any standards if we didn't have a robust operations/engineering team making it work. Sometimes, even when you adopt the cloud, you realize that your operations model is even more important and there is nothing wrong with that.
The real problem is believing that there is a pure model that works for everyone.
The parent's point was that if the ops/deployment engineering team is unresponsive to the needs of developers, it may end up being better to run with no standards in cowboy-mode. If they ops team is extremely fast and highly skilled, they will be a boon full stop. If they are unskilled and politically obstructive, they will be a curse, full stop.
In no way am I responsible for our migration going well. But I do feel if a company this size can do it, then others failing to do so says more about their architecture teams than the endeavor's futility.
This has less to do with cloud vs own hardware and more about how the company is structured. I've worked in companies before with an ops department: a few people responsible for managing all the cloud servers. All the devs (like me at the time) work locally and have access to an isolated dev/uat environment provided by that team. That team had most of the show automated, they weren't really provisioning any machines by hand.
If something bad slipped into prod there was a process in place of how a dev would work together with someone from that team to fix it.
> pre-devops workflows
I guess this is what you are calling a pre-devops workflow? In a lot of fields not all devs are allowed to see/touch the complete production environment. Not everyone can go the netflix way of "everyone pushes to production and we'll just fix it when it breaks".
I dunno, that sounds like the a good way to do it.
I kinda shake my head if people call for devs to handle all the infrastructure and everything in prod. Why should a developer concern himself with details of scaling and tuning postgres, elasticsearch or loadbalancers? If you don't outsource that, that's the job for the ops team. However, if that's the responsibility of the ops team, there's no reason for dev to have access to these systems beyond the application layer. That just seems like a good way to split the work of running and scaling an application.
Now if we're talking about applications, that's something different. In fact, it's the contrary. I am a big proponent of having developers manage and configure their own applications, including prod. It's fast and makes them develop better applications. However, doing this for critical systems in a responsible way requires automation to enforce best practices. At our place, devs don't have root access to productive application servers, but devs have permissions to configure and use the automation controlling the productive application servers. And it's safe and rather risk-free because the automation does the right thing.
And it's also a different thing if we're talking about test. By all means, deploy a 3 container database cluster to poke around with it. I like experimentation and volatility in test setups and PoCs. Sometimes its just faster to solve the problem with 3 candidate technologies and go from there. Just don't expect things to go into production just like that. We'll have to think about availability, scaling and automating that properly first.
I’m a machine learning engineer. I’d love it if I only needed to worry about machine learning and application interfaces.
Instead, because ops / infra arbitrarily block me from what I need, I have to be an expert on database internals, network bottlenecks, app security topics, deployment, containers, CI tools, etc. etc., both so that I can “do it myself” when ops e.g. refuses to acknowledge some assumption-breaking GPU architecture we need, and so that I can exhaustively deal with every single arch / ops debate or bureaucratic hurdle that comes up for me to endlessly justify everything I need to do far beyond any reasonable standard.
For me, managing devops shit myself is a necessary evil. Far better than the case of unresponsive / bureaucratic ops teams, but worse than the unicorn case of an actual customer service oriented ops team that actually cares rather than engages in convenience-for-themselves optimization at every turn.
This is so frustrating to read as a service provider.
Especially because a well-done infrastructure scales so much harder than the manual style. We're currently dealing with a bit of fallout from a bureaucratic system at one point. People are so confused because to get a custom container build, all they do is create a repository with a specific name and a templated Jenkinsfile and 10 - 15 minutes later they get a mail with deployment instructions. It's so easy, and no one has to do anything else.
> I guess this is what you are calling a pre-devops workflow? In a lot of fields not all devs are allowed to see/touch the complete production environment. Not everyone can go the netflix way of "everyone pushes to production and we'll just fix it when it breaks"
Unwillingness shouldn't be confused with ability. Most companies can do this if they're not handling PII/PHI. It takes investment in smart people and time but most companies besides pure software companies see tech as a cost center and avoid investing in better infrastructure and platform systems.
Very much so when I was arguably doing devops for BT using PR1ME super minis' back in the day. A mate who was in operations for an IBM shop was horrified that our team were allowed to write our own JCL.
A lot of those companies don't have cloud architects, just Amazon/Microsoft/Google sales reps talking into the ears of MBAs about turning capex into opex.
In my current company (a large bank), our team (~50 people) is trying to leave the on-premise infrastructure and move to the cloud because the on-perm stuff is managed in such a way that it's hard for us to acomplish anything. It will probably cost a lot more to use the cloud, but we're gladly willing to pay for it if we can shed the bureucracy this way.
I've read that in medieval times, kings have sometimes abandoned their castles for new ones after too much fecies were accumulated in them (as people were shitting wherever back then). I feel a strong parallel in this story to our situation.
Yep. No one telling me no anymore and I can write Lambas to replace cron jobs, use RDS to replace DBs, use S3 and Glacier to replace Storage, etc. Fargate is awesome too. No gate keepers nor bureaucracy just code and git repos. That's why AWS is so awesome.
And, I can show exactly how much everything costs. As well as work to reduce those costs.
AWS added to a small, skilled dev team is a huge multiplier.
>>politely tell them it's their own ... fault when they complain about the costs.
They don't care. Neither do their bosses. Not even the CFO or the CEO.
The only people who will care are the investors. And sometimes not even them, they will likely just sell and walk.
Only people who are likely to care are activist investors with large block of shares, those types have vested interests in these things.
Part of the reason why there is so much waste every where is because organizations are not humans. And most people in authority have no real stake or long term consequences for decisions. This is everywhere, religious organizations, companies, governments etc. Everywhere.
Could be very interesting when we have another economic downturn to see if attitudes on this change. It certainly seems more cost-effective to run ones own technical operations rather than offloading onto AWS / Google / Microsoft.
It will only matter if the infrastructure is a large part of their cost of technical operations (as opposed to labor) or, more importantly, overall cost of operations.
In another thread on here (unfortunately I can't recall which), an executive shared a sentiment along the lines of "I don't care if it's 1% or 0.1% of my overall budget".
Perhaps 0.9% would become more significant during a downturn, especially for startups if VC money dries up.
When I worked at Google, I really missed the dual monitor setup I had had at my previous job. I asked my manager how to get a 2nd monitor. Apparently, since I had the larger monitor, I was not allowed to get a 2nd one without all kinds of hassle. I asked if I was allowed to just buy one from Amazon and plug it in, and I was told no. I finally just grabbed an older one that had been sitting in the hallway of the cube-farm next to us for a few days, waiting for to be picked up and re-used. I'm sure somebody's inventory sheet finally made sense 2 years later when I quit, and they collected that monitor along with the rest of my stuff.
I wanted second 30" monitor, so I filed a ticket. They sent me long email listing reasons why I shouldn't get a second monitor, including (numbers are approximate, employee count from 2013 or so) "If every googler gets an extra monitor, in a year it would be equivalent to driving Toyota Camry for 18000 miles."
I'm thinking "this can't possible be right", so I spend some time calculating, and it turned out to be approximately correct. So I'm thinking, "we should hire one less person and give everyone an extra monitor!". I replied "yes, I understand, go ahead and give me an extra monitor anyway". They replied "we'll require triple management approval!". Me: "please proceed". First two managers approved the request, director emails me "Why I'm bother with this?". Me: "they want your approval to give me second monitor", him: "whatever".
And finally few days later I got a second monitor...
> They sent me long email listing reasons why I shouldn't get a second monitor
I had no idea Google was so cheap. Gourmet breakfast, lunch, and dinner every day? No problem. A couple hundred bucks for a second monitor? Uh... it's not about the money, we're, uh, concerned about the environment.
When I started at Amazon they gave us dual 22" monitors. I bought dual 27" dells and a gfx card to back them and plugged it in. I also explained how much I was swapping with 8gb and a virt (45 minutes of lost productivity a day), and ram was $86. Manager happily expensed ram for me and the rest of the team . 2 years later that was the standard setup. Now I have 7 monitors and 3 PC's from multiple projects and oses and hardware reups and interns. None funded by me. The dells are happily on my quad tree at home.
Amazon also allows you to byoc and image it. Images are easily available. Bias for action gets u far.
A colleague at my office recently brought in a 43" 4K TV to use as a monitor. He was reimbursed immediately.
After a few days I bought one, too, took it home, and took the 2x 23" monitors it replaced to the office.
In a different department another friend systematically acquired new large monitors for her dev team. Her management chain complained about the expense.
Bias for action not only gets you far, it attracts folks to your team inside a company. ;-)
At an old job I managed to snag a 30" monitor from a colleague who was moving to a different job. He managed to get it in the brief window that it was offered instead of 26" ones. When I announced that I was moving on, the vultures started circling over who would get the monitor.
The inventory sheet will never be reconciled over that monitor, unless it breaks.
My company has the same policy - one monitor per employee. When I wanted a second one, I bought my own, set it up over the weekend, and when someone from IT finally noticed it a month or so later, I just said "Oh, I found it in the hardware e-waste pile over by the server room"
Not sure if I can take it with when I leave, but hey, at least now I have my second monitor and it cost less than $300 - I'll probably bequeath it to a fellow employee.
Reminds me of the time when LCD monitors were just taking over. The software team had budgeted and gotten approval for nice new development system along with a large LCD monitor. As we tried to place the order, the IT manager gets hold and makes a big fuss about it. Given that the CEO even approved it, it caught us off guard. Best we could tell she was butt hurt for us getting LCD monitors before their team did.
There was a choice between a "big" single monitor, or a "small" dual monitor setup. I had just not realized how much I used the framing of the 2 monitors to stay organized, and I chose the "big" single monitor setup. That turned out to be a mistake. The point is that it was easier to "dumpster dive" for a second small monitor than it was to get one in a legit way...
At my employer, I happened to get one of the first machines they built with SSDs (in 2014) and one of the last machines they issued as desktop instead of laptop.
A year later they stopped issuing SSDs and I have to do the whole "managerial approval" thing to get any variation from the "standard" machine, so I am more than a year overdue for a replacement.
When decent monitors cost like $100-$150, it's nuts. I've got a quad 27" setup. I'd like to have more, but things get a little dodgy when you run out of real video ouputs and have to start using displaylink usb adapters.
I always waited until my interns left and then "collected" their monitors. Eventually I had too many monitors (Just managing the asset tags and reconciling things was a huge pain.), and replaced everything with a small 24" 4K.
You're neglecting management costs. IT teams don't buy hardware with corporate credit cards, they have to work through pre-existing requisition processes that properly budget for the hardware, make sure support contracts are in place, etc. You have to migrate whatever workload off the server where you installed the GPU (politically problematic since Murphy promises you that your users will be connecting to the server by its IP address, or the workload is "mission critical" and can't be stopped, or whatever) and reserve the server for GPU work. If you're running something like vCenter to virtualize your datacenter resources, you need to make sure that VMware picks up on the GPU in the new machine, doesn't schedule new VMs without GPU requirements on the server with the new GPU, etc.
Comments like these (why can't I just do X?), it's like the difference between being single and being in a serious relationship. When you're single, you can do whatever you want. When you're in a serious relationship, you can't just make whatever decisions you want without talking to your partner. Well, big corporate enterprise is like that on steroids, because instead of having one partner, you have several or dozens, and everyone needs to buy in.
However, this cannot justify just any disparity in prices. Data science FTW !
Capex => Opex has a name. It's called "a loan". So let's model cloud usage as a loan. Let's assume you want, for an employee, a machine learning rig and you set depreciation on 2 years. Let's also assume that the article is correct about the ballpark figures, somewhere around $1400/month for renting, and $3000 for the machine.
And let's ignore the difference in power usage, extra space, bandwidth, ... (which is going to be 2 digit dollars at the very most anyway, since you need all that for the employee in the first place)
So how much interest does cloud charge for this Capex => Opex change ? Well 33600/30000.5 or 613.5% per year. Pretty much every bank on the planet will offer very low credit score companies 30% loans, even 10% is very realistic.
There are no words ... Just give the man his bloody machine. Hell, give him 5, 4 just in case 3 fail, and 1 for Crysis just to be a "nice" guy (you're not really being so nice: you're saving money) and you still come out ahead of the cloud.
We both know why this doesn't happen, the real reason: you don't trust your employees. Letting this employee have that machine would immediately cause a jealousy fight within the company, and cause a major problem. That's of course why the GP comment is right: leave this bloody nightmare of a company, today.
In most (good) companies team/group have budgets where large portions are at discretion to group manager. I would expect group managers to approve for such hardware requests. It’s much harder to justify $4K bills on cloud services. This is not IT issue and I think this thread has got derailed little here.
They seriously can't buy a graphics card and slap it in the PCIe slot?
Well I mean they obviously can...
First of all you're assuming IT will just let you have a decent machines with PCIe slots. It's all about laptops don't you know. Workstations are so 2012.
Secondly while they might, after much begging, let me have one graphics card to put into this one old workstation I've scavenged, they certainly won't let me have a machine with top of line specs with several cards or, heaven forbid, several machines, (but they'll happily let me have a laptop that costs just as much for less than half the performance).
We have had luck begging for "just one" to try it out. Then run some benchmarks. We were able to show that the one developer with the fast workstation compiled faster, and therefore was more productive. A few numbers with dollar signs and management told IT to stop stalling and get everybody a good workstation - things happened fast.
You need to repeat that every couple years though.
At our enterprise, we have HP ultrabooks or however they are called. Windows 7, 32 Bit. You know, it has a decent i5 on it, but there is some specific bottleneck making it terrible. Every employee that sits in an office has this piece of crap.
Takes 3 minutes to start IntelliJ. Since some years ~50 developers were able to get Macbooks, luckily I am one of them, else I'd just be pissed off every single work day.
Not to mention you'll be on some awful Dell tower with no 6/8-pin power connectors and the case won't fit anything reasonably powerful, and the IT gatekeepers have no idea what you mean when you explain this problem.
Last time I worked in big corporate I had to use their laptop even if I didn't plug mine into anything, because of security. At some point said laptop got stolen and I sent a buying request for a new one. Took a year without feedback and a heated discussion for my boss to accuse me for having been to passive on it. I was then using my own laptop. At that point you can't even call it shadow IT anymore.
Any buying requests go through the direct supervisor, the financial department, the director of the company, then back to the supervisor for the final signature. And because it's Germany, it was on paper. When I say any buying request, I mean any. From any programming book, to any accounting book for finance or any HR book.
You'd think people at the top have better things to do. You think that, until you hear the department heads discuss which excel format files should be saved in.
Typically if you want to buy some hardware in a big Corp you go to some preapptoved internal catalog to pick up from one or two models of desktop or a laptop from vendors like Dell, HP, or Lenovo - whatever was preapproved. Sometimes you can fine tune specs but not in a very wide range.
Buying anything outside of that will require senior level approval and - heaven forbid - “vendor approval study” (or similar verbiage).
I worked at a startup that started doing that, the company was only around 250 people at the time, but the VP of IT was hired from a fortune 100 company, and he instituted the catalog approach of only 2 supported laptop configs (13" or 15"), and one desktop.
When I left the company, the engineering director was working out how to let his team go rogue - basically handling all of their own IT support, the only caveat being that they had to install IT's intrusion detection software, but aside from that, an engineer would be able to buy any machine within a certain budget.
No, not in most corporate it shops... card had to be “procured” through the corporate vendor, support has to be in place, drivers for some shitty obsolete os flavor has to be installed... that’s assuming you can even find a server in rack that is recent enough to even run it all, if not then same shit has to happen with the server.
It’s all ITs own fault frankly - or more like fault of corporations approaching IT as mostly a desktop Helpdesk support.
Hey there, I'm the writer for the article. I have another article planned that will talk about part picking and actual building. For now though, here is the parts list: https://pcpartpicker.com/b/B6LJ7P w/ 1600W psu.
We just got around them and argue after the fact. It is a pain no matter what. But we actually do both, interernal custom builds and cloud computing so no matter what we are set
Innovation being sidelined in favor of status quo and 'cover-my-ass' risk management highlights exactly the reason I won't work for a bureaucratic corporation...
Money^, benefits^^, ample PTO^^^, and a 8-5 workday allowing plenty of time for the important things in life like my family are exactly the reasons I do.
^: You know, like real money, not 90k plus 2.5% of something that will probably not exist next year
^^: You know, like real benefits such as low-cost, low-deduct. health insurance, matching 401k, etc.
^^^: As in, real, actual PTO, not this "unlimited" bullshit some startups and now sadly companies are replicating. If you decide to quit and can't "cash out" your PTO, you don't have real PTO.
America, surprisingly, in a southern state no less, where heart disease and diabetes rates are higher. Our work has a grandfathered non-ACA plan (we got to keep our old insurance but means we miss out on "free" stuff like contraceptives and a breast pump, and other banal things, but in exchange, we get much better coverage, and a lower deductible)
You always have at least two options, and this case is not extreme -
1. You can blow up any amount if you like to.
2. Or, you can figure out what you are trying to do. Then, learn how to do it better. There is a cheaper way to run in the cloud too - https://twitter.com/troyhunt/status/968407559102058496
Great post. From someone who's done a few of these, I'll make a few observations:
1. Even if your DL model is running on GPUs, you'll run into things that are CPU bound. You'll even run into things that are not multithreaded and are CPU bound. It's valuable to get a CPU that has good single-core performance.
2. For DL applications, NVMe is overkill. Your models are not going to be able to saturate a SATA SSD, and with the money you save, you can get a bigger one, and/or a spinning drive to go with it. You'll quickly find yourself running out of space with a 1TB drive.
3. 64GB of RAM is overkill for a single GPU server. RAM has gone up a lot in price, and you can get by with 32 without issue, especially if you have less than 4 GPUs.
4. The case, power supply, and motherboard, and RAM are all a lot more expensive for a properly configured 4 GPU system. It makes no sense to buy all of this supporting hardware and then only buy one GPU. Buy a smaller PSU, less RAM, a smaller case, and buy two GPUs from the outset.
5. Get a fast internet connection. You'll be downloading big datasets, and it is frustrating to wait half a day while something downloads and you can get started.
6. Don't underestimate the time it will take to get all of this working. Between the physical assembly, getting Linux installed, and the numerous inevitable problems you'll run into, budget several days to a week to be training a model.
I ran into this exact issue about 2 years ago about build vs rent. Ultimately I chose build.
Here's my thoughts/background:
Background: Doing small scale training/fine tuning on datasets. Small time commercial applications.
I find renting top shelf VM/GPU combos on the cloud to be psychologically draining. Did I forget to shut off my $5 dollar an hour VM during my weekend camping trip? I hate it when I ask myself questions like that.
I would rather spend the $2k upfront and ride the depreciation curve, than have the "constant" VM stress. Keep in mind, this is for a single instance, personal/commercial use rig.
I feel that DL compute decisions aren't black/white and should be approached in stages.
Stage 0:
If you do full time computer work at a constant location, you should try to own a fast computing rig. DL or not. Having a brutally quick computer makes doing work much less fatiguing.Plus it opens up the window to experimenting with CAD/CAE/VFX/Photogrammetry/video editing. (4.5ghz i7 8700k +32gb ram +SSD)
Stage 1:
Get a single 11/12 gb GPU. 1080TI or TitanX (Some models straight up won't fit on smaller cards). Now you can go on Github and play with random models and not feel guilty about spending money on a VM for it.
Stage 2:
Get a 2nd GPU. Makes for writing/debugging multi-gpu code much easier/smoother.
Stage 3:
If you need more than 2 GPU's for compute, write/debug the code locally on your 2 GPU rig. Then beam it up to the cloud for 2+ gpu training. Use preemptible instances if possible for cost reasons.
Stage 4:
You notice your cloud bill is getting pretty high($1k+ month) and you never need more than 8x GPUs for anything that your doing. Start the build for your DL runbox #2. SSH/Container workloads only. No GUI, no local dev. Basically server grade hardware with 8x GPUS.
Only in non-North American datacenters. In NA, Nvidia can enforce their driver license, which prohibits use of consumer GPUs in datacenters.
A nice advantage of non-consumer GPUs is their bigger RAM size. Consumer GPUs, even the newest 2080 Ti, has only 11 GB. Datacenter GPUs have 16GB or 32 GB (V100). This is important for very big models. Even if the model itself fits, small memory size forces you to reduce batch size. Small batch size forces you to use a smaller learning rate and acts as a regularizer.
> Nvidia contractually prohibits the use of GeForce and Titan cards in datacenters. So Amazon and other providers have to use the $8,500 datacenter version of the GPUs, and they have to charge a lot for renting it.
I wondered how it is possible that they can restrict the use of hardware after it is bought. After reading the article I learned that they key is the license to the drivers.
So I guess in theory it would be possible for AWS to develop their own (or enhance open source) drivers. On the other hand they would spoil the business relationship with Nvidia and have to do without any discounts.
The problem is that there's already open-source nvidia driver -- https://nouveau.freedesktop.org -- but it's so feature incomplete that it's good only for hobby. You can't run CUDA on it, you can't run any recent games on it, you can run TensorFlow on it, you can't run PyTorch on it, etc etc -- it is essentially useless and has been so for many years.
Curious of the MTBF (mean time between failure) rate of a GefForce/Titan series GPU under continuous utilization in datacenter conditions vs a desktop computer with intermittent usage. I don't want to believe Nvidia is just out to stiff cloud providers. Maybe it's to protect themselves from warranty abuse?
The failure rate of GeForce is higher than Tesla (though it's not that much worse in my experience these days). OEMs worked around this by burning the GPUs in for a couple days running HPC or DL code before sending them out to customers and then they binned the failed GPUs into gaming machines where bitwise reproducibility didn't matter as much. Also, it's usually the memory controller that's defective on 1080TIs so it's pretty easy to spot early.
People who do crypto mining use consumer GPUs, and run them 24/7. You would imagine if it was cheaper overall to buy the datacenter edition they would. But they don't.
Yes. Companies are not allowed to restrict the (legal) usage of something i bought. The only thing they can do is refusing support if i used their product in any means not covered by the license terms, but even this is really hard for them in reality.
It‘s the same with no-resell license terms. Within the EU companies cant forbid me reselling something i bought.
We're going full circle here. The EULAs, licenses etc. of US companies that try to restrict how you can use the Hardware/Software you bought are void in Germany (probably most of EU too) and can be ignored. In many cases they void other (legal) parts of those licenses too since the license is partly illegal.
I don't think that's universally true concerning software. I tried to do some research on the jurisdiction in the EU but got to no definitive conclusion.
However in case of reselling software the corresponding restriction is obviously seen as illegal (or non-binding) in the EU. On the other hand there are restrictions such as copying, running software X on Y computers etc that are valid and binding.
My blanked statement indeed does not cover some fundamental aspects. IP/Copyright laws are in place and will restrict consumer rights. So pirating software is still illegal.
BUT AFAIK you can make legal copies for Backup purposes of your legally acquired software.
Also AFAIK a restriction to run Software on a certain type of Hardware is not valid in Germany. For example I'm pretty sure that Apple can't do anything against Hackintoshs here.
Since the Eula comes with the driver, hetzner (and other hardware rentals) shouldn't be covered directly. They don't download and install the driver for you, so aren't a party of that contract in any case.
how would Amazon sell access to the GPUs without the customers being able to identify the models? Also, procuring data center amounts of GPUs is bound to leave quite the paper trail.
NVIDIA is attempting to separate enterprise/datacenter and consumer chips to justify the cost disparity. Specifically, they're introducing memory, precision etc. limits which have major performance implications to GeForce and there's also the EULA which was been mentioned here. That said, everything AWS comes at a premium as they're making the case that on-demand scale outweighs the pain of management/CapEx. This premium is especially noticeable with more expensive gear like GPUs. At Paperspace (https://paperspace.com), we're doing everything we can to bring the cost down of cloud and in particular, the cost of delivering a GPU. Not all cloud providers are the same :)
Paperspace is great for the price(especially for storage) and easy for deployment. However, the customer support is horrible. Turnaround time is around a week, and sometimes they seem to have billing issues. One time I got charged around $100 for trying out a GPU machine for 2 hours(which is the number of hours showed up on the invoice), and it took them three weeks to eventually issue a refund. That made me decide to switch. Hope you guys can hire more people for customer support or have better solution for issues like this.
I contemplated building my own machine for deep learning. I'm a geek so I do like building hardware. But I couldn't even justify spending $2k on a deep learning machine.
Paperspace has eliminated my desire to build a deep learning computer thanks to their insanely low prices.
Current prices are around $0.78 per hour for a Nvidia Quadro P5000, that's pretty comparable to a 1080 TI.
On top of that you can even run Gradient notebooks (on demand) without even setting up a server. This is the future when bandwidth costs are minimal: thin clients, powerful servers.
At the end of the day, I wanted to spend more time tuning the ML pipeline rather than fussing with drivers, OS dependencies, etc
Sure there's lots of things that Paperspace could do better, but their existing product is already leaps and bounds better than GCloud or AWS. AWS and GCloud wins through big contracts with large businesses and I'm just a little guy.
Disclosure: I do not work for Paperspace and am not paid to endorse them in any way. I love their product.
Memory? Because if you can spread your model across multiple GPUs, and you've implemented Krizhevsky's One Weird Trick to switch between reducing the smallest of either parameters or deltas, you're golden.
I thought tensor cores and NVLINK would end up Tesla differentiators, and really great ones at that, but now they're both in the Turing consumer GPUs so I am really scratching my head here.
That said, the EULA is just stupid. I cannot use CUDA 9.2 or later at work because of it. No one is going to audit our computers for any reason ever, period, full stop.
For me, one of the main reasons for building a personal deep learning box was to have fixed upfront cost instead of variable cost for each experiment. Not an entirely rational argument, but I find having fixed cost instead of variable cost promotes experimentation.
This is how I feel as well, especially in a work context. It's liberating to have hardware freely available for use. I'm dreading the day when our computing grid is made more efficient (smaller and tolled).
It's been this way since day 1. NVLINK remains the only real Tesla differentiator (although mini NVLINK is available on the new Turing consumer GPUs so WTFever). But because none of the DL frameworks support intra-layer model parallelism, all of the networks we see tend to run efficiently in data parallel because doing anything else makes them communication-limited, which they aren't because data scientists end up building networks that aren't, chicken and the egg style.
I continue to be boggled that Alex Krizhevsky's One Weird Trick never made it to TensorFlow or anywhere else:
I also suspect that's why so many thought leaders consider ImageNet to be solved, when what's really solved is ImageNet-1K. That leaves ~21K more outputs on the softmax of the output layer for ImageNet-22K, which to my knowledge, is still not solved. A 22,000-wide output sourced by a 4096-wide embedding is 90K+ parameters (which is almost 4x as many parameters in the entire ResNet-50 network).
All that said, while it will always be cheaper to buy your ML peeps $10K quad-GPU workstations and upgrade their consumer GPUs whenever a brand new shiny becomes available, be aware NVIDIA is very passive aggressive about this following some strange magical thinking that this is OK for academics, but not OK for business. My own biased take is it's the right solution for anyone doing research, and the cloud is the right solution for scaling it up for production. Silly me.
I think that the reason no one implements Krizhevsky's OWT (at least in normal training scripts, there's nothing stopping you from doing this in TensorFlow) is that the model parallelism in OWT is only useful where you have more weights than inputs/outputs to a layer. This was true for the FC layers in AlexNet, but hardly anyone uses large FC layers anymore.
Model parallelism is also useful in situation where your model (and/or your inputs) is so large that even with batch_size=1 it does not fit in GPU memory (especially if you're still using 1080Ti). However other techniques might help here (e.g. gradient checkpointing, or dropping parts of your graph to INT8).
Own hardware is always cheaper to buy than using a cloud service, but keeping it running 24/7 involves substantial costs.
Sure, if you run a solo operation, you can just get up during the night to nurse your server, but at some point that no longer makes sense to do.
Somewhere along the way we forgot about this and it's now perfectly normal to run a blog on a GKE 3 VM kubernetes cluster, costing 140 EUR/month.
I used to manage hardware in several datacentres, and I'd usually visit the data centres a couple of times a year. Other than that we used a couple of hours of "remote hands" services from the datacentre operator. Overall our hosting costs were about 30% of what the same capacity would have cost on AWS. Once a year I'd get a "why aren't we using AWS" e-mail from my boss, update our costing spreadsheets and tell him I'd happily move if he was ok with the costs, as it would have been more convenient for me personally, and every year the gap came out just too ridiculously huge to justify.
In the end we started migrating to Hetzner, as they finally got servers that got close enough to be worth offloading that work to someone else. Notably Hetzner reached cost parity for us; AWS was still just as ridiculously overpriced.
There are certainly scenarios where using AWS is worth it for the convenience or functionality. I use AWS for my current work for the convenience, for example. And AWS can often be cheaper than buying hardware. But I've never seen a case where AWS was the cheapest option, or even one of the cheapest, even when factoring in everything, unless you can use the free tier.
We had too low variations in load over the day to make elastic usage cost effective for the most part, so it made very little difference. Indeed, one of the most cost effective ways of using AWS is to not use it, but be ready to use it to handle traffic spikes. Do that and you can load your dedicated servers much closer to capacity, while almost never having to spin any instances up.
And if considering elastic capacity, do you include the cost of the engineering effort required to take advantage of it?
A similar question applies to any other dynamic cost-reduction measure, such as spot instances.
I recall reading an announcement that GCP was starting only charging for actually-used vCPUs, rather than all that were provisioned, a form of automatic elastic cost-savings, although it was still more expensive than a DIY method. AFAIK, AWS doesn't do anything like that.
We've seen an even wider cost disparity on colo and dedicated servers vs AWS. More than 5x. It's easy for us to estimate because databases and web servers need to be on all the time, and those dominate costs.
> keeping it running 24/7 involves substantial costs
So does using a cloud service. It's not actually obvious, conceptually, but very little of the admin overhead has to do with the "own hardware" aspect of running it, especially if one excludes anything that has a direct analog at a cloud service.
There certainly exist services that abstract away more of this, but that's in exchange for higher cost and lower top performance, but that doesn't scale (in terms of cost).
> Sure, if you run a solo operation, you can just get up during the night to nurse your server, but at some point that no longer makes sense to do.
I'd actually argue the reverse. My experience is that the own-hardware portion took at most a quarter of my time, and that remained constant up to several hundred servers. It's much cheaper per unit of infrastructure the more units you have.
The tools and procedures that allow that kind of efficiency were the prerequisite for cloud services to exist.
It's not expensive at all compared to cloud hosting to keep hardware running. Hardware is really stable and usually runs for years without any issues. You can even use colocation and remote hands, it isn't expensive.
I think the real answer to this question is unhelpful, which is: it depends.
I ran a detailed cost analysis of tier 3 onprem vs aws about 7 years ago. I included the cost of maintaining servers, support staff salaries, rent, insurance, employee dwell time etc and onprem was still cheaper. Maybe it's different now.
We put significant thought into being cheap. I think constraint can breed innovation.
Missing 1 important point: ML workflows are super chunky. Some days we want to train 10 models in parallel, each on a server with 8 or 16 GPUs. Most days we're building data sets or evaluating work, and need zero.
When it comes to inference, sometime you wanna ramp up thousands of boxes for a backfill, sometimes you need a few to keep up with streaming load.
Trying to do either of these on in-house hardware would require buying way too much hardware which would sit idle most of the time, or seriously hamper our workflow/productivity.
on the other hand, this comparison accounts for the full cost of the rig, while a realistic comparison should consider the marginal costs. Most of us need a pc anyways, and if you're a gamer the marginal cost is pretty close to zero.
>"Nvidia contractually prohibits the use of GeForce and Titan cards in datacenters. So Amazon and other providers have to use the $8,500 datacenter version of the GPUs, and they have to charge a lot for renting it."
I wonder if someone might provide some clarification on this. Is this to say only if a reseller buys directly from Nvidia they are compelled by some agreement they signed with Nvidia? How else would this legal for Nvidia to dictate how and where someone is allowed to use their product? Thanks.
Another comment in this thread said that it's due to the license on Nvidia's drivers. So technically you can use the hardware in a datacenter, just not with the official drivers. Unfortunately it seems that the open-source drivers aren't usable for most datacenter purposes, so this effectively limits how you can use the hardware (at least in North America, where they can enforce it).
While, in sheer dollar amount this post is probably correct, it doesn't really scale.
At scale, you need more than just hardware. It's maintenance, racks, cooling, security, fire suppression etc. Oh, and the cost of replacing the GPUs when they die.
At full price, yes, cloud GPUs on AWS aren't cheap, but at potentially a 90% saving in some regions/AZs, the price of spot instances by bidding on unused capacity for ML tasks that can be split over multiple machines make using cloud servers a much more attractive prospect.
I think this post is conflating one physical machine to a fleet of virtualised ones, and that's not really a fair comparison.
Also, the post refers to cloud storage at $0.10/GB/month which is incorrect. AWS HDD storage is $0.025/GB/month and S3 storage is $0.023 which is arguably more suited to storing large data sets.
And the same can be said about can in fact pretty much be said by any AWS service.
The equivalent of an i3 metal is probably around 30000 to 40000$ with Dell or HP, and probably half cheaper if self assembled (like a supermicro server). AWS i3.metal will cost 43000 annually, so even more than the acquisition cost of the server, server which will last probably around 5 year.
But if you start taking into account all the logistic, additional skills, people and processes needed to maintain a rack in a DC, plus the additional equipment (network gears, KVMs, etc). The cost win is far less evident and it also generally adds delays when product requirements changes.
Fronting the capital can be an issue for many companies, specially the smaller ones, and for the bigger ones, repurposing the hardware bought for a failed project/experiment is not always evident.
> But if you start taking into account all the logistic, additional skills, people and processes needed to maintain a rack in a DC,
You've mostly described what one pays a datacenter provider, plus hiring someone who has experience working with one (and other own-hardware vendors, such as ISPs and VARs), which doesn't cost any more (and maybe less) than hiring someone with equivalent cloud vendor expertise.
> plus the additional equipment (network gears, KVMs, etc).
Although these are non-zero, they're a few hundred dollars (if that) per server, at scale, negligible compared to $20k.
> The cost win is far less evident
It still is, since the extra costs usually brought up are rarely quantified, and, when they are, turn out to be minor (nowhere near even doubling the cost of hardware plus electricity). AWS could multiply it by 10 (as in the very rough pricing example you provided).
> generally adds delays when product requirements changes.
This is cloud's biggest advantage, but it's not directly related to cost. This advantage can easily be mitigated by merely having spare hardware sitting idle, which is, essentially part of what one is paying for at a cloud provider.
I really want to believe this. Of course the numbers given depend on very frequent use of your machine, but still. One would imagine that building a datacenter at scale and only renting when you actually are training a model to be much cheaper, but the reality appears to be not so.
So where does the money go?
Three places:
- AWS/Google/Whoever-you're-renting-from obviously get a cut
- Inefficiencies in the process (there's lots of engineers and DB administrators and technicians and and and people who have to get paid in the middle.)
- Thirdly, and this is what most surprised me, NVIDIA takes a big cut. Apparently the 1080Ti and similar cards are consumer only, whilst datacenters & cloud providers have to buy their Tesla line of cards, with corresponding B-to-B support and price tag (3k-8k per card). [1]
So, given these three money-gobbling middlemen, it does seem to kinda make sense to shell out 3.000$ for your own machine, if you are serious about ML.
Some small additional upsides are that you get a blazing fast personal PC and can probably run Crysis 3 on it.
By coincidence I just posted https://news.ycombinator.com/item?id=18066472 about the expense of AWS et al for NASA's HPC work. (Deep learning, "big data" et al are, or should be, basically using HPC and general research computing techniques, although the NASA load seems mostly engineering simulation.)
> Even when you shut your machine down, you still have to pay storage for the machine at $0.10 per GB per month, so I got charged a couple hundred dollars / month just to keep my data around.
Curious how it relates to sticking only a single terabyte SSD in the machine. As a couple hundred dollars per month should relate to a couple terabytes.
Yeah, that didn't add up for me too. Storage for a single machine shouldn't add up to hundreds of dollars a month unlesa you're not really trying to manage your data (by culling datasets, using compression, etc.)
Thanks for the catch! Actually that wasn't intended as a direct comparison to this machine's 1TB storage (I think I had ~2TB in the cloud). I can see how this is confusing - just edited the article.
I have seen sys admins stand-up a bunch of EC2's in AWS and install PostGres and Docker on them (because the dev's said they need a DB and a docker server). They don't get the services model (use RDS and ECS). Sys admins have to change. Orgs can't afford this cost nor be slowed down by this 1990s mindset.
Standing up a bunch of EC2's in AWS is just a horrible idea and an expensive one as well. It also moves all of the on-prem problems (patching, backups, access, sys admins as gatekeepers, etc.) to the cloud. It's the absolutely wrong way to use AWS.
So stop sys admins from doing that as soon as you notice. Teach them about the services and how, when used properly, the services are a real multiplier that frees everyone up to do other, more important things rather than baby sitting hundreds of servers.
Yeah that's nice and all, but then you have total and complete vendor lock-in.
I decided to do the EC2 thing when I built one of my products, knowing that I couldn't have vendor lock in - and that decision was critical to the survival of my company when:
1) A customer wanted to run on azure. We would have lost a 2.5 million pound contract if we couldn't do that
2) Another customer wanted on-prem solution, we would have lost a 55 million USD contract if we were vendor locked to AWS
The DB schema in RDS is identical to the DB schema on a Postgres server in your data center and the RDS data dumps can be loaded into your very own personal DB wherever you like.
The dockerfile used to build the containers in ECS can be pushed to any registry and the resulting containers can run on any docker service.
What vendor lock in? I just don't buy that argument.
Use the services Luke ;) it will cost much less and you can focus on other things!
> There’s one 1080 Ti GPU to start (you can just as easily use the new 2080 Ti for Machine Learning at $500 more — just be careful to get one with a blower fan design)
I don't believe there are any blower-style 20-series cards. The reference cards use a dual-fan design.
So this is definitely an interesting article with some good ideas. But the main thrust isn't particularly interesting. It's always going to be true that building and operating your own is cheaper than using "the cloud"... so long as you are making use of it much of the time; and if you have the resources and facilities to build, operate, troubleshoot, and replace the hardware; and if you are sure about your long-term needs. But the decision isn't always that straightforward for most potential users.
Also, look into using S3 for long-term storage, instead of leaving your stuff cold in EBS volumes. It's quite a bit cheaper.
Purely on hardware yes it’s no secret that AWS is more, quite a bit more, than just buying/building the machine and plugging it in. That’s for any servers no just “deep learning” servers.
Of course you’re also paying for everything else AWS brings and the ability to spin up/down on demand with nearly unlimited scalability which is hardly “free.” AWS is also a very profitable business for Amazon so they’re making good margin too on most of their pricing.
> AWS is more, quite a bit more, than just buying/building the machine and plugging it in.
Last time I looked, for mid-range AWS instances, purchase price was about 6-12 months of the rent. That’s assuming you buy comparable servers, i.e. Xeon, ECC RAM, etc…
For GPGPU servers however, purchase price is only 1-2 months of Amazon rent. Huge difference, despite 1080Ti is very comparable to P100, 1080Ti is slightly faster (10.6 TFlops versus 9.3), P100 has slightly more VRAM (12/16GB versus 11).
You know, all the negatives of building your own machine assumes you run it 24/7, and we sure as hell don't. We run these models maybe a couple of times a week, but we expect fast results when we do, and that's unchanged.
I will bring this up with the rest of my data engineering team, this might be a good idea.
> Assuming your 1 GPU machine depreciates to $0 in 3 years (very conservative), the chart below shows that if you use it for up to 1 year, it’ll be 10x cheaper, including costs for electricity
You can invert that to 3 years of use at 33% utilization which would come out cheaper (or is my maths broken?). Still doesn't sound like it'd be a good match for your usage though.
That 3 years is extremely conservative though, in reality it would probably be much longer and there are some potential upgrade paths to factor in as well. Not to mention the potential to use it for other purposes.
Every hardware can be cheaper than AWS if you build it yourself, that's not the point. The ongoing cost of maintaining and improving it, and the necessary knowledge might not. Also instead of building your own hardware, you can build the app instead. I thought these are obvious for anyone.
Exactly. I recently had a discussion on the costs of an RDS setup on AWS. Yes, renting a dedicated server or buying a server and setting up replicated MySQL is much cheaper than the montly cost of two RDS instances. Until you add the labour cost of setting up, maintaining and monitoring.
This works if your machine learning is just for data analytics purposes.
The article somehow completely ignores integration with other services to query or update the model, which requires some API to be hosted on a static ip or domain, as well as the devops process.
Yeah the main use case here is training, which is much more of an offline process. For inference, you will probably want a cloud provider. Though the cost difference still holds, so maybe a startup idea.
Can anybody fill me in on why people these days are spending almost the same money for a last-gen, medium-high end, 12 core, 24 thread CPU and a motherboard? What features would an almost $400 motherboard have that could justify that price?
Threadripper is actually not even that great here, it's NUMA (half of the lanes are attached to each die) and if you just want NUMA you can do a dual-socket E5-2650 system for like half the price. Same logic applies to TR here:
You're not using the CPU for anything, there is no reason to have a beefy 12-core CPU there. If you really want Threadripper, the 8-core version is fine.
Again, personally I'd look at X79 boards, since they are pretty cheap and you can do up to 4 GPUs off a single root, depending on the board. There are new-production boards available from China on eBay, see ex: "Runing X79Z". Figure about $150 for the mobo, about $100 for the processor, and then you can stack in up to 128 GB of RAM, including ECC RDIMMs, which runs about $50 per 16 GB.
There are some Z170/Z270 boards like the "Supercarrier" from Asrock that include PLX chips to multiplex your x16 lanes up to 4 16x slots (does not increase bandwidth, just allows more GPUs to share it at full speed!). They also will not support ECC (you need a C-series chipset for that, which run >$200). So far most OEMs have been avoiding putting out high-end boards for Z370 because they know Z390 is right around the corner so there is currently no SuperCarrier available for the 8000 series.
Did the Stanford grad include that his and other grad students time is worth at least $100 an hour on the open market, i.e. what a Stanford CS grad would make in Silicon Valley (cest moi)?
This is only applicable if they can actually sell very small increments of their time for that amount of money, which I strongly suspect they can't. There have been numerous threads here on HN about the difficulties of finding full-price work that isn't full time.
Even then, the $2k difference between the cheapest pre-built the article references and the DIY version would be 20 hours at $100/hr. Half a workweek to build one PC seems excessive.
seems logical.. i would argue the inflection point in the ubiquity of pooled compute services came as bandwidth needs increased due to more businesses interacting directly, remotely and in real time, with customers, be they consumers or other businesses
if compute needs were all internal, as they are with desktop apps.. namely, compilation.., and with the majority of computationally expensive machine learning demands.. namely, training.. then i'd argue the pooled compute model would have remained niche
Hi! Thank you for the comment, I'm the writer on the article. When we did our training we actually needed the computer for months at a time. It takes 1-2 months to tune a model and we were running exps almost 24/7. So one project put us in the break even point to build.
Was it possible to paralelize the process? Seems like training on few dozen cloud GPUs and finishing much faster would save a lot of money in terms of engineer time
This analysis assumes 100% utilization: keeping the machine busy 24/7/265. It also ignores AWS price drops and the value of being able to switch to new AWS GPU instance types as they become available.
TL;DR: Proposed machine costs $3k plus ~$100-200?/month for electricity, comparable AWS is currently $3/hr.
My conclusion: If you're going to do more than 1-2,000hrs of GPU computing in the next few years, start thinking about building a machine.
If you just run something for 3 days every 2 weeks using your numbers that comes out to $432 per month. Just one month! Still not including storage costs etc. So after 6 months you're even. And you can still sell those GPUs in 1/2 years for a decent chunk of money when you want to buy the next generation.
The cloud only really makes sense for the on-demand burst capability IMO.
TLDR; Two month rental cost of GPU machine in AWS (and elsewhere) is same as purchase price of similarly powered hardware. Why? In fiersly competitive market no cloud provider wants to capture share here. One thing is that author is comparing high end GPU machine with V100s that cost way too much for the privilege of putting them in data center compared to consumer grade Ti1080.
I don't see what's surprising here. This is the ski rental problem [0].
Would it surprise anyone to learn that renting a car is more expensive in the long run compared to buying one? This is the same thing, only the time scales are different.
But the ski rental problem as described there misses the same factors as the AWS example: it does not account for the additional costs of ownership vs renting.
The analogy doesn't quite map for all aspects so I'll stick with the ski example to show it's not specific to cloud computing: Owning skis has a load of associated costs ranging from simple to more complex: storage, maintenance, carrying them to and from the airport, having to spend a lot of energy on the decision of which pair to buy as it will be expensive to change my decision later, having the general background burden of owning another thing that I have to think about on some level even when I am not skiing.
All these things are ultimately paid for with money directly, or time, or mental energy that I could be spending on other things. What's the cost of all those combined? It's really hard to say and may even extend beyond a simple economic function and get into the philosophical arena. But you can't deny there is a cost and you can't ignore that cost when evaluating if renting is more or less expensive than owning, over an arbitrary time period.
Concrete example: I own skis. But if I'm going on a ski vacation instead of a day trip, I rent them. That decision has nothing to do with the cost to rent skis vs. what they sell for, and everything to do with the fact that I can rent skis for about the same as what it would cost me to transport my own skis and boots to Breckenridge and back.
I suspect that there is something like that (only not at all) going on with machine learning kit: If you've got exactly one person training models, and they're doing it all the time, then yeah, it's cheaper to own your own hardware. If you've got several people all trying to share the training rig, then you start having to worry about all the costs associated with scheduling time, delays if someone else got their first (especially if they're doing deep learning and will be monopolizing the server for upwards of a week), etc. Once you factor all those full-lifecycle costs in, the cloud might start to look relatively cheaper.
It depends on your needs. If you are absolutely certain that you are going to rent skis so often that you will spend more then buying them (plus the costs of storage, transport, and maintenance), then buying is the right choice.
TIL "ski rental problem". Makes sense that this would be an entire class of problem.
To me it's been the "should I buy an airplane" problem. (Basically, you have to rent a lot, > 50 hrs/year, before buying makes sense financially, and even then it might not work out very well unless you fly a ton)
Because the cloud has economies of scale and benefits from specialization (cheaper power, bulk buying prices, remote land prices, etc)
Hence, some of these saving should be passed on to consumers and it should be cheaper.
The counter argument is that the cloud is over-engineered (redundancy, backups, better software) for enterprise customers. You u are paying for all these benefits whether u want to or not.
One is that AWS (and Azure and Google) are taking advantage of their brand recognition and are pricing their services extremely high compared to their competition. For some that is worth it as they need services they don't get elsewhere, but for a lot of people there are far cheaper cloud services out there. Sometimes even hybrid solutions (e.g. a particular problem with AWS are the absolutely insane bandwidth prices - if you "need" to store your data on S3 because they're the only ones you trust for durability or because it fits in with other parts of your workflow - but you access most objects regularly enough from outside AWS that bandwidth prices are a high part of your cost, for example, it can often pay to rent servers somewhere like Hetzner just to act as caches)
The other problem is that while it looks straightforward, as much as we might like to think so, servers are not as much of a commodity as we'd like to think. Components are.
The really big savings in running your own comes when you measure how much RAM you need and what CPU you need, and whether or not disk IO is an issue, and configure servers that fit your use better. If the cloud providers have instances that fit your needs, awesome, you can get a decent price (if you shop around). If they don't, you can easily end up paying a lot for components in your instances that you don't need.
When you don't know what you need, it might be ok to pay that premium, but as soon as you know what you need, there are many cases where buying your own allows you a degree of specialisation that does not pay for the cloud providers because it's better to provide a smaller range where they can get economy of scale. I've seen cases where e.g. being able to shove enough RAM in a server cut costs by 90% or more vs. splitting the load over multiple lower RAM cloud instances.
I'm not saying it will stay this way forever - there are certainly regularly more and more combinations that you can find decently priced cloud instances for - but there are still tons of cases where building your own (or having someone configure one for you) is so much cheaper that it's nearly crazy not to.
I think your pricing information might be slightly out of date. AWS Lightsail's $5 plan is equivalent or better than Linode's $5 plan. Plus AWS Lightsail has an even more barebones plan at $3.5 per month.
Two years ago you would be perfectly correct. The cheapest AWS option back then would be a micro EC2 instance at around $20 per month IIRC.
Lightsail is basically a toy to compete with Vultr/DO/OVH/etc. You can't even rename a server or change its metadata. You can't resize a server (except for taking a snapshot and building a new server). Essentially, you can't make any changes to a server and there are zero tools for organizing, tagging, or even adding a description field. Of course, you cannot manage Lightsail instances using the AWS console (or even see them). All of the powerful AWS tools are invisible to you as well, and your entire Lightsail infrastructure is invisible from within AWS itself (except for the bill).
Lightsail is really great for personal sandbox servers, or for moving to AWS on the cheap, but it's not even as capable as its worst competitor (and certainly cuts out a lot of features compared to real AWS.)
The worst part is probably the surprising and seemingly minor factor of being able to simply rename or resize a server. This makes using it in a team or production environment surprisingly difficult.
With that all said, I still find Lightsail very useful for small prototyping and other fast jobs (or a Lambda replacement -- it's way cheaper than running any reasonably busy Lambda function), but I can't see using it for real production usage. Your best bet for real production usage, or something that will someday move into production, would be probably vultr/DO/linode. (OVH really wants to be great, and it should be because the hardware is really great, but the dashboard and billing are so bizarrely bad..)
AWS is good when you have a virtually unlimited budget or don't mind running up the bill while you're getting rolling, but getting locked into AWS will mean that you are stuck with it forever. It's very hard to migrate away if you start using a vendor-specific thing like DynamoDB (to be fair, Google Firebase has exactly the same issue; Instances are basically fungible, but proprietary data stores are not)
> the cloud has economies of scale and benefits from specialization
For GPUs it's the opposite. With some legal BS, NVidia forbids Amazon to rent out GeForces. Doesn't look OK to me, I don't think NVidia has a right to deside how people will use the hardware they have bought. Anyway, Amazon has to buy Teslas, and Teslas are 10 times more expensive.
If you need 10k cookies every day for multiple years, you're better off building your own bakery. If you need multiple years of continuous DL work, you're better off building your own DL system.
If you need 5 cookies today (and maybe the next batch next week), go to Walmart. Same for an one-off DL job.
I think the analogy holds. The numbers in the article assume constant usage (there is a nod to the cost of static storage, but all the math is done on $3/hr gpu usage, not $0.1/mo/GB storage).
So if your usage is 4 hours per week, 50 weeks a year, you hit $600/year of gpu. If your usage is 24 hours a day, you hit $27k.
So it's not a clear-cut "buying metal is cheaper" - you have to decide how your usage amortizes. As pgeorgi says, if your demand is a pack of cookies a week, buying a bakery isn't the most cost-efficient way to satisfy that demand.
Depends on your use case. The more accurate models these days require lots of data with huge models. Machine Translation, Image Generation, Segmentation, etc. Each experiment takes a couple days/weeks to train, so we were encountering long stretches of constant usage.
Only for niche problems run into by huge companies and research groups. Most 'normal' problems encountered by 'normal' companies don't need anywhere near that much training.
It's fine for exactly what the article describes: a student doing projects at home.
For anything bigger than that, you'll quickly run into issues. Try talking to your ops team and telling them you want to set up a mid-range desktop PC with extra RAM and a high end graphics card as a DL workstation. I think you'll very quickly find some friction, especially once you want to go into production with it.
Google went to production with Pentium desktops on shelves.
IIRC their original hardware plan was based around commodity hardware and getting the most performance per dollar since they built the crawler to be distributed pretty early on.
Point being, consumer grade hardware has come a long way since then, and if you're doing something cutting edge like DL it's not outlandish to expect that rolling your own might be worthwhile and totally justified.
Google's start up between 1996 and 1998 was also over twenty years ago. There were fewer than 200 million Internet users on the Internet and less than 2.5 million web sites [0] in 1998. Google was also started by two college grad students, meaning gasp it was initially just a research project. I'd also point out that while, yes, the initial production servers were cheap and used commodity hardware, this [1] is what they looked like, which is hardly the type of setup that the article is suggesting.
And that's just unnecessary friction, which is a problem of politics, not technology.
I do recall a time as recently as 4 years ago where it was difficult to find, for example, rack-mountable servers that could hold a large number of these GPU cards, but substituting for a mid-range PC with 2 or 4 of them would not have been an issue, even then.
Cost would be higher but not necessarily outrageously so, since large server chasses often have startlingly large PSUs. It might be an extra 50% on top of that $3k, but that's still competitive, even with the pre-builds the article mentions.
This is usually the case for this kind of thing, yes. Cheap biscuits are a really cheap bulk product while the margins on the fresh ingredients are higher, and people making their own are doing so as a hobby luxury.
SNAP benefits at anywhere from $134 to $192 a month are both more than sufficient for this. So why are ~42 million people starving in America? Answer (IMO): Cloud providers reduce the friction of GPU adoption the same way grocery and convenience stores reduce the friction of food acquisition, both in exchange for profiting off of it.
Normal food can be even cheaper. You could probably indefinitely avoid starvation with dried rice and beans, oil, and a vitamin supplement for $2/day (in NA), without a big upfront cost. Poverty is complicated.
Quite likely, the Walmart ones are better, at least if it comes to taste and looks. Unless you're an expert baker. And then your cookies probably still aren't better, you will just claim they are, and it's polite to agree with that assessment. In fact, it's polite in our society to agree that any half-good DIY attempt has better quality than commercial equivalent, even though it doesn't stand up to scrutiny.
If you had picked another example of pre-made food, I might agree, but cookies are really easy. Just follow the recipe on the back of the chocolate chip bag, you'll get something that is far better than anything out of a bag that's engineered to sit around in a store for months.
Where in the world are you? It's my normal human experience from both the UK and Sweden. Of course the stuff you make from scratch will often have a higher quality (or at least be made from higher quality ingredients) and if you want to buy similar or better quality then it will cost more. But the cheapest chocolate chip cookies you can find in the food store will be cheaper than buying ingredients and baking.
So, whatever bakery paid for ingredients, wages and owner's profit is still less that what you'd just pay for just the ingredients? What are they putting into these cookies, sawdust?
Cheap palm oil and other vegetable fats instead of butter, corn syrup instead of cane sugar, much lower quality chocolate than is generally available in grocery stores, various chemical stabilizers instead of eggs, artificial vanilla flavoring instead of real vanilla and so on.
Plus they obviously get everything at a massive discount since they're buying everything by the ton instead of by the kg.
You might want to ask why those cheap ingredients are cheap and you will find a bunch of farmers getting huge subsidies from the taxpayers to grow otherwise unprofitable crops.
Kind of like how if you remove all the subsidies given to fossil fuel producers, some forms of renewable energy suddenly become much more competitive.
So when you add a whole cost center for IT, the inefficiencies of the way they operate suddenly make the cloud efficient.
Wages and profit is amortized over thousands of cookies per day. You pay less than 1 cent per cookie. If you don't bake all the dough in your mixing bowl you already have spend more on ingredients than the labor at the factory.
The comments on cheap low quality ingredients and bulk pricing are true as well.
GPU accelerates training time considerably. For development it's very convenient to use a Linux laptop with nVidia Optimus. I was able to train a quite complex model on laptops's nVidia (optirun process) while continue working using Intel SOC's GPU. However GPU memory is crucial it's better to choose a laptop with 4GB or more video RAM.
I have a hard time taking seriously any analysis that states x is n-times smaller than y, with neither being a negative number. That's not how math works.
Much easier to spend huge amounts of money of Azure/AWS and politely tell them it's their own fucking fault when they complain about the costs. (what me? no I'm not bitter, why do you ask?)