Andreessen-Horowitz craps on “AI” startups from a great height

m0zg · on Feb 24, 2020

"Huge compute bills" usually come from training, or to be more precise, hyperparameter search that's required before you find a model that works well. You could also fail to find such a model, but that's another discussion.

So yeah, you could spend one or two FTE salaries' (or one deep learning PhD's) worth of cash on finding such models for your startup if you insist on helping Jeff Bezos to wipe his tears with crisp hundred dollar bills. That's if you know what you're doing of course. Literally unlimited amounts could be spent if you don't. Or you could do the same for a fraction of the cost by stuffing a rack in your office with consumer grade 2080ti's. Just don't call it a "datacenter" or NVIDIA will have a stroke. Is that too much money? Not in most typical cases, I'd think. If the competitive advantage of what you're doing with DL does not offset the cost of 2 meatspace FTEs, you're doing it wrong.

That, once again, assumes that you know what you're doing, and aren't doing deep learning for the sake of deep learning.

Also, if your startup is venture funded, AWS will give you $100K in credit, hoping that you waste it by misconfiguring your instances and not paying attention to their extremely opaque billing (which is what most of their startup customers proceed to doing pretty much straight away). If you do not make these mistakes, that $100K will last for some time, after which you could build out the aforementioned rack full of 2080ti's on prem.

bob1029 · on Feb 24, 2020

I find it fun how the cost of the cloud is forcing people to consider what absolutely must run in the cloud (presumably for stability and compliance reasons) and what can be brought back on-prem.

We don't train ML models, but we are in a similar boat regarding cloud compute costs. Building our solutions for our clients is a compute-heavy task which is getting expensive in the cloud. We are considering options such as building commodity threadripper rigs, throwing them in various developers' (home) offices, installing a VPN client on each and then attaching as build agents to our AWS-hosted jenkins instance. In this configuration we could drop down to a t3a.micro for Jenkins and still see much faster builds. The reduction in iteration time over a month would easily pay for the new hardware. An obvious next step up from this is to do proper colocation, but I am of a mindset that if I have to start racking servers I am bringing 100% of our infrastructure out of the cloud.

blt · on Feb 24, 2020

If I worked from home and my employer asked me to install a server in my home, I would tell them to go fuck themselves.

It's noisy, it takes up space, and presumably I'm on call to fix it if it breaks.

You should pay them an extra 24x(PSU wattage)x(peak $/Wh in area) per day for the electricity too.

I'm alarmed that someone in your company felt this idea was appropriate enough to propose.

bob1029 · on Feb 24, 2020

We would certainly compensate employees. I didn't feel it appropriate to disclose every last detail regarding the arrangement in a thread which is only tangentially-related to the OP.

This was my idea. I am a developer in my company. We are a flat structure. We have a lot of respect for each other. I am on a standup with the CEO every day. We all believe in our product and would happily participate in whatever activity brings it to market more quickly. We do not hire or retain the kind of talent that would flatly refuse to participate in experimental projects like this. At least not without some sort of initial conversation about why it's not a good fit for a particular individual.

I certainly see how someone might share your perspective. I used to work for a souless megacorp and I could have easily found myself telling my former employer to "go fuck themselves" if a proposal similar to this was imposed upon me.

starfallg · on Feb 25, 2020

Try the value data center providers like he.net.

A 42u rack and 1 Gbps connection is $400 per month.

Put cheap supermicro EPYC servers in rather than threadrippers (or build your own). High capacity RDIMMs are cheaper than UDIMMs.

This will give you a much more maintainable solution than workstations at employee's home via overlay VPN.

There is a maintenance cost for infrastructure that people tend to forget these days.

xemdetia · on Feb 25, 2020

I think it was funny to presume that the employer wouldn't pay for it. At one of my old roles I had a second laptop I used solely for VMs just because I ran out of space on primary workstation. A simple fix orders of magnitude cheaper for a developer compared to more ESXi or cloud hardware.

Aeolun · on Feb 25, 2020

I guess the ‘go fuck yourselves’ is heavily dependent on how able the company is to do this sort of stuff without involving you.

blt · on Feb 25, 2020

Do you have an ownership stake in the company?

breischl · on Feb 24, 2020

Tangentially, I once joined a startup where a week after I started the CEO demanded that I run their crawler software on my own personal hardware, on my personal internet connection, 24x7. I (politely) told him to go fuck himself.

I later learned that the CTO had to spend a couple hours talking him out of firing me. I probably should've just quit on the spot anyway.

jiofih · on Feb 24, 2020

You can stop at “go f* themselves”, plenty of jobs to choose from.

m0zg · on Feb 24, 2020

This is not a new phenomenon. As early as in 2009 I worked for a company (ads, but not Google) which outgrew the typical "cloud" cost structure at the time, and moved everything to a more traditional datacenter, and saved substantial money even considering 3 more SREs they had to hire to absorb the increased support needs. AWS charges what the market will bear, and as such it was never designed to make sense for everyone. One needs to re-evaluate on the back of the napkin from time to time.

buckminster · on Feb 24, 2020

I once had a borrowed Sun blade server in my home office. The fan in it sounded like an industrial vacuum cleaner. It got moved to a different room and was powered on as little as possible.

Your plan makes sense but be mindful of the acoustics or your devs may grow to hate you.

bob1029 · on Feb 24, 2020

Excellent point. If we are building these rigs by hand (which is a likely option considering the initial usage context), the cooling solution would probably be a Noctua NH-U14S or similar. I already have one of these in my office attached to a 2950X and it is dead silent. You can definitely hear it when every core is pegged, but it's hardly noticeable over any other arbitrary workstation. The sound is nowhere near as intrusive as something like a blower on a GPU (or god forbid a sun microsystems blade).

maremp · on Feb 27, 2020

What you’re saying sounds like a PC custom build enthusiast. I respect that, I like it too.

Please be mindful of the fact that consumer products are not designed for the workstation/server type of load. It’s related to why the hardware is cheaper compared to the server hardware. Also, the consumer ISP connection is most likely not as reliable as that of a data center. I’m working remotely from home and I experienced this many times, bad performance in peak times or a half of a day of downtime can happen, without any warning. And account for maintenance, everyone on the team must be able to figure out a problem or deal with getting someone else in to fix it.

I know I sound like a buzzkill. I am writing this with good intentions.

Even if this works right now, it’s not a reliable long-term solution. Maybe instead of dumping a couple of grand on consumer PCs to handle server’s work, look into building a proper server. Or you could find a datacenter provider to rent their hardware, something that is not as shiny and full of features like AWS.

m0zg · on Feb 24, 2020

BTW, the only reason why consumer vacuum cleaners are so loud is because consumers associate loudness with suction power.

"Backpack" style commercial vacuum cleaners have more suction, and are barely audible in comparison.

buckminster · on Feb 24, 2020

I stand corrected. It was much louder than a consumer vacuum but my analogy skills are weak.

ggm · on Feb 24, 2020

Your analogy skills were strong, because analogy is rooted in myth, not fact. Achilles did not actually have an Achilles tendon because Achilles did not exist

There does not have to be an increadibly loud functional industrial vacuum cleaner, for figuratively everyone to get your analogy, because the Herculean reality of vacuum cleaners is that you cannot clean an augean stable of lego on the floor, without a lot of noise. If you get my analogy.

ColanR · on Feb 25, 2020

To continue the pedantry, I don't think we know for sure that Achilles didn't exist. Troy certainly did, and we have the Iliad to thank for knowing to look for it.

ggm · on Feb 25, 2020

No, we have Schliemann to thank, for digging through it in his lust for anachronistically Indy-Jonesing the archeology. Troy only exists because Schliemann stole the gold. Otherwise its just some hill in Turkey.

ColanR · on Feb 25, 2020

I know the story...Schliemann went looking for Troy because the Illiad told him it existed.

ggm · on Feb 26, 2020

Didn't work on Ararat. False in unum, false in omnibus. Anyway, he went to the wrong hill first. But overall yes, myth informs reality sometimes. Shame he didn't find the bones of the giant wooden platypus the Greek ice cream salesmen hid in.

roel_v · on Feb 25, 2020

I don't know about backpack style vacuums (the few I've seen seem plenty loud), but the idea that manufacturers deliberately create loud vacuums to make them seem more powerful is ridiculous. There is an entire market segment of vacuums that are marketed as 'quiet', so there is plenty of demand for quiet vacuums. There is simply a trade off between suction / loudness / price, and various models make different trade offs (corresponding to various points in the market).

wpietri · on Feb 25, 2020

That some people want quiet vacuums is not proof that some people don't want loud vacuums. Cars and motorcycles are good examples of markets where you see both deliberately quiet and deliberately loud examples.

steev · on Feb 25, 2020

This doesn't seem any more ridiculous than companies setting a higher price and then always having the item on sale. I think JCPenny famously tried simply lowering their prices and sales went down. I guarantee if you ask the average person "would you rather companies cut the BS and just lower their prices rather than have items on sale in perpetuity" they would of course say lower the prices. But as the anecdote would demonstrate, that isn't how things actually play out.

The point is, when it comes to consumer behavior, I don't think anyone has a clue what to expect. It would not surprise one bit if vacuum companies make louder vacuums because the consumer thinks it works better.

m0zg · on Feb 25, 2020

Is BMW piping in fake "engine noise" through the speakers also "ridiculous" in your opinion?

linksnapzz · on Feb 25, 2020

Once upon a time, a company I worked for convinced Sun Microsystems that it would be cool to provide us with hardware at a discount. The discount hardware was six UltraSPARC E4000 servers, each with 8 cpus and 8GB of memory, and a creator3d card in the io mezzanine upa slot.

This company was using them as desktop workstations, in an open office.

One was used as a build host. Often, the shopvac wail of E4000 fans would be cut short by some poor dev going berserk and unplugging the thing when nobody was looking...

echelon · on Feb 25, 2020

> I find it fun how the cost of the cloud is forcing people to consider what absolutely must run in the cloud

Honestly why ever go to the cloud? It seems like a Larry Ellison boondoggle with the absurdly high costs and lock-in. (Ever look at moving your data?)

Running your own metal is cheaper if you actually fund it.

CoolGuySteve · on Feb 25, 2020

In my experience, teams will rack up thousands in monthly expenses just being parked in a shell on very large On-Demand or Reserved EC2 instances. Basically using them as development boxes without realizing how much they cost.

I've saved a ton of money just giving them dedicated workstations to develop on and then having everyone use a shared EC2 instance to push jobs to a fleet of spot instances for large scale training.

fxtentacle · on Feb 24, 2020

No, also inference is quite expensive. You'll have 100% usage on a $10,000 GPU for 3s per customer image for a decently sized optical flow network. That's 3 hours of compute time for 1 minute of 60fps video.

Now let's say your customer wants to analyze 2 hours = 120 minutes of video and doesn't want to wait more than those 3 hours, then suddenly you need 120 servers with one $10k GPU each to service this one customer within 3 hours of waiting.

Good luck reaching that $1,200,000 customer lifetime value to get a positive ROI on your hardware investment.

When I talk about AI, I usually call it "beating the problem to death with cheap computing power". And looking at the average cleverness of AI algorithm training formulas, that seems to be exactly what everyone else is doing, too.

And since I'm being snarky anyway, there's two subdivisions to AI:

supervised learning => remember this

unsupervised learning => approximate this

Both approaches don't put much emphasis on intelligence ;) And both approaches can usually be implemented more efficiently without AI, if you know what you are doing.

m0zg · on Feb 24, 2020

Some kinds of inference are expensive, yes, not going to dispute that. But 99.95% of it is actually surprisingly inexpensive. Hell, a lot of useful workloads can be deployed on a cell phone nowadays, and that fraction will increase over time, further reducing inference costs or eliminating them outright (or rather moving them to the consumer).

For the vast majority of people the main expense is creating the combination of a dataset and model that works for their practical problem, with the dataset being the harder (and sometimes more expensive) problem of the two.

The dataset is also their "moat", even though most of them don't realize it, and don't put enough care into that part of the pipeline.

fxtentacle · on Feb 24, 2020

The algorithms that run on cell phones tend to be specially optimized and quality-reduced neural networks. For example, https://arxiv.org/abs/1704.04861

jiofih · on Feb 24, 2020

Apple does a pretty good job at that, with no compromise in quality.

fxtentacle · on Feb 24, 2020

I believe that just due to memory constraints, running any high-quality neural network on phones is currently impossible.

State of the art optical flow tracking needs about 10 GB of GPU memory to execute on full HD frames. I wouldn't know of any mainstream phone with that much RAM.

That, BTW, is also the reason why autonomous drones usually downsample the images before AI tracking, which has the nasty side effect of making thin branches, fences, telephone wires, etc. invisible.

nl · on Feb 24, 2020

And since I'm being snarky anyway, there's two subdivisions to AI:
supervised learning => remember this

unsupervised learning => approximate this

This doesn't make any sense at all.

Both are "remembering" something under some constraint, which forces generalisation.

Supervised learning just "knows" what it is "remembering". Unsupervised learning is just trying to group data into patterns.

Both approaches don't put much emphasis on intelligence

Seems like most "intelligence" relies a lot on pattern recognition.

And both approaches can usually be implemented more efficiently without AI, if you know what you are doing.

The evidence is that you are wrong on this for a number of pretty important problems. I don't know much about optical flow, but in the image and text spaces you can't approach the accuracy of neural network approaches with hand crafted features.

streetcat1 · on Feb 24, 2020

I am not sure what you are doing, but can you just compute the similarity between two frames, and analyze only the novel frames?

I.e. I think that in one minute video, 95% of your images do not have new information in them

mrpidgeon · on Feb 25, 2020

"supervised learning => remember this

unsupervised learning => approximate this"

Lol this can't be more wrong lmao. Both areas "remember" and "approximate" things trough training. The difference is that unsupervised learning does not have labeled data, thus it has to search for some pattern. Honestly not even computer science graduates would say something like this.

fizixer · on Feb 24, 2020

- Or AMD could change their policy of 'never miss an opportunity to miss an opportunity' and offer high-performance OpenCL GPGPU offerings. Then nVidia could have all the stroke they wanted.

- Or Tensorflow/Pytorch could've crapped on OpenCL a little less by releasing a fully functional OpenCL version everytime they released a fully functional Cuda version, instead of worshipping Cuda year in and year out.

- Or Google could start selling their TPUv2, if not TPUv3, while they're on the verge of releasing TPUv4.

- Or one of the other big-tech's Facebook/Microsoft/Intel could make and start selling a TPU-equivalent device.

- Or I could finish school and get funded to do all/most of the above ;)

edit: On a more serious note, a cloud/on-prem hybrid is absolutely the right way to go. You should have a 4x 2080 ti rig available 24x7 for every ML engineer. It costs about $6k-8k a piece [0]. Prototype the hell out of your models on on-prem hardware. Then when your setup is in working condition and starts producing good results on small problems, you're ready to do a big computation for final model training. Then you send it to the cloud, for final production run. (Guess what, on a majority of your projects, you might realize, the final production run could be carried out on on-prem itself; you just have to keep it running 24 hours-a-day for a few days or up to a couple weeks.)

[0]: https://l7.curtisnorthcutt.com/the-best-4-gpu-deep-learning-...

m0zg · on Feb 24, 2020

As someone who has actually worked on this stuff soup to nuts, it's not as easy as people imagine, because you can't just support some subset of available ops and call it a day. If you want to make OpenCL pie from scratch, you must first make the universe, and support every single stupid thing (among thousands) and even mimic some of the bugs so that models work "the same".

This is hard and time consuming, and this field is hard enough as it is. What makes it even harder is that only NVIDIA has decent, mature tooling. There is some work on ROCM though, so AMD is not _totally_ dead in the water. I'd say they're about 90% dead in the water.

derefr · on Feb 24, 2020

> support every single stupid thing (among thousands) and even mimic some of the bugs so that models work "the same".

Do you need to do the stupid things performantly, though? Because that sounds like a case for skipping microcode shims, and going straight to instructions that trap into a software implementation. Or just running the whole compute-kernel in a driver-side software emulator that then compiles real sub-kernels for each stretch of non-stupid GPGPU instructions, uploads those, and then calls into them from the software emulation at the right moments. Like a profile-guided JIT, but one that can't actually JIT everything, only some things.

fizixer · on Feb 24, 2020

I know Tensorflow decided to be cuda-exclusive for the silly reason that the matrix library they were using (eigen) only supported cuda.

I have never recovered from that.

simonebrunozzi · on Feb 24, 2020

Are you in the Bay area? Would love to chat. Thinking of an idea where your expertise could be very handy. $my-hn-username at gmail.

dnautics · on Feb 25, 2020

That article is mostly right, but there's one part that got skimped on that will mess you up big time with about an 20% chance if you run for long enough.

liuliu · on Feb 25, 2020

I've been playing with custom-built 2080 Ti workstation for a while: https://www.youtube.com/watch?v=OF3JYEIsjH8

Several issues: 1. electricity bill is still an issue, I've been paying anywhere between $500 to $1000 per month for this workstation (always have something to train). 2. something with a decent memory size (Titan RTX and RTX 8000) cost way too much; 3. once you reached a point of 4-2080Ti-is-not-fast-enough, power management and connectivity setup would be a nightmare.

Would love to know other people's opinions on the on-prem setup, especially whether a consumer-grade 10Ghe is enough for connectivity-wise.

eyegor · on Feb 25, 2020

10gbe will depend on the workload. In general, I'd assume it's fine because it takes a parallel raid setup to saturate. Upgrading to 100gbe is pretty unreasonable cost wise unless you buy network gear from a back alley van dealer.

Although once you reach 4 2080ti, you ought to consider switching to a titanium grade psu and rewiring if you're in a 100-120v country. If you're feeling cheap, just steal the phases from two different circuits. Last I looked, most psu operate around 5% lower efficiency on 115 vs 230.

liuliu · on Feb 25, 2020

I've encountered some latency issues with allreduce on transformer models due to vocabulary sizes when communication have to cross PCIe lanes. Increasing batch size helps a lot, but low-latency & high-throughput is universally more helpful to lift these minor concerns (I don't really want to care about my batch size to improve allreduce performance). Hence worried not only throughput, but also latency on consumer-grade 10ghe equipments.

xyzzyz · on Feb 25, 2020

If you're feeling cheap, just steal the phases from two different circuits.

I chukled, but more seriously, if you can't rewire your house to get a normal 240V circuit, you should not be fucking around with hacks like above.

m0zg · on Feb 25, 2020

>> $500 to $1000 per month

How much is your electricity? I currently run 12 GPUs in my garage pretty much non-stop. 4 GPUs per machine, 3 machines. Each machine is about 1.2KW on average (I can tell because each machine is connected through its own rack UPS), or 13.2 cents per hour, or $95/mo. Which, IMO, is not bad at all. That's less than $300 per month for 12 GPUs.

liuliu · on Feb 25, 2020

Sorry, it is 2-month billing cycle. We have around 30 cents per kwh I think.

ignoramous · on Feb 24, 2020

As someone hoping to build a world-wide footprint, say 25 to 50 DCs, of servers to deploy to with unmetered bandwidth, what are some alternatives to the usual suspects?

I have come across fly.io, Vultr, Scaleway, Stackpath, Hetzner, and OVH but either they are expensive (in that they charge for bandwidth and uptime) or do not have a wide enough foot-print.

I guess colos are the way to go, but how does one work with colos, allocate servers, deploy to them, ensure security and uptime and so on from a single place, 'cause dealing with them individually might slow down the process? Is there a tooling that deals with multi-colos like the ones for multi-cloud like min.io, k8s, Triton etc;

mrkurt · on Feb 24, 2020

(Hi, I'm from fly.io)

It depends what you need in your datacenters! If you just want servers, and don't care about doing something like anycast, you can find a bunch of local dedicated server providers in a bunch of cities and go to town. But you can't get them all from one provider, really, not with any kind of reasonable budget.

You _could_ buy colo from a place like Equinix in a bunch of cities, and then either use their transit or buy from other transit providers.

But also, unmetered bandwidth isn't a very sustainable service, so I'm curious what you're after? You're usually either going to have to pay for usage, or pay large monthly fixed prices to get reasonable transit connections in each datacenter.

In our case, we're constrained by Anycast. To expand past the 17 usual cities you end up needing to do your own network engineering which we'd rather not do yet.

ignoramous · on Feb 24, 2020

(thanks mrkurt)

It is anycast that I'm going after. Requirement for unmetered bandwidth (or cheaper than AWS et al) is because of the kind of workloads (TURN relays, proxy, tunnels etc) we'd deal with gets expensive, otherwise. For another related workload, per-request pricing gets expensive, again, due to the nature of the workload (to the tune 100k requests per user per month).

So far, for the former (TURN relays etc), I've found using AWS Global Accelerator and/or GCP's GLB to be the easiest way to do anycast but the bandwidth is slightly on the expensive side. Fly.io matches the pricing in terms of network bandwidth (as promised on the website), so that's a positive but GCP/AWS have a wider footprint. Cloudflare's Magic Transit is another potential solution, but requires an enterprise plan and one needs to bring-your-own-anycast-IP and origin-servers.

For the latter (latency-sensitive workload with ~100k+ reqs / month), Cloudflare Workers (200+ locations minus China) are a great fit though would get expensive once we hit a certain scale. Plus, they're limited to L7 HTTP reqs, only. Whilst, I believe, fly.io can do L4.

mrkurt · on Feb 24, 2020

Ah interesting! If you're cool talking more about what you're doing with me will you send me an email (kurt@fly.io)?

KaiserPro · on Feb 24, 2020

> As someone hoping to build a world-wide footprint

Does adding an extra 100ms to the response time cost you that much business wise?

As for colos, it depends on scale. If you have 30k servers world wide, it pays to have someone manage the contracts for you. If not it pays to go for the painful arseholes like vodaphone, or whoever bought Cable & wireless's stuff.

as for security, it gets very difficult. You need to make sure that each machine is actually running _what_ you told it, and know if someone has inserted a hypervisor shim between you and your bare metal.

none of that is off the shelf.

Which is why people pay the big boys, so that they can prove chain of custody and have very big locks on the cages.

K8s gives you scheduling and a datastore. For a large globally distributed system its going to scale like treacle.

avip · on Feb 24, 2020

For balance, all big cloud providers - aws, gcp, azure, oracle [0] have pretty similar startup plans. Y$$MV

(I'm in full agreement with everything you've written + it's well-phrased and funny. gj!)

[0] that's not a typo - there is such thing as "Oracle cloud"

pridkett · on Feb 24, 2020

There’s also the issue that data scientists often want to go running to hyperparameter optimization and neural architecture search. In most cases improving your data pipelines and ensuring the data are clean and efficient will pay off much more quickly.

fxtentacle · on Feb 24, 2020

But manually improving the data pipeline requires an understanding of the problem, whereas doing a hyperparameter optimized architecture search just needs $$$ hardware and no clue on the side of the operator.

derefr · on Feb 24, 2020

Or, to put that another way: if you knew what algorithm the AI would be using to discriminate the signal from the noise in your data, why would you need the AI? Just write that algorithm.

nl · on Feb 25, 2020

That's not the same thing at all.

You can't build hand-build a feature detector as accurate as (say) a ResNet50 by hand. Before 2014 people tried to do this with feature detectors like SIFT and HOG. These were patented and made the inventors significant money. If it was still possible to do it then someone would be and making profit from it.

Hyperparameter search is just optimising the training parameters (things like batch size, or optimiser parameters). This might give another 1% lift on accuracy, but isn't generally a significant factor.

derefr · on Feb 25, 2020

> You can't build hand-build a feature detector as accurate as (say) a ResNet50 by hand.

Yes, you can. If, that is, you can actually understand what the produced model is doing. And, of course, no human can do that, because no human understands the algorithm being employed by the produced model, because it's a really freaking complex algorithm whose optimal formulation really is just a graph of matrix transformations, rather than an imperative procedure that can be described by words like "variable."

This is an important idea to absorb, for the specific case where the AI converges on an optimal algorithm that's actually very simple—because the data has a regular, simple shape—rather than on one that's too complicated for our mortal minds. If you already knew that simple algorithm, then the work you did training an AI just to end up back at that same simple algorithm is wasted effort. An AI can't do better at e.g. being an AND gate, than an actual AND gate can. An AI can't do what wc(1) does better than wc(1) can.

If the data is regular—that is, if a model of its structure can be held fully in a human brain—then jumping immediately to Machine Learning, before trying to just solve the problem with an algorithm, is silly. The only time you should start with ML, is when it's clear that your problem can't be cleanly mapped into the domain of human procedural thinking.

The AI programmers of the 1960s were not wrong to start with Expert Systems (i.e. attempting to write general algorithms) for deduction, and only begrudgingly turn to fuzzy logic later on. Many deduction tasks are algorithmic. If you don't require the context of "common sense", but only need operate on data types you understand, you can get very far indeed with purely-algorithmic deduction, as e.g. modern RDBMS query planners do. There would be no gain from using ML in RDBMS query planning. It's regular data; the AI's trained model would just be a recapitulation of the query-planning algorithm we already have.

nl · on Feb 25, 2020

>> But manually improving the data pipeline requires an understanding of the problem

> Or, to put that another way: if you knew what algorithm the AI would be using to discriminate the signal from the noise in your data, why would you need the AI? Just write that algorithm.

My point is that this isn't the same thing at all.

Say your problem is plant detection from mobile phone photos. I can understand everything about plants, and I can manually build a highly optimised data processing pipeline.

But I can't build a feature extractor that outperforms ResNet50. That's the key algorithm.

> If the data is regular—that is, if a model of its structure can be held fully in a human brain—then jumping immediately to Machine Learning, before trying to just solve the problem with an algorithm, is silly.

True, but no one has made that argument. This is specifically about using hyperparameter optimisation vs improving your data pipeline.

derefr · on Feb 25, 2020

> True, but no one has made that argument.

Er, yes, I did, in my original post. The form you quoted was me attempting to be more precise in rephrasing it.

My point—my original point, this whole time—was that applying an advanced “feature extraction” algorithm to a data source whose features are explicitly encoded in a lossless, linearly-recoverable way in the data—what we usually call structured data—is silly.

For example, there’s no point in using ResNet50 to extract the “features” of a formal grammar, like JSON. It’d just be badly simulating a JSON parser.

In fact, there’s pretty much no data structure software engineers use, where ResNet50 would give you more information out than you’d get from just using the ADT interface of the data structure. What features are in a queue? Items and an ordering. What’s in a tree? Items, an ordering, and parent-child relationships. Etc.

The only place where it might make sense to use ML when dealing with structured data, is with statistical data structures like Bloom filters. ResNet50 might be able to recover some of the original data out of a bloom filter, in est using it as a compressed-sensing tool, or (in the algorithmic CS domain) as a decompressor for a lossy, underdetermined compression codec.

——

My second point was that, often, it turns out that your data is structured data, even when you didn’t ask for structured data.

Some natural-world datasets are structured!

Example: the standard model of quantum chromodynamics describes a clean digraph of possible spin configurations. You don’t need feature detection when looking at LHC data. The dataset is pre-bucketed, the items pre-tagged, by nature itself.

But more often, what happens is that your data turns out to not be “raw” / primary-source data, but rather a secondary source that was already structured, enriched, and feature-extracted by someone else before you got there.

Scraping social network data? It’s already a graph, and it often already has annotation fields in the JSON graph endpoints describing the relationships between the members. If you don’t just stop and look at the dataset, you might think your feature-extractor is doing something very clever, when actually it’s just finding the explicit pre-chewed “relationship” field and spitting it back out at you.

——

You might not see the relation to a kD-sample-matrix feature-extractor like ResNet50, so here’s some more tightly-analogous examples:

• What if the images in your training dataset turn out to be in Fireworks PNG format, where the raster data contains an embedding of the original vector image it was rendered from? Specializing your feature-extractor to this data is just going to make it learn to find those vectors (and extract features from those), rather than depending on the features in the raster data; and then it’ll fail on images without embedded vector descriptions. And if that’s all you want, why not just use a PNG parser to pull out the vectors?

• What if your audio files turn out to all have been MIDI files rendered out from a certain synthesizer using its default set of instrument patches? Will feature-extraction on this rendered data beat just writing a program to exact-match and decode the instruments back to a MIDI description? Certainly there might be MIDI-level features you want to extract, but will ResNet50 be better at extracting those MIDI-level features for having seen the rendering, as opposed to having been fed the decoded MIDI-level data directly?

nl · on Feb 25, 2020

> Er, yes, I did, in my original post. The form you quoted was me attempting to be more precise in rephrasing it.

Ok. No one other than yourself has made that argument.

> Scraping social network data? It’s already a graph, and it often already has annotation fields in the JSON graph endpoints describing the relationships between the members. If you don’t just stop and look at the dataset, you might think your feature-extractor is doing something very clever, when actually it’s just finding the explicit pre-chewed “relationship” field and spitting it back out at you.

I've worked in this exact field, and I've never ever heard of anyone who doesn't do this. I guess someone might.. so if your point is that there are dumb people around then ok.

But that isn't what a feature extractor does! In the graph context a simple feature extractor is something hand coded like the degree of a node, or a more complex learned one is something that maybe does embedding.

If your point is that people should understand their data, then yes that is in data science 101 for good reasons.

perl4ever · on Feb 25, 2020

"There would be no gain from using ML in RDBMS query planning. It's regular data; the AI's trained model would just be a recapitulation of the query-planning algorithm we already have."

Most of what you wrote seems fine, until I got to this. A query optimizer seems like something that tends to be very opaque, very complex, and in my experience blows up without a good explanation frequently in typical situations. It's also based on a lack of complete data about the problem domain, to the point where an optimal algorithm seems hopeless. I'm not saying an AI approach automatically can be better, but at least you (I) can envision it being better, perhaps less brittle and taking into account dimensions and possibilities a human doesn't. And the non-AI solutions aren't trustworthy in terms of bounded quality so you have not got a lot to lose.

nl · on Feb 25, 2020

Most (?) RDBMS query planners rely on updating statistics on the data in the table to decide what type of joins to perform.

I can imagine cases (distributed databases, different speed storage) where it would make sense to test the queries and learn which optimisations make sense. It'd be self tuning and able to adapt to changing hardware.

perl4ever · on Feb 25, 2020

Oracle certainly has statistics and incredibly sophisticated optimization in theory, it's just that in practice, I think it sucks.

I spent way too many years writing ad hoc Oracle SQL that had to complete within a few hours and trying to guess if the optimizer had decided to finish in 15 minutes or a week.

And I would read Tom Kyte where he says obviously your database is set up wrong if the optimizer isn't working for you. And how you should do everything in one big beautiful query that uses all the latest features of Oracle.

fxtentacle · on Feb 24, 2020

Exactly :)

In most cases, unsupervised learning is nothing more than having the AI try to approximate the solution of your highly non-linear loss function. So if there's any way of solving that loss function directly, it will perform like a well-trained AI.

mxcrossb · on Feb 25, 2020

> In most cases

In what cases is this not true?

streetcat1 · on Feb 24, 2020

Of course, you can also automatically search the best data pipeline for your data...

zitterbewegung · on Feb 24, 2020

I was training ML models on AWS / Google Colab. After racking up a few hundred dollars on AWS I bought a Titan RTX (I also play video games so it does that very well also.

alephnan · on Feb 24, 2020

> Just don't call it a "datacenter" or NVIDIA will have a stroke.

Context please :) ?

OkGoDoIt · on Feb 24, 2020

NVIDIA forces you to buy significantly more expensive cards that perform marginally better if you are using them for datacenter use. They try to enforce not letting businesses use consumer grade gaming cards. I assume this is so cloud providers don't buy up all the supply of graphics cards and make it hard for gamers to get decent cards, like what happened during the bitcoin craze.

nkassis · on Feb 24, 2020

No it's just pure price discrimination. They don't care about gamers they just know businesses will pay more if forced to while gamers can't.

gbba · on Feb 25, 2020

I wouldn't say they don't care about gamers, considering that gaming makes up about half of their revenue: https://www.anandtech.com/show/15513/nvidia-releases-q4-fy20...

nkassis · on Feb 25, 2020

Sure sorry I realize now the point I was trying to make doesn't match my wording. They do care about selling to gamers but availability to gamers is not in anyway why they are forcing more expensive models of essentially the same cards on the HPC and server market. It's all because they can and the business market is able to bear the cost. Now if they found a true competitor in that market, I think this pricing model would fall apart fast.

aj7 · on Feb 24, 2020

Exactly.

mereel · on Feb 24, 2020

Just a guess but maybe it's some licensing issue? https://www.nvidia.com/en-us/drivers/geforce-license/

No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

derefr · on Feb 24, 2020

> except that blockchain processing in a datacenter is permitted

Well, Nvidia, y'see, my new blockchain does AI training as its Proof-of-Work step...

mam2 · on Feb 24, 2020

well they are the one writing the rules, so i'd side with OP

ThePadawan · on Feb 24, 2020

Datacenter GPUs are mostly identical to the much cheaper consumer versions. The only thing preventing you from running a datacenter with consumer hardware is the licensing agreement you accept.

endorphone · on Feb 24, 2020

"The only thing preventing you from running a datacenter with consumer hardware is the licensing agreement you accept."

The consumer cards don't use ECC and memory errors are a common issue (GDDR6 running at the edge of its capabilities). In a gaming situation that means a polygon might be wrong, a visual glitch occurs, a texture isn't rendered right -- things that just don't matter. For scientific purposes that same glitch could be catastrophic.

The "datacenter" cards offer significantly higher performance for some case (tensor cores, double precision), are designed for data center use, are much more scalable, etc. They also come with over double the memory (which is one of the primary limitations forcing scale outs).

Going with the consumer cards is one of those things that might be Pyrrhic. If funds are super low and you want to just get started, sure, but any implication that the only difference is a license is simply incorrect.

dnautics · on Feb 25, 2020

> For scientific purposes that same glitch could be catastrophic.

But for machine learning, some say that stochasticity improves convergence times!

artaak · on Feb 27, 2020

Could you cite an example where ECC requirement on GPU was real and demonstrated to be needed? In practice, I don't know anyone who'd willfully take 10-15% perf hit on GPUs, because of a cosmic ray.

The thermal design for "datacenter" card can be better for sure. And on-board memory size and design. That's about it. For how many x over geforce price tag is that?

endorphone · on Feb 28, 2020

"In practice, I don't know anyone who'd willfully take 10-15% perf hit on GPUs, because of a cosmic ray."

Virtually every server in data centers runs on ECC: the notion of not using it is simply absurd. And given that the Tesla V100 gets 900GB/s of memory bandwidth with ECC, versus 616GB/s of memory bandwidth on the 2080Ti without ECC, it's a strawman to begin with.

nvidia further states that there is zero performance penalty for ECC.

As to whether the requirement is "real", Google did an analysis where they found their ECC memory corrected a bit error every 14 to 40 hours per gigabit.

"That's about it."

Also ECC memory. Also dramatically higher double precision performance. Dramatically higher tensor performance. Aside from all of that...that's it.

luisfmh · on Feb 24, 2020

Learned a new word today. Pyrrhic.

KaiserPro · on Feb 24, 2020

And the cooling, amount of ram and the doubles performance.

the chip might be the same, but the rest of it isn't

Granted, its not worth the $3k price bump, but thats a different issue.

zwaps · on Feb 24, 2020

Nah that's not really it. The reason NVIDIA doesn't allow this is precisely because the additional RAM - functionally the only difference - is not cost efficient. People would like (and did) use a bunch of consumer 1080s, which is why NVIDIA disallowed precisely that. You had to buy the equivalent pro grade card, which costs easily two or three times that and offers a couple more GB of RAM.

walshemj · on Feb 25, 2020

Yep if you getting huge bills you should be doing on prem HPC eg where a 15k budget means 15kw per container and your into exotic network designs where 10g wont cut it any more.

eg from 2011 6400 Hadoop nodes like http://bradhedlund.com/2011/11/05/hadoop-network-design-chal...

God only knows what fun you could get up to with modern tech - I miss bleeding edge rnd

paulddraper · on Feb 24, 2020

> Also, if your startup is venture funded, AWS will give you $100K in credit

AFAIK that is limited to <$20k and it expires.

webel0 · on Feb 25, 2020

We got $100k but boy oh boy once you’re on it it’s hard to get off. Now we’re close to 50-50 gcloud/aws

wpietri · on Feb 25, 2020

Hah! "First hit is free, man!" Makes sense, though. Anybody offering a "free" $100k of anything can afford a lot of MBA time to make sure that the CLTV is well over costs.

calebkaiser · on Feb 25, 2020

Inference is also becoming a bigger contributor to compute bills, especially as models get bigger. With big models like GPT-2, its not unheard of for teams to scale up to hundreds of GPU instances to handle a surprisingly small number of concurrent users. Things can get expensive pretty quick.

artsyca · on Feb 24, 2020

Slow clap

shoo · on Feb 25, 2020

> most people haven’t figured out that ML oriented processes almost never scale like a simpler application would. You will be confronted with the same problem as using SAP; there is a ton of work done up front; all of it custom. I’ll go out on a limb and assert that most of the up front data pipelining and organizational changes which allow for [ML to be used operationally by an org] are probably more valuable than the actual machine learning piece.

Strong agreement from me: I've never worked on deploying ML models, but have worked on deploying operations-research type automated decision systems that have somewhat similar data requirements. Most of the work is client org specific in terms of setting up the human & machine processes to define a data pipeline to provide input and consume output of the clever little black box. A lot of this is super idiosyncratic & non repeatable between different client deployments.

izendejas · on Feb 25, 2020

That's because, ML and operations-research problems can be simplified to set of optimization problems and the underlying math and statistics are all very similar if not identical in some cases.

And the input matters, a lot. So the differentiating factor isn't the models, it's the data and companies like Google figured it out a long time ago.

In short, find interesting problems, then the solutions -- not the other way around.

killjoywashere · on Feb 25, 2020

"The data" means more than pure computer science people want to admit. In any "advanced" application, that means annotators. Radiologists drawing circles around cancer, attorneys labeling contract clauses as unacceptable, drivers labeling stop signs, etc.

ML is a mining problem. Digitizers are the miners. Annotators are the refiners.

joe_the_user · on Feb 25, 2020

Basically, the system is massively ad-hoc and driven by this large scale annotation, training and testing.

The big question here is, what happens when the world changes next year? You rebuild the application. I know there are companies that advertise doing continuous updating of deep learning models but it seems like calculating total costs and total benefits is going to be hard here.

killjoywashere · on Feb 25, 2020

Sometimes the mine makes money, sometimes it doesn't make sense to run the mine.

moandcompany · on Feb 25, 2020

To extend the mining metaphor, and relate back to the original articles:

People and organizations are chasing what they believe, or are told to believe, is pay dirt.

Many unfamiliar investors have rushed in, possibly fearing missing out, and fund many of the prospectors, yet many of the prospectors and investors aren't really aware of the costs of running a mine, nor the practices required to run them efficiently.

It turns out that there's more aspects to the value creation process than dig/refine/polish (data/train/predict), especially when usefulness in application matters and there are finite resources available for digging.

Companies selling shovels are some of the primary beneficiaries of this, by selling shovels (i.e. renting compute) funded by the malinvestment.

Additional beneficiaries are the refiners (training experts) that are able to charge steep labor premiums, however organizations are starting to figure out that their refiners are expensive to keep idle and often operate the mines poorly in terms of throughput/cost-effectiveness/repeatability/application (see the various threads on "Data Engineers")

streetcat1 · on Feb 25, 2020

This is correct, however, the distinction between labeling and training is artificial, and probably arises from the fact that ML came from academia, where it was not part of the business process.

I.e. a modern ML system should just plug into the business process from day 0, where the ML task should be performed by human and recorded by the machine.

After a while, the machine would train on this recorded data, and start replacing the humans.

Rinse and repeat.

killjoywashere · on Feb 25, 2020

> a modern ML system should just plug into the business process from day 0, where the ML task should be performed by human and recorded by the machine.

Ah, this is a typical thing I hear people in the Valley say: just push it all ... somewhere. No.

If we digitized all microscopy slides, it would require YouTube-scale storage several times over. People think genomics is big. People think reconnaissance imaging is big. They're big, but there's only so much of them.

IF it were digitized, there would be far more pathology whole slide imaging being generated every day. I did some estimates at one point and had to throw a couple orders of magnitude into the genomics data to even make it competitive at enterprise scale.

And keep in mind, we're talking clinical medicine. We want the data now. We're looking at the slides while the glue is still wet. You don't have the bandwidth, no one has the bandwidth, to do some of this stuff they way you propose and maintain the current "business process" of clinical medicine.

Building models and iterating, the old fashioned way, is the only way it makes sense.

6510 · on Feb 25, 2020

Funny, we all thought computers were fast. Turns out its nowhere what we need.

Piskvorrr · on Feb 26, 2020

They're fast, sure. But not very efficient in certain problem domains, specificially where humans are efficient (for reasons that are IMHO historical, not innate).

divbzero · on Feb 25, 2020

This is spot on. Hence the open sourcing of ML code while keeping an iron grip on data.

Erlich_Bachman · on Feb 25, 2020

> And the input matters, a lot. So the differentiating factor isn't the models, it's the data and companies like Google figured it out a long time ago.

The models are likely also a differentiating factor in a sense that there are models that perform much better than others, to a point of completely new functionality. But also all of these models are basically open source currently... So they can't by definition be differentiating between different companies, because all of the companies generally have access to all of the algorithms. At leat to all of the types of algorithms.

joshuaellinger · on Feb 24, 2020

I just spent $50K on coloc hardware. I'm taking a $10K/mo Azure spend down to a $1K/mo hosting cost.

But the real kicker is that I get x5 the cores, x20 RAM, x10 storage, and a couple of GPUs. I'm running last-generation Infiniband (56gb/sec) and modern U.2 SSDs (say 500MB/sec per device).

I figure it is going to take me about $10K in labor to move and then $1K/mo to maintain and pay for services that are bundled in the cloud. And because I have all this dedicated hardware, I don't have to mess around with docker/k8s/etc.

It's not really a big data problem but it shows the ROI on owning your own hardware. If you need 100 servers for one day per month, the cloud is amazing. But I do a bunch of resampling, simple models, and interactive BI type stuff, so co-loc wins easily.

eyegor · on Feb 25, 2020

Yes, it's quite obvious when you actually have compute needs. At my current employer, we spent about 100k to build a small single purpose hpc. One year later, I calculated the azure costs (help bargain for more servers) would have been around 1.5m. This is almost 24/7 use though, and add another ~150k in electricity.

foobiekr · on Feb 25, 2020

For my own company we built out at two regionally distinct colo facilities. That worked really well and operations was efficient and costs were moderate, clearly tied to CAPEX increments which were predictable.

Recent projects have been on AWS. For a project that is roughly on the scale of our colo in terms of instances, though with aggregate lower performance, we are buying one of our colos every year. It’s insane. Network costs are particularly egregious in AWS.

But there is absolutely no way we’d be permitted to build colo facilities for many reasons and there are many reasons why even if we could get permission to do so we would choose not to due the resulting death by a thousand cuts orchestrated by the team who happens to have inserted themselves as the owner for DC/colo like things.

angry_octet · on Feb 25, 2020

Yes, cloud costs are the cost of having poor internal management, such that inefficiency and incompetence reigns unchallenged. The enormous cost differential is unfortunately borne by the unit doing the work, rather than the one preventing it from being done efficiently.

foobiekr · on Feb 25, 2020

That is a very accurate summary.

marcus_holmes · on Feb 25, 2020

the point of Cloud is that it solves the problem of variable demand.

I used to run on-prem back in the 2000's, and we were constantly dealing with demand fluctuation crises. Spinning up new physical servers to deal with new demand, or being massively over-specced when demand dropped, was a real pain.

I'm starting a new thing this week, and using the Cloud for it because I have no idea what our demand will be. I can start small, scale up with our customer growth, and never have to worry about ordering new servers a month in advance so I have enough capacity when (or if) I need it.

At some point in the future, when our needs are clear and relatively stable, it might make sense to migrate to on-prem and save those costs.

throwaway9d0291 · on Feb 25, 2020

I half-agree. The Cloud specifically solves the problem of _highly_ variable demand.

If your peak demand is 100x your baseline and only happens for ~1h each day, cloud is almost certainly a good choice. If it happens for ~12h a day or it's only 5x your baseline, the cost of the cloud is such that you're likely to save with dedicated hardware, even though much of your hardware sits around doing nothing part of the time.

> never have to worry about ordering new servers a month in advance so I have enough capacity when (or if) I need it.

There is a middle-ground that's very much worth considering: renting dedicated servers. It's not quite as cost-effective as colocation and owning your hardware when you have at least a cabinet worth of stuff but it does offload the management of the hardware and provisioning to somebody else. They can also usually be provisioned in a matter of minutes.

In some cases (e.g. Packet.net) these machines can even be treated essentially like cloud instances, with hourly pricing.

There's also yet another middle ground: using dedicated to handle the known and predictable baseline traffic and using the cloud to handle the unexpected bursts.

supermatt · on Feb 25, 2020

Thanks for the pointer to packet.net. This looks like it will scratch a current itch.

burnte · on Feb 25, 2020

I did similar at my current and last job. Rather than spend $24k/month, I spent $50k, bought a shitton of hardware, built a virtualization cluster at Corp, and upgraded our connections. Accounting thought i was a wizard.

walshemj · on Feb 25, 2020

Especially as they can amortize that cost in the annual accounts - there might even be RnD tax credits they can use

sabalaba · on Feb 25, 2020

Yea we’re seeing this all over the place at Lambda (https://lambdalabs.com). Most people running consistent GPU training or inference jobs are building on-prem clusters or even groups of workstations.

It just doesn’t make financial sense to use the big the cloud service providers for those with consistent workloads. I always hear stories where folks have saved hundreds of thousands in infrastructure costs with owning + co-lo.

chintler · on Feb 25, 2020

I agree to this, and I think lambdalabs is quite precisely positioned for on-prem training.

As an aside, thank you for your one-line installer script for tf/keras. Earlier, my team used to spend days figuring out the CUDA/tf/keras/CUDNN etc dependency charts, and you've brought that down to ~0.

htrp · on Feb 25, 2020

+1 for the one line install

dboreham · on Feb 25, 2020

We never went cloud, except for ancillary things like build machines, nagios etc that run on tiny VMs. Whenever I looked at the economics I could buy a server of the class we needed for roughly 2x the monthly rent for the equivalent from Amazon.

Merrill · on Feb 25, 2020

This whole topic recapitulates all the arguments for business units acquiring and operating their own servers versus continuing to suffer the internal bill-backs from the corporate data center.

Some of the same caveats apply with respect to software updates, configuration control, security, availability, business continuity, disaster recovery, and what happens if the local admin is hit by a bus.

StreamBright · on Feb 25, 2020

Exactly. These examples are mostly apples to oranges comparisons. I have worked over 20 years in OPS and it is really hard to do cheaper than AWS ____in the long run___. If you are unlucky and bought a batch of SSDs that are faulty exactly 1 month after warranty expires or you have downtime because of other low-level reasons that AWS shields you from, your co-loc cost can quickly go up. I don't even want to go into networking hoops, that is a whole different problem to deal with global network vendors. If you can be sure you never run into these, or your business is resiliant to these sort of problems, or you have a dedicated highly skilled team (like dropbox) than co-loc might be a good idea. Otherwise it is pretty damn hard.

wpietri · on Feb 25, 2020

I'm sure your right for your case. But I'd add one caveat for those less experienced: if you own the hardware, you need to be prepared to go to the colo when something breaks. The various clouds are a much nicer experience when hardware fails. At the very least people should have enough spare capacity that a hardware failure means going sometime in the next couple of weeks, rather than getting up at 3 am and fixing things under pressure.

latch · on Feb 25, 2020

Or take the middle road and just get rent the hardware (aka, dedicated hosting). You pay more than colo but still way less than cloud, get the same level of hardware support as a cloud provider but the same performance as colo.

kavalg · on Feb 25, 2020

Yep, for example Hetzner offers bare metal servers as well as cloud instances at a very reasonable price. (Not affiliated in any way. Just a happy customer.)

joshuaellinger · on Feb 25, 2020

Prior to the cloud, I ran at a coloc facility for 15 years. I break stuff much more often than having it actually fail. So... make yourself robust against human error first and you'll probably cover the hardware side as a side-effect. I am more likely to hose a machine during an OS upgrade and not have time to recover than I am to have an SSD fail.

But spare capacity is a good idea, especially if you have real-time traffic.

foobiekr · on Feb 25, 2020

Operations teams deal with both. You design your system with enough spare capacity that you can live somewhat degraded for a time - you must if only due to the lead time. Software failures are far far far more common than hardware failures so once you combine these, the occasional midnight trip to the colo is both rare and oddly satisfying for hero types.

TheSpiceIsLife · on Feb 25, 2020

I would have assumed the colo provider would offer Remote Hands, so you’d only need to send replacement hardware.

That’s how the DC I used to work in operated.

wpietri · on Feb 25, 2020

If you have enough spare capacity and the problem is pretty mundane, sure, that can work. But if not, then it's off to the colo while the rest of the company freaks out.

StreamBright · on Feb 25, 2020

Network redundancy, electricity redundancy, bandwidth included? Otherwise, it is a bit of apples to oranges. What about firewalls? I mean you could ignore all that and say you only need raw computing power. On the k8s note, nobody is forcing you to use k8s on the top of Azure.

Now do the calculation for ongoing operations for 5 years, taking into consideration normal hardware failure and maintenance cost. You need to swap out old hardware to get a new CPU, etc. I have tried to use co-loc vs cloud for ~100 nodes and cloud won, by 30%.

andrew311 · on Feb 25, 2020

What colo company did you use?

joshuaellinger · on Feb 25, 2020

In Austin, DataFoundry is by far the best. It was overkill for me and went with something off the beaten path but they have an amazing facility.

I wound up at a facility run by a fiber vendor because they'd sell me a fixed 250mbps pipe for the same price that a data center would sell me 20mbps pipe that bursts to 1gbps. It only works for me because of the nature of my business -- most people would be better off somewhere else.

Choosing a co-loc facility is complicated. My recommendation is to tour and get quotes from 3-5 vendors in your area before choosing anyone. Ideally, take someone who has done it before.

dmak · on Feb 25, 2020

How did you estimate your hardware needs?

adtac · on Feb 25, 2020

I plan to do this in the near future once my GCP credits are used up (18 months of credits left).

My plan is to temporarily shift to dedicated hardware through a service like Hetzner to evaluate what kind of hardware I need. I can simply redirect a fraction of the traffic and extrapolate. Since this is elastic there will be no upfront costs, but I can play around with different sizes. Once I'm happy with my estimate, buy real hardware and move the rest over.

At least that's the plan. I don't think you can do much more than an educated guess and I think this will be as close as I can get.

Not AI related btw.

joshuaellinger · on Feb 25, 2020

Gee... if only there was a service where you could spin up machines on demand. (joke)

I kinda worked backwards from the cost. I ran the business for a year on Azure but each 'sample' of the resample took about 2 mins so it precluded any near real-time analysis. I ported the kernel to a GPU locally using python/numba and it ran in about 10 seconds and that was enough to seal-the-deal.

From there, I spec-ed out a GPU server and then machines that matched each role in my environment. I decided I was willing to spend $50K and just started loading up the machines.

bsenftner · on Feb 25, 2020

[flagged]

angry_octet · on Feb 25, 2020

(I don't think people like the gratuitous imaging generated by that last sentance. Much too real.)

mrosett · on Feb 25, 2020

Agreed. Totally unnecessary

raiyu · on Feb 24, 2020

The number of places where machine learning can be used effectively from both a cost perspective and a return perspective are small. They are usually tremendously large datasets at gigantic companies, and they probably have to build in house expertise because it's hard to package this up into a product and resell it for various industries, datasets, etc.

Certainly something like autonomous driving needs machine learning to function, but again, these are going to be owned by large corporations, and even when a startup is successful, it's really about the layered technology on-top of machine learning that makes it interesting.

It's kind of like what Kelsey Hightower said about Kubernetes. It's interesting and great, but what will really matter is what service you put on top of it, so much so that whether you use Kubernetes becomes irrelevant.

So I think companies that are focusing on a specific problem, providing that value added service, building it through machine learning, can be successful. While just broadly deploying machine learning as a platform in and of itself can be very challenging.

And I think the autonomous driving space is a great example of that. They are building a value added service in a particular vertical, with tremendous investment, progress, and potentially life changing tech down the road. But as a consumer it's really the autonomous driving that is interesting, not whether they are using AI/machine learning to get there.

andreilys · on Feb 24, 2020

“The number of places where machine learning can be used effectively from both a cost perspective and a return perspective are small.”

Thankfully transfer learning and super convergence invalidates this claim.

Using pre-trained models + specific training techniques significantly reduces the amount of data you need, your training time and the cost to create near state of the art models.

Both Kaggle and google colab offer free GPU.

ska · on Feb 24, 2020

>Thankfully transfer learning and super convergence invalidates this claim.

IME it is nowhere near as universally successful as this suggests.

craftinator · on Feb 25, 2020

> Both Kaggle and google colab offer free GPU.

I think this sentence invalidates your argument against:

“The number of places where machine learning can be used effectively from both a cost perspective and a return perspective are small.”

In a hobbyist world, free GPU time is an amazing thing, and you can do a lot of fun and rewarding projects using transfer learning and other techniques that avoid heavy engineering and data processing. In a business world, where your product must consistently and accurately perform well, problems that may be solved by ML need to be heavily scrutinized and researched, because for most problems there are cheaper, faster, more robust solutions. Free GPU time doesn't weigh in at this scale.

andreilys · on Feb 25, 2020

Sure if ML/DL is your core business than yeah it doesn't make sense.

If ML/DL is an add-on to help augment your business (separate from the core value) then yes transfer learning and free GPU's will get you good returns.

Q6T46nT668w6i3m · on Feb 24, 2020

How would you explain the rise (and success) of machine learning in science? A lab that uses some learning-based method will likely be limited to just one or two people (responsible for data acquisition, feature engineering, evaluation, etc.) and extremely finite data.

ska · on Feb 24, 2020

It's not clear there has been any deep impact actually, but there has been a lot of discussion (and grant proposals)

I've seen a lot of cross pollination of ML and AI techniques into various disciplines. A large percentage just didn't work at all, most of the rest were more "kind of interesting, but". Nothing earthshaking happened although pop sci press likes to talk about it a lot.

If you have more digital data than you used to, using modern free frameworks and toolkits to do basic (i.e. older, boring, but understood) ML stuff to understand it seems to have a reasonable return. Mostly I think this is because it becomes accessible to someone without much background in the area, and you can do reasonable things without having to put 6 months of reading and implementing together before starting.

semi-extrinsic · on Feb 24, 2020

How do you define success? Adoption? Because right now, writing "we will use machine learning to solve X" in a grant proposal is an easy way to increase chances of getting funding.

Barrin92 · on Feb 24, 2020

I'm not sure there is a rise. 'Science' is a huge domain. Machine learning if I had to guess maybe plays a role in < 1% of them, and that may be overstating it.

Also it's doubtful to even categorize machine-learning as science. The goal of science is to generate insight and knowledge, ML solves particular engineering problems or searches problem spaces, it doesn't build fundamental scientific models.

PeterisP · on Feb 24, 2020

Can you elaborate on what you mean by "A lab that uses some learning-based method will likely be limited to just one or two people (responsible for data acquisition, feature engineering, evaluation, etc.)" ? I know a bunch of labs that apply machine learning to specific tasks, and the parts you list each can easily take up multiple people for years for a single task - not counting data acquisition, because data is definitely not "extremely finite", you need lots of quality data, and improving data is something that always gets improvements and can easily eat up more manpower than you can have budget, no matter what that budget is.

artemisyna · on Feb 25, 2020

...because previously, the academics would use an army of undergrads to do the same data labeling that ML accomplishes.

(The dis-economy of scale hurts less if you're already starting from a point with the manual labor.)

C1sc0cat · on Feb 25, 2020

Its now a lot cheaper in the 1980's than when I worked at the worlds leading hydrodynamics orgs.

I briefly looked at using neural nets to analyse data from an experiment - analysing the efficacy of toilet bowl designs.

The entry level hardware was £250k in 1981 - it was much cheaper to take photo's and have a research assistant count squares.

Now you could use fairly cheap commodity hardware to do it.

It would have been an amazing cutting edge project if we could have got some government funding we did have an in-house knowledge engineer.

jorblumesea · on Feb 25, 2020

It's interesting that the industry constantly has to relearn the idea that tech needs follow business needs, not the other way around. As you said, so many teams rushing to containerize, but if the services you run are piles of junk, do your users care about whether kubernetes can scale based on memory instead of cpu? Similarly, many effective "recommendation engines" are just inverted indexes and not fancy ML models, and are a hell of a lot cheaper.

inthewoods · on Feb 25, 2020

Having briefly worked for an AI company, I agree with the conclusion that AI companies are more like services businesses than software companies. I would add only one other thing: to me going forward there likely won't be "AI companies" - AI exists to power applications. And in my experience, unless the output is truly differentiated, customers aren't willing to spend more for something "powered by AI" - they just expect that software has evolved to provide the kind of insights that AI sometimes deliver.

shoo · on Feb 25, 2020

For an example of a genuine software company vaguely in this ecosystem, consider companies that build the tools that some AI/ML/optimisation systems use as building blocks. Eg optimisation algorithms.

If you need to solve gnarly industrial scale mixed integer combinatorial optimisation problems in the guts of your ML / optimisation engine, the commercial MIP solvers (gurobi , CPLEX ) or non-MIP based alternative combinatorial optimisation systems (localsolver ) can often give more optimal results in exponentially less running time than free open source alternatives.

1% more optimal solutions might translate into 1% more net profit for the entire org if you've gone whole hog and are trying to systematically profit optimise the entire business, so depending on the scale of the org it might be an easy business case to invest a few million dollars to set this system in place.

Annual server licenses for this commerical MIP solver software was 0(100k) / yr per server & the companies that build these products bake a lot of clever tricks from academia into these products that you can exploit by paying the license fee. ( my knowledge of pricing is out of date by about 7 years ) .

MrK93 · on Feb 25, 2020

I'm all for linear optimization and other optimization techniques. It's refreshing to see other people talk about Gurobi, CPLEX, etc... Having done research in the field of scheduling and now getting contacted by companies, it's demoralizing to see that everybody usually speaks about machine learning while many problems can be solved in a more precise way with other techniques.

mapgrep · on Feb 25, 2020

Aren’t software businesses increasingly like service businesses though?

They deliver now often with backend cloud storage, update near continuously, integrate frequently with outside services, sometimes open source major components iteratively, typically have an evolving API and developer ecosystem to educate, and are sold as subscriptions. It’s not as “human in the loop” as some of the AI described in this article but it’s clearly moving toward services in terms of margins.

Nothing is like the old shrink wrapped software business, basically.

inthewoods · on Feb 25, 2020

Not from what I see - what I see is software companies using services as a way to shorten time-to-value for the customer. They do this either themselves or via professional services firms.

To me, the services you describe are software-as-a-service - they scale well without adding more humans to the mix. Services businesses, in contrast, generally need more humans to do more work.

I do think you are right that we are entering an age where the margin pressures will continue to increase. As the Amazon quote goes "your margin is my opportunity." In that world, strength accrues to the largest players - which is why AWS is so strong.

I like to joke that AWS should refund money to the startup that buy booths at re:Invent only to find out AWS is rolling out a competing service (with the acknowledgement that AWS entering a space doesn't necessarily mean the end of the competing company.)

rossdavidh · on Feb 24, 2020

So, way back in the last millenium, I did my Master's thesis (way smaller deal than a Ph.D. thesis) on neural networks. Since then, I have looked in on it every few years. I think they're cool, I like using them, and writing multi-level backpropagation neural networks used to be one of the first things I'd do in a new language, just to get a feel for how it worked (until pytorch came along and I decided for the first time that using their library was easier than writing my own).

So, it's not like I dislike ML. But, saying an investment is an "AI" startup, ought to be like saying it's a python startup, or saying it's a postgres startup. That ought not to be something you tell people as a defining characteristic of what you do, not because it's a secret but rather because it's not that important to your odds of success. If you used a different language and database, you would probably have about the same odds of success, because it depends more on how well you understand the problem space, and how well you architect your software.

Linear models or other more traditional statistical models can often perform just as well as DL or any other neural network, for the same reason that when you look at a kaggle leaderboard, the difference between the leaders is usually not that big after a while. The limiting factor is in the data, and how well you have transformed/categorized that data, and all the different methods of ML that get thrown at it all end up with similar looking levels of accuracy.

There used to be a saying: "If you don't know how to do it, you don't know how to do it with a computer." AI boosters sometimes sound as if they are suggesting that this is no longer true. They're incorrect. ML is, absolutely, a technique that a good programmer should know about, and may sometimes wish to use, kind of like knowing how a state machine works. It makes no great deal of difference to how likely a business is to succeed.

jedberg · on Feb 24, 2020

Saying that you're going to "use AI" is more akin to saying "we're going to have a web application" back in 1998.

Back then a lot of startups didn't have websites, because they were making other products (hardware, boxed software, etc). If they had a website it was just a marketing page.

So saying that you were going to make a "web application" did in fact differentiate you, in that it showed your approach was very different from the boxed software folks, but it didn't tell you much beyond that.

all_blue_chucks · on Feb 25, 2020

"Web application" came later. In the nineties it was called a "cgi web page" by your webmaster.

perl4ever · on Feb 25, 2020

In the nineties, there was a huge difference between 1995 and 1998. It wasn't all that apparent to some of us until later, but things moved really fast in that timeframe. The years leading up to 2000 were almost like the imagining of approaching an event horizon or asymptote.

edw · on Feb 25, 2020

What you’re describing is so hard to convey to people. In 1994 the we were building raised-floor data centers with halon for suppressors and marveling at our 2GB behemoth UNIX boxes. And writing our own web application framework using CGI. In ‘99 we were renting a suite at a colo and putting our own hardware there, running ColdFusion web apps. In ‘04 we were renting half a rack at the same colo and trying not to write three tier Java servlet based apps with 1,000 line web.xml files. And then AWS happened.

rbinv · on Feb 25, 2020

CGI to ColdFusion to Java servlets. Sounds enterprise-y.

edw · on Feb 25, 2020

It was all very start-uppy. What were you using to build your commercial web applications in 1996 if not CGI? Mod_perl did not even exist until 1995, and FastCGI didn't exist IIRC until after Netscape released their enterprise server.

rbinv · on Feb 26, 2020

Huh. I agree with CGI, but CF certainly had alternatives.

edw · on March 2, 2020

The boring low-risk unsexy thing that works is often underrated. I didn't choose CF — at the time I argued that it was a tool for scrubs, but the VPE said, "Hey, I know it, and I know it can do the job." We launched that CF-based web site on time and sold the company for $350MM fewer than six months later. Only then did we incrementally port it to Java.

aSplash0fDerp · on Feb 25, 2020

I can recall chugging along with a Pentium 133mhz and 56k dialup between 95-98.

Its fantastic to think that we didn't see 1ghz cpus and 1mb cable/dsl Internet until 2000.

The resourcefulness from that pre-2k era was amazing! It was leaps and bounds!

jedberg · on Feb 25, 2020

I know, but I'm writing to a modern audience. :)

randomdude402 · on Feb 25, 2020

Or from time to time, a webmistress.

_revy · on Feb 25, 2020

> If you don't know how to do it, you don't know how to do it with a computer.

This is so true. We spent decades educating non-technical people that understanding a problem well is a prerequisite to programming it. Take something easy to understand like driving a car, doing it in a computer is now harder.

AI is undoing all that. People reach a vague problem they can't describe and assume computers will magically fix it.

Tostino · on Feb 24, 2020

Well the term Postgres or Python startup may not make sense, but a Pytorch or TensorFlow startup may not either. A database startup though, tells me the company is likely going to be in the database field, and most likely is going to try and sell me something I don't need. An AI startup, similarly, is going to either be utilizing existing techniques on industry problems to sell me something I don't need, or making some novel improvement to the training or inference to sell me something else I don't need.

So...yeah.

7532yahoogmail · on Feb 24, 2020

Thank you for the perspective. Now when we talk machine learning are we talking:

L. Pachter and B. Sturmfels. Algebraic Statistics for Computational Biology. Cambridge University Press 2005.

G. Pistone, E. Riccomango, H. P. Wynn. Algebraic Statistics. CRC Press, 2001. Drton, Mathias, Sturmfels, Bernd, Sullivant, Seth. Lectures on Algebraic Statistics, Springer 2009.

Or more like:

Watanabe, Sumio. Algebraic Geometry and Statistical Learning Theory, Cambridge University Press 2009.

My understanding (I do not do AI or machine learning) that AI is distinct from these more mathematical analytic perspectives.

Finally, might we argue that generally AI/ML is more easily suited to data that's already high quality eg. CERN data, trade data, drug trial data as opposed to unconstrained data eg. Find the buses in these 1MM jpegs?

kk58 · on Feb 25, 2020

Pure CS based AI approaches are primarily for Image, Text, and maybe graphs and control. The domains are called computer vision, natural language processing, graph learning and reinforcement learning

Structured Data like tables, time series etc the techniques are still from statistics. Regression for example is the workhorse for numerical prediction problems

I think a lot of people are missing the point about leaps AI has made because they aren't aware of NLP or CV or reinforcement learning.

So "AI" mentioned above is stunningly good for buses in 1MM image and reasonably good drug trial, cern data.

The business models required for making AI business successful haven't been invented yet.

Good AI model will be Deep stack : example would be something like precision agriculture where you'd use AI for designing rice then use iot and earth observation to locate right acreages and monitor growth and adjust nutrient at crop level and get dramatically great output with least wastage and highest nutritional content.

Most AI companies are still started by ex CS folks who in general arent aware of deep technical opportunities in other disciplines. I think this will change soon very fast due to ubiquity of deep learning training material, libraries and research papers.

phreeza · on Feb 25, 2020

> There used to be a saying: "If you don't know how to do it, you don't know how to do it with a computer."

This is a tautology in the narrow sense, but in the broader sense I think there surely exist things that humans don't "know" how to do without a computer, but know how to do with a computer. And the space of solveable problems is expanding, though AI is only a narrow slice of that.

_laiq · on Feb 25, 2020

I don't know. I think what we all do, we know how to do it without a computer. Computer just automate stuff for us. It's a very practical saying because it forces you to ask the right question about the problem you're trying to solve. (We all know how to do AirBNB by hand, or Uber by hand, but the mobile app is hyper efficient w/ GPS & 4G, that's all).

ativzzz · on Feb 24, 2020

I agree with the author's opinion about

> I’ll go out on a limb and assert that most of the up front data pipelining and organizational changes which allow for it are probably more valuable than the actual machine learning piece.

Especially at non-tech companies with outdated internal technology. I've consulted at one of these and the biggest wins from the project (I left before the whole thing finished unfortunately) were overall improvements to the internal data pipeline, such as standardization and consolidation of similar or identical data from different business units.

noelsusman · on Feb 24, 2020

I do data science at a non-tech company with outdated internal technology and I've seen this over and over again. Honestly though, it's worth every penny because often the only way to get the resources to truly solve data pipeline issues is to get an executive to buy some crap from a vendor and force everyone to make it work.

jotakami · on Feb 25, 2020

I was a consultant at one of the giant outsourcers and nod my head vigorously at this comment. The least sexy projects were MDM (master data management) but they were absolutely essential to the success of any other fancy analytics/BI/ML project.

2sk21 · on Feb 25, 2020

Interestingly I too worked on MDM systems about ten years ago, when I was at IBM Research. Ironically, one of my first ideas for applying machine learning was in de-duplication of data in an MDM server. However the technology was a bit too primitive back in 2010 and the project was a hard sell so it was abandoned.

correlator · on Feb 24, 2020

No need to look at AZ for this. If you're building "AI" I wish you a speedy road to being acquired by a company that can put it to use. You've become a high priced recruiting firm.

If you're solving a real problem and use ML in service of solving that problem, then you've got a great moat....happy trusting customers.

It's not complicated

motohagiography · on Feb 24, 2020

Sssh! Valuations are a function of projected market size and opacity of the problem. Clarity like this collapses the uncertainty and destroys value. If you pour enough capital into rooms full of PhD's something's gotta hit.

My way of saying, you're very, very right.

seibelj · on Feb 24, 2020

I wrote an article I published a week ago about how AI is the biggest misnomer in tech history https://medium.com/@seibelj/the-artificial-intelligence-scam...

I wrote it to be tongue-in-cheek in a ranting style, but essentially "AI" businesses and the technology underpinning it are not the silver bullet the media and marketing hype has made it out to be. The linked article about a16z shows how AI is the same story everywhere - enormous capital to get the data and engineers to automate, but even the "good" AI still gets it wrong much of the time, necessitating endless edge-cases, human intervention, and eventually it's a giant ball of poorly-understand and impossible to maintain pipelines that don't even provide a better result than a few humans with a spreadsheet.

scottlocklin · on Feb 25, 2020

Coming from a fellow masshole: that's a great rant.

There was this meme in the 70s about "self driving cars" following magnetic strips in the road in restricted highways. I remember at the time, being, like 8 and thinking "sure seems like an overly complicated train."

seibelj · on Feb 25, 2020

Thanks man! Lifelong masshole here.

Your post was much better than mine, but I appreciate the comment.

harias · on Feb 24, 2020

>That’s right; that’s why a lone wolf like me, or a small team can do as good or better a job than some firm with 100x the head count and 100m in VC backing.

goes on to say

>I agree, but the hockey stick required for VC backing, and the army of Ph.D.s required to make it work doesn’t really mix well with those limited domains, which have a limited market.

Choose one?

Also assumes running your own data center to be easy. Some people don't want to be up 24x7 monitoring their data center or to buy hardware to accommodate the rare 10 minute peaks in usage.

jjeaff · on Feb 24, 2020

>rare 10 minute peaks

But is that really the use case here? I haven't worked in ML. But I'm not seeing where you are going to need to handle a 10 minute spike that requires a whole datacenter.

A month's worth of a quad gpu instance on AWS could pay for a server with similar capacity in a few months of usage.

And hardware is pretty resilient these days. Especially if you co-locate it in a datacenter that handles all the internet and power up time for you. And when something does go wrong, they offer "magic hands" service to go swap out hardware for you. Colocation is surprisingly cheap. As is leasing 'managed' equipment.

sp527 · on Feb 25, 2020

Training ML models usually doesn’t have the same uptime requirements as production systems. If your training goes down for a bit, it probably won’t make much difference to the underlying business, in most cases.

That’s why the author found it glaringly obvious that it should be brought in-house. It’s often both the most costly and most “in-housable” compute work involved in these companies.

icheishvili · on Feb 24, 2020

I don't think these are necessarily contradictory. With pytorch-transformers, you can use a full-blown BERT model like the best in the world. And yet, to make this novel and defensible, you would need to build on top of it and innovate significantly, which would require significant capital to achieve.

bsenftner · on Feb 25, 2020

I ran a small data cluster for years, the horsepower behind my startup. Other than the Chinese DDoS attacks, running the cluster was absolutely elementary. The idea that running a server or a band of servers is difficult is a bold faced lie. People have got to stop repeating the cloud propaganda.

detaro · on Feb 24, 2020

> Some people don't want to be up 24x7 monitoring their data center or to buy hardware to accommodate the rare 10 minute peaks in usage.

Do you need that for training workloads, and what percentage of a startups workload is training?

moab · on Feb 24, 2020

I found it fun to read this after reading this other post that made the rounds today about AI automating most programming work and making program optimization irrelevant: https://bartoszmilewski.com/2020/02/24/math-is-your-insuranc...

dang · on Feb 24, 2020

A thread about the original article, from a few days ago: https://news.ycombinator.com/item?id=22352750

fxtentacle · on Feb 24, 2020

I predict a great future for startups that sell pickaxes, err, tools for AI.

AI is like the new gold rush. And just like back then, it's not the gold diggers that will get rich.

"Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling."

https://medium.com/startup-grind/fueling-the-ai-gold-rush-7a...

(from 2017)

moksly · on Feb 24, 2020

Is it the new gold rush though. I work in a large organisation that has a lot of data and inefficient processes, and we haven’t bought anything.

It hasn’t been for a lack of trying. We’ve had everyone from IBM and Microsoft to small local AI startup try to sell us their magic, but no one has come up with anything meaningful to do with our data that our analysis department isn’t already doing without ML/AI. I guess we could replace some of our analysis department with ML/AI, but working with data is only part of what they do, explaining the data and helping our leadership make sound decisions is their primary function, and it’s kind of hard for ML/AI to do that (trust me).

What we have learned though, is that even though we have a truck load of data, we can’t actually use it unless we have someone on deck who actually understands it. IBM had a run at it, and they couldn’t get their algorithms to understand anything, not even when we tried to help them. I mean, they did come up with some basic models that their machine spotted/learned by itself by trawling through our data, but nothing we didn’t already have. Because even though we have a lot of data, the quality of it is absolute shite. Which is anecdotal, but it’s terrible because it was generated by thousand of human employees over 40 years, and even though I’m guessing, I doubt we’re unique in that aspect.

We’ll continue to do various proof of concepts and listen to what suppliers have to say, but I fully expect most of it to go the way Blockchain did which is where we never actually find a use for it.

With a gold rush, you kind of need the nuggets of gold to sell, and I’m just not seeing that with ML/AI. At least no yet.

hooande · on Feb 24, 2020

AI != gold. The market for selling tools to people who are essentially chasing buzz words is much smaller than that of selling tools to people extracting scarce metals from the ground.

Ultimately the value of selling tools is dependent on the riches being mined actually existing. The value of AI/big data to the average business has yet to be determined

b0b10101 · on Feb 24, 2020

>"Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling."

A lot of those companies are styled as "AI" companies themselves, aiming to automate the process of labeling.

The main winner here really is Amazon. They get a chunk by serving up infrastructure and in labeling through mechanical turk.

whoisjuan · on Feb 24, 2020

An many times all these AI computations go into solving mundane problems like "What's the likelihood of this Ad to perform well".

AI is so shiny that makes people want to jump as fast as they can into that boat but a reasonable objective analysis shows that a huge and not insignificant amount of software problems can still be solved without relying on the "AI black box".

DrNuke · on Feb 24, 2020

You all know a GTX 1070 with 8GB on a gaming laptop with 32GB is still doing wonders and covering 90%+ business cases when coupled with smart & batch techniques the likes of you learn from fast.ai or under direct pytorch implementation, right??

_bxg1 · on Feb 24, 2020

> Training a single AI model can cost hundreds of thousands of dollars (or more) in compute resources

Why don't they buy their own hardware for this part? The training process doesn't need to be auto-scalable or failure-resistant or distributed across the world. The value proposition of cloud hosting doesn't seem to make sense here. Surely at this price the answer isn't just "it's more convenient"?

KaiserPro · on Feb 24, 2020

because you are trading speed for cash.

Say you have $8M in funding, and you need to train a model to do x

You can either:

a) gain access to a system that scale ondemand and allows instant, actionable results.

b) hire a infrastructure person, someone to write a K8s deployment system. Another person to come in a throw that all away. Another person to negotiate and buy the hardware, and another to install it.

Option b is can be the cheapest in the long term, but it carries the most risk of failing before you've even trained a single model. It also costs time, and if speed to market is your thing, then you're shit out of luck.

_bxg1 · on Feb 24, 2020

Why in the world do you need a Kubernetes deployment system to run a single, manual, one-time (or a handful of times), high-compute job?

dsl · on Feb 24, 2020

Because when all you have is a hammer, everything looks like a nail.

We have become so DevOps and cloud dependent that everyone has forgotten how to just run big systems cheaply and efficiently.

PeterisP · on Feb 24, 2020

Because that high-compute job needs to be distributed on many, many machines, and if you're using cheap preemptible instances you have to handle machines dropping off and joining in while you're running that single job.

It's definitely not something that you can launch manually - perhaps Kubernetes is not the best solution, but you definitely need some automation.