A Full Hardware Guide to Deep Learning

benanne · on March 9, 2015

The article recommends getting a 580 as the cheapest, most cost-effective option. One thing the 580 has going against it is that the cuDNN library does not support it. Only Kepler and Maxwell cards (600, 700 and 900 series) are supported. Since many of the popular libraries for deep learning (Theano, Caffe, Torch7) support using cuDNN as a backend now, I think this is worth mentioning. For many configurations cuDNN provides some of the fastest convolution implementations available right now. Even if the 580 is a great card for CUDA, a more recent model may actually be a better choice in light of this.

I agree 100% with the 980 recommendation, it's a great card in terms of performance, power usage and price point.

timdettmers · on March 10, 2015

Thanks four your feedback, this is an important point and I will update my blog posts accordingly.

kyzyl · on March 9, 2015

Overall a good article with some insightful points. One thing strikes me as a bit off, though. The recommendation for one or two cores per GPU seems not quite right. Examining only the CPU<->GPU performance, this might be reasonable. Like the author mentions, you can use the other core to prep the next mini-batch and all those sorts of tasks. However, training the model is only one part of the system, and I tend to value the overall performance of the system more than any one facet.

For example, despite training on GPUs being very computationally intensive, I find one of the most onerous tasks to be the custom data prep/transformation/augmentation pipeline. Because these types of things are usually pretty application specific, there often isn't a ready-made toolkit that does all the heavy lifting for you (unlike the GPU training, which has Torch, Caffe, pylearn, cuda-convnet, lasagne, cxxnet...), so you end up having to roll it yourself. Often you have to run this code often, and with large data that isn't trivial. Usually you won't invest--at least, I don't--in doing custom CUDA code for this type of thing, if it's even possible, so having lots of fast CPU cores is a win. I usually write multi-threaded routines for my processing steps and run them on 8-32 cores for huge gains. So my point is that "one or two cores per GPU" is a bit of a narrow recommendation.

The same applies if you want to do 'real time' data augmentation (this is hinted at later in the article) and/or if you want to deploy with CPU only. Sure you need the GPU to do the training in a reasonable amount of time, but once you've fit your model, it might not be worth it to deploy to GPU-enabled computers if all you're doing is forward passes.

PS: This is also a place where running on EC2 can be a win. Maybe it's more economical to build a workstation, but once you're in the cloud you can spin up a few 32 core boxes to run your preprocessing really quickly, shut them down and spin up some GPU instances for training, then shut those down and spin up some mid-tier boxes to run the models through a bunch of data without breaking the bank. All in 'one place'.

timdettmers · on March 10, 2015

Thanks for sharing your experience – this is a fair point. Often it is possible to pre-process your data and save it to disk so that you can skip this decompression/conversion/transformation step once you start training your net, but I can image applications where this is impractical or just does not work. I will add a small note to my blog about this.

dchichkov · on March 9, 2015

I was using a few different cards a few years back, to work on deep learning. Including 2 x GTX 580 and a few others. A checklist:

  1. motherboard form factor.
  2. cooling. 
  3. power supply.
  0. memory.

It is important to choose a motherboard with the right form factor which would actually fit your cards physically. A fact that the motherboard has 3xPCIe 16 doesn't necessarily mean that it would fit your two(!) cards. Nothing is more frustrating than not being able to fit the cards. Cooling. Can not be overstated. Also note that the box makes a lot of noise during its operation. Ideally you'd want to put it far away from your workplace. Power supply. Note that the spec that you read on the power supply is usually overrated. If you have 4x200W cards an advise is a 2kW PSU. And GPU memory, if you can fit your dataset in, rather than loading/unloading it in batches that will save you a lot of time and efforts. Well worth the money.

timdettmers · on March 9, 2015

These are some good points. I heard from another person that he had problems with the form factor and I will add that to the post tomorrow. I think a 2kW PSU is overkill, but you are right that more is better for PSUs.

If you want memory a good option will be to wait for the GTX Titan X which will be released in the next weeks: 12 GB RAM and it will be the fastest card by far. Overall however, I think the GTX 980 will be better for many cases still – it is just very cost effective.

choppaface · on March 9, 2015

This is cool, but I'm wondering about how the prices compare to EC2 (including spot prices and depreciation of personal hardware). How much training do you need to do before EC2 becomes too expensive?

timdettmers · on March 9, 2015

If you have no desktop PC or no money for a GPU, it might be a better choice to use a EC2 instance instead of buying the hardware. You pay about $11 a week for a EC2, which is quite good once you compare it against the electricity costs that come on top of running a personal computer.

The downside is that you have a slow EC2 GPU with 4 GB RAM. Conv nets that take 3 weeks on a EC2, will take less than 2 weeks on a GTX980. If you run large conv nets, the 4 GB can be limiting (for example on ImageNet or similarly sized data sets).

Another point is that is more convenient to work on your own desktop and you can run multi-GPU nets, which is not possible on EC2 because the virtualization kills the memory bandwidth between GPUs.

If you think about it, over the long term a personal system will just be more cost efficient (you can keep a good system for years). So for deep learning researcher and those that apply deep learning this is just the most cost effective option.

A example calculation: You can buy a faster system than a EC2 for roughly $400 (GTX 580 + other parts from eBay). Together with electricity costs thats about 1 year worth of EC2, or 2 years worth if you use deep learning sporadically. A high end deep learning system will be about $1000-1400, which is about 3 years worth of EC2. So a EC2 makes good sense, if you use deep learning only sporadically and work with small data sets. If you use deep learning heavily, want a faster system or want to use multiple GPUs, a personal system will be better.

kyzyl · on March 9, 2015

Putting aside the question of hardware+electricity vs. g2.2xlarge service charges, I think it's worth mentioning that there's a lot more to putting these models together than just getting the hardware and paying to operate it. I tend to spend quite a while mucking with configurations, writing data preprocessing/formatting code, and doing component-wise checking of each piece of the giant ball of software it inevitably becomes. For these tasks, it can be a LOT more convenient to be running locally.

As soon as you're dealing with EC2, you have to take on the mental overhead of making sure that all your configuration persists between restarts (especially if you're using spot instances!), running start up tasks, mounting EBS and paying for volumes, etc. and in my experience this all really adds up. That said, I still do use EC2 for some things. If I wave five similar models I want to run in parallel, it's as simple as spinning up five identical instances. Also, once I start training new models I can continue to use my workstation without any slowdowns.

timdettmers · on March 10, 2015

A very valuable comment – this is an important perspective, thanks! I think I will add a EC2 section to my blog post.

fuchsvomwalde · on March 9, 2015

Awesome article! Totally agree with the author.

deeviant · on March 9, 2015

The GTX 980 outperforms the K80 ?

frozenport · on March 9, 2015

The k80 is weird, it behaves like 2 GPUs that need to be fed over the same interface. Unless you really, really need RAM you would have better performance from 2 cards.

akosednar · on March 9, 2015

No, but a K80 isn't really meant for workstations (from what I understand). Plus, it's $7k while a 980 is sub $1k typically.

deeviant · on March 9, 2015

Well, first of all, just to clarify, I was asking a question, not a statement. I recently ordered a HPC server for my company, we're using caffe to train/detect very large sets.

I went with the K80, the company we ordered it from charged us $4400 for the card, so there must be a good amount of markup that can be negotiated out of it.

I have since read some material comparing the K40 to the 980, giving a slight edge to the 980, which is surprising considering the price points, but I have not yet found any good benchmarks/posts about the K80 vs the 980. The K80 is not just two K40's glued together, as it uses the GK210 tesla chips rather than the GK110. The GK210 is a more advanced chip with more cache and better energy efficiency but I'm really not too sure how it translate into real world performance.

If anybody has any data or perspective on this, I would appreciate it.

bombita · on March 10, 2015

Just my 2 cents. I've been trying my scientific computing code (QM/MM, not ML) in a cluster using various configs (6xK40, 4xK80, 6xK20, etc) and the performance I noticed of the K80 is quite strange. I've been using the CUDA_DEVICE 0,1,2,3 of that config and if I try to use more than one logical GPU, the performance is not 1:1, but more like 1:0.6

The only conclusion I've been able to find is that the K80 presents itself as 2 different devices (0,1 or 2,3 in that config) but the performance is not 2x, at all. There is quite a lot of PCI bus contention, hurting badly the performance of my code (as it is just running many <10ms kernels at a time). So far, having 2xK40 seems to be a better value and performance proposition than 1xK80 on the same bus, but the flops/watt aspect of that equation favors greatly the K80.

fuchsvomwalde · on March 9, 2015

Awesome article!