Hacker News new | past | comments | ask | show | jobs | submit login
Benchmarking Modern GPUs for Maximum Cloud Cost Efficiency in Deep Learning (minimaxir.com)
83 points by minimaxir on Dec 16, 2017 | hide | past | favorite | 26 comments



> As with the original benchmark, I set up a Docker container containing the deep learning frameworks (based on cuDNN 6, the latest version of cuDNN natively supported by the frameworks) that can be used to train each model independently. The Keras benchmark scripts run on the containers are based off of real world use cases of deep learning.

Can someone who is knowledgeable about docker (and container performance in general) comment on how using a docker instance impacts benchmark performance?

I understand containers are not virtual machines, but my preference would be for a benchmark run on bare metal without containerization involved. In particular, it’s not clear to me whether a container can take advantage of the same CPU and GPU instruction optimizations that a binary or compiled-from-source version can. For example, I’m curious if containers package redundant versions of software for convenience that are underoptimized when compared to versions packaged with the operating system. Is that a realistic concern?

In other words, is there generally a meaningful downside to benchmarking with containers, and if so what is it? And equally importantly, are containers the way most people use deep learning frameworks? I have never installed Tensorflow or Keras using a container.


> it’s not clear to me whether a container can take advantage of the same CPU and GPU instruction optimizations that a binary or compiled-from-source version can.

Since the container is running on the bare metal, with the same kernel and everything, just isolated, it absolutely can take advantage of these things. It'll require about the same amount of work to do the compilation for using any special instructions for performance as it would if it's on bare metal.

> For example, I’m curious if containers package redundant versions of software for convenience that are underoptimized when compared to versions packaged with the operating system. Is that realistic concern?

Maybe. It'll mean that there's extra data on disk and maybe in ram from loading other shared libraries which will mean some overhead, but unless there's some specific set of instructions or optimizations that would make a difference for the workload it should be basically the same.


> unless there's some specific set of instructions or optimizations that would make a difference for the workload

I think this was the parent's worry. Most HPC deployments have versions of libc, libm, etc. compiled with tuning flags specific to the microarchitecture of the system, to allow workloads deployed on them to squeeze as much performance as possible out of the system. I would worry that the versions of libraries in the container are just the generic packages that ship with the OS, compiled with -march=i686 or whatever the modern equivalent is, where they don't try to take advantage of e.g. AVX512 instructions if they're available.


Yes you’re 100% correct, but if you’re using a well tuned GPU Deep Learning pipeline the bottleneck is GPU compute and the CPU is always waiting on the GPU.

It can be a bit tricky to optimise this though - some deep learning users do things like resizing images on the fly, or on the fly vectorising / scaling of input data in these cases the CPU can be the bottleneck and the optimisations above can help - but this is largely a symptom of a poor pipeline configuration in newbie shops, most serious places have optimised this away and the GPU is the bottleneck


> I would worry that the versions of libraries in the container are just the generic packages that ship with the OS, compiled with -march=i686 or whatever the modern equivalent is, where they don't try to take advantage of e.g. AVX512 instructions if they're available.

That was essentially my worry, yes. Thanks for clarifying.


Most binary distributions don’t have different sets of packages for different feature sets of 64-bit Intel processors. Most containers use the same exact binaries the given distribution would on bare metal.

The processor won’t (and AFAIK can’t) report different feature sets for processes running in a Docker container, and most heavy optimized numerical libraries will choose to use things like AVX at runtime, not compile time.


Tensorflow does have compile-time optimisations, but it will warn if you're running a binary missing out on, for example, AVX when the hardware supports it.


Great write up! I would be remiss to not mention Paperspace https://www.paperspace.com where we also offer cloud GPU infrastructure that is less expensive and more powerful than most of the larger clouds.

We also offer a suite of tools that makes setting up cloud AI pipelines a bit easier to set up and manage.

If this is of interest to anyone here's a $5 promo code to try us out : HNGPU5

full disclosure: I'm one of the co-founders :)


I would be interested to see how the new preemptible GPU instances fare in this comparison, perhaps on the next revision? https://cloud.google.com/compute/docs/instances/preemptible#...


Dammit. When did Google announce these?

It looks like preemptable GPUs are exactly half the price of normal GPUs (for both K80s and P100s; $0.73/hr and $0.22/hr respectively), so they're about double the cost efficiency (when factoring in the cost of the base preemptable instance), which would put them squarely ahead of CPUs in all cases. (and since the CPU instances used here were also preemptable, it's apples-to-apples)


Quite recently I think. You can’t be blamed for missing it, the documentation is inconsistent on if it is supported. (https://mobile.twitter.com/danielwithmusic/status/9421780263...).


Shameless plug, we also have a benchmark post for CPU/K80/V100 focusing on image tasks: https://blog.floydhub.com/benchmarking-floydhub-instances. In this case, V100 can be a lot more cost effect compared to CPU and K80. Our tensorflow environments are built from source with optimizations targeting our CPU instances.


Would be better if Volta-based P3 instances are involved in the comparison.

Spoiler Alert: It is a game changer.


I talked about Volta at the end of the post, but it’ll still be a bit for Volta/cuDNN 7 support is baked into the native Tensorflow/CNTK distributions.

Even if Volta has the speed advantages touted, I doubt P3s will be as cost-effective as a K80.


It is actually more interesting than that. My experience is P3 instance can significantly accelerate the training performance, as much as 2x-3x, so to train the same model, you won't need to keep the instance up nearly as long for a K80 instance.


1. I wouldn't take much away from the LSTM benchmark. It's more a benchmark of Keras since Keras only supports CuDNN's LSTM via Tensorflow right now. AFAIK CNTK does supports CuDNN LSTM but not through Keras. Keras actually implements its own LSTM in terms of the base math operations (it doesn't call the Tensorflow or CNTK LSTM operations which are in some cases optimized in C++ etc.), so on the CPU you probably could get better performance if you were using the Tensorflow or CNTK functions directly.

2. Compiling Tensorflow from source on CPUs is a bit of a hassle but I have seen nice performance gains (10-20%) for LSTM tasks. I bet you would get even higher gains for CNNs since they're more parallelizable. (Note: I've never gotten the latest TF to work with Intel MKL).

3. I haven't fully tested this myself, but with the P100s you also have full support for half precision floats, which supposedly offer a huge speedup.

4. Also would have liked to see benchmarks of other frameworks like PyTorch, etc. I haven't used them myself but everything I've heard indicates that Tensorflow is often slower.


Great write up! Even though I'm very far from understanding deep learning and the various frameworks you're testing here, benchmarking GPUs has always been a muse to me.

I'm interested to see the utilization of the underlying GPU devices when you run the MLP or CNN benchmarks (monitored with `nvidia-smi`) — the speed-up factor between the different benchmarks don't seem to be inline with the speed-up factor shown in cuDNN link[1] between a K80 and P100. I'm wondering if the P100 device is under-utilized when used with TensorFlow or CNTK.

[1] https://developer.nvidia.com/cudnn


At least for personal experiments, how much worse is it to just build a system with a 1080 ti, leave it in a corner where you live and remote in as needed, from a laptop, school, wherever.

Separately it seems with ML price/performance is sometimes less important than how much money do you have to spend?

For example with other development work, I’d probably never buy an 18 core cpu because for most projects it wouldn’t speed up my iterative work much.

However ML is more like VFX work, where it’s common that nothing may be too fast. In other words I would be more often willing ignore price/perf, and for example pay 100% more for only a 50% perf gain, if it were within my means to do so.


This is a great deal, 4x 1080 for $1700 CAD/month if you need it the whole month or for longer:

https://www.ovh.com/ca/en/dedicated-servers/gpu/1801gpu06.xm...

But the best deal with regards to GPUs is to buy your own and put it in your office.


Hetzner has a single 1080 for 99 EUR a month: https://www.hetzner.de/dedicated-rootserver/ex51-ssd-gpu


I've used hetzner, works great and a tiny fraction of the cost. But you have to manage nodes yourself and bandwidth isn't great.


With Kubernetes and docker, it has never been easier to run it yourself.


>But the best deal with regards to GPUs is to buy your own and put it in your office.

Of course, if you need to use it _all_ the time (or need to heat the space). The whole point of cloud computing is to share those resources with others when you don't need them.


Right, but after 3-4 months of full time use (and it's not hard to use these full-time, since models can take days to train), you've spent more than if you just bought one outright.


> Indeed, the P100 is twice as fast and the K80, but due to the huge cost premium, it’s not cost effective for this simple task.

Am I reading the graph wrong, or is this statement not true?

Looks like the P100 is running the code in around half the time as the K80 and the cost is only 150% that of the K80.

.5::1.5 == 1::3 > 1::1


I'd be interested in hearing how this compares to TPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: