High performance models in TensorFlow

jacquesm · on May 4, 2017

How would you ever saturate 8 GPUs if your average large motherboard with 2 CPUs has only 40 PCIe lanes at its disposal?

Is there a motherboard out there that would allow the usage of 8 GPUs with each of them allocated the full 16 lanes?

The best I've found so far would be 4 GPUs on one board, either with 16/8/8/8 or in a rare case 16/16/16/16 (but that requires 2 CPUs).

Besides the physical space, which again seems to be limited to 4 double-wide GPUs on one motherboard.

AReallyGoodName · on May 4, 2017

Quite the opposite is true. If you are using the PCIe bus a lot you probably aren't making good use of the GPU.

A GPU has it's own processor and RAM. If you're transferring to your system RAM and back again often enough to max out x16 PCIe lanes you should fix that.

jacquesm · on May 4, 2017

It would be interesting to see exactly how much data is transferred across the bus during training. Of course it would be great if you could fit your whole dataset on the other side but typically a GPU during training will max out at batches of anywhere from 8 to 64 depending on image size and number of channels. So you'll be moving quite a bit of data.

If found a tool to monitor this unfortunately it only works on Xeons and not on what's in my desktop.

https://github.com/opcm/pcm

jph00 · on May 4, 2017

Most research is done on tiny images (224*224) and uses therefore large batch sizes (128). So there's a lot of compute to do. Even more so if you're using RNNs with large backprop windows.

There are quite a few examples of folks keeping 8 GPUs busy. Eg. Baidu's speech recognition training (which uses quasi-rnn IIRC).

jacquesm · on May 4, 2017

Interesting. So that means it might actually be worth it to run a GPU off an extender cable or riser board.

nl · on May 4, 2017

Yes.

Using external GPUs over USB-C gives useful speed for training NNs.

gtani · on May 5, 2017

40 lanes is a common limitation on consumer chipsets and core i7's (the Z170, x99 and z97 are commonly used for DL builds) but as you said pairs of older Xeons make more available: https://www.microway.com/hpc-tech-tips/common-pci-express-my...

Above that, you need to look at c612: https://www.supermicro.com/products/motherboard/Xeon/C600/X1...

AlphaSite · on May 4, 2017

AMDs Naples should have 128 Pcie 3.0 lanes, so that should happily feed 8 x16 GPUs or 16 x8 GPUs (I don't see most workloads being that bandwidth starved that it's actually an issue).

asafira · on May 4, 2017

What boards specifically are you talking about re: 4 GPUs (either 16/8/8/8 or 16/16/16/16)

jacquesm · on May 4, 2017

Asus X99-E WS is a good starting point. You can put four double wide GPUs on that.

Another really nice one is this one:

https://www.supermicro.nl/Aplus/system/Tower/4021/AS-4021GA-... (AMD based)

https://www.supermicro.nl/products/system/4U/7047/SYS-7047GR... (Intel based)

http://www.dell.com/us/business/p/poweredge-t630/pd

(That last one should be able to hold 4 GPUs but I'm not quite sure about whether or not it will be able to power all of them, Dell isn't helping with their documentation either.)

Edit: just found this:

http://www.supermicro.nl/products/system/4U/4027/SYS-4027GR-...

About $4K, + 8 GPUs that's a pretty penny. Drool.

agravier · on May 4, 2017

I'm using Asus' z10pe-d8 ws with two Xeons. It's got 80 lanes and works all right with 4 GPUS.

jacquesm · on May 4, 2017

Neat! What kind of PSU are you running that off and what kind of GPUs?

agravier · on May 4, 2017

GPUs are GTX Titan (the older 6 GiB cards, not the Titan X), and currently I only use 3 of those with an EVGA SuperNOVA 1300 G2. Full config:

Z10PE-D8 WS mobo, 2 x Xeon E5-2620 v4 (32 threads), 128 GiB DDR4 ECC RAM (8*16 GiB), 256 GiB SM961 NVMe, 12 TiB SATA (4 x 3 TiB HGST server drives), 3 x GTX Titan 6GB, Soldam black knight XR1 casing, 2 x DeepCool Gammax 400 CPU coolers, EVGA SuperNOVA 1300 G2 PSU, 2 x Noctua NF-A9 PWM, 2 x Noctua NF-A14 3000 PWM

clickok · on May 4, 2017

I have a similar setup to grandparent (Z10PE-D16 WS), running four last-gen TitanX GPUs with an EVGA SuperNOVA P2 1600W PSU.

I was a little worried because this is the most power-hungry machine I've ever built, but so far I haven't had any issues. But I'm not going out of my way to start an electrical fire, either.

The motherboard in question is pretty deluxe, the only thing I'd complain about is the boot time.

deepnotderp · on May 4, 2017

Two CPUs max out at 80 lanes iirc.

jacquesm · on May 4, 2017

That's correct, two Xeon 2600's do 40 lanes each and that's the maximum you can get right now. Which is why boards that are dual CPU usually require both CPUs to be installed to have all PCIe slots available because the PCIe interface is now embedded in the CPU rather than in the chipset.

frozenport · on May 4, 2017

You can get 4 CPUs easy, and even 8 CPUs

jacquesm · on May 4, 2017

40 lanes / CPU is the maximum afaik.

deepnotderp · on May 4, 2017

Yup, 2x40.

It also depends on what you plan to do with the gpu. For example, models that do most of the work on the gpu and rarely ingest data from the host, such as large and slow models, will run just fine. On the other hand, attempting to parallelize training across GPUs and nodes is a chore...

frozenport · on May 4, 2017

Yeah, but he was wrong on CPU counts

AlphaSite · on May 4, 2017

AMDs. Apples should hit 128 lanes for a dual socket (as I understand it each socket has 128 lanes, but 64 are reserved for cross cpu traffic, leaving 128 for general usage).

lowglow · on May 4, 2017

Whenever there is any AI/ML stuff on HN I immediately CTRL+F for your handle. Good reads + insights every time. Cheers.

nyamhap · on May 4, 2017

My bottleneck is still speed of reading data from json. I wonder whether I should wait for features to be built out here or go down the path of writing a custom data reader in C++

jacksnipe · on May 4, 2017

If your data has a little extra structure that isn't shared by JSON in general, you could probably get serious performance gains by rolling your own.

sirfz · on May 4, 2017

Utilizing multiprocessing for reading and processing jsons (or any type of data) then feeding the output into a shuffle_batch* op works great for me.

jamesblonde · on May 4, 2017

You could use Tensorflow-on-Spark to read your JSON into a RDD in Spark. Then the Tf-RDD-Reader will be in-memory and can feed your training.

feelix · on May 4, 2017

It appears that hacker news users upvote anything with machine learning or tensorflow in it. This is merely a fifo queue implementation, which is not particularly significant any way. Why it was submitted, much less upvoted, is nonsensical to me.

aub3bhat · on May 4, 2017

Maybe upvoted to be saved for later reading. This seems like a new work released by TF team on best practices along woth benchmarks for comparisons.

But I agree with your assesment, I have noticed several barely interesting blogposts/arxiv papers upvoted.

sirfz · on May 4, 2017

Actually this provides a solution for a very serious problem which is usually the main bottleneck in machine learning experimentation (in terms of speed). For me at least, this was a great read and I'm pretty sure I'll be trying this out soon.

mylittlethrow · on May 4, 2017

One (not so charitable, I have to admit) interpretation is that there is a sizable group of people interested in ML that doesn't have the more traditional computer science background, and thus find this kind of text far more appealing, xkcd's ten thousand and everything [0].

A more extreme version of this is situation is of the medical researcher who "reinvented" on his own the trapezoidal rule for numerical integration[1].

[0]: https://xkcd.com/1053/ [1]: https://fliptomato.wordpress.com/2007/03/19/medical-research...

aub3bhat · on May 4, 2017

No, yesterday TensorFlow team released benchmarks [1], as part of these benchmarks they simultaneously published tips for ensuring high performance [2] which they used in benchmarks. The reason for publishing the second link is that due to copying memory between python and C++ (when using feed_dict) TensorFlow can appear to be slower than competing frameworks, hence they felt a need to point out methods used to ensure high performance by reducing I/O latency.

Looking at [2] without having context of [1] can be confusing. But no one is trying to pass this off as "innovation".

[1] https://www.tensorflow.org/performance/benchmarks [2] https://www.tensorflow.org/performance/performance_models

mylittlethrow · on May 4, 2017

Oh, sorry if I was unclear, but I didn't intend to say that the TensorFlow team is trying to pass this off as innovation. I was thinking about why some readers might find this interesting and upvote even without the context.