Hacker News new | past | comments | ask | show | jobs | submit login
High performance models in TensorFlow (tensorflow.org)
164 points by mrry on May 4, 2017 | hide | past | favorite | 32 comments



How would you ever saturate 8 GPUs if your average large motherboard with 2 CPUs has only 40 PCIe lanes at its disposal?

Is there a motherboard out there that would allow the usage of 8 GPUs with each of them allocated the full 16 lanes?

The best I've found so far would be 4 GPUs on one board, either with 16/8/8/8 or in a rare case 16/16/16/16 (but that requires 2 CPUs).

Besides the physical space, which again seems to be limited to 4 double-wide GPUs on one motherboard.


Quite the opposite is true. If you are using the PCIe bus a lot you probably aren't making good use of the GPU.

A GPU has it's own processor and RAM. If you're transferring to your system RAM and back again often enough to max out x16 PCIe lanes you should fix that.


It would be interesting to see exactly how much data is transferred across the bus during training. Of course it would be great if you could fit your whole dataset on the other side but typically a GPU during training will max out at batches of anywhere from 8 to 64 depending on image size and number of channels. So you'll be moving quite a bit of data.

If found a tool to monitor this unfortunately it only works on Xeons and not on what's in my desktop.

https://github.com/opcm/pcm


Most research is done on tiny images (224*224) and uses therefore large batch sizes (128). So there's a lot of compute to do. Even more so if you're using RNNs with large backprop windows.

There are quite a few examples of folks keeping 8 GPUs busy. Eg. Baidu's speech recognition training (which uses quasi-rnn IIRC).


Interesting. So that means it might actually be worth it to run a GPU off an extender cable or riser board.


Yes.

Using external GPUs over USB-C gives useful speed for training NNs.


40 lanes is a common limitation on consumer chipsets and core i7's (the Z170, x99 and z97 are commonly used for DL builds) but as you said pairs of older Xeons make more available: https://www.microway.com/hpc-tech-tips/common-pci-express-my...

Above that, you need to look at c612: https://www.supermicro.com/products/motherboard/Xeon/C600/X1...


AMDs Naples should have 128 Pcie 3.0 lanes, so that should happily feed 8 x16 GPUs or 16 x8 GPUs (I don't see most workloads being that bandwidth starved that it's actually an issue).


What boards specifically are you talking about re: 4 GPUs (either 16/8/8/8 or 16/16/16/16)


Asus X99-E WS is a good starting point. You can put four double wide GPUs on that.

Another really nice one is this one:

https://www.supermicro.nl/Aplus/system/Tower/4021/AS-4021GA-... (AMD based)

https://www.supermicro.nl/products/system/4U/7047/SYS-7047GR... (Intel based)

http://www.dell.com/us/business/p/poweredge-t630/pd

(That last one should be able to hold 4 GPUs but I'm not quite sure about whether or not it will be able to power all of them, Dell isn't helping with their documentation either.)

Edit: just found this:

http://www.supermicro.nl/products/system/4U/4027/SYS-4027GR-...

About $4K, + 8 GPUs that's a pretty penny. Drool.


I'm using Asus' z10pe-d8 ws with two Xeons. It's got 80 lanes and works all right with 4 GPUS.


Neat! What kind of PSU are you running that off and what kind of GPUs?


GPUs are GTX Titan (the older 6 GiB cards, not the Titan X), and currently I only use 3 of those with an EVGA SuperNOVA 1300 G2. Full config:

Z10PE-D8 WS mobo, 2 x Xeon E5-2620 v4 (32 threads), 128 GiB DDR4 ECC RAM (8*16 GiB), 256 GiB SM961 NVMe, 12 TiB SATA (4 x 3 TiB HGST server drives), 3 x GTX Titan 6GB, Soldam black knight XR1 casing, 2 x DeepCool Gammax 400 CPU coolers, EVGA SuperNOVA 1300 G2 PSU, 2 x Noctua NF-A9 PWM, 2 x Noctua NF-A14 3000 PWM


I have a similar setup to grandparent (Z10PE-D16 WS), running four last-gen TitanX GPUs with an EVGA SuperNOVA P2 1600W PSU.

I was a little worried because this is the most power-hungry machine I've ever built, but so far I haven't had any issues. But I'm not going out of my way to start an electrical fire, either.

The motherboard in question is pretty deluxe, the only thing I'd complain about is the boot time.


Two CPUs max out at 80 lanes iirc.


That's correct, two Xeon 2600's do 40 lanes each and that's the maximum you can get right now. Which is why boards that are dual CPU usually require both CPUs to be installed to have all PCIe slots available because the PCIe interface is now embedded in the CPU rather than in the chipset.


You can get 4 CPUs easy, and even 8 CPUs


40 lanes / CPU is the maximum afaik.


Yup, 2x40.

It also depends on what you plan to do with the gpu. For example, models that do most of the work on the gpu and rarely ingest data from the host, such as large and slow models, will run just fine. On the other hand, attempting to parallelize training across GPUs and nodes is a chore...


Yeah, but he was wrong on CPU counts


AMDs. Apples should hit 128 lanes for a dual socket (as I understand it each socket has 128 lanes, but 64 are reserved for cross cpu traffic, leaving 128 for general usage).


Whenever there is any AI/ML stuff on HN I immediately CTRL+F for your handle. Good reads + insights every time. Cheers.


My bottleneck is still speed of reading data from json. I wonder whether I should wait for features to be built out here or go down the path of writing a custom data reader in C++


If your data has a little extra structure that isn't shared by JSON in general, you could probably get serious performance gains by rolling your own.


Utilizing multiprocessing for reading and processing jsons (or any type of data) then feeding the output into a shuffle_batch* op works great for me.


You could use Tensorflow-on-Spark to read your JSON into a RDD in Spark. Then the Tf-RDD-Reader will be in-memory and can feed your training.


It appears that hacker news users upvote anything with machine learning or tensorflow in it. This is merely a fifo queue implementation, which is not particularly significant any way. Why it was submitted, much less upvoted, is nonsensical to me.


Maybe upvoted to be saved for later reading. This seems like a new work released by TF team on best practices along woth benchmarks for comparisons.

But I agree with your assesment, I have noticed several barely interesting blogposts/arxiv papers upvoted.


Actually this provides a solution for a very serious problem which is usually the main bottleneck in machine learning experimentation (in terms of speed). For me at least, this was a great read and I'm pretty sure I'll be trying this out soon.


One (not so charitable, I have to admit) interpretation is that there is a sizable group of people interested in ML that doesn't have the more traditional computer science background, and thus find this kind of text far more appealing, xkcd's ten thousand and everything [0].

A more extreme version of this is situation is of the medical researcher who "reinvented" on his own the trapezoidal rule for numerical integration[1].

[0]: https://xkcd.com/1053/ [1]: https://fliptomato.wordpress.com/2007/03/19/medical-research...


No, yesterday TensorFlow team released benchmarks [1], as part of these benchmarks they simultaneously published tips for ensuring high performance [2] which they used in benchmarks. The reason for publishing the second link is that due to copying memory between python and C++ (when using feed_dict) TensorFlow can appear to be slower than competing frameworks, hence they felt a need to point out methods used to ensure high performance by reducing I/O latency.

Looking at [2] without having context of [1] can be confusing. But no one is trying to pass this off as "innovation".

[1] https://www.tensorflow.org/performance/benchmarks [2] https://www.tensorflow.org/performance/performance_models


Oh, sorry if I was unclear, but I didn't intend to say that the TensorFlow team is trying to pass this off as innovation. I was thinking about why some readers might find this interesting and upvote even without the context.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: