Quite the opposite is true. If you are using the PCIe bus a lot you probably aren't making good use of the GPU.
A GPU has it's own processor and RAM. If you're transferring to your system RAM and back again often enough to max out x16 PCIe lanes you should fix that.
It would be interesting to see exactly how much data is transferred across the bus during training. Of course it would be great if you could fit your whole dataset on the other side but typically a GPU during training will max out at batches of anywhere from 8 to 64 depending on image size and number of channels. So you'll be moving quite a bit of data.
If found a tool to monitor this unfortunately it only works on Xeons and not on what's in my desktop.
Most research is done on tiny images (224*224) and uses therefore large batch sizes (128). So there's a lot of compute to do. Even more so if you're using RNNs with large backprop windows.
There are quite a few examples of folks keeping 8 GPUs busy. Eg. Baidu's speech recognition training (which uses quasi-rnn IIRC).
AMDs Naples should have 128 Pcie 3.0 lanes, so that should happily feed 8 x16 GPUs or 16 x8 GPUs (I don't see most workloads being that bandwidth starved that it's actually an issue).
(That last one should be able to hold 4 GPUs but I'm not quite sure about whether or not it will be able to power all of them, Dell isn't helping with their documentation either.)
I have a similar setup to grandparent (Z10PE-D16 WS), running four last-gen TitanX GPUs with an EVGA SuperNOVA P2 1600W PSU.
I was a little worried because this is the most power-hungry machine I've ever built, but so far I haven't had any issues.
But I'm not going out of my way to start an electrical fire, either.
The motherboard in question is pretty deluxe, the only thing I'd complain about is the boot time.
That's correct, two Xeon 2600's do 40 lanes each and that's the maximum you can get right now. Which is why boards that are dual CPU usually require both CPUs to be installed to have all PCIe slots available because the PCIe interface is now embedded in the CPU rather than in the chipset.
It also depends on what you plan to do with the gpu. For example, models that do most of the work on the gpu and rarely ingest data from the host, such as large and slow models, will run just fine. On the other hand, attempting to parallelize training across GPUs and nodes is a chore...
AMDs. Apples should hit 128 lanes for a dual socket (as I understand it each socket has 128 lanes, but 64 are reserved for cross cpu traffic, leaving 128 for general usage).
My bottleneck is still speed of reading data from json. I wonder whether I should wait for features to be built out here or go down the path of writing a custom data reader in C++
It appears that hacker news users upvote anything with machine learning or tensorflow in it. This is merely a fifo queue implementation, which is not particularly significant any way. Why it was submitted, much less upvoted, is nonsensical to me.
Actually this provides a solution for a very serious problem which is usually the main bottleneck in machine learning experimentation (in terms of speed). For me at least, this was a great read and I'm pretty sure I'll be trying this out soon.
One (not so charitable, I have to admit) interpretation is that there is a sizable group of people interested in ML that doesn't have the more traditional computer science background, and thus find this kind of text far more appealing, xkcd's ten thousand and everything [0].
A more extreme version of this is situation is of the medical researcher who "reinvented" on his own the trapezoidal rule for numerical integration[1].
No, yesterday TensorFlow team released benchmarks [1], as part of these benchmarks they simultaneously published tips for ensuring high performance [2] which they used in benchmarks. The reason for publishing the second link is that due to copying memory between python and C++ (when using feed_dict) TensorFlow can appear to be slower than competing frameworks, hence they felt a need to point out methods used to ensure high performance by reducing I/O latency.
Looking at [2] without having context of [1] can be confusing. But no one is trying to pass this off as "innovation".
Oh, sorry if I was unclear, but I didn't intend to say that the TensorFlow team is trying to pass this off as innovation. I was thinking about why some readers might find this interesting and upvote even without the context.
Is there a motherboard out there that would allow the usage of 8 GPUs with each of them allocated the full 16 lanes?
The best I've found so far would be 4 GPUs on one board, either with 16/8/8/8 or in a rare case 16/16/16/16 (but that requires 2 CPUs).
Besides the physical space, which again seems to be limited to 4 double-wide GPUs on one motherboard.