GPU folks: we need to talk about control flow

exDM69 · on Dec 21, 2016

Buggy shader compilers are no surprise to me but it's a serious issue that needs to be addressed by all GPU vendors. Vulkan and SPIR-V may help a little as the GLSL compiler frontend is moved out of the "driver" component and the resulting SPIR-V blob can be more easily manipulated.

There are some major issues at play:

1) Shipping an OpenGL-based product on multiple vendors' devices (especially on mobile) requires a huge amount of QA effort and per-device bug workarounds. Pretty much everyone makes their games on top of Unity/Unreal/etc because the engine vendors do most of the QA work.

2) Issue #1 is made worse by the fact that many consumer devices do not get GPU driver updates because the phone/tablet OEM end-of-lifes their products too soon and/or the mobile operators and other middlemen don't ship updates as they should. This means that if you ship a GPU-using product, you need to support devices with old, buggy drivers and this is expensive.

3) There's a huge untapped potential in GPUs, there are very few non-graphics apps taking advantage of the processing power but I can't see it changing for the better before GPUs become easier to target and verify correctness of operation.

It would be too easy to blame GPU vendors' software engineering practices, but I (as a GPU driver programmer) see that the bigger problem is issue #2 (not that the GPU drivers are faultless). Even if the drivers do get fixed and updated, getting the updates to the hands of the customers is still going to be an issue. The middlemen need to be cut out of the equation, we can't be dependent on the business requirements of OEMs and operators when shipping mission critical software infrastructure.

This is a bit of a chicken and egg problem, games are not important enough to OEMs and operators fixing their update delivery mechanisms, but no-one dares to use GPUs for anything more important before this issue gets sorted out.

DannyBee · on Dec 21, 2016

You missed one of the most major issues:

Most of the vendors know they have serious issues here, but hide them. Literally. You never notice because they don't use open source compilers (even for things like CUDA, let alone shader compilers), so it's not obviously more than "just a bug" until it happens to you continuously. Instead, they pretty much never have to fix the bug until someone notices, and then they hack it some more and move on, instead of fixing underlying issues in their structurization/etc passes.

Most vendors i've talked to can't even tell me what control flow breaks their compiler (again, doesn't matter if we are talking shaders, cuda, you name it. it's all broke), they know plenty does, but are fairly ¯\_(ツ)_/¯ about doing more than working around whatever bug they get given.

Meanwhile, over in open source clang/llvm world, we can basically fuzz test CFG's, etc for CUDA.

The death of some of these compilers can't come fast enough.

mrb · on Dec 21, 2016

"You never notice because they don't use open source compilers (even for things like CUDA, let alone shader compilers)"

Actually AMD recently completely open sourced their entire GPGPU compute stack. See https://github.com/RadeonOpenCompute

Compiler, assembler (LLVM based), linker, driver, etc. Everything is open to my knowledge.

DannyBee · on Dec 30, 2016

Actually, we wrote the CUDA support in clang :)

and yes, AMD has done this because they have nothing to lose anymore :)

akiselev · on Dec 21, 2016

Even worse some of the mobile GPU vendors, coming from the hardware/silicon industry, sometimes provide fixes to their bugs only when a customer asks for it, leaving the fix out of the master repository for fear of providing evidence to potential litigators and IP to competitors.

Athas · on Dec 21, 2016

In my experience, the situation is a little better for AMD and NVIDIA GPUs - likely because they have already seen (relatively) heavy use for OpenCL and CUDA. The drivers are still pretty bad altogether, with some of intermittent faults and weird quirks, but compiler bugs are infrequent.

It is however my general impression that the graphics parts of GPU drivers are generally more buggy than the compute parts. Likely because the graphics stack is older and more complicated. Is this a correct assessment?

wolfgke · on Dec 21, 2016

> It is however my general impression that the graphics parts of GPU drivers are generally more buggy than the compute parts.

I'm not a graphics driver developer, but I know that game engines for modern games often are actually quite buggy and often the graphics driver stack tries to work around engine bugs. It's also well-known that for AAA titles the graphics driver internally even often replaces the game's shaders completely by hand-optimized ones made by the GPU vendor. I have read from an insider that in the last years NVIdia's graphics drivers often did not validate arguments (for a little bit more performance), while the AMD's did. This was as I heard the reason why one of the last Tomb Raider game had initially problems on AMD GPUs.

Also for OpenGL drivers you have to realize that in the past Intel drivers did not really support modern OpenGL versions. So at least for Indie developers the safest way was to target OpenGL 2.1 + extensions. Though OpenGL >= 3.1 should better be initialized using a Core context many developers still use(d) a Compatibility context - this adds a lot of legacy stuff that one has to stay compatible with and thus makes the whole OpenGL implementation of Compatibility contexts very error-prone (in the driver). Luckily we now have Vulkan (though writing your engine against Vulkan is at least at the beginning more complicated (time to first triangle) than targeting OpenGL or DirectX 11.x (not DirectX 12 - DX 12 is a lot more like Vulkan)).

ploxiln · on Dec 21, 2016

In the open-source-everything world, it's a lot more common to push for and eventually make a fix where the bug is, and then tell people to update or patch. (Though it still doesn't always happen.)

In the closed-source and hardware worlds, it's a lot more common to implement a work-around wherever you are. (Alternatively, the entity paying money pressures the entity being paid to put in a workaround on their side.) Soon there are a bunch of workarounds in common use which in turn need to be worked-around. This is the world of commercial games and graphics drivers.

I used to work on a WiFi access point product, though I only touched the actual driver a few times (other team members focused there), and there was a lot of working around bugs, and working around work-arounds in other WiFi devices for other bugs in WiFi APs, in that world too. Some crazy stuff ...

azernik · on Dec 21, 2016

Also because the graphics parts have to implement more logic logic; generally computing users of GPUs want access to arithmetic primitives, not complicated graphics-relevant functions.

daveguy · on Dec 21, 2016

With the #2 issue -- it would be nice if the drivers were distributed as a drivers only option rather than a mega-advertising option. It looks like AMD does this now with a 55MB version, but Nvidia still only has the ridiculous "experience" application in place of a simple driver package (300+MB).

exDM69 · on Dec 21, 2016

I understand your opinion as a consumer but it's not a real problem (minor annoyance at most) compared to millions of consumer devices with outdated drivers (with possible security problems).

The "experience" thing is optional afaik.

matt4077 · on Dec 21, 2016

Let's meet in the middle and do it like Apple?

Also, as a consumer the "experience" seems to matter a lot more to me than ... whatever the other side of this argument is ... the hardship for programmers?

And since you're bringing up security: if there are security holes in GPU driver installations, I'd bet eliminating all the crapware would do a lot more for security than any number of updates could ever do.

pcwalton · on Dec 21, 2016

> Let's meet in the middle and do it like Apple?

Apple's OpenGL drivers (on desktop at least) are the worst in the industry, and that is not an exaggeration. I have found so many cases in which you can make shaders read random memory ([1] for example) that I'm no longer surprised.

[1]: https://github.com/servo/webrender/commit/96ac2f13b9b73dcfa0...

exDM69 · on Dec 21, 2016

Yes, some kind of meet in the middle solution would be nice. For example, Microsoft's official updater keeping the drivers up to date. And by up to date I mean the latest from the GPU vendor and not something a bit older but certified by MS.

Getting rid of 3rd party installers and updaters should be a priority of Microsoft's anyway.

> if there are security holes in GPU driver installations, I'd bet eliminating all the crapware would do a lot more for security than any number of updates could ever do.

The "crapware" is just some unprivileged user space processes. The driver has kernel mode components, and security issues there could cause crashes, snooping memory of other processes or even kernel space remote code execution.

Not that I'm defending bundled (or optional) crapware, but I don't think that's a security problem.

gsnedders · on Dec 21, 2016

The worst place for security bugs in drivers is mobile, for the normal mobile reason (try getting security updates for any OS component!), and there's no crapware there.

benjaminPKP · on Dec 21, 2016

Just took an applied parallel programming class at a top 5 engineering school. The untapped potential for GPUs really is crazy. If GPUs could be used for common tasks able to be parallelized, then less work for the CPU and way quicker computations. I mean we made a convolutional neural network for handwriting recognition which took ~30 minutes to run on a CPU, ran it parallel in 400 milliseconds on GPUs. Once CPUs start leveling out due to the transistor-heat issue, I think GPUs might start taking on a bigger role, they simply need more support. So many people are all about the buzzwords in the tech world that no one wants to turn the attention something like parallelization, but hopefully someday soon

p1esk · on Dec 21, 2016

Going from 30 min on CPU to 0.4 sec on GPU can mean only one thing - you had a really crappy CPU code.

A well written C code should run about 20-50 times slower than a well written CUDA code on the latest GPUs, and that's for a single CPU core. This was true with my hand-written implementations, as well as when using libraries (Numpy vs CuDNN).

brigade · on Dec 21, 2016

Going further, power for power on the current generation, for SGEMM, i'd expect GPUs to beat CPUs by maybe 5x (6700k has a theoretical 512 gflops at 91W, while xeons can come a bit shy of 1.5 tflops at 145w)

Which is still pretty major, but nowhere near the wild claims that used to be common in academic papers. It's outdated, but [1] is still pretty good reading.

[1] http://sbel.wisc.edu/Courses/ME964/Literature/LeeDebunkGPU20...

benjaminPKP · on Dec 21, 2016

We implemented the LeNet-5 convolutional neural network, and yes when run sequentially it really does take that long. The sequential code that took that long was also written by professors who worked at NVIDIA, it was written as well as it could be sequentially. The project was a competition to see how fast we could have it go in parallel so we used a ton of parallelization techniques and tricks like streaming to make it go at .4 sec. Think again before you call someones code "crappy"

p1esk · on Dec 21, 2016

I just did a quick test for Lenet-5, using Theano + CuDNN 5, Xeon E5-1620v2 3.7GHz, Maxwell Titan X GPU:

Two conv layers (6 and 12 feature maps, 5x5 filters), one fully connected layer (120 neurons), activation=Tanh, pooltype: average (excluding padding), cost=Negative Log Likelihood.

Learning rate=0.20, minibatch size=100, dropout = 0.0, L2 lambda=0.0, momentum=0.0, initialization: normal

Training for 10 epochs:

CPU: 172 sec, GPU: 14 sec.

When decreasing batch size to 20 images, the numbers are: CPU: 318 sec, GPU: 38 sec.

So yeah, the CPU code your professors wrote was really crappy.

nkurz · on Dec 21, 2016

I'll add my vote to the side saying "poorly optimized CPU code". This doesn't mean that the code is "crappy", but if you are more than 1000x faster on the GPU than CPU, there is almost certainly the potential for improved performance on the CPU. Optimization is hard, and depending on their area of focus, professors who worked at NVIDIA may not be in the best position to get top performance out of modern Intel CPU's.

I'd be betting that someone with low level optimization expertise (comparable to what you appear to have done with the GPU) could get at least another 10x out of the CPU version. You are completely right that GPU's have the potential for great increases over the current normal, but there's also (typically) room for large improvement on modern CPU's as well.

benjaminPKP · on Dec 21, 2016

It's actually about 75x faster, not 1000x. I mean there's a difference between parallelizing basic code vs something like the LeNet-5. The sequential code was pretty standard. I mean I'm sure the CPU code could've been optimized more, I'm not trying to argue, I just thought it was interesting to see how fast you can make some CUDA code go

p1esk · on Dec 21, 2016

Wait, did you make a typo, and it was 30 seconds for the CPU run time, not 30 minutes? If so, then yes, that's more realistic.

benjaminPKP · on Dec 21, 2016

Oh wow. Now that's embarrassing. My bad for stirring things up with that

Asooka · on Dec 21, 2016

You can actually get some gain in CPU performance by writing in OpenCL. That's because OpenCL code is meant to be easily parallelisable and consumed by wide SIMD units, so Intel's compiler can do a lot more autovectorising than for C code.

CyberDildonics · on Dec 21, 2016

> Think again before you call someones code "crappy"

Lets see it then. Most cpu programs have enormous amounts of fat left in them from cache misses and poor or no use of SIMD.

brigade · on Dec 21, 2016

I'll second p1esk calling the CPU version barely optimized based solely on the numbers you're providing. Unless you're comparing a D410 against a Titan X?

benjaminPKP · on Dec 21, 2016

It was run on a tesla k80 cluster, so pretty high end. But the reason we were able to get such acceleration is because we wrote the the low-end CUDA code, using as much register memory and shared memory as possible, while using good streaming techniques (which really accelerates it), as well as matrix multiplication techniques faster than SGEMM. The batch size was also huge that the code was run on. Since we wrote the low-end CUDA code we were also able to prevent control-divergence as much as possible by using knowledge on warps and how dram bursts occur. We weren't using any API like openACC or anything for help, we wrote CUDA code with some pretty good optimizations. The numbers are real, and a lot of the optimizations come from understanding how the nested loops in the LeNet-5 CNN work together. All I'm saying is this shows how well GPUs can speed things up when writing efficient GPU code.

viewer5 · on Dec 21, 2016

What causes #1? Why's OpenGL so troublesome and weird for each device?

pjmlp · on Dec 21, 2016

Easy just think how browsers render HTML/CSS/JavaScript, C compilers specific interpretation of ANSI C, UNIX variations from what POSIX leaves to each implementation, and many other implementation vs standards variations.

Just like them OpenGL is a text standard describing how a 3D API is supposed to behave.

It happens that what those papers state and what each team of developers at every card manufacturer understand is not always the same.

Then there is the set of card specific extensions, for every nice feature they want to sell on their graphics card, but yet to be adopted by OpenGL paper standard.

There are of course certification tests available, but they are costly and don't cover 100% of the API usage anyway.

So programming OpenGL, happens to be like trying to write web applications, while trying to make the code portable and bug free (with workarounds) across all graphic cards out there.

p0nce · on Dec 21, 2016

> So programming OpenGL, happens to be like trying to write web applications, while trying to make the code portable and bug free (with workarounds) across all graphic cards out there.

And it's much harder to test because you have to find the outdated graphic card, with the outdated driver, and have a machine where you can plug the offending card.

glaze · on Dec 21, 2016

One reason is that shaders are stored as text and require each driver vendor to write parsers for them and not everyone implements them in the same way.

exDM69 · on Dec 21, 2016

Further, the drivers have to be more permissive than necessary. It's not an option for a GPU vendor to ship a driver update that would break Doom/Quake/Dota/etc.

Ideally games would ship only shaders that are strictly valid GLSL but that's not quite the case.

If you're developing GL, do everyone a favor and start using the official GLSL validator in your build scripts.

discreditable · on Dec 21, 2016

Compounding issues #1 and #2 you mention is that when the HPs, Lenovos, et al do ship updates they're usually a few months behind current at best. I've seen this for Intel, AMD, and NVIDIA. What's worse is if you run the "generic" Intel GPU driver installer, if it detects an OEM's driver on the system it will refuse to install.

bhouston · on Dec 21, 2016

Found a bug on Nexus 5 that was never fixed - struct constructors in WebGL glsl were completely broken:

https://github.com/mrdoob/three.js/pull/7556

Also some other code that seemed completely valid failed on Nexus 5 devices - so we had to revert it:

https://github.com/mrdoob/three.js/pull/9948

Basically mobile GPUs are a mixed bag of bugs all over the place. Three.JS is an attempt to find common path through those bugs in order to achieve reproducible results.

If you want to try your hand at this, there is still this outstanding issue that we haven't yet tracked down on the Nexus 5 devices:

https://github.com/mrdoob/three.js/issues/9515

mattst88 · on Dec 21, 2016

I see "Next stop: Intel."

I sincerely hope that you test our Linux drivers (I work for Intel on them). We'd be excited to learn about anything you find. Please feel free to file bugs here (https://bugs.freedesktop.org/enter_bug.cgi?product=Mesa&comp...) and ping us on #intel-gfx on Freenode.

wolfgke · on Dec 21, 2016

> I sincerely hope that you test our Linux drivers

I sincerely hope that they test out both Linux and Windows drivers. It is well-known that these drivers at least in the past were developed by two completely different teams, so the results might be quite different and thus interesting.

bhouston · on Dec 21, 2016

There was a ThreeJS bug that differed based on whether it was a Linux or a Windows Intel driver -- specifically this one:

https://github.com/mrdoob/three.js/issues/10331

Linux Intel didn't reproduce, but Windows Intel did. Windows Intel acted like Adreno GPUs but Linux Intel behaved like NVIDIA GPUs. Fun times.

Sarkie · on Dec 21, 2016

This is from 2013 but still a fun read.

https://dolphin-emu.org/blog/2013/09/26/dolphin-emulator-and...

jsheard · on Dec 21, 2016

Also this, from the guy who ported Valves Source engine to OpenGL: http://richg42.blogspot.co.uk/2014/05/the-truth-on-opengl-dr...

The vendors are Nvidia, AMD, Intel/Linux and Intel/Windows respectively. Yes, Intel maintains two completely separate OpenGL implementations.

_pdp_ · on Dec 21, 2016

Depending on the rendered garbage this could have security implications!

It is pretty easy to capture the image from a canvas. The rendered garbage could reveal buffered adjacent/random memory (requires research). The memory can be extracted from the image with the right interpretation.

greggman · on Dec 21, 2016

You can be proactive in getting these kinds of issues fixed. Make a small repo and submit it to the webgl conformance tests.

https://github.com/KhronosGroup/WebGL

There are already something like 2500 tests

raverbashing · on Dec 21, 2016

Mobile GPU vendors seems to be firmly in the "make some games and common apps work and to heck with the specs and all the rest"

AMD and nVidia seem to have this better worked out

willvarfar · on Dec 21, 2016

This is a series of articles (previous entries have been on HN in recent weeks) that finds bugs in other vendors too.

AMD, for example: https://medium.com/@afd_icl/first-stop-amd-bluescreen-via-we...

They are doing the suppliers alphabetically, so they'll get to NVIDIA in good time :)

What mades you think that driver stability was a mobile problem? :D

raverbashing · on Dec 21, 2016

> What mades you think that driver stability was a mobile problem? :D

Not an exclusive problem, but it seems to be worse on those platforms :)

m_mueller · on Dec 21, 2016

They'll get there as well when these GPUs start being used as GPGPU coprocessors. Apple's privacy stance will probably push this usecase soon.

vvanders · on Dec 21, 2016

This shouldn't be a surprise to anyone who's worked on mobile GPUs, no one uses conditionals in shader code since the performance hit is gnarly(and used to be even worse before branching support). It's a fairly uncommon code path.

As far as these go they aren't that bad, I've seen a highly regarded vendor's shader compiler die and bring down the whole Android stack which resulted in insta-reboot.

If it isn't in the rendering path for Android(HWUI) or Chrome then it's best to tread carefully on mobile GPUs.

rosstex · on Dec 21, 2016

Could someone briefly explain why seemingly innocuous changes to the code can have such large effects on the resulting images?

jlebar · on Dec 21, 2016

Note that GPGPU compilers have the same sorts of problems; see e.g. https://llvm.org/bugs/show_bug.cgi?id=27738.

ingenter · on Dec 21, 2016

Would it be better if GPU drivers could compile open-spec bytecode and upload the result to the GPU to do all of the computation? This way OpenGL may be used as a library, shipped with the application.

wolfgke · on Dec 21, 2016

Vulkan does - you can upload shaders in SPIR-V (open-spec bytecode) to the GPU.

Athas · on Dec 21, 2016

Sadly, the translation from SPIR-V to system-dependent machine code can still be buggy. Although hopefully most of the optimisation will take place at the SPIR-V level, which, as I understand it, is pretty similar to LLVM. That should enable reuse of thoroughly debugged code, instead of each vendor maintaining their own full compiler.

wolfgke · on Dec 21, 2016

> Sadly, the translation from SPIR-V to system-dependent machine code can still be buggy.

This is true, but it eleminates at least one one the points where things can go wrong. Additionally this approach has the advantage that developers (with some practice) can read the SPIR-V "assembly" code to make sure it is correct. With existing solutions it was already hard to get and interprete the intermediate code to find out whether the problem is in the frontend or backend.

willvarfar · on Dec 21, 2016

Excellent article, but not the article I was expecting from the title ;)

From the title, I thought this would be a post about warps/wavefront divergence :)

bhouston · on Dec 21, 2016

Me too. I was hoping for better coordination of warps.

jwatte · on Dec 21, 2016

To be fair, the reason gpus can be fast on gpu workloads is that the hardware makes totally different assumptions about typical control flow across processed pixel tiles. You can't run Call of Duty or Tensorflow at the speeds we see today on x86 style control flow. (Knights Landing proved that!)

Bugs are annoying, but so are mismatched programmer expectations.

14113 · on Dec 21, 2016

There's a difference between speed and correctness. If you promise something in the spec (e.g. unreachable control flow not mattering), in practice, that promise should be upheld. If it's upheld with a 32x slowdown (e.g. warp divergence), then so be it, but the code should still run correctly according to the spec.

nix0n · on Dec 21, 2016

GPUs do not actually have control flow.

They are SIMD vector processors (Same Instructions Multiple Data).

CUDA and OpenCL make this a bit more explicit.

wolfgke · on Dec 21, 2016

> GPUs do not actually have control flow.

Wrong. Let's look at the instriction set of the AMD GCN3 ISA:

> http://gpuopen.com/compute-product/amd-gcn3-isa-architecture...

which links to

> http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-c...

which even has a whole chapter about control flow: "Chapter 4 Program Control Flow".

andars · on Dec 21, 2016

> Wrong.

I'd say it's more like "not quite". ~10 years ago, the statement was pretty much correct. GPUs were entirely SIMD, so they couldn't truly branch, but could fake it with predication. Longer branches would involve executing both branches on every thread and only committing the results of the active branch.

Modern GPUs can do much better, since individual warps/wavefronts can truly diverge. Within warps, however, it's still a bit of a mess.

wolfgke · on Dec 21, 2016

> ~10 years ago, the statement was pretty much correct.

If one looks at the capabilities of CPUs from 10 years ago, one can easily justify similar completely outdated statements for CPUs.

uiri · on Dec 21, 2016

While this is correct, it isn't the whole story.

GPUs can implement control flow by calculating both results and then throwing away one of them based on a predicate. Also, GPUs typically do have a scalar unit which can perform branching but this is heavily discourage by CUDA and OpenCL due to the performance implications.

emcq · on Dec 21, 2016

Current GPUs are not SIMD. They are classic examples of MIMD (multiple instruction multiple data), with multiple concurrent processing units. SIMD is common in vector processing units, but GPUs do more than that at this point.

https://en.m.wikipedia.org/wiki/MIMD