Volta: Advanced Data Center GPU

gigatexal · on May 10, 2017

These tensor cores sound exotic: "Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply," Curious to see how the ML groups and others take to this. Certainly ML and other GPGPU usage has helped Nvidia climb in value. I wonder if Nvidia saw the writing on the wall so to speak with Google releasing their specialty hardware called the Tensor hardware that Nvidia decided to use it in their branding as well.

deepnotderp · on May 10, 2017

"Tensor hardware" is a very vague term that's more marketing than an actual hardware type, I guarantee you that these are really SIMD or matrix units like the Google tpu that they just devised to call "Tensor", because, you know, it sells.

Symmetry · on May 10, 2017

They're matrix units just like in the Google TPU but the TPU stands for "Tensor Processing Unit" so that's consistent. There's no reason to add special SIMD units when the entire core is already running in SIMT mode and by establishing a dataflow for NxNxN matrix multiplies you can reduce your register read bandwidth by a factor of N. Which isn't as huge for NVidia's N=4 as for Google's N=256 but is still a big deal, and diminishing returns might mean that NVidia is getting most of the possible benefit when stopping at 4 and preserving more flexibility for other workloads.

gigatexal · on May 10, 2017

For me, the laymen, reading the matrix multiply stuff that's what it sounded like to me as well given my understanding of SIMD and such. Especially when they made mention to BLAS. But I am no expert.

deepnotderp · on May 10, 2017

Yup, the tpu also, it was just a systolic matrix multiplier, but hey, it's Google, and they called it a "Tensor processor" so let's get a hard on..m

bmiranda · on May 10, 2017

Google's hardware is for inference, not training.

josephpmay · on May 10, 2017

Volta is for both inferencing and training, but has an emphasis on inferencing

gigatexal · on May 10, 2017

thanks for clarifying.

JustFinishedBSG · on May 11, 2017

It doesn't matter, operations are the same in forward and backward mode.

"Made for inference" just means "too slow for training" if you are pessimistic or "optimized for power efficiency" if you are optimistic.

Otherwise training and inference are basically the same

david-gpu · on May 11, 2017

You can do inference pretty easily with 8-bit fixed point weights. Now attempt doing the same during training.

Training and inference are only similar at a high level, not in actual application.

redcalx · on May 11, 2017

... because the gradient that is being followed may have a lower magnitude than can be represented in the lower precision.

dgacmu · on May 11, 2017

You also need a few other operations for training, such as transpose, which may or may not be fast in a particular implementation.

(ETA: In case it's not obvious, I'm agreeing with david-gpu's comment, and adding more reasons that training currently differs from inference.)

Symmetry · on May 10, 2017

It's really cool how much performance you can get out of hardware dataflows.

arca_vorago · on May 10, 2017

More great hardware being stuck behind proprietary CUDA when OpenCL is the thing they should be helping with. Once again proprietary lock in that will result in inflexibility and digital blow-back in the long run. Yes I understand OpenCL has some issues and CUDA tends to be a bit easier and less buggy, but that doesn't detract from the principles of my statement.

nicwilson · on May 11, 2017

I am the author of DCompute, a compiler/library/runtime framework for abstracting OpenCL/CUDA for D. You can write kernels already, although the API automation is still a work in progress. I'm hoping that this should level the field a bit, because let's face it, people use CUDA for two reasons: the OpenCL driver API sucks; and the utility libraries (cuDNN et al) for CUDA. Possibly driver quality as well.

By having an API thats not horrible to use, that advantage is gone. The utility libraries will be more of a challenge to undermine, but since it targets CUDA natively there is no disadvantage to users of nvidia's hardware, but there is no advantage to others, yet (see GLAS[1] for what is possible with relative ease). Using D as the kernel language will also bring significant advantages over C/C++, static reflection, sane templates and compile time code generation to name a few.

You can find it at https://github.com/libmir/dcompute.

If you have any question, please ask!

[1] https://github.com/libmir/mir-glas

slizard · on May 11, 2017

^ This!

Please read this before moving on: https://twitter.com/jrprice89/status/667466444355993600

Also, NVIDIA's CUDA compilers are built on clang which does have OpenCL frontend, so all they would need to do is to put some resources into making that frontend work with their current nvcc toolchain.

Many request and want this, but instead they are trying hard to hold back OpenCL just because providing OpenCL 2.0 support (and extensions for their GPUs features) may help adoption of OpenCL which in turn may end up helping other folks and companies too.

MichaelBurge · on May 10, 2017

Nobody else is even bothering to compete, so standards don't really matter. Let them do their job: I'd rather have faster GPUs.

tanderson92 · on May 10, 2017

Standards matter if you care about software and hardware freedom.

nightski · on May 10, 2017

You really don't have freedom if there are no legitimate competitors.

throw2016 · on May 10, 2017

But accepting a monopoly standard guarantees loss of freedom and rules out any future prospect of legitimate competition.

zurn · on May 11, 2017

Successful open systems tend to have standards that "just happened", not prescriptive standards. Like x86 PCs or Linux or most programming languages.

Cuda seems to be clearly winning over OpenCL in the real world so other vendors should just adopt it. AMD already has a CUDA compiler IIRC.

slizard · on May 11, 2017

The point where you suggested that x86 PCs are "open systems" (listing it next to "Linux" of all things!) I realized that you don't get it. We are where we are with Intel ripping off the consumers and companies alike exactly because nobody realized that x86 is everything but open.

A similar mistake is about to happen, but luckily on the software side where losses can be cut quick and mistakes can be reversed easier -- though many will suffer when they have to reimplement their precious library from ground up because they did (or could) not take into account the fact that CUDA is as proprietary as it gets.

AMD has no CUDA compiler BTW. And CUDA is not a programming language FYI. ;)

zurn · on May 11, 2017

I'm pretty sure I referred to open systems and CUDA in their commonly understood meanings. Here are some links that may help to clarify the concepts:

http://www.pcmag.com/encyclopedia/term/48478/open-system

http://www.anandtech.com/show/9792/amd-sc15-boltzmann-initia...

Aside: I have no position on whether is CUDA's Fortran and C++ dialects constitute their own languages, nor did I refer to CUDA as a programming language.

slizard · on May 11, 2017

> http://www.pcmag.com/encyclopedia/term/48478/open-system

Sadly, that's a very problematic, borderline BS definition.

"A system that allows third parties to make products that plug into or interoperate with it. For example, the PC is an open system."

Intel allows some third parties to interoperate with their system (ref Intel vs NVIDIA etc.) and they pick and choose to their liking, kill some and promote others exactly because they control the open-ness of their systems.

> http://www.anandtech.com/show/9792/amd-sc15-boltzmann-initia...

HIP is still not a CUDA compiler. http://gpuopen.com/compute-product/hip-convert-cuda-to-porta...

> nor did I refer to CUDA as a programming language.

You did refer to "CUDA compiler". My comment was admittedly a nitpick as well as a serious point too. CUDA can be seen as a C++ language + extensions -- something you can compile --, but it's also more than that (stuff that you can't compile), e.g.: API, programming tools, etc. all strongly adapted for NVIDIA hardware.

dr_zoidberg · on May 11, 2017

That article doesn't say AMD wanted to support CUDA, they wanted to give tools to migrate to HCC ("AMD's CUDA").

slizard · on May 12, 2017

That would be one way, but what's commendable is that they went further and HIP is actually also a common thin API on top of CUDA and their own platform. They could've just stopped at converting code, but they did not -- and that's something that might save them and give people enough incentive to support their products. You can keep your NVIDIA path that'll be compiled with the nvcc backend and target both platforms with the nearly the same code, especially on host (and often also device side).

slizard · on May 11, 2017

I don't wish you the suffering vendor lock-in can cause after 10 years (hell, even less) of faithfully following the NVIDIA path, but... actually I do because that probably the best way to realize what's wrong with proprietary systems that pitch themselves as "de-facto" standards.

paulsutter · on May 10, 2017

Build your systems around GEMM/blas. Every vendor will give you a fast GEMM, and you'll be set for basically all the architectures that are coming out.

arcanus · on May 11, 2017

Except that not all problems in computation are GEMMs. CNNs in Machine learning certainly are, but many 'real' systems cannot be posted in such a manner.

In supercomputing this is the problem with using high performance linpack for benchmarks, which typically exceeds actual scientific codes by an order of magnitude in terms of floating point operations per second.

paulsutter · on May 11, 2017

Yes but to the extent you can, it's an easy win. I switched to a GEMMable method for a preprocessing step today based on the Volta and recent TPU news.

Hopefully Tensorflow XLA or other optimization frameworks could solve this problem in a more general way in the medium term:

http://www.kdnuggets.com/2017/04/deep-learning-virtual-machi...

ahelwer · on May 10, 2017

I thought NVIDIA GPUs support OpenCL? Or do they not do that anymore?

eslaught · on May 10, 2017

It's always been 10-20% slower than CUDA and frankly NVIDIA doesn't have an incentive to make it faster than that.

On the other hand, I believe Google is working on a CUDA compiler [1] so we may actually see meaningful improvement in the sense that it may become possible to run CUDA on other GPUs. (Edit: And Google actually has an incentive to achieve performance parity, so it might really happen.)

[1]: https://research.google.com/pubs/pub45226.html

jlebar · on May 10, 2017

> On the other hand, I believe Google is working on a CUDA compiler [1]

Hi, I'm one of the developers of the open-source CUDA compiler.

It's not actually a separate compiler, despite what that paper says. It's just plain, vanilla, open-source clang. Download or build the latest version of clang, give it a CUDA file, and away you go. That's all there is to it.

In terms of compiling CUDA on other GPUs, that's not something I've worked on, but judging from the commits going by to clang and LLVM, other people are quite interested in making this work.

qb45 · on May 10, 2017

Interesting, appears to have been merged upstream:

http://llvm.org/docs/CompileCudaWithLLVM.html

But it still targets NVIDIA GPUs and uses NVIDIA libraries so not that universal yet.

mrb · on May 10, 2017

"It's always been 10-20% slower than CUDA"

This is an untrue, yet often repeated statement. For example Hashcat migrated their CUDA code to OpenCL some time ago, with zero performance hits. What is true is that Nvidia's OpenCL stack is less mature than CUDA. But you can write OpenCL code that performs just as well as CUDA.

nl · on May 10, 2017

It has historically been slower for neural networks, especially considering the lack of a CuDNN equivalent.

slizard · on May 11, 2017

Also the opposite can be true as well (>2x slower); e.g try to rely heavily on shuffle.

chronic940 · on May 10, 2017

What is Hashcat and why should we care?

khedoros1 · on May 10, 2017

A password cracking utility, and because it was put forth as at least one example of a real-world application purported to perform just as well under OpenCL as CUDA. If true, it provides evidence against the claim "[OpenCL]'s always been 10-20% slower than CUDA".

PetahNZ · on May 10, 2017

Because its a performance critical application that has made the switch so is a good comparison.

DannyBee · on May 10, 2017

As someone said, we already merged it upstream. :)

Nowadays our CUDA compiler is just clang

agumonkey · on May 10, 2017

10-20% slower seems an honest delta, I can't blame a company for working more on their desires/ideas if they provide a standardized non crippled solution.

slizard · on May 11, 2017

> It's always been 10-20% slower than CUDA and frankly NVIDIA doesn't have an incentive to make it faster than that.

Incorrect. Our kernels (GROMACS molecular simulation package) are 2-3x slower implemented in OpenCL vs CUDA.

> On the other hand, I believe Google is working on a CUDA compiler

They were. It's upstream clang by now.

hawski · on May 11, 2017

Can Vulkan fill that space?

nicwilson · on May 11, 2017

Nvidia will have to support the SPIR-V Vulkan environment that is different to the OpenCL SPIR-V environment. But Vulkan is a graphics API not a compute API. Yes in theory you can write compute shaders but from my experience if you have a compute workload: use a compute API, they're much more suited for the job.

So, no.

hesdeadjim · on May 10, 2017

I find it so cool that technology created to make games like Quake look pretty has ended up becoming a core foundation of high performance computing and AI.

Florin_Andrei · on May 10, 2017

I think it's even cooler how matrix multiplication dominates both the universe at large, and the systems that understand it (neural networks).

nostrebored · on May 10, 2017

Well a large portion of that is desiring the data to be in that form. BLAS operations are ruthlessly efficient and use the system hardware so well.

thearn4 · on May 10, 2017

Linear algebra is the ultimate common variable in technical computing and applied mathematics at large.

flukus · on May 11, 2017

I find it incredible that even with all these cool applications of matrix multiplication it gets taught so horribly in schools.

varelse · on May 11, 2017

And yet taught so well by Gilbert Strang and Peter Sheridan Dodds or Math The Beautiful:

https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8

https://www.youtube.com/playlist?list=PLX5cCY3jYXJSDzX_lszck...

https://www.youtube.com/channel/UCr22xikWUK2yUW4YxOKXclQ/pla...

halflings · on May 10, 2017

Relevant article:

https://www.technologyreview.com/s/602344/the-extraordinary-...

Florin_Andrei · on May 10, 2017

Max Tegmark keeps saying interesting things lately. Controversial for sure, but interesting. His book 'Our Mathematical Universe' (related to the article you've linked) is thought-provoking, and I would label it a must read if you're interested in what's going on at the outer edges of fundamental science. The chapters are clearly separated into: fact, hypothesis, and far-out speculation, so there's no need to criticize the whole thing indiscriminately.

There was a series of attempts by Lee Smolin and others to come up with a theory of quantum gravity by assuming that the universe, at the bottom, is essentially simple and discrete (not in the fixed-grid sense, but in the sense of a discrete web of relations). That model also exhibits a remarkable similarity between the structure of the universe, and the structure of the neural networks that understand it.

The future of fundamental science is sure to be fascinating.

barrkel · on May 10, 2017

I think that's a dreadful article. We do know how neural networks work; they're a bunch of hierarchical probabilistic calculations that are pipelined. I don't really see how that couldn't work well; it's just hard to find the right probabilities. The difficulty is far more in the training than the working, and that's where the deep learning advances come in - inferring more parameters in a deeper hierarchy.

There's no relationship between a hierarchy of probabilistic estimations and a hierarchical decomposition of the cosmos. The cosmos forms an apparent hierarchy because of the rules that govern matter and the initial expansion of the universe. That a small number of parameters might be listed in describing both is neither here nor there. A small number of parameters describe the vectors in a font file. It doesn't follow that a typeface then has any relationship with my brain or the universe.

The article reads, to me, like this: neural networks are this cool hierarchy thing, the cosmos is this cool hierarchy thing, and both of these things have low Kolmogorov complexity, isn't it amazing that our brains are like this and can understand the universe, wow.

akvadrako · on May 11, 2017

> a bunch of hierarchical probabilistic calculations that are pipelined

That's one way of describing quantum theory; generally "contextual" or "non-commuting" are used instead of "hierarchical".

If the universality of such a common framework doesn't seem profound to you, at least realise it isn't something generally appreciated and barely even hinted at just a few decades ago.

anonymfus · on May 10, 2017

It backwards: first common application of the technology gave it the name.

Like in one of the Stanisław Lem's stories about Ijon Tichy people call intelligent anthropomorphic robots washing machines.

matt4077 · on May 10, 2017

You'll be delighted to hear that traffic signals are called "robots" in South Africa.

malkia · on May 11, 2017

That itself reminded me, that the word "robot" comes from "robota, rabota" which means "to work, worker" in a lot of slavic languages (My native language is bulgarian).

From the dictionary:

robot, origin: from Czech, from robota ‘forced labor.’ The term was coined in K. Čapek's play R.U.R. ‘Rossum's Universal Robots’ (1920).

https://www.google.com/search?q=robot+definition

contras1970 · on May 11, 2017

Rossum is a riff on "rozum" which means reason (as in to reason about).

https://en.wikipedia.org/wiki/R.U.R.#Origin_of_the_word

mrkgnao · on May 11, 2017

"Thinker's Universal Workers"?

fulafel · on May 10, 2017

Yet another step in the progression in which mass market GPU silicon kills traditional vector and memory bandwidth rich4 HPC/supercomputing hardware. Cray-on-a-chip.

Edit: traditional vector machines like the nec sx still hold the programmability crown because you get a usable single system image, right?

pgodzin · on May 10, 2017

Matrix multiplication is important for graphics and important for finding the weights of a neural network

hesdeadjim · on May 10, 2017

Yep, hard to imagine though that the original creators of the Nvidia TNT or Voodoo had any idea that GPUs would become fully programmable computing hardware used for non-graphical applications.

rasz · on May 10, 2017

Creators of Voodoo (3dfx = Gary Tarolli, Scott Sellers) came from the world of fully programmable GPUs. Silicon Graphics workstations had full T&L since ~1988 (http://www.sgistuff.net/hardware/systems/iris3000.html).

The whole point of Voodoo 1 was making it as simple and cheap as possible by removing all the advanced features and calculating geometry/lighting on the CPU.

https://www.youtube.com/watch?v=3MghYhf-GhU

kijiki · on May 10, 2017

Iris Graphics Geometry Engines weren't programmable in the modern sense. There was a fixed pipeline of Matrix units, clippers and so on that fed the fixed function Raster Engines. You could change various state that went into the various stages, but the pipeline's operations were fixed.

Later SGI Geometry Engines used custom, very specialized DSP-like processors, but the microcode for those were written by SGI, and not end-user programmable.

There were probably research systems before it, but AFAIK the Geforce 3 was the first (highly limited) programmable geometry processor that was generally commercially available.

dom0 · on May 10, 2017

Uhm, weren't their later graphics systems heavily based on i860 processors?

kev009 · on May 10, 2017

Yes, later REs were i860s.

abritinthebay · on May 10, 2017

I don't think they'd have been super surprised. Just pleasantly happy.

AI Accelerators have been a thing for decades - DSPs were used as neural network accelerators in the early 90s - and Cell processors were a thing by 2001.

GPUs just became vastly more accessible to general purpose program in the last decade. People were doing it back in the 90s but it was seriously hard.

We finally hit a tipping point where it's just kinda hard.

cr0sh · on May 10, 2017

There were also the various custom "systolic array" processor designs in the 1980s (the ALVINN vehicle, and earlier projects which led to it, used these for early neural-network based self-driving experiments).

scott_s · on May 10, 2017

I remember back in 2004 when I heard a fellow grad student was working on using GPUs as a co-processor for scientific computing, I though "Wow, that's esoteric and niche."

abricot · on May 10, 2017

This reminds me of a comment i read here ages ago about a scientist using the "processors" of the univercity's postscript printers because they did the work much faster than their scientific workstations.

ChickeNES · on May 10, 2017

Reminds me of some Commodore 64 programs running code on the 1541 disk drive to offload computation from the main CPU (both the C64 and the 1541 had 6502s (well, the C64 had a 6510 which had an I/O port) running @ 1Mhz). The original Apple Laserwriter had a 68k running at 12Mhz, while the Mac Plus, which came out almost a year later, had its 68k running at 8Mhz.

dom0 · on May 10, 2017

https://www.youtube.com/watch?v=ooLO2xeyJZA

mattnewton · on May 10, 2017

Wow, this is just Nvidia running laps around themselves at this point. Xenon Phi still not competitive, AMD focused on the consumer space, looks like the future of training hardware (and maybe even inferencing) belongs to Nvidia. (Disclosure: I am and have been long Nvidia since I found out cudnn existed and how far ahead it was)

coldtea · on May 10, 2017

>Xenon Phi still not competitive, AMD focused on the consumer space, looks like the future of training hardware (and maybe even inferencing) belongs to Nvidia.

Assuming there's a big future to training hardware and inferencing. Many of those "new paradigms" / "silver bullet technologies" have come and gone in the last decades.

mattnewton · on May 10, 2017

That's true, but there is reason to believe this time is different™, with killer applications in medical image understanding, natural language understanding, and self driving cars, all of which could drive demand of these chips by themselves. It is possible we will discover new dominant architectures that don't use this hardware well but I am putting my money on us coming up with even more applications that do use this hardware well.

deepnotderp · on May 10, 2017

There's something coming for them: deep learning processors.

I'm biased, since I'm part of one, but there's little to no modification of the software stack necessary, so it's a credible threat to nvidia.

mattnewton · on May 10, 2017

I hope so, if only because it keeps them running at this pace! Kudos for charging the 800lb gorilla head on.

p1esk · on May 10, 2017

What do you think about them open sourcing DLA of Xavier?

deepnotderp · on May 10, 2017

They haven't released enough info on it, what exactly are they open sourcing? The chip design?

p1esk · on May 12, 2017

I'm wondering myself. Maybe just the software to use it? No idea...

ndesaulniers · on May 11, 2017

> I am and have been long Nvidia

Today was a great day to be!

kobeya · on May 11, 2017

The potential disrupter here is RISV-V with vector extensions, which are currently being standardized.

bmiranda · on May 10, 2017

815 mm^2 die size!

That's at the reticle limit of TSMC, a truly absurd chip.

kurthr · on May 10, 2017

I agree... there's not much more they can do to scale since off die is still slow. Unless they stitch across the exposure boundary!

However, they have been at the reticle limit since they were in 28nm. GM200 (980 Ti and Titan X) was 601 mm^2 at TSMC... the maximum possible at the time.

tostitos1979 · on May 10, 2017

I've seen some huge mainframe die back in the day. What is reticle limit exactly? Thanks for educating a SW guy :)

Terribledactyl · on May 10, 2017

Part of the chipmaking process is burning layers into wafers covered in photoreceptive material. Photomasks/reticles used to cover entire wafers making many units at once, but now the processes are so small they have to compress the image (4-10 times is typical), burn a couple units, step over repeat on the same wafer. This GPU is so large, they can only fit 1 of them in a single burn step.

tbrownaw · on May 10, 2017

It's something along the lines of the film size for the super fancy camera they use in one of the steps. (The silicon wafer would be the equivalent of the entire roll of film.)

deepnotderp · on May 10, 2017

193i immersion steppers,a la ASML have 32x26 as the reticle limit

arnon · on May 10, 2017

This is odd for NVIDIA. They usually push out revised versions in the second year, not change the entire architecture to the new one.

Feels like they're feeling AMD breathing down their necks with their VEGA architecture, which should be very interesting.

AMD have also stepped up their game with ROCm which might take a chunk out of CUDA.

Robadob · on May 10, 2017

As I recall, Volta (3d memory) has been delayed multiple times due to supply and this is only a very limited release of their highest end hardware for deep learning all pegged for Q3/Q4 release. A field where they haven't really any competition.

Can't imagine we will be seeing any Volta GeForce cards released till next year.

dogma1138 · on May 10, 2017

Volta GeForce will come early 2018 likely with GDDR6 at this point.

Symmetry · on May 10, 2017

I wonder if the individual lane PCs will pave the way for implementing some of Andy Glew's ideas for increased lane utilization in future revisions?

http://parlab.eecs.berkeley.edu/sites/all/parlab/files/20090...

randyrand · on May 10, 2017

What are the silver boxes that line both sides of the card? Huge Capacitors?

smitty1110 · on May 10, 2017

Ferrite chokes, part of the power delivery system.

hatsunearu · on May 11, 2017

Inductor, not chokes. Part of the buck converter to create Vcore.

randyrand · on May 10, 2017

Why are they needed?

qb45 · on May 10, 2017

For the same reason as around any other CPU or GPU and lots and lots of other chips: buck converter, i.e. 12V 20A DC in, 1.2V 200A DC out.

hatsunearu · on May 11, 2017

It's part of a step down voltage regulator called a buck converter. The buck converter works by putting a pulse of energy into the inductor and stretching it out to lower the voltage. This creates the core voltage.

6d6b73 · on May 10, 2017

To get rid of electrical noise.

flamedoge · on May 10, 2017

im assuming chip draws yuuge power

Keyframe · on May 10, 2017

You're not wrong. 300W, holy shit.

tobyhinloopen · on May 10, 2017

Time to play some games on it

mtgx · on May 10, 2017

I have a feeling eventually Nvidia will, like Intel, de-prioritize the consumer market in favor of the much more profitable server/machine learning market.

jra101 · on May 10, 2017

Gaming GPUs still more than 50% of revenue for NVIDIA:

http://www.anandtech.com/show/11361/nvidia-announces-earning...

kobeya · on May 11, 2017

And shrinking.

kanwisher · on May 10, 2017

I'ved thought that but the per unit volume is huge. Every game console, phone, tablet, PC needs a GPU. Even low-end devices are expected to run games. Thats billions of units, albeit at lower margins

paulmd · on May 10, 2017

Most of those aren't NVIDIA, though.

Ironically, most of them actually use AMD's IP (the "Adreno" GPU, which is an anagram of "Radeon") that they sold off to Qualcomm in 2009. Which was yet another terrible call made by AMD management in that timeframe.

(although who knows if Adreno would have blown up in the same way if it had AMD mismanaging it)

Even more ironically, Adreno also used tile-based rendering that NVIDIA ended up adopting in the Maxwell architecture and AMD is adopting in the Vega architecture. It's a nice way to boost your power efficiency, which is critical to battery life in mobile devices.

Turns out since we're past Dennard scaling, packing more transistors on a chip now makes it hotter. So if you want it to go faster, you need to cut the power down in other ways. And thus, desktop GPUs are starting to look an awful lot like mobile GPUs...

(which is yet another reason why AMD's general-purpose compute-oriented GPU architectures are losing so badly in the desktop graphics market. RX 580 pulls twice the power of a GTX 1060 for the same performance...)

AlphaSite · on May 11, 2017

Only in furmark, for non power virus work loads it's ~20w more on average.

paulmd · on May 13, 2017

No, in gaming it's literally twice the power consumption of a GTX 1060.

https://www.techpowerup.com/reviews/Sapphire/RX_580_Nitro_Pl...

Many other aftermarket 580s are similar. For a sense of perspective here, that's roughly the same amount of power as some aftermarket 290Xs used. Or roughly 60 watts more than a GTX 1080. And that's GPU-only, not a total system load.

https://www.techpowerup.com/reviews/ASUS/R9_290X_Direct_Cu_I...

Polaris 10 is a reasonably efficient chip when you don't push it too hard. AMD - and their AIB partners - are pushing it way, way too hard in a desperate attempt to eke out a 2% win over the 1060. It isn't worth a 50% increase in TDP to get an extra 8% performance.

(and unlike the RX 480 - there is no reference RX 580 design, it's a whole bunch of these crazy juiced-up cards)

tutanchamun · on May 20, 2017

Reference card (1060) vs overlocked card (580). Looking at multiple review sites, including international ones like Computerbase, PCGH.de etc, and comparing overlocked 1060 vs overclocked 480/580 the difference is ~50-60 watt.

Not good, but also not twice the power consumption...

I don't understand why AMD didn't use faster memory in the 580 like Nvidia did with the 1060 refresh. The 580 needs faster memory more than higher core clocks.

Lets hope AMDs return to tile based rendering (used in Adreno) plus the other improvements help them get better at power consumption just like Nvidia with Maxwell. But I don't expect much from Vega after AMDs GPUs of the last 3 years. Navi looks more promising, as it is probably the first GPU to be fully designed under Raja Koduri.

zokier · on May 10, 2017

And practically every game console, phone, tablet and vast majority of PCs are running integrated GPUs. Integrated GPUs that are not nvidia. Unless NV gets into the licensing market, the growth potential for them seems somewhat limited.

Slackwise · on May 10, 2017

Well, the Nintendo Switch is an NVIDIA tablet, so there's that. It's selling like hotcakes, if you can even find one.

qb45 · on May 11, 2017

Well, they are trying to get into CPU market since late 2000s to sell their own integrated chips.

Cshelton · on May 10, 2017

I mean, at what point will we go full circle of going back to a "mainframe" where consumers don't really own/posses the computing power, rather it's down in datacenters. Like, you play your game through a VM basically, and your personal computer is just an AWS instance...

sherincall · on May 10, 2017

GeForce Now[0] - a VM you connect to from your PC, install games and stream.

GeForce Now for SHIELD[1] - Different model, more like "netflix for games"

[0]: http://www.geforce.com/geforce-now [1]: https://www.nvidia.com/en-us/shield/games/geforce-now/

cr0sh · on May 10, 2017

There's also this:

http://www.nvidia.com/object/nvidia-grid.html

perryh2 · on May 11, 2017

https://www.paperspace.com/

AndrewKemendo · on May 10, 2017

Pretty much already there.

ben174 · on May 10, 2017

They've done quite well on the Nintendo Switch.

https://finance.yahoo.com/news/heres-much-nvidia-will-make-o...

gruturo · on May 10, 2017

Not necessarily. So many of the improvements would anyway have a dual use, and it's not like their margins in the gaming/end user GPU business are razor thin. Moreover, the volume is probably immensely higher, so despite lower per-unit profit, they probably make it up in quantity. Won't be like this forever given the different speeds at which the 2 sectors grow, but it's gonna be some time before the roles are reversed.

hatsunearu · on May 10, 2017

Perhaps, but the desktop gaming market is still growing and is a huge part of NVIDIA's income.

zokier · on May 10, 2017

Isn't that what this post was all about? Releasing brand new architecture on compute first seems to me pretty much like prioritizing compute market over consumers.

coldtea · on May 10, 2017

Citation needed for the "much more profitable" part.

kakarot · on May 10, 2017

Nah. This is NVIDIA. They will just continue to focus on both markets as long as they're kicking ass in them.

AndrewKemendo · on May 10, 2017

It can only play Crysis on 50% texture.

josephpmay · on May 10, 2017

I know you're getting downvotes, but in the Keynote they showed a cinematic-quality live rendered "gaming demo" scene

Tossrock · on May 10, 2017

For those wondering, this was (around) the 44 minute mark.

grondilu · on May 11, 2017

I was wondering if this will be used in supercomputers. Apparently yes:

> Summit is a supercomputer being developed by IBM for use at Oak Ridge National Laboratory.[1][2][3] The system will be powered by IBM's POWER9 CPUs and Nvidia Volta GPUs.

https://en.wikipedia.org/wiki/Summit_(supercomputer)

Summit is supposed to be finished in 2017, though. I'm quite surprised this is possible since the Volta architecture has only just now been announced.

Scaevolus · on May 11, 2017

The Summit contract was signed in November 2014: http://www.anandtech.com/show/8727/nvidia-ibm-supercomputers

Supercomputers have very long planning and development cycles. So do GPUs and CPUs. The contract specified chips that didn't yet exist (Volta and POWER9) as much more than codenames on a roadmap.

lowglow · on May 10, 2017

I'm really happy our startup didn't go all in on Tesla (Pascal architecture) yet. These look amazing.

mattnewton · on May 10, 2017

I feel like every time I buy cards, Nividia announces the successor with absurd improvements.

lowglow · on May 10, 2017

Yeah, I just sprung for a Titan Xp -- waiting for it to become obsolete next month.

mattnewton · on May 10, 2017

Well, close to already if you are looking at $$/comprable performance, with the 1080ti

mastazi · on May 11, 2017

The Titan Xp (with lowercase p, as opposed to the Titan XP) came out after the 1080 Ti so I'm sure GP took the latter into consideration before making a decision...

lowglow · on May 11, 2017

Yep. I'm not sure it was worth the extra $$ for the extra specs just yet. We'll see when we SLI it.

The issue though is no memory sharing with the GTX/Titan line. If that were the case, I probably just would have sprung for two 1080Tis out the gate.

Definitely loving the eight 1080Tis they just fit in here though: http://www.velocitymicro.com/promagix-g480-high-performance-...

dom0 · on May 10, 2017

OTOH improvements in the mainstream segments seem to go slower: Mainstream cards are about twice as fast now as they were five years ago.

deepnotderp · on May 10, 2017

Cuda should run on both, right? Unless you're talking about shader assembly or hardware.

braindead_in · on May 10, 2017

So when are the new AWS instances are coming?

1024core · on May 10, 2017

FTA: "GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s."

The math doesn't add up.

p1esk · on May 10, 2017

Bidirectional bandwidth.

orik · on May 10, 2017

Maybe 25 GB/s each way?

1024core · on May 10, 2017

That's what I thought too, but then why would they quote unidirectional b/w in one part of the sentence, and bidirectional in the other?

jacquesm · on May 10, 2017

Bandwidth of a single link (which is unidirectional) versus aggregate bandwidth of all links.

Etheryte · on May 11, 2017

Interesting to note that Nvidia's stock rose about 18% (!, 102.94USD on May 9, 121.29USD on May 10) in a single day after this announcement. I expected the market to react, but this seems disproportionate.

virtuallynathan · on May 11, 2017

They announced this the day after earnings, earnings caused the jump, this compounded (maybe).

boulos · on May 11, 2017

My favorite outcome of Volta is that it's the first GPU they've produced that actually can claim this SIMT thing due to its separate program counters (we had a spirited debate about whether or not just doing masking but presenting the programming model meant the chip was SIMT or just that CUDA was but GPUs weren't).

Athas · on May 10, 2017

Does this architecture improve on 64-bit integer performance? Have any of the GPU manufacturers said anything about that? At some point it becomes a necessity for address calculations on large arrays.

sipherhex · on May 11, 2017

"With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations"

https://devblogs.nvidia.com/parallelforall/inside-volta/

Under "New SM" in "Key Features" section

jabl · on May 11, 2017

But if you read the article it seems the integer units are int32, so not capable of 64-bit computations.

caenorst · on May 10, 2017

Did they communicate any release date and price during the show ?

abhshkdz · on May 10, 2017

DGX-1 with Volta — $149k, Q3; DGX Home Station with Volta — $69k, Q3

tanderson92 · on May 10, 2017

Any information about when this architecture will make it onto Tesla or Quadro products available to "mass" market?

abhshkdz · on May 10, 2017

I think Jensen mentioned this would be available with OEMs Q4 onwards

gwbas1c · on May 10, 2017

How long until Tesla sues for trademark infringement? "from detecting lanes on the road to teaching autonomous cars to drive" makes it sound like there is an awful lot of overlap in product function.

cr0sh · on May 10, 2017

I doubt anything like that would happen. While Tesla Motors was founded prior to the creation of the Tesla GPU architecture, there's not really any overlap - in fact, I wouldn't be surprised if Tesla Motors wasn't using something like this from NVidia:

http://www.nvidia.com/object/drive-px.html

As far as any overlap software-wise is concerned, while it isn't super clear what Tesla Motors is doing for their self-driving systems, based on what I've seen it seems like they are using only "basic" lane-detection and identification along with some other algorithmic vision-based systems. I'm not saying that's everything they are doing, just what I have seen released publicly on their vehicle platform.

NVidia, on the other hand, has been experimenting with using neural networks (deep learning CNNs specifically) to drive vehicles using only camera information:

https://arxiv.org/abs/1604.07316

This is actually a fun CNN to implement - I (and many others) implemented variations of it in the first term on Udacity's Self-Driving Car Engineer Nanodegree. We weren't told to do it this way, but I chose to do so after reviewing the various literature, plus it seemed like a challenge (and it was for me). Udacity supplied a simulator:

https://github.com/udacity/self-driving-car-sim

...and we wrote code in Python (Tensorflow and Keras) to train and drive the virtual car. For my part, I had set up my home workstation with CUDA so that Tensorflow would utilize my GPU (a lowly GTX 750 TI SC - though it seems like it might have a similar GPU capability as NVidia's Drive-PX system, based on what I've researched - a Mini-ITX mobo, a PCI-E slot riser, and a GTX 750 would make a decent low-end deep-learning platform for self-driving vehicle experiments, and cost a fraction of what the Drive-PX sells for).

sargun · on May 11, 2017

Tesla Motors uses Tegra chips to power their console. So, nVidia is probably okay.

JustFinishedBSG · on May 11, 2017

What hardware do you think Tesla is using...?