These tensor cores sound exotic:
"Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply,"
Curious to see how the ML groups and others take to this. Certainly ML and other GPGPU usage has helped Nvidia climb in value. I wonder if Nvidia saw the writing on the wall so to speak with Google releasing their specialty hardware called the Tensor hardware that Nvidia decided to use it in their branding as well.
"Tensor hardware" is a very vague term that's more marketing than an actual hardware type, I guarantee you that these are really SIMD or matrix units like the Google tpu that they just devised to call "Tensor", because, you know, it sells.
They're matrix units just like in the Google TPU but the TPU stands for "Tensor Processing Unit" so that's consistent. There's no reason to add special SIMD units when the entire core is already running in SIMT mode and by establishing a dataflow for NxNxN matrix multiplies you can reduce your register read bandwidth by a factor of N. Which isn't as huge for NVidia's N=4 as for Google's N=256 but is still a big deal, and diminishing returns might mean that NVidia is getting most of the possible benefit when stopping at 4 and preserving more flexibility for other workloads.
For me, the laymen, reading the matrix multiply stuff that's what it sounded like to me as well given my understanding of SIMD and such. Especially when they made mention to BLAS. But I am no expert.
More great hardware being stuck behind proprietary CUDA when OpenCL is the thing they should be helping with. Once again proprietary lock in that will result in inflexibility and digital blow-back in the long run. Yes I understand OpenCL has some issues and CUDA tends to be a bit easier and less buggy, but that doesn't detract from the principles of my statement.
I am the author of DCompute, a compiler/library/runtime framework for abstracting OpenCL/CUDA for D. You can write kernels already, although the API automation is still a work in progress.
I'm hoping that this should level the field a bit, because let's face it, people use CUDA for two reasons: the OpenCL driver API sucks; and the utility libraries (cuDNN et al) for CUDA. Possibly driver quality as well.
By having an API thats not horrible to use, that advantage is gone. The utility libraries will be more of a challenge to undermine, but since it targets CUDA natively there is no disadvantage to users of nvidia's hardware, but there is no advantage to others, yet (see GLAS[1] for what is possible with relative ease). Using D as the kernel language will also bring significant advantages over C/C++, static reflection, sane templates and compile time code generation to name a few.
Also, NVIDIA's CUDA compilers are built on clang which does have OpenCL frontend, so all they would need to do is to put some resources into making that frontend work with their current nvcc toolchain.
Many request and want this, but instead they are trying hard to hold back OpenCL just because providing OpenCL 2.0 support (and extensions for their GPUs features) may help adoption of OpenCL which in turn may end up helping other folks and companies too.
The point where you suggested that x86 PCs are "open systems" (listing it next to "Linux" of all things!) I realized that you don't get it. We are where we are with Intel ripping off the consumers and companies alike exactly because nobody realized that x86 is everything but open.
A similar mistake is about to happen, but luckily on the software side where losses can be cut quick and mistakes can be reversed easier -- though many will suffer when they have to reimplement their precious library from ground up because they did (or could) not take into account the fact that CUDA is as proprietary as it gets.
AMD has no CUDA compiler BTW. And CUDA is not a programming language FYI. ;)
Aside: I have no position on whether is CUDA's Fortran and C++ dialects constitute their own languages, nor did I refer to CUDA as a programming language.
Sadly, that's a very problematic, borderline BS definition.
"A system that allows third parties to make products that plug into or interoperate with it. For example, the PC is an open system."
Intel allows some third parties to interoperate with their system (ref Intel vs NVIDIA etc.) and they pick and choose to their liking, kill some and promote others exactly because they control the open-ness of their systems.
> nor did I refer to CUDA as a programming language.
You did refer to "CUDA compiler". My comment was admittedly a nitpick as well as a serious point too. CUDA can be seen as a C++ language + extensions -- something you can compile --, but it's also more than that (stuff that you can't compile), e.g.: API, programming tools, etc. all strongly adapted for NVIDIA hardware.
That would be one way, but what's commendable is that they went further and HIP is actually also a common thin API on top of CUDA and their own platform. They could've just stopped at converting code, but they did not -- and that's something that might save them and give people enough incentive to support their products. You can keep your NVIDIA path that'll be compiled with the nvcc backend and target both platforms with the nearly the same code, especially on host (and often also device side).
I don't wish you the suffering vendor lock-in can cause after 10 years (hell, even less) of faithfully following the NVIDIA path, but... actually I do because that probably the best way to realize what's wrong with proprietary systems that pitch themselves as "de-facto" standards.
Build your systems around GEMM/blas. Every vendor will give you a fast GEMM, and you'll be set for basically all the architectures that are coming out.
Except that not all problems in computation are GEMMs. CNNs in Machine learning certainly are, but many 'real' systems cannot be posted in such a manner.
In supercomputing this is the problem with using high performance linpack for benchmarks, which typically exceeds actual scientific codes by an order of magnitude in terms of floating point operations per second.
Yes but to the extent you can, it's an easy win. I switched to a GEMMable method for a preprocessing step today based on the Volta and recent TPU news.
Hopefully Tensorflow XLA or other optimization frameworks could solve this problem in a more general way in the medium term:
It's always been 10-20% slower than CUDA and frankly NVIDIA doesn't have an incentive to make it faster than that.
On the other hand, I believe Google is working on a CUDA compiler [1] so we may actually see meaningful improvement in the sense that it may become possible to run CUDA on other GPUs. (Edit: And Google actually has an incentive to achieve performance parity, so it might really happen.)
> On the other hand, I believe Google is working on a CUDA compiler [1]
Hi, I'm one of the developers of the open-source CUDA compiler.
It's not actually a separate compiler, despite what that paper says. It's just plain, vanilla, open-source clang. Download or build the latest version of clang, give it a CUDA file, and away you go. That's all there is to it.
In terms of compiling CUDA on other GPUs, that's not something I've worked on, but judging from the commits going by to clang and LLVM, other people are quite interested in making this work.
This is an untrue, yet often repeated statement. For example Hashcat migrated their CUDA code to OpenCL some time ago, with zero performance hits. What is true is that Nvidia's OpenCL stack is less mature than CUDA. But you can write OpenCL code that performs just as well as CUDA.
A password cracking utility, and because it was put forth as at least one example of a real-world application purported to perform just as well under OpenCL as CUDA. If true, it provides evidence against the claim "[OpenCL]'s always been 10-20% slower than CUDA".
10-20% slower seems an honest delta, I can't blame a company for working more on their desires/ideas if they provide a standardized non crippled solution.
Nvidia will have to support the SPIR-V Vulkan environment that is different to the OpenCL SPIR-V environment.
But Vulkan is a graphics API not a compute API. Yes in theory you can write compute shaders but from my experience if you have a compute workload: use a compute API, they're much more suited for the job.
I find it so cool that technology created to make games like Quake look pretty has ended up becoming a core foundation of high performance computing and AI.
Max Tegmark keeps saying interesting things lately. Controversial for sure, but interesting. His book 'Our Mathematical Universe' (related to the article you've linked) is thought-provoking, and I would label it a must read if you're interested in what's going on at the outer edges of fundamental science. The chapters are clearly separated into: fact, hypothesis, and far-out speculation, so there's no need to criticize the whole thing indiscriminately.
There was a series of attempts by Lee Smolin and others to come up with a theory of quantum gravity by assuming that the universe, at the bottom, is essentially simple and discrete (not in the fixed-grid sense, but in the sense of a discrete web of relations). That model also exhibits a remarkable similarity between the structure of the universe, and the structure of the neural networks that understand it.
The future of fundamental science is sure to be fascinating.
I think that's a dreadful article. We do know how neural networks work; they're a bunch of hierarchical probabilistic calculations that are pipelined. I don't really see how that couldn't work well; it's just hard to find the right probabilities. The difficulty is far more in the training than the working, and that's where the deep learning advances come in - inferring more parameters in a deeper hierarchy.
There's no relationship between a hierarchy of probabilistic estimations and a hierarchical decomposition of the cosmos. The cosmos forms an apparent hierarchy because of the rules that govern matter and the initial expansion of the universe. That a small number of parameters might be listed in describing both is neither here nor there. A small number of parameters describe the vectors in a font file. It doesn't follow that a typeface then has any relationship with my brain or the universe.
The article reads, to me, like this: neural networks are this cool hierarchy thing, the cosmos is this cool hierarchy thing, and both of these things have low Kolmogorov complexity, isn't it amazing that our brains are like this and can understand the universe, wow.
> a bunch of hierarchical probabilistic calculations that are pipelined
That's one way of describing quantum theory; generally "contextual" or "non-commuting" are used instead of "hierarchical".
If the universality of such a common framework doesn't seem profound to you, at least realise it isn't something generally appreciated and barely even hinted at just a few decades ago.
That itself reminded me, that the word "robot" comes from "robota, rabota" which means "to work, worker" in a lot of slavic languages (My native language is bulgarian).
From the dictionary:
robot, origin: from Czech, from robota ‘forced labor.’ The term was coined in K. Čapek's play R.U.R. ‘Rossum's Universal Robots’ (1920).
Yet another step in the progression in which mass market GPU silicon kills traditional vector and memory bandwidth rich4 HPC/supercomputing hardware. Cray-on-a-chip.
Edit: traditional vector machines like the nec sx still hold the programmability crown because you get a usable single system image, right?
Yep, hard to imagine though that the original creators of the Nvidia TNT or Voodoo had any idea that GPUs would become fully programmable computing hardware used for non-graphical applications.
Creators of Voodoo (3dfx = Gary Tarolli, Scott Sellers) came from the world of fully programmable GPUs. Silicon Graphics workstations had full T&L since ~1988 (http://www.sgistuff.net/hardware/systems/iris3000.html).
The whole point of Voodoo 1 was making it as simple and cheap as possible by removing all the advanced features and calculating geometry/lighting on the CPU.
Iris Graphics Geometry Engines weren't programmable in the modern sense. There was a fixed pipeline of Matrix units, clippers and so on that fed the fixed function Raster Engines. You could change various state that went into the various stages, but the pipeline's operations were fixed.
Later SGI Geometry Engines used custom, very specialized DSP-like processors, but the microcode for those were written by SGI, and not end-user programmable.
There were probably research systems before it, but AFAIK the Geforce 3 was the first (highly limited) programmable geometry processor that was generally commercially available.
I don't think they'd have been super surprised. Just pleasantly happy.
AI Accelerators have been a thing for decades - DSPs were used as neural network accelerators in the early 90s - and Cell processors were a thing by 2001.
GPUs just became vastly more accessible to general purpose program in the last decade. People were doing it back in the 90s but it was seriously hard.
We finally hit a tipping point where it's just kinda hard.
There were also the various custom "systolic array" processor designs in the 1980s (the ALVINN vehicle, and earlier projects which led to it, used these for early neural-network based self-driving experiments).
I remember back in 2004 when I heard a fellow grad student was working on using GPUs as a co-processor for scientific computing, I though "Wow, that's esoteric and niche."
This reminds me of a comment i read here ages ago about a scientist using the "processors" of the univercity's postscript printers because they did the work much faster than their scientific workstations.
Reminds me of some Commodore 64 programs running code on the 1541 disk drive to offload computation from the main CPU (both the C64 and the 1541 had 6502s (well, the C64 had a 6510 which had an I/O port) running @ 1Mhz). The original Apple Laserwriter had a 68k running at 12Mhz, while the Mac Plus, which came out almost a year later, had its 68k running at 8Mhz.
Wow, this is just Nvidia running laps around themselves at this point. Xenon Phi still not competitive, AMD focused on the consumer space, looks like the future of training hardware (and maybe even inferencing) belongs to Nvidia. (Disclosure: I am and have been long Nvidia since I found out cudnn existed and how far ahead it was)
>Xenon Phi still not competitive, AMD focused on the consumer space, looks like the future of training hardware (and maybe even inferencing) belongs to Nvidia.
Assuming there's a big future to training hardware and inferencing. Many of those "new paradigms" / "silver bullet technologies" have come and gone in the last decades.
That's true, but there is reason to believe this time is different™, with killer applications in medical image understanding, natural language understanding, and self driving cars, all of which could drive demand of these chips by themselves. It is possible we will discover new dominant architectures that don't use this hardware well but I am putting my money on us coming up with even more applications that do use this hardware well.
I agree... there's not much more they can do to scale since off die is still slow. Unless they stitch across the exposure boundary!
However, they have been at the reticle limit since they were in 28nm. GM200 (980 Ti and Titan X) was 601 mm^2 at TSMC... the maximum possible at the time.
Part of the chipmaking process is burning layers into wafers covered in photoreceptive material. Photomasks/reticles used to cover entire wafers making many units at once, but now the processes are so small they have to compress the image (4-10 times is typical), burn a couple units, step over repeat on the same wafer. This GPU is so large, they can only fit 1 of them in a single burn step.
It's something along the lines of the film size for the super fancy camera they use in one of the steps. (The silicon wafer would be the equivalent of the entire roll of film.)
As I recall, Volta (3d memory) has been delayed multiple times due to supply and this is only a very limited release of their highest end hardware for deep learning all pegged for Q3/Q4 release. A field where they haven't really any competition.
Can't imagine we will be seeing any Volta GeForce cards released till next year.
It's part of a step down voltage regulator called a buck converter. The buck converter works by putting a pulse of energy into the inductor and stretching it out to lower the voltage. This creates the core voltage.
I have a feeling eventually Nvidia will, like Intel, de-prioritize the consumer market in favor of the much more profitable server/machine learning market.
I'ved thought that but the per unit volume is huge. Every game console, phone, tablet, PC needs a GPU. Even low-end devices are expected to run games. Thats billions of units, albeit at lower margins
Ironically, most of them actually use AMD's IP (the "Adreno" GPU, which is an anagram of "Radeon") that they sold off to Qualcomm in 2009. Which was yet another terrible call made by AMD management in that timeframe.
(although who knows if Adreno would have blown up in the same way if it had AMD mismanaging it)
Even more ironically, Adreno also used tile-based rendering that NVIDIA ended up adopting in the Maxwell architecture and AMD is adopting in the Vega architecture. It's a nice way to boost your power efficiency, which is critical to battery life in mobile devices.
Turns out since we're past Dennard scaling, packing more transistors on a chip now makes it hotter. So if you want it to go faster, you need to cut the power down in other ways. And thus, desktop GPUs are starting to look an awful lot like mobile GPUs...
(which is yet another reason why AMD's general-purpose compute-oriented GPU architectures are losing so badly in the desktop graphics market. RX 580 pulls twice the power of a GTX 1060 for the same performance...)
Many other aftermarket 580s are similar. For a sense of perspective here, that's roughly the same amount of power as some aftermarket 290Xs used. Or roughly 60 watts more than a GTX 1080. And that's GPU-only, not a total system load.
Polaris 10 is a reasonably efficient chip when you don't push it too hard. AMD - and their AIB partners - are pushing it way, way too hard in a desperate attempt to eke out a 2% win over the 1060. It isn't worth a 50% increase in TDP to get an extra 8% performance.
(and unlike the RX 480 - there is no reference RX 580 design, it's a whole bunch of these crazy juiced-up cards)
Reference card (1060) vs overlocked card (580). Looking at multiple review sites, including international ones like Computerbase, PCGH.de etc, and comparing overlocked 1060 vs overclocked 480/580 the difference is ~50-60 watt.
Not good, but also not twice the power consumption...
I don't understand why AMD didn't use faster memory in the 580 like Nvidia did with the 1060 refresh. The 580 needs faster memory more than higher core clocks.
Lets hope AMDs return to tile based rendering (used in Adreno) plus the other improvements help them get better at power consumption just like Nvidia with Maxwell. But I don't expect much from Vega after AMDs GPUs of the last 3 years. Navi looks more promising, as it is probably the first GPU to be fully designed under Raja Koduri.
And practically every game console, phone, tablet and vast majority of PCs are running integrated GPUs. Integrated GPUs that are not nvidia. Unless NV gets into the licensing market, the growth potential for them seems somewhat limited.
I mean, at what point will we go full circle of going back to a "mainframe" where consumers don't really own/posses the computing power, rather it's down in datacenters. Like, you play your game through a VM basically, and your personal computer is just an AWS instance...
Not necessarily. So many of the improvements would anyway have a dual use, and it's not like their margins in the gaming/end user GPU business are razor thin. Moreover, the volume is probably immensely higher, so despite lower per-unit profit, they probably make it up in quantity. Won't be like this forever given the different speeds at which the 2 sectors grow, but it's gonna be some time before the roles are reversed.
Isn't that what this post was all about? Releasing brand new architecture on compute first seems to me pretty much like prioritizing compute market over consumers.
I was wondering if this will be used in supercomputers. Apparently yes:
> Summit is a supercomputer being developed by IBM for use at Oak Ridge National Laboratory.[1][2][3] The system will be powered by IBM's POWER9 CPUs and Nvidia Volta GPUs.
Supercomputers have very long planning and development cycles. So do GPUs and CPUs. The contract specified chips that didn't yet exist (Volta and POWER9) as much more than codenames on a roadmap.
The Titan Xp (with lowercase p, as opposed to the Titan XP) came out after the 1080 Ti so I'm sure GP took the latter into consideration before making a decision...
Interesting to note that Nvidia's stock rose about 18% (!, 102.94USD on May 9, 121.29USD on May 10) in a single day after this announcement. I expected the market to react, but this seems disproportionate.
My favorite outcome of Volta is that it's the first GPU they've produced that actually can claim this SIMT thing due to its separate program counters (we had a spirited debate about whether or not just doing masking but presenting the programming model meant the chip was SIMT or just that CUDA was but GPUs weren't).
Does this architecture improve on 64-bit integer performance? Have any of the GPU manufacturers said anything about that? At some point it becomes a necessity for address calculations on large arrays.
"With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations"
How long until Tesla sues for trademark infringement? "from detecting lanes on the road to teaching autonomous cars to drive" makes it sound like there is an awful lot of overlap in product function.
I doubt anything like that would happen. While Tesla Motors was founded prior to the creation of the Tesla GPU architecture, there's not really any overlap - in fact, I wouldn't be surprised if Tesla Motors wasn't using something like this from NVidia:
As far as any overlap software-wise is concerned, while it isn't super clear what Tesla Motors is doing for their self-driving systems, based on what I've seen it seems like they are using only "basic" lane-detection and identification along with some other algorithmic vision-based systems. I'm not saying that's everything they are doing, just what I have seen released publicly on their vehicle platform.
NVidia, on the other hand, has been experimenting with using neural networks (deep learning CNNs specifically) to drive vehicles using only camera information:
This is actually a fun CNN to implement - I (and many others) implemented variations of it in the first term on Udacity's Self-Driving Car Engineer Nanodegree. We weren't told to do it this way, but I chose to do so after reviewing the various literature, plus it seemed like a challenge (and it was for me). Udacity supplied a simulator:
...and we wrote code in Python (Tensorflow and Keras) to train and drive the virtual car. For my part, I had set up my home workstation with CUDA so that Tensorflow would utilize my GPU (a lowly GTX 750 TI SC - though it seems like it might have a similar GPU capability as NVidia's Drive-PX system, based on what I've researched - a Mini-ITX mobo, a PCI-E slot riser, and a GTX 750 would make a decent low-end deep-learning platform for self-driving vehicle experiments, and cost a fraction of what the Drive-PX sells for).