> The most advanced versions of Nvidia’s just-announced GeForce RTX 30 Series, for example, has an incredible 10,496 cores.
I'm so tired of Nvidia getting away with this blatant falsehood.
They weren't even actual-cores when Nvidia started counting SIMT lanes as "cores" (how many hands do you have, 2 or 10? And somebody who writes twice as fast as you must obviously have 20 hands, yes?), and now that the cores can, under some conditions, do dual-issue, they are counting each one double.
What's next, calling each bit in a SIMT lane a "core"?
> What's next, calling each bit in a SIMT lane a "core"?
No, because that would break the convention they've used since the introduction of the term CUDA core: it's the number of FP32 multiply-accumulate instances in the shader cores. Nothing more, nothing less. If you see more into it, then that's on you.
You may not like that they define core this way (all other GPU vendors do it the same way, of course), but they've never used a different definition.
BTW, I checked the 8800 GTX review and there's no trace of 'core' being used in that way. It was AMD who started calling FP32 ALUs 'shader cores' or 'shaders' with the introduction of the ill fated 2900 XT. Since comparing the number of same-function resources is one of the favorite hobbies of GPU buyers, Nvidia subsequently started using the same counting method and came up with the terms "CUDA core."
The term "core" had a concrete meaning before Nvidia defined it to mean number of FMA units, and it's obviously no accident: they get to claim they have 32x more cores than would be the case by common definition. They are the odd one out; should Intel and AMD start multiplying their number of cores by the SIMD width? You would accept that as a "core"?
Now, as I wrote above, they have changed it AGAIN: they are now double-counting each "core" because it has some (limited) superscalar capability.
> The term "core" had a concrete meaning before Nvidia defined it to mean number of FMA units
It was AMD/ATI who started doing it. Nvidia followed, and they didn't really have a choice if they wanted to avoid an "AMD has 320 shader cores yet Nvidia only has 16" marketing nightmare.
> Now, as I wrote above, they have changed it AGAIN: they are now double-counting each "core" because it has some (limited) superscalar capability.
They did not. In Turing, there's one FP32 and one INT32 pipeline with dual issue, in Ampere, there's one FP32 and one (INT32+FP32) pipeline, allowing dual issue of 2 FP32 and INT32 is not being used.
That can only be done if there are 2 physical FP32 instances. There is no double counting.
If your point is that this second FP32 unit can't always be used at 100%, e.g. because the INT32 is used, then see my initial comment: it's the number of physical instances, nothing more, nothing less. It doesn't say anything about their occupancy. The same was obviously always the case for AMD as well, since they had a VLIW5 ISA when they introduced the term, and I'm pretty sure that those were not always fully occupied either.
Is that it? According to that definition, the first FP32 unit is not a core either...
But, again, if you want to blame someone for this terrible, terrible marketing travesty, start with AMD. They started it all...
My point was simply been that, contrary to your assertion, Nvidia and AMD have never changed the definition of what they consider to be a core, even if that definition doesn't adhere to computer science dogma.
One "CUDA core" is indeed one GPU thread. The lane of a GPU SIMD is nothing like CPU SIMD, and can independently branch (even if that branching can be much more expensive than on a CPU).
This is not true, just like a shader core with AMD was not a GPU thread.
For example, the 2900 XT had 320 shader cores, but since it used VILW-5 ISA, that corresponds to 64 GPU threads.
Similarly, an RTX 3080 has 8704 CUDA cores, but there are 2 FP32 ALUs per thread, resulting in 4252 threads, and 68 SMs since, just like Turing, there are 64 threads per SM.
> It was AMD who started calling FP32 ALUs 'shader cores' or 'shaders' with the introduction of the ill fated 2900 XT
IIRC, ATI started to use proper FP32 ALUs for both pixel shading and video processing/acceleration [0] around the same time. I guess doing this stuff needs more than simple MUL/ACC instructions.
If you're a computer science fundamentalist like pixelpoet who wants to stick to definition of core as what's AMD and Nvidia call "CU" and "SM", using "threads and warps" instead of "strands and theads", then "CUDA core" is obviously jarring.
It's simply a case where marketing won. At least Nvidia and AMD, and now Intel, are using the same term. You can go on with your live, or you can whine about it, but it's not going to change.
If we're going to stick to literal definitions of the things in computer science, we should get the forks out and march to the sotrage manufacturers' HQs to force them to accept 1kB is actually 1024 bytes, not 1000.
From my PoV, any functionally complete computational unit (in the context of the device) can be called a core. Should we say that GPUs has no core because they don't support 3D Now! or SSE or AES instructions?
Consider an FPGA. I can program it to have many small cores which can do a small set of operations or I can program it to be single but more capable core. Which one is a real core then?
a CU is not a computationally complete unit, it has no independent thread of execution, all cores in a warp follow the same execution path (indeed all execution paths, even if not applicable to that CU).
That is in contrast to your FPGA example. Either one big core or many small cores can all execute their own threads.
It's not that simple either, you've got stuff like SMT and CMT where you have multiple threads executing on a single set of execution resources - but CUs are clearly on the line of "not a self-contained core".
Thanks for the clarification. I've read nVidia's architecture back in the day but, it seems I'm pretty rusty on the GPU front.
Is it possible for you to point me to the right direction so, I can read how these things work and bring myself up to speed?
SMT is more like cramming two threads into a core and hoping they don't compete for the same ports/resources in the core. CMT is well... we've seen how that went.
The short of it is that GP is right, SIMT is more or less "syntactic sugar" that provides a convenient programming model on top of SIMD. You have a "processor thread" that runs one instruction on an AVX unit with 32 lanes. What they are calling a "CUDA core" or a "thread" is analogous to an AVX lane, the software thread is called a "warp" and is executed using SMT on the actual processor core (the "SM" or "Streaming Multiprocessor"). The SM is designed with a lot of SMT threads (warps) being able to be living on the processor at once (potentially dozens per core), being put to sleep when they need to do long-term data accesses, and then it swaps to some other warp to process while it waits. This covers for the very long latency of GDDR memory accesses.
The distinction between SIMT and SIMD is that basically instead of writing instructions for the high-level AVX unit itself, you write instructions for what you want the AVX lane to be doing and the warp will map that into a control flow for the processor. It's more or less like a pixel shader type language - since that's what it was originally designed for.
In other words, under AVX you would load some data into the registers, then run an AVX mul. Maybe a gather, AVX mul, and then a store.
In SIMT, you would write: outputArr[threadIdx] = a[threadIdx] * b[threadIdx]; or perhaps otherLocalVar = a[threadIdx] * threadLocalVar; The compiler then maps that into loads and stores and allocates registers and schedules ALU operations for you. And of course like any "auto-generator" type thing this is a leaky abstraction, it behooves the programmer to understand the behavior of the underlying processor, since it will faithfully generate code with suboptimal performance.
In particular, in order to handle control flow, basically any time you have a code branch ("if/else" statement, etc), the thread will poll all its lanes. If they all go one way it's good, but if you have them go both ways then it has to run both sides, so it takes twice as long. The warp will turn off the lanes that took branch B (so they just run NOPs) and then it will run Branch A for the first set of the cores. Then it turns off the first set of cores and runs Branch B. This is an artifact of the way the processor is built - it is one thread with an AVX unit, each "CUDA core" has no independent control, it is just an AVX lane. So if you have say 8 different ways through a block of code, and all 8 conditions exist in a given warp, then you have to run it 8 times, reducing your performance to 1/8th. Or potentially exponentially more if there is further branching in subfunctions/etc.
(obviously in some cases you can structure your code so that branching is avoided - for example replacing "if" statements with multiplication by a value, and you just multiply-by-1 the elements where "if" is false, or whatever. But in others you can't avoid branching, and regardless you have to manually provide such optimizations yourself in most cases.)
AMD broadly works the same way but they have their own marketing names, the AVX lane is a "Stream Processor", the "warp" is a "wavefront", and so on.
AVX-512 actually introduces this programming model to the CPU side, where it is called "Opmask Registers". Same idea, there is a flag bit for each lane that you can use to set which lanes an operation will apply to, then you run some control flow on it.
1kiB (kibibyte) is standardized as 1024 bytes in 1998. Before that it was kilobyte which was used for 1024 bytes [0].
> Prior to the definition of the binary prefixes, the kilobyte generally represented 1024 bytes in most fields of computer science, but was sometimes used to mean exactly one thousand bytes. When describing random access memory, it typically meant 1024 bytes, but when describing disk drive storage, it meant 1000bytes. The errors associated with this ambiguity are relatively small (2.4%).
As a GPU buyer, I don't understand Nvidia's marketing at all. They list all their specs as like "X ray tracing cores, Y tensor cores, Z CUDA cores" and I have no idea how that translates to real-world performance. Which of those does my game use? Which of those does Blender use when raytracing? (Those are the two applications that I use a GPU for, so the ones I personally care about.) And the cores are listed as separate things, but I have the feeling they're not; if you're using all Y tensor cores, then there aren't Z unused CUDA cores sitting around, right?
I think it all ends up being useless at best and completely misleading at worse. The reality is that I don't even know if I want to buy their product or not. I guess that's what reviews are for? But why do reviewers have to do the job of Nvidia's marketing department? Seems strange to me.
This is a criticism that has been made of consumer computing since its inception: Even if you're versed in the technical details, you can still get misled.
In all cases it helps to know your must-haves and prioritize accordingly, given that it's rarely the case that you can "just" get the new one and be happy: even if it benchmarks well, if it isn't reliable or the software you want to run isn't compatible, it'll be a detriment. So you might as well wait for reviews to flesh out those details unless you are deadset on being an early adopter. The specs say very little about the whole of the experience.
I actually hate the idea of having a high-end GPU for personal use these days. It imposes a larger power and cooling requirement, which just adds more problems. I am looking to APUs for my next buy - the AMD 4000G series looks to be bringing in graphics performance somewhere between GT 1030 and GTX 1050 equivalent which is fine for me, since I mostly bottleneck on CPU load in the games I play now(hello Planetside 2's 96 vs 96 battles, still too intense for my 2017 gaming laptop) and these APUs now come in 6 and 8 core versions with the competitive single-thread performance of more recent Zen chips. I already found recordings of the 2000G chips running this game at playable framerates, so two generations forward I can count on being a straight up improvement. The only problem is availability - OEMs are getting these chips first.
> But why do reviewers have to do the job of Nvidia's marketing department?
Are you suggesting you would rather read reviews of Nvidia products that are written by Nvidia, and you would trust them more than 3rd party reviews?
> I don't understand Nvidia's marketing at all.
Do read @TomVDBs comments; this isn't Nvidia, this is the industry-wide marketing terminology.
Cores are important to developers, so what you're talking about is that some of the marketing is (unsurprisingly) not targeted for you. If you care most about Blender and games, you should definitely seek out the benchmarks for Blender and the games you play. Even if you understood exactly what cores are, that wouldn't change anything here, you would still want to focus on the apps you use and not on the specs, right?
> I have the feeling they're not; if you're using all Y tensor cores, then there aren't Z unused CUDA cores sitting around, right?
FWIW, that's a complicated question. There's more going on than just whether these cores are separate things. The short answer is that they are, but there are multiple subsystems that both types of cores have to share, memory being one of the more critical examples. The better answer here is to compare the perf of the applications you care about, using Nvidia cards to using AMD cards, picking the same price point for each. That's how to decide which to buy, not worrying about the internal engineering.
I wouldn't trust them more, but it would be a good rule of thumb for whether or not I need to be awake during this product cycle. For example, if they're like "3% more performance on the 3090 vs. the RTX Titan" then I can just ignore it and not even bother reading the reviews. Instead, they're just like "well it has GDDR6X THE X IS FOR XTREME" which is totally meaningless.
> Instead, they're just like "well it has GDDR6X THE X IS FOR XTREME" which is totally meaningless.
That's referring to memory and not cores; is that a realistic example? I'm not very aware of Nvidia marketing that does what you said specifically - the example feels maybe a little exaggerated? I will totally grant that there is marketing speak, and understanding the marketing speak for all tech hardware can be pretty frustrating at times.
> if they're like "3% more performance on the 3090 vs. the RTX Titan" then I can just ignore it and not even bother reading the reviews.
Nvidia does publish some perf ratios, benchmarks, and peak perf numbers with each GPU, including for specific applications like Blender. Your comment makes it sound like you haven't seen any of those?
Anyway, I think that would be a bad idea to ignore the reviews and benchmarks of Blender and your favorite games, even if you saw the headline you want. There is no single perf improvement number. There never has been, but it's even more true now with the distinction between ray tracing cores and CUDA cores. It's very likely that your Blender perf ratio will be different than your Battlefield perf ratio.
I haven't seen any of those. All I've seen is a green-on-black graph where the Y axis has no 0, they don't say what application they're testing, and they say that the 3070 is 2x faster than the 2080 Ti. Can you link me to their performance numbers? As you can tell, I'm somewhat interested. (And know that real reviews arrive tomorrow, so... I guess I can wait :)
The official press release of the 3000 series[1] has a graph that seems to be what you're looking for. Look for the section named "GeForce RTX 30 Series Performance".
It has a list of applications, each GPU and their relative performance. Y=0 is even on the graph!
Nvidia does publicize benchmarks in its marketing, but many people (correctly) are skeptical of benchmarks published by the same company that makes the product. The number of CUDA cores and other hardware resources is a number that feels a bit more objective, even though it is hard to understand the direct implications on performance.
Apple is a good example of a company that just doesn't really talk too much about the low-level details of their chips. People buy their products anyway.
I'm so tired of Nvidia getting away with this blatant falsehood.
They weren't even actual-cores when Nvidia started counting SIMT lanes as "cores" (how many hands do you have, 2 or 10? And somebody who writes twice as fast as you must obviously have 20 hands, yes?), and now that the cores can, under some conditions, do dual-issue, they are counting each one double.
What's next, calling each bit in a SIMT lane a "core"?