Professional GIS user here. Most of my high resolution terrain models are well within the capacity of even a modest GPU to load into WebGL and run on a browser without breaking a sweat. Terrain models are static by nature and don't require a lot of horsepower once they are loaded.
The struggle comes in reading the data from storage, where several minutes can be spent loading in a single high resolution raster for analysis/display. When I built my own PC this year, I splurged on an M.2 SSD for the OS and my main data store. Best decision I ever made for my workflow - huge 3D scenes that formerly took minutes to load on a spinning platter now pop up in seconds.
This thing would probably be the bees knees for what I do. Shame it starts at $10k (and that it's AMD so no CUDA, so no way to justify it at work for "data science" :-/).
Any idea why Nvidia dominates in "serious" GPGPU applications? I remember people mocking them for refusing to adopt OpenCL, and when they finally caved their implementation performed far worse than AMDs. How did they win people over? Did they give out a bunch of free GPUs to universities or something?
CUDA mainly. It's fast (faster than OpenCL) and NVIDIA is really good with their software. CuDNN for deep neural networks is almost an industry standard. Nvidia understands software and markets better, while AMD sits on their butts for too long. Granted AMD always come out with a good open source solution that is always just a bit worse and very late. NVIDIA tries to create markets while AMD mess up and end up becoming followers. Shame really.
Edit: this is a good step though. AMD should be pushing the envelope and hopefully with Zen, they can actually realize some of the gains of HSA (which they tried to pioneer but it wasn't so useful since Bulldozer isn't that good)
CUDA is not [unqualified] "faster than OpenCL". NVIDIA does design software better than AMD, without a doubt. I think it's likely that NVIDIA decided to push CUDA and lag OCL support if their customers balked at having to port between the two. It's not that hard to port IMO, they're extremely similar.
OpenCL IMO is an ugly API born of the equivalently ugly CUDA driver API because Steve Jobs got butthurt at Jensen Huang for announcing a deal with Apple prematurely. Downvote all you like, but as John Oliver would say "That's just a fact." I witnessed it secondhand from within NVIDIA.
In contrast, OpenCL could have been a wonderful vendor-independent solution for mobile, but both Apple and Google conspired independently to make that impossible (ironic in Apple's case because of OpenCL's origin story and idiotic in the case of Google and its dreadful Renderscript, a glorified reinvention of Ian Buck's Ph.D. thesis work, Brook).
Fortunately, AMD appears to have figured out OpenCL has no desktop traction and they have embarked on building a CUDA compiler for AMD GPUs called R.O.C (Radeon Open Compute). They have also shown dramatically improved performance at targeted deep learning benchmarks. It's early, but so is the deep learning boom.
The wildcard for me is what Intel will decide to do next.
The big win IMO is vendor-unlocking all the OSS CUDA code out there.
It would have been fantastic if Intel had stopped beating the linpack horse a lot sooner and built a viable competitor to GPUs by now. Not this timeline though alas... Maybe 2020?
They did give out a bunch of free GPUs to universities, but more than that, they have invested heavily and deliberately in HPC: community engagement, better SDKs (it's been almost a decade since I've first been able to build a Linux executable against CUDA, and that code should still work), server SKUs sold in the server channel (AMD didn't bother to design SKUs for servers until it was too late). Other things that solidified NVIDIA's lead were the AWS win (in 2010, they managed to partner with AWS to bring GPU instances to market - and those are thriving) and the 2011 ORNL Titan win (important for the HPC community mindshare).
They also partner with universities to design courses around Nvidia products, so they're better at funneling talent. Or at least it seems so. I don't know about AMD programs.
I don't know anyone who thinks Metal is better than Vulkan. In fact I've heard only the opposite.
And the reason why it's C is so that you can bind from multiple languages. DirectX doesn't have this problem because it's not plain C++, it's COM--which is designed to support bindings from multiple languages. But COM is effectively Windows only, and Vulkan needs to be platform independent. So Khronos made the correct choice here.
As you acknowledged, they also created an idiomatic C++ wrapper, so I don't see what your complaint is at all. Khronos did the correct thing every step of the way.
With the rise of middleware engines on the industry the actual APIs are even less relevant nowadays than a few years ago.
Besides the big studios already abstract the graphic APIs on their in-house engines anyway, or they outsource to porting studios.
Usually the HN community, which is more focused on web development and FOSS, seems to miss the point that the culture in the game industry is more focused on proprietary tooling and how to take advantage of their IP.
The whole discussion around which API to use isn't that relevant when discussing about game development proposals.
It is more akin to the demoscene culture, where cool programming tricks were shown without sharing how it was done, than the sharing culture of FOSS.
If NVidia hadn't made their C++ wrapper available, I very much doubt Khronos would have bothered to create one of their own.
> With the rise of middleware engines on the industry the actual APIs are even less relevant nowadays than a few years ago.
> The whole discussion around which API to use isn't that relevant when discussing about game development proposals.
OK, so if the graphics API doesn't matter, why did your parent comment participate in the graphics API war?
(BTW, I agree with you that the graphics API doesn't matter too much anymore. But I think if you're going to attack Vulkan, you should do so based on specific technical reasons.)
> If NVidia hadn't made their C++ wrapper available, I very much doubt Khronos would have bothered to create one of their own.
Khronos isn't a company—it's a standards body. As NVIDIA is a member of Khronos, "Khronos" did bother to create a C++ API.
Well, for me being based on pure C instead of the OO interfaces of other APIs it isn't something that makes me wish to use it.
I rather use APIs that embrace OO, offer math, font handling, texture and mesh APIs as part of SDKs instead of forcing developers to play Lego with libraries offered on the wild.
> Khronos isn't a company—it's a standards body. As NVIDIA is a member of Khronos, "Khronos" did bother to create a C++ API.
> Well, for me being based on pure C instead of the OO interfaces of other APIs it isn't something that makes me wish to use it.
But it's not "pure C". C is just the glue. You can use the C++ API if that's what you want, and you have a completely object-oriented API.
Can Linux never have an "OO interface" because all syscalls trap into a kernel written purely in C? Of course not, that would be silly. The same is true here. If you program against an object-oriented C++ interface, then you have a fully object-oriented API.
Anything that increases our dependency on C++ is much, much worse.
You can easily call into something with a C ABI (er, OS ABI designed around C, or whatever is the correct technical name) from any language. Try that with C++ :D
C++ provides better tools to write safer code than C ever will.
I would rather be using Ada, Rust, System C#, D, <whatever safe systems programming language>, but until OS vendors start providing something else, C++17 will have to do.
Since 1994 I only write C code when forced to do so.
I think it was a combination of marketing and great developer tools. I'm not in that business so I don't know first-hand, but former colleagues have said that Nvidia provided tons of tools, examples, and resources, while AMD basically completely neglected developers. This is changing now, but at this point it's too little, too late.
As someone that recently switched from nVidia to AMD, I can also say that nVidia products are just plain better.
AMD is trying to fix it, but also now "too late", their drivers, that frankly, are crap, on all platforms, AMD drivers are very, VERY buggy.
Also, AMD GPUs need excessive amounts of power, at first I didn't considered this a problem, until I bought my AMD GPU and noticed it is constantly throttling and causing stutter even in professional software, due to power limits, it also bit AMD in the ass during the RX 480 launch (where the excessive power usage went beyond the motherboard limits, and their "Fix" was make the driver instead request power beyond the specification of PSU cables instead, or allow users on Windows, to enable a harder power limit, making it throttle even more).
I had hope AMD would "scare" nVidia into improving, into stopping their shady business practices and just improve their business, but after actaully buying AMD product, and interacting with their crappy support, crappy community, crappy distribution network (it was very hard to get the card!), I concluded that AMD has a loooong way before they make nVidia react, AMD is too far behind in all aspects, and the only reason they are still competitive, is because they sell very power-hungry beefy GPUs for cheap prices, achieving a reasonable performance per dollar, but if you compare their products ignoring that, they are just junk (both in the hardware and software sense).
NVIDIA supports developers better than AMD does (on the whole). So they got CUDA working first, and helped developers get their computational frameworks running on CUDA easier and faster. That gave them several hardware generations of head start. If OpenCL 2.x becomes easier to support and more performant as Khronos claims then maybe there'll be a shift.
ATi has always been bad at openGL performance in 3d content creation apps. That's the reason. Back in 2001 Nvidia ranked supreme and ATi was buggy. Seems these days both can be buggy but Nvidia is still #1. The 6800 was a industry changer.
Well it will be a almost dead set certainty to run OpenCL, hopefully it will have bindings / extensions to handle the SSD directly from your OpenCL code. Then all you need to do is port the code.
I'm having trouble seeing the value add here. From the Anandtech article [1] they are using 2x512GB Samsung 950 Pro SSDs which use PCIE v3 x4 with M.2 connectors and a PCIE switch. The drives presumably are using NVME.
The demo claims that without the SSDs they were rendering raw 8k video @17 fps and using the SSDs improved rendering to > 90fps. How can this be such a significant improvement over accessing the same SSDs connected directly to a motherboard? The graphics card would have a PCIE 3.0 x16 connection...plenty of bandwidth and very low latency.
The writer had some of the same thoughts, and found this:
"The performance differential was actually more than I expected; reading a file from the SSG SSD array was over 4GB/sec, while reading that same file from the system SSD was only averaging under 900MB/sec, which is lower than what we know 950 Pro can do in sequential reads. After putting some thought into it, I think AMD has hit upon the fact that most M.2 slots on motherboards are routed through the system chipset rather than being directly attached to the CPU. This not only adds another hop of latency, but it means crossing the relatively narrow DMI 3.0 (~PCIe 3.0 x4) link that is shared with everything else attached to the chipset."
Thanks, I missed this. I noticed that the PCIE 3.0 x1 bandwidth is just above what AMD reported in their throughput test - 985 MB/s...I wonder if AMD used a system with the SSD connected to a M.2 slot on the motherboard with just a PCIE 3.0 x1 link.
I should clarify that I do think this onboard SSD concept could be really compelling for certain use cases, such as needing to store several hundred gigs of data which needs to be randomly accessed.
I wonder if this would be useful to companies like Pixar and Weta Digital - I imagine a big speedup in frame rendering time or a reduction in number of build machines required would be worth lots of cash.
Weren't old HDD DMA drives exactly like this? And we went away from it, because copying anything between drives was causing like 50% cpu usage? Having a separate chipset managing your drives seems like a good idea, but the bus should definitely be faster in this case.
The current pipeline for such a request would be the GPU sending a request for more data, the CPU receiving it, the kernel figuring out who should handle the request, the handler then making a request to the kernel for the data on the SSDs, the SSDs sending the data back to main memory, the kernel telling the handler where the memory is, then the handler telling the kernel to send the memory to the video card.
In contrast, the route with the on-board SSDs is that the GPU makes the request, the ASIC that handles NVMe requesting from the memory chips, the memory chips sending to the ASIC, which goes into memory. There are a lot fewer steps there, and a lot fewer places where delayed interrupts, etc can introduce lag.
With DMA and the fact that the GPU is a PCIE bus master I don't expect much userland/kernel/interrupt activity...here is my naive implementation of "traditional" 8k video streaming:
(CPU)
WHILE video not done
1 - Initiate DMA from SSD -> main memory with some large (say 128MB) chunk of 8k video
2 - Signal GPU to begin DMA transfer
3 - Wait for interrupt from GPU
(GPU)
WHILE video not done
1 - Wait for CPU to indicate data is ready
2 - Initiate DMA from main memory -> local GPU memory over PCIE 3.0 x16 link
3 - Issue interrupt to CPU
4 - Render chunk of video
What you're missing is that PCIE 3.0 x16 is emphatically NOT plenty of bandwidth. It's more than what you might otherwise get of course, but having a texture loaded locally on the GPU die is orders of magnitudes different from reading it from main memory. It's also not "low latency" at least not relatively speaking. I would have killed for hardware like this when I worked with terrain rendering in the past.
The software portion also adds a tremendous value. It's not trivial to supply the GPU with data in an efficient manner. In my work I have a 30 GB volume I need fourier filter and remap.
Surprisingly, this makes a lot of sense, and I'm glad that it uses an M.2 port. Thus, technically, you should be able to swap out the SSD for any screaming fast model you want.
I can't find anything that supports your claim that it uses an M.2 port. But even if it does, sadly the SSD will most likely be soldered directly on the PCB like it is in most tablets and phones.
It would have been really cool to have been able to swap it out as needed, but I don't think that will be doable :(
Ars seems to imply that there will be two M.2 ports that users can populate.
"Instead of adding more expensive graphics memory, why not let users add their own...The Radeon Pro SSG features two PCIe 3.0 M.2 slots for adding up to 1TB of NAND flash, "
I used to do research on gene sequencing on a GPU. For small sets it was quite fast (it's arguably a O(n^4) algorithm, though really a O(n^2m^2)) but once you couldn't fit the data set on the GPU it was dead slow.
Well, maybe. It would certainly improve. But you probaly still won't get anywhere near the GDDR5(etc) speeds from those memory accesses. Depending on the memory access pattern it might end up dropping down significantly. But you might still end up considering it "dead slow."
Sure, that part was clear. But the difference between "quite fast" for this application and "dead slow" might be that hundreds of GB/s with GDDRx/HBM is so much faster than anything else. Even the M.2 SSD onboard is probably only ~5GB/s. So if the walk pattern is something that is predictable enough by the GPU (usually pretty linear, but some memory features have 2d/3d locality), then you could crunch those numbers as fast as the SSD could deliver them. Now it becomes a question of how much computation/kernel time can you spend on a chunk of data?
On a CPU you'd have the same bottlenecks unless you fit the entire dataset in memory (which, admittedly is a possibility). Assuming you can't get the full dataset in memory, apples-to-apples you should be able to see a significant speedup on the GPU over the CPU. Previously this was not the case for large datasets.
Except that Pascal class GPUs support true unified virtual memory so some minor code changes might provide mostly the same performance using memory-mapped files. Caveat: I don't own a Pascal class GPU yet.
1. GPU driven command buffer workloads will become more significant (now that literally all your textures can exist in VRAM, it's worth going the extra mile to elide a GPU-CPU round trip).
2. Voxel based techniques (which were generally memory intensive) open up again. That's modeling destructible environments, atmospheric effects, transparent materials, etc in a more accurate and performant way.
3. Entire scene graphs can live on the GPU which opens up a lot of design space for new volume hierarchies and data structures.
It comes with a 10-year warranty, so it might not even matter. But back in 2014, 240GB drives were approaching a petabyte of continuous writes before failing. https://techreport.com/review/27909/the-ssd-endurance-experi... With a couple more years of manufacturing and wear-leveling improvements, and more space for the algorithms to play with, it could be a lot more now.
On the contrary, SSD lifespans have actually decreased. Smaller transistors considerably reduce lifespan, and the advent of storing 2 (MLC) and 3 (TLC) bits per cell has hastened the trend.
3D-NAND uses different physics, but as of now hasn't yet matched legacy NAND's endurance (lifespan)
By "legacy" do you mean particularly old flash, or planar floating-gate flash in general? Because samsung apparently claims that they can do 3D TLC with better reliability and density than 10nm planar MLC.
Even if the SSDs crap out after a couple of years the cost of replacing them is tiny compared to the $10k sticker price of the board itself. If you can afford this card the SSDs shouldn't be an issue.
Depending on your perspective, your computer has contained several "computers" for many generations. Storage, network, display, and other adapters often have something like a CPU onboard capable of executing software instructions.
Lets assume a relatively modest 8bit per component RGB, a 7680x4320 frame is 132MB, at 24fps is 3.2GB/s.
So in 1TB you can store 312 seconds aka 5 mins of video.
You're going to have to top this up to play any more than 5 mins, which means writing to the SSD at 3.2 GB/s whilst you read from it. I do not believe even any NVMe SSDs are full duplex, so that's 6.4GB/sec, which the SSDs cannot provide.
You can DMA to GPU memory at 10GB/sec, so the SSDs are of no benefit for video.
No-one stores 8K data uncompressed anyway, so the CPU will need to read and decompress the data.
Perhaps it's useful for faulting in megatextures for 3D scenes though.
This might be useful for consumer cards as well: a tiny (0.25-2x VRAM) compiled shader or texture cache could possibly help with the framerate dips (microstutters).
I know that PCIe bandwidth is not really an issue and SSD latency is significantly higher than RAM (but with NVMe pushing below 3 µs it makes you wonder how low a custom interface could go..)
> The graphics card has already been demonstrated rendering a raw 8K video – the initial demo showed this running at 17 frames per second, and switching to the SSG, that was boosted to over 90 frames per second.
I doubt that the same GPU was used. If it is, we're in trouble (or have unoptimized software). I thought that someone would have already solved the "let's stream data in and out and not interrupt whatever the GPU is doing" problem, but meanwhile many people are fretting over what PCIe version (thus, attained speeds) they can use.
One frame of uncompressed 48bit color 8k video is 190MB. So just sending it through ideal pcie 3.0 16x you're limited to 83fps. Add overhead, latencies, actual transfer slower than theoretical, and especially actual processing requiring several frames at once. 17fps seems good, considering.
Indeed modern GPUs do copies over the bus while executing shaders/kernels. But once you saturate the bus, you are limited in how much shader execution you can accomplish as a result. If this scene required more textures than fit in memory, it would benefit from having a local DDR-accessible cache instead of having to grab it over PCIe.
A GPU's 16 lanes of PCIe 3.0 provide almost 16 GB/s of bandwidth. An M.2 SSD (used in this card) is at best 4 GB/s (PCIe 3.0 x4 lanes). This card has two M.2 slots.
Even at those speeds, accessing data from system RAM is faster. (depends on specifics, but system RAM is easily 40+ GB/s) How does increasing data bandwidth 50% increase performance by 4? Are inefficiencies (and latencies) really eating up all that?
> "let's stream data in and out and not interrupt whatever the GPU is doing"
You're quoting this but I'm not sure I understand what problem you're referencing. Streaming data to the GPU isn't a problem of interrupting it. GPUs use memory fences, atomics, and other things you might find in CPU land. Why are we in trouble if the demo was true? For every type of hardware, I can come up with a rendering algorithm and data set that makes it look better than similar hardware in a different configuration. If you have more memory bandwidth, you should be using it accordingly.
Instead of fixing bus that's too slow for SSD (really too slow?) they're made that hack with duct tape. And it surely has proprietary API and direct addressing (no filesystem) like in 50s.
Then they'll integrate ethernet into video card for high speed traders (also can be marketed as lower latency for gaming). But why not to integrate whole PC into it? With gaming console-style OS on it.
Does it only bug me that with the addition of cold storage on the card, GPUs scream of poor design?
If I was rewriting the same-ish REST helper on every micro-service of a project, I would immediately think of refactoring and yet GPUs now have CPU/RAM/Disk.
Is the future GPU-only or are we in dire need of a better way to use existing PC components from GPUs?
I don't understand. What does hardware design have to do with refactoring, usually a software term? The addition of the SSD is motivated by basic physics, which creates the need for computational units and storage units to be close together. (The alternative would be a non-von Neumann model without this compute-storage distinction, but that's not so easy.)
Lots of high end components has onboard RAM and flash. Big SAS cards have flash to store state in case of a power outage, there are realtime ethernet cards that control the entire ip stack themselves, etc. When you're working with huge datasets, it just makes sense to have data locality.
With UHD and higher monitors coming down the pipeline, something like this could be a game changer for even gaming graphics... right now, the GTX 1080 can't consistently hit 60fps in newer games at 4K. Something like this could improve this a bit. Even the technique in adding say 64-128GB of secondary storage soldered onto the graphics board could be significant.
The struggle comes in reading the data from storage, where several minutes can be spent loading in a single high resolution raster for analysis/display. When I built my own PC this year, I splurged on an M.2 SSD for the OS and my main data store. Best decision I ever made for my workflow - huge 3D scenes that formerly took minutes to load on a spinning platter now pop up in seconds.
This thing would probably be the bees knees for what I do. Shame it starts at $10k (and that it's AMD so no CUDA, so no way to justify it at work for "data science" :-/).