Hacker News new | past | comments | ask | show | jobs | submit login
Zluda: CUDA on Intel GPUs (github.com/vosen)
261 points by fho on Feb 27, 2021 | hide | past | favorite | 77 comments



Trivia: the Polish word "cuda" means "miracles", "zluda" ("złuda") means "a delusion". Nice pun.


I'm Czech and under the dialect where I grew up, "Zluda" could be translated to "evil person/monster" or "mischievous person/monster".

Growing up, people sometimes called their kids "zludy" (plural of "zluda" in my language) -- or trouble-makers.


Interesting. Possibly a variant of the standard word zrůda?


Yeah, could be! I grew up close to the polish/slovak/czech border. I bet some of those words got mixed together.


Off topic...I heard that a common Polish swear-word is "cholera", because the disease is so bad. Is that true?


Same in Dutch, they use disease names such as cancer or typhus as bad words


It is.


Excellent effort. Nvidia has become defacto GPGPU hardware vendor due to CUDA, but I wish it was OpenCL or other general API instead. Even Raspberry Pi's VideoCore has OpenCL support[1].

But a look at HW Acceleration support table at FFmpeg[2] shows why GPGPU Platform API is such a mess. But performance benefits are incredible, using VAAPI for FFmpeg to encode 1080p 2560x1080 screen capture at 60fps reduces CPU usage from 90% to 10% on a old corei5 with intel HD 3000; An old laptop could be perfectly used as an encoding machine for streaming just by using HW Acceleration.

What's funny is that the laptop also has Radeon HD 6490M with 1GB GDDR5 dedicated memory and it's not supported by VAAPI for encoding! GPGPU API/Platform Support are astonishingly messy.

[1]https://github.com/doe300/VC4CL

[2]https://trac.ffmpeg.org/wiki/HWAccelIntro


GPU accelerated video encoding/decoding is done largely via special fixed function hardware not GPGPU so there really isn't a relation between the links and the statements. The reason the Radeon HD 6000 is not supported for hardware accelerated encoding is simply because it did not have a video encoding ASIC to use. The HD 7000 series introduced it to the family. https://en.wikipedia.org/wiki/Radeon_HD_6000_series#Radeon_F...


> The reason the Radeon HD 6000 is not supported for hardware accelerated encoding is simply because it did not have a video encoding ASIC to use.

Doesn’t this prove the broader point that GPGPU cross platform would benefit everyone? A new codec is written.. and everyone gets to use it not just those with fixed function support.


No, it proves fixed function hardware outperforms general purpose hardware for the task it's made for. Without the fixed function hardware the GPU is worse suited for the task than a general purpose CPU.


>GPU accelerated video encoding/decoding is done largely via special fixed function hardware not GPGPU so there really isn't a relation between the links and the statements

CUDA(NVENC/NVDEC), AMF, OpenCL are listed on the FFMpeg chart I mentioned and linked to. HW Accel might use different compute unit for its function, but GPGPU API is also being used for that nevertheless.

Perhaps HW Accel is not the best example for sighting the need gap for universal GPGPU API or its implementation as OP talks about porting GPGPU API for a non-supported platform; What would be the right example?

P.S. I stand corrected on implying Radeon HD 6000 couldn't encode in HW Accel due to API mess, where as it was missing ASIC. Thanks.


NVENC/NVDEC are not CUDA, they are API's for accessing the fixed function hardware on Nvidia GPUs. Same with AMF, just for the AMD fixed function hardware. VAAPI/VDPAU/DX*/others in the chart with the exception of OpenCL are OS or cross platform APIs used for abstracting over access to fixed function video decode from many vendors without needing to target each vendor individually. Again, nobody is using any of these APIs for GPGPU for video encode/decode, it performs horribly.

The only GPGPU library in that chart is OpenCL, used for accelerating custom video filter effects like blur. The table says OpenCL is supported on Linux, Windows, Mac, and Android devices (with GPUs capable of GPGPU) for Intel, AMD, and Nvidia. The Pi isn't supported because the GPU isn't capable enough. That leaves the only capable platform not covered by that could be as iOS (walled garden reasons). I.e. I'm not seeing the gap you keep referring to for GPGPU APIs, just a task (video coding) that doesn't work well on general purpose hardware.

As for ZLUDA, why is it needed then? CUDA was and is extremely popular as Nvidia has been at the forefront of GPGPU (hardware and software) for over a decade pushing CUDA along the way. OpenCL didn't even get comparable until 5 years after CUDA kicked off. As a result there is a large amount of CUDA code out there that people don't want to rewrite just to make it portable. Hence projects like ZLUDA for Intel which does it transparently and ROCm from AMD which helps automate the porting of code.


>NVENC/NVDEC are not CUDA, they are API's for accessing the fixed function hardware on Nvidia GPUs.

Not necessarily, NVENC can use CUDA cores for hardware acceleration too for certain features[1] -

>HEVC encoding also lacks Sample Adaptive Offset (SAO). Adaptive quantization, look-ahead rate control, adaptive B-frames (H.264 only) and adaptive GOP features were added with the release of Nvidia Video Codec SDK 7.These features rely on CUDA cores for hardware acceleration.

>Nvidia Video Codec SDK 8 added Pascal exclusive Weighted Prediction feature (CUDA based). Weighted prediction is not supported if the encode session is configured with B frames (H.264).

[1]https://en.wikipedia.org/wiki/Nvidia_NVENC


Point of note - using "CUDA Cores" is not the same as using the "CUDA API", particularly from the app's view as it's all behind the NVENC API anyways. But yes, these conditionals are what I was referring to when I said "GPU accelerated video encoding/decoding is done __largely__ via special fixed function hardware" i.e. the fixed function hardware does >95% of the compute work which is what makes it viable.


This looks like something Intel should be paying good money for, but I feel like they are just going to "embrace" open source and snatch it without giving a penny to the author.


Disclosure: I work at Intel.

At one of the internal Q&As, there was a question as to why we didn't just implement CUDA. One of the reasons given was that the lawyers looked at the license of CUDA and decided that Intel could not legally implement CUDA for Intel's GPU devices. I don't know the details, but quite frankly, it wouldn't surprise me if Nvidia didn't somehow put a poison pill in there to prevent Intel or AMD from implementing it for their own GPUs (note that AMD also doesn't provide an implementation of CUDA for its own GPUs).

Instead, the strategy Intel pursued was to develop a migration tool from CUDA to Sycl: https://software.intel.com/content/www/us/en/develop/tools/o...


AMD implemented HIP, which is nearly CUDA (if not identical). There is an implementation for Intel too though it is third-party:

https://github.com/cpc/hipcl


The problem here seems to be that everyone kinda wants Nvidia's market capture. Intel didn't contribute to AMD's project - they started their own.


And they are all approaching it the wrong way, the typical hubris of a hardware company that thinks they can just have some interns make a software solution for their problem.

Here is how you beat the CUDA lock-in: consistently make better performing GPUs so not using them is a liability. Instead buying AMD you not only get a worse GPU but also the intern software solution, and that is just not compelling.


It doesn't help if they don't offer the same GUI debugging capabilities, and polyglot support as CUDA does in its current form.


Or AMD could simply knock off the binary blobs. The DRM excuse has always been weak, because it is a tiny fraction of the total firmware blob - and it'd be easy to make it so that the legally hobbled hardware decoder simply errors out in cases where the end user chooses not to load the DRM blob. Boom, no more dependence on AMD interns. I've been following their commit logs pretty closely for a year now, and I frequently see some amazing accidental admissions about the left hand (software team) not knowing what the right hand (hardware team) is doing. During one very frustrating series of patches it was difficult resisting the impulse to say "Go get your dad."


Not only that, everyone that has jumped into SYSCL is doing SYSCL + their own sugar on top, so what is the "portability" selling point actually?

Also they keep forgetting that CUDA is not only C++, rather a polyglot eco-system.


I am guessing that Nvidia only copied what Intel is doing with x86. Why don't Intel free the x86? A little bit of competition would be good for everyone, otherwise ARM will take over the market, so Intel has nothing to lose from doing that.


Why not collaborate with AMD for once an improve HIP? If AMD and Intel stay divided all hope is lost and nvidia will stay the main target


Because HIP/ROCm is awful.

Instead of targeting an IR, it directly targets a given GPU's ISA, so that your existing binary will not run on future hardware. That's a total no-go for basically every non-HPC use case.

Intel is much better off building a sound technical foundation from scratch.


And it doesn't even work on most consumer AMD graphic cards^^


You better make Zluda work well on AMD gpus then otherwise it will be irrelevant


You make it sound like that's an issue with Intel while the license is to blame.


So, basically this project is illegal and waiting to be removed from Github then?


(IANAL.)

Without looking into the CUDA licenses (who knows, they might even expressly allow this kind of thing, but seems pretty unlikely to me), I'd expect this to be a case of whether "APIs are copyrightable" or not, same as the famous and sort of still ongoing https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

The US courts said "yes" in this case (note: 100% stupid IMO), but I'm not confident that nVidia'd have an easy win if they decided to sue the developer, and I'm also relatively sure they wouldn't send a DMCA request at this stage (and that if they did, their request would probably be reviewed harder than normal).


That's not the most accurate summary of Google v Oracle. The case has been tried twice, Google has been found in the clear twice at the district court by the jury, and the Court of Appeals for the Federal Circuit has twice overturned the jury result, and the case is now at the Supreme Court awaiting a decision as to whether or not CAFC's decision is off it's rocker.

It is not usual for CAFC to hear copyright disputes; that it was appealed to CAFC instead of the 9th Circuit is because there was a patent claim at one point, and CAFC should have followed 9th Circuit precedent on the matter. Google contends that 9th Circuit precedent holds that the API is not copyrightable, which means that CAFC erred in ignoring precedent. Most software companies ultimately agree with Google here, not Oracle: it's telling that most of the amici who side with Oracle are not software companies but media publishers (e.g., MPAA, RIAA).


That link is missing a trailing period by the way

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

Edit: huh? I pasted it in with the period and it gets removed. Do trailing periods get removed because it thinks it's the end of a sentence?

Anyway, this redirects to the right URL: https://en.wikipedia.org/wiki/Google_v_Oracle


Yeah, HN strips trailing periods for the reason you stated. You can normally get around that by adding a hash/pound sign.

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_...


> I'm not confident that nVidia'd have an easy win if they decided to sue the developer

So long as Nvidia's case doesn't get outright laughed out of court, they can throw money at their legal team until the developer goes bankrupt.


I think you could argue from the interop side which would make such a project legal in Europe. So Intel or AMD could funnel money into this via their European branches or some subsidiary?


yes. I am not a lawyer but nvidia's lawyers would be interested I bet.


This makes me think, why Intel doesn't actually care? There's a lot of money poured into marketing, and today it's always about "AI-optimized muti-core nano-whateverer GPU", so isn't it obvious making your shitty GPUs compatible with software written for competition's less shitty GPUs is probably even more profitable in the 5-year run than trying to make your GPUs a bit less trash? Yes, Nvidia wouldn't want that, so it won't be easy to make your product CUDA-compatible, but not impossible. Is it illegal for them to do, or what? We all are kind of accustomed to the idea that Intel doesn't care, but why don't they care? It is money laying around waiting to be taken, and they just ignore it.


Impressive! The PTX to SPIR-V compiler must have been quite a bit of work; what's the coverage of the ISA like?

With oneAPI I had hoped to get the inverse, a oneAPI implementation for NVIDIA hardware, but I don't think the CUDA driver API is low-level enough to do so (e.g. explicit vs global contexts). And yes, I know of Codeplay's implementation of DPC++ for NVIDIA GPUs, but that doesn't implement oneAPI Level0 APIs so is not usable for other languages.


I'm from Codeplay and I am not sure about what you mean from your comment about Level0. Intel have developed a back-end for DPC++ that supports Intel processors through OpenCL and SPIR-V. Codeplay has implemented a back-end that supports Nvidia processors by using PTX instructions in the same way that native CUDA code does. PTX is the equivalent to SPIR for Nvidia processors. Maybe I am misunderstanding so apologies if that is the case.


> Is ZLUDA a drop-in replacement for CUDA? > > Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA

I fail to see how that is a drop-down replacement.

Plus apparently it doesn't support the polyglot CUDA ecosystem.


On top of HN... I get excited to use it with Blender then scroll through a plethora of "awesomeness" text to find this

> Warning: this is a very incomplete proof of concept. It's probably not going to work with your application. ZLUDA currently works only with applications which use CUDA Driver API or statically-linked CUDA Runtime API - dynamically-linked CUDA Runtime API is not supported at all.

Common man. You can't call it a drop in replacement if it's an incomplete proof of concept.


I don’t think it’s unfair to call it a drop in replacement, though the current limitations should perhaps be caveated a little more clearly.

I think calling it a drop in replacement highlights that the aim of Zluda is to not require recompilation of the software being run. Maybe “Proof of concept drop in replacement” would be a better description?


> Common man. You can't call it a drop in replacement if it's an incomplete proof of concept.

I was nearly put off by that but maybe it should be given a chance with some funding or what not. All great and interesting things come from proof of concepts, that actually work.

At least it isn't vapourware. Unlike some 'projects' I see on GitHub.


Downvoters: So this 'proof of concept' on GitHub is in fact some how 'vapourware', and it shows an empty repository with zero functioning working code? I'm looking at the same repository as everyone else in this HN post.

Literally the top comment is even suggesting that Intel could be interested in funding this, so surely it deserves a chance with some backing, even if it is 'incomplete'.

Care to explain the downvotes?


That's the goal, even if not there yet


What’s does this mean? Is the statement in the article true or not?


Love this concept and hope this does well or something like it becomes a de facto standard. There's an explosion of software using GPUs. In the DL space alone, PyTorch and TensorFlow and many frameworks that compete with them, plug-in to them, etc. There's also an explosion of DL hardware coming. We have NVIDIA's whole product line, competing GPUs, TPUs, Inferentia, a bunch of start ups... That compatibility matrix is going to be insane. You need a good integration layer between them. If everyone can support CUDA, great. But there's also OpenCL and other competing standards. They're all going to have some performance trade-offs, but some of that must be worth it to keep some semblance of a common framework, otherwise small and new players don't stand much of a chance.


I wish PyTorch would utilize CPUs more effectively out of the box. The last time I tried to run some DNN training on CPU (last summer), I was disappointed to find that only one single core was used on my Ryzen machine. Yes, GPUs don't have as much memory bandwidth or compute as GPUs, but this is still leaving a lot of performance on the table.


pytorch can use more than one core.


> Tying to the previous point, currently ZLUDA does not support asynchronous execution. This gives us an unfair advantage in a benchmark like GeekBench. GeekBench exclusively uses CUDA synchronous APIs

Any "professional" application solely use async APIs so while these numbers may look impressive something like tensorflow or pytorch would either not run or be incredibly slow.


Upvoting so someone from Intel sees it and hires these devs to make this actually happen. CUDA absolutely dominates, it would be a game changer to have it work with another GPU platform.


I think the correct way to solve it is to make something competitive for Vulkan. Intel, AMD, and maybe ARM should really cooperate on that front.


I was working at Intel on integrated graphics less than a year ago.

Geekbench could, at best, be described as a fine benchmark. But it's pretty terrible in a bunch of respects.

I would love to see more benchmarks run through with this.


I'm stoked for this to be usable on things like PyTorch.


The author has a comment here describing what that would take: https://github.com/vosen/ZLUDA/issues/17#issuecomment-735403....

tl;dr: someone would need to re-implement cuDNN


> Authors of CUDA benchmarks used CUDA functions atomicInc and atomicDec which have direct hardware support on NVIDIA cards, but no hardware support on Intel cards.

Then how does InterlockedAdd HLSL intrinsic works on Intel? https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/...


Very nice! I look forward to trying this out.

There a few issues which come to mid though:

1. How do these benchmark results compare to writing x64_64 implementations directly? That is, is it enough to reach OpenCL-level performance?

2. What about if I want to use both a CPU and and a GPU? It seems that would not be supported, as this library produces a `libcuda.so`.

3. What about AMD chips?

4. The README says:

> Authors of CUDA benchmarks used CUDA functions atomicInc

> and atomicDec which have direct hardware support on NVIDIA

> cards, but no hardware support on Intel cards. They have

> to be emulated in software, which limits performance

is there really no support, not even in the near future, for atomic increment on Intel chips?

5. This is all written in Rust. It's a bit fishy to me for a C library to be implemented in Rust, though maybe I'm just a little prejudiced.

6. Where is the documentation of the semantic differences between proper CUDA and ZLUDA? e.g. - what do streams do? How do events work? etc.


To be fair I have the same feeling about Rust. C++ is the language positioned for high performance computing, almost everything that screams HPC is in C++. We already have problems in our company modifying some old Fortran libraries that are called from C++ and in the future God knows what Rust or Zig training we will require for basically no upside for HPC. But if safety would be my concern, I would totally look towards Rust. Or hopefully, Julia will save the day.


So how come this still can’t be done with AMD GPUs?



How does this work with CUDA libraries? Does it has an implementation of the standard libraries at least?


Does the same exist with amd ? Sorry for asking maybe the obvious but could it be made?



My opinion about the subject: AMD actually does not care about GPU compute on customer hardware.

If they did they would not have:

- Have had support for Linux only, with no ROCm on Windows

- Have made ROCm as a GPU-specific targeted process, there's no IR like PTX to make your current *binary* run on future GPU generations

- Having dropped support for GCN2/3 (https://github.com/RadeonOpenCompute/ROCm/issues/1353#issuec...) making the _only_ supported customer GPU generation Vega, with no support for RDNA/RDNA2.

They obviously don't care about the market as they should, despite anything they might or they might not say. Nothing to see here... It's only and solely their own fault that NVIDIA is the only option.

Intel so far has a much more competent strategy around GPU computing, and might prove to be an actual competitor. I've written off AMD as a possible competitor to NVIDIA for GPU computing a long time ago.


They dropped GCN2/3 support before ROCm was even anywhere near production ready but the biggest issue is the lack of cross platform as you mentioned you can’t even use it in Windows containers and also that they never even attempted to support APUs.

Intel not only has better OpenCL support but will come out out of the gate with OneAPI that will support all Intel GPUs this means productivity applications could use it wether it’s for a laptop or for a future productivity workstation with an Intel discrete GPU.

It’s pretty much impossible to buy a laptop with an AMD GPU and run ROCm on it, even the discrete cards are not officially supported and since the ROCm binaries are hardware specific without official support things tend to be even more broken than what they are now.

I really can’t understand how AMD could cock it up so badly.


Intel should back the crap outta this project/dev


can i use ffmpeg with h264_evenc to encode videos with it?


Video encoding uses different hardware and different APIs than GPGPU. FFMPEG can use Intel, AMD, and Nvidia hardware encoding; see https://trac.ffmpeg.org/wiki/HWAccelIntro


Yeah i know, but NVENC is based on CUDA, that is, it doesn't interact with the GPU (NVIDIA in that case) directly, it works with CUDA library. So if Zluda is a CUDA compatibility layer for Intel GPUs, why not?


what's wrong about my question? yes i know it makes little sense because every motherboard with Intel GPU probably has a CPU with QuickSync Video support and that will most certainly work better for encoding (h264_qsv), but in theory?


Eth?


You wouldn't care about GPU API when mining crypto. You just need more GPUs. There's nothing complicated about the software you need for that.


Yes please take business away from Nvidia and AMD. I just want to buy a GPU.


Fun facts, now even the latest lower end Intel's CPUs (Atom, Pentium, Celeron) will feature the much improved Gen 11 iGPU over the last generation Intel's built in GPU [1].

The latest iGPU supports more than 1 TFLOP in GPU performance. This is apparently more than the performance of two years old Nvidia GeForce GT 1030.

[1]https://www.notebookcheck.net/Intel-s-Elkhart-Lake-SoC-will-...


>This is apparently more than the performance of two years old Nvidia GeForce GT 1030.

Almost 4 years old...

https://www.techpowerup.com/gpu-specs/geforce-gt-1030.c2954

"The GeForce GT 1030 is an entry-level graphics card by NVIDIA, launched in May 2017."


My bad, when iGPU were first announced, GT 1030 is barely two years old as mentioned in the article.

Anyway having iGPU inside an entry level Intel's CPU with a potential of 32 GB of RAM is amazing. The performance is probably not that far off from my old high end Asus ROG gaming laptop that I bought in 2014 with Nvidia GTX 700 series GPU.

The laptop with the entry level iGPU will probably cost less than 20% of the original ROG laptop but now can supports up to three 4K monitors!

If Asus can resurrect their infamous Netbook series laptop or cheap laptop lines with these new entry level CPUs with iGPU I bet it can sell them like hot cakes.

[1]https://rog.asus.com/articles/g-series-gaming-laptops/asus-i...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: