Hacker News new | past | comments | ask | show | jobs | submit login
NVIDIA Transitions Fully Towards Open-Source Linux GPU Kernel Modules (nvidia.com)
881 points by shaicoleman 3 months ago | hide | past | favorite | 254 comments



There is little meaning for NVIDIA to open-source only the driver portion of their cards, since they heavily rely on proprietary firmware and userspace lib (most important!) to do the real job. Firmware is a relatively small issue - this is mostly same for AMD and Intel, since encapsulation reduces work done on driver side and open-sourcing firmware could allow people to do some really unanticipated modification which might heavily threaten even commercial card sale. Nonetheless at least for AMD they still keep a fair share of work done by driver compared to Nvidia. Userspace library is the worst problem, since they handle a lot of GPU control related functionality and graphics API, which is still kept closed-source.

The best thing we can hope is improvement on NVK and RedHat's Nova Driver can put pressure on NVIDIA releasing their user space components.


It is meaningful because, as you note, it enables a fully opensource userspace driver. Of course the firmware is still proprietary and it increasingly contains more and more logic.


Which in a way is good because the hardware will more and more perform identically on Linux as on Windows.


Doesn't seem like a bad tradeoff so long as the proprietary stuff is kept completely isolated with no access to any other parts of my system.


Personally, I somewhat wonder about that. The firmware (proprietary) which runs on the gpu seems like it'll have access to do things over the gpu PCIe bus, including read system memory, and access other devices (including network gear). Reading memory of remote hosts (ie RDMA) is also a thing which Nvidia gpus can do.


Is that not solvable using an IOMMU (assuming hardware that has one)?


No idea personally. :)


An IOMMU does solve it, at the cost of some performance. The GPU can only access memory that the IOMMU allows, and the part that programs the IOMMU is open source.

RDMA requires a special network card and is opt-in - an RDMA NIC cannot access any random memory, only specially registered regions. One could argue that a NIC FW bug could cause arbitrary memory accesses, but that's another place where an IOMMU would help.


Awesome, thanks. :)


The GLX libraries are the elephant(s) in the room. Open source kernel modules mean nothing without these libraries. On the other hand AMD and Intel uses "pltform GLX" natively, and with great success.


Mesa already provides good open source GLX and Vulkan libraries. An open source NVIDIA kernel driver enables interoperability with Mesa exactly like Intel and AMD.


Half of the trade secrets NVIDIA has are living in their own GLX libraries. Even if you install the open source kernel module, these GLX libraries are installed (just did it on a new cluster).

I’m not holding my breath about these libraries to be phased out and NVIDIA integrates to the platform GLX any time soon.

I think NVIDIA will resist moving to a firmware only model (ala AMD & Intel) as long as they can, preferably forever.


The firmware is also signed, so you can't even do reverse engineering to replace it.


the open kernel driver also fundamentally breaks the limitation about geforce gpus not being licensed for use in the datacenter. that provision is a driver provision and CUDA does not follow the same license as the driver... really the only significant limitation is that you aren't allowed to use the CUDA toolkit to develop for non-NVIDIA hardware, and some license notice requirements if you redistribute the sample projects or other sample sourcecode. and yeah they paid to develop it, it's proprietary source code, that's reasonable overall.

https://docs.nvidia.com/cuda/eula/index.html

ctrl-f "datacenter": none

so yeah, I'm not sure where the assertion of "no progress" and "nothing meaningful" and "this changes nothing" come from, other than pure fanboyism/anti-fans. before you couldn't write a libre CUDA userland even if you wanted to - the kernel side wasn't there. And now you can, and this allows retiming and clock-up of supported gpus even with nouveau-style libre userlands. Which of course don't grow on trees, but it's still progress.

honestly it's kinda embarrassing that grown-ass adults are still getting their positions from what is functionally just some sick burn in a 2004 viral video or whatever, to the extent they actively oppose the company moving in the direction of libre software at all. but I think with the "linus torvalds" citers, you just can't reason those people out of a position that they didn't reason themselves into. Not only is it an emotionally-driven (and fanboy-driven) mindset, but it's literally not even their own position to begin with, it's just something they're absorbing from youtube via osmosis.

Apple debates and NVIDIA debates always come down to the anti-fans bringing down the discourse. It's honestly sad. https://paulgraham.com/fh.html

it also generally speaks to the long-term success and intellectual victory of the GPL/FSF that people see proprietary software as somehow inherently bad and illegitimate... even when source is available, in some cases. Like CUDA's toolchain and libraries/ecosystem is pretty much the ideal example of a company paying to develop a solution that would not otherwise have been developed, in a market that was (at the time) not really interested until NVIDIA went ahead and proved the value. You don't get to ret-con every single successful software project as being retroactively open-source just because you really really want to run it on a competitor's hardware. But people now have this mindset that if it's not libre then it's somehow illegitimate.

Again, most CUDA stuff is distributed as source, if you want to modify and extend it you can do so, subject to the terms of the CUDA license... and that's not good enough either.


Can you link the source code for CUDA please? Thanks.

Edit since I'm being downvoted: I did search for it and could not find it.



I really don't know where this crap about "Moving everything to the firmware" is coming from. The kernel part of the nvidia driver has always been small, and this is the only thing they are open-sourcing (they have been announcing it for months now......). The immense majority of the user-space driver is still closed and no one has seen any indications that this may change.

I see no indications either that either nvidia nor any of the rest of the manufacturers has moved any respectable amount of functionality to the firmware. If you look at the opensource drivers you can even confirm by yourself that the firmware does practically nothing -- the size of the binary blobs of AMD cards are minuscule for example, and long are the times of ATOMBIOS. The drivers are literally generating bytecode-level binaries for the shader units in the GPU, what do you expect the firmware could even do at this point? Re-optimize the compiler output?

There was an example of a GPU that did move everything to the firmware -- the videocore on the raspberry pi, and it was clearly a completely distinct paradigm, as the "driver" would almost literally pass through OpenGL calls to a mailbox, read by the secondary ARM core (more powerful than the main ARM core!) that was basically running the actual driver as "firmware". Nothing I see on nvidia indicates a similar trend, otherwise RE-ing it would be trivial, as happened with the VC.


https://lwn.net/Articles/953144/

> Recently, though, the company has rearchitected its products, adding a large RISC-V processor (the GPU system processor, or GSP) and moving much of the functionality once handled by drivers into the GSP firmware. The company allows that firmware to be used by Linux and shipped by distributors. This arrangement brings a number of advantages; for example, it is now possible for the kernel to do reclocking of NVIDIA GPUs, running them at full speed just like the proprietary drivers can. It is, he said, a big improvement over the Nouveau-only firmware that was provided previously.

> There are a number of disadvantages too, though. The firmware provides no stable ABI, and a lot of the calls it provides are not documented. The firmware files themselves are large, in the range of 20-30MB, and two of them are required for any given device. That significantly bloats a system's /boot directory and initramfs image (which must provide every version of the firmware that the kernel might need), and forces the Nouveau developers to be strict and careful about picking up firmware updates.


>> I see no indications either that either nvidia nor any of the rest of the manufacturers has moved any respectable amount of functionality to the firmware.

Someone who believes this could easily prove that they are correct by "simply" taking their 4090 and documenting all its functionality, as was done with the [7900 xtx](https://github.com/geohot/7900xtx).

You can't say "I see no indications/evidence" unless you have proven that there is no evidence, no?


so basically “if you really think there’s no proof of a positive claim, then you won’t mind conclusively proving the negation”?

no, that’s not how either logical propositions or burden of proof works


He has already told you how to prove it: enumerate the functionality of the driver - the GPU and the code are finite, bounded environments. You can absolutely prove that there is no tea in a cup, that there are no coins in a purse, that there is no cat in a box, etc.


> no, that’s not how either logical propositions or burden of proof works

I think you're missing the point, perhaps intentionally to make a smart-sounding point?

We're programmers, working on _specific physical things_. If I claim that my CPU's branch predictor is not doing something, it is only prudent to find out what it is doing, and enumerate the finite set of what it contains.

Does that make sense? The goal is to figure out _how things actually work_ rather than making claims and arguing past each other until the end of time.

Perhaps you don't care about what the firmware blobs contain, and so you'd rather have an academic debate about logical propositions, but I care about the damn blobs, because it matters for my present and future work.


These aren't necessarily conflicting assessments. The addition of the GSP to Turing and later GPUs does mean that some behavior can be moved on-device from the drivers. Device initialization and management is an important piece of behavior, certainly, but in the context of the all work done by the Nvidia driver (both kernel and user-space), it is a relatively tiny portion (e.g. compiling/optimizing shaders and kernels, video encode/decode, etc).


There IS meaning because this makes it easier to install Nvidia drivers. At least, it reduces the number of failure modes. Now the open-source component can be managed by the kernel team, while the closed-source portion can be changed as needed, not dictated by kernel API changes.


Why is the user space component required? Won't they provide sysfs interfaces to control the hardware?


It's something common to all modern GPUs, not just NVIDIA: most of the logic is in a user space library loaded by the OpenGL or Vulkan loader into each program. That library writes a stream of commands into a buffer (plus all the necessary data) directly into memory accessible to the GPU, and there's a single system call at the end to ask the operating system kernel to tell the GPU to start reading from that command buffer. That is, other than memory allocation and a few other privileged operations, the user space programs talk directly to the GPU.


I remember Nvidia getting hacked pretty bad a few years ago. IIRC, the hackers threatened to release everything they had unless they open sourced their drivers. Maybe they got what they wanted.

[0] https://portswigger.net/daily-swig/nvidia-hackers-allegedly-...


For Nvidia, the most likely reason they've strongly avoided Open Sourcing their drivers isn't anything like that.

It's simply a function of their history. They used to have high priced professional level graphics cards ("Nvidia Quadro") using exactly the same chips as their consumer graphics cards.

The BIOS of the cards was different, enabling different features. So people wanting those features cheaply would buy the consumer graphics cards and flash the matching Quadro BIOS to them. Worked perfectly fine.

Nvidia naturally wasn't happy about those "lost sales", so began a game of whack-a-mole to stop BIOS flashing from working. They did stuff like adding resistors to the boards to tell the card whether it was a Geforce or Quadro card, and when that was promptly reverse engineered they started getting creative in other ways.

Meanwhile, they couldn't really Open Source their drivers because then people could see what the "Geforce vs Quadro" software checks were. That would open up software countermeasures being developed.

---

In the most recent few years the professional cards and gaming cards now use different chips. So the BIOS tricks are no longer relevant.

Which means Nvidia can "safely" Open Source their drivers now, and they've begun doing so.

--

Note that this is a copy of my comment from several months ago, as it's just as relevant now as it was then: https://news.ycombinator.com/item?id=38418278


Very interesting, thanks for the perspective. I suspect all the recent loss of face they experienced with the transition to Wayland happening around the time that this motivation evaporated also probably plays a part too though.

I swore off ever again buying Nvidia, or any laptops that come with Nvidia, after all this. Maybe in 10 years they'll have managed to right the brand perceptions of people like myself.


interesting timing to recall that story. now the same trick is used for h100 vs whatever the throttled-for-embargo-wink-wink Chinese version is called.

but those companies are really adverse to open sourcing because they can't be sure they own all the code. it's decades of copy pasting reference implementations after all


> now the same trick is used for h100 vs whatever the throttled-for-embargo-wink-wink Chinese version

No. H20 is a different chip designed to be less compute-dense (by having different combinations of SM/L2$/HBM controller). It is not a throttled chip.

A800 and H800 are A100/H100 with some area of the chip physically blown up and reconfigured. They are also not simply throttled.


that's what nvidia told everyone in mar 23... but there's a reason why h800 were included last minute on the embargo in oct 23.


That's not what NVIDIA claimed, that's what I have personally verified.

> there's a reason why h800 were included last minute

No. Oct 22 restrictions are by itself significantly easier than Oct 23 one. NVIDIA just need to kill 4 NVLink lanes off A100 and you get A800. For H100 you kill some more NVLink until on paper NVLink bandwidth is roughly at A800 level again and then voila.

BIS is certainly pissed off by NVIDIA's attempt at being creative to sell the best possible product to China. So they actually lowered allowed compute number AGAIN in Oct 23. That's what killed H800.


I see. thanks for the details.


The explanation could also be as simple as fear of patent trolls.


I doubt it. It's probably a matter of constantly being prodded by their industry partners (i.e. Red Hat), constantly being shamed by the community, and reducing the amount of maintenance they need to do to keep their driver stack updated and working on new kernels.

The meat of the drivers is still proprietary, this just allows them to be loaded without a proprietary kernel module.


Nvidia has historically given zero fucks about the opinions of their partners.

So my guess is it's to do with LLMs. They are all in on AI, and having more of their code be part of training sets could make tools like ChatGPT/Claude/Copilot better at generating code for Nvidia GPUs.


Yup. nVidia wants those fat compute center checks to keep coming in. It's an unsaturated market, unlike gaming consoles, home gaming PCs, and design/production workstations. They got a taste of that blockchain dollar, and now AI looks to double down on the demand.

The best solution is to have the industry eat their dogfood.


I also see this as the main reason. GPU drivers for Linux, as far as I know, were just a niche use case, maybe CUDA planted a small seed, and the AI hype is the flower. Now the industry, not the users, demand drivers, so this became a demanded feature instead of a niche user wish.

A bit sad, but hey, welcome anyways.


I suspect it's mainly the reduced maintenance and reduction of workload needed to support, especially with more platforms coming to be supported (not so long ago there was no ARM64 nvidia support, now they are shipping their own ARM64 servers!)

What really changed the situation is that Turing architecture GPUs bring new, more powerful management CPU, which has enough capacity to essentially run the OS-agnostic parts of driver that used to be provided as blob on linux.


Am I correct in reading that as Turing architecture cards include a small CPU on the GPU board, running parts of the driver/other code?


In Turing microarchitecture, nVidia replaced their old "falcon" cpu with NV-RISCV RV64 chip, running various internal tasks.

"Open Drivers" from nVidia include different firmware that utilizes the new-found performance.


How well isolated is this secondary computer? Do we have reason to fear the proprietary software running on it?


As well isolated as anything else on the bus.

So you better actually use IOMMU


Ah, yes, the magical IOMMU controller, that everybody just assumes to be implemented perfectly across the board. I'm expecting this to be like Hyperthreading, where we find out 20 years later, that the feature was faulty/maybe_bugdoored since inception in many/most/all implementations.

Same thing with USB3/TB-controllers, NPUs, etc that everybody just expects to be perfectly implemented to spec, with flawless firmwares.


It's not perfect or anything, but it's usually a step up[1], and the funniest thing is that GPUs generally had less of ... "interesting" compute facilities to jump over from, just easier to access usually. My first 64 bit laptop, my first android smartphone, first few iPhones, had more MIPS32le cores with possible DMA access to memory than the main CPU cores, and that was just counting one component of many (the wifi chip).

Also, Hyperthreading wasn't itself faulty or "bugdoored". The tricks necessary to get high performance out of CPUs were, and then there was intel deciding to drop various good precautions in name of still higher single core performance.

Fortunately, after several years, IOMMU availability becomes more common (current laptop I'm writing this on has proper separate groups for every device it seems)

[1] There's always the OpenBSD of navel gazing about writing "secure" C code, becoming slowly obsolescent thanks to being behind in performance and features, and ultimately getting pwned because your C focus and not implementing "complex" features helping mitigate access results in pwnable SMTPd running as root.


All fine and well, but I always come back to "If I were a manufacturer/creator of some work/device/software, that does something in the plausible realm of 'telecommunication', how do make sure, that my product can always comply with https://en.wikipedia.org/wiki/Lawful_interception requests? Allow for ingress/egress of data/commands at as low a level as possible!"

So as a chipset creator company director it would seem like a no-brainer to me to have to tell my engineers unfortunately to not fix some exploitable bug in the IOMMU/Chipset. Unless I want to never sell devices that could potentially be used to move citizens internet packets around in a large scale deployment.

And implement/not_fix something similar in other layers as well, e.g. ME.


If your product is supposed to comply with Lawful Interception, you're going to implement proper LI interfaces, not leave bullshit DMA bugs in.

The very point of Lawful Interception involves explicit, described interfaces, so that all parties involved can do the work.

The systems with LI interfaces also often end up in jurisdictions that simultaneously put high penalties on giving access to them without specific authorizations - I know, I had to sign some really interesting legalese once due to working in environment where we had to balance both Lawful Interception, post-facto access to data, and telecommunications privacy laws.

Leaving backdoors like that is for Unlawful Interception, and the danger of such approaches is greatly exposed in form of Chinese intelligence services exploiting NSA backdoor in Juniper routers (infamous DRBG_EC RNG)


> you better actually use IOMMU

Is this feature commonly present on PC hardware? I've only ever read about it in the context of smartphone security. I've also read that nvidia doesn't like this sort of thing because it allows virtualizing their cards which is supposed to be an "enterprise" feature.


Relatively common nowadays. It used to be delineated as a feature in Intel chips as part of their vPro line, but I think it’s baked in. Generally an IOMMU is needed for performant PCI passthrough to VMs, and Windows uses it for DeviceGuard which tries to prevent DMA attacks.


Mainstream consumer x86 processors have had IOMMU capability for over a decade, but for the first few years it was commonly disabled on certain parts for product segmentation (eg. i5-3570K had overclocking but no IOMMU, i5-3570 had IOMMU but limited overclocking). That practice died off approximately when Thunderbolt started to catch on, because not having an IOMMU when using Thunderbolt would have been very bad.


Seems to me that Zen 4 has no issues at all, but bridges/switches require additional interfaces to further fan-out access controls.


It's hard to believe one of the highest valued companies in the world cares about being shamed for not having open source drivers.


They care when it affects their bottom line, and customers leaving for the competition does that.

I don't know if that's what's happening here, honestly, but you're right that they don't care about being shamed, but building a reputation of being hard to work with and target, especially in a growing market like Linux (still tiny, but growing nonetheless, and becoming significantly more important in the areas where non-gaming GPU use is concerned) can start to erode sales and B2B relationships, and the latter particularly if you make the programmers and PMs hate using your products.


> customers leaving for the competition does that

What competition?

I do agree that companies don’t really care for public sentiment as long as business is going as usual. Nvidia is printing money with their data center hardware [1] where half of their yearly revenue comes from.

https://nvidianews.nvidia.com/news/nvidia-announces-financia...


> in a growing market like Linux

Isn't Linux 80% of their market? ML et al is 80% of their sales, and ~99% of that is Linux.


True, although note that the Linux market itself is increasing in size due to ML. Maybe "increasingly dominant market" is a better phrase here.


Hah, good point. The OP was pedantically correct. The implication in "growing market share" is that "market share" is small, but that's definitely reading between the lines!


Right, and that's where most of their growth is.


Having products that require a bunch of extra work due to proprietary drivers, especially when their competitors don't require that work, is not good.


The biggest chunk of that "extra work" would be installing Linux in the first place, given that almost everything comes with Windows out of the box. An additional "sudo apt install nvidia-drivers" isn't going to stop anyone who already got that far.


Does the "everything comes with Windows out of the box" still apply for the servers and workstations where I imagine the vast majority of these high-end GPUs are going these days?


Tainted kernel. Having to sort out secure boot problems caused by use of an out of tree module. DKMS. Annoying weird issues with different kernel versions and problems running the bleeding edge.


Most cloud instances come with Linux out of the box.


I mean I've personally given our Nvidia rep some light hearted shit for it. Told him I'd appreciate if he passed the feedback up the chain. Can't hurt to provide feedback!


Kernel modules are not user-space drivers which are still proprietary.


Ooops. Missed that part.

Re-reading that story is kind of wild. I don't know how valuable what they allegedly got would be (silicon, graphics and chipset files) but the hackers accused Nvidia of 'hacking back' and encrypting their data.

Reminds me of a story I heard about Nvidia hiring a private military to guard their cards after entire shipments started getting 'lost' somewhere in asia.


Wait what? That PMC story got me. Where can I find more info on that lmao?


I'd heard the story first hand from a guy in san jose. Never looked it up until now. This is the closest thing I could find to it. In which case it sounds like it's been debunked.

[0] https://www.pcgamer.com/no-half-a-million-geforce-rtx-30-ser...

[1] https://www.geeknetic.es/Noticia/20794/Encuentran-en-Corea-5...


Much of the black magic has been moved from the drivers to the firmware anyway.


they did release it. a magic drive i have seen, but totally do not own, has it


Huh. Sway and Wayland was such a nightmare on Nvidia that it convinced me to switch to AMD. I wonder if it's better now.

(IIRC the main issue was https://gitlab.freedesktop.org/xorg/xserver/-/issues/1317 , which is now complete.)


Better as of extremely recently. Explicit sync fixes most of the issues with flickering that I’ve had on Wayland. I’ve been using the latest (beta?) driver for a while because of it.

I’m using Hyprland though so explicit sync support isn’t entirely there for me yet. It’s actively being worked on. But in the last few months it’s gotten a lot better


> Better as of extremely recently.

Yup. Anecdotally, I see a lot of folks trying to run wine/games on Wayland reporting flickering issues that are gone as of version 555, which is the most recent release save for 560 coming out this week. It's a good time to be on the bleeding edge.


On latest NixOS unstable and KDE + Wayland is still a bit of a dumpster fire for me (3070 + latest NV drivers). In particular there’s a buffer wait bug in EGL that needs fixing on the Nvidia side that causes the Plasma UI to become unresponsive. Panels are also broken for me, with icons not showing.

Having said that, the latest is a pain on X11 right now as well, with frequent crashing of Plasma, which atleast restarts itself.

There’s a lot of bleeding on the bleeding edge right at this moment :)


That's interesting, maybe it's hardware-dependent? I'm doing nixos + KDE + Wayland and I've had almost no issues in day-to-day usage and productivity.

I agree with you that there's a lot of bleeding. Linux is nicer than it used to be and there's less fiddling required to get to a usable base, but still plenty of fiddling as you get into more niche usage, especially when it involves any GPU hardware/software. Yet somehow one can run Elden Ring on Steam via Proton with a few mouse clicks and no issues, which would've been inconceivable to me only a few years ago.


Yeah it’s pretty awesome overall. I think the issues are from a few things on my end:

- I’ve upgraded through a few iterations starting with Plasma 6, so my dotfiles might be a bit wonky. I’m not using Home Manager so my dotfiles are stateful.

- Could be very particular to my dock setup as I have two docks + one of the clock widgets.

- Could be the particular wallpaper I’m using (it’s one of the dynamic ones that comes with KDE).

- It wouldn’t surprise me if it’s related to audio somehow as I have Bluetooth set-up for when I need it.

I’m sure it’ll settle soon enough :)


I've been having a similar flakiness with plasma on Nixos (proprietary + 3070 as well). Sadly can't say whether it did{n't} happen on another distro as I last used Arch around the v535 driver.

I found it funny how silently it would fail at times. After coming out of a game or focusing on something I'd scratch my head as to where did the docks/background went. I'd say you're lucky in that it recovered itself, generally I needed to run `plasmashell` in the alt+f2 run prompt.


I think it's X11 stuff that is using Vulkan for rendering that is still flickering in 555. This probably affects pretty much all of Proton / Wine gaming.


Any specific examples that you know should be broken? I am on X11 with 555 drivers and an nvidia gpu. I don't have any flickering when I'm gaming, it's actually why I stay on X11 instead of transitioning to wayland.


They are probably talking about running the game in a wayland session via xwayland, since wine's wayland driver is not part of proton yet.


You can always use X11. /s


I know that was a joke, but - as someone who is still on X, what am I missing? Any practical advantages to using Wayland when using a single monitor on desktop computer?


Even that single monitor can be hidpi, vrr or hdr (this one is still wip).


I have a 165 DPI monitor. This honestly just works with far less hassle on X. I don't have to listen to anyone try to explain to me how fractional scaling doesn't make sense (real explanation for why it wasn't supported). I don't have to deal with some silly explanation for why XWayland applications just can't be non-blurry with a fractional or non-1 scaling factor. I can just set the DPI to the value I calculated and things work in 99% of cases. In 0.9% of the remaining cases I need to set an environment variable or pass a flag to fix a buggy application and in the 0.1% of cases I need to make a change to the code.

VRR has always worked for me on single monitor X. I use it on my gaming computer (so about twice a year).


Same, can't understand people evangelizing Wayland

I have a laptop 10.1 2560x1600 with a 32' monitor, and another 27', never had any problem

Wayland has practically no advantages, you have to spend hours configuring, and still have apps working badly... they are always just a month away from having "everything" fixed

Maybe Wayland is the future but I'll keep using Xorg distros for the foreseeable future


You guys must be using some different X11 than the rest of us.

Basically, with X11 and hidpi, all you can do is to set up the system to announce dpi with certain value and hope, that the clients will cope. Some can (I know of exactly two of them: Chrome and Firefox), others will up bump up the font size and hopefully are using a layout, so the window sizes will adjust to accommodate the textboxes, but all the non-text assets will stay low-res how they were, because they do not have any other. Apps for remote desktop access or vm console won't be able to display remote/vm correctly. And the rest will just ignore that and you get tiny stuff on the display.

And this is just the hidpi issue with single display. Won't go into the problems when running with multiple displays, with different dpi.

I also do not have a faintest idea of what "setting up Wayland" might mean. What did you set up? How? The only thing that needs to "set up" is to pick a wayland session in the display manager. There's no xorg.conf for wayland, setting up drivers, etc. What did you configure "for hours"?

I've been using 4K 27" for over a decade, and Wayland, since Fedora made it default. Since I have no 20-year old xdotool scripts, or others that inject events or try to grab pixmaps, I've had no problem.


It's possible there might be a misunderstanding as to what "working" means. For me, if there's vaseline anywhere on my screen, that's strictly worse than tiny fonts I need a magnifying glass for. I'd rather have no scaling than nearest neighbour interpolation.

> You guys must be using some different X11 than the rest of us.

Speak for yourself, I know plenty of people who are able to get non-96-DPI working on X with just Xft.dpi and some environment variables.

> Some can (I know of exactly two of them: Chrome and Firefox), others will up bump up the font size and hopefully are using a layout, so the window sizes will adjust to accommodate the textboxes, but all the non-text assets will stay low-res how they were, because they do not have any other.

This is an application bug (non text assets not getting scaled up) and will hardly be fixed with anything other than vaseline the text and icons on an equivalently non-DPI-change supporting application on wayland.

The vast majority of modern software works just fine.

> Apps for remote desktop access or vm console won't be able to display remote/vm correctly.

Does Wayland solve this in any other way other than to vaseline it all up? xfreerdp has /scale. When it comes to VMs I use through spice you just set their DPI settings individually to match your host, then you get nice scaling without vaseline. AFAIK in wayland this all gets vaselined.

> And this is just the hidpi issue with single display. Won't go into the problems when running with multiple displays, with different dpi.

Don't run multiple displays with different DPI. It's an unsolvable problem in the X11/Wayland ecosystem. You need to keep everything as postscript or something equivalent all the way up until the point you know which monitor it's rendered on.

Of all the things Wayland could have actually gone out and fixed, this is one they eschewed in favour of "ah screw it, just give all the applications some graphics buffers and let them figure it out".

> I also do not have a faintest idea of what "setting up Wayland" might mean. What did you set up? How? The only thing that needs to "set up" is to pick a wayland session in the display manager. There's no xorg.conf for wayland, setting up drivers, etc. What did you configure "for hours"?

I know exactly what guilhas means.

Some people are not content with Ubuntu Gnome at a integer scaling factor, they're running highly bespoke setups where everything from the display manager to the screen-grab stuff is customized or custom written. So you spend a lot of time and effort switching to sway, switching to wayland, switching to wayland native versions of a terminal, fixing firefox so it starts in wayland mode, fiddling with the nonsensical scaling settings to actually get firefox to render at the right size, figuring out how to get your screenshot binding to work again, figuring out how to get all your applications to start in the right version, being dismayed when something which still uses X11 runs in XWayland and looks like vaseline because of weird design decisions which are incomprehensible (meanwhile that same application with Xft.dpi set to the right value renders flawlessly).

Eventually you get it all back up and running and you play with it for a week and you spot 20 things which subtly work differently or outright break, you spend hours looking for a solution to only get half of it working.

Right now wayland works mostly fine for the Ubuntu Gnome user or the Kubuntu user (except issues getting non-integer scaling factors working or issues with things needing XWayland) but it's nowhere near as easy to get up and running for someone running a non-standard setup.


It's buggy still with sway on nvidia. I really thought the 555 driver would wrinkle out last of the issues but it still has further to go. Switched to kde plasma 6 on wayland since then and it's been great, not buggy at all.


Easy Linux use is what keeps me firmly on AMD. This move may earn them a customer.


why switch to amd and not just switch to X? :D


once you go Wayland you usually don’t go back :)


I tried wayland (on amd) and found it annoying to work with compaired to x11 without any apparent benefits, wayland is definitely the future, but i don't think the future is now


I tested wayland for a while to see what the hype is about. No uoside lits of small workflows broken. Back to Xorg it was.


Why not both?


From the github repo[0]:

Most of NVIDIA's kernel modules are split into two components:

    An "OS-agnostic" component: this is the component of each kernel module that is independent of operating system.

    A "kernel interface layer": this is the component of each kernel module that is specific to the Linux kernel version and configuration.
When packaged in the NVIDIA .run installation package, the OS-agnostic component is provided as a binary:

[0] https://github.com/NVIDIA/open-gpu-kernel-modules


That was the "classic" drivers.

The new open source ones effectively move majority of the OS-agnostic component to run as blob on-GPU.


Not quite - it moves some logic to the GSP firmware, but the user-space driver is still a significant portion of code.

The exciting bits there is the work on NVK.


Yes, I was not including userspace driver in this, as a bit "out of scope" for the conversation :D


How is the NVIDIA driver situation on Linux these days? I built a new desktop with an AMD GPU since I didn't want to deal with all the weirdness of closed source or lacking/obsolete open source drivers.


I built my new-ish computer with an AMD GPU because I trusted in-kernel drivers better than out-of-kernel DKMS drivers.

That said, my previous experience with the DKMS driver stuff hasn't been bad. If you use Nvidia's proprietary driver stack, then things should generally be fine. The worst issues are that Nvidia has (historically, at least; it might be different for newer cards) refused to implement some graphics features that everybody else uses, which means that you basically need entirely separate codepaths for Nvidia in window managers, and some of them have basically said "fuck no" to doing that.


The current stable proprietary driver is a nightmare on Wayland with my 3070, constant flickering and stuttering everywhere. Apparently the upcoming version 555 is much better, I'm sticking with X11 until it comes out. I never tried the open-source one yet, not sure if it supports my GPU at all.


The 555 version is the current version. It was officially released on June 27.

https://www.phoronix.com/news/NVIDIA-555.58-Linux-Driver


In defense of the parent, upcoming can still be a relative term, albeit a bit misleading. For example: I'm running the 550 drivers still because my upstream nixos-unstable doesn't have 555 for me yet.


I love NixOS, and the nvidia-x11 package is truly wonderful and captures so many options. But having such a complex package makes updating and regression testing take time. For ML stuff I ended up using it as the basis for an overlay, and ripping out literally everything I don’t need, which makes it a matter of minutes usually to make the changes requires to upgrade when a new driver is released I’m running completely headless because these are H100 nodes, and I just need persistenced and fabricmanager, and GDRMA (which wasn’t working at all, causing me to go down this rabbit hole of stripping everything away until I could figure out why).


I was going to say specialisations might be useful for you to keep a previous driver version around for testing but you might be past that point!

Having the ability to keep alternate configurations for $previous_kernel and $nvidia_stable have been super helpful in diagnosing instead of rolling back.


> nixos-unstable doesn't have 555

Version 555.58.02 is under “latest” in nixos-unstable as of about three weeks ago[1]. (Somebody should check with qyliss if she knows the PR tracker is dead... But the last nixos-unstable bump was two days ago, so it’s there.)

[1] https://github.com/NixOS/nixpkgs/commit/4e15c4a8ad30c02d6c26...


`nvidia-smi` shows that my driver version is 550.78. I ran `nixos-rebuild switch --upgrade` yesterday. My nixos channel is `nixos-unstable`.

Do you know something I don't? I'd love to be on the latest version.

I should have written my post better, it implies that 555 does not exist in nixpkgs, which I never meant. There's certainly a phrasing that captures what I'm seeing more accurately.


Are you using flakes? If you don't do `nix flake update` there won't be all that much to update.


I am! I forgot about this. Mental model check happening.

(Still on 550.)


I did not mean to chastise you or anything, just to suggest you could be able to have a newer driver if you had missed the possibility.

The thing is, AFAIU, NVIDIA has several release channels for their Linux driver[1] and 555 is not (yet?) the "production" one, which is what NixOS defaults to (550 is). If you want a different degree of freshness for your NVIDIA driver, you need to say so explicitly[2]. The necessary incantation should be

  hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.latest;
This is somewhat similar to how you get a newer kernel by setting boot.kernelPackages to linuxPackages_latest, for example, if case you've ever done that.

[1] https://www.nvidia.com/en-us/drivers/unix/

[2] https://nixos.wiki/wiki/Nvidia


I had this configuration but was lacking a flake update to move my nixpkgs forward despite the channel, which I can understand much better looking back.

Thanks for the additional info, this HM thread has helped me quite a bit.


The versions that nixos provides are based on the files in this repo

https://github.com/aaronp24/nvidia-versions

See: https://github.com/NixOS/nixpkgs/blob/9355fa86e6f27422963132...

You could also opt to use the latest driver instead of stable: https://nixos.wiki/wiki/Nvidia


Yep, I'm on openSUSE Tumbleweed, and it's not rolled out there yet. I would rather wait than update my drivers out-of-band.


I switched to Wayland 10 years ago when it became an option ok Fedora. First thing I had to do was to drop NVIDIA and switch to Intel GPU, and past 5 years to AMD GPU. Makes a big difference if the upstream kernel is supported.

Maybe NVIDIA drivers have kind of worked on 12 month old kernels that Ubuntu on average uses.


this is resolved in 555 (currently running 555.58.02). my asus zephyrus g15 w/ 3060 is looking real good on Fedora 40. there's still optimizations needed around clocking, power, and thermals. but the graphics presentation layer has no issues on wayland. that's with hybrid/optimus/prime switching, which has NEVER worked seamlessly for me on any laptop on linux going back to 2010. gnome window animations remain snappy and not glitchy while running a game. i'm getting 60fps+ running baldurs gate 3 @ 1440p on the low preset.


Had similar experience with my Legion 5i 3070 with Wayland and Nvidia 555, but my HDMI out is all screwed up now of course. Working on 550. One step forward and one step back.


is there a mux switch?


I have a 3070 on X and it has been great.


Same setup here. Multiple displays don't work well for me. One of the displays doesn't often get detected after resuming screen saver.


I have two monitors connected to the 3070 and it works well. The only issue I had was suspending, the GPU would "fall of the bus" and not get its power back when the PC woke up. I had to add the kernel line "pcie_aspm=off" to prevent the GPU from falling asleep.

So... not perfect, but it works.


Huh. I’m using 2 monitors connected to a 4090 on Linux mint - which is still using X11. It works flawlessly, including DPI scaling. Wake from sleep is fine too.

I haven’t tried wayland yet. Sounds like it might be time soon given other comments in this thread.


I've literally never had an issue in decades of using NVIDIA and linux. They're closed source, but the drivers work very consistently for me. NVIDIA's just the only option if you want something actually good and to run ML workloads as well.


> but the drivers work very consistently for me

The problem with comments like this is that you never know if you will be me or you on your graphics card or laptop.

I have tried nvidia a few times and kept getting burnt. AMD just works. I don't get the fastest ML machine, but I am just a tinkerer there and OpenCL works fine for my little toy apps and my 7900XTX blazes through every wine game.

If you need it professionally than you need it, warts an all. For any casual user that 10% extra gaming performance needs to weighed against reliability.


It also depends heavily on the user.

A mechanic might say "This car has never given me a problem" because the mechanic doesn't consider cleaning an idle bypass circuit or adjusting valve clearances to be a "problem". To 99% percent of the population though, those are expensive and annoying problems because they have no idea what those words even mean, much less the ability to troubleshoot, diagnose, and repair.


a lot has probably to do with not really understanding their distributions package manager and lkms specifically, I also always suspected that most Linux users don't know if they are using Wayland or X11 and the issues they had were actually Wayland specific ones they wouldn't have with Nvidia/x11 and come to think of it, how would they even know if it's a GPU driver issue in the first place? Guess I'm the mechanic in your analogy.


If there's an issue with Nvidia/Wayland and there isn't with AMD/Wayland or Intel/Wayland, it is Nvidia issue then, not Wayland one.


When I run Gentoo or Arch, I know. But when I run Ubuntu or Fedora, should I have needed to know?

On plenty of distros "I want to install it and forget about is reasonable" and on both Gentoo and Ubuntu I have rebooted from a working system into a system where the display stopped working, at least on Gentoo I was ready because I broke it somehow.


Absolutely I once had an issue with kernel/user-space driver version mismatch in Ubuntu, trivial to fix and the kernel logs tell you what's wrong. But yeah I get that most users don't read their kernel logs and it shouldn't be an expectation to do so for normal users of linux. The experiences are just very different, it's why the car mechanic analogy fits so well.

I think it also got so much better over time, I've been using Linux since debian woody (22 years ago) the stuff you had to deal with back then heavily skews my perspective on what users today see as unacceptable brokenness in the Nvidia driver.


I've run NixOS for almost a decade now and I honestly would not recommend anything else. I've had many issues with booting on almost every distro. They're about as reliable as Windows in that regard. NixOS has been absolutely rock solid; beyond anything I could possibly have hoped for. In the extremely rare case my system would not boot, I've either found a hardware problem that would affect anyone, or I could just revert to a previous system revision and boot up. Never had any problem. No longer use anything else because it's just too risky


If you use a search engine for "Torvalds Nvidia" you will discern a certain attitude towards Nvidia as a corporation and its products.

This might provide you a suggestion that alternate manufacturers should be considered.

I have confirmed this to be the case on Google and Bing, so DuckDuckGo and Startpage will also exhibit this phenomena.


An opinion on support from over ten years ago is not a very strong suggestion.


Your problem there is that both search engines place this image and backstory at the top of the results, so neither Google nor Bing agree with any of you.

If you think they're wrong, be sure to let them know.


What torvalds is complaining about is absolutely true, but the problem is that most users do not give a shit about those issues. Torvalds disagreement wasn't about bugs in-, or complains about the quality of the proprietary driver, he complained about nvidias lack of open source contributions and bad behavior towards the kernel developer community. But users don't care if they run a proprietary driver as long as it works (and it does work fine for most people).

So you see now why that's not very relevant to end-users experiences they were talking about?


No.


Do you think Google and Bing are endorsing top results, and in particular endorsing a result like that in the specific context of what manufacturers I consider buying from?

That's the only way they would be disagreeing with me.


Torvalds has said nasty mean things to a lot of people in the past, and expressed regret over his temper & hyperbole. Try searching for something more recent https://youtu.be/wvQ0N56pW74


> AMD just works. I don't get the fastest ML machine, but I am just a tinkerer there and OpenCL works fine for my little toy apps and my 7900XTX blazes through every wine game.

That's the opposite of my experience. I'd love to support open-source. But the AMD experience is just too flaky, too card-dependent. NVidia is rock-solid (maybe not for Wayland, but I never wanted Wayland in the first place).


What kind of flakiness? The only AMD GPU problem I have had involved a lightning strike killing a card while I was gaming.

My nvidia problems are generally software and update related. The NVidia stuff usually works on popular distros, but as soon anything custom or a surprise update happens then there is a chance things break.


> What kind of flakiness?

Black screens, X server crashes, OpenGL programs either crashing or running slow. Just general unreliability. Different driver versions seemed more reliable than others, which meant I was always very reluctant to upgrade, which then gives you more problems as you end up pinning old versions which then makes it harder to troubleshoot online...

> My nvidia problems are generally software and update related. The NVidia stuff usually works on popular distros, but as soon anything custom or a surprise update happens then there is a chance things break.

I mean if you run mixed versions then yeah that will work for some upgrades and not others. A decent package manager should prevent that; some distros refuse to put effort into packaging the nvidia-drivers out of principle. But if you keep the drivers in sync (which is what the official package from NVidia themselves does, it's not their fault some distros choose to explode it into multiple packages) and properly rebuild just the kernel module every time you do a kernel upgrade (or just reinstall the whole driver if you prefer), then it's rock solid.


Up to a couple of years ago, before permanently moving to AMD GPUs, I couldn't even boot Ubuntu with an Nvida GPU. This was because Ubuntu booted by default with Nouveau, which didn't support a few/several series (I had at least two different series).

The cards worked fine with binary drivers once the system was installed, but AFAIR, I had to integrate the binary driver packages in the Ubuntu ISO in order to boot.

I presume that now, the situation is much better, but necessiting binary drivers can be a problem in itself.


Are you using wayland or are you still on x11? My experience was that the closed source drivers were fine with x11 but a nightmare with wayland.


I did when my card stopped being supported by all the distros because it was too old while the legacy driver didn't fully work the same.


Me too. Now I have a laptop with discrete nvidia and an eGPU with 3090 in it, a desktop with 4090, another laptop with another discrete nvidia.. all switching combinations work, acceleration works, game performance is on par with windows (even with proton to within a small percentage or even sometimes better). All out of the box with stock Ubuntu and installing driver from Nvidia site.

The only "trick" is I'm still on X11 and probably will stay. Note that I did try wayland on few occasions but I steered away (mostly due to other issues with it at the time).


Likewise. Rock solid for decades in intel + nvidia proprietary drivers even when doing things like hot plugging for passthroughs.


Yeah I once worked at a cloud gaming company that used Wine on Linux on NVIDIA to stream cloud games. They were the only real option for multi-game performance, and very rock solid in terms of uptime. I truly have no idea what people are talking about. Yes I use X11.


Same here, been using the nvidia binary drivers on a dozen computers with various other HW and distros for decades with never any problems whatsoever.


3090 owner here.

Wayland is even worse mess than it normally is. Used to flicker real bad before 555.58.02, less so with the latest driver - but still has some glitches with games. A bunch of older Electron apps still fail to render anything and require hardware acceleration disabled. I gave up trying to make it all work - can't get rid of all the flicker and drawing issues, plus Wayland seems to be a real pain in the ass with HiDPI displays.

X11 sort of works, but I had to entirely disable DPMS or one of my monitors never comes back online after going to sleep. I thought it was my KVM messing up, but that happened even with a direct connection... no idea what's going on there.

CUDA works fine, save for the regular version compatibility hiccups.


4070ti super here, X11 is fine, i have zero issues.

Wayland is mostly fine, though i get some windowframe glitches when maxing them to the monitor and a another issue that i'm pretty sure is wayland but it has obnly happened a couple of times and it locks the whole device up. I cant prove it yet.


I am not using Wayland and I do not have any intention to use it, therefore I do not care for any problems caused by Wayland not supporting NVIDIA and demanding that NVIDIA must support Wayland.

I am using only Linux or FreeBSD on all my laptop, desktop or server computers.

On desktop and server computers I did not ever have the slightest difficulty with the NVIDIA proprietary drivers, either for OpenGL or for CUDA applications or for video decoding/encoding or for multiple monitor support, with high resolution and high color depth, on either Gentoo/Funtoo Linux or FreeBSD, during the last two decades. I also have AMD GPUs, which I use for compute applications (because they are older models, which still had FP64 support). For graphics applications they frequently had annoying bugs, unlike NVIDIA (however my AMD GPUs have been older models, preceding RDNA, which might be better supported by the open-source AMD drivers).

The only computers on which I had problems with NVIDIA on Linux were those laptops that used the NVIDIA Optimus method of coexistence with the Intel integrated GPUs. Many years ago I have needed a couple of days to properly configure the drivers and additional software so that the NVIDIA GPU was selected when desired, instead of the Intel iGPU. I do not know if any laptops with NVIDIA Optimus still exist. The laptops that I bought later had video outputs directly from the NVIDIA GPU, so there was no difference between them and desktops and the NVIDIA drivers worked flawlessly.

Both on Gentoo/Funtoo Linux and FreeBSD I never had to do anything else but to give the driver update command and everything worked fine. Moreover, NVIDIA has always provided a nice GUI application "NVIDIA X Server Settings", which provides a lot of useful information and which makes very easy any configuration tasks, like setting the desired positions of multiple monitors. A few years ago there was nothing equivalent for the AMD or Intel GPU drivers, but that might have changed meanwhile.


great. rtx 4090 works out of the box after installing drivers from non-free. That's on debian bookworm.


I got my nvidia 1060 back then during the crypto crysis when the price of AMD GPUs were inflated due to miners. Hesitant and scepital about Linux support, I upgraded the same machine with that GPU since 2016 von Ubuntu 14.04, to 18.04 and now 24.04 - without any nvidia driver issues anytime whatsoever. When I read about issues with nvidias drivers, it is mostly people with rare distro or rolling release ones, with changing kernel versions very frequently and failure to recompile with the binary drivers. For LTS distros you will likely have no issues.


4070 worked out of the box on my arch system. I used the closed source drivers and X11 and I've not encountered a single problem.

My prediction is that it will continue to improve if only because people want to run nvidia on workstations.


My experience with an AMD iGPU on Linux was so bad that my next laptop will be Intel. Horrible instability to the point where I could reliably crash my machine by using Google Maps for a few minutes, on both Chrome and Firefox. It got fixed eventually - with the next Ubuntu release, so I had a computer where I was afraid to use anything with WebGL for half a year.


Depends on the version of drivers: 550 version results into black screen (you have to kill and restart X server) after waking up from sleep. 535 version doesn't have this bug. Don't know about 555.

Also tearing is a bitch. Still. Even with ForceCompositionPipeline.


I've been running Arch with KDE under Wayland on two different laptops both with NVIDIA GPUs using proprietary drivers for years and have not run into issues. Maybe I'm lucky? It's been flawless for me.


The experiences always vary quite a lot, it depends so much on what you do with it. For example discord doesn't support screen sharing with Wayland, it's just one small example but those can add up over time. Another example is display rotation which was broken in kde for a long time (recently fixed).


I have never had an issue with them. That said I typically go mid range on cards so they are usually hardened architecture due to a year or two of being in the high end.


KDE plasma 6 + Nvidia beta 555 works well. Have to make .desktop files to launch some applications explicitly Wayland.


Whatever pop_os uses has been quite stable for my 4070.


Pop uses X by default because of Nvidia.


plug, install then play, I got 3 different Nvidia GPU sets and all running without any issue, nothing crazy to do but follow installation instructions.


To some of us, running any closed source software in userland qualifies as quite crazy indeed.


Throwing the tarball over the wall and saying "fetch!" is meaningless to me. Until they actually contribute a driver to the upstream kernel, I'll be buying AMD.


You can just use Nouveau and NVK for that if you just need workstation graphics (and the open-gpu-modules source code/separate GSP release has been a big uplift to Nouveau too, at least.)


Nouveau is great, and I absolutely admire what the community around it has been able to achieve. But I can't imagine choosing that over AMD's first class upstream driver support today.


IIRC hardware video decoding of HEVC didn't work for me with nouveau


The title of this statement is misleading:

NVIDIA is not transitioning to open-source drivers for its GPUs; most or all user-space parts of the drivers (and most importantly for me, libcuda.so) are closed-source; and as I understand from others, most of the logic is now in a binary blob that gets sent to the GPU.

Now, I'm sure this open-sourcing has its uses, but for people who want to do something like a different hardware backend for CUDA with the same API, or to clear up "corners" of the API semantics, or to write things in a different-language without going through the C API - this does not help us.


NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules

or

NVIDIA Transitions Towards Fully Open-Source GPU Kernel Modules?


Not much point in a "partially" open-source kernel module.


But “fully towards” is pretty ambiguous, like an entire partial implementation.

Anyhow I read the article, I think they’re saying fully as in exclusively, like there eventually will not be both a closed source and open source driver co-maintained. So “fully open source” does make more sense. The current driver situation IS partially open source, because their offerings currently include open and closed source drivers and in the future the closed source drivers may be deprecated?


See my answer. It's not going to be fully-open-source drivers, it's rather that all drivers will have open-source kernel modules.


You can argue against proprietary firmware, but is this all that different from other types of devices?


Other device manufacturers with proprietary drivers don't engage in publicity stunts to make it sound like their drivers are FOSS or that they embrace FOSS (or just OSS).


haven't read it but probably the former


"towards" basically negates the "fully" before it for all real intents and purposes


Remember that time when Linus looked at the camera and gave Nvidia the finger. Has that time now passed? Is it time to reconcile? Or are there still some gotchas?


These are kernel modules not the actual drivers. So the finger remains up.


Too late for me. I tried switching to Linux years ago but failed because of the awful state of NVIDIA's drivers. Switched to AMD least year and it's been a breeze ever since.

Gaming on Linux with an NVIDIA card (especially an old one) is awful. Of course Linux gamers aren't the demographic driving this recent change of heart so I expect it to stay awful for a while yet.


As someone who is pretty skeptical and reads the fine print, I think this is a good move and I really do not see a downside (other than the fact that this probably strengthens the nVidia monoculture).


AFAIK I believe all they did was move the closed source user space driver code to their opaque firmware blob leaving a thin shim in the kernel.

In essence I don’t believe that much has really changed here.


Having as open-source all the kernel, more precisely all the privileged code, is much more important for security than having as open-source all the firmware of the peripheral devices.

Any closed-source privileged code cannot be audited and it may contain either intentional backdoors, or, more likely, bugs that can cause various undesirable effects, like crashes or privilege escalation.

On the other hand, in a properly designed modern computer any bad firmware of a peripheral device cannot have a worse effect than making that peripheral unusable.

The kernel should take care, e.g. by using the I/O MMU, that the peripheral cannot access anything where it could do damage, like the DRAM not assigned to it or the non-volatile memory (e.g. SSDs) or the network interfaces for communicating with external parties.

Even when the peripheral is so important as the display, a crash in its firmware would have no effect if the kernel had reserved some key combination to reset the GPU (while I am not aware of such a useful feature in Linux, its effect can frequently be achieved by switching, e.g. with Alt+F1, to a virtual console and then back to the GUI, the saving and restoring of the GPU state together with the switching of the video modes being enough to clear some corruption caused by a buggy GPU driver or a buggy mouse or keyboard driver).

In conclusion, making the NVIDIA kernel driver as open source does not deserve to have its importance minimized. It is an important contribution to a more secure OS kernel.

The only closed-source firmware that must be feared is that which comes from the CPU manufacturer, e.g. from Intel, AMD, Apple or Qualcomm.

All such firmware currently includes various features for remote management that are not publicly documented, so you can never be sure if they can be properly disabled, especially when the remote management can be done wirelessly, like through the WiFi interface of the Intel laptop CPUs, so you cannot interpose an external firewall to filter the network traffic of any "magic" packets.

A paranoid laptop user can circumvent the lack of control over the firmware blobs from the CPU manufacturer by disconnecting the internal antennas and using an external cheap and small single-board computer for all wired and wireless network access, which must run a firewall with tight rules. Such a SBC should be chosen among those for which complete hardware documentation is provided, i.e. including its schematics.


Everything you wrote assumes the IOMMUs across the board to be 100% correctly implemented without errors/bugdoors.

People used to believe similar things about Hyperthreading, glitchability, ME, Cisco, boot-loaders, ... the list goes on.


There still is a huge difference between running privileged code on the CPU, for which there is nothing limiting what it can do, and code that runs on a device, which should normally be contained by the I/O MMU, except if the I/O MMU is buggy.

The functions of an I/O MMU for checking and filtering the transfers are very simple, so the probability of non-intentional bugs is extremely small in comparison with the other things enumerated by you.


Agreed, that the feature-set of IOMMU is fairly small, but is this function not usually included in one of the Chipset ICs, which do run a lot other code/functions alongside a (hopefully) faithful correct IOMMU routine?

Which -to my eyes- would increase the possibility of other system parts mucking with IOMMU restrictions, and/or triggering bugs.


Did you run this through a LLM? I'm not sure what the point is of arguing with yourself and bringing up points that seem tangential to what you started off talking about (…security of GPUs?)


I have not argued with myself. I do not see what made you believe this.

I have argued with "I don’t believe that much has really changed here", which is the text to which I have replied.

As I have explained, an open-source kernel module, even together with closed-source device firmware, is much more secure than a closed-source kernel module.

Therefore the truth is that a lot has changed here, contrary to the statement to which I have replied, as this change makes the OS kernel much more secure.


But the firmware runs directly on the hardware, right? So they effectively rearchitected their system to move what used to be 'above' the kernel to 'below' the kernel, which seems like a huge effort.


It’s some effort but I bet they added a classical serial CPU to run the existing code. In fact, [1] suggests that’s exactly what they did. I suspect they had other reasons to add the GSP so the amortized cost of moving the driver code to firmware was actually not that large all things considered and in the long term reduces their costs (eg they reduce the burden further of supporting multiple OSes, they can improve performance further theoretically, etc etc)

[1] https://download.nvidia.com/XFree86/Linux-x86_64/525.78.01/R...


That's exactly what happened - Turing microarchitecture brought in new[1] "GSP" which is capable enough to run the task. Similar architecture happens AFAIK on Apple M-series where the GPU runs its own instance of RTOS talking with "application OS" over RPC.

[1] Turing GSP is not the first "classical serial CPU" in nvidia chips, it's just first that has enough juice to do the task. Unfortunately without recalling the name of the component it seems impossible to find it again thanks to search results being full of nvidia ARM and GSP pages...


>the name of the component

Falcon?


THANK YOU, that was the name I was forgetting :)

here's[1] a presentation from nvidia regarding (unsure if done or not) plan for replacing Falcon with RISC-V, [2] suggests the GSP is in fact the "NV-RISC" mentioned in [1]. Some work on reversing Falcon was apparently done for Switch hacking[3]?

[1] https://riscv.org/wp-content/uploads/2016/07/Tue1100_Nvidia_... [2] https://www.techpowerup.com/291088/nvidia-unlocks-gpu-system... [3] https://github.com/vbe0201/faucon


Would you happen to have a source or any further readings about Apple M-series GPUs running their own RTOS instance?


Asahi Linux documentation has pretty good writeup.

The GPU is described here[1] and the mailbox interface used generally between various components is described here [2]

[1] https://github.com/AsahiLinux/docs/wiki/HW%3AAGX#overview

[2] https://github.com/AsahiLinux/docs/wiki/HW%3AASC


Why? It should make it much easier to support Nvidia GPUs on Windows, Linux, Arm/x86/RISC-V and more OSes with a single firmware codebase per GPU now.


Yes makes sense, in the long run it should make their life easier. I just suspect that the move itself was a big effort. But probably they can afford that nowadays.


mind the wording they've used here - "fully towards open-source" and not "towards fully open-source".

big difference. almost nobody is going to give you the sauce hidden behind blobs. but i hope the dumb issues of the past (imagine using it on laptops with switchable graphics) go away slowly with this and it is not only for pleasing the enterprise crowd.


All `g_bindata_k*.c` files are essentially blobs with no source provided:

https://github.com/NVIDIA/open-gpu-kernel-modules/tree/main/...


My guess is Meta and/or Amazon told Nvidia that they would contribute considerable resources to development as long as the results were open source. Both companies bottom lines would benefit from improved kernel modules, and like another commenter said elsewhere, Nvidia doesn't have much to lose.


I wonder if we'll ever get hdcp on nvidia. As much as I enjoy 480p video from streaming services.


Just download it to your pc. It is better user experience and costs less


Which service goes that low? The ones I know limit you from using 4k, but anything up to 1080p works fine.


Nonsense that a 1080p limit is acceptable for (and accepted by) paying customers.


Depends. I disagree with HDCP in theory on ideological grounds. In practice, my main movie device is below 720p (projector), so it will take another decade before it affects me in any way.


I really hope this makes it easier to install/upgrade NVIDIA drivers on Linux. It's a nightmare to figure out version mismatches between drivers, utils, container-runtime...


A nightmare how? When i used their cards, I'd just download the .run and run it. Done.


And when it doesn't work, what do you do then?

Exactly, that's when the nightmare starts.


After a reboot of coarse :)

Everything breaks immediately otherwise.


From my limited experience with their open-sourcing of kernel modules so far: It doesn't make things easier; but - the silver lining is that, for the most part, it doesn't make installation and configuration harder! Which is no small thing actually.


Transition is not done until their drivers are upstreamed into the mainline kernel and ALL features work out of the box, especially power management and hybrid graphics.


I thought power management was moved to the GPU firmware in the 20 series, which is why the new driver only supports those?


I read "NVIDIA transitions fully Torvalds..."


This is great. I've been having to build my own .debs of the OSS driver for some time because of the crapola NVIDIA puts in their proprietary driver that prevents it from working in a VM as a passthrough device. (just a regular whole-card passthru, not trying to use GRID/vGPU on a consumer card or anything)

NVIDIA can no longer get away with that nonsense when they have to show their code.


Thank You Nvidia hacker! You did it! The Lapasu$ team threaten a few years back that if nvidia is not going to release nvidia opensource they are gonna release their code. That lead nvidia to releasing first kernel opensource module in a few months later but it was quite incomplete. Now it seems they are opensourcing fully more.


didn't they say that many times before?


Not sure but with the Turing series they support having a cryptographically signed binary blob that they load on the GPU. So before where their kernel driver was a thin shim for the user space driver, now it’s a thin shim for the black box firmware loaded on the GPU


the scope of what the kernel interface provides didn't change, but what was previously a blob wrapped by source-provided "os interface layer" is now moved to run on GSP (RISC-V based) inside the GPU.


I cant wait to use linux without having to spend multiple weekends trying to get the right drivers to work.


Are Nvidia grace CPUs even available? I thought it was interesting they mentioned that.


I'll update as soon at its in NixOS unstable. Hopefully this will change the mind of the sway maintainers to start supporting Nvidia cards, I'm using i3 and X but would like to try out Wayland.


Well, it is something, even if it's still only the kernel module, and it will be probably never upstreamed anyway.


So does this mean actually getting rid of the binary blobs of microcode that are in their current ‘open’ drivers?


No, it means the blob from the "closed" drivers is moved to run on GSP.


Does this mean we can aggressively volt mod, add/replace memory modules to our liking?


It’s kind of surprising that these haven’t just been reverse engineered yet by language models.


That's simply not how LLMs work, and are actually awful at reverse engineering of any kind.


Are you saying that they cant explain the contents of machine code in human readable format? Are you saying that they can’t be used in a system that iteratively evaluates combinations of inputs and check their results?


Just that they're horrible at it


This means Fedora can bundle it?


That's not upstream yet. But they supposedly showed some interesting in nova too.


does this mean you will be able to use NVK/Mesa and CUDA at the same time? The non mesa proprietary side of nvidia's linux drivers are such a mess and NVK is improving by the day, but I really need cuda.


Maybe that’s one way to retain engineers who are effectively millionaires.


What is GPU kernel module? Is it something like a driver for GPU?


Yes. In modern operating systems, GPU drivers usually consist in kernel component that is loaded inside of the kernel or in a privileged context, and a userspace component that talks with it and implements the GPU-specific part of the APIs that the windowing system and applications use. In the case of NVIDIA, they have decided to drop their proprietary kernel module in favour of an open one. Unfortunately, it's out of tree.

In Linux and BSD, you usually get all of your drivers with the system; you don't have to install anything, it's all mostly plug and play. For instance, this has been the case for AMD and Intel GPUs, which have a 100% open source stack. NVIDIA is particularly annoying due to the need to install the drivers separately and the fact they've got different implementations of things compared to anyone else, so NVIDIA users are often left behind by FOSS projects due to GeForce cards being more annoying to work with.


Thanks. I'm not well versed in these things. It sounded like something you load into GPU (it reminded me old hp printer, which required firmware upload after start).


will this mean that we'll be able to remove the arbitrary distinctions between quadro and geforce cards maybe by hacking some configs or such in the drivers?


they are worthless. the main code is in the userspace


They know CUDA monopoly won't last forever.


CUDA lives in userspace; this kernel driver release does not contain any of that. It's still very useful to release an open source DKMS driver, but this doesn't change anything at all about the CUDA situation.


hope linux gets first class open source gpu drivers.. and dare I hope that Go adds native support for GPUs too


damn, only for new GPUs.


For varying definitions of "new". It supports Turing and up, which was released in 2018 with the 20xx line. That's two generations back at this point.


Hopefully, we get a plain and simple C99 user space vulkan implementation.


NVidia revenue is now 78% from "AI" devices.[1] NVidia's market cap is now US$2.92 trillion. (Yes, trillion.) Only Apple and Microsoft can beat that. Their ROI climbed from about 10% to 90% in the last two years. That growth has all been on the AI side.

Open-sourcing graphics drivers may indicate that NVidia is moving away from GPUs for graphics. That's not where the money is now.

[1] https://www.visualcapitalist.com/nvidia-revenue-by-product-l...

[2] https://www.macrotrends.net/stocks/charts/NVDA/nvidia/roi


It indicates nothing; they started it a few years ago, before that. They just transferred the most important parts of their driver to the (closed source) firmware, to be handled by the onboard ARM CPU, and open sourced the rest.


Well, Nvidia seems to be claiming in the article that this is everything, not just graphics drivers: "NVIDIA GPUs share a common driver architecture and capability set. The same driver for your desktop or laptop runs the world’s most advanced AI workloads in the cloud. It’s been incredibly important to us that we get it just right."

And For cutting-edge platforms such as NVIDIA Grace Hopper or NVIDIA Blackwell, you must use the open-source GPU kernel modules. The proprietary drivers are unsupported on these platforms. (These are two most advanced NVIDIA architectures currently)


That's interesting. I've been expecting the AI cards to diverge more from the graphics cards. AI doesn't need triangle fill, Z-buffering, HDMI out, etc. 16 bit 4x4 multiply/add units are probably enough. What's going on in that area?


TL;DR - there seems to be not that much improvement from dropping the "graphics-only" parts of the chip if you already have a GPU instead of breaking into AI market as your first product.

1. nVidia compute dominance is not due to hyperfocus on AI (that's Google's TPU for you, or things like intel's NPU in Meteor Lake), but because CUDA offers considerable general purpose compute. In fact, considerable revenue came and still comes from non-AI compute. This also means that if you figure out a novel mechanism for AI that isn't based around 4x4 matrix addition, or which mixes it with various other operations, you can do them inline. This also includes any pre and post processing you might want to do on the data.

2. The whole advantage they have in software ecosystem builds upon their PTX assembly. Having it compile to CPU and only implement the specific variant of one or two instructions that map to "tensor cores" would be pretty much nonsensical (especially given that AI is not the only market they target with tensor cores - DSP for example is another).

Additionally, a huge part of why nvidia built such a strong ecosystem is that you could take cheapest G80-based card and just start learning CUDA. Only some highest-end features are limited to most expensive cards, like RDMA and NVMe integration.

Compare this with AMD, where for many purposes only the most expensive compute-only cards are really supported. Or specialized AI only chips that are often programmable either in very low-level way or essentially as "set a graph of large-scale matrix operations that are limited subset of operations exposed by Torch/Tensorflow" (Google TPU, Intel Meteor Lake NPU, etc).

3. CUDA literally began with how evolution of shader model led to general purpose "shader processor" instead of specialized vector and pixel processors. The space taken by specialized hardware for graphics that isn't also usable for general purpose compute is pretty minimal, although some of it is omitted, AFAIK, in compute only cards.

In fact, some of the "graphics only" things like Z-buffering are done by the same logic that is used for compute (with limited amount of operations done by fixed-function ROP block), and certain fixed-function graphical components like texture mapping units are also used for high-performance array access.

4. Simplified manufacturing and logistics - nVidia uses essentially the same chips in most compute and graphics cards, possibly with minor changes achieved by changing chicken bits to route pins to different functions (as you mentioned, you don't need DP-outs of RTX4090 on an L40 card, but you can probably reuse the SERDES units to run NVLink on the same pins).


Kernel is an overloaded term for GPUs. This is about the linux kernel


"... Linux GPU Kernel Modules" is pretty unambiguous to me.


Yep the title was updated.


Guh, wish i could delete this now that the title was updated. the original title (shown on the linked page) wasn't super clear


Nvidia has finally realize they couldn’t write drivers for their own hardware, especially for Linux.

Never thought I would see the day.


Suddenly they went from powering gaming to being the winners of the AI revolution; AI is Serious Cloud Stuff, and Serious Cloud Stuff means Linux, so...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: