Writing an NVMe Driver in Rust [pdf]

ajross · 2024-05-29T16:37:34 1717000654

For those curious: this is a userspace driver, written to open/mmap the PCI BAR via sysfs. It doesn't attempt to hook interrupts from the device and has to resort to polling for completion[1]. Also AFAICT there's really no attempt being made at parallel design: all the hardware interaction is handled by a single manager thread responding to queued commands from workers. That's a fine choice for simplicity and reliability but a very poor one for scalability in a storage backend that needs to be handling requests from filesystem drivers serving dozens or hundreds of CPUs.

Basically: this is good work (great work for an undergraduate thesis). But it's very much "solving the easy part" and not really showing off Rust in any particularly impressive way. You can write very similar userspace "drivers" (and I have!) in Python.

[1] Though on modern hardware designed to manage arbitrary scatter/gather command queues on its own, that's not really such a big deal. In performance situations the hardware will always have something to do anyway, and idle hardware can be sent a command synchronously. Fixing this amounts to a power optimization only.

FullyFunctional · 2024-05-30T00:44:59 1717029899

I didn't read the whole thing, but the author briefly mentions VFIO which would have given him access to interrupts so it would have been possible (but for flat-out performance, polling is the only thing fast enough).

I wrote a small NVMe user-space driver using VFIO more than a decade ago this way. Coming from having virtualized ATAPI (SATA) and SCSI, NVMe was such a refreshingly excellent design.

bengl3rt · 2024-05-29T04:20:39 1716956439

I only read the paper (and the code in the paper) but not the complete source code, so maybe this would become clearer if I had, but...

Does Rust fundamentally guarantee that if you make a struct, its fields will lay out in memory in the order that you defined them? Can it be used to interact with APIs (really ABIs) who expect a C struct (or pointer to one)?

I think my main frustration with stuff like Go and Swift in this case is that their structs are not binary-compatible with C structs in this way because they rearrange things to be better aligned/packed/whatever.

pitterpatter · 2024-05-29T04:29:34 1716956974

Yes, Rust supports annotating your types with a `#[repr(...)]` attribute to control how it gets laid out in memory. There's a "C" repr that gives you representations interoperable with C.

https://doc.rust-lang.org/reference/type-layout.html#the-c-r...

Hemospectrum · 2024-05-29T04:25:00 1716956700

> Does Rust fundamentally guarantee that if you make a struct, its fields will lay out in memory in the order that you defined them? Can it be used to interact with APIs (really ABIs) who expect a C struct (or pointer to one)?

You have to specify this behavior with #[repr(C)]. Otherwise, the compiler will rearrange fields to try to optimize packing and alignment.

mirashii · 2024-05-29T17:43:35 1717004615

Or, more precisely, reserves the right to do so in a future version. rustc does not currently do this. There are some compiler flags to randomize the struct layout just to make sure that you’re not implicitly depending on the order.

https://doc.rust-lang.org/reference/type-layout.html

https://github.com/rust-lang/compiler-team/issues/457

tialaramex · 2024-05-29T22:13:53 1717020833

> Or, more precisely, reserves the right to do so in a future version. rustc does not currently do this

Does not currently do what ?

Rust certainly will re-arrange the layout of a default repr(Rust) struct to make it smaller, for example: https://rust.godbolt.org/z/7KsqvnE9o

[Edited to provide a nicer Godbolt example which compares the two layout strategies]

mirashii · 2024-05-31T14:05:39 1717164339

I stand corrected

the8472 · 2024-05-29T13:34:12 1716989652

The source[0] has `repr(packed)` but they should be using `repr(C, packed)`. That'll blow up in a future rust version (possibly 1.80) since packed alone is not sufficient to guarantee ordering[1].

[0] https://github.com/bootreer/vroom/blob/37bd8a22f5e0550b2cbc9... [1] https://doc.rust-lang.org/reference/type-layout.html#the-ali...

thegeekpirate · 2024-05-29T07:23:30 1716967410

Go doesn't rearrange struct fields, which is also why there's a linter/analyzer which shows you whether or not you could be using less memory with better field sorting

https://pkg.go.dev/golang.org/x/tools/go/analysis/passes/fie...

Zanfa · 2024-05-29T07:45:09 1716968709

> I think my main frustration with stuff like Go and Swift in this case is that their structs are not binary-compatible with C structs in this way because they rearrange things to be better aligned/packed/whatever.

If you need binary-compatibility with C structs in Swift, you can define them in a bridging header.

nXqd · 2024-05-29T13:42:45 1716990165

in rust you can control the layout and alignment of fields in a struct with `#[repr(...)]`

Kydlaw · 2024-05-29T11:57:48 1716983868

Really cool and impressive work! I'm more on the software (high level) side than on the hardware, but I always wondered how USBs and other ports were truly working. It is an area of knowledge that I'm shamefully lacking...

Does anyone have nice resources to share like this one that focuses on a specific port/connection and implement a driver/reader/parser? I'd very like to learn more of this.

ajross · 2024-05-29T17:11:57 1717002717

"Big" hardware like a USB3 controller or the NVMe controller in the linked article actually looks more like software than hardware anyway, to be truthful. The devices have their own processors and accept commands that look like a software protocol: you put a linked list of these structs/headers/commands into memory, which reference buffers at these addresses via pointer, then turn the device on by writing to this register over here. As the device finishes, it signals completion with a given struct by putting a message into this output buffer (or by setting its "active" field to false, maybe), etc...

The only bits you need to worry about that looks like hardware are getting the correct physical addresses filled in, and on some platforms worring about memory ordering and/or cache management to be sure the device sees the same memory state you think it does.

nielsole · 2024-05-29T12:03:44 1716984224

Ben eater has videos on USB, SPI and rs232.

Bitluni has a couple of easily consumable videos about VGA.

virtue3 · 2024-05-29T17:51:59 1717005119

second on ben eater; the USB one showing inputs on a keyboard and such with an oscilloscope blew my mind.

I was like -> oh, it really just is a huge bus and there's A LOT happening.

MisterTea · 2024-05-29T14:04:19 1716991459

Plan 9 and C but there is some good info in these videos: https://www.youtube.com/@adventuresin9797/search?query=usb

gdiamos · 2024-05-29T01:53:41 1716947621

That is a bachelor’s thesis - someone should hire that student

Clear writing and ideas

dailykoder · 2024-05-29T09:54:01 1716976441

A lot of bachelor thesis from TUM are very very impressive. Back in the day when Paul Emmerich[1] held his talks at the CCC about the user-space network drivers, I looked through a few of them and a good portion of them were "only" bachelor thesis'. I felt very bad when I compare them to my bachelor thesis. But on the other hand, I do have a friend now that is doing her masters at TUM and she says they sometimes have extreme difficulties getting their topics approved, because the TUM wants to stay "elite" and the students are under immense pressure. But hats off to the NVMe thesis. Well done!

[1] https://www.net.in.tum.de/members/emmericp/

ddritzenhoff · 2024-05-29T10:42:26 1716979346

Currently a master's student at TUM, and I agree with your friend -- it's a painful journey haha.

eisbaw · 2024-05-29T06:36:53 1716964613

Let him at least finish M.Sc. first. Then he can get dissapointed about the industry a bit later.

rvense · 2024-05-29T12:48:21 1716986901

Get them out now before they find out the truth about academia...

drv · 2024-05-29T04:10:31 1716955831

It's cool to see more systems code written in Rust! I also previously worked on SPDK, so it was neat to see it being chosen as a point of comparison.

However, I was waiting for the touted memory safety to be mentioned beyond the introduction, but it never really came up again. I was hoping for the paper to make a stronger argument for memory-safe languages like Rust, something like "our driver did not have bugs X, Y, and Z, which were found in other drivers, because the compiler caught them".

Additionally, in a userspace device driver that is given control of a piece of hardware that can do DMA, like an NVMe controller, the most critical memory safety feature is an IOMMU, which the driver covered by the paper does not enable; no amount of memory safety in the driver code itself matters when the hardware can be programmed to read or write anywhere in the physical address space, including memory belonging to other processes or even the kernel, from totally "safe" (in Rust semantics) code.

While the driver from the paper may certainly have a "simplified API and less code", I don't expect much of this to be related to the implementation language; it's comparing a clean-sheet minimal design to a project that has been around for a while and has had additional features incrementally added to it over time, making the older codebase inevitably larger and more complex. This doesn't seem like a particularly surprising result or an endorsement of a particular language, though it perhaps does indicate that it would be useful to start from scratch now and again just to see what the minimum viable system can look like. I certainly would have liked to rewrite it in Rust, but that wasn't really feasible. :)

In any case, it's great to see proof that a Rust driver can have comparable performance to one written in C, since it will hopefully encourage new code to be written in a nicer language than C. I definitely don't miss having to deal with manual memory management and chasing down use-after-frees now that I write Rust instead of C.

(As a side note, I'd encourage anyone thinking of using a userspace storage driver on Linux to check out io_uring first before going all in; if io_uring had existed before SPDK, I don't know that SPDK would have been written, given that io_uring gets you most of the way there performance-wise and integrates nicely with the rest of the kernel. A userspace driver has its uses, but I would consider it to be a last resort after exhausting all other options, since you have to reinvent all of the other functionality normally provided by the kernel like I/O scheduling, filesystems, encryption, etc., not just the NVMe driver itself. That is, assuming the io_uring security issues get resolved over time, and I expect they will.)

jvanderbot · 2024-05-29T12:59:32 1716987572

Comparing clean sheet designs to legacy bug-patched, security-focused implementations is pretty common for early-days Rustaceans. Most of the touted simplicity and compile speed is lost now that all the easy problems have been solved by an over-general crate that solves way more than you need it to. The language isn't going to save you from ecosystem bloat, and it isn't going to magically handle all security problems, especially those that occur at design time not compile or runtime.

But for those who want to get a handle on how rust might be used for something other than yet another crypto project or a toy webasm app, TFA is exactly what the doctor ordered.

tempaccount420 · 2024-05-29T14:40:46 1716993646

Because writing a linked list by hand for the 1000th time is definitely safer than importing it from a crate with many collections already implemented... Not

jvanderbot · 2024-05-29T14:53:15 1716994395

I'm not saying we shouldn't use crates. I'm agreeing that maybe we still have to be cautious about them, and in the "early days" when we were doing hand-coded stuff and saying "See how easy this is? Why was this hard in C" are long gone. For the very reasons you implied with your sarcasm.

stefanha · 2024-05-29T16:15:38 1716999338

There is even NVMe passthrough support via io_uring, so it's still possible to send custom NVMe commands when using io_uring instead of a userspace driver: https://www.usenix.org/system/files/fast24-joshi.pdf

Normal block I/O use cases don't really need NVMe io_uring passthrough, but it addresses the more exotic cases and is available in mainline Linux. And NVMe passthrough might eke out a little more performance.

anonymousDan · 2024-05-29T06:13:22 1716963202

Very interesting comment. By I/O scheduling, you mean across multiple processes (i.e. multiplexing the device)?

drv · 2024-05-29T07:42:19 1716968539

I/O scheduler was probably a bad example, since you might not need/want one for fast NVMe devices anyway, but yes, they help ensure limited resources (storage device bandwidth or IOPS) get shared fairly between multiple users/processes, as well as potentially reordering requests to improve batching (this matters more on spinning disks with seek latency, since a strategy of delaying a little bit to sort requests could save more time on seeks than it would spend on the delay+CPU overhead).

The more general point is that if you need any of the many features of a general-purpose OS kernel, a full userspace driver may not be a very good fit, since you will end up reinventing a lot of wheels. Cases where it could be a good fit would be things like database backends or dedicated block storage appliances, situations where the OS would just get in the way and where it's viable to dedicate a whole storage device (or several) and a whole CPU (or several) to one task.

koverstreet · 2024-05-29T03:58:19 1716955099

Rust is a total gamechanger, and it's probably the thing that excites me the most about the future of kernel development.

daghamm · 2024-05-29T11:12:18 1716981138

In what sense is it a game changer?

A lot of memory and thread safety guarantees in Rust are only applicable to user space.

Removing that, there are still some great things in Rust (e.g. enums ergonomics) but also a few questionable things.

pornel · 2024-05-29T11:21:36 1716981696

This is a common misconception that if the kernel needs direct hardware access and such, then it all becomes unsafe.

Rust programming style is building safe (zero-cost) abstractions on top of unsafe primitives, turning all other code into safe "glue" code. If you design for it, you can have a lot of "boring" code even in a kernel.

A kernel will have more unsafe primitives to implement, but the safe/unsafe division still helps testing the unsafe parts, and still prevents bugs caused by misuse of these APIs.

The type system that gives safety in userspace still exists in the kernel space. So even if the allocators and threads are different, you still have the type system tools to write safer APIs for them.

tialaramex · 2024-05-29T12:26:21 1716985581

Yes. Because it's popular and large the Linux kernel has problems where abstractions leak and a subtle implementation detail causes problems because you were supposed to just know that certain APIs don't do quite what it seems like they'd do from the name.

Rust's culture says you must mark abstractions that leak safety as unsafe. If this Rust function named "make_doodad" is labelled safe, it is not OK that I could run it without a doodad_manager, and yet in this case it blows up. Either somehow require me to prove I have a doodad_manager, or, mark it unsafe and document the requirement, or re-design the function so that it checks for a doodad_manager and fails cleanly when one is not present. In some cases you might decide all three are needed: make_doodad_with_manager(&manager) -> Doodad, unsafe make_doodad_unchecked() -> Doodad and make_doodad() -> Result<Doodad,Problem>

u320 · 2024-05-29T11:26:34 1716981994

> A lot of memory and thread safety guarantees in Rust are only applicable to user space.

I'm not sure what you are thinking of here. Rust memory safety doesn't care about the environment at all.

daghamm · 2024-05-29T17:12:29 1717002749

In kernel space ordinary operations like writing to a [u32] you own can have unexpected results. For example the page may not exist or the memory could point to some hardware component or be aliased. See also pornel's comment.

There are some ideas around Rust code patterns and structure for bare metal, see for example the RTFM work (now renamed). But they all do have some drawbacks such as redeuced readability and IMO too much abstraction.

Anyway, my point was that since of the guarantees Rust provides build upon certain assumptions about the environment that generally don't exist in kernel space.

spoiler · 2024-05-29T17:25:48 1717003548

That not quite correct. Sure, you will have some unsafe primitives that interact with hardware. But nothing stops you from creating abstractions on top of those.

Also, `unsafe` doesn't disable all language features or the type system, it just provides an escape hatch to use raw pointers. Which, yes is quite a big step away from "normal Rust", but that's why we abstract around them.

It sure is extra and boring work, but it's entirely possible to create ergonomic APIs around unsafe low level primitives. I mean, that's how a lot of stuff gets implemented in stdlib or even in some crates. We just don't interact with it frequently, though.

Not kernel developer myself, but I've done some embedded Rust and written drivers

throwaway1105q · 2024-05-29T11:40:14 1716982814

Even in firmware level code, you can still get most of the benefits. There are very few cases where you don't gain any safety at all.

tormeh · 2024-05-29T01:21:20 1716945680

Extremely impressive. Does anyone know if performance is likely to decrease as more features are implemented? Because if not, this is a winner.

anonymousDan · 2024-05-29T05:22:01 1716960121

One question that is unclear to me - what are the disadvantages of a user space driver like SPDK (in comparison to a kernel implementation)? It prevents multiplexing the device? The API is more complex to program?

ongy · 2024-05-29T07:30:37 1716967837

It's not protected by the copy-left properties of the kernel D:

For the more technical points:

The kernel needs to have it available in a somewhat complex way to be able to mount the drive. I.e. init becomes a lot more involved with an initram that first needs to load the driver (potentially loopback it?) and the mount actual root.

To some degree there can also be issues around syscall boundaries. I.e. the usual monolith vs. microkernel. I haven't checked the API they hook into to provide the device to other components. But it likely requires the kernel to jump back into userspace in various "hot-ish" paths for IO.

ongy · 2024-05-29T07:38:00 1716968280

Ok, it doesn't integrate with the system at all. This provides a library to access the device.

I.e. it can only be used by a single consumer, that consumer doesn't get nice things like ther kernel's file systems or device mapper (raid/crypt/verity...) features. This can be fine, e.g. when the consumer is a database that just needs a block device, or something like ceph/minio etc. that provides the storage api to its consumers.

It'd have to use something like NBD (~fuse for block devices) to actually integrate and then my previous post describes some of the downsides to that setup.

mgt19937 · 2024-05-30T14:10:06 1717078206

I used to think read is always faster than write in ssd. But in figure 5.3 and figure 5.4, it looks like in ssd, read iops is lower than write iops.

When queue depth is low(like qd=1), random 4k read iops is far more less(14.5 kiops vs 128 kiops) than 4k random write iops. When queue depth is high, like qd=32, the read iops and write iops becomes similar. But read iops is still less than write iops.(436 kiops vs 608 kiops)

I wonder why read is slower than write? Is it because ssd has a fast write cache, and it will finish the write request once the data is written into cache? Or it simply report that the data is written and actually write them in batch in background?

moooo99 · 2024-05-29T09:50:36 1716976236

This is extremely impressive work, especially for a bachelor thesis

moonshotideas · 2024-05-29T05:32:08 1716960728

Given the statement ‘we achieve SPDK-like throughput,’ I’m curious whether the performance is slightly worse than SPDK. If it is, do you have any comparison metrics for throughput?