What Stefan (the author) calls my vision, I call my arrogance. :)
Some bits of my plan, especially with respect to how the async runtime in QEMU is designed, were good. However I had vastly underestimated the complexity of one step. The various disk images form a graph that can change on the fly, for example if you make a live snapshot of a VM disk. The method I had thought of to handle changes to the graph would have added a lot of technical debt. Fortunately it was nacked by the maintainer Kevin Wolf and replaced with something better.
The important thing, at every step, was being committed to getting it done. Some of the intermediate steps were better in terms of bugs fixed, but worse in terms of complexity because you had to juggle both the "good" locks and the legacy global locks.
This is really everybody else's work. All I did was some mentoring of Emanuele, who is also the first presenter in the linked video.
We didn't use formal methods, but some small parts of the async runtime were validated with spin. The Promela sources are in the QEMU source repository.
In the end, the locking changes are relatively low tech. The verification part are where the magic happens, and TSA plus my call graph analysis tool vrc are enough for that.
Rather that adding more fine grained locking, would there have been another way? In particular I wonder of the problem domain of QEmu could have benefited from a thread per core architecture. Do guest OSs try to pin high tps devices to singular cores these days, and if so could than provide a natural way to shard the IO workload?
I didn't go into the various ways in which AioContext lock was replaced in the article. You're right, sometimes new fine-grained locks weren't necessary.
When there is really only one thread accessing some data then locking isn't needed. That's what was done for the SCSI emulation layer where request processing only happens in 1 thread. Here is a new function that was introduced to schedule work on the thread that runs SCSI emulation (a rare operation that is not performance critical and allows the rest of the code to avoid locks):
https://gitlab.com/qemu-project/qemu/-/blob/master/hw/scsi/s...
QEMU's IOThreads allow the user to configure the threads and get something similar to thread per core architecture. But if 1 thread becomes a bottleneck, then some form of thread synchronization is needed again even with thread per core architecture. Some problems can be parallelized and they work well with thread per core.
Some bits of my plan, especially with respect to how the async runtime in QEMU is designed, were good. However I had vastly underestimated the complexity of one step. The various disk images form a graph that can change on the fly, for example if you make a live snapshot of a VM disk. The method I had thought of to handle changes to the graph would have added a lot of technical debt. Fortunately it was nacked by the maintainer Kevin Wolf and replaced with something better.
For more details on both my plan and what was actually done, see https://kvm-forum.qemu.org/2023/Multiqueue_in_the_block_laye... (slides, also linked from the post) or https://youtu.be/Ubped0PgvZI?si=IsckfZ7uDNYJNp_y (video).
The important thing, at every step, was being committed to getting it done. Some of the intermediate steps were better in terms of bugs fixed, but worse in terms of complexity because you had to juggle both the "good" locks and the legacy global locks.