Yeah, back in 2014, I worked at HP on storage drivers for linux, and we got 1 million IOPS (4k random reads) on a single controller, with SSDs, but we had to do some fairly hairy stuff. This was back when NVME was new and we were trying to do SCSI over PCIe. We set up multiple ring buffers for command submission and command completion, one each per CPU, and pinned threads to CPUs and were very careful to avoid locking (e.g. spinlocks, etc.). I think we also had to pin some userland processes to particular CPUs to avoid NUMA induced bottlenecks.
The thing is, up until this point, for the entire history of computers, storage was so relatively slow compared to memory and the CPU that drivers could be quite simple, chuck requests and completions into queues managed by simple locking, and the fraction of time that requests spent inside the driver would still be negligible compared to the time they spent waiting for the disks. If you could theoretically make your driver infinitely fast, this would only amount to maybe a 1% speedup. So there was no need to spend a lot of time thinking about how to make the driver super efficient. Until suddenly there was.
Oh yeah, iirc, the 1M IOPS driver was a block driver. For the SCSI over PCIe stuff, there was the big problem at the time that the entire SCSI layer in the kernel was a bottleneck, so you could make the driver as fast as you wanted, but your requests were still coming through a single queue managed by locks, so you were screwed. There was a whole ton of work done by Christoph Hellwig, Jens Axboe and others to make the SCSI layer "multiqueue" around that time to fix that.
The thing is, up until this point, for the entire history of computers, storage was so relatively slow compared to memory and the CPU that drivers could be quite simple, chuck requests and completions into queues managed by simple locking, and the fraction of time that requests spent inside the driver would still be negligible compared to the time they spent waiting for the disks. If you could theoretically make your driver infinitely fast, this would only amount to maybe a 1% speedup. So there was no need to spend a lot of time thinking about how to make the driver super efficient. Until suddenly there was.