I think that's the entire point of this article: the existing system software APIs that we use aren't a good abstraction for the capabilities of the underlying hardware, leading to poor performance.
Single-thread, single-queue performance is much lower than the max with good NVMe devices.
With increased concurrency and deeper queues, my Samsung 960 Pro which has been running my Windows 10 desktop for several years still can do 294k random 4k reads IOPS, and 2.5GB/s sequential read.