A journey to io_uring, AIO and modern storage devices

hardwaresofton · on May 25, 2021

Thoroughly surprised the lack of difference between read latency on SATA SSD and NVMe SSD. I thought NVMe was significantly more performant than SATA. Guess the difference is all on the writing end, so if you had some sort of mostly read workload it's fine.

derefr · on May 25, 2021

It's IOPS, not throughput, that NVMe gets you. (Or rather, if you're not being bottlenecked by IOPS, then NVMe won't get you anything over a JBOD of SATA SSDs.)

SATA SSDs will serve you just fine if you're Netflix or YouTube, streaming a bunch of video files to people. High throughput, low IOPS — no advantage to NVMe.

But you'll get a huge boost from NVMe if you're doing a lot of little concurrent random reads from a much-larger-than-memory dataset. For example, if you're Twitter, fetching updates to users' timelines.

If you've got higher concurrency, you'll want several NVMe SSDs behind a hardware RAID controller, in RAID0.

And if you've got really high concurrency, you'll want several of these NVMe SSD RAID controllers, on separate PCI-e lanes, under software RAID0.

That's the point when you're starting to reach those "basically identical to memory" figures that are the use-case for Intel's Optane.

hardwaresofton · on May 25, 2021

Thanks for this -- this is what I'm missing. I was starting to essentially suspect that NVMe is really more of a benefit for "hyperscaler"/IaaS providers... This lays it out well for me, and the how too. I rent my hardware so I don't have access to the hardware RAID stuff but this really filled out holes for me.

> If you've got higher concurrency, you'll want several NVMe SSDs behind a hardware RAID controller, in RAID0.

Hmnn why would you want RAID0 here? faster speed but absolutely no redundancy? what about RAID1+0/0+1?

> That's the point when you're starting to reach those "basically identical to memory" figures that are the use-case for Intel's Optane.

I need to do more reading on this but thanks for pointing this out

derefr · on May 25, 2021

> Hmnn why would you want RAID0 here? faster speed but absolutely no redundancy? what about RAID1+0/0+1?

I can't speak to others†, but for our use-case, we're not using NVMe as "persistent storage but faster and smaller" but rather as "main memory but larger and slower".

Our data isn't canonically on our nodes' NVMe pools, any more than it's canonically in the the nodes' local memory. The canonical representation of our (PGSQL data warehouse) data is its WAL log segments, shipped to/from object storage.

The NVMe pool can be thought of as holding a "working" representation of the data — sort of like how, if you have a business layer that fetches a JSON document from a document store, then the business-layer node's main memory then holds a deserialized working representation of the data.

You wouldn't back up/protect the data on your NVMe pool, any more than you'd back up the memory backing the deserialized JSON. If you wanted to modify the canonical data, you'd modify it in the canonical store. (For JSON, the document database; for our use-case, the WAL segments + base-backups in the object store.)

† I would note that "large ephemeral working-set backing store" is seemingly largely considered by IaaS providers to be the main use-case for having NVMe attached to compute. The IaaS providers that support NVMe (AWS, GCP) only support it as instanced or scratch (i.e. ephemeral) storage, that doesn't survive node shutdown. These are strictly "L4 memory", not stores-of-record.

hardwaresofton · on May 26, 2021

> I can't speak to others†, but for our use-case, we're not using NVMe as "persistent storage but faster and smaller" but rather as "main memory but larger and slower".

> Our data isn't canonically on our nodes' NVMe pools, any more than it's canonically in the the nodes' local memory. The canonical representation of our (PGSQL data warehouse) data is its WAL log segments, shipped to/from object storage.

I appreciate this insight -- one of the features that drew me to KeyDB[0] was their FLASH feature, with drives being as fast as they are these days it feels like having tiered cache with DRAM + NVMe disk isn't so bad! Great to see others think this way as well.

Redis is one thing but treating pg this way is somewhat interesting -- Do you also have the usual hot standby setup with synchronous replication just in case of a machine going down before the latest WAL segment is shipped out?

> † I would note that "large ephemeral working-set backing store" is seemingly largely considered by IaaS providers to be the main use-case for having NVMe attached to compute. The IaaS providers that support NVMe (AWS, GCP) only support it as instanced or scratch (i.e. ephemeral) storage, that doesn't survive node shutdown. These are strictly "L4 memory", not stores-of-record.

Reasonable, thanks for explicitly noting this. I think the write characteristics (and technically the read as this article shows) of NVMe do make it attractive but don't preclude using it as the store of record medium but what you said definitely syncs up with the other configurations I've seen and the offerings at AWS (I haven't really looked at GCP node types).

[0]: https://github.com/EQ-Alpha/KeyDB/wiki/Legacy-FLASH-Storage

derefr · on May 26, 2021

> Do you also have the usual hot standby setup with synchronous replication just in case of a machine going down before the latest WAL segment is shipped out?

We don't have to — the data in our data warehouse is a rearrangement/normalization/collation of origin data from public sources. If our DB cluster dies, we stand up a new one from a backup, let it catch up from WAL segments, and then our ETL pipeline will automatically re-bind to it, discover the newest data it has available inside it, and begin ETLing to it. From our ETL ingestion-agent's perspective, there's no difference between "loading to a DB freshly restored from backup because it lost 10 minutes of updates" and "loading to a DB that is/was up-to-date because we just received some new origin data."

(Our situation of our primary-source data being easily and reliably re-fetched is pretty unique to our vertical; but you can put pretty much any system into the same shape by first writing your primary-source data to an event store / durable unbounded-size MQ [e.g. Kafka] as you receive it; and then ETLing it into your DB from there. Forget DB disaster recovery; just keep the MQ data safe, and you'll always be able to replay the process that translates an MQ event-stream into a DB state.)

We do have hot standbys, but they're async (using wal-g — master writes WAL segments to object store; replicas immediately fetch WAL segments from object store, as quickly as they can manage.) They're for query load scaling, not disaster recovery.

hardwaresofton · on May 26, 2021

Appreciate the detailed description of what works for your warehousing solution!

> (Our situation of our primary-source data being easily and reliably re-fetched is pretty unique to our vertical; but you can put pretty much any system into the same shape by first writing your primary-source data to an event store / durable unbounded-size MQ [e.g. Kafka] as you receive it; and then ETLing it into your DB from there. Forget DB disaster recovery; just keep the MQ data safe, and you'll always be able to replay the process that translates an MQ event-stream into a DB state.)

I'm familiar with this almost-event-sourcing model. AFAIK is what banks and financial firms do (at scales I can't imagine and with technology they'll never open source) and is the only way to do rock-solid lose-nothing architecture that is relatively easy to reason about and flexible.

I have no need to do this but Debezium w/ NATS + Jetstream/Liftbridge would be my pick for this, since I'm a sucker for slightly newer tooling that boasts at being "cloud native" (i.e. easy for me to administer). AFAIK the boring choice might be Debezium + Kafka with some RAIDx drives below it all just in case.

> We do have hot standbys, but they're async (using wal-g — master writes WAL segments to object store; replicas immediately fetch WAL segments from object store, as quickly as they can manage.) They're for query load scaling, not disaster recovery.

Read only replicas for load scaling is pretty common, was surprised to hear wal-g though -- pg has an absolutely dizzying array of replication options[0], and I don't think I landed on wal-g as the "best" choice for me (with a heavily application-centric viewpoint) and this gives me reasonf or some pause. I think I mostly had it down to barman or pgbackrest. The object-store-capable options always caught my eye (and what put backrest over the top for me IIRC), will go back and revisit. Does covalent maintain an engineering blog? because sounds like ya'll are doing stuff I'd love to read about and cargo cult^H^H critically consider.

Weirdly enough I've just been down a bit of a rabbit hole on the data engineering world and have been thoroughly impressed by dbt[1]. The concept is simple, but combined with a powerful database like pg it holds a lot of promise in my mind. I'm mostly excited by the idea of exposing the power of postgres and it's many extensions and features to a SQL-first view of the world that dbt encourages, and building some sort of "data platform" that is nothing more than enshrining and exposing the best ways to do things with that stack.

Is dbt or a similar tool a part of your stack in any way? Listening to people who sort of... distribute hype around data engineering, you'd think Snowflake is the next greatest thing since sliced bread, but the pattern of debezium + dbt + a big ol' well-configured and replicated postgres cluster feels like more than enough. I'm aware of course that Snowflake is really just a management layer (essentially, they're an Redshift-aaS provider for any cloud) but it feels like that gets lost in discussion. Sounds like you all have a well considered data pipeline (so the snowflake part is done for you), wondering what the other links in the chain look like. Any Metabase or Superset at the end of the pipeline?

[EDIT] Want to note the recent presentation from CERN on the topic of replication that most recently influenced me[2] (skip to the last 2 slides if you want the TL;DR)

[0]: https://wiki.postgresql.org/wiki/Binary_Replication_Tools

[1]: https://docs.getdbt.com

[2]: https://www.slideshare.net/AnastasiaLubennikova/advanced-bac...

derefr · on May 26, 2021

pgBackRest is probably strictly better than wal-g for our use-case, but I already knew wal-g was at least satisfactory in solving our problems (from previous usage in implementing something Heroku-Dataclips-alike before), so I threw it on and filed an issue for doing a compare-and-contrast down the line. So far it hasn't caused too many issues, and does indeed solve our main issue: allowing truly-async replication (e.g. replication to DCs in other countries, where the link has a huge RTT, high potential packet loss, and might even go down for hours at a time due to, say, politics — without that causing any kind of master-side buffering or slowdown.)

> Does covalent maintain an engineering blog?

Not yet, but we've certainly been considering it. Probably something we'll do when we have a bit more breathing room (i.e. when we get out from under all the scaling we're doing right now.)

> Is dbt or a similar tool a part of your stack in any way?

Nothing like that per se (though it'd very much be a data-layer match for how we're doing infra management using k8s "migrations" through GitOps. Imagine SQL schema-definition k8s CRDs.)

But that's mostly because — to our constant surprise — our PG cluster's query plans improve the more we normalize our data. We hardly build matviews or other kinds of denormalized data representations at all (and when we do, we almost always eventually figure out a partial computed-expression GIN index or something of the sort we can apply instead.)

We are planning to offer our users something like dbt, and we may or may not use dbt under the covers to manage it. (Analogous to how GitHub created their own git library to use in their Rails backend.)

One thing we do want to enable is avoiding the recomputation of intermediate "steps" within long chained CTE queries (imagine: taking 10 seconds to compute A, and then using it in a join to quickly select some Bs; then, in another query, taking another 10 seconds to compute A again, because now you need to select Cs instead), without keeping around any sort of persistent explicitly-materialized intermediate, or writing ugly procedural logic. https://materialize.com/ seems promising for that. (Just wish it was a library I could shove inside PG, rather than an external system that is apparently going to grow to having its own storage engine et al.)

> Any Metabase or Superset at the end of the pipeline?

We've always mentioned to people that Tableau has first-class support for connecting to/querying Postgres. Given the type of data we deal with, that's usually all they need to know. :)

hardwaresofton · on May 26, 2021

> pgBackRest is probably strictly better than wal-g for our use-case, but I already knew wal-g was at least satisfactory in solving our problems (from previous usage in implementing something Heroku-Dataclips-alike before), so I threw it on and filed an issue for doing a compare-and-contrast down the line. So far it hasn't caused too many issues, and does indeed solve our main issue: allowing truly-async replication (e.g. replication to DCs in other countries, where the link has a huge RTT, high potential packet loss, and might even go down for hours at a time due to, say, politics — without that causing any kind of master-side buffering or slowdown.)

I always thought that dataclips was just them running on ZFS but honestly this would make more sense. I don't use Heroku much so I didn't see any limitations that jumped out to me as "oh they have that limitation because they're doing pgbackrest" but.

> Not yet, but we've certainly been considering it. Probably something we'll do when we have a bit more breathing room (i.e. when we get out from under all the scaling we're doing right now.)

Yup sounds reasonable, I certainly am not a customer (nor in a position to be one really, as I don't do any blockchain) so satisfying me brings you almost negative ROI. Scaling sounds fun in and of itself (when it isn't terrifying though) -- thanks for sharing so much about the setup!

> Nothing like that per se (though it'd very much be a data-layer match for how we're doing infra management using k8s "migrations" through GitOps. Imagine SQL schema-definition k8s CRDs.)

Now this is fascinating -- I'm also very much sold on k8s and SQL-schemas-as-CRDs sounds like it'd be relatively easy to implement and very high ROI. Migrations were always a sort of weird infrastructure-mixed-with-dev thing, and being able to tie the CRDs for migrations to the StatefulSet running the DB and the Deployment running the application seems like it could open up a whole new world of more intelligent deployment. cool stuff.

> But that's mostly because — to our constant surprise — our PG cluster's query plans improve the more we normalize our data. We hardly build matviews or other kinds of denormalized data representations at all (and when we do, we almost always eventually figure out a partial computed-expression GIN index or something of the sort we can apply instead.)

I don't know how people use open source DBs that are not postgres. I mean I do, but I can't wait until zheap removes the most common reason for people choosing MySQL to begin with.

> We are planning to offer our users something like dbt, and we may or may not use dbt under the covers to manage it. (Analogous to how GitHub created their own git library to use in their Rails backend.)

dbt does look pretty good, it seems like just enough structure. I do want to also bring up ddbt[0] if ya'll do some research/exploratory sprint on it.

> One thing we do want to enable is avoiding the recomputation of intermediate "steps" within long chained CTE queries (imagine: taking 10 seconds to compute A, and then using it in a join to quickly select some Bs; then, in another query, taking another 10 seconds to compute A again, because now you need to select Cs instead), without keeping around any sort of persistent explicitly-materialized intermediate, or writing ugly procedural logic. https://materialize.com/ seems promising for that. (Just wish it was a library I could shove inside PG, rather than an external system that is apparently going to grow to having its own storage engine et al.)

Interesting -- would Incremental View Maintenance[2] work for this? Don't know if this is useful to you but the dbt conference I got sucked into watching most of had a talk with the materialize CEO and some other people. I don't remember it being particularly high signal (I mean let's be real it's basically a 3-way ad for the CEO involved) but it might be worth a watch[1].

> We've always mentioned to people that Tableau has first-class support for connecting to/querying Postgres. Given the type of data we deal with, that's usually all they need to know. :)

I really underestimate how much Tableau is the standard (because I'm not a data person probably) -- basically between that, looker and PowerBI I'm not sure people use much else, even if it's open source/free.

[0]: https://github.com/monzo/ddbt

[1]: https://www.youtube.com/watch?v=0DDNjB4O0As

[2]: https://github.com/sraoss/pgsql-ivm

derefr · on May 26, 2021

> Incremental View Maintenance

It’s nice to have that as an automatic feature rather than something that you have to create for yourself with stored procedures (doing e.g. dynamic programming in the DB), but a matview is still a matview — it’s a real table taking up space on disk and getting replicated through the WAL.

The thing Materialize gives you is the same thing you get from Apache Beam nee Google Cloud Dataflow at cluster scale, or from Elixir’s Flow / Java’s Akka Streams at the in-process scale on the business layer. Namely, the ability to have one stream node in a flow digraph that feeds N outputs that consume it at different rates, and to keep just enough of the intermediate stream node materialized in a buffer to keep all those readers served. Or: like the kind of ACK-truncated buffer that serves a Kafka/Redis Streams consumer group, but with all the stages in the same address-space without serialization+remoting overhead in-between.

This “garbage-collected buffered materialization” is a very important property, if your A query returns a million A IDs, and you’re doing a LEFT JOIN from Bs to As that’s dropping most of the As because they don’t have a related B. You don’t want to be holding onto — and replicating! — a thousand concurrent-request copies of that million A ID list. (The million A IDs are also why you’re doing CTEs in your B/C queries in the first place. With a small list of A IDs, it’d be sensible to just round-trip the A ID list to the client, getting sent back as IN (VALUES ...) bind-params for the B/C queries. But 1. naively, those million IDs become a million bind-params, and that’s stupid, slow, and actually impossible [PG only accepts 65K bind-params per statement!]; and 2. trying to pass the fetched list as an A.id[] and using <@ will deoptimize the search.)

The closest thing PG has is the ability to BEGIN a transaction; create a TEMPORARY table for that transaction; SELECT INTO the temporary table the result of a query; and then use that temporary table to run two queries. Which you have to do in PL/pgSQL (losing the benefits of “whole program” query-plan optimization you get from CTE queries); and then you still have to do something at the end to get your two queries' differently-shaped outputs back to the user. (For example, encoding every row into JSONB with an explicit embedded row-type-tag, and then double-decoding at the receiver; materializing the results of the queries into two more UNLOGGED [but not TEMPORARY] tables, fetching the results from them in two more queries, and then dropping them; etc.)

hardwaresofton · on May 27, 2021

> It’s nice to have that as an automatic feature rather than something that you have to create for yourself with stored procedures (doing e.g. dynamic programming in the DB), but a matview is still a matview — it’s a real table taking up space on disk and getting replicated through the WAL.

Ah, yeah at the end of the day both unlogged materialized views aren't supported by postgres[0]. Also a patch that introduces temporary materialized views is stuck behind some refactoring that started in 2019[1] which is unfortunate.

> (explanation of materialize)

Thanks for this -- I think I get more clearly what you're getting out of materialize. I'm not sure I quite understand the stream analogy because the data being buffered here feels chunked (though I guess it could be incrementally maintained with some more smarts), but I think I understand the issue that basically ongoing queries need access to this temporarily-useful partial result. Basically a cross-session intermediate query result cache.

This feels like it could be reduced to a custom table access method plugin, but if that temporary table patch gets merged it could be even easier to hack something together.

[0]: https://www.postgresql-archive.org/CREATE-UNLOGGED-MATERIALI...

[1]: https://www.postgresql.org/message-id/CAKLmikMgkuwFKHem%2BSf...

anarazel · on May 25, 2021

> It's IOPS, not throughput, that NVMe gets you. (Or rather, if you're not being bottlenecked by IOPS, then NVMe won't get you anything over a JBOD of SATA SSDs

It's both. Throughput is a significant issue with SATA. You need 6 SATA drives to match the throughout of single PCIe 3x drive. The SATA JBOD will have a significant higher CPU overhead at combined line speed than NVMe.

> If you've got higher concurrency, you'll want several NVMe SSDs behind a hardware RAID controller, in RAID0.

The few examples of raid controllers with NVMe support I had access increased latency sufficiently to be quite painful.

netflixandkill · on May 25, 2021

NVME will pull ahead with higher concurrency due to the much larger command queues. With single or small sets of reads the overall bus throughput doesn't matter much and the limitation is how fast the control board on the SSD can pull in data and put it on the bus.

masklinn · on May 25, 2021

> Thoroughly surprised the lack of difference between read latency on SATA SSD and NVMe SSD.

You did notice that the scales are different yes? At the low end the best case latencies are similar but nvme has much better worst cases:

> For a 4 kilobytes block size the average time improved only a little, but the 99 percentile is two times lower.

and as size increases nvme benefits.

hardwaresofton · on May 25, 2021

Do you mean the logarithmic scale? yeah I did notice that, but I definitely did overlook the p99 2x performance difference, thanks for drawing my attention to that.

Internally I think I just had a mostly wrong intuition on how much of a technology shift NVMe was compared to SATA -- it's a bit hard to tell if we're approaching theoretical limits or if SATA was just very good for it's time.

As a side note I wish this post didn't try to fit so much in one graph, but I can easily imagine someone complaining about the exact opposite so that's neither here nor there I guess. Separation by color and shade/intensity of color rather than the line patterns might have helped. Or maybe just completely separate charts for some of the dimensions... not sure.

pdimitar · on May 25, 2021

I don't have scientific benchmarks but I did observe noticeable performance differences in data ingestion workflows between a machine with a SATA 3 drive and a machine with a NVMe drive (and it was a mid-range one, with 2200/1700 MB/s speeds).

But the gains are only coming if you properly setup the machine, e.g. a NVMe drive can't make much use of its high concurrency stats if you don't have a strong multicore CPU with a newer motherboard that allows for more traffic.

digikata · on May 25, 2021

The generation of flash chip in the drive can make a major difference there, as well as fill, the wear of the drive and it’s internal firmware architecture. The makers also target different tiers of performance with different model lines too and that can be a huge difference.

anarazel · on May 25, 2021

I'm confused by how high their latencies are, something isn't right (/me goes and spends too much time that I instead should spend on paperwork).

The biggest contributor is that they tested random reads - that'll likely end up bottlenecked by the FTL, not the protocol & interface latency. Which, especially on consumer devices, can add a lot.

Testing sequential, non-buffered, 4k IO with fio, on my boot disk, a Samsung SSD 970 PRO 1TB, I see:

Average latency of 13.8us (99.9th percentile 33us), with fio running on the socket the NVMe is attached to, and 16.22us on the remote node (99.9th 37us). With polling that goes down to 8.19 / 8.79us.

If I instead test random reads, I see 69.68 / 70.63us (99.9th 147us/ 176us) local/remote node respectively, if I use a 128MB file. Unsurprisingly the local/remote difference nearly vanishes, as the transfer itself isn't the bottleneck anymore. Nor does polling help much. With larger files (I tested up to 128GB, didn't have enough free space for more), that increases a bit, but not by much.

Even taking the effect of randomness into account, my disk is several years old, and the latencies reported in the post (and linked PDF) are well above what I see, on a two socket system, with plenty other things going on at the same time. I see considerably lower latencies on two different fast server grade SSDs as well (not quite to the optane level though).

The PDF's benchmark description says:

> The main goal of this work is to evaluate the hardware and not the filesystem. On each device we create a file of size approximately 90% that of a device size. We fill it with specific data to be sure that the space is allocated and that the reading takes place. To assert the latter we verify that received bytes match the ones we expect.

I don't really understand why they're doing that. On most disks that will ensure that the FTL cannot operate from cache, even if that cache is generously sized. IME even random workloads tend to have more locality than a uniform distribution over 90% 4TB. But more importantly, this seems to counter the author's stated goals:

> In this work we present a detailed overview and evaluation of modern storage reading performance with regard to available Linux synchronous and asynchronous interfaces.

But the test setup chooses the scenario that is least likely to be influenced by the choice of interface?

There's a lot of odd stuff in the later sections. It's not at all clear what 7.1 really measures. The numbers are too high for QD1 sequential reads (even at 10us latency one would just be a bit below 1GB/s). But too low for high QD sequential reads. So is it buffered - but then I see higher throughput at 4kB? Then 7.2 talks about "AIO" in the graphs but I think just means doing one read per thread. But how would that be "asynchronous reading"?

savrus · on May 25, 2021

Hi!

Thanks for taking interest in the paper. First of all I'm sorry the setting doesn't meet with your expectations. It's always the problem with benchmarking why one measure one thing and not the other. In my case I was interested in uniformy random distribution since the whole work has been motivated by a task of putting key-value storage backend to NVMe. Hot data is already present in main memory and NVMe is used for a long tail of infrequently-accessed but large amount of data. Alas said maybe some other distribution could match the real workload more closely, I just don't know any better estimation.

It's pretty interesting that you question the working set size. Given the complexity of FTL it won't be surprising if the latency depend on working set size. I dind't do such kind of experiments but I hope this discussion could motivate somebody to take such measurements. Anyway thanks for the provided timings, I should remember them and keep that in mind.

As for section 7.1 QD was something from 16 to 64 but I don't remember exactly. In section 8 more attention is given to to QD and I try to pick the best one.

Section 7.2 could be probably the most confusing due to the read pattern complexity. I mention there that I'm interested in requests where several disjoint blocks are asked at once. Obviously that is transformed into several AIO requests. You have every right to complain about asynchronousity here because the reader waits for all of them to complete before issuing the next request. It's just the AIO interface which makes such kind of compound requests more efficient than issuing them via pread which the section demonstrated.

lathiat · on May 25, 2021

I mean ultimately SATA and NVMe are just an interconnect. The biggest difference is throughout. Yes there can be some minor latency or round trip gains in the protocol but throughout is the biggest difference.

anarazel · on May 25, 2021

The number of non-cacheable writes for SATA/AHCI is iirc 3x of NVMe command submission (1). Which does increase the lower bound of latency that is achievable. That doesn't matter for slower drives, but on the very low latency end it's a measurable impact.

Lower confidence: I think there's also a latency differences due to better control over interrupts with NVMe than SATA. Especially on a multi socket system, or even a higher core count single CPU system, that's quite a win.

hardwaresofton · on May 25, 2021

Hey I think you might have forgotten to add your source for (1) to this analysis!

anarazel · on May 25, 2021

I can't quite tell if you're cheekily suggesting I should provide sources for my claims, or whether you misunderstood (1) to be a reference to some source? If the latter, it was just intended to document the number of uncached writes for NVMe.

anarazel · on May 25, 2021

If the latter, I think I just read the specs at some point & traced the linux drivers.

There's prominent references to needing just one MMIO write for NVMe command submission in the NVMe spec:

"Does not require uncacheable / MMIO register reads in the command submission or completion path.

A maximum of one MMIO register write is necessary in the command submission path."

Page 7 of https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4b-2...

I didn't quickly re-find the relevant SATA/AHCI spec parts, but there's some references in other places, e.g. https://spdk.io/news/2019/05/06/nvme/ :

"The NVMe specification is designed to significantly reduce the number of required MMIO compared to older specifications like AHCI. AHCI requires as many as 7 total MMIO operations (many of which are reads) per I/O, whereas NVMe requires an MMIO write only to ring the doorbell both on the submission and completion side. MMIO writes are posted, meaning the CPU does not stall waiting for any acknowledgement that the write made it to the PCIe device. This means that MMIO writes are much faster than MMIO reads. That said, they still have a significant cost."

And even that one doorbell write (for each of submission and completion) can be batched.

I'm not sure why I remember three - that might just have been the average number in simple scenarios, rather than the max.

hardwaresofton · on May 25, 2021

Definitely the latter! Thought (1) was a pointer to some source

cperciva · on May 25, 2021

NVMe is a higher performance interface, so initially drive manufacturers used it on their higher performance SSDs.

infogulch · on May 24, 2021

Pretty interesting to see this and the other io_uring article on the front page today, given yesterday's post,

The Unwritten Contract of Solid State Drives (2017) - https://dl.acm.org/doi/10.1145/3064176.3064187 - https://news.ycombinator.com/item?id=27260522

I'm pretty sure io_uring didn't exist then, and wonder if the new kernel interface would change anything about the paper.

TristanBall · on May 24, 2021

Everything I've seen say using the no-op scheduler for very low latency devices is the way to go - particularly for nvme and optane devices.. interesting that it doesn't seem to be mentioned here, or in the linked paper?

wtallis · on May 24, 2021

Recent Linux kernels default to not using an IO scheduler with devices that support multiple queues, which includes NVMe SSDs but not SATA (at least through AHCI HBAs). Individual distros can override the kernel's defaults. However, I'm not sure the difference between no-op and mq-deadline IO scheduling for SATA SSDs would be big enough to matter for the purposes of this article's measurements.

anarazel · on May 24, 2021

It's a really annoying proxy to determine the default configuration :(. There's plenty devices with multiple channels that benefits from the increased merging possible with a scheduler - particularly around writes, where a lot of consumer SSDs are weak. And conversely, there's plenty workloads on single queue devices that are hurt by the scheduler and some of the other heuristics that trigger for single queue devices...

anarazel · on May 25, 2021

FWIW, my own measurements tell a somewhat more complex story. Not hugely for reads, but writes. And less so for the higher end drives (with plenty fast cache). Plenty real world, write heavy, scenarios can benefit from having merging in positions other than head/tail (which noop can in the right circumstances). While FTLs are getting better and better, flash density is increasing (the number of QLC disks is growing, SLC ones are shrinking), and small writes are getting evermore expensive. This is especially bad for workloads involving fsync/O_DSYNC writes, because on drives with volatile caches they defeat all the efforts to shortish small write costby

Of course there also a lot of write workloads that'd still never see merges, making a scheduler just unnecessary cost.

This isn't helped by none of the current Linux schedulers really working well for local low latency storage. mq-deadline has completely antiquated heuristics and default constants, that clearly come from the spinning disk age. Bfq is too complicated (but works surprisingly well for higher latency / throttled storage).

Nor helped by the fact that nearly every storage benchmark ensures that no merging ever will occur outside of "plugging" (basically sets of IOs submitted together). No wonder they never see any benefit from schedulers :(

gumby · on May 24, 2021

I like the xkcd-style graphs. Anybody know what package that is?

Twirrim · on May 24, 2021

It's a little over-the-top for my taste in this article. Would much rather something a little crisper.

sz4kerto · on May 24, 2021

https://cran.r-project.org/web/packages/xkcd/vignettes/xkcd-...

gumby · on May 25, 2021

Thanks

vlmutolo · on May 25, 2021

Very likely matplotlib

gumby · on May 25, 2021

Thanks