I liked most of the piece, but some bits rubbed me the wrong way:
> I was taken by surprise by the fact that although every one of my peers is certainly extremely bright, most of them carried misconceptions about how to best exploit the performance of modern storage technology leading to suboptimal designs, even if they were aware of the increasing improvements in storage technology.
> In the process of writing this piece I had the immense pleasure of getting early access to one of the next generation Optane devices, from Intel.
The entire blog post is complaining about how great engineers have misconception about modern storage technology and yet to prove it the author had to obtain benchmarks from early access to next-generation devices...?! And to top it off, from this we conclude "the disconnect" is due to the APIs? Not, say, from the possibility that such blazing-fast components may very well not even exist in users' devices? I'm not saying the conclusions are wrong, but the logic surely doesn't follow... and honestly it's a little tasteless to criticize people's understanding if you're going to base the criticism on things they in all likelihood don't even have access to.
Optane has been commercially available for five years already and it's not used in any device I'm aware of. Assuming it will find broad adoption at this point seems like a bad bet.
Optane in consumer devices is in a weird place. Write durability does not matter, SSD manufacturers reduce rewrite count with every new generation in the name of the cheaper prices and users are happy to accept that downgrade (see Samsung 980 Pro vs Samsung 970 Pro for a recent occasion). Speed matters to some extent. Many users don't observe performance improvements in their typical tasks when upgrading from SATA SSD to M.2 PCI-E SSD, so they definitely wouldn't notice speed improvement upgrading from M.2 to Optane. And yet Optane's cost is not cheap.
I just don't see place for Optane in any user devices. It's either servers, but it's not obvious choice there or tiny market for benchmark users. May be someone like Apple with its vertical integration could exploit it extreme speed to achieve something awesome, but I doubt it, they don't like to depend on a single supplier.
>Optane in consumer devices is in a weird place. Write durability does not matter, SSD manufacturers reduce rewrite count with every new generation in the name of the cheaper prices and users are happy to accept that downgrade (see Samsung 980 Pro vs Samsung 970 Pro for a recent occasion).
The odd thing about rewrite count is that it shrinks very quickly the bigger your drive is. In the past you used to have a 128GB boot SSD and install a few games on it. This meant that you constantly deleted and reinstalled games on that low capacity SSD increasing the rewrite count.
Nowadays you can get 500GB SSDs for a decent price. If you can install every game then you might never have to rewrite data on it through uninstallation/installation. A lot of files are only written once and then never changed. With greater capacity you can have more "write once" files.
You the consumer might not feel the need to rewrite, but the drive itself has to do this periodically as a matter of internal Flash maintenance. Even Read Disturb (think ram RowHammer) is a thing for NAND. Then you have MLC/TLC technology in conjunction with modern small geometries leading to faster leakage (data persistence) and you end up with failures like Samsung 840 slowing down to a crawl, or older driver forgetting all data after being unplugged for a while, leading to Samsung 840 firmware update forcing the drive to rewrite itself periodically in the background. https://www.techspot.com/news/60362-samsung-fix-slow-840-evo...
> It was announced in July 2015 and is available on the open market under brand names Optane (Intel) and subsequently QuantX (Micron) since April 2017.
I think that's the entire point of this article: the existing system software APIs that we use aren't a good abstraction for the capabilities of the underlying hardware, leading to poor performance.
Single-thread, single-queue performance is much lower than the max with good NVMe devices.
With increased concurrency and deeper queues, my Samsung 960 Pro which has been running my Windows 10 desktop for several years still can do 294k random 4k reads IOPS, and 2.5GB/s sequential read.
Please may I pray your forgiveness first for my naivety and 2nd forgive please my disarray because I am keeping the first question unchanged because I think it's the way a lot of people will be thinking about Optane and I realised that this duality the memory and storage modes and it's not finished there either, needs far better marketing than I think Intel capable of.
I immediately assumed that you are running Optane DIMMS and enjoying life with great sequential speed and acceptably OK latency memory capacity of multiples of the normal RAM address space for your machines.
I forgot not only the fact that Optane M2 and AIC skus exist but in fact make incredible value improvement on smaller systems. It's up the desktop computational capacity scale where use cases aren't as elegant as the fairly easy to circumscribe NUC applications. (Least those I can imagine).
Second I forgot that Intel shipped the Scale MP authored RAM to block driver for Optane that let you tell your OS to treat storage attached Optane as RAM.
Because of unrelated factors the systems acquisition I was running in H218 is starting over now and as a consequence I never tested components not sure about relevance and 2nd order road map business effects. So since the only pixels I've read about the RAM to block driver option have been 9n the launch specifications and my 3 comments about this since I have no idea if you actually can use storage Optane as RAM.
(Dropping the semantics "storage class" and "memory class" would be my first edict at the helm of Intel strategy. I bet this nomenclature actually confuses the internal planning and delivery itself it's so insidiously confusing I could write one of those hideous corporate communications "style guide" about the possibilities for disaster this way )
Below is my original question I still would dearly appreciate learning your answer to. But I owe you a apology for doubting you in the original because I originally thought that Optane could my to.a lot of beating in a 1l chassis and less. Check out Patrick at Serve the Home for a super series of reviews of inexpensive and bargain performance mini NUC style desktops I'm sure you get no better than with a stick of Optane added.(To a PCIE riser adapter n.b. some of these tiny models give you bifurcation on their only pcie slot which could transform your use cases.
[Original below /
Intel "new unit of computing" skus have block storage drivers for Optane???
touisteur I must beg your indulgence my desire for knowledge! Can you possibly post a sky / model number of the most capable NUC you have working with Optane?
BTW you asked for sk-y- (sku?), I have a good bunch of NUC7i7BNHX1 and some i5 with similar Optane storage. Don't know whether they're byte addressable, I did talk about mmap and speed...
Yeah how many people are running apps on servers served at all or even partially by NVMe SSDs? Where I work for our on prem stuff it's basically all network storage.
You can do pretty amazing things with well designed (onprem) networked storage with NVMe drives/arrays.
I replaced the company I was working with at the time's traditional "enterprise" HPE SANs with standard linux servers running a mix of NVMe and SATA SSDs that provided highly available, low latency and decent throughput iSCSI via network.
Gen 1 back in 2014/2015 did something like 70K random 4k read/write IOP/s per VM (running on Xen back then) and would just keep scaling till you hit the clusters 4M~ IOP/s limit (minus some overhead obviously).
Gen 2 provided between 100K and 200K random 4k to each VM to a limit of about 8M~ on the underlying units (which again were very affordable and low maintenance).
This provided very good storage performance (latency, throughput and fast / minimally if at all disruptive fail-over and recovery) for our apps, some of them were written in highly blocking Python code and needed to be rewritten async to get the most out of it, but it made a _huge_ (business changing) difference and saved us an insane amount of money.
These days I've moved into consulting and all the work I do is on GCP and AWS but I do miss the hands on high performing gear like that.
I hope it's worth noting here that a increasing variety of formerly vertically integrated storage systems management layers and more capabilities have become available in VM form with per GB licensing models.
IF you touch health care or finance outside of the trading rooms, Hitachi Data Systems (Ventara but I am not seeing the Hitachi name disappear it's too important I think I'm mindshare I know) the most surprisingly good and common installation. HDS wasn't scared to use OSS and build open platforms.
Big edit sorry
I wandered away from concluding with my own little dream to decouple the block implementation from the fs how DAOS does only with a full selection from the commercial file systems available, paying for usable capacity and not raw installed drive specifications. Paying for peritoneal at enterprise mark-up sucks. Meanwhile not too many people seem to be aware that you can run very small budget and scale file systems that were only available in multiples of house prices only a couple of years ago. The latest all flash NetApp filer is 20k and I don't imagine many people who have the knowledge to debate about the issues can't economically justify that even in a home lab. Executive care options can cost less only it's not many acquisitions that can cost you more just owning but with orders of magnitude difference between the costs.
For network storage some of his points are even stronger. Sure, the page cache becomes more useful as latency goes up but it also becomes more important to send more I/O at once, something that is hard to do with blocking APIs like read(2) and write(2). The page cache is pretty good at optimizing sequential I/O to do this, but not random I/O or workloads where you need to sync().
There are a series of hardware updates and software bottlenecks to work through before access becomes more common, the performance will bubble up from below starting with more widespread NVMe devices, then faster NVMe over fabrics hardware will become more common, and drivers, hypervisors, filesystems, and storage apps will likely have to rethink things to re-optimize. That means different times in terms of showing up in the cloud/on-prem/etc.
This is starting to change a bit because of things like the DPUs that companies are making. Basically it's an intelligent PCI-e <=> network bridge that lets you emulate/share PCI-e devices on the host while the actual hardware (NVMe storage, GPU, etc.) is located elsewhere. This lets you reconfigure the host in software without having to physically change the hardware in the servers itself. It also lets you change the way you have things in the server rack since everything doesn't need to be able to physically fit into every other server case.
The linked article states that DPUs behave like a PCIe device implementing the NVME protocol but instead of directly connected storage it can forward requests over network (fabric) via NVMeoF.
This doesn't look like a generic PCIe over network/fabric bridge. Did I misunderstand you or did I fail to locate that information in the linked article?
IBM research put out a bunch of white papers a few years back about the huge performance difference between spinning rust, SATA SSDs, and NVMe. The performance gains were amazing, especially with things like parquet.
not apps and not running there, but we bought a NAS for our really small compute cluster and just put 16 cheap (=marketed as read-intensive) NVME drives in (well, originally 8 – but then there was money left, which had to be spent fast), because IOPS are magnitudes better than HDDs and for around 3 times $/TB the hassle of configuring some caching just wasn't worth it in our opinion.
Performance on par with typical Desktop-SSDs over NFS (multiple clients possible). Definitely better than the old 1TB/disk-RAID5.
It's a pretty common pattern to have a fleet of big beefy VM hosts all backed by a single giant SAN on a 10gbe switch. This lets you do things like seamlessly migrate a VM from one host to another, or do a high availability thing with multiple synchronized instances and automatic failover (VMWare called this all "vMotion"). In any case, lots of bandwidth to the storage, but also high latency, at least relative to a locally-connected SATA or PCIe SSD.
So yeah, if that's your setup, you don't have much of an option in between your SAN and allocating an in-machine ramdisk, which will be super fast and low latency, but also extremely high cost.
Why not consider nVME in this case then as cheaper than RAM, slower than RAM, but faster than network storage? I don't know how you handle concurrency btwn VMs or virtualize that storage, but there must be some standard for that?
I think a lot of it depends what the machines are used for. I'm not actually the IT department, but I believe in my org, we started out with a SAN-backed high availability cluster, because the immediate initial need was getting basic infrastructure (wiki, source control, etc) off of a dedicated machine that was a single point of failure.
But then down the road a different set of hosts were brought online that had fast local storage, and those were used for short term, throwaway environments like Jenkins builders, where performance was far more important than redundancy.
I’m laughing a little bit because an old place I used to work had a similar setup. The SAN/NAS/whatever it was was pretty slow, provisioning VMs was slow, and as much as we argued that we didn’t need redundancy for a lot of our VMs (they were semi-disposable), the IT department refused to give us a fast non-redundant machine.
And then one day the SAN blew up. Some kind of highly unlikely situation where more disks failed in a 24h period than it could handle, and we lost the entire array. Most of the stuff was available on tapes, but rebuilding the whole thing resulted in a significant period of downtime for everyone.
It ended up being a huge win for my team, since we had been in the process of setting up Ansible scripts to provision our whole system. We grabbed an old machine and had our stuff back up and running in about 20 minutes, while everyone else was manually reinstalling and reconfiguring their stuff for days.
Ha, that's awesome. Yeah, for the limited amount of stuff I maintain, I really like the simple, single-file Ansible script— install these handful of packages, insert this config file, set up this systemd service, and you're done. I know it's a lot harder for larger, more complicated systems where there's a lot of configuration state that they're maintaining internal to themselves and they want you to be setting up in-band using a web gui.
I experienced this recently trying to get a Sentry 10 cluster going— it's now this giant docker-compose setup with like 20 containers, and so I'm like "perfect, I'll insert all the config in my container deployment setup and then this will be trivially reproducible." Nope, turns out the particular microservice that I was trying to configure only uses its dedicated config file for standalone/debugging purposes; when it's being launched as part of the master system, everything is passed in at runtime and can only be set up from within the main app. Le sigh.
It depends on what you want to spend your labor dollars on. I run a complex system on HPE storage which is all SSD or NVME and is super fast. You pay, but the vendor mostly takes care of it operationally — I pay for about 20% of a SAN sme, mostly to do maintenance on the storage fabric.
Intel has a storage layer running Optane for meta and logs called DAOS. ^0
Intel DAOS was news to me only last week, a oversight I'm profoundly embarrassed by given I'm responsible for a meaningful amount of storage iron due to be replaced with orders to spend to estimate eliminating dependencies if performance comes plenty with sufficient reduction of vendor lock in. Lock in is a relative thing in enterprise storage. Apple could have been forgiven a eternity of nannying for a good ZFS port but I have always believed that Steve knew Larry coveted SUN for ZFS alone, years before that forced wedding. I ran my business on VMS. RDB the management system Palmer gifted to Oracle was the upgrade option above licensing RMS Record Management System and RMS is plenty capable being foundational to the hallowed for good reason cluster capabilities VMS is remembered for in the same ways sclerotic drunks on their death beds remember their Sunday school teacher's good words on moderation.
ZFS exposed to a new and enthusiastic development community unfettered with nuisance concepts precisely those which were cast off with exhilarating energy by the NO-SQL movement. would be a genuine threat provided with functional basic data integrity underpinning their efforts during the pertinent time.
> Our CTO, Avi Kivity, made the case for async at the Core C++ 2019 event. The bottom line is this; in modern multicore, multi-CPU devices, the CPU itself is now basically a network, the intercommunication between all the CPUs is another network, and calls to disk I/O are effectively another. There are good reasons why network programming is done asynchronously, and you should consider that for your own application development too.
>
> It fundamentally changes the way Linux applications are to be designed: Instead of a flow of code that issues syscalls when needed, that have to think about whether or not a file is ready, they naturally become an event-loop that constantly add things to a shared buffer, deals with the previous entries that completed, rinse, repeat.
As someone that's been working on FRP related things for a while now, this feels very vindicating. :)
I few like as recently as a few years ago, the systems world was content with it's incremental hacks, but now the gap between the traditional interfaces and hardware realities has become too much, and bigger redesigning is afoot.
> that have to think about whether or not a file is ready, they naturally become an event-loop that constantly add things to a shared buffer, deals with the previous entries that completed, rinse, repeat.
This is the exception, not the rule, and it bugs me when APIs default to this. Most consumers of data are not looking at the stream of data, and in many cases where streaming is what I want, there are tools and APIs for handling that outside of my application logic. Much of the time I’m dealing with units of data only after the entire unit has arrived. Because if the message is not complete there is no forward progress to be made.
My tools should reflect that reality, not what’s quickest for the API writers to create.
In fact, if I remember my queuing theory properly, responsiveness is improved if the system prioritizes IO operations that can be finished (eg, EOS, EOF) over processing buffers for one that is still in the middle, which can’t happen with an event stream abstraction.
> in modern multicore, multi-CPU devices, the CPU itself is now basically a network, the intercommunication between all the CPUs is another network, and calls to disk I/O are effectively another.
Interesting take, and NUMA CPUs have felt networked to me when I've used them, but typical multicore UMA CPUs sure haven't... is there a reason to believe this will change (or already has), or did the author mean to only talk about NUMA?
I guess it depends on what you consider to be "typical". If you look at a many-core chip from AMD you'll see a gradient of access latency from one core to another. You'll see the same on any Intel Skylake-X descendant, although the slope of that gradient is less. Your software will need to be very highly optimized already before you start to sweat the difference, though.
Yep. If you are sweating a microsecond, 100 nanoseconds is significant chunk of your budget. For this and possibly other reasons, a many-core CPU isn't always a great choice for hosted storage. If your goal is to export NVMe blocks over a network interface, you might be better off with an easier-to-program 4- or 8-core CPU. I don't like seeing 128 cores and a bunch of NVMe devices in the same box because it just causes trouble.
The problem is forcing everyone to program in asynchronous way.
WinRT tried to go down that route (only asynchronous APIs), to drive developers into that path, but eventually they had to support synchronous as well due to the received resistence.
Before Herb Sutter's free lunch is over, not many writes in multi-threading code. Then today, everyone is writing multi-threading code, one way or another; and majority indirectly in newer languages like Go and Rust, or better frameworks like Actors, message passing, coroutines, and some noble souls who are capable enough, in classic pthread and other threading APIs.
Of course everyone is going to program in an async way; that's how nature works.
But it certainly will not be in a fashinon that is repulsive to you, just give it some time. Maybe 10 years.
One thing I have started to realize is that best case latency of an NVMe storage device is starting to overlap with areas where SpinWait could be more ideal than an async/await API. I am mostly advocating for this from a mass parallel throughput perspective, especially if batching is possible.
I have started to play around with using LMAX Disruptor for aggregating a program's disk I/O requests and executing them in batches. This is getting into levels of throughput that are incompatible with something like what the Task abstractions in .NET enable. The public API of such an approach is synchronous as a result of this design constraint.
Software should always try to work with the physical hardware capabilities. Modern SSDs are most ideally suited to arrangements where all data is contained in an append-only log with each batch written to disk representing a consistent snapshot. If you are able to batch thousands of requests into a single byte array of serialized modified nodes, you can append this onto disk so much faster than if you force the SSD to make individual writes per new/modified entity.
On Linux, it's already a NVMe driver option to enable polling for (high priority) IO completion rather than sleeping until an interrupt. The latency of handling an interrupt and doing a couple of context switches is higher than the best-case latency for fast SSDs. The io_uring userspace API also has a polling mode.
I agree with the premise, but disagree with the conclusion.
For a little background, my first computer was a Mac Plus around 1985, and I remember doing file copy tests on my first hard drive (an 80 MB) at over 1 MB/sec. If I remember correctly, SCSI could do 5 MB/sec copies clear back in the mid-80s. So until we got SSD, hard drive speed stayed within the same order of magnitude for like 30 years (as most of you remember):
So the time to take our predictable deterministic synchronous blocking business logic into the maze of asynchronous promise spaghetti was a generation ago when hard drive speeds were two orders of magnitude slower than today.
In other words, fix the bad APIs. Please don't make us shift paradigms.
Now if we want to talk about some kind of compiled or graph-oriented way of processing large numbers of files performantly with some kind of async processing internally, then that's fine. Note that this solution will mirror whatever we come up with for network processing as well. That was the whole point of UNIX in the first place, to treat file access and network access as the same stream-oriented protocol. Which I think is the motive behind taking file access into the same problematic async domain that web development is having to deal with now.
But really we should get the web back to the proven UNIX/Actor model way of doing things with synchronous blocking I/O.
I suppose _modern_ storage is fast, but how many servers are running on storage this modern? None of mine are and my work dev machine is still rocking a SATA 2.5" SSD.
We're probably still a few years off from being able to switch to this fast I/O yet. With the new game consoles switching over to PCIe SSDs I expect the price of NVMe drives to drop over the next few years until they're cheap enough that the majority of computers are running NVMe drives.
Even with SATA drives like mine though, there's really not that much performance loss from doing IO operations. I've run my OS with 8GiB of SSD swap in active use during debugging and while the stutters are annoying and distracting, the computer didn't grind to a halt like it would with spinning rust. Storage speed has increased massively in the last five years, for the love of god fellow developers, please make use of it when you can!
That said, deferring IO until you're done still makes sense for some consumer applications because cheap laptops are still being sold with hard drives and those devices are probably the minimum requirement you'll be serving.
> I expect the price of NVMe drives to drop over the next few years until they're cheap enough that the majority of computers are running NVMe drives.
Price no longer has anything to do with it. PC OEMs are simply not shipping SATA SSDs any more, and major drive vendors have started to discontinue their client (OEM) SATA SSD product lines. We're just waiting for the SATA-based PC install base to be retired.
One SSD is sufficient for almost all consumer systems. The only reason to want more than two SSDs is if you're re-using at least one old tiny SSD in a new machine. SATA ports will stick around in desktops only for the sake of hard drives. There may be a few niches left where using several SATA SSDs in a workstation still makes some kind of sense, and obviously not all server platforms have migrated to NVMe yet. But as far as influencing the direction and design of consumer systems, SATA SSDs have only slightly more relevance than optical disc drives.
Drive price per GB doesn't even scale monotonically with capacity. Right now, the best price per GB is usually on 1TB or 2TB models. And if you need more than two such devices with SSD performance, you're far outside the bounds of consumer computing and into workstation territory.
It depends on what you mean by "consumer computing", really. The single largest app using up my disk space is Steam, and if I only had a single 1 Tb SSD, it'd be full by now.
With PCIe lane bifurcation, you won’t even need a PCIe switch on your expansion card. I have 10 Samsung 980 Pro PCIe SSDs in my AMD ThreadRipper PRO/WX machine (2 in motherboard M.2 slots, and 2 x “expansion cards” that hold 4 SSDs each). Had to configure PCIe bifurcation in BIOS, so lanes connected to a PCIe x16 card will be treated like 4 x PCIe x4 instead.
So far the best aggregate results with io_uring 10.5 M 4K IOPS and 66.5 GB/s with large reads...
When doing research, I was careful to buy only kit that can achieve full PCIe 4.0 speed and not some old PCIe 3.0 stuff that's "compatible with PCIe 4". This applies both to SSDs and expansion cards...
Edit: It's worth adding that your CPU(s)/mobo must have enough PCIe lanes when trying to get max throughput. 10 SSDs would need 40 PCIe lanes dedicated to them (many consumer CPUs/chipsets have 36 or 44 and some lanes are used for other stuff! The new AMD ThreadRipper PRO WX has 128 :-)
I wasn't including the workstation market when I referred to what PC OEMs are doing.
Are you using 8 consumer SATA SSDs in your workstation? Is it for the sake of increased capacity, or for the sake of increased performance? Because it's pretty easy now to match the performance of an 8-drive SATA RAID-0 with a single NVMe drive, but 8TB consumer NVMe SSDs are still 50% more expensive than 8TB consumer SATA SSDs.
(Also, even 8 SATA ports is above average for consumer motherboards; it looks like about 17% of the retail desktop motherboard models currently on the market have at least 8 SATA ports.)
3 M.2 slots is common even on AMD's mainstream X570 and B550 platforms. I don't know if any of those motherboards also bundle riser cards for further M.2 PCIe SSDs, but they do support PCIe bifurcation so you can run your GPU at PCIe 4.0 x8 and use the second x16/x8 slot to run two more SSDs in a passive riser purchased separately.
The b550 Aorus Master I just got is unique in its bifurcation. It maps the CPU's gen 4 PCI-E to the main 16x slot and one M.2 4x, or drops the 16x slot down to 8x and then has three gen 4 M.2 4x. The other two PCI-E slots are gen 3 through the b550 chipset.
I chose this board for its absurdly overkill CPU voltage regulation. The pci-e configuration seems like a pretty good compromise, though.
Right now it's a bit more specialized to storage oriented server platforms that can run in the 10-40 NVMe devices. You get this sort of imbalance where any one or two high performance NVMe devices at full throughput can push more I/O than a single high end network link
Sure you could run 24 NVMes with highpoint pcie4 raids on a trx40 board. But then most still have like 10 sata ports so you can run those as well. It will be great when sata is replaced by U.2 but who knows when that happens.
Well, if the whole async 'I/O is the bottleneck' principle which was the refrain from a few years ago is actually true, then servers running databases should be focusing on upgrading their storage to these levels, since that's where the most bang for buck comes from in terms of performance gain is (of course now the big thing is running everything on clouds like AWS where everything is dog slow and really expensive, so perhaps it doesn't actually matter). (In fact the main reason for the API changes covered in the article is because the CPU and RAM can no longer run laps around storage).
Intuitively one should be able to approach the max speed for sequential reads via some tuning (queue/read_ahead_kb) even with the traditional, blocking posix interface. This would require a large enough read-ahead and large enough buffer size. Not poisoning the page cache/manually managing the page cache is an orthogonal issue and only relevant for some applications (and the additional memory copy barely makes a difference in OPs post).
One advantage of using high level (Linux) kernel interfaces is that this "automatically" gets faster with newer Linux versions without a need of large application level changes. Maybe in a few years we'll have an extra cache layer, or it stores to persistent memory now. Linux will (slowly) improve and your application with it. This won't happen if it is specifically tuned for Direct I/O with Intel Optane in 2020.
But yeah, random IO is (currently) another issue, and as said the usual advice is to avoid them. And with the old API this still holds. If one currently wants fast random IO one needs to use io_uring/aio (with Direct-IO) or just live with the performance not being optimal and hope that the page cache does more good than bad (like Postgresql).
I found this to be a good read, but I wish the author discussed the pros/cons of bypassing the file system and using a block device with direct I/O. I've found that with Optane drives the performance is high enough that the extra load from the file system (in terms of CPU) is significant. If the author was using a file system (which I assume is the case) which was it?
> ...misconceptions... Yet if you skim through specs of modern NVMe devices you see commodity devices with latencies in the microseconds range and several GB/s of throughput supporting several hundred thousands random IOPS. So where’s the disconnect?
Whoa there... let's not compare devices with 20+ GB/s and latencies in nanosecond ranges which translate to half a dozen giga-ops per second (aka RAM) with any kind of flash-based storage just yet.
The advertised bandwidth for RAM is not actually what you get per-core, which is what you care about in practice.
If you want to know the upper bound on your per-core RAM bandwidth:
64 bytes (the size of a cache line) * 10 slots (in a CPU core's LFB or line fill buffer) / 100ns (the typical cost of a cache miss) * 1000000 * 1000 (to convert ns to ms to seconds) = 6400000000 bytes per second = 5.96 GiB per second RAM bandwidth per core
There's no escaping that upper bound per core.
Nanosecond RAM latencies don't help much when you're capped by the line fill buffer and queuing delay kicks in spiking your cache miss latencies. You can only fetch 10 lines at a time per core and when you exceed your 5.96 GiB per second budget your access times increase.
If you compare with NVMe SSD throughput plus Direct I/O plus io_uring, around 32 GIB per second and divide that by 10 according to the difference in access latencies, then I think the author is about right on target. The point they are making is valid: it's the same order of magnitude.
While I was in the hospital ICU earlier this year, I promised myself I would build a zen 3 desktop when it came out despite my 10 year old desktop still working just fine.
I've since bought all the pieces but the CPU; they are all sold out. So I got a 6 core 3600XT in the interim. I bought fairly high binned RAM and overclocked it to 3600Mhz, and was surprised to cap out at about 36GB/s throughput. Your 6GiB/s per core explanation checks out for me!
Cool! I had a similar empirical experience working on a Cauchy Reed-Solomon encoder awhile back, which is essentially measuring xor speed, but I just couldn't get it past 6 GiB/s per core either, until I guessed I was hitting memory bandwidth limits. Only a few weeks ago I stumbled on the actual formula to work it out!
> capped by the line fill buffer and queuing delay kicks in spiking your cache miss
could you point me to a little reading material on this? I know what an LFB is, more or less, but what queueing delay, an dhow does that relate to cache misses? Thanks.
It means if a system can only do X of something per second, then if you push the system past that, new arriving stuff has to wait on existing work in the queue, and things take longer than if the queue was empty. You can think of it like a traffic jam and it applies to most systems.
For example, our local radio station here in Cape Town loves to talk about "queuing traffic" when they do the 8am traffic report, and I always think of Little's law.
Bufferbloat is another example of queueing delay, e.g. where you fill the buffer of your network router say with a large Gmail attachment upload and spike the network ping times for everyone else sharing the same WiFi.
> In the DRAM region we’re actually seeing a large change in behaviour of the new microarchitecture, with vastly improved load bandwidth from a single core, increasing from 14.8GB/S to 21GB/s
Yeah, that's odd. But the article's really about cache, so maybe it's a mistake. Next para says
> More importantly, memory copies between cache lines and memory read-writes within a cache line have respectively improved from 14.8GB/s and 28GB/s to 20GB/s and 34.5GB/s.
so it looks like it's talking about cache not ram but... shrug
The article isn't exactly conflating RAM and flash; if it were, the conclusions would be very different. A synchronous blocking IO API is fine if you're working with nanosecond latencies of RAM, or with storage that's as painfully slow and serial as a mechanical hard drive.
Flash is special in that its latency is still considerably higher than that of DRAM, but its throughput can get reasonably close once you have more than a handful of SSDs in your system (or if you're willing to compare against the DRAM bandwidth of a decade-old PC). Extracting the full throughput from a flash SSD despite the higher-than-DRAM latency is what requires more suitable APIs (if you're doing random IO; sequential IO performance is easy).
True, but that applies more to writes than reads. Most real-world workloads do a lot more reads than writes, and what writes they do perform can usually tolerate a lot of buffering in the IO stack to further reduce the overall impact of low write performance on the underlying storage.
We’ve been using Ethernet cards for storage because the network round trip to RAM over TCP/IP on another machine in the same rack is far cheaper than accessing local storage. Latency compared to that option is likely the most noteworthy performance gain.
My understanding of distributed computing history is that the last time network>local storage happened was in the 80’s, and most of the rest of the history of computing, moving the data physically closer to the point of usage has always been faster.
Just as then, we’ve taken a pronounced software architecture detour. This one has lasted much longer, but it can’t really last forever. With this new generation of storage, we’ll probably see a lot of people trotting out 90’s era system designs as if they are new ideas rather than just regression to the mean.
Depending on the storage technology the comparison to RAM is not that far off. Intel is trying to market it that way any way [0]. It's obviously not RAM but it's not the <500GB 5200RPM SATA 3GB/s disk I started programming on.
Yeah, back in 2014, I worked at HP on storage drivers for linux, and we got 1 million IOPS (4k random reads) on a single controller, with SSDs, but we had to do some fairly hairy stuff. This was back when NVME was new and we were trying to do SCSI over PCIe. We set up multiple ring buffers for command submission and command completion, one each per CPU, and pinned threads to CPUs and were very careful to avoid locking (e.g. spinlocks, etc.). I think we also had to pin some userland processes to particular CPUs to avoid NUMA induced bottlenecks.
The thing is, up until this point, for the entire history of computers, storage was so relatively slow compared to memory and the CPU that drivers could be quite simple, chuck requests and completions into queues managed by simple locking, and the fraction of time that requests spent inside the driver would still be negligible compared to the time they spent waiting for the disks. If you could theoretically make your driver infinitely fast, this would only amount to maybe a 1% speedup. So there was no need to spend a lot of time thinking about how to make the driver super efficient. Until suddenly there was.
Oh yeah, iirc, the 1M IOPS driver was a block driver. For the SCSI over PCIe stuff, there was the big problem at the time that the entire SCSI layer in the kernel was a bottleneck, so you could make the driver as fast as you wanted, but your requests were still coming through a single queue managed by locks, so you were screwed. There was a whole ton of work done by Christoph Hellwig, Jens Axboe and others to make the SCSI layer "multiqueue" around that time to fix that.
This is such a big deal! The assumptions made when IO APIs where designed are so out-of-step with today's hardware that it really is a time to have a big rethink. In graphics, the last 20 years of API development have very much been focused on harnessing a GPU that have again and again outgrown the CPUs ability to feed it. So much have been learned, and we really need to apply this to both storage and networking.
> The operating system reads data in page granularity, meaning it can only read at a minimum 4kB at a time. That means if you need to read read 1kB split in two files, 512 bytes each, you are effectively reading 8kB to serve 1kB, wasting 87% of the data read.
SSDs (whether SATA or NVMe) all read and write whole sectors at a time, right? I'm not sure what the sector size is, but 4 KiB seems like a reasonable guess. So I think you're reading the 8 KiB no matter what; it may just be a question of what layer you drop it at (right when it gets to the kernel or not). Also, doesn't direct IO require sector size-aligned operations?
All SATA SSDs and almost all NVMe SSDs use 512-byte LBAs by default, but many NVMe SSDs can be reconfigured to use 4096-byte LBAs by default. The underlying NAND flash memory these days typically has a native page size of 16kB.
Thanks. What are the tradeoffs with NVMe sector size? There must be some reason to want 4096 if they added that option. Is there less write amplification as you get closer to the native page size or does the firmware avoid that anyway with its write leveling?
This is a really poor article. Only in very rare circumstances can developers change the API's. API's are not "bad"; they are built to various important requirements. Only some of those requirements have to do with performance.
> “Well, it is fine to copy memory here and perform this expensive computation because it saves us one I/O operation, which is even more expensive”.
"I/O operation" in fact refers to the API call, not to the raw hardware operation. If the developer measured this and found it true, how can it be a misconception?
It may be caused by a "bad" I/O API, but so what? The API is what it is.
API's provide one requirement which is stability: keeping applications working. That is king. You can't throw out API's every two years due to hardware advancements.
> “If we split this into multiple files it will be slow because it will generate random I/O patterns. We need to optimize this for sequential access and read from a single file”
Though solid state storage doesn't have track-to-track seek times, the sequential-access-fast rule of thumb has not become false.
Random access may have to wastefully read larger blocks of the data than are actually requested by the application. The unused data gets cached, but if it's not going to be accessed any time soon, it means that something else got wastefully bumped out of the cache. Sequential access is likely to make use of an entire block.
Secondly, there is that API again. The underlying operating system may provide a read-ahead mechanism which reduces its own overheads, benefiting the application which structures its data for sequential access, even if there is no inherent hardware-level benefit.
If there is any latency at all between the application and the hardware, and if you can guess what the application is going to read next, that's an opportunity to improve performance. You can correctly guess what the application will read if you guess that it is doing a sequential read, and the application makes that come true.
I didn't get the impression that the author was suggesting to throw out the old APIs. It seems to me like the article is a proof of concept of new approaches that could be added as new APIs, only expected to be used by people who need them, using an approach that takes advantage of modern storage technology.
> "Random access may have to wastefully read larger blocks of the data than are actually requested by the application. The unused data gets cached, but if it's not going to be accessed any time soon, it means that something else got wastefully bumped out of the cache. Sequential access is likely to make use of an entire block."
I may have misread it, but I thought he addressed this in the article.
> "Random access files take a position as an argument, meaning there is no need to maintain a seek cursor. But more importantly: they don’t take a buffer as a parameter. Instead, they use io_uring’s pre-registered buffer area to allocate a buffer and return to the user. That means no memory mapping, no copying to the user buffer — there is only a copy from the device to the glommio buffer and the user get a reference counted pointer to that. And because we know this is random I/O, there is no need to read more data than what was requested."
John Ousterhout, in the RAMCloud project, bet that 2.5μs latency for a read across the network would be possible. NVMe-s seem to be inching towards that too - like < 20μs at this point. Would be interesting to see how these numbers play out.
Because I am a "glue" programmer, and I realize that all storage options suck, I've decided to wait on any infrastructure choices for now, and just use the filesystem as a key-value store when developing my projects.
When I need indexing, I use SQLite, but I limit myself to very basic subset of SQL that would work in any of Oracle, Maria, Microsoft stores without changes.
What I hate about Unix filesystems: the fact that you can't take a drive, put it in another computer and have permissions (user/group-ids) working instantly. Same for sharing over nfs.
Of course, people have tried to solve this, but I think not well enough. It's a huge amount of technical debt right there in the systems we use every day.
Permissions (and ownership info) can be useful even if you have complete access to a filesystem.
By the way, assume you have root permission. How would you replace a single file in a random tar-file, without changing any of the permissions/userids/groupids inside the tar-file? You can't untar it because the users inside the tar file don't correspond with the ones on your system. So, you'll have to use special tools, which is (only) one demonstration of the inadequacy of the permissions mechanism of our filesystems.
There are two problems with this. 1. Tar also stores user and group names, which you see when you use the "v" and "t" options. 2. If you try this as non-root user, you run into permission problems.
I think this is a loaded article and underscores the importance of low level engineers who understand your workload to guide purchasing. No longer will fringe benefits and bribes be enough.
I stopped reading this when it became evident that the author was just promoting a library he’d written. I might have been a tiny bit more interested if I were a Rust developer.
Very well. The API Glauber (the author) describes has always been a good API for performance even with old HDDs and old CPUs.
In fact with very old CPUs the reduction in memory copies and system calls was more significant than it is now. The extra control over memory use with direct I/O is beneficial on some lower-end embedded systems too, e.g. video streaming from HDD on low memory systems.
It's just that before fast SSDs, there wasn't as much motivation to get the "best" I/O performance via complex APIs, because most of the time overhead was dominated by the HDD performance itself. (This was less true for big, fast RAIDs though.)
There was some benefit in a small amount of parallelism with HDDs, to improve block sorting, but that was usually best achieved with a few threads or processes, which is still a simple API to use, just read() and write() syscalls.
The only things really worth doing a complex API for before were O_DIRECT+AIO together, and even then they need a well-written, I/O-aware application to make them really worth using. In general, those applications have tended to be databases and VM hypervisors, but I've also seen it done for optimized file streaming to/from HDD on embedded systems. Though beneficial, the Linux AIO implementation had problems for a long time, even with O_DIRECT in some cases (such as filling holes in sparse files and extending files); so it was never reliably "true" async I/O. And there was no memory-buffer transfer and system call elision as there is with io_uring.
Now there is more motivation, so the API has been improved. O_DIRECT+AIO+io_uring is a better combination. The benefits are increasingly worth the effort for more kinds of applications due to the faster SSDs, and the SSDs' ability to handle large I/O request queues. But they would have been a good API combination for performance 20 years ago too.
Also: Modern storage is plenty fast, but also not reliable for long term use.
That is why I buy a new SSD every year and clone my current (worn out) SSD to the new one. I have several old SSDs that started to get unhealthy, well, according to my S.M.A.R.T utility that I used to check them. I could probably get away with using an SSD for another year, but will not risk the data loss. Anyone else do this?
No. Hardly anyone does this, because it's just conspicuous consumption, not actually sensible. Have any of your SSDs ever used even half of their warrantied write endurance in a single year?
I've looked into RAID. It seems a bit complicated to use. Is it trivial to create a RAID array in Linux with zero fuss and the whole thing 'just working' with very little knowledge of the filesystem itself other than it keeps your data 'safe' and redundancy baked in?
Do you mean “use” or “setup”? RAID is trivial to use. Mount the volume to a directory and use it like normal.
The setup is a bit more involved, but really not that bad. It’s a couple commands to join a few disks in an array and then you make a file system and mount it.
On the other hand, how many years does each of us have left? Ten? twenty? Thirty? Forty? Few of us can easily imagine ourselves still alive and productive in forty years. So much of what we do rests on an implicit assumption that we are going to live for eternity, and starts to seem pointless when we consider how short our existence is.
Very well said, there are times we lose the bigger picture of our lives and instead start wasting times with pointless stuff just to escape the reality of our lives.
Materialism (the dominant underlying philosophy of our culture) keeps us away from that higher level consciousness. It poisons our mental models and worldview.
What do you do that wears them out so fast? I've been running the same NVMe disk as my daily driver since 2015 and it's not showing any signs of degradation.
I forgot to mention I do a lot of heavy writes to it. It is common to see me creating a huge 20GB virtual machine disk image, using it for a few hours, then deleting it, before creating a new one in its place. I'm a huge virtualization freak.
In a lot of these systems (at least VMWare back when I used that, and Docker) you can clone an existing image with copy-on-write. This is a lot faster and would avoid 20GB of writes to spin up a new VM.
Eh, still that is not that much data. I've had much longer life out of SSDs that are used as cache drives and perform a huge amount of writes per day, with Intel and Samsung drives that is.
If you're losing drives that fast check your thermal management instead. Don't run your drives hot.
I work with bioinformatics data and tend to switch out an NVMe within 3-4 months. I'm usually maxing out read or write for 12 out of 24 hours a day. The slowdown is rapid and very noticeable.
That probably doesn't have anything to do with write endurance of the flash memory. When your drive's flash is mostly worn-out, you will see latency affected as the drive has to retry reads and use more complex error correction schemes to recover your data. But there are several other mechanisms by which a SSD's performance will degrade early in its lifetime depending on the workload. Those performance degradations are avoidable to some extent, and are not permanent.
Assuming these are consumer SSD, the most important way to maintain good performance is to ensure that it gets some idle time. Consumer SSDs are optimized for burst performance rather than sustained performance, and almost all use SLC write caching. Depending on the drive and how full it is, the SLC cache will be somewhere between a few GB up to about a fourth of the advertised capacity. You may be filling up the cache if you write 20GB in one shot, but the drive will flush that cache in the background over the span of a minute or two at most if you don't keep it too busy.
The other good strategy to maintain SSD performance in the face of a heavy write workload is to not let the drive get full. Reserving an extra 10-15% of the drive's capacity and simply not touching it will significantly improve sustained write speeds. (Most enterprise SSD product lines have versions that already do this; a 3.2TB drive and a 3.84TB drive are usually identical hardware but configured with different amounts of spare area.)
If a drive has already been pushed into a degraded performance state, then you can either erase the whole drive or, if your OS makes proper use of TRIM commands, you can simply delete files to free up space. Then let the drive have a few minutes to clean things up behind the scenes.
I think you could give it a shot with ATA Secure Erasing one of them and seeing if it performs faster. Although 4 months at 50% utilization at (say) 2GB/s is some ~10PB of I/O, so I'm not sure if I would expect what you're seeing to be a temporary slowdown...
Consumer SSDs have endurance ratings in TBW which is terabytes written over the lifespan. They're often in the 100s with some drives over 1000. The faster drives also use MLC or TLC which has lower latency, better endurance, and higher performance than the more higher capacity QLC.
For example the Samsung 1TB 970 PRO (not the 980 PRO) has a 1200TBW rating with a 5 year warranty. That's 1.2M gigabytes written or more than 600GB every day, and will usually handle far more.
It will highly vary depending on use case. I have been using the same SSD (Samsung 850 evo) since 2015. First used on my gaming desktop, then on my college laptop, now in my gaming desktop again. I just make sure to keep it at ~25% to ~50% capacity to give the controller an easy time and I try to stick to mostly read only workloads (gaming). SMART report from that drive: https://pastebin.com/raw/HyPE6aHm
For my disk for my exact use case: ~4 years of operation. 88% of lifespan remaining.
I've had one (very early and cheap) SSD fail on me. Other than that I don't think I've seen or heard of any issues across a large range of more modern SSDs. The reliability and endurange issues which occured on earlier SSDs no longer seem to be a problem (this is in part because flash density has skyrocketed: because each flash chip can operate more or less independently, the more storage an SSD has the faster it can run and the more write endurance it has).
I would add a new drive with zfs mirroring and enable simple compression. For most use cases it gets better read performance, ok write performance, and can tolerate both of the drives being a bit flaky so you can run it for a lot longer than the new drive alone.
Every year seems like a very short lifespan, but I guess every usecase is different. I definitely replace drive when SMART is starting to look bleak, but that is far more infrequent in my usecase I guess.
Yes but I forgot to mention I do a lot of heavy writes to it. It is common to see me creating a huge 20GB virtual machine disk image, using it for a few hours, then deleting it, before creating a new one in its place. I'm a huge virtualization freak.
> It is common to see me creating a huge 20GB virtual machine disk image, using it for a few hours, then deleting it
The SSD in my current desktop, a Samsung 960 Pro 1TB, has a warranty for 800 TBW or 5 years. So that's 800/5/365.25*1000 ~= 438 GB per day, every single day.
And it's been documented the Samsung drives can do a lot more than the warranty is good for.
Either you're doing something else weird, or you're not really wearing them out.
> does not necessarily mean you're actually writing out 20GB to the disk.
You mean like preallocation? I think Virtualbox now does that. In the past it didn't though, it just kept writing a bunch of zeroes to the drive until it reached 20GB.
I think having a backup solution is the better choice here. You can use your SSDs until they die or become too slow, and you won't lose your data if it breaks before you replace it after a year
> I think having a backup solution is the better choice here
Any particular provider you would recommend? I've looked into backblaze but it seems a bit pricey. Also: I am aware that cloud based backup solutions have very little failure rate in terms of drives since they're probably using RAID
I use Backblaze's B2 (think S3) style storage for backup, and I'm paying about ~$4.50 USD for ~1TB of storage per month. I don't use a ton of bandwidth though, so if you have a lot of churn in the files you're backing up you could see higher costs, but looking over the numbers Backblaze as the cheapest solution by far to others I looked at.
Can somebody please write up modern SSD and state of the world regarding data retention, modes, applicability for SSD replacing spinning rust "on the shelf" offline...
SSDs make no sense for offline archival. They're more expensive than hard drives and will be for the foreseeable future. You don't need the improved random IO performance or power efficiency for a drive that's mostly sitting on a shelf.
If what you have is an SSD, you need to know how applicable it is to unplug/remove it, and hold it offline. What I read is that this is not a good fit for how the data retention of an SSD works, it expects to be plugged in to power more frequently. If the storage is safe for 6 months to a year to 10 years, it would be good to know. I believe it may not be.
I entirely agree that its not the medium of choice yet, if ever. LTO still exists for many people, perhaps 25+ TB hdd will make that moot.
I've only had two SSDs fail on me, and in both cases they died without any warning. Didn't get discovered during boot or anything. Two different brands, very different uses.
So while they _can_ fail in a graceful way, that's not been my experience.
> I was taken by surprise by the fact that although every one of my peers is certainly extremely bright, most of them carried misconceptions about how to best exploit the performance of modern storage technology leading to suboptimal designs, even if they were aware of the increasing improvements in storage technology.
> In the process of writing this piece I had the immense pleasure of getting early access to one of the next generation Optane devices, from Intel.
The entire blog post is complaining about how great engineers have misconception about modern storage technology and yet to prove it the author had to obtain benchmarks from early access to next-generation devices...?! And to top it off, from this we conclude "the disconnect" is due to the APIs? Not, say, from the possibility that such blazing-fast components may very well not even exist in users' devices? I'm not saying the conclusions are wrong, but the logic surely doesn't follow... and honestly it's a little tasteless to criticize people's understanding if you're going to base the criticism on things they in all likelihood don't even have access to.