I always liked the embedded system model where you get flash hardware that has two operations -- erase block and write block. GC, RAID, error correction, etc. are then handled at the application level. It was never clear to me that the current tradeoff with consumer-grade SSDs was right. On the one hand, things like the error correction, redundancy, and garbage collection don't require the attention from CPU (and more importantly, doesn't tie up any bus). On the other hand, the user has no control over what the software on the SSD's chip does. Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.
It would be nice if we could just buy dumb flash and let the application do whatever it wants (I guess that application would be your filesystem; but it could also be direct access for specialized use cases like databases). If you want maximum speed, adjust your settings for that. If you want maximum write durability, adjust your settings for that. People are always looking for that one size fits all use case, but it's hard here. Some people may be running cloud providers and already have software to store that block on 3 different continents. Some people may be an embedded system with a fixed disk image that changes once a year, with some temporary storage for logs. There probably isn't a single setting that gets optimal use out of the flash memory for both use cases. The cloud provider doesn't care if a block, flash chip, drive, server rack, availability zone, or continent goes away. The embedded system may be happy to lose logs in exchange for having enough writes left to install the next security update.
It's all a mess, but the constraints have changed since we made the mess. You used to be happy to get 1/6th of a PCI Express lane for all your storage. Now processors directly expose 128 PCIe lanes and have a multitude of underused efficiency cores waiting to be used. Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.
1. Contemporary mainstram OSes have not risen to the challenge of dealing appropriately with the multi-CPU, multi-address space nature of modern computers. The proportion of the computer that the "OS" runs on has been shrinking for a long time and there have only been a few efforts to try to fix that (e.g. HarmonyOS, nrk, RTKit)
2. Hardware vendors, faced with proprietary or non-malleable OSes and incentives to keep as much magic in the firmware as possible, have moved forward by essentially sandboxing the user OS behind a compatibility shim. And because it works well enough, OS developers do not feel the need to adjust to the hardware, continuing the cycle.
There is one notable recent exception in adjusting filesystems to SMR/Zoned devices. However this is only on Linux, so desktop PC component vendors do not care. (Quite the opposite: they disable the feature on desktop hardware for market segmentation)
Are there solutions to this in the high-performance computing space, where random access to massive datasets is frequent enough that the “sandboxing” overhead adds up?
HPC systems generally use LustreFS where you have multiple servers handling metadata and objects (files) separately. These servers have multiple level of drives, where metadata servers are SSD backed and file servers run on SSD accelerated spinning disk boxes, with a mountain of 10TB+ drives.
When this structure is fed to a EDR/HDR/FDR Infiniband network, the result is a blazing fast storage system where you can make a massive number of random accesses by very large number of servers simultaneously. The whole structure won't shiver even.
There are also other tricks Lustre can pull for smaller files to accelerate the access and reduce the overhead even further, too.
In this model, the storage boxes are somewhat sandboxed, but the whole model as a general is mounted via its own client, so the OS is very close to the model Lustre provides.
On the GPU servers, if you're going to provide big NVMe scratch spaces (a-la nVidia DGX systems), you soft-RAID the internal NVMe disks with mdadm.
In both models, saturation happens on hardware level (disks, network, etc.) processors and other soft components doesn't impose a meaningful bottleneck even under high load.
Additionally, In the HPC space, power loss is not a major factor; backup power systems exist, and rerunning the last few minutes of a half-completed job is common, so on either side you are unlikely to encounter the fallout of "I clicked save, why didn't it save"?
Hours and days of jobs need to be rerun in my experience, our researchers do a poor job of check pointing.
Of all the issues we have with lustre, data loss has never been one whilst I have been in the team.
> Hours and days of jobs need to be rerun in my experience, our researchers do a poor job of check pointing.
Enabling pre-emption in your queues by default and that'll change: after a job is scheduled and run for 1-2 hours it can be kicked out and a new one run instead after the first's priority decays a bit.
> When would I want to use preemption? When would I not want to use it?
> When a job is designated as a preemptee, we increase the job's priority, and increase several limits, including the maximum number of running processors or jobs per user, and the maximum running time per job. Note that these increased limits only apply to the preemptable job. This allows preemptable jobs to potentially run on more resources, and for longer times, than normal jobs.
> Enabling pre-emption in your queues by default and that'll change.
We run preemptive queues, and no. Not all jobs are compatible with that. Esp. the code researchers developed themselves.
My own code also doesn't have support for checkpointing. Currently it's blazing fast, but for bigger jobs it might need the support, and it needs way more cogs inside the pipeline to make it possible.
This is absolutely correct. Cattle vs. Pet analogy [0] applies perfectly there. On the other hand, HPC systems are far from being unprotected. Storage systems generally disable write caches on spinning drives automatically and have all on the fly data on either battery backed or flash based caches. So FS level corruption is kept at minimal levels.
Also, yes, many longer jobs are checkpoints and restart where it's left off, but it's not always possible, unfortunately.
Is it though? IBM hates GPFS and has been trying to kill it off since its initial release, but every time it tries the government (by NSF/tertiary proxy) stuffs more money its mouth. It lives despite being hated by the parent. Both GPFS and Lustre have their warts.
I can recommend the related talk "It's Time for Operating Systems to Rediscover Hardware". [1]
It explores how modern systems are a set of cooperating devices (each with their own OS) while our main operating systems still pretend to be fully in charge.
Fundamentally the job of the OS is resource sharing and scheduling. All the low level device management is just a side show.
Hence why SSD's use a block layer (or in the case of NVMe key/value, hello 1964/CKD) abstraction above whatever pile of physical flash, caches, non-volatile caps/batts, etc. That abstraction holds from the lowliest SD card, to huge NVMe-OF/FC/etc smart arrays which are thin provisioning, deduplicating, replicating, snapshoting, etc. You wouldn't want this running on the main cores for performance and power efficiency reasons. Modern m.2/SATA SSD's have a handful of CPUs managing all this complexity, along with background scrubbing, error correction, etc so you would be talking fully heterogeneous OS kernels knowledgeable about multiple architectures, etc.
Basically it would be insanity.
SSDs took a couple orders of magnitude off the time values of spinning rust/arrays, but many of the optimization points of spinning rust still apply. Pack your buffers and submit large contiguous read/write accesses, queue a lot of commands in parallel, etc.
So, the fundamental abstraction still holds true.
And this is true for most other parts of the computer as well. Just talking to a keyboard involves multiple microcontrollers, scheduling the USB bus, packet switching, and serializing/deserializing the USB packets, etc. This is also why every modern CPU has a mgmt CPU that bootstraps and manages it power/voltage/clock/thermals.
So, hardware abstractions are just as useful as software abstractions like huge process address spaces, file IO, etc.
And if the entire purpose of computer programming is to control and/or reduce complexity, I should think the discipline would be embarrassed with the direction in which the industries have been going the past several years. AWS alone should serve as an example.
> And if the entire purpose of computer programming is to control and/or reduce complexity
I honestly don’t know where you got that idea from. I always thought the whole point of computer programming was to solve problems. If it makes things more complex as a result, then so be it. Just as long as it creates fewer, less severe problems than it solves.
What are some examples of complex solutions to simple problems? That is, where a solution doesn't result in a reduction of complexity? I can't find any.
And this is where the increased complexity is necessary for a solution, not just Perl Anti-golf or FactoryFactory Java jokes.
More complex systems are liable to create more complex problems...
I don't think you can get away from this - yes, can solve a problem, but if you model problems as entropy, increasing complexity increases entropy.
It's like the messy room problem - you can clean your room (arguably high entropy state), but unless you are exceedingly careful doing so increases entropy. You merely move whatever mess to the garbage bin, expend extra heat, increase your consumption in your diet, possibly break your ankle, stress your muscles...
Arguably cleaning your room is important, but decreasing entropy? That's not a problem that's solvable, not in this universe.
> but unless you are exceedingly careful doing so increases entropy
In an isolated system, entropy can only increase. Moving at all heats up the air. Even if you are exceedingly careful, you increase entropy when doing any useful work.
An interesting approach would be to standardize a way to program the controllers in flash disks, maybe something similar to OpenFirmware. Mainframes farm out all sort of IO to secondary processors and it was relatively common to overwrite the firmware in Commodore 1541 drives, replacing the basic IO routines with faster ones (or with copy protection shenanigans). I'm not sure anyone ever did that, but it should be possible to process data stored in files without tying up the host computer.
But, its still an abstraction, and would have to remain that way unless your willing to segment it into a bunch of individual product categories, since the functionality of these controllers grows with the target market. AKA the controller on a emmc isn't anywhere similar to the controller on a flash array. So like GP-GPU programming, its not going to be a single piece of code because its going to have to be tuned to each controller, for perf as well as power reasons never mind functionality differences (aka it would be hard to do IP/network based replication if the target doesn't have a network interface).
There isn't anything particularly wrong with the current HW abstraction points.
This "cheating" by failing to implement the spec as expected isn't a problem that will be solved by moving the abstraction somewhere else, someone will just buffer write page and then fail to provide non volatile ram after claiming its non volatile/whatever.
And its entirely possible to build "open" disk controllers, but its not as sexy as creating a new processor architectures. Meaning RISC-V has the same problems, if you want to talk to industry standard devices (aka the NVMe drive you plug into the RISC-V machine is still running closed source firmware, on a bunch of undocumented hardware).
> the controller on a emmc isn't anywhere similar to the controller on a flash array
That’s why I suggested something similar to OpenFirmware. With that in place, you could send a piece of Forth code and the controller would run it without involving the CPU or touching any bus other than the internal bus in the storage device.
Adding a JIT to the mix only makes the problem worse, its a question of resources, your not going to fit all that code into a tiny SD micocontroller which has a very limited CPU/ram footprint even in comparison to sata SSDs. The resource availability takes another huge jump when you move to a full blown "array", which not only has all this additional disk mgmt functionality, but its going to have network mgmt functionality, etc. Some of these arrays have full blown xeons/etc in them.
Plus, disks are at least partially sold on their performance profile, and your going create another problem with a cross platform IR vs coding it in C/etc and using a heavyweight optimizing compiler targeting the microcontroller in question directly.
You also have to remember these are effectively deeply embedded systems in many cases, which are expected to be available before the OS even starts. Sometimes that includes operating over a network of some form (NVMe-OF). So it doesn't even make sense when that network device is shared/distributed.
Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology. But in the enterprise SSD space, there's a lot of experimentation with exactly this kind of thing. One of the most popular right now is zoned namespaces, which separates write and erase operations but otherwise still abstracts away most of the esoteric details that will vary between products and chip generations. That makes it a usable model for both flash and SMR hard drives. It doesn't completely preclude dishonest caching, but removes some of the incentive for it.
There is no strong reason why a consumer SSD can't allow reformatting to a smaller normal namespace and a separate zoned namespace.
Zone-aware CoW file systems allow efficiently combining FS-level compaction/space-reclamation with NAND-level rewrites/write-leveling.
I'd probably pay for "unlocking" ZNS on my Samsung 980 Pro, if just to reduce the write amplification.
Enabling those features on the drive side is little more than changing some #ifdef statements in the firmware, since the same controllers are used in high-end consumer drives and low-power data center drives. But that doesn't begin to address the changes necessary to make those features actually usable to a non-trivial number of customers, such as anyone running Windows.
Isn't this a chicken and egg problem? Why would OS vendors spend time implementing this on their side if the drives don't support it?
The difference here being that it's not clear to me that there's much cost on the drive side to actually allow this. Aside maybe for the will to segment the market.
To me, this looks like the whole sector size situation. OSs, including regular Windows, have supported 4K drives for quite a while now. I bought a Samsung 980 (non-pro) the other day that still pretends to have 512B sectors. The OEM drive in my laptop (some kind of Samsung) can be formatted with a 4k namespace, but the default is also 512B. The 980 doesn't even support this.
It's not quite a chicken and egg problem. Features like ZNS come into existence in the first place because they are desired by the hyperscale cloud providers who control their entire stack and are willing to sacrifice compatibility for small efficiency gains that matter at large scale.
The problem for the rest of the market is that the man-hours to rewrite your software stack to work with a different storage interface that allows eg. a 2% reduction in write amplification isn't worthwhile if you only have a fleet of a few thousand drives to worry about. There's minimal trickling down because the benefits are small and the compatibility costs are non-zero.
Even simple stuff like switching to shipping drives with a 4kB LBA size by default has very little performance impact (since drives are tracking things with 4kB granularity either way) and would be problematic for customers that want to apply a 512B disk image. The downsides of switching are small enough that they could easily be tolerated for the sake of a significant improvement, but for most of the market the potential improvement is truly insignificant. (And of course, fancy features are a market segmentation tool for drive vendors.)
> Why would OS vendors spend time implementing this on their side if the drives don't support it?
In the case of Microsoft, forcing the adoption of a de-facto standard (and refusing to support competing ones OOTB) they create is immensely beneficial in terms of licensing fees.
> Consumer SSDs don't have much room to offer a different abstraction from emulating the semantics of hard drives and older technology.
From what I understand the abstraction works a lot like virtual memory. The drive shows up as a virtual address space pretending to be a a disk drive and then the drive's firmware maps virtual addresses to physical ones.
That doesn't seem at all incompatible with exposing the mappings to the OS through newer APIs so the OS can inspect or change the mappings instead of having the firmware do it.
The current standard block storage abstraction presented by SSDs is a logical address space of either 512-byte or 4kB blocks (but pretty much always 4kB behind the scenes). Allocation is implicit upon writing to a block, and deallocation is explicit but optional. This model is indeed a good match for how virtual memory is handled, especially on systems with 4kB pages; there are already NVMe commands analogous to eg. madvise().
The problem is that it's not a good match for how flash memory actually works, especially with regards to the extreme disparity between a NAND page write and a NAND erase block. Giving the OS an interface to query which blocks the SSD considers as live/allocated rather than deallocated and implicitly zero doesn't seem all that useful. Giving the OS an interface to manipulate the SSD's logical to physical mappings (while retaining the rest of the abstraction's features) would be rather impractical, as both the SSD and the host system would have to care about implementation details like wear leveling.
Going beyond the current HDD-like abstraction augmented with optional hints to an abstraction that is actually more efficient and a better match for the fundamental characteristics of NAND flash memory requires moving away from a RAM/VM-like model and toward something that imposes extra constraints that the host software must obey (eg. append-only zones). Those constraints are what breaks compatibility with existing software.
If anything consumer-level SSDs move to the opposite direction. On Samsung 980 Pro it is not even possible to change the sector size from 512 bytes to 4K.
It's called the program-erase model. Some flash devices do expose raw flash, although it's then usually used by a filesystem (I don't know if any apps use it natively).
There's a _lot_ of problems doing high performance NAND yourself. You honestly don't want to do that in your app. If vendors would provide full specs and characterization of NAND and create software-suitable interfaces for the device then maybe it would be feasible to do in a library or kernel driver, but even then it's pretty thankless work.
You almost certainly want to just buy a reliable device.
Endurance management is very complicated. It's not just a matter of PE cycles for any given block will meet UBER spec at data retention limits with the given ECC scheme. Well, it could be in a naive scheme but then your costs go up.
Even something as simple as error correction is not. Error correction is too slow to do on the host for most IOs, so you need hardware ECC engines on the controller. But those become very large if you have a huge amount of correction capability in them so if errors exceed their capability you might go to firmware. Either way, the error rate is still important to know the health of the data, so you would need error rate data to be sent side-band with the data by the controller somehow. If you get a high error rate, does that mean the block is bad or does it mean you chose the wrong Vt to issue the read with, retention limit was approached, the page had read disturb events, dwell time was suboptimal, operating temperature was too low? All these things might factor in to your garbage collection and endurance management strategy.
Oh and all these things depend on every NAND design/process from each NAND manufacturer.
And then there's higher level redundancy than just per-cell (e.g., word line, chip, block, etc). Which all depend on the exact geometry of the NAND and how the controller wires them up.
I think better would be a higher level logical program/free model that sits above the low level UBER guarantees. GC would have to heed direction coming back from the device about what blocks must be freed, and what the next blocks to be allocated must be.
> Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out.
I don't know, maybe if there was a "my files exist after the power goes out" column on the website, then I'd sort by that, too?
Ultimately the problem is on the review side. Probably because there's no money in it. There just aren't enough channels to sell that kind of content into, and it all seems relatively celebrity driven. That said, I bet there's room for a YouTube personality to produce weekly 10 minute videos where they torture hard-drives old and new - and torture them properly, with real scientific/journalistic integrity. So, basically you need to be an idealistic outspoken nerd and a little money to burn on HDDs and audio/video setup. Such a person would definitely have such a "column" included in their reviews!
(And they could review memory, too, and do backgrounder videos about standards and commonly available parts.)
>I don't know, maybe if there was a "my files exist after the power goes out" column on the website
more like, "don't lose the last 5 seconds of writes if the power goes out". If ordering is preserved you should keep your filesystem, just lose more writes than you expected.
I wouldn't expect ordering of writes to be preserved, absent a specific way of expressing that need, part of a write cache's job is reordering writes to be more efficient which means ordering is not generally preserved.
But then again, if they're willing to accept and confirm flush commands without flushing, I wouldn't expect them to actually follow ordering constraints.
>part of a write cache's job is reordering writes to be more efficient which means ordering is not generally preserved.
you can use some sort of WAL mechanism to ensure that the the parallel writes appear as if ordering was preserved. that will allow you to lie and ignore fsyncs, but still ensure consistency in case of a crash.
>But then again, if they're willing to accept and confirm flush commands without flushing, I wouldn't expect them to actually follow ordering constraints.
it depends on which type of liar they think you are. if they're the "don't care, disable all safeguards type", then yes they're probably ignoring ordering as well. However, it's also possible they're the methodical liar, figuring out what they can get away with. As I mentioned in another comment in this thread, as long as ordering is preserved the lie wouldn't be noticed in typical use cases (ie. not using it for some sort of prod db, and not using it as part of a multi-drive array). power losses are relatively common, and a drive that totally corrupts the filesystem on it will get noticed much more quickly than a drive that merely loses the last few seconds of writes.
The flip side of the tyranny of the hardware flash controller is that the user can't reliably lose data even if they want to. Your super secure end to end messaging system that automatically erases older messages is probably leaving a whole bunch of copies of those "erased" messages laying around on the raw flash on the other side of the hardware flash controller. This can create a weird situation where it is literally impossible to reliably delete anything on certain platforms.
There is sometimes a whole device erase function provided, but it turns out that a significant portion of tested devices don't actually manage to do that.
But then you have to find a place to store the key that can be securely erased. Perhaps there is some sort of hardware enclave you can misuse. Even a tiny amount of securely erasable flash would be the answer.
A TPM can only store a limited number of keys. You need a forseparate key for anything you want to securely delete and in a lot of applications you might have a lot of things you want to delete separately.
You can pretty easily expand one secure, rotatable key into N. 1. Don't use TPM key directly, use it to encrypt the list of working keys. 2. Store the TPM-encrypted list of working keys on disk. 3. When you need to drop a working key, remove it from the list, rotate the TPM key and reencrypt all the working keys, and store the new list on disk again. Remnants of the old list are irrecoverable because the old TPM key doesn't exist anymore, and the new list is inaccessible without the new TPM key. There, now you have an arbitrary number of secure keys and can drop them individually.
Great point. This assumes that the TPM does secure deletes. Their primary purpose is protect keys, not get rid of them. I think in practice a TPM is a small enough system that the deletion would be secure just because that is the simplest way to do that. If you do this enough then some overwriting will likely occur. I guess media endurance could be a problem in some cases.
This is the theory, where you never have to store the key on disk. In reality you store the key on disk while performing actions that would block the TPM chip from releasing the key, such as upgrading the firmware.
Great, we'll just store the key persistently on... Disk? Dammit! Ok, how about we encrypt the key with a user auth factor (like passphrase) and only decrypt the key in memory! Great. Now all we need to do is make sure memory is not persisted to disk for some unrelated reason. Wait...
Swap on zram instead of disk based prevents persisting memory to disk and also dramatically improves swap performance. It's enabled by default on Fedora. I use it everywhere - on my desktop and on production servers.
For sure, I'm not saying it's unsolvable, just that the defaults are insecure. Even if I, as an app developer, wanted to provide security for my users, I can't confidently delete sensitive data since this happens below layers I can or should control. We can argue about who is responsible, but it's not a great situation.
> Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.
Yeah, this is how things are supposed to be done and the fact it's not happening is a huge problem. Hardware makers isolate our operating systems in the exact same way operating systems isolate our processes. The operating system is not really in charge of anything, the hardware just gives it an illusory sanboxed machine to play around in, a machine that doesn't even reflect what hardware truly looks like. The real computers are all hidden and programmed by proprietery firmware.
Flash storage is incredibly complex in the extreme at the low level. The very fact we're talking about microcontroller flash as if it's even the same ballpark as NVMe SSDs in terms of complexity or storage management says a lot on its own about how much people here understand the subject (including me.)
I haven't done research on flash design in almost a decade back when I worked on backup software, and my conclusions back then were basically that: you're just better off buying a reliable drive that can meet your your own reliability/performance characteristics, and making tweaks to your application to match the underlying drive operational behavior (coalesce writes, append as much as you can, take care with multithreading vs HDDs/SSDs, et cetera), and testing the hell out of that with a blessed software stack. So we also did extensive tests on what host filesystems, kernel versions, etc seemed "valid" or "good". It wasn't easy.
The amount of complexity to manage error correction and wear leveling on these devices alone, including auxiliary constraints, probably rivals the entire Linux I/O stack. And it's all incredibly vendor specific in the extreme. An auxiliary case e.g. the case of the OP, of handling power loss and flushing correctly, is vastly easier when you only consider some controller firmware and some capacitors on the drive, versus the whole OS being guaranteed to handle any given state the drive might be in, with adequate backup power, at time of failure, for any vendor and any device class. You'll inevitably conclude the drive is the better place to do this job precisely because it eliminates a massive amount of variables like this.
"Oh, but what about error correction and all that? Wouldn't that be better handled by the OS?" I don't know. What do you think "error correction" means for a flash drive? Every PHY on your computer for almost every moderately high-speed interface has a built in error correction layer to account for introduced channel noise, in theory no different than "error correction" on SSDs in the large, but nobody here is like, "damn, I need every number on the USB PHY controller on my mobo so that I can handle the error correction myself in the host software", because that would be insane for most of the same reasons and nearly impossible to handle for every class of device. Many "Errors" are transients that are expected in normal operation, actually, aside from the extra fact you couldn't do ECC on the host CPU for most high speed interfaces. Good luck doing ECC across 8x NVMe drives when that has to go over the bus to the CPU to get anything done...
You think you want this job. You do not want this job. And we all believe we could handle this job because all the complexity is hidden well enough and oiled by enough blood, sweat, and tears, to meet most reasonable use cases.
No, they look like any normal flash drive actually. Traditionally, for any hard drive you can buy at the store, the storage controller exists on the literal NVMe drive next to the flash chips, mounted on the PCB, and the controller handles all the "meaty stuff", as that's what the OS talks to. The reason for this is obvious: because you can just plug it into an arbitrary computer, and the controller abstracts the differences from the vendors, so any NVMe drive works anywhere. The key takeaway is the storage controller exists "between" the two.
Apple still has a flash storage controller that exists entirely separately from the host CPU, and the host software stack, just like all existing flash drives do today. The difference? The controller just doesn't exist on the literal, physical drive next to the flash chips. Because it doesn't exist; they just solder flash directly on the board without a mount like an M.2 drive. Again, no variability here, so it can all be "hard coded." And the storage controller instead exists by the CPU in the "T2 security chip", which also handles things like in-line encryption on the path from the host to the flash (which is instead normally handled by host software, before being put on the bus). It also does some other stuff.
So it's not a matter of "architecture", really. The architecture of all T2 Macs which feature this design is very close, at a high level, to any existing flash drive. It's just that Apple is able to put the NVMe controller in a different spot, and their "NVMe controller" actually does more than that; it doesn't have to be located on a separate PCB next to the flash chips at all because it's not a general 3rd party drive. It just has to exist "on the path" between the flash chips and the host CPU.
I would absolutely love to have access to "dumb" flash from my application logic. I've got append only systems where I could be writing to disk many times faster if the controller weren't trying to be clever in anticipation of block updates.
The ECC and anything to do with multi or triple level cell flashes is quite non-trivial. You don’t want to have to think about these things if you don’t have to. But yes, better control over the flash controllers would be nice. There are alternative modes for NVMe like those specifically for key-value stores: https://nvmexpress.org/developers/nvme-command-set-specifica...
This is like the statement that if I optimize memcpy() for the number of controllers, levels of cache, and latency to each controller/cache, its possible to make it faster than both the CPU microcoded version (rep stosq/etc) and the software versions provided by the compiler/glibc/kernel/etc. Particularly if I know what the workload looks like.
And it breaks down the instant you change the hardware, even in the slightest ways. Frequently the optimizations then made turn around and reduce the speed below naive methods. Modern flash+controllers are massively more complex than the old NOR flash of two decades ago. Which is why they get multiple CPUs managing them.
IMO the problem here is that even if your flash drive presents a "dumb flash" API to the OS, there can still be caching and other magic that happens underneath. You could still be in a situation where you write a block to the drive, but the drive only writes that to local RAM cache so that it can give you very fast burst write speeds. Then, if you try to read the same block, it could read that block from its local cache. The OS would assume that the block has been successfully written, but if the power goes out, you're still out of luck.
"Clearly vendors and users are at odds with each other here; vendors want the best benchmarks (so you can sort by speed descending and pick the first one), but users want their files to exist after their power goes out."
Clearly the vendors are at odds with the law, selling a storage device that doesn't store.
I think they are selling snake-oil, otherwise known as commiting fraud. Maybe they made a mistake in design, and at the very least they should be forced to recall faulty products. If they know about the problem and this behaviour continues ait is basically a fraud.
We allow this to continue, and the manufacturers that actually do fulfill their obligations to the customer suffer financially, while unscurpulous ones laugh all the way to the bank.
I agree, all the way up to entire generations of SDRAM being unable to store data at their advertised speeds and refresh timings. (Rowhammer.) This is nothing short of fraud; they backed the refresh off WAY below what's necessary to correctly store and retrieve data accurately regardless of adjacent row access patterns. Because refreshing more often would hurt performance, and they all want to advertise high performance.
And as a result, we have an entire generation of machines that cannot ever be trusted. And an awful lot of people seem fine with that, or just haven't fully considered what it implies.
I don't know if a legal angle is the most helpful, but we probably need a Kyle Kingsbury type to step into this space and shame vendors who make inaccurate claims.
Which is currently all of them, but that was also the case in the distributed systems space when he first started working on Jepsen.
Sure, of course. But even if you did want to seek a legal remedy, someone would have to do the work to clearly document the issue for the purposes of making it clear to a non-technical courtroom.
And at the point where that documentation had been done, that on its own might be enough to right the ship without anyone actually having to get sued.
The manufacturers warrant these devices to behave on a motherboard with proper power hold up times, not in whatever enclosures.
If the enclosure vendor suggests that behavior on cable pull will fully mimick motherboard atx power loss then that is fraud. But they probably have fine print about that, I'd hope.
"The manufacturers warrant these devices to behave on a motherboard with proper power hold up times"
Thats an interesting point, doesn't 'power failure' also include potential failure of the power supply, in which case you might not get that time?
Or what if a new write command is issued withing the holdup time, does the motherboard /OS know about powerloss during those 16 milliseconds that the power is still holding?
'Power loss' or 'power failure' for a part designed to operate at ATX specs does not mean supply failure. Supply failure can cause anything up to and including destruction of all components and even death of operator.
Anyway, let's firm up how an SSD works and what the OS knows.
SSDs have volatile DRAM buffers as a staging area to use before writing to the flash.
Flush (OS ioctl) means the data is successfully residing in the volatile DRAM of the SSD.
This is all the OS knows and usually ever knows in the ioctl cycle.
If power is lost there is some time before the >16ms is up that power good signal is lost on the motherboard. The voltage on the 3.3V rail will probably also drop enough from nominal to let the SSD controller know it better gets its housekeeping in order. In other words, dump the DRAM somewhere permanent and deal with it on the next power up.
Anything the OS is doing in the interim will not likely be acknowledged as flushed so that's not a concern. The OS userspace write will never complete. That loop works fine.
The thing that gets people up in arms is that flush means the SSD has the data only in volatile memory and not necessarily in non-volatile storage.
All performant SSDs seem to work this way. They need buffers.
The larger form factor enterprise drives, which are maybe 25% more expensive, have PLP capacitor banks. These supply a solid 50ms of power. Some manufacturers supply oscilliscope screenshots and such.
Anything else seems to be variable in its approach to power loss, particularly the smaller, hotter M.2 parts .
Capacitor banks have issues like taking up space, causing inrush currents, gaining impedance over time, and mediocre reliability at the high temperatures that latest M.2 sticks experience.
> The larger form factor enterprise drives, which are maybe 25% more expensive, have PLP capacitor banks.
I wonder if there would be a market for a small board that contains the capacitors and passes the signals down to a M.2 female connector. The physical disconnect would probably help with the temperature as well (and the board could come with its own heatsink).
I want to revise my comments. There are indeed some capacitors on many M.2 boards -- not sure how much. It takes several mF or more to drive a few amps at 3.3V for some tens of milliseconds, which is not insignificant, so larger form factors are certainly at an advantage.
Nothing says that you can't both offload everything to hardware, and have the application level configure it. Just need to expose the API for things like FLUSH behavior and such...
Yeah, you're absolutely right. I'd prefer that the world dramatically change overnight, but if that's not going to happen, some little DIP switch on the drive that says "don't acknowledge writes that aren't durable yet" would be fine ;)
> the embedded system model where you get flash hardware that has two operations -- erase block and write block
> just attach commodity dumb flash chips to our computer
I kind of agree with your stance; it would be nice for kernel- or user-level software to get low-level access to hardware devices to manage them as they see fit, for the reasons you stated.
Sadly, the trend has been going toward smart devices for a very long time now. In the very old days, stuff like floppy disk seeks and sector management were done by the CPU, and "low-level formatting" actually meant something. Decades ago, IDE HDDs became common, LBA addressing became the norm, and the main CPU cannot know about disk geometry anymore.
I think the main reason they did not expose lower level semantics is that the wanted a drop in replacement for hdds. The second is liability: unfettered access ti arbitrary location erases (and writes) can let you kill (wear out) a flash device in a really short time.
SATA vs NVMe vs SCSI/SAS only matters at the lowest levels of the operating system's storage stack. All the filesystem code and almost all of the block layer can work with any of those transports using the same HDD-like abstractions. Switching to a more flash-friendly abstraction breaks compatibility throughout the storage stack and potentially also with assumptions made by userspace.
I've actually run into some data loss running simple stuff like pgbench on Hetzner due to this -- I ended up just turning off write-back caching at the device level for all the machines in my cluster:
Granted I was doing something highly questionable (running postgres with fsync off on ZFS) It was very painful to get to the actual issue, but I'm glad I found out.
I've always wondered if it was worth pursuing to start a simple data product with tests like these on various cloud providers to know where these corners are and what you're really getting for the money (or lack thereof).
[EDIT] To save people some time (that post is long), the command to set the feature is the following:
nvme set-feature -f 6 -v 0 /dev/nvme0n1
The docs for `nvme` (nvme-cli package, if you're Ubuntu based) can be pieced together across some man pages:
Unlike eg. ATA and SCSI, the NVMe specs are freely available to the public. They're a little more complicated to read now that the spec has been split into a few modules, but finding the descriptions of all the optional features isn't too hard.
The nvme-cli tool and its documentation is written with the assumption that the user is at least somewhat familiar with the NVMe spec or protocol itself, because a large part of the purpose of that tool is to expose NVMe functionality that the OS does not currently understand or make use of. It's meant to be a pretty raw interface.
From reading your vadosware.io notes, I'm intrigued that replacing fdatasync with fsync is supposed to make a difference to durability at the device level. Both functions are supposed to issue a FLUSH to the underlying device, after writing enough metadata that the file contents can be read back later.
If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.
That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.
> From reading your vadosware.io notes, I'm intrigued that replacing fdatasync with fsync is supposed to make a difference to durability at the device level. Both functions are supposed to issue a FLUSH to the underlying device, after writing enough metadata that the file contents can be read back later.
Yeah I thought the same initially which is why I was super confused --
> If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.
Gulp.
> That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.
It looks like I'm going to have to do some more experimentation on this -- maybe I'll get a fresh machine and try to reproduce this issue again.
What led me to NVMe as dropping write was the complete lack of errors on the pg and OS side (dmesg, etc).
I think this is something LTT could handle with their new test lab. They already said they want to set new standards when it comes to hardware testing, so if they can hold up to what they promised and hire enough experts this should be a trivial thing to add to a test Parcours for disk drives.
LTT's commentary makes it difficult to trust they are objective (even if they are).
I loved seeing how giddy Linus got while testing Valve's Steamdeck, but when it comes to actual benchmarks and testing, I would appreciate if they dropped the entertainment aspect.
GamersNexus seems to really be trying to work on improving and expanding their testing methodology as much as possible.
I feel like they've successfully developed enough clout/trust that they have escaped the hell of having to pull punches in order to assure they get review samples.
They eviscerated AMD for the 6500xt. They called out NZXT repeatedly for a case that was a fire hazard (!). Most recently they've been kicking Newegg in the teeth for trying to scam them over a damaged CPU. They've called out some really overpriced CPU coolers that underperform compared to $25-30 coolers. Etc.
I bet they'd go for testing this sort of thing, if they haven't already started working on it already. They'd test it and then describe for what use cases it would be unlikely to be a problem vs what cases would be fine. For example, a game-file-only drive where if there's an error you can just verify the game files via the store application. Or a laptop that's not overclocked and only is used by someone to surf the web and maybe check their email.
> for starters i think the lab is going to focus on written for its own content and then supporting our other content [mainly their unboxing videos]... or we will create a lab channel that we just don't even worry about any kind of
upload frequency optimization and we just have way more basic, less opinionated videos, that are just 'here is everything you need to know about it' in video form if, for whatever reason, you prefer to watch a video compared to reading an article
Ah, I see how my comment was misleading--it really meant to highlight that at times I do appreciate LTT's entertainment aspect, not that I expected there to be a technical review of the steamdeck.
I'd really like to see one of the popular influencers disrupt the review industry by coming up with a way to bring back high quality technical analysis. I'd love to see what the cost of revenue looks like in the review industry. I'm guessing in-depth technical analysis does really bad in the cost of revenue department vs shallow articles with a bunch of ads and affiliate links.
I think the current industry players have tunnel vision and are too focused on their balance sheets. Things like reputation, trust, and goodwill are crucial to their businesses, but no one is getting a bonus for something that doesn't directly translate into revenue, so those things get ignored. That kind of short sighted thinking has left the whole industry vulnerable to up and coming influencers who have more incentive to care about things like reputation and brand loyalty.
I've been watching LTT with a fair bit of interest to see if they can come up with a winning formula. The biggest problem is that in-depth technical analysis isn't exciting. I remember reading something many years ago, maybe from JonnyGuru, where the person was explaining how most visitors read the intro and conclusion of an article and barely anyone reads the actual review.
Basically you need someone with a long term vision who understands the value you get from in-depth technical analysis and doesn't care if the cost of it looks bad on the balance sheet. Just consider it the cost of revenue for creating content and selling merchandise.
The most interesting thing with LTT is that I think they've got the pieces to make it work. They could put the most relevant / interesting parts of a review on YouTube and skew it towards the entertainment side of things. Videos with in-depth technical analysis could be very formulaic to increase predictability and reduce production costs and could be monetized directly on FloatPlane.
That way they build their own credibility for their shallow / entertaining videos without boring the core audience, but they can get some cost recovery and monetization from people that are willing to pay for in-depth technical analysis.
I also think it could make sense as bait to get bought out. If they start cutting into the traditional review industry someone might come along and offer to buy them as a defensive move. I wonder if Linus could resist the temptation of a large buyout offer. I think that would instantly torpedo their brand, but you never know.
They rigorously test their hardware and you can filter/sort by literally hundreds of stats.
I just built a PC and I would have killed for a site that had apples-to-apples benchmarks for SSDs/RAM/etc. Motherboard reviews especially are a huge joke. We're badly missing a site like that for PC components.
Just google userbenchmark bias. Basically, when AMD shook things up a few years ago with huge numbers of cores, UserBenchmark responded by weighting down the importance of multithreaded workloads in their scores, so Intel would stay on top. Now their site is banned from most subreddits, including both r/intel and r/amd.
It wasn't one event, more like the ratings for CPUs just became laughably, transparently, utterly worthless, to the point where Intel i3 laptop CPUs were scoring higher than top-end AMD Threadripper CPUs. And they refused to acknowledge any issues.
Within a month or two after AMD started shipping CPUs with more than 8 cores, they tweaked the algorithm to ignore >8 cores. And various other ridiculous changes that hurt AMD's rankings.
Unfortunately Userbenchmark is totally useless for comparing components. They don’t even attempt to benchmark one change at a time while keeping all other parts of the testbench identical.
Worse yet every time I benchmark one of my machines, I score significantly higher than the average user results for the same hardware. Perhaps the average submitter has crapware/antivirus installed or their machines are misconfigured (e.g. XMP disabled) which makes all their data suspect.
I appreciate the links. But it's tough to believe stats uploaded by random users, especially when we're only talking a few percent difference. Not to mention, if you sort by "avg bench %", apparently WD released an NVMe drive that's faster than Intel Optane. You'd think that would have made the news.
fwiw the best motherboard comparison I found was on overclock.net[1]. It didn't list everything I cared about, but it was a great starting point
Individual benchmarks uploaded by random users would be hard to trust yes, but UserBenchmark collects thousands. If you click through to the page for a given product it'll even show you a distribution graph of the collected scores from different real-world machines.
> apparently WD released an NVMe drive that's faster than Intel Optane
"Faster" is a matter of opinion; it depends on what you're optimizing for. Optane obviously has faster random reads, but it's not so great at sequential writes. The UserBenchmark score tries to take all of that into account: https://ssd.userbenchmark.com/Compare/Intel-905P-Optane-NVMe...
I mentioned this in another comment, but I think GamersNexus is doing exactly what you want.
Regarding influencers: they're being leveraged by companies precisely because they are about "the experience", not actual subjective analysis and testing. 99% of the "influencers" or "digital content creators" don't even pretend to try to do analysis or testing, and those that do generally zero in on one specific, usually irrelevant, thing to test.
I hope they do a good mix of entertainment and GamersNexus's depth. I'm struggling to watch GN without zoning out after a couple minutes. It's good deep content for sure, but if it was in written form I'd just skim and get the interesting bits.
You wrote: <<bring back high quality technical analysis>>
How about Tom's Hardware and AnandTech? If they don't count, who does? Years ago, I used to read CDRLabs for optical media drives. Their reviews were very scientific and consistent. (Of course, optical media is all but dead now.)
He’s recently pivoted a ton of his business to proper lab testing, and is hiring for it. It’ll be interesting to see, I think he might strike a better balance for those types of videos (I too am a bit tired of the clickbait nature these days).
So he says; I wish the funding were available to other groups who already have a more proven / technical track record in the area, though.
Like... what if LTT bought out Anandtech instead of trying to spin up a new 'labs' to replicate what has largely been lost (but still exists to an extent) in a few dusty corners of the tech journalism world.
I'm willing to give the benefit of the doubt, but there's so far been a lot of "just try me" and "we hired someone amazing" but I'll believe it when I see results!
But audience is also important. If it is only super-technical sources that are reporting faulty drives then the manufactures won't care much. However if you get a very popular source that has a lot of audience, especially in the high-margin "gamer" vertical then all of a sudden the manufactures will care a lot.
So if LTT does start providing more objective benchmarks and reviews it could be a powerful market force.
I would personally leave this kind of testing to the pros, like Phoronix, Gamers Nexus, etc. LTT is a facade for useful performance testing and understanding of hardware issues.
I think "the problem with LTT" is that, as time goes on, they've slid off the purely informational stuff and into the whatever-gets-the-most-clicks stuff. I don't mind a little bit of humor or personality (Digital Foundry is great in that regard), but when Linus started uploading videos that defended his use of click-baity thumbnails and the bribes he received from Nvidia/Intel, his credibility fell off a cliff for me. If you're not going to stand for the objective truth, why even bother reviewing hardware? I'd imagine that pressure is what pushed them to invest in this lab, but even then I have a hard time trusting them.
Linus is welcome to chase whatever niche market he wants, but as a "purely informational source" he's got a pretty marred track record these days.
Why do clickbaity thumbnails matter more than the content of the video? Linus has made it clear that he hates using them, but videos with them consistently get way better viewership, which is kinda essential to keep the channel running.
I'm also very curious to see a source on the "bribes he received from Nvidia/Intel", because I'm not finding anything that looks relevant on Google.
> Why do clickbaity thumbnails matter more than the content of the video?
Take for example this recent video: "We ACTUALLY downloaded more RAM" [1] complete with grinning youtube face holding a stick of RAM marked '10TB' - and it's complete bullshit.
How can I trust the opinion of someone who publishes such embarrassing nonsense?
I like how you quoted my question and then completely ignored it. The fact that you disliked the title of a video is not in any way a meaningful criticism of its content.
But okay, let's take a closer look at that video. When I saw it in my YouTube recs, I rolled my eyes at the clickbait and skipped over it, but I didn't see how it makes LTT look bad. In fact, I just gave it a fair shot and skimmed through it for myself, and it actually looks like a pretty solid explanation of memory hierarchy and swap space for beginners, packaged in a format that will increase its reach. I don't see what's bullshit about that.
Look, say what you will about clickbait, the unfortunate truth is that it gets views, which channels like LTT need to survive and grow. Linus is on the record saying he hates it, but they've done the tests to confirm that the stupid thumbnails and titles just perform better.
And come on, let's be honest here: How many people are going to click on a video titled "What is swap space?" or "You can use Google Drive for swap space on Linux" or something similarly boring? Even the best explanation in the world isn't going to get traction with a title like that. I looked for comparable videos and it looks like "What is Linux swap?" by Average Linux User (https://youtu.be/0mgefj9ibRE) is the next most-viewed video on the same topic. That video has gotten about 90,000 views since it was posted in 2019. By comparison, the LTT video has averaged about 100,000 views per day in the 16 days since it was posted.
So it looks to me like LTT took a technical topic that most people would never think about, found an angle to make it interesting to random people browsing YouTube, and tricked potentially thousands of people into learning something. What exactly is embarrassing about that?
Not sure about his other stuff you claim, I'm not a super big video guy for tech things (just let me skip and search ahead easily) but this came up with my friends a few years ago after people started noticing many videos from various creators going to this format of thumbnail.
I used to developed SSD firmware in the past and our team always used to make sure it would write the data and check the write status. We also used to used to analyze competitor products using bus analyzers and could determine some wouldn't do that. Also in the past many OS filesystems would ignore many errors we returned anyway.
Edit: Here is an old paper on the subject of OS filesystem error handling.
There is a 970 Evo, a 970 Pro and a 970 Evo Plus, but no 970 Evo Pro as far as I am aware. Would be interesting what model OP is actually talking about and if it is the same for other Samsung NMVe SSDs. I also prefer Samsung SSDs because they are reliably and they usually don't change parts to lower spec ones while keeping the same model number like some other vendors.
And watch out with the 980 Pro, Samsung has just changed the components.
Samsung have removed the Elpis controller from the 980 PRO and replaced it with an unknown one, and also removed any speed reference from the spec sheet.
It's OK for them to do this, but then they should give the new product a new name, not re-use the old name so that buying it becomes a "silicon lottery" as far as performance goes.
Link seems to be broken, shows a picture of the note 10 for me. I guess you wanted to link this one [1].
I knew about the changed controller in the 970 Evo Plus, but I wasn't aware they also changed the 980 Pro. That's disappointing. Is there anyone left that isn't doing those shenanigans?
I mostly buy Samsung Pro. Today I put an Evo in a box which I'm sending back for RMA because of damaged LBAs. I guess I'm stopping my tests on getting anything else but the Pros.
But IIRC Samsung was also called out for switching controllers last year.
"Yes, Samsung Is Swapping SSD Parts Too | Tom's Hardware"
I see they don't want a "thing", but that hardly seems to be a reason to not name names. Is there some special status of companies that the non-conformant status of their devices should be private?
It turning into a "thing" sounds like a net win for consumers.
> I see they don't want a "thing", but that hardly seems to be a reason to not name names.
I see you've never experienced the shit-storm of abusive messages sometimes received from fans when you say something bad about the products from a company they are unreasonably attached to. Or the rather aggressive stance some companies themselves take when something not complimentary is said. Or in the middle, paid shills (the company getting someone to pretend to be one of those overly attached people).
Everything you listed are external chilling factors.
You would blame the person neutrally shining the flashlight for the obscene response of others? Simply identifying a list of tested drives should not cause fear for someone's wellbeing. Anything otherwise is a successful stifling of knowledge.
> You would blame the person neutrally shining the flashlight for the obscene response of others?
Absolutely not, you seem to have misread me there.
I'm saying I understand the bad results not being published to avoid the potential for the obscene response from others.
Publishing the good results is enough of a public service. More than required, in fact. The test results could have been kept to themselves as nothing is owed to the rest of us.
Perhaps, it's the author does not want to name vendors which fail giving them time to contact him with some attractive suggestions. Or I am too suspicious? :)
I'm curious whether the drives are at least maintaining write-after-ack ordering of FLUSHed writes in spite of a power failure. (I.e., whether the contents of the drives after power loss are nonetheless crash consistent.) That still isn't great, as it messes with consistency between systems, but at least a system solely dependent on that drive would not suffer loss of integrity.
Enterprise drives with PLP (power loss protection) are surprisingly affordable. I would absolutely choose them for workstation / home use.
The new Micron 7400 Pro M.2 960GB is $200, for example.
Sure, the published IOPS figures are nothing to write home about, but drives like these 1) hit their numbers every time, in every condition, and 2) can just skip flushes altogether, making them much faster in uses where data integrity is important (and flushes would otherwise be issued).
b...but the M in MLC stands for multi... as in multiple... right?
checks
Oh... uh; Apparently the obvious catch-all term MLC actually only refers to dual layer cells, but they didn't call it DLC, and now there's no catch-all term for > SLC. TIL.
The "L" stands for level, and that makes it even more wrong. MLC should have been been QLC or DBC, and TLC should have been OLC or TBC. It's two bits and four levels, or three bits and eight levels. The latest "QLC" flash has 16 levels (and is a massive step down in performance and reliability; it seems TLC is the sweet spot, at least right now, unless you really want the absolute cheapest, performance be damned).
Interestingly enough, Apple has a patent for 8-bit cell flash (256 levels!), going full blown analog processing and error correction, but I don't think that ever became a product.
I'm not sure there were ever consumer TLC SSDs that didn't use SLC caching (except for a handful that use MLC caching). SLC caching was starting to show up even in some MLC drives when the market was transitioning from MLC to TLC, and now there are even some enterprise drives that use SLC caching.
Samsung has a hardware testing lab where all new storage products (SSDs/memory cards) are rigorously put to (automated) tests through a ridiculous number of reads, writes and power scenarios. The numbers are then averaged out and dialed down a bit to provide some buffer and finally advertised on the models. I'm not surprised that they maintain data integrity. They also own their entire stack (software and hardware) so there is less scope for a untested OEM bug to slip through.
Per the title, four vendors were tested. Samsung was already mentioned as a non-loser, so it can't be one of the two losers (or else the title would be wrong and the SSDs would be from 3 vendors at most).
I didn't pay careful attention to the wording of the submitted title. I may have been confused because of the wording of the actual tweet: I tested a random selection of four NVMe SSDs from four vendors.
The word "random" meant to me that Samsung drives could have been selected twice. But, yes, then there wouldn't be four distinct vendors.
Unstated but implied by you is there are only two (major) Korean vendors to choose from.
So if Samsung is a Korean winner then Hynix must be the Korean loser. Which is now clear to me.
Is it possible there's a third (minor) Korean player? Could I possibly still have a chance? :)
> Is it possible there's a third (minor) Korean player? Could I possibly still have a chance? :)
Well supposedly Zalman (another Korean company) makes SSDs, but I don't think I've ever seen one in the wild. Their specialty is heatsinks and fans, last I checked.
Please stop replying with misinformation all over this thread.
The NVMe spec is available for free; you should read it.
And you're 100% wrong about the enclosure too. It's driven by an Intel TB bridge JHL6240 and the drives are PCIe NVMe m.2 devices. Power specs are identical to on-board m.2 slots with PCIe support (which is all modern ones). There is no USB involved.
See my other reply to you where I explain what Flush actually does (your comments about it are also completely wrong).
Your TB test sounds valid but did you verify with manufacturer that power loss protection or power failure protection works in your TB enclosure? Is that a fair assumption or do you need to ask?
I'm a systems engineer but I've never done low level optimizations on drives. How does one even go about even testing something like this? It sounds like something cool that I'd like to be able to do
My script repeatedly writes a counter value "lines=$counter" to a file, then calls fcntl() with F_FULLFSYNC against that file descriptor which on macOS ends up doing an NVMe FLUSH to the drive (after sending in-memory buffers and filesystem metadata to the drive).
Once those calls succeed it increments the counter and tries again.
As soon as the write() or fcntl() fail it prints the last successfully written counter value which can be checked against the contents of the file. Remember: the semantics of the API and the NVMe spec require that a successful return from fcntl(fd, F_FULLFSYNC) on macOS require that data is durable at that point no matter what filesystem metadata OR drive internal metadata is needed to make that happen.
In my test while the script is looping doing that as fast as possible I yank the TB cable. The enclosure is bus powered so it is an unceremonious disconnect and power off.
Two of the tested drives always matched up: whatever the counter was when write()+fcntl() succeeded is what I read back from the file.
Two of the drives sometimes failed by reporting counter values < the most recent successful value, meaning the write()+ fcntl() reported success but upon remount the data was gone.
Anytime a drive reported a counter value +1 from what was expected I still counted at that as success... after all there's a race window where the fcntl() has succeeded but the kernel hasn't gotten the ACK yet. If disconnect happens at that moment fcntl() will report failure even though it succeeded. No data is lost so that's not a "real" error.
On very recent Linux kernels you can open the raw NVMe device and use the NVMe pass thru ioctl to directly send NVMe commands (or you can use SPDK on essentially any Linux kernel) and bypass whatever the fsync implementation is doing. That gives a much more direct test of the hardware (and some vendors have automated tests that do this with SPDK and ip power switches!). There's a bunch of complexity around atomicity of operations during power failure beyond just flush that have to get verified.
Is it possible the next write was incomplete when the power cut out? Wouldn't this depend on how updates to file data are managed by the filesystem? The size and alignment of disk and filesystem data & metadata blocks?
Yes, kinda. If the drive completes the flush but gets disconnected before the kernel can read the ack then I can get an error from fcntl(). In theory it's possible I could get an error from write() even though it succeeded but I don't know if that is possible in practice.
In any case the file's last line will have a counter value +1 compared to what I expected. That is counted as a success.
Failure is only when a line was written to the file with counter==N, fcntl(fd, F_FULLFSYNC, 1) reports success all the way back to userspace, yet the file has a value < N. This gives the drive a fairly big window to claim it finished flushing as the ack races back to userspace but even so two of the drives still failed. The SK Hynix Gold P31 sometimes lost multiple writes (N-2) meaning two flush cycles were not enough.
This seems like it would only work with with an external enclosure setup. I wonder if a test could be performed in the usual NVMe slot.
Of course, it seems it would be much harder to pull main power for the entire PC. I'm not sure how you'd do that - maybe high speed camera, high refresh monitor to capture the last output counter? Still no guarantee I'm afraid.
If you have a host system that has reasonable PCIe hotplug support and won't panic at a device dropping off the bus, then you can just use a riser card that can control power provided over a PCIe slot.
Quarch makes power injection fixtures for basically all drive connectors, to be paired with their programmable power supply for power loss testing or voltage margin testing (quite important when M.2 drives pull 2.5+A over the 3.3V rail and still want <5% voltage drop).
There's plenty of network controlled power outlets. Either enterprise/rackmount PDUs, or consumer wifi outlets, or rig something up with a serial/parallel port and a relay. You'd use an always on test runner computer to control the power state.
The computer under test would boot from PXE, on boot read from the drive and determine the last write, send that to the test runner for analysis, then begin the write sequence and report ASAP to the test runner at each flush. The test runner turns the power off at random, waits a minute (or 10 seconds, whatever) and turns it back on and starts again.
In a well functioning system, you should often get back the last reported successful write, and sometimes get back a write beyond the last reported write (two generals and all), but never a write before the last reported write. You can't use this testing to prove correct flushing, but if you run for a week and it doesn't fail once, it's probably likely not to lie.
I haven't evaluated the code, but here's a post from 2005 with a link to code that probably works for this. (Note: this doesn't include the pxe booting or the power control... This just covers the what to write to the disk, how to report it to another machine, and how to check the results after a power cycle)
which is more difficult (and sometimes slower) than STONITH style devices which just kill power to the entire machine. The latter allow you to program the whole thing and run test cycle after test cycle where the device kills itself the moment it gets a successful flush.
The problem is you can't trust a model number of SSD. They change controllers, chips, etc after the reviews are out and they can start skimping on components.
This needs to be cracked down on from a consumer protection lens. Like, any product revision that could potentially produce a different behavior must have a discernable revision number published as part of the model number.
>Like, any product revision that could potentially produce a different behavior must have a discernable revision number published as part of the model number.
AFAIK samsung does this, but it doesn't really help anyone except enthusiasts because the packaging still says "980 PRO" in big bold letters, and the actual model number is something indecipherable like "MZ-V8P1T0B/AM". If this was a law they might even change the model number randomly for CYA/malicious compliance reasons. eg. firmware updated? new model number. DRAM changed, but it's the same spec? new model number. changed the supplier for the SMD capacitors? new model number. PCB etchant changed? new model number.
That's why the examples I listed are plausible reasons for changing the model number. For firmware, it's plausible that it warrants changing the model number because firmware can and do affect performance, as other comments has mentioned.
Also I really don't see this being something that judges will stop. You see other CYA behavior that has persisted for decades, eg. drug side effects (every possible symptom under the sun), or prop 65 warnings.
Doubtful. That already happens with the "known to the state of California to cause cancer" labelling on products sold in California. Some companies just put that on everything when they have no idea if it contains those chemicals or not.
No fine prints. The first line is the same font size as the second line. You think that's going to help the average joe figure whether it's been component swapped or not? Oh, by the way, "MZ-V8P6T0B/AM" isn't the model number from the last comment. I swapped one digit. Did you catch that? If you were already on the lookout for this sort of stuff, you'd be already be checking the fine print at the back. This at best saves you 3 seconds of time. It also does nothing for the "randomly changing model numbers for trivial changes" problem mentioned earlier. In short, the proposed legislation would save 1% of enthusiasts 3 seconds when making a purchase.
Actually yea, a random joe will be able to see that 980 isn't the whole deal, and if he does care, might dig in. Most people dont even realise that this is a possibility
>Actually yea, a random joe will be able to see that 980 isn't the whole deal, and if he does care, might dig in.
If that works that'll probably be because of the novelty factor. Once it wears off everyone will just tune out the meaningless jumble of letters and only look at much memorable marketing name. See also: prop 65 warnings.
The PC laptop manufacturers have worked around this for decades by selling so many different short-lived model numbers that you can rarely find information about the specific models for sale at a given moment.
This does mitigate the benefit. But it still provides solid ground for a trustworthy manufacture to step in and break the trend.
Right now if a trustworthy manufacture kept the same hardware for an extended period of time they would not be noticed, and no one could easily tell. Because many manufacturers are swapping components with the same model number it is poisoning the well for everyone. If the law forced model number changes then you could see that there are 20 good reviews for this exact model number and all of the other drives only have reviews for different model numbers. All of a sudden that constant model number is a valuable differentiator for a careful consumer.
True. It’s the Gish Gallop of model numbering. Fortunately, it is the preserve of the crap brands. It’s sort of like seeing “in exchange for a free product, I wrote this honest and unbiased review”. Bam! Shitty product marker! Asus GL502V vs Asus GU762UV? Easy, neither. They’re clearly both shit or they wouldn’t try to hide in the herd.
I agree that long model numbers can be used to obscure a product, but a wide product line with small variations between products sensibly encoded in the model number isn't necessarily bad for the consumer. MikroTik switches and routers have long model numbers but segments of it are interpretable once you get to know them, and the model number is also the name of the product with a discoverable product page that describes its features in detail.
Don't throw the "meaningful long model numbers"-baby out with the "intentionally opaque model numbers"-bathwater.
>Shitty product marker! Asus GL502V vs Asus GU762UV? Easy, neither. They’re clearly both shit or they wouldn’t try to hide in the herd.
Is this based on empirical evidence or something? My sample size isn't very big, but I haven't really noticed any correlation between this practice and whether the product is crappy. I just chalked this up to manufacturers colluding with retailers to stop price matches, rather than because "clearly both shit or they wouldn’t try to hide in the herd".
Years ago, that kind of behavior got Dell crossed off my list of suppliers I'd work with for clients. We had to setup 30+ machines of the exact same model number, and same order, and set of pallets -- yet there were at least 7 DIFFERENT random configurations of chips on the motherboards & video cards -- all requiring different versions of the drivers. This was before the days of auto-setup drivers. Absolute flaming nightmare. It was basically random - the different chipsets didn't even follow serial number groupings, it was just hunt around versions for every damn box on the network. Dell's driver resources & tech support even for VARs was worthless.
This wasn't the first incident, but after such a blatant set of quality control failures I'll never intentionally select or work with a Dell product again.
What has changed is that most of the drivers are self-updating, so they can configure that at the factory. So, yes, the basic configuration has gotten better.
What has not changed is that Dell basically does grab-bag components in their systems and just because you have two machines of the same model# and rev#, does NOT mean that you have the same two machines.
This means that Dell has substandard abilities to 1) control their supply chain, 2) track and fix problems, 3) diagnose problems in the field, 4) prevent problems.
But sure, because Microsoft is now doing a far better job of auto-configuration in it's OS products, the setup problem is mitigated.
I avoid Dell like the plague, and advise everyone else to do the same. Sure, you may know someone who got away with it for a long time — there's usually some good items in a grab-bag — but the systemic reliability is just not there.
Exactly. This was not the first such trustworthiness problem I saw with Dell, only the culmination of many over the years. So there is no reason to believe that Dell has changed one iota in this regard, and they should still be regarded as untrustworthy.
While I agree with the sentiment, even a firmware revision could cause a difference in behavior and it seems unreasonable to change the model number on every firmware release.
Er, no? Whataboutism is an attempt to claim hypocrisy by drawing in something else with the same flaw. This is pointing out a way for this exact proposal to fail.
Okay, I thought it was something along the lines of argumenting against a proposal that is better but not perfect because "what about this edge case". Had a colleague who was a master at this craft and managed to get many good ideas shot down this way.
It seems unreasonable to me that there is unknown proprietary software running on my storage devices to begin with. This leads to insanity such as failure to report read errors or falsely reporting unwritten data as committed. This should be part of the operating system, not some obscure firmware hidden away in some chip.
It's complicated. Nowadays we have shortage of electronic components and it's difficult to know what will be not available the next month. So it's obvious that manufacturers have to make different variants of a product that can mount different components.
That complicates a lot of things. You are basically making a new product, with all the complications (and costs) associated with it. And where do you stop? What decides if it has impact or not? If I change the brand of a capacitor do I need a new model? Of course not. If I change the model of a switching controller? Well, still shouldn't change anything, but let's obvious. If I change another integrated circuit (i.e. a small SPI flash used to hold the firmware, or a i2c temperature sensor, or a supporting microcontroller)? Likely doesn't impact that much.
I don't want to live in a world where electronic components can't be commoditized because of fundamentally misinformed regulation.
There are alternatives to interchangeable parts, and none of them are good for consumers. And that is what you're talking about - the only reason for any part to supplant another in feature or performance or cost is if manufacturers can change them
!
"The Consumer Financial Protection Bureau (CFPB) is an agency of the United States government responsible for consumer protection in the financial sector. CFPB's jurisdiction includes banks, credit unions, securities firms, payday lenders, mortgage-servicing operations, foreclosure relief services, debt collectors, and other financial companies operating in the United States. "
It's definitely fraud. The only reason to hide the things they do is to mislead the customer as evidenced by previous cases of this that caused serious harm to consumers.
What do you expect? These companies are making toys for retail consumers. If you want devices that guarantee data integrity for life or death, or commercial applications, those exist, come with lengthy contracts, and cost 100-1000x more than the consumer grade stuff. Like I seriously have a hard time empathizing with someone who thinks they are entitled to anything other than a basic RMA if their $60 SSD loses data
There's a big difference in this depending on why the SSD lost the data.
If it was fraudulently declaring a lack of write-back cache despite a lack of observed crash-consistency, due to not just innocent negligience in firmware development, that's far different from some genuine bug in the FTL messing up the data mapping.
Personally, I expect implementing specifications properly. That's it.
About "commercial applications", let's face it. Those "enterprise solutions" cost way higher not because they are 10-1000x times "better", but because they contain generous "bonuses" for senior stuff.
Not really a problem when your computer has a large UPS built into it. Desktop macs, not so good.
But really isn’t the point of a journaling file system to make sure it is consistent at one guaranteed point in time, not necessarily without incidental data loss.
From the filesystem's perspective looking at storage, the flush does indeed preserve the semantics. It is just that you can't rely on the contents of anything if the power goes out.
I don't have a clue how a journaling FS works. But any ordering should not be observable unless you have a power outage. Can you give an example how a journaling FS could observe something that should be observable?
The simplest answer is that the journal size isn't infinite, and not everything goes into the journal (like often actual file data). Therefore, stuff must be removed from the journal at some point. The filesystem only removes stuff from the journal once it has a clear message from the drive that the data that has been written elsewhere is safe and secure. If the drive lies about that, then the filesystem may overwrite part of the journal that it thinks is no longer needed, and the drive may write that journal-overwrite before it writes the long-term representation. That's how you get filesystem corruption.
No, during a crash or lockup, acknowledged writes are not lost. (Because the drive has acknowledged them, they are in the drive's internal queue and thus need no further action from the OS to be committed to durable storage.) Only power loss/power cycle causes this.
Why? During a crash or lockup acked writes still reached the drive. They will be flushed to the storage eventually by the SSD controller. As long as you have power that is.
The key word is ‘eventually’. How long? Seconds, or even minutes? If your machine locks up, you turn it off and on again. If the drive didn’t bother to flush its internal caches by then, that data is lost, just as in a power failure.
That would be a power failure. Kernel crash is not equivalent to that.
System reboot doesn't entail power failure either. The disks may be powered by an independent external enclosure (e.g. JBOD). All they see is that they stopped receiving commands for a short while.
It's at least 5 seconds on Apple SSDs. Not sure if they even lazily flush at all. They might rely on the OS to periodically issue flushes (I heard launchd does that every 30 seconds).
I measured the performance of different size direct I/Os on some Samsung SSDs, and the write performance when switching from one test to another was significantly affected by the previous test parameters, or not, depending on whether a "sleep 2" was inserted between each test.
The only explanation I can think of is that the flash reorganises or commits cached data during that 2 seconds.
If you were using consumer SSDs, you have multiple layers of caching to worry about: in the controller's SRAM, and in a portion of the flash that's operating as SLC. There are also power saving modes to consider; waiting long enough for the drive to drop to a sleep state means you'll incur a wakeup penalty on the next IO.
> Not really a problem when your computer has a large UPS built into it.
Actually it is (through a small one) to name some examples where it can still lose without full sync:
- OS crashes
- random hard reset, e.g. due to bit flips due to e.g. cosmic radiation (happens). Or someone putting their magnetic earphone cases or similar on your laptop or similar.
Also any application which care about data integrity will do full syncs and in turn will get hit by a huge perf. penalty.
I have no idea why people are so adamant to defend Apple in this case, it's pretty clear that they messed up as performance with full flush is just WAY to low and this affects anything which uses full flushes, which any application should at least do on (auto-)safe.
The point of a journalism file system is about making it less likely the file system _itself_ isn't corrupted. Not that the files are not corrupted if they don't use full sync!
I had an NVMe controller randomly reset itself a few days ago. I think it was a heat issue. Not really sure though, may be that the motherboard is dodgy.
But the thread OS is about it not being a problem that SCSI-level flushes are supper slow, which is only not
a problem if you don't do them (e.g. only use fsync on Mac)?
But reading it again there might have been some confusion about what was meant.
Hard drive write caches are supposed to be battery-backed (i.e., internal to the drive) for exactly this reason. (Apparently the drives tested are not.) Data integrity should not be dependent on power supply (UPS or not) in any way; it's unnecessary coupling of failure domains (two different domains nonetheless -- availability vs. integrity).
The entire point of the FLUSH command is to flush caches that aren't battery backed.
Battery-backed drives are free to ignore such commands. Those that aren't need to honor them. That's the point.
Battery- or capacitor-backed enterprise drives are intended to give you more performance by allowing the drive and indeed the OS to elide flushes. They aren't supposed to give you more reliability if the drive and software are working properly. You can achieve identical reliability with software that properly issues flush requests, assuming your drive is honoring them as required by the NVMe spec.
You said caches should be battery backed, implying that it's wrong for them not to be. I'm saying FLUSH is what you use to maintain data integrity when caches are not battery backed, which is a perfectly valid use case. Modern drives are not expected to have battery backed caches; instead the software knows how to ask them to flush to preserve integrity. We've traded off performance to make up the integrity.
The problem is these drives don't provide integrity even when you explicitly ask them to.
As a systems engineer, I think we should be careful throwing words around like “should”. Maybe the data integrity isn’t something that’s guaranteed by a single piece of hardware but instead a cluster or a larger eventually consistent system?
There will always be trade-offs to any implementation. If you’re just using your M2 SSD to store games downloaded off Steam I doubt it really matters how well they flush data. However if your financial startup is using then without an understanding of the risks and how to mitigate them, then you may have a bad time.
The OS or application can always decide not to wait for an acknowledgement from the disk if it's not necessary for the application. The disk doesn't need to lie to the OS for the OS to provide that benefit.
Accidental drive pulls happen -- think JBODs and RAID. Ideally, if an operator pulls the wrong drive, and then shoves it back in in a short amount of time, you want to be able to recover from that without a full RAID rebuild. You can't do that correctly if the RAID's bookkeeping structures (e.g. write-intent bitmap) are not consistent with the rest of the data on the drive. (To be fair, in practice, an error arising in this case would likely be caught by RAID parity.)
Not saying UPS-based integrity solutions don't make sense, you are right it's a tradeoff. The issue to me is more device vendors misstating their devices' capabilities.
It doesn't need to, kernel panic alone does not cause acknowledged data not to be written to the drive.
UPS is not perfect though, it's better if your data integrity guarantees are valid independent of power supply. All that requires is that the drive doesn't lie.
> Not really a problem when your computer has a large UPS built into it.
Except that _one time_ you need to work until the battery fails to power the device, at 8%, because the battery's capacity is only 80%. Granted, this is only after a few years of regular use...
In Apple's defense, they probably have enough power even in the worst case to limp along enough to flush in laptop form factor, even if the power management components refuse to power the main CCXs. Speccing out enough caps in the desktop case would be very Apple as well.
Apple do not have PLP on their desktop machines (at least not the Mac Mini). I've tested over 5 seconds of written but not FLUSHed data loss, and confirmed via hypervisor tracing that macOS doesn't do anything when you yank power. It just dies.
You can design your FTL to be resilient to arbitrary power loss, as long as the NAND chips don't physically go off corrupting unrelated data on power down. That only requires extremely minimal capacitance. I believe Apple SSDs do have some of that in the NAND PMIC next to the storage chips themselves; it probably knows to detect falling voltage rails and trigger a stop of all writes to avoid any actual corruption due to out-of-spec voltages.
I've absolutely heard storage vendors talking about protecting just the FTL during power loss as PLP. You could have an FTL where any writes are atomic, but that gets in the way of throughput practically. The storage vendors don't seem to generally be on board that tradeoff except for 'industrial' branded SKUs that also make throughput tradeoffs anyway.
That's not necessarily true, it just comes down to the design. Apple seem to use a log-structured FTL which works great for performance and just requires a (very fast) log replay after a hard shutdown. You can see the syslog messages from the NVMe controller (via RTKit) talking about rebuilding the table when this happens.
Micron also talks about power loss resistant FTL design in their whitepaper, and although their older SSDs had caps, I think their recent ones mostly do away entirely.
The micron whitepaper you cited talks about how their higher tier strategy involves keeping enough capacitance around to write out the FTL, because it's DRAM copy is allowed to get out of sync when the write-cache is enabled.
> The hold-up circuitry also preserves enough time and energy to ensure that the FTL addressing table is properly saved to the NAND. This thorough amount of data protection not only ensures data integrity in unexpected power-loss events, but it also enables the system designer to leave the SSD’s write cache enabled, giving a significant advantage in data throughput speeds.
It's the other way around. The PMIC cuts the main system off at a certain voltage, and even in the worse case you have the extra watt/sec to flush everything at that point.
I'm really hoping the battery has a low-voltage cutoff... I guess the question is: does the battery cut the power, or does the laptop? In the latter case, this may be "ok" for some definition of ok. The former, there's probably not enough juice to do anything.
> does the battery cut the power, or does the laptop?
Last time I checked (and I could very well be out of date on this), there wasn't really a difference. It wasn't like an 18650 were the cells themselves have protection, but a cohesive power management subsystem that managed the cells more or less directly. It had all the information to correctly make such choices (but, you know, could always have bugs, or it's a metric that was never a priority, etc.).
Batteries can technically be used until their voltage is 0 ( it would be hard to get any current under >1 volt for lithium cells but still). The cutoff is either due to the BMS (battery management system) cutting off power to protect the cells from permanent damage or because the voltage is just too low to power the device (but in that case there's still voltage).
Running lithium cells under 2.7v leads to permanent damage. But, I'm sure laptops have a built-in safety margin and can cut off power to the system selectively. That's why you can still usually see some electronics powered (red battery light, flashing low battery on a screen, etc) even after you "run out" of battery.
I've never designed a laptop battery pack, but from my experience in battery packs in general, you always try to keep safety/sensing/logging electronics powered even after a low voltage cutoff.
Even in very cheap commodity packs that are device agnostic, the basic internal electronics of the battery itself always keep themselves powered even after cutting off any power output. Laptops have the advantage of a purpose built battery and BMS so they can have a very fine grained power management even at low voltages.
Batteries don't die instantly; you always have a few seconds where voltage is dropping fast and can take emergency action (shut down unnecessary power consumers, flush NVMe) before it's too late. Weak batteries die by having higher internal resistance; by lowering power consumption, you buy yourself time.
I don't know if macOS explicitly does this, but it does try to hibernate whem the battery gets low, and that might be conservative enough to handle all normal cases.
That's why I gave the death at 8%. I imagine the OS controls the hibernation, so it probably won't think that is low enough to trigger hibernation, just a warning. My current laptop will die at 70%, and I have hibernation triggering at 80% just to be safe (it lasts about 10 minutes after being unplugged). From experience, when the power dies it just turns off. The OS has no idea the battery is about to die and when rebooting it requires an fsck (or whatever its called) to get back to normal. So I assume there is data loss.
Also, thanks for all you're doing with Linux on the new macs. I love your blog posts!
More complex systems are liable to create more complex problems...
I don't think you can get away from this - yes, can solve a problem, but if you model problems as entropy, increasing complexity increases entropy.
It's like the messy room problem - you can clean your room (arguably high entropy state), but unless you are exceedingly careful doing so increases entropy. You merely move whatever mess to the garbage bin, expend extra heat, increase your consumption in your diet, possibly break your ankle, stress your muscles. https://drbrainpharma.com/product/buy-mdma-crystal-in-europe...
As a frame of reference, how much loss of FLUSH'd data should be expected on power loss for a semi-permanent storage device (including spinning-platter hard drives, if anyone still installs them in machines these days)?
I'm far more used to the mainframe space where the rule is "Expect no storage reliability; redundancy and checksums or you didn't want that data anyway" and even long-term data is often just stored in RAM (and then periodically cold-storage'd to tape). I've lost sight of what expected practice is for desktop / laptop stuff anymore.
The semantics of a FLUSH command (per NVMe spec) is that all previously sent write commands along with any internal metadata must be written to durable storage before returning success.
Basically the drive is saying "yup, it's all on NAND - not in some internal buffer. You can power off or whatever you want, nothing will be lost".
Some drives are doing work in response to that FLUSH but still lose data on power loss.
A flush command only guarantees, upon completion, that all writes COMPLETED prior to submission of the flush are non-volatile. Not all previously sent writes. NVMe base specification 2.0b section 7.1.
That's a very important distinction. You can't assume just because a write completed before the flush that it's actually durable. Only if it completed before you sent the flush.
I'm not very confident that software is actually getting this right all that often, although it probably is in this fsync test.
Nope. The reason this is so complex is that these devices are actually highly parallel machines with multiple queues accepting commands. It's quite difficult to even define "before" in terms of command sequence. For example, if you have a device with two hardware queues for submitting commands and a software thread for each, if you submit a flush on one queue, which commands on the other queue does it affect?
Or what if the device issues a pci write to the completion entry that passes a flush command being submitted on the wire?
I think the only interpretation that makes sense is from the perspective of a single software thread. If that particular thread has seen the completion via any mechanism and then that thread issues the flush, then you know the write is durable. Other than that, the device makes no promises.
As far as I understand (which is little more than this twitter thread) the flush command should only return a success response once any data has been written to non-volatile storage.
If the storage still requires power after that point to maintain the data, that storage area is volatile, no?
So if the device has returned success (and I'm not going to claim that they've ensured that it was the device returning success and not the adapter, or that they even verified what the response was - those seem like valid questions) presumably the power wind-down should not be an issue?
That said, I presumed by "disconnect the cable" the test involved some extension cable from the motherboard straight to drive to make it easier to disconnect - would that therefore make it a valid test of the NVMe?
You understand incorrectly. Flush means data is in volatile DRAM in the device. That's how SSDs work.
Extension cable from motherboard would certainly make it invalid. These devices are not hot swap and may expect power hold up and sequencing from the supply.
You're wrong. Please read the NVMe spec, section 7.1:
> the Flush command shall commit data and metadata associated with the specified namespace(s) to non-volatile media
It has nothing to do with DRAM. The spec is explicit about what Flush does. The drive is not allowed to ack the flush until the data and drive metadata are on durable storage.
For a discussion on what happens when the drive has power loss protection, see 5.24.2.1 - in that case the write cache (aka DRAM) is considered non-volatile and Flush can be a no-op. None of the drives I tested fell into that category however.
I agree with all that. Thanks for your knowledge. I am not a storage or driver engineer so I apologize for imprecise comments. Let's take a peak at section 5.2x.x (depends on spec version)
"Note: If the controller is able to guarantee that data present in a write cache is written to non-volatile media on loss of power, then that write cache is considered non-volatile and this feature does not apply to that write cache"
You would likely be more in a position to raise the issue with OEMs but I believe they will claim they have power loss protection with heavy asterisks attached and that is why they ack your fcntl FLUSH while not actually flushing out of the [actually volatile] DRAM.
I don't know why a guarantee is not actually a guarantee and why there may actually be gradations of PLP. Sounds fishy but that is one possible explanation for what you are seeing. Do you know? Anyway if you want the real PLP guarantee you have to get a drive where the documentation specifies backup capacitors and preferably oscilloscope traces. Ultrastar drives have such info, for example.
Laptops, especially the likes of MacOS with T2 chip in which all I/o goes through T2 can do some clever things. It can essentially turn the underlying NVMe SSD into a battery backed storage. Even if the OS on main CPU crashes and dies, the T2 chip with its own independent OS can ensure SSD does a full flush before the battery runs out of power. Now, I don't know if Apple does this, but I sure hope they do. It would be great if they published the details so that even Linux on MacBook can do this well.
SK Hynix Gold P31 2TB SHGP31-2000GM-2, FW 31060C20
Sabrent Rocket 512 (Phison PH-SBT-RKT-303 controller, no version or date codes listed)
I've ordered more drives and will report back once I have results:
Intel 670p
Samsung 980
WD Black SN750
WD Green SN350
Kingston NV1
Seagate Firecuda 530
Crucial P2
Crucial P5 Plus
These are just my results in my specific test configuration, done by me personally for fun in my own time. I may have made mistakes or the results might be invalid for reasons not yet know. No warranties expressed or implied.
Flush performance varies by 6x and is not necessarily correlated with overall perf or price. If you are doing lots of database writes or other workloads where durability matters don't just look at the random/sustained read/write performance!
High flush perf: Crucial P5 Plus (fastest) and WD Red
Despite being a relatively high end consumer drive the Seagate had really low flush performance. And despite being a budget drive the WD Green was really fast, almost as good as the WD Red in my test.
The SK Hynix drive had fast flush perf at times, then at other times it would slow down. But it sometimes lost flushed data so it doesn't matter much.
No surprise here. I've never encountered a redundancy feature in storage that worked. Power failure, drive controller failure, connection failure - and data is kaput. Regardless of the promised.
How can it be so bleak? Can it be that nobody's data redundancy is real? Sure. If you don't test it, regularly, then by the hoary rules of computing it doesn't work.
How do these NVMe SSDs fare when setting the FUA or Force Unit Access bit for write through on Linux (O_DIRECT | O_DSYNC) instead of F_FULLFSYNC on macOS?
I imagine that different firmware machinery would be activated for FUA, and knowing whether FUA works properly would provide comfort to DB developers.
Aren't there filesystem options that affect this? Wasn't there a whole controversy over ext4 and another filesystem not committing changes even after flush (under specific options/scenarios)?
The ext4 "issue" was in userspace. Certain software wasn't calling fsync() and had assumed the data was written anyway, because ext3 was more forgiving.
IIRC the "solution" was to give ext4 the ext3 semantics, i.e. not insist that every broken userspace program needed to be fixed.
This is an old topic, but I disagree that any program that doesn't grind your disk into dust is broken. Consistency without durability is a useful choice.
It is, but those programs are definitely "broken" if they expect any sort of durability guarantees, which was the problem with the programs that ran into problems with early versions of ext4.
I.e. it's fine to open(), write() and close() without an appropriate fsync() (note that you'll also need to fsync() the relevant directory entri(es), which most people get wrong).
It's not fine to do so and come complaining to the OS maintainers when the kernel lost your data, you didn't tell it you wanted the data flushed to disk, and you didn't wait around for that to happen until you reported success to the user.
If you don't want to deal with any of that there's a widely supported way to get around it: Just mount your filesystems with the "sync" option, or run sync(1) after your program finishes, but before reporting success.
You'll grind your "disks into dust", but you'll have data safety, sans any HW or kernel issues being discussed in this thread. But hey, consumer hardware sucks. News at 11! :)
All that being said we live in the real world. IIRC the ext4 issue was that the delay for the implicit sync was changed from single-digit to double-digit seconds.
So people were experiencing data loss due to API (mis)use that they were getting away with before. After the "do it like ext3" change they might still be, it's just that the implicit sync window was narrowed again.
There's simply no way around not needing to care about any of this and still having some items from the "performance" column as well as the "data safety" column.
All of modern computing is structured around these trade-offs, even SSD I/O is glacially slow compared to the CPU throughput.
You need to juggle performance and data safety to get any reasonable I/O throughput, just like you can't have a reasonably performing OS without something like the OOM killer, "swap of death" or similar.
You can always opt-out of it, but it means mounting your FS with sync, replacing your performant kernel with a glacially slow (but guaranteed to be predictable) RTOS etc.
He is trsting consumer grade devices which don‘t have power loss protection by design. That is a „feature“ for enterprise devices so they can increase the price for datacenter usage.
This has nothing to do with PLP. If the drive reports PLP then Flush is allowed to be a no-op because all acknowledged writes are durable by design - the OS need only wait for the data write and FS metadata writes to complete without needing to issue a special IO command. This is covered in 5.24.1.4 in the NVMe spec 2.0b
He is trusting that drives are conformant to their specs. This is an issue of non-conformance that increases marketable performance at the cost of data security. PLP is great, but in lieu of that the drives should be honest about the state of writes. How can you trust your data will be there after an ACPI shut down?
Quote: "Then I manually yanked the cable. Boom, data gone."
He never said what cable he yanked. If he did it to internal one that goes from PSU to the drive then that's a very niche test, not relevant. But if he did it to the one from wall socket to the PC then yeah, then that's a good test.
Regarding this, first of all Raymond Chen warned us more than a decade ago that vendors lie all the time about their hardware capabilities. He had one test with exactly this on flushing, where the HDD driver (written by the manufacturer of course) was always returning S_OK, regardless of whatever. Do note that this was in a time when HDD's were common, not SSD's.
Secondly, I always buy a PSU that is more than double in power for the system. If, let's say, the system has a 500W requirement, my PSU for that system will be at least 1200W. This will definitely have big capacitors to keep your system alive 2 seconds after the power goes off. Those 2 seconds might seem small to us, but the drives, as much as lying assholes as they are to OS, will still correctly flush pending data. Never experienced data loss going this route.
I'm confident this is a means to cheat performance benchmarks, because some of those will run tests using those flushes to try and get 'real' hardware performance instead of buffered/cached performance.
I wonder if a small battery or capacitor on these devices would work to avoid data loss.
Surprised this is being rediscovered. Years ago only "enterprise" or "data center" SSDs had the supercap necessary to provide power for the controller to finish pending writes. Did we ever expect consumer SSDs to not lose writes on power fail?
As I said in the recent Apple discussion, pretty much all drives are lying and have been for decades at this point. The good brands just spec out enough capacitance that you don't see the difference externally.
The more you look at speculative execution and these drive issues, the more you see that we're giving up a lot of what computing "safe" for just performance.
So if SSDs rely solely on capacitors for data integrity and lie about flushes, what do they do on a flush that takes any amount of time? Are they just taking a speed hit for funsies? Heck, from this test, the magnitude of the speed hit isn't even correlated with whether they lose writes...
At one point it was different barriers on the different submission queues inside the drive. Not externally visible queues, but between internal implementation layers.
It's been a few years since I've checked up on this and it was for the most part pre SSDs though.
Never said they were and you don't need several seconds of power buffer for this purpose. They might just as well serve a different purpose like I know it just an observation. Either explain why why you definitely need F range for this or don't comment. They are rated 16v so take that into account in your explanation.
This is not about pulling power during writes. Flush is supposed to force all non-committed (i.e. cached) writes to complete. Once that has been acknowledged there is no need for any further writes. So those drives are effectively lying about having completed the flush. I also have to wonder when they intended to write the data...
I’m actually interested in testing this scenario, a drive getting power loss. Is there a thing which will cut power to a server device on command? Or do you just pull the drive out of its bay?
If your server has IPMI or similar, you can use that to cut power (not to the BMC though), otherwise, network PDUs or many UPSes or consumer level networked outlets or something fun with a relay (be careful with mains voltages).
Pulling the drive is also worth testing though; you might get different results. Requires more human involvement though.
> The models that never lost data: Samsung 970 EVO Pro 2TB and WD Red SN700 1TB.
The others would probably be SK Hynix and Micron/Crucial, right? Curious why he's reluctant to name and shame. A drive not conforming to requirements and losing data is a legitimate problem that should be a "thing"!
Crucial seems plausible, but there's a surprising number of US brands for NVMe SSDs. I was able to find: Crucial, Corsair, Seagate, Plextor, Sandisk, Intel, Kingston, Mushkin, PNY, Patriot Memory, and VisionTek.
Micron/Crucial is the 3rd largest manufacturer of flash memory, most of the other brands in your list just make the PCB and whitelabel other flash chips and controllers (perhaps with some firmware customization, but they're usually not responsible for implementing features like FLUSH).
Toshiba/Kioxia is another big one, but they're based in Japan. The US brand could be Intel instead of Crucial, I suppose.
Looks like he works at Apple. Maybe what he's testing is work related or covered by some sort of NDA (e.g. doesn't want to risk harming supplier relations for the brands misbehaving)
I thought Crucial specifically designed some power loss protection as a differentiating selling point? Well at least that was the reason why I bought one back in M.2 days (gosh my PC is ancient...)
I think the most they ever promised for some of their consumer drives was that a write interrupted by a power failure would not cause corruption of data that was already at rest. Such a failure can be possible when storing multiple bits per memory cell but programming the cells in multiple passes, especially if later passes are writing data from an entirely different host write command.
It changed [1] from capacitors and "power loss protection" to something else described as "power loss immunity" with the MX500. I don't think I've ever seen it explained very well.
> With the release of the MX500, Crucial has included a new replacement for the traditional power loss protection feature, power loss immunity. Instead of relying on a bank of capacitors for power loss protection, Crucial was able to work the new 3D TLC NAND and the code to allow for more efficient NAND programming so that the capacitors are no longer needed.
That's just a regurgitated press release IMO.
A lot of consumer drives also stopped reporting DRAT/RZAT [2] around the Crucial MX500, Samsung 850 timeframe. They swap internals as others in this thread have pointed out and the write endurance has dropped since reviewers stopped reporting on it. I have a Crucial MX500 in my system right now with 11% life remaining and only 37TBW even though it's advertised as having 180TBW of endurance.
Edit: I actually found [3] an explanation of "power loss immunity".
> The impact is still the same: you don't get the full protection that is standard for enterprise SSDs, but data that has already been written to the flash will not be corrupted if the drive loses power while writing a second pass of more data to the same cells.
I always thought write operations on SSDs were more or less to write a new page or block or whatever the terminology is and to flag the old one(s) for garbage collection. I don't understand how it would be possible to lose the old data by doing that. Did they just invent a term that sounds like power loss protection, but doesn't actually do anything special?
In my opinion, as a consumer, this is up to you. If you need this, get a UPS battery backup (or a laptop which has its own battery). Or, you can get a super specialized SSD. Ultimately though, most consumer SSDs DON’T need this feature. And if they did include it by default, it would likely be environmentally questionable for a feature most people will never use (because most consumer SSDs these days go into laptops with their own batteries).
You didn't understand the issue. It's not that these drives lose data with sudden power loss. It's that you tell the drive "please write all data that is currently in your write cache to persistent storage now" and then the drive says "ok I'm all done, your data is safe" and then when you cut power AFTER this, your data is still sometimes gone. This has nothing to do with any batteries, or complicated technology. It just means make your drive not lie.
Correct me if I'm wrong, but if these drives are used for consumer applications, this behavior is probably not a big deal? If you made changes to a document, pressed control-S, and then 1 second later the power went out, then you might lose that last save. That'd suck, but you would have lost the data anyways if the power loss occurred 2s before, so it's not that bad. As long as other properties weren't violated (eg. ordering), your data should mostly be okay, aside from that 1s of data. It's a much bigger issue for enterprise applications, eg. a bank's mainframe responsible for processing transactions told a client that the transaction went through, but a power loss occurred and the transaction was lost.
Modern SSDs, and especially NVMe drives, have extensive logic for reordering both reads and writes, which is part of why they perform best at high queue depths. So it's not just possible but expected that the drive will be reordering the queue. Also, as batteries age, it becomes quite common to lose power without warning while on a battery.
In general it's strange to hear excuses for this behavior since it's obviously an attempt to pass off the drive's performance as better than it really is by violating design constraints that are basic building blocks of data integrity.
>Modern SSDs, and especially NVMe drives, have extensive logic for reordering both reads and writes, which is part of why they perform best at high queue depths. So it's not just possible but expected that the drive will be reordering the queue.
If we're already in speculation territory, I'll further speculate that it's not hard to have some sort of WAL mechanism to ensure the writes appear in order. That way you can lie to the software that the writes made it to persistent memory, but still have consistent ordering when there's a crash.
>Also, as batteries age, it becomes quite common to lose power without warning while on a battery.
That's... totally consistent with my comment? If you're going for hours without saving and only saving when the OS tells you there's only 3% battery left, then you're already playing fast and loose with your data. Like you said yourself, it's common for old laptops to lose power without warning, so waiting until there's a warning to save is just asking for trouble. Play stupid games, win stupid prizes. Of course, it doesn't excuse their behavior, but I'm just pointing out to the typical consumer, the actual impact isn't bad as people think.
> As long as other properties weren't violated (eg. ordering), your data should mostly be okay, aside from that 1s of data.
That's the thing though—ordering isn't guaranteed as far as I remember. If you want ordering you do syncs/flushes, and if the drive isn't respecting those, then ordering is out of the window. That means FS corruption and such. Not good.
The tweet only mentioned data loss when you yanked the power cable. That doesn't say anything about whether the ordering is preserved. It's possible to have a drive that lies about data written to persistent storage, but still keeps the writes in order.
> If you made changes to a document, pressed control-S, and then 1 second later the power went out, then you might lose that last save.
If you made changes to a document, pressed control-S, and then 1 second later the power went out, then the entire filesystem might become corrupted and you lose all data.
Keep in mind that small writes happen a lot -- a lot a lot. Every time you click a link in a web page it will hit cookies, update your browser history, etc etc, all of which will trigger writes to the filesystem. If one of these writes triggers a modification to the superblock, and during the update a FLUSH is ignored and the superblock is in a temporary invalid state, and the power goes out, you may completely hose your OS.
Nope, the problem here is that it violates a very basic ordering guarantee that all kinds of applications build on top of. Consider all of the cases of these hybrid drives or just multiple hard drives where you fsync on one to journal that you do something on the other (e.g. steam storing actual games on another drive).
This behavior will cause all kinds of weird data inconsistencies in super subtle ways.
> As long as other properties weren't violated (eg. ordering)
That is primarily what fsync is used to ensure. (SCSI provides other means of ensuring ordering, but AFAIK they're not widely implemented.)
EDIT: per your other reply, yes, it's possible the drives maintain ordering of FLUSHed writes, but not durability. I'm curious to see that tested as well. (Still an integrity issue for any system involving more than just one single drive though.)
> That'd suck, but you would have lost the data anyways if the power loss occurred 2s before,
But if you knew power was failing, which is why you did the ^S in the first place, it would not just suck, it be worse than that because your expectations were shattered.
It's all fine and good to have the computers lie to you about what they're doing, especially if you're in on the gag.
But when you're not, it makes the already confounding and exasperating computing experience just that much worse.
Go back to floppies, at least you know the data is saved with the disk stops spinning.
>But if you knew power was failing, which is why you did the ^S in the first place, it would not just suck, it be worse than that because your expectations were shattered.
The only situation I can think of this being applicable is for a laptop running low on battery. Even then, my guess is that there is enough variance in terms of battery chemistry/operating conditions that you're already playing fast and loose with regards to your data if you're saving data when there's only a few seconds of battery left. I agree that that having it not lose data is objectively better than having it lose data, but that's why I characterized it as "not a big deal".
Contractor: "Hi, we need to kill the power to the house now."
Me: "Oh, ok, let me shut down my computer."
And, everything I've been reading lately is simply that there's nothing safe about this. How is shutting down a computer ever safe now? How long do we have to wait to ensure our data is flushed correctly, by everything?
>Contractor: "Hi, we need to kill the power to the house now."
>Me: "Oh, ok, let me shut down my computer."
and they're killing the power 0.1 seconds after you told them the computer is shut down? If the drive only lost the last 2 or 3 seconds of writes, you'll be fine.
>And, everything I've been reading lately is simply that there's nothing safe about this. How is shutting down a computer ever safe now? How long do we have to wait to ensure our data is flushed correctly, by everything?
that's a good point. maybe the drives still get some residual power from the motherboard even when the computer is "off", and that's enough to finish the writes?
It would be nice if we could just buy dumb flash and let the application do whatever it wants (I guess that application would be your filesystem; but it could also be direct access for specialized use cases like databases). If you want maximum speed, adjust your settings for that. If you want maximum write durability, adjust your settings for that. People are always looking for that one size fits all use case, but it's hard here. Some people may be running cloud providers and already have software to store that block on 3 different continents. Some people may be an embedded system with a fixed disk image that changes once a year, with some temporary storage for logs. There probably isn't a single setting that gets optimal use out of the flash memory for both use cases. The cloud provider doesn't care if a block, flash chip, drive, server rack, availability zone, or continent goes away. The embedded system may be happy to lose logs in exchange for having enough writes left to install the next security update.
It's all a mess, but the constraints have changed since we made the mess. You used to be happy to get 1/6th of a PCI Express lane for all your storage. Now processors directly expose 128 PCIe lanes and have a multitude of underused efficiency cores waiting to be used. Maybe we could do all the "smart" stuff in the OS and application code, and just attach commodity dumb flash chips to our computer.