Hacker News new | past | comments | ask | show | jobs | submit login
NVMe is not a hard disk (koehntopp.info)
248 points by omnibrain on June 11, 2021 | hide | past | favorite | 206 comments



"most database-like applications do their redundancy themselves, at the application level, so that...storage replication in distributed storage...is not only useless, but actively undesirable"

I do believe that's been the author's experience. However, I think he may be unaware that's not everyone's...or even most people's, experience.


My biggest take away from this is "I certainly hope this person isn't responsible for actually designing the bare metal storage infrastructure meant to underly something important". They seem to be operating from the premise that data replication at the 'cloud' level can solve all their problems.


Not sure what to say, but this is how it works on all the large systems I'm familiar with.

Image you have two servers, each with two 1TB disks (4TB physical storage). And you have two distinct services with 1TB datasets, and want some storage redundancy.

One option is to put each pair of discs in a RAID-1 configuration, and so each RAID instance holds one dataset. This protects against a disk failure, but not against server or network failures.

Your other option is to put one copy of the dataset on each server. Now you are protected from the failure of any single disk, server, or the network to that server.

In both cases you have 2TB of logical storage available.


you're putting yourself at risk of split-brain though (or downtime due to fail-over (or the lack of it)).

In either case what you're describing isn't really the 'cloud' alternative.


Yeah, that's a complication/cost of HA that a significant portion of industry has long accepted. Everywhere I've been in the last 5 years has had this assumption baked in across all distributed systems.


Where possible, my organization tries and have services deployed in sets of three (and try and require a quorum) to reduce/eliminate split-brain situations. And we're very small scale.


I think you're giving him too little credit there.

The parents point is really spot on though: most websites aren't at scale where every stateful api has redundant implementations. But the author's point does have merit: inevitably, something goes wrong with all systems - and when it does, your system goes down if you trust your HA config to work. If you actually did go for redundant implementations, your users likely aren't even gonna notice anything went wrong.

It's however kinda unsustainable to maintain unless you have a massive development budget


But why is his layer the only correct layer for redundancy?

He's also doing wasted work his redundancy has bugs that a consumer has to worry about working around.


We're currently quite off-topic if i'm honest.

I think the author was specifically talking about RAID-1 redundancy and is advocating that you can leave your systems with RAID-0 (so no redundant drives in each server), as you're gonna need multiple nodes in your cluster anyway... so if any of your systems disks break, you can just let it go down and replace the disk while the node is offline.

but despite being offtopic: redundant implementations are -from my experience - not used in a failover way. they're active at all times and load is spread if you can do that, so you'd likely find the inconsistencies in the integration test layer.


> not used in a failover way.

Aurora works like that. Read replicas are on standby and become writable when the writable node dies or is replaced. They can be used, of course, but so can other standby streaming replicas.


I'm not the original author but I used to work with him and now do in storage infrastructure at Google. As others pointed out, what the author, Kris, writes kind of implies/requires a certain scale of infrastructure to make sense. Let me try to provide at least a little bit of context:

The larger your infrastructure, the smaller the relative efficiency win that's worth pursuing (duh, I know, engineering time costs the same, but the absolute savings numbers from relative wins go up). That's why an approach along the lines of "redundancy at all levels" (raid + x-machine replication + x-geo replication etc) starts becoming increasingly worth streamlining.

Another, separate consideration is types of failures you have to consider: an availability incident (temporary unavailability) vs. durability (permanent data loss). And then it's worth considering that in the limit to long durations, an availability incident will become the same as a durability incident. This is contextual: To pick an obvious/illustrative example, if your Snapchat messages are offline for 24h, you might as well have lost the data instead.

Now, machines fail, of course. Doing physical maintenance (swapping disks) is going to take significant, human time scales. It's not generally tolerable for your data to be offline for that long. So local RAID barely helps at all. Instead, you're going to want to make sure your data stays available despite a certain rate of machine failure.

You can now make similar considerations for different, larger domains. Network, power, building, city/location, etc. They have vastly different failure probabilities and also different failure modes (network devices failing is likely an availability concern, a mudslide into your DC is a bit less likely to recover). Depending on your needs, you might accept some of these but not others.

The most trivial way to deal with this is to simply make sure you have a replica of each chunk of data in multiple of each of these kinds of failure zones. A replica each on multiple machines (pick the amount of redundancy you need based on a statistical model from component failure rates), a replica each on machines under different network devices, on different power, in different geographies, etc.

That's expensive. The next most efficient thing would be to use the same concept as RAID (erasure codes) and apply that across a wider scope. So you basically get RAID, but you use your clever model of failure zones for placement.

This gets a bit complicated in practice. Most folks stick to replicas. (Eg. last I looked, for example HDFS/Hadoop only supported replication, but it did use knowledge of network topology for placing data.)

The reason why you don't want to do this in your application is because it's really kinda complicated. You're far more likely to have many applications than many storage technologies (or databases).

Now, at some point of infrastructure size or complexity or team size it may make sense to separate your storage (the stuff I'm talking about) from your databases as well. But as Kris argues, many common databases can be made to handle some of these failure zones.

In any case, that's the extremely long version of an answer to your question why you'd handle redundancy in this particular layer. The short answer is: Below this layer is too small a scope or with too little meta information. But doing it higher in the stack fails to exploit a prime opportunity to abstract away some really significant complexity. I think we all know how useful good encapsulation can be! You avoid doing this in multiple places simply because it's expensive.

(Everything above is common knowledge among storage folks, nothing is Google specific or otherwise it has been covered in published articles. Alas, the way we would defend against bugs in storage is not public. Sorry.)


Erasure codes are indeed a great illustration of the scale concept. If your designing to tolerate 2 simultaneous failures with 3 drives you need RAID-1 with N=3, 3x the cost of a single copy. If you have 15 drives you can do 12-of-15 erasure coding which is only 1.25x the cost of a single copy.


If you are not running a database at home, you have the database in a replication setup.

That provides capacity, but also redundancy. Better redundancy than at the disk level - fewer resources are shared.

https://blog.koehntopp.info/2021/03/24/a-lot-of-mysql.html Here is how we run our datatbases.


"If you are not running a database at home, you have the database in a replication setup."

That's one of the perceptions I'm saying isn't always true. Especially in big, non-tech companies that have a mish-mash of crazy legacy stuff. Traditional H/A disks and backups still dominate some spaces.


Backups that are easily restored trump replication for so many cases and much more cheaply. High availability is overrated when you have spare preconfigured commodity hardware you can spin up in minutes, which is often perfectly acceptable human time scale. You really don't need the other insurance running all the time - that only makes it more fragile.

Only really huge services would care about downtime of this degree, or horizontal scaling.


> Backups that are easily restored trump replication for so many cases and much more cheaply

At 400 MB/s, you restore a terabyte in 45 minutes, and then you need to replay the changes that happened since the backup (in MySQL: replay the binlog), which will take aproximately another 15 minutes.

"One hour of MTTR or provisioning time per Terabyte of data at 400 MB/s sustained linear I/O speed" is a useful rule of thumb.

Having a replica that you can promote reduces that to seconds. Having a time delayed replica where you just roll forward the binlog reduces that to minutes.

You can work with modern databases without using replication, but that in most cases just shows you suck at operations.


Agreed. It depends on what you do.

When I ran large-scale Exchange and database systems, we would always get into big fights with the storage people, who believed that SAN disk with performance tiering, replication, etc was the way to solve all problems.

The problem at the time was that the fancy storage cost $40+ Gb/mo and performed like hot garbage because the data didn’t tier well. For email, the “correct” answer in 2010 was local disk or dumb sas arrays with exchange dag.

For databases, the answer was “it depends”. Often it made sense to use the SAN and replication at that layer.


Even if the author assumptions are true and valid, this is like saying mirroring RAID is a safe backup option :)

it is not. if you have 10 DB hosts with the same brand of nvme, and they fail under a workload, what good is it that you have 10 hot-hot failover hosts? you just bought yourself a few days, or hours if you are specially unlucky and using consumer grade.


Yep. Mirroring happily mirrors mistakes very well, and very quickly.


Replication also replicates mistakes.


Correct.

For that you have time delayed replicas and of course backups. But time delayed replicas are usually much faster to restore than a backup.


> Because flash does not overwrite anything, ever.

This is repeated multiple times in the article, and I refuse to believe it is true. If NVME/SSDs never overwrote anything, they would quickly run out of available blocks, especially on OSs that don't support TRIM.


There's nuance to this; the deletes / overwrites are accomplished by bulk wiping entire blocks.

Rather than change the paint color in a hallway you have to tear down the house and build a new house in the vacant lot next door that's a duplicate of the original, but with the new hallway paint.

To optimize, you keep a bucket of houses to destroy, and a bucket of vacant lots, and whenever a neighborhood has lots of "to be flattened houses" the remaining active houses are copied to a vacant lot and the whole neighborhood is flattened.

So, things get deleted, but not in the way people are used to if they imagine a piece of paper and a pencil and eraser.


Just to add to the explanation, SSDs are able to do this because they have a layer of indirection akin to virtual memory. This means that what your OS thinks is byte 800000 of the SSD may change it's actual physical location on the SSD over time even in the absence of writes or reads to said location.

This is a very important property of SSDs and is a large reason why log structured storage is so popular in recent times. The SSD is very fast at appends, but changing data is much slower.


> The SSD is very fast at appends, but changing data is much slower.

No, it's worse than that. The fact that it's an overly subtle distinction is the problem.

SSDs are fast while write traffic is light. From an operational standpoint, the drive is lying to you about its performance. Unless you are routinely stress testing your system to failure, you may have a very inaccurate picture of how your system performs under load, meaning you have done your capacity planning incorrectly, and you will be caught out with a production issue.

Ultimately it's the same sentiment as people who don't like the worst-case VACUUM behavior of Postgres - best-effort algorithms in your system of record make some people very cranky. They'd rather have higher latency with a smaller error range, because at least they can see the problem.


Are there write-once SSDs? They would have a tremendous capacity. Probably good for long term backups or archiving. Also possibly with a log structured filesystem only.


Making them write-once doesn't increase the capacity; that's mostly limited by how many analog levels you can distinguish on the stored charge, and how many cells you can fit. The management overhead and spare capacity to make SSDs rewritable is –to my knowledge– in the single digit percentages.

(Also you need the translation layer even for write-once since flash generally doesn't come 100% defect free. Not sure if manufacturers could try to get it there, but that'd probably drive the cost up massively. And the translation layer is there for rewritable flash anyway... the cost/benefit tradeoff is in favor of just living with a few bugged cells.)


I suspect that hawki was assuming that a WORM SSD would be based on a different non-flash storage medium. I don't know any write once media that has similar read/write access times to an SSD.

FWIW, there are WORM microsd cards available but it looks like they still use flash under the hood.


I don't know enough specifics, so I didn't assume anything :) In fact I was not aware of non-flash SSDs.

Because of the Internet age there probably is not much place for write once media anyway, even it would be somewhat cheaper. But maybe for specialized applications or if it would be much much cheaper per GB.


The only write once media I'm aware of that is in significant use are WORM tapes. They don't offer significant advantages over regular tapes, but for compliance reasons it can be useful to just make it impossible to modify the backups.


What about EPROMs? I mean could those be scaled down with 7nm lithography to be energy efficient incorruptible fast storage?


You mean the UV erasable kind? Essentially phase change memory? Very hard to miniaturize?

Because the older Flash aren't as stable when miniaturized as you'd expect. Current flash is a direct descendant of these, they only are more stable because the cells are much chunkier and thus with lower leakage.


I was thinking of the anti-fuse based PROMs not EPROMs sorry. I figure if you miniatures those they'd be faster and denser and use-based fully reliable.


I thought along that route as well but I'm not sure how the feature scale of a fuse compares to the size of a flash cell - especially since the latter can contain multiple bits worth of info (MLC). Assuming the fuse write results in a serious physical state change of some sort, I suspect that the energy required for high speed writes (at SSD speeds) may become substantial.

That being said, it's not clear how innovation has occurred in this direction in the storage space.


> Making them write-once doesn't increase the capacity

It could theoretically make them cheaper. But I guess that there wouldn't be enough demand, so you'd be better off having some kind of OS enforced limitation on it.


I find this a super interesting question. I always assumed that long term stability of electronic non-volatile memory is worse than that of magnetic memory. When I think about it, I can't think of any compelling reason why that should be the case. Trapped electrons vs magnetic regions; I have no intuition which one of them is likely to be more stable.

There is a question on stackoverflow about this topic with many answers but no definitive conclusion. There seem to be some papers touching the subject but at a glance I couldn't find anything useful in them.

[1] https://superuser.com/questions/4307/what-lasts-longer-data-...


According to https://www.ni.com/en-no/support/documentation/supplemental/... (Seems kinda reputable at least)

"The level of charge in each cell must be kept within certain thresholds to maintain data integrity. Unfortunately, charge leaks from flash cells over time, and if too much charge is lost then the data stored will also be lost.

During normal operation, the flash drive firmware routinely refreshes the cells to restore lost charge. However, when the flash is not powered the state of charge will naturally degrade with time. The rate of charge loss, and sensitivity of the flash to that loss, is impacted by the flash structure, amount of flash wear (number of P/E cycles performed on the cell), and the storage temperature. Flash Cell Endurance specifications usually assume a minimum data retention duration of 12 months at the end of drive life."


> During normal operation, the flash drive firmware routinely refreshes the cells to restore lost charge. However, when the flash is not powered the state of charge will naturally degrade with time.

You have to be careful how you interpret this bit. "Normal operation" here assumes not just that the SSD is powered, but that it is actively used to perform IO. Writes to the SSD will eventually cause data to be refreshed as a consequence of wear leveling; if you write 1TB per month to a 1TB drive then every (in-use) cell will be refreshed approximately monthly, and data degradation won't be a problem.

If you have an extremely low-write workload, the natural turnover due to wear leveling won't keep the data particularly fresh and you'll be dependent on the SSD re-writing data when it notices (correctable) read errors, which means data that is never accessed could degrade without being caught. But in this scenario, you're writing so little to the drive that the flash stays more or less new, and should have quite long data retention even without refreshing stored data.


> When I think about it, I can't think of any compelling reason why that should be the case. Trapped electrons vs magnetic regions; I have no intuition which one of them is likely to be more stable.

My layman intuition (which could be totally wrong) is that trapped electrons have a natural tendency to escape due to pure thermal jitter. Whereas magnetic materials tend to stick together, so there's at least that. Don't how much of this matches the actual electron physics/technology though...


Hmm I don't think this is conclusive. Thermal jitter makes magnetic boundaries change too, and of course you have to add to it that it is more susceptible to magnetic interference.

I don't have intuition either, but I don't think this explanation is sufficient


> Trapped electrons vs magnetic regions;

From the physics point of view, aren't both cases the same thing?.

Isn't magnetic regions a state of the electric field? so if I move electrons in and out, the electric field should be changing as well


No. A region of a piece of material is magnetized in a certain direction when its (ionized) atoms are mostly oriented in that direction, the presence of a constant magnetic field is (roughly speaking) only a consequence of that.

So flash memory is about the electrons, while magnetic memory is about the ions.


Aren't permanent magnetics a direct result of oriented spins? (So due to quantum effects?)


Modern multi-bit-per-cell flash has quite terrible data retention. It is especially low if it is stored in a warm place. You'd be lucky to see ten years without an occasional re-read + error-correct + re-write operation going on



Any SSD you go through the trouble of building a max capacity disk image for, then dd'ing onto the disk before removing?

I mean... This is general purpose HW here. Write once SSD is a workflow more than an economically tenable use-case in terms of making massive size write once then burn the write circuit devices.


I don't think anyone would make literally write-once drives with flash memory; that's more optical disk territory. But zoned SSDs and host-managed SMR hard drives make explicit the distinction between writes and larger-scale erase operations, while still allowing random-access reads.


That would be magnetic tapes.


Append-only garbage-collected storage was used in data center even when hard disks were (and are) popular because it's more reliable and scalable.


inspired by that last sentence, the analogy could be rewritten as:

  - lines on page
  - pages of paper
  - whole notebooks
and might be easier for people to grok than the earlier houses/paint analogy.


I don’t know, I like the drama of copying a neighborhood and tearing down the old one xD


Reminds me of https://xkcd.com/1737/.

"When a datacenter catches fire, we just rope it off and rebuild one town over."


Speaking of xkcd, 2021 is return of "All your bases" See alt-text on image.

https://xkcd.com/286/


I think the explanation is sound maybe (I am not that familiar) but the analogy gets a bit lost when you talk about buckets of houses and buckets of vacant lots.

Maybe there is a better analogy or paradigm to view this through.


I should have been a little more clear -- the urban planner managing the house building / copying and neighborhood destruction (the realtime controller) The rules are: 1) You can build a house kinda quickly 2) You can't modify a house once it is built 3) you can only build a house on a vacant lot 4) you can change the "mailing address" (relative to the physical location) of the house 5) you can only knock down whole blocks of houses at once (not one at a time) 6) each time you flatten a block more crap accumulates in that block until after a while you can't build there anymore. 7) the flatten / rebuild step may be quite slow (because you have lots of houses to build) 8) You can lie and say you built a house before it is finished, if you don't have too many houses to build. (if you've got an SSD with a capacitor / battery or tiny cache and reserved area for that cache) 9) you've lied to the user and you actually have 5-100% more build-able area than you've advertised. 10) you have a finite area so eventually the dead space accumulates to the point where you can no longer safely build.

So -- you keep track of vacant lots and "dead" houses (abandoned but not flattened); whenever you've got spare time you will copy blocks with some ratio of "live" to abandoned houses to new lots so the new block only has live houses.

These pending / anticipatory compaction/garbage collection operations are what I refer to as "buckets" -- having to compact 300 (neighborhoods) blocks to achieve 300 writes is going to result in glacial performance because of this huge write amplification (behind the scenes the drive is duplicating 100s of mb / gb of data to write a small amount of user modifications)

As you might imagine, there are lots of strategies to how to approach this problem, some of which give you an SSD with extremely unpredictable (when full) performance, others will give a much more consistent but "slower" performance.


Spoiler alert - This is the plot to ‘The Prestige’.


It's true and untrue depending on how you look at it. Flash memory only supports changing/"writing" bits in one direction, generally from 1 to 0. Erase, as a separate operation, clears entire sectors back to 1, but is more costly than a write. (Erase block size depends on the technology but we're talking MB on modern flash AFAIK, stuff from 2010 already had 128kB.)

So, the drives do indeed never "overwrite" data - they mark the block as unused (either when the OS uses TRIM, or when it writes new data [for which it picks an empty block elsewhere]), and put it in a queue to be erased whenever there's time (and energy and heat budget) to do so.

Understanding this is also quite important because it can have performance implications, particularly on consumer/low-end devices. Those don't have a whole lot of spare space to work with, so if the entire device is "in use", write performance can take a serious hit when it becomes limited by erase speed.

[Add.: reference for block sizes: https://www.micron.com/support/~/media/74C3F8B1250D4935898DB... - note the PDF creation date on that is 2002(!) and it compares 16kB against 128kB size.]


> Understanding this is also quite important because it can have performance implications

Security implications too. The storage device cannot be trusted to securely delete data.


If you write whole drive capacity of random data, you should be fine.


No. Say a particular model of SSD has over-provisioning of 10%, then even after writing the "whole" capacity of the drive, you can still be left with up to 10% of data recoverable from the Flash chips.


Right, so one better write 2x or 10x drive capacity of random data to it.


You should be running flash with self-encryption (and make sure you have a drive that implements that correctly).

To zap a drive you ask it to securely drop the self-encryption key. The data will still be there, but without the key it is indistinguishable from random noise.


Well who has time and energy to verify that. Just overwrite it several times, or destroy the drive.


For some family photos? Probably. For sensitive material or crypto keys? Absolutely not, due to overprovisoning as mentioned (which can be way higher than 10% for enterprise drives), but also due to controllers potentially lying to you especially when drives have things like pSLC caches, etc.


By any reasonable definition they do overwrite data. It's just that they can't overwrite less than a block of data.


If a logical overwrite only involved bits going from 1 to 0, are and drives smart enough to recognize this and do it as an actual overwrite instead of a copy and erase?


On embedded devices, yes, this is actually used in file systems like JFFS2. But in these cases the flash chip is just dumb storage and the translation layer is implemented on the main CPU in software. So there's no "drive" really.

On NVMe/PC type applications with a controller driving the flash chips… I have absolutely no idea. I'm curious too, if anyone knows :)


I do know. Apparently you downvoted my sibling response to you as too simplistic, but I was clearly responding to someone where the embedded bare drive situation is irrelevant.

When it comes to what non bare flash drives do, you can start here: http://www.vldb.org/pvldb/vol13/p519-kakaraparthy.pdf

This paper is imperfect and the following citations are worth skimming. There's a cohort of similar papers chasing the same basic question in recent years that aren't densely cited amongst each other.

Go here next: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46... but note that's just a jumping off point to the more recent papers.

It's hard to gain a full understanding of this layer because it's the basis of intense competition, hence held closely by controller manufacturers.

I'm far from world expert on this, but have read a lot about it and can answer with what I know to the best of my ability.


> Apparently you downvoted my sibling response to you as too simplistic,

I didn't downvote your sibling response, but I did ignore it since it provided neither any sources nor any context for why I should trust your knowledge. Apparently others were less kind on your short statement.

With the additional information in this post, I'm much more willing to accept it into my head — thanks for answering this!


Yeah sorry that was unnecessarily grouchy of me.


Generally no, because the unit of write is a page.


Flash has a flash translation layer (FTL). It translates linear block addresses (LBA) into physical addresses ("PHY").

Flash can write blocks at a granularity similar to a memory page (cells, around 4-16 KB). It can erase only sets of blocks, at a much larger granularity (around 512-ish cell sized blocks).

The FTL will try to find free pages to write your data to. In the background, it will also try to move data around to generate unused erase blocks and then erase them.

In flash, seeks are essentially free. That means that it does no longer matter if blocks are adjacent. Also, because of the FTL, adjacent FTL are not necessarily adjacent on the physical layer. And even if you do not rewrite a block, it may be that the garbage collection moves data around at the PHY layer in order to generate completely empty erase blocks.

The net effect is that positioning as seen from the OS no longer matters at all from the OS layer, and that the OS layer has zero control over adjacency and erase at the PHY layer. Rewriting, defragging, or other OS level operations cannot control what happens physically at the flash layer.

TRIM is a "blatant layering violation" in the Linus sense: It tells the disk "hardware" what the OS thinks it no longer needs. TRIM'ed blocks can be given up and will not be kept when the garbage collector tries to free up an erase page.


> In flash, seeks are essentially free. That means that it does no longer matter if blocks are adjacent.

> The net effect is that positioning as seen from the OS no longer matters at all from the OS layer, and that the OS layer has zero control over adjacency and erase at the PHY layer. Rewriting, defragging, or other OS level operations cannot control what happens physically at the flash layer.

I don't agree with this. The "OS visible position" is relevant, because it influences what can realistically be written together (multiple larger IOs targeting consecutive LBAs in close time proximity). And writing data in larger chunks is very important for good performance, particularly in sustained write workloads. And sequential IO (in contrast to small random IOs) does influence how the FTL will lay out the data to some degree.


Disagree, because my understanding your OS visible positions have zero relevance to what will actually be translated to PHYs.

If you feed your NVMe a stream of 1GB writes spread out at completely randomised OS visible places (LBAs), the FTL may very well write it sequentially and you get the solid sustained write performance.

Conversely, you may try to write 1GB of sequential LBAs, and your FTL may very well spread it out all across the physical blocks simply because that's what’s available.

What I'm saying is that sequential reads and writ workloads are good, but whether the OS considers them sequential or not in terms of LBAs is irrelevant. The controller ignores LBAs and abstracts everything away.

My understanding could be wrong, so please correct me if I am.


That may sometimes be true the first times you write the random data (but in my experience it's often not true even then, and only if you carefully TRIMed the whole filesystem and it was mostly empty). But on later random writes it's rarely true, unless your randomness pattern of exactly the same as in the first run. To make room the FTL will (often in the background) need to read the non-written parts of erase blocks sized data assigned in the previous runs, just to be able to write out the new random writes. At some point new writes need to wait for this. Slowing things down.

Whereas with larger/sequential writes, there's commonly no need for read-modify-write cycles. The entire previous erase block sized chunks can just be marked as reusable with new content - the old data isn't relevant anymore.

This is pretty easy to see by just running benchmarks with sustained sequential and random write IO. But on some devices it'll take a bit - initially the writes are all in a faster area (e.g. using SLC flash instead of denser/cheaper mlc/tlc/qlc).

Of course, if all the random writes are >= erase block size, with a consistent alignment to multiples of the write size, then you're not going to see this - it's essentially sequential enough.


Thanks for this part, I feel like this was a crucial piece of information I was missing. Also explains my observations about TRIM not being as important as people claim it is, the firmware on modern flash storage seems more than capable of handling this without OS intervention.


The GC in the device cleans up.

TRIM is useful, it gives the GC important information.

TRIM is not that important as long as the device is not full (less than 80%, generally speaking, but it is very easy to produce pathological cases that are way off in either direction). Once the device fills up above that it is crucial.


The author clearly explains how this works in the sentence immediately following. "Instead it has internally a thing called flash translation layer (FTL)" ...


I unfortunately skimmed over this, isotopp's explanation helped clear things up in my head.


I just saw his post, it's a great explanation.

It might also help to keep in mind that both regular disk drives and solid state drives remap bad sectors. Both types of disks maintain an unaddressable storage area which is used to transparently cover for faulty sectors.

In a hard drive, faulty sectors are mapped during production and stored in the p-list, and are remapped to sectors in this extra hidden area. Sectors that fail at runtime are recorded in the g-list and are likewise remapped.

Writes may usually go to the same place in a hard drive, but it's not guaranteed there either.


This is not true anymore for many recent SMR HDDs. They have a translation layer, just like flash storage.

This is because for SMR HDDs, each block can either be SMR (higher density, EXTREMELY SLOW WRITES like <10mb/s possible, erases will remove multiple blocks just like flash memory), or normal (standard density, normal write speeds).

The controller abstracts this away and does writes as normal, but while the drive is idle, the controller in the background, converts these standard blocks into SMR blocks.

This is also why SMR HDDs support TRIM.


Thanks for the info that makes a lot of sense. It looks like this tech has emerged in the time since I last did much work with disk drives.

Seems it's increasingly a bad idea to presume the implementation of internals.


Perhaps they mean it must erase an entire block before writing any data, unlike a disk that can write a single sector at a time?


The issue is that DDR4 is like that too. Not only the 64 byte cache line, but DDR4 requires a transfer to the sense amplifiers (aka a RAS, row access strobe) before you can read or write.

The RAS command eradicated the entire row, like 1024 bytes or so. This is because the DDR4 cells only have enough charge for one reliable read, after that the capacitors don't have enough electrons to know if a 0 or 1 was stored.

A row close command returns the data from the sense amps back to the capacitors. Refresh commands renew the 0 or 1 as the capacitor can only hold the data for a few milliseconds.

------

The CAS latency statistic assumes that the row was already open. It's a measure of the sense amplifiers and not of the actual data.


It's vaguely similar, but there's a huge difference in that flash needs to be erased before you can write it again, and that operation is much slower and only possible on much larger sizes. DDR4 doesn't care, you can always write, just the read is destructive and needs to be followed by a write.

I think this makes the comparison unhelpful since the characteristics are still very different.


The difference is that on DDR you have infinite write endurance and you can do the whole thing in parallel.

If flash was the same way, and it could rewrite an entire erase block with no consequences, then you could ignore erase blocks. But it's nowhere near that level, so the performance impact is very large.


That's a good point.

There are only 10,000 erase cycles per Flash cell. So a lot of algorithms are about minimizing those erases.


What does DDR have to do with NVMe?


You can't write a byte, or a word, either.

The "fact" that you can do it in your program without disturbing bytes around it is a convenient fiction that the hardware fabricates for you.


DDR4 is effectively a block device and not 'random access'.

Pretty much only cache is RAM proper these days (aka: all locations have equal access time... that is, you can access it randomly with little performance loss).


I’m confused. What’s the difference between a cache line and a row in RAM? They’re both multiples of bytes. You have data sharing per chunk in either case.

The distinction seems to be how big the chunk is not uniformity of access time (is a symmetrical read disk not a block device?)


Hard disk chunks are 512 bytes classically, and smaller than the DDR4 row of 1024 bytes !!

So yes. DDR4 has surprising similarities to a 512byte sector hard drive (modern hard drives have 4k blocks)

>> What’s the difference between a cache line and a row in RAM?

Well DDR4 doesn't have a cache line. It has a burst length of 8, so the smallest data transfer is 64 bytes. This happens to coincide with L1 cache lines.

The row is 1024 bytes long. Its pretty much the L1 cache on the other side, so to speak. When your CPU talks to DDR4, it needs to load a row (RAS all 1024 bytes) before it can CAS read a 64 byte burst length 8 chunk.

-----------

DDR4, hard drives, and Flash are all block devices.

The main issue for Flash technologies, is that the erase size is even larger than the read/write block size. That's why we TRIM for NVMe devices.


Thanks, I see what you mean at the interface level.

In terms of performance analogy though, hard drives do not get random access to blocks, but RAM does. The practical block size of hard drives is sequential reads of 100kiB+ due to seeks.


Of course it does [0]. It's just it assigns writes as evenly as possible (to have as even wear as possible), so log-like internal "file system" is a way to go.

https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf


"Most people need their data a lot less than they think they do" - great way to put it and a thought provoking article.


Yes, interesting thought. On my ride in to work I was actually thinking how our situation is exactly opposite: in our environment (R&D test and measurement and production automation) data is everything and never in the cloud so we don't get to benefit from all the cool stuff the kids are doing these days. Historical data can go in the cloud (as long as we're reasonably sure it's secure) but operational data from our test and production tooling (e.g. assembly lines and end-of-line audit tools) has to be right there with super short latency.

So we're still very interested in things like hard disks, NVMe, etc.


I think it's often true at business level (most departments in a large company don't need redundancy and uptime they think they do),

and rarely if ever true at personal/family level (most people don't think their phone/tablet/laptop could lose their data; most people don't think of their data past photos they took this week - "Oh it's OK to lose... wait, my tax documents are gone? Photos of my baby from two years ago are gone? My poems are gone? My [.. etc] is gone???"?).


I know for a fact I don’t need 40T of movies and tv, but that changes nothing


Pfft. 440tb of data hoarding on my NAS. ;)


I can't imagine wasting NVME storage on backup data. That's what I have my 8tb spinning hdds for.


Isn't tape more cost-effective and reliable? (Also, your backups do not need to be spinning all the time, if that's what they are doing.)


The problem with really high capacity tape these days is there's almost literally 1 vendor for drives and tapes, which is even worse than the 3 companies worldwide that manufacture hard drives. If you want relatively not so fast storage for a lot of terabytes, I bet I could clone a backblaze storage pod with some modifications and achieve a better $/TB ratio than an enterprise priced tape solution.


There are at least 2 manufacturers for tapes (Fujifilm and Sony), but their tapes are also sold under many other brands (e.g. IBM, Quantum, HP).

The price per TB is currently about $7 or less (per real TB of LTO-8, not per marketing compressed TB), so there is no chance to approach the cost and reliability of tapes using HDDs.

The problem with tapes is that currently the tape drives are very expensive, because their market is small, so you need to store a lot of data before the difference in price between HDDs and tapes exceeds the initial expense for the tape drive.

Nevertheless, if you want to store data for many years, investing in a tape drive may be worthwhile just for the peace of mind, because when storing HDDs for years you can never be sure how will they work when operated again.


> Nevertheless, if you want to store data for many years, investing in a tape drive may be worthwhile just for the peace of mind, because when storing HDDs for years you can never be sure how will they work when operated again.

You'd actually need two tape drives to be reasonably secure, because getting a SATA adapter is much easier if your current setup is broken and you need to access the data either many years later (when technology moved on) or right now.

Having your data on a few hard drives combined with some check-summing and somewhat regular checking is probably easier. Or a combination of both - two different storage systems with two different failure modes are a lot safer. But when you're going that way the break even moves even farther away.


Tape is great for long term storage of archival records, but nobody wants to restore a live virtual machine from 7 years ago.

Depending on the business system you might want your server to be backed up multiple times a day to just weekly, and if you are doing multiple times a day you do want it to be decently fast or to have some extra capacity so as not to slow down normal operations while the backup runs.


Maybe but I imagine automatic daily backups from 5 different machines would be a headache to do with tape.


Right now disks are cheaper than tapes, even if you don't count the very expensive tape readers.


This is incorrect. I recently acquired an LTO-6 tape library; the tapes are <20€ for 2.5TB true capacity (marketing = "6TB compressed".) That's <8€/TB. Disk drives start at 20€/TB for the cheapest garbage.

Sources:

https://geizhals.eu/?cat=zip&xf=1204_LTO-6 (tapes)

https://geizhals.eu/?cat=hde7s&sort=r (disks)

(For completeness, the tape library ran me 500€ on eBay used, but they normally run about twice that. It's a 16-slot library, which coincidentally matches the breakeven - filled for 40TB it's 820€, the same in disks would've been 800€, though I would never buy the cheapest crappy disks.)

FWIW, a major reason for going for tapes for me was that at some point my backup HDDs always ended up used for "real". The tape library frees me from trying to discipline myself to properly separate backup HDDs ;)


For newer tape formats the prices are similar or even a little lower.

For example, at Amazon in Europe an LTO-7 cartridge is EUR 48 for 6 real TB (EUR 8 per real TB), while an LTO-8 cartridge is EUR 84 for 12 real TB (EUR 7 per real TB).

At Amazon USA I see now $48 for LTO-7 and $96 for LTO-8, i.e. $8 per TB if you buy 1 tape. For larger quantities, there are discounts.

Older formats like LTO-6 have the advantage that you may find much cheaper tape drives, but you must handle more cartridges for a given data size.

Currently the cheapest HDDs are the external USB drives with 5400 rpm, which are much slower, about 3 times slower than the tapes, but even those are many times more expensive than the tapes (e.g. $27 ... $30 per TB).


I don't disagree much overall but in my experience cheap or sale drives can beat 15 per TB.


Since someone... disagreed? I'll cite that I recently got a low end 14TB drive for less than $15 per TB, including sales tax. If I pretend it had a 20% VAT and convert that to euros it was actually well under 14€/TB.

That was a better than usual deal but $15/TB is the norm for low end drives with a bit of patience in the US, at least until Chia's influence hit extremely recently.


Yep, reminds me of a tech blog I read in the last year or so (can't remember where) that talked about using 6 replicas for each DB shard (across 3 AZ's IIRC)… they just used EC2's on-disk NVMe storage (which is "ephemeral") because it's faster and hey, if the machine dies, you have replicas!

This post's point that even with that setup, it's still nice to have volume-based storage for quicker image replacements is interesting, I'm not experienced enough with cloud setups to know if that makes sense (eg how long does it take to upgrade an EC2 instance that has data on disk? upgrade your OS? upgrade your pg version? does ephemeral storage vs volume storage affect these? I imagine not…)


Especially in a world where so much data now "lives" in the cloud. Between my dropbox, github, google photos, etc. Very little of my data only lives on a hard drive. The stuff that does lives on a Synology NAS and is mirrored to S3 Glacier weekly.


NAS mirrored to S3 Glacier. How much does it cost?


In my experience this is very cheap. I take it the parent is not retrieving from Glacier often/ever, which is where the significant costs go. It's a decent balance for disaster recovery.

I sync my photos to S3 (a mix of raw and jpeg, sidecar rawtherapee files) across a few devices so Glacier is prohibitively expensive in this regard, but I still pay <$100 a year for more stuff than I could ever store locally.


I did the math, and for me Glacier is great for backups where homeowners insurance is likely to be involved in the restoral. It was ferociously expensive for anything less drastic.


I'm trying to figure out the costs. To back up my NAS at full capacity, I need 10TB of storage. Using S3 Glacier Deep Archive, that seems to cost $10/month per full backup image I keep. That's not bad.

What's confusing is that the calculator has "Restore Requests" as the # of requests, "Data Retrievals" as TB/month, but there's also a "Data Transfer" section for the S3 calculator. If I add 1 restore request for 10TB of data (eg: restoring my full backup to the NAS), that adds about $26 for that month. Totally reasonable.

However, if "Data Transfer" is relevant, and I can't tell if it is or isn't, uploading my backup data is free but retrieving 10TB would cost $922! Is that right?

This is what has always deterred me from using AWS. It's so unclear what services and fees will apply to any given use case, and it seems like there's no way to know until Amazon decides that you've incurred them. At $10/month for storage and $26 if I need to restore, I can just set this up and I don't need to plan for disaster recovery expenses. But if it's going to cost me $922 to get my data back, I've got to figure out how to make sure my insurance is going to cover that. This isn't a no-brainer anymore. Also, what assurance do I have that the cost isn't going to be higher when I need the data, or that there won't be other fees tacked on that I've missed?

[1] https://calculator.aws/#/createCalculator/S3


Glacier pricing can be hard to grok...

With Glacier as it is usually used you dont read data directly from the Glacier storage, it has to be restored to S3 where you then access it. That is the restore charges and the delays, so you can pay a low rate for the bulk option that takes up to 24 hours to restore your data to S3. But the real cost is the bandwidth from S3 back to your NAS/datacenter/etc, which brings it up to about $90USD/TB.

Other fees would include request pricing, some low amount per 1000 requests. So costs can go up a bit if you store 1 million small files to Glacier vs 1000 large files. There is also a tipping point (IIRC about 170KB) where it is cheaper to store small files on S3 than Glacier.

Depending on your data and patterns it can be better to use Glacier as a second backup which is what I do. All my data is backed up to a Google Workspace as that is "unlimited" for now. The most important subset (a few TB) also goes to Glacier. Glacier is pay as you go, there isnt some "unlimited" or "5TB for life" type deal that can change. If Google Workspace ever becomes not "unlimited" or something happens to it, I have the most important data in Glacier and its data that I have no qualms paying >$1k to get back.

But for me restoring from Glacier means that my NAS is dead (ZFS RAIDZ2 on good hardware) and Google Workspace has failed me at the same time.


Cool, thank you for the details. None of their marketing or FAQs for Glacier mention that getting the data back means going to S3 first and then paying S3's outgoing bandwidth costs. As deceptive as I expected.

I'll check out Google Workspace; that sounds like the right level of kludge for me, since this is the first time I've ever bothered to try to setup off-site backups. I only started using RAID a couple of years ago.


It makes more sense sense when you think of Glacier as a tier of S3, like Infrequent Access/etc which it is now. There used to be a Glacier standalone service but you had to upload a blob and track the UID to file name mapping yourself. That skipped S3 but was far more complex.

Workspace make more sense for large amounts of data where you can take advantage of the "unlimited".

Backblaze B3 might be what you are looking for as an in between option.


Are you sure about the "restored to s3" bit? Their SDK seems to fetch directly from Glacier.

Note that the official name is "S3 Glacier", so from AWS's public perspective, it is S3.


> However, if "Data Transfer" is relevant, and I can't tell if it is or isn't, uploading my backup data is free but retrieving 10TB would cost $922! Is that right?

That's right. AWS charges offensive prices for bandwidth.

There are alternate methods to get data out for about half the price, or you can try your luck on using lightsail and if they don't decide it's a ToS violation you could get the transfer costs to around $50.


The $922 sounds about right. That jibes with my estimates.

There's another (unofficial!) calculator at http://liangzan.net/aws-glacier-calculator/ you can toy with.


Thanks, I'll check it out.


how would that process look like in practice should you need to get to call the insurance guys? as in, would you claim on the cost to retrieve the data or ? (this question is general, regardless of the actual country)


I honestly don't know. I've never had to use it.


I don't have automated mirroring set up, but I have insight.

I use a Windows free tool called FastGlacier. I set up an IAM user on my AWS account for my backups, and use those creds to login. Then it's drag and drop! You can even use FastGlacier to encrypt/decrypt on the fly as you upload and download.

Glacier is cheap because the retrieval times are very slow - something like 1-12 hours depending on the tier.

I have about 100GB of critical data. Personal documents, photos and some music I don't want to have to search for if the house burns down. It's something like a dollar a month. Less than a cup of coffee.


Deep Archive is super cost effective, $1/TB/mo. For the house-burns-down scenario, I don't mind the 24hr retrieval time


Glacier is about $1USD/TB/month just for storing data. If you need to retrieve it ends up being about $90USD/TB, most of that is bandwidth charges.


That means that if you store the data for much more than a half of year, Glacier becomes more expensive than storing on tapes.

Of course, tapes require a tape drive and its cost would require a lot of data to compensate the cost, but at a such high cost of retrieval it would not take much data to equal the cost of a tape drive.

Glacier is OK for a couple of TB, but for tens or hundreds it would not be suitable.


> Of course, tapes require a tape drive and its cost would require a lot of data to compensate the cost, but at a such high cost of retrieval it would not take much data to equal the cost of a tape drive.

But the less you expect to use it, the less this matters.

So I'd put the break-even point a bit higher. Tape is good for 100TB or more but for tens it's hard to justify a tape drive.

Also it's important to remember to get those tapes offsite every week!


Few bucks a month. More if I needed to retrieve something.


But Dropbox, Github, Google photos etc rely on massive piles of hard drives.


Sure, but the comment was about personal storage. Fault tolerance at the edge is less important in a cloud world.


My take on NVMe has evolved a lot after getting a chance to really play around with the capabilities of consumer-level devices.

The biggest thing I have realized (as have others), is that traditional IO wait strategies don't make sense for NVMe. Even "newer" strategies like async/await do not give you want you truly want for one of these devices (still too slow). The best performance I have been able to extract from these is when I am doing really stupid busy-wait strategies.

Also, single-writer principle and serializing+batching writes before you send them to disk is critical. With any storage medium where a block write costs you device lifetime, you want to put as much effective data per block as possible, and you also want to avoid editing existing blocks. Append-only log structures are what NVMe/flash devices live for.

With all of this in mind, the motivations for building gigantic database clusters should start going away. One NVMe device per application instance is starting to sound a lot more compelling to me. Build clusters with app logic (not database logic).

In testing of these ideas I have been able to push 2 million small writes (~64k) per second to a single Samsung 960 pro for a single business entity. I don't know of any SQL clusters that can achieve these same figures.


In NVME you can get around 800.000 IOPS from a single device, but the latency gives you around 20.000 IOPS sequentially. You need to talk with deep queues or with multiple concurrent threads to the device in order to eat the entire IOPS buffet.

Traditional OLTP workloads do not tend to have the concurrency to actually saturate the NVME. You would need to be 40-way parallel, but most OLTP workloads give you 4-way.

Multiple instances per device are almost a must.


With a lot of NVMe devices, up to medium priced server gear, the bottleneck in OLTP workloads isn't normal write latency, but slow write cache flushes. On devices with write caches one either needs to fdatasync() the journal on commit (which typically issues a whole device cache flush) or use O_DIRECT | O_DSYNC (ending up as a FUA write which just tags the individual write as needing to be durable) for journal writes. Often that drastically increases latency and slows down concurrent non-durable IO, reducing the benefit of deeply queued IO substantially.

On top-line gear this isn't an issue, they don't signal a write cache (by virtue of either having a non-volatile cache or enough of a power reserve to flush the cache). Which then prevents the OS from actually doing more expensive for fdatasync()/O_DSYNC. One also can manually ignore the need for caching by changing /sys/block/nvme*/queue/write_cache to say write through, but that obviously looses guarantees - but can be useful to test on lower end devices.


One consequence of that is that:

> Multiple instances per device are almost a must.

Isn't actually unproblematic in OLTP, because it increases the number of journal writes that need to be flushed. With a single instance group commit can amortize the write cache flush costs much more efficiently than with many concurrent instances all separately doing much smaller group commits.


> You need to talk with deep queues or with multiple concurrent threads to the device in order to eat the entire IOPS buffet.

Completely agree. There is another angle you can play if you are willing to get your hands dirty at the lowest levels.

If you build a custom database engine that fundamentally stores everything as key-value, and then builds relational abstractions on top, you can leverage a lot more benefit on a per-transaction basis. For instance, if you are storing a KVP per column in a table and the table has 10 columns, you may wind up generating 10-20 KVP items per logical row insert/update/delete. And if you are careful, you can make sure this extra data structure expressiveness does not cause write amplification (single writer serializes and batches all transactions).


  > If you build a custom database 
  > engine that fundamentally stores
  > everything as key-value, and then
  > builds relational abstractions on
  > top
Sounds like this could be FoundationDB, among other contenders like TiDB.

https://foundationdb.org


More like myRocks. FoundationDB doesn't use LSM and definitely wants to do lots of overwriting in place. TiDB uses Rocksdb and would be closer.


You may want to play with a TiDB setup from Pingcap.


I guess it still makes sense for higher abstraction levels though, right? Like a filesystem or other shared access to a storage resource. So these asynchronous APIs aren’t writing as directly to storage, they’re placing something in the queue and notifying when that batch is committed.

> Append-only log structures are what NVMe/flash devices live for.

I would think this is also good for filesystems like ZFS, APFS, and BTRFS, yes? I had an inkling but never really looked into it. Aren’t these filesystems somewhat similar to append-only logs of changes, which serialize operations as a single writer?


> RAID or storage replication in distributed storage <..> is not only useless, but actively undesirable

I guess I'm different from most people, good news! When building my new "home server" half a year ago I made a raid-1 (based on ZFS) with 4 NVMEs. I rarely appear at that city, so I brought the fifth one and put it into an empty slot. Well, one of the 4 nvmes lasted for 3 months and stopped responding. One "zpool replace" and I'm back to normal, without any downtime, disassembly, even reboots. I think that's quite useful. When I'm there the next time I'll replace the dead one, of course.


This article is speaking of large scale multinode distributed systems. Hundreds of rack sized systems. In those systems, you often don't need explicit disk redundancy, because you have data redundancy across nodes with independent disks.

This is a good insight, but you need to be sure the disks are independent.


well most often hba's and raid controllers are another thing which increases latency and makes maintenances costs go up quite a bit (more stuff to update) and also it's another part that can break.

that's why it's not recommended when running ceph.


I'm pretty sure discrete HBAs / Hardware RAID Controllers have effectively gone the way of the dodo. Software RAID (or ZFS) is the common, faster, cheaper, more reliable way of doing things.


Don’t lop HBAs and RAID controllers together. The former is just PCIe to SATA or SCSI or whatever (otherwise it is not just an HBA, but indeed a RAID controller). Such a thing is still useful and perhaps necessary for software RAID if there are insufficient ports on the motherboard.


Hardware RAID doesn't seem to be going away quickly. Since they're almost all made by the same company, and they can usually be flashed to be dumb HBAs, it's not too bad, but it was pretty painful when using managed hosting and the menu options with lots of disks all have the raid controllers that are a pain to setup; and I'm not going to reflash their hardware (although I did end up doing some SSD firmware updates myself because firmware bugs were causing issues and their firmware upgrade scripts weren't working well and were tremendously slow)


ZFS needs HBAs. Those get your disks connected but otherwise get out of the way of ZFS.

But yes, hardware RAID controllers and ZFS don't go together.


Hardware caching raid controllers do have the advantage if power is lost, the cache can still be written out without the CPU/software to do it. This let's you safely run without write-thru cache fsync. This was a common spec for provisioned bare-metal MySQL servers I'd worked with.


The entire comment thread of this article is on-prem, low scale admins and high-scale cloud admins talking past each other.

You can build in redundancy at the component level, at the physical computer level, at the rack level, at the datacenter level, at the region level. Having all of them is almost certainly redundant and unnecessary at best.


Sometimes. Other times they may make things worse by lying to the filesystem (and thereby also the application) about writes being completed, which may confound higher-level consistency models.


It does seem to me that it's much easier to reason about the overall system's resiliency when the capacitor-protected caches are in the drives themselves (standard for server SSDs) and nothing between that and the OS lies about data consistency. And for solid state storage, you probably don't need those extra layers of caching to get good performance.


Since my experience was from a number of years back, I tried searching for more recent reports: "mysql ssd fsync performance". The top recent one I found was for Digital Ocean[0] in 2020. It says "average of about 20ms which matches your 50/sec" and mentions battery back-up controllers which wasn't even in my search terms.

[0] https://www.digitalocean.com/community/questions/poor-fsync-...


I would be worried about my data behind held hostage by a black box proprietary RAID controller from a hostile manufacturer (unless you're paying them millions to build & design you a custom product, at which point you may have access to internal specs & a contact within their engineering team to help you).

I'd rather have ZFS or something equivalent in software. Software can be inspected, is (hopefully) battle-tested for years by many different companies with different workloads & requirements, and worst-case scenario, because it's software, you can freeze the situation in time by taking byte-level snapshots of the underlying drives as well as a copy of the software for later examination/reverse-engineering, something you can't do with a hardware black box where you're bound to the physical hardware and often have a single shot at a recovery attempt (as it may change the state of the black box).

Have you heard of the SSD failures about a decade ago where the SSD controller's firmware had a bug that bricked the drive past a certain lifetime? The data is technically still there, and would be recoverable if you could bypass the controller or fix its firmware, but unless you had a very good relationship with the manufacturer of the SSD to gain access to the internal tools and/or source code to allow you to tinker with the controller you were SOL.


It was RAID-1, so there's no data manipulation going on, a simple mirror copy with double the read bandwidth.


> > RAID or storage replication in distributed storage <..> is not only useless, but actively undesirable

> I guess I'm different from most people, good news!

The earlier part of the sentence helps explain the difference: "That is, because most database-like applications do their redundancy themselves, at the application level..."

Running one box I'd want RAID on it for sure. Work already runs a DB cluster because the app needs to stay up when an entire box goes away. Once you have 3+ hot copies of the data and a failover setup, RAID within each box on top of that can be extravagant. (If you do want greater reliability, it might be through more replicas, etc. instead of RAID.)

There is a bit of universalization in how the blog post phrases it. As applied to databases, though, I get where they're coming from.


You omitted the context from the rest of the sentence:

> most database-like applications do their redundancy themselves, at the application level …

If that’s not the case for your storage (doesn’t sound like it), then the author’s point doesn’t apply to your case anyway. In which case, yes, RAID may be useful.


What setup do you use to put 4 NVME in one box? I know it’s possible, I’ve just heard off so many different setups. I know there are some PCIE cards that allow for 4 NVME drives. But you have to match that with a motherboard/CPU combo with enough ones to not lose bandwidth.


For distributed storage, we use this: https://www.slideshare.net/Storage-Forum/operation-unthinkab...

We then install SDS software, Cloudian for S3, Quobyte for File, and we used to use Datera for iSCSI. Lightbits maybe in the future, I don't know.

These boxen get purchased with 4 NVME devices, but can grow to 24 NVME devices. Currently 11 TB Microns, going for 16 or more in the future.

For local storage, multiple NVME hardly ever make sense.


I've been looking into building some small/cheap storage, and this is one of the enclosures I've been looking at.

https://www.owcdigital.com/products/express-4m2


That’s exactly what they are doing. Anyone else is using proprietary controllers and ports for a server chassis


I’ve recently converted all my home workstation and NAS hard drives over to OpenZFS, and it’s amazing. Anyone who says RAID is useless or undesirable just hasn’t used ZFS yet.


The article’s author only said RAID was useless in a specific scenario, not generally, and the post you’re replying to omitted this crucial context.


Compare your solution with having 4 SBCs, with 1 NVME each, at different locations. The network client would handle replication and checksumming.

The total cost might be similar but you have increased reliability over SBC/controller/uplink failure.

Of course there are tradeoffs on performance and ease of management...


You think that building 4 systems in 4 locations is likely to have a similar cost to one system at one location? For small systems, the fixed costs are a significant portion of the overall system cost.

This is doubly true for physical or self-hosted systems.


My environment is not a home environment.

It looks like this: https://blog.koehntopp.info/2021/03/24/a-lot-of-mysql.html


We are currently converting our SSD-based Ganeti clusters from LVM on RAID to ZFS, to prepare for our NVMe future, without RAID cards (1). Was hoping to get the second box in our dev ganeti cluster reinstalled this morning to do further testing, but the first box has been working great!

1: LSI has a NVMe RAID controller for U.2 chassis, preparing for a non-RAID future, just in case.


Does zpool not automatically promote the hot spare like mdadm?


It can, if you set a disk as the hot spare for that pool.

But a disk can only be a hot spare in one pool, so to have a "global" hot spare it has to be done manually. That may be what that poster was doing.


Also, if I understand it correctly, there are a few other caveats with hot spares: It will only activate when another drive completely fails, so you can't decide to replace a drive when it's close to failure (probably not an issue in this case, though, with the unresponsive drive). Second, with the hot spare activated, the pool is still degraded, and the original drive still needs to be replaced; then the hot spare is removed from the vdev, and goes back to being a hot spare.

It's these reasons that I've decided to just keep a couple of cold spares ready that I can swap in to my system as needed, although I do have access to the NAS at any time. If I was remote like GP, I might decide to use a hot spare.


I've been messing with NVMe over TCP at home lately and it's pretty awesome. You can scoop up the last-gen of 10GBe/40GBe networking on eBay for cheap and build your own fast disaggregated storage on upstream Linux. The kernel based implementation saves you some context switching over other network file systems, and you can (probably) pay to play for on-NIC implementations (especially as they're getting smarter).

It seems like these solutions don't have a strong authentication/encryption-in-transit story. Are vendors building this into proprietary products or is this only being used on trusted networks? I think it'd be solid technology to leverage for container storage.


I just use iSCSI at home but using Mellanox’ RoCE which is pretty well performing.

One thing I’m noticing is that most of these storage protocols do, in fact, assume converged Ethernet; that is, zero packet loss and proper flow control.

Is this also the case with NVMe over TCP?


I haven't experimented with it yet, but I expect that over TCP things degrade more gracefully. It seems earlier iterations of storage over networking didn't want to pay the overhead of TCP and lost out on the general purpose benefits that it brings. IIRC some RoCE iterations aren't routable for example. In theory, you could expose your NVMe over TCP device over the internet.

It seems to me applications taking advantage of NVMe are focused on building out deep queues of operations which may smooth out issues introduced by the network. But only way to know is to benchmark.


The point of TCP is to provide protection against packet loss and reordering, and to provide flow control.


> build your own fast disaggregated storage on upstream Linux

Why is this better than just connecting the drive via PCIe bus directly to the CPU?


It certainly isn't faster or more reliable for a single node, but this is just homelab stuff. Nothing in mine is particularly necessary. I think it's interesting because it's now accessible and receiving support from multiple vendors.

At some scale, it's nice to separate the two. You don't care much about where the disks live vs. where your compute is running. You can evict work from a node, have it reschedule, and not have to worry about replicating the data to another machine with excess capacity. Though I'm no authority on this topic.


If you want a whole drive, sure. But if you want to virtualize storage (e.g. split up one drive) you need a network protocol.


I don't need a network protocol to use one drive for 100 VMs, PV drivers in Xen or VirtIO in KVM work well.


One thing I don’t get in the Mach.2 design is why not use the approach IBM used in mainframe hard drives of having multiple heads over the same platter. The Mach.2 is two drives sharing parts, that will behave like a 2-drive RAID array. Having multiple heads on the same platter would allow it to read like a RAID-1 while writing like a RAID-0.


Yeah, I was quite surprised to see "the drive presents itself as two disks"... I can only imagine building the drive this was was seen as the lowest risk in terms of both firmware NREs and additional hardware required.

That being said, dual head stacks (as opposed to just dual actuators on a single head stack) would almost certainly not fit within the footprint that ordinary drives do, which would mean not being able to use standard disk enclosures for them... which would be a non-starter. Alternatively you shrink the platters, which means less capacity - and given that hard disk drives have been stuck at a certain areal density limit for a while I imagine nobody's particularly keen on losing lots of capacity for a small IOPS benefit.

IDK - maybe if HAMR goes mainstream in hard drives then we can make that tradeoff without the drive seeming like an inferior product.


> would almost certainly not fit within the footprint that ordinary drives do

Yes. You'd probably need to use 2.5" platters for a 3.5" enclosure. Also not sure if the mechanics would be precise enough to allow one head to read what the other wrote. When IBM did this, densities were MUCH lower.

As for capacity vs IOPS, 10K and 15K RPM drives rarely have the same densities as the slower large capacity units. I don't think I ever saw a 10+TB drive faster than 7200 RPM.


Not following the leaf-and-spine figure. 40x10G is 400G, but 4x40G is 160G, so how is this completely oversubscription free?

Edit: followed the link, it is 2.5:1 oversubscribed


it's worth noting that 40GbE is an absolute dead end in terms of technology, in the ISP world. Look at how many 40GbE members there are on many major IXes and other things. It's super cheap to buy an older Arista switch on ebay with N x 10GbE ports and a few 40GbE uplinks, because they're all getting pulled from service and sold off cheap. The upgrade path for 10GbE is 100GbE.

major router and switch manufacturers aren't even making new line cards with multiple 40GbE ports on them anymore, they're tech from 6-7 years ago. You can buy either something with dense 10GbE SFP+ ports, or something with QSFP/QSFP28 or similar 100GbE ports.

If you want something for a test/development lab, or are on a very tight budget, sure.


Meanwhile some budget homelabbers are jumping on the 2.5Gb bandwagon


The image is from 2012 and describes a topology, but not a technology.

Today you'd rather use Arista 7060, 7050cx orJuniper 5200 as ToR. You'd not build 1:1, but plan for it in terms of ports and cable space, then add capacity as needed.

Almost nobody in an Enterprise environment actually needs 1:1, unlike hosters or hyperscalers renting nodes to others. Even then you'd probably being able to get away with a certain amount of oversubscription.


Random perspective: I think the author is unintentionally speaking of a future not quite here yet & specifically with AI integrated into applications & their infrastructure. Essentially the need for low data gravity with how AI works today in viable business domains.

For large processing of data in general requests (maybe like a traditional SQL query) AI might typically require the retrieval data to map to vector formats. But it's not applicable to add those to RDBs (you're not doing any sql upfront, retrieval will be too slow), nor compute those ad-hoc (business app use cases, not general CV or NLP models, nor analytics).

AI/ML inputs/vectors need to exist somewhere where it is fast access vs abstracted software over that access; so using embedded dbs (leveldb, rocks, badger, lmdb, etc). Anecdotally, I get top performance over 8-24 physical ssd attachments (TBs of data), nvme raid-0 single mount on individual nodes, where we need to gather anywhere from 1 - potentially 10s of thousands of vectors every request. It's never more performant to distribute that and nothing is really gained with persistence. Today's platforms are somewhat geared for fast recovery (though writing TB's of data isnt simple, you can model automation & scale horizontally). Still, app platform infrastructure isn't really geared for AI application architectures like this. E.g., Kubernetes is too abstracted, but I imagine Google/GCP will, intentionally for AI workloads, enable CSI drivers and more robustness around Local SSDs on GKE.


NVME over X seems like fun. After a decade of shying away from fibre channel, we've finally got to the point where we've forgotten how awful iscsi was to look after. (and how expensive FC used to be, and how disappointing SAS switches were.)


"I don't know what features the network of the future will have, but it will be called Ethernet."

Ethernet is absorbing all the features of other networking technology, and adding them to it's own perfection.


Don't worry. By the time the worst bugs are ironed out, and we have management tools worth crap, there will be a new shiny.

That being said, nvme seems nice with no major footguns, so I guess NVME over X is in principle at least a good idea.


Arista was pushing leaf-spine in 2008....


This blog post is the perfect example for "I think I have a deep understanding but I really don't."


> "Customers of the data track have stateless applications, because they have outsourced all their state management to the various products and services of the data track."

I have no idea what this is even supposed to mean. It's like somebody combined some buzzwords thought up by a fresh business school marketing graduate working in the 'cloud' industry with an attempt at actual x86-64 hardware systems engineering.

The whole premise of the first half of the article seems to be 'you don't need to design a lot of redundancy and fault tolerance', the second part then goes into a weird explanation of NVME targets on CentOS. I hope this person isn't actually responsible for building storage systems at the bare metal level supporting some production business application.


I think the article is saying that a web server (customers) should be stateless, because everything important should be in a database (data track) on another host. And that database probably has application level handling for duplicating writes to another disk or another host.

The conclusion seems to be that it's not important for hardware level data redundancy because existing database software already handles duplication in application code. I don't understand how that conclusion was reached. Hardware level redundancy like raid1 seems useful because it simplifies handling a common failure case when a single HDD or NVME fails on a database server. Hardware redundancy is just the first stage in a series of steps to handle drive failure. I do agree that a typical stateless server doesn't need raid1, but afaik it's not standard practice for a stateless web application to bother with raid1 anyway.


> I think the article is saying that a web server (customers) should be stateless, because everything important should be in a database (data track) on another host. And that database probably has application level handling for duplicating writes to another disk or another host.

Correct.

> Hardware level redundancy like raid1 seems useful because it simplifies handling a common failure case when a single HDD or NVME fails on a database server

Nobody needs that if your database does replicate. Cassandra replicates data. MySQL in a replication setup replicates data. And so on. Individual nodes in such a setup are as expendable as individual disks in a RAID. More so, because you get not only protection against a disk failure, but depending on deployment strategy also against loss of a node, a Rack or a Rack Row. Or even loss of DC or AZ.


It's basic 12-factor aka cloud native thinking: "Any data that needs to persist must be stored in a stateful backing service, typically a database."


Wish it was as swappable as RAM sticks, quite the pain to get in with a screwdriver after removing the GPU during OS installs on 2 drives without bootloader crap


NVMe is just a protocol for accessing flash memory over PCIe express, do not confuse it with the M.2 form factor (which also supports SATA and USB!). My Optane P900 I use as a ZFS log device is NVMe and plugs into a standard PCIe slot on my PowerEdge R520, and servers frequently use U.2 form factor drives.


I think where people commonly confuse it is because most often when they see a M.2 device, it is the variant of the slot that exposes PCI-E pins to the NVME SSD, such as for a $100 M.2 2280 SSD from Samsung, Intel, WD, etc. As you mention there's lots of other things which can be electrically connected to a motherboard's I/O buses in a M.2 slot.


Yeah agreed, but I'm clearly not referring to the protocol.


Not necessarily flash - the p900 you list isn't flash for one. Nor is it necessarily over PCIe...


Icy dock makes an entire range of hot-swap docks for m.2 NVMe drives.


Thanks!


E.g. U.2 drives can utilize NVMe but in a form factor easier to swap (including hot plugging support).

Edit: typo


"Flash shredders exist, too, but in order to be compliant the actual chips in their cases need to be broken. So what they produce is usually much finer grained, a “sand” of plastics and silicon."

For metal, chip credit cards I can't shred, I put them on a brick in the back yard, and torch them with a weed burner. There's a psychological bonus here, if for example Amex made me change cards through no choice of my own because they were hacked.

This wouldn't be "compliant" for a flash drive, but it would be effective.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: