Seagate Creates an NVMe Hard Disk Drive

oneplane · on Nov 14, 2021

Like the article explains, it makes sense if you can do this and reduce the amount of now 'legacy' interfaces.

We used to do IDE emulation over SATA for a while, then we got AHCI over SATA and AHCI over other fabrics. Makes sense to stop carrying all the legacy loads all the time. For people that really need it for compatibility reasons we still have the vortex86 style solutions that generally fit the bill as well as integrated controllers that do PCIe-to-PCI bridging with a classic IDE controller attached to that converted PCI bus. Options will stay, but cutting legacy by default makes sense to me. Except UART of course, that can stay forever.

Edit: I stand corrected, AHCI (like the name implies: Advanced Host Controller Interface) is for the communication up to the controller. Essentially the ATA commands continue to be sent to the drive but the difference is between commanding the controller over IDE or AHCI modes. This is also why the controller then needs to know about the drive, where NVMe doesn't need to know about the drive because the ATA protocol is no longer needed to be understood by the controller (as posted here elsewhere).

masklinn · on Nov 14, 2021

> then we got AHCI over SATA

AHCI is the native SATA mode, AFAIK explicit mentions of AHCI are mostly over non-SATA interfaces (usually m.2, because an m.2 SSD can be SATA, AHCI over PCIe, or NVMe).

wtallis · on Nov 14, 2021

AHCI is the driver protocol used for communication between the CPU/OS and the SATA host bus adapter. It stops there, and nothing traveling over the SATA cable can be correctly called AHCI.

You can have fully standard SATA communication happening between a SATA drive and a SATA-compatible SAS HBA that uses a proprietary non-AHCI driver interface, and from the drive's end of the SATA cable this situation is completely indistinguishable from using a normal SATA HBA.

Likewise, you can have AHCI communication to a PCIe SSD and the OS will think it's talking to a single-port SATA HBA, with the peculiarity that sustained transfers in excess of 6Gbps are possible.

AussieWog93 · on Nov 14, 2021

>It stops there, and nothing traveling over the SATA cable can be correctly called AHCI.

Informally, a lot of BIOSes (another informality there!) would give you the choice back in the day between IDE or AHCI when it came to communicating with the drive.

I think this is where most of the confusion came from.

wtallis · on Nov 14, 2021

That choice was largely about which drivers your OS included. The IDE compatibility mode meant you could use an older OS that didn't include an AHCI driver. The toggle didn't actually change anything about the communication between the chipset and the drive, but did change how the OS saw the storage controller built-in to the chipset on the motherboard. The newer host controller interface was required to make use of some newer features on drives, because IDE compatibility mode had no way to expose that functionality.

(Later, a third option appeared for proprietary RAID modes that required vendor-specific drivers, and that eventually led to more insanity and user misunderstandings when NVMe came onto the scene.)

crest · on Nov 15, 2021

This wasn't about how the SATA controller talks to the drives, but about how the SATA controller is presented to the operating system. The IDE/legacy setting was useful at the time because Window XP lacked AHCI drives on the install media. This compatibility setting simplified the installation of Windows XP on newer hardware which was very common in the DIY space at the time it was introduced. Afterward you could install AHCI drivers and switch to AHCI mode for improved performance (iirc AHCI was required for NCQ).

OJFord · on Nov 15, 2021

> AFAIK explicit mentions of AHCI are mostly over non-SATA interfaces

No, I'm no expert in this at all, but every motherboard I've had has had an AHCI/IDE mode toggle for its SATA ports.

(Maybe the same thing is sometimes framed as 'legacy' mode vs. not or something? Just not that I've seen.)

p_l · on Nov 15, 2021

On Intel motherboards, it was a switch between initializing the SATA controller into PIIX4 interface or AHCI interface.

SATA is still IDE, just with different physical layer.

romwell · on Nov 14, 2021

>Except UART of course, that can stay forever.

Same for 3.5mm TRS for analog connections :D

jabbany · on Nov 14, 2021

This never really got "replaced" though. All the new connectors are for digital connections.

There's not much new for analog, which is probably the way it should be.

tzs · on Nov 15, 2021

USB-C has a mode called Audio Adapter Accessory Mode. In this D+ and D- become right and left analog audio output, respectively, and SBU1 becomes analog microphone input.

Here's a reference design from TI showing how one might implement it on a device where you want to be able to switch between using your USB-C connector for data devices and for analog audio devices [1].

[1] https://www.ti.com/lit/ug/tidub66/tidub66.pdf

romwell · on Nov 15, 2021

> All the new connectors are for digital connections.

USB-C headphones might disagree with you, as the other comment has pointed out (USB-C supports analog audio on data pins in order to create the one-connector-to-rule-them-all).

postalrat · on Nov 15, 2021

There won't be much new for analog until something lower level is found.

jbverschoor · on Nov 15, 2021

The new MacBook Pro even has automatic phantom power

nitrogen · on Nov 15, 2021

Nice. I feel like that used to be common for PC microphone ports that were separate jacks, but IIRC it was at a much lower voltage than the 48V professional phantom power.

userbinator · on Nov 15, 2021

I actually don't see much point in these changes besides "new and shiny" trendchasing with perhaps a bit of planned obolescence thrown in. NVMe is a massive complexity increase for little gain. Looking at all the bugs in various NVMe implementations and drivers (many of which are "worked around" by detecting specific devices) is enough to scare you from trusting them for storage.

benlwalker · on Nov 15, 2021

NVMe is certainly not more complex than either AHCI or SAS from a driver or hardware interface standpoint. And I've written drivers for all three that run in production today. Now maybe the controller design is more difficult - I wouldn't know - but there are several vendors that sell NVMe controllers out there so it can't be all that bad.

And as noted in the other comments, this is about simplifying support for things like connectors, LEDs, expanders, and software in data centers where they're mixing NVMe SSDs and HDDs today. It's just simpler to have fewer parts and software differences.

t0mas88 · on Nov 14, 2021

NVMe also does away with the controller knowing about the disk layout and addressing. Which may make sense for future disks and ever increasing cache sizes. At some point you probably want to put all the logic in the drive itself to optimise it (as SSDs already do)

oneplane · on Nov 14, 2021

Indeed, ideally the controller or HBA should really just provide the fabric and nothing more. A bit like Thunderbolt and USB4.

wmf · on Nov 14, 2021

Hasn't LBA been around since 1990 or so?

wtallis · on Nov 14, 2021

Even the old cylinder/head/sector addressing almost never matched the real hard drive geometry, because the floppy-oriented BIOS CHS routines used a different number of bits for each address component than ATA did, so translation was required even for hard drives with capacities in the hundreds of MB.

trasz · on Nov 14, 2021

SCSI already got rid of knowledge of disk layout/addressing some 30 years ago.

sundvor · on Nov 15, 2021

SCSI you say? My favourite "People Can't Memorise Computer Industry Acronyms" still stays with me till this day..

jinto36 · on Nov 15, 2021

My go-to "I'm a smart computer guy because I know what the acronym is short for" acronym was PCMCIA.

baruch · on Nov 15, 2021

The SATA protocol itself is rather outdated, for example, when using NCQ and if a single command gets an error (say, medium error) all the other pending commands are aborted and you need to resubmit them. That's quite a pain and a performance killer on top of the latency that the medium error itself induced into the process (to reach it there were multiple retries taking more than a second already).

Another advantage I can see is that the overhead to controlling NVMe is somewhat lower than SATA/AHCI, you can use kernel-bypass (like SPDK) to control many drives more efficiently than through the kernel interface, spending less time in the CPU when you control large number of drives from a single server. The HDD bandwidth is rather low so you can easily put a large number of bridges in there to increase the drive count.

rektide · on Nov 15, 2021

There's a good chance SATA 3.5 spec (July 2020) may have provided options to avoid the single-failure issue, with the Defined Ordered NCQ Commands feature.

But as you say, there's a ton of things that NVMe just does way better though. From the start, it had much more optimized/slimmer & less chatty protocols (mentioned), multiple queues, and much bigger possible queue depths. NVMe was a huge leap forward. And it continues to evolve really good capabilities. Prioritization, streams features, multi-zoned devices... there's so so so much better capabilities available on NVMe.

h2odragon · on Nov 14, 2021

Disks have had complex CPUs on them for awhile, might as well go full mainframe and admit they're smart storage subsystems and put them on the first class bus. is "DASD" still an IBM trademark?

Of course theres a long history of "multiple interface" drives which are always ugly hacks that turn up as rare collectors items and examples of boondoggle.

DaiPlusPlus · on Nov 14, 2021

> and admit they're smart storage subsystems and put them on the first class bus. is "DASD" still an IBM trademark?

Y'know that eventually we'll be running everything off Intel's Optane non-volatile RAM: we won't have block-addressable storage anymore, everything will be directly byte-addressable. All of the storage abstractions that have popped-up over the past decades (tracks, heads, cylinders, ew, block-addressing, unnecessarily large block sizes, etc) will be obsolete because we'll already have perfect-storage.

It's not quite DASD, but it's much better.

hakfoo · on Nov 16, 2021

I feel like you're going to need different filesystem concepts to deal with full byte addressable storage.

There's a lot of simplifications you can use knowing everything is a world of 512 byte/8k blocks.

Everything that's a linked list of blocks would suddenly be lists of "start and offset" instead, because you might have a file was fragmented as "1Mb at offset A, 32 bytes at offset B, and 5Mb at offset C". You'd likely need some tradeoffs for minimum fragment sizes to avoid a file sharding into a million 32-byte fragments and requiring a huge number of distinct reads to load, and maximum sizes that would be manageable without overloading the recipient device's buffer.

trasz · on Nov 14, 2021

Optane didn’t exactly take the market by storm. Also: flash memory doesn’t work this way; its inherently organized into blocks/pages.

DaiPlusPlus · on Nov 14, 2021

The "Optane" that Intel released as their 3DXPoint NVMe brand, (and quietly withdrew it recently) isn't the same Optane as their byte-addressable non-volatile storage+RAM combo. It isn't Flash memory with blocks/pages, it really is byte-addressable: https://dl.acm.org/doi/10.1145/3357526.3357568

"true" Optane hasn't taken over the scene because, to my knowledge, there's no commercially supported OS that's built around a unified memory model (heck, why not have a single memory-space?) for both storage and memory.

We can't even write software that can just reanimate itself from a process image that would be automagically persisted in the unified storage model. We've got a long way to go before we'll see operating systems and applications that take advantage of it.

bcrl · on Nov 15, 2021

It's not software that's the problem, it's the hardware that still doesn't perform at the level that was expected when it was under development. This is because of the fact that the 3DXPoint / Optane cells have higher than planned errors rates which require a controller much like flash does to perform error correction and wear leveling. This makes it impossible to provide the same latency as DRAM (the original goal), and pushes power requirements for the DIMM form factor up since each DIMM has to have its own controller for the 3DXPoint memory. So now you have most of the problems of flash without cells that store multiple bits (as in TLC/QLC flash) or the 3d layering that flash vendors now use.

Suffice it to say that it was promising while it was under development, but Intel neglected to factor in just how much flash vendors would improve performance and reduce cost per bit.

simcop2387 · on Nov 15, 2021

I wonder how that compares with ddr5 which has a controller and new power systems for similar reasons then. Combined with the upcoming CXL bus stuff it makes me womder if it'll happen it's just the interface wasn't ready yet

wmf · on Nov 15, 2021

DDR5 does not have a controller on the DIMM; it's nothing like the complexity of Optane DIMMs. I agree that CXL seems like a better fit for Optane than a memory bus.

simcop2387 · on Nov 15, 2021

ah yes, it's just a clocking driver. the ecc requirements are into the ram chips themselves. i had misremembered.

https://www.rambus.com/memory-and-interfaces/server-dimm-chi...

bcrl · on Nov 15, 2021

It's more a question of Amdahl's law: the introduction of SSDs brought I/O latencies from the 5 to 10 milliseconds of HDDs down to tens to hundreds of microseconds. Going from tens of microseconds to 100-200 nanoseconds is much less of an improvement for most general purpose workloads.

spijdar · on Nov 14, 2021

But RAM is itself accessed in blocks. The process is hidden to software, but memory is always fetched in word-aligned blocks. It doesn't contradict your point, but just pointing out that even DRAM is pulled in chunks not unlike traditional drives (if you squint)

(Of course, getting those chunks down to cache line sizes does open up a lot of possibilities...)

trasz · on Nov 14, 2021

Well, yeah, the cacheline can be considered a kind of a 64-byte block. But it doesn't work like this because of how RAM works - you could access DRAM in words if you wished to; it's just that it doesn't make sense because of CPU cache. For flash, the blocks (and pages) are inherent to its design and there is no way around it.

Also, RAM "block size" is 64B, while for flash its more like 4kB. CPU cache will "deblock" the 64B blocks, but it can't efficiently do it for 4kB ones.

And then there's the speed. Does replacing PCIe with a memory bus actually make a performance difference that's measurable given flash latency?

dragontamer · on Nov 14, 2021

DRAM is accessed in rows of maybe 4096 bytes or so, the size of the sense amplifiers.

Also known as a bank. The transfer itself is 64 bytes at a time (burst length 8 for DDR4 x 64 bit channel). But internally, the RAS (row address strobe) is the access of note.

namibj · on Nov 15, 2021

DDR4 DIMMs have (almost?) exclusively 8kiB pages. If you have 2 channels with interleaving, like AFAIK the default on all current x86 desktop platforms, you get the same 16kiB effective page size as the typical[0] TLB compressing (which smashes the 4 4kiB pagetable entries in a cacheline worth of lowest-level pagetable into a single virtual 16kiB TLB entry if they have the same attributes and look exactly as a smashed 16kiB page table entry would look like (alignment, contiguous, etc.)).

[0]: Zen3 does it for sure, but I think I also saw it somewhere else.

namibj · on Nov 15, 2021

DDR4 pages ("rows") are 8kiB, with reads/writes being 64B bursts into the "row"buffer. (For a full DIMM; the following goes a bit more into detail and doesn't strictly assume full 64 data bit (+ optional ECC) wide channels.) Each bank has a single "row" buffer into which a page has to be loaded ("activate" command) before reading/writing into that page, and it has to be written back ("precharged") before a new page can be loaded or an internal bank refresh can be triggered.

"row" buffer locality is relatively important for energy efficiency, and can account for about a 2x factor in the DRAM's power consumption. Each rank has 4 bank groups of 4 banks each, with the frequency of "activate" commands to banks in the same group being restricted more severely than to separate bank groups. A rank is just a set of chips who's command/address lines are in parallel, and who's data lines are combined to get up to the typical 64/72 bits of a channel. A single chip is just 4, 8, or 16bit wide. A channel can have typically between IIRC 1 and 12 ranks, which are selected with separate chip select lines the controller uses to select which rank the contents of the command/address bus are meant for.

Also, "activate" and "precharge" take about as long (most JEDEC standard timings have literally the same number is cycles for these 3 timings (also known as the "primary" timings)) as the delay between selecting the "column" in a "row" buffer, and the corresponding data bits flowing over the data bus (for reads and writes there may iirc be a one-cycle difference due to sequencing and data buffers between the DDR data bus and the serdes that adapts the nominal 8-long bursts at DDR speeds to parallel accesses to the "row" buffer).

With JEDEC timings for e.g. DDR4-3200AA being 22-22-22 CL-tRCD-tRP ("column" address-to-data delay; "row"-to-"column" address delay; "precharge"-to-"activate" delay), and the burst length being 4 DDR cycles, this is far from random access.

In fact, within a bank, and even neglecting limits on activate frequency (as I'm too lazy to figure out if you can hit them when using just a single bank), with assumed-infinite many "rows" and thus zero "row" locality for random accesses, as well as perfect pipelining and ignoring the relatively rare "precharge"/"activate" pairs at the end of a 1024-columns-wide "row":

Streaming accesses would take 4 cycles per read/write of a 64B cacheline (assuming a 64bit (72 with normal SECDED ECC) DIMM; is less bits, the cacheline would be narrower), archiving 3200 Mbit/s per data pin. Random accesses would take 22+22+22=66 cycles per read/write, archiving 193 ³¹/₃₃ Mbit/s per data pin.

So random accesses are just 6 ²/₃₃ % efficient in the worst case and assuming simplified pathological conditions. In practice, you have 16 banks to schedule your request queue over, and often have multiple ranks (AFAIK typically one rank per 4/8/16GiB on client (max 4 per channel on client, 2 optimal, with 1 having not enough banks and 3-4 limiting clock speeds due to limited transmitter power) and low-capacity servers; one rank per 16-32GiB on high-capacity servers).

There is a slight delay penalty for switching between ranks, iirc most notable when reading from multiple ones directly in sequence or possibly when reading from one and subsequently writing to another (both are pretty bad, but I don't recall which one typically has more stall cycles between transmissions; iirc it's like around 3 cycles (=6 bits due to DDR; close to one wasted burst)).

yxhuvud · on Nov 15, 2021

> DDR4 pages ("rows") are 8kiB

Wait, are the underlying hardware access size for memory bigger than the default Linux page size (4kb)? Wouldn't that introduce needless inefficiencies?

Can false sharing happen between different pages if they happen to be in the same row?

rasz · on Nov 15, 2021

Its internal implementation detail, invisible to the CPU.

https://faculty-web.msoe.edu/johnsontimoj/EE4980/files4980/m...

diagrams on pages 4-7.

dragontamer · on Nov 15, 2021

Not really. Pages are about virtual memory and not about physical memory.

In practice, the only thing DDR4 banks do is make prefetching an important strategy, thus making sequential performance for DDR4 incredible.

A fact already known to high performance programmers. Accessing byte 65 after byte 64 is much more efficient than accessing byte 9001.

namibj · on Nov 17, 2021

What you don't want is things accessed together residing in different rows of the same bank. Things being on the same row is a good thing.

yxhuvud · on Nov 20, 2021

Even for false sharing? That is, the problem when two unrelated atomics are allocated in the same page but accessed frequently from different threads, causing the memory to be thrashing between two different L1 cache lines.

rlkf · on Nov 14, 2021

> commercially supported OS that's built around a unified memory model

Doesn't OS/400 work that way? (Of course then there is the question to which degree "commercially" should imply "readily open to third-party software and hardware vendors")

chiph · on Nov 15, 2021

Yes. It uses a single-level store model. Everything gets mapped to a memory address.

DaiPlusPlus · on Nov 15, 2021

The Wikipedia article on _IBM i_ (nee OS/400) is light on details, but eventually it links to https://en.wikipedia.org/wiki/Single-level_store and this part is prophetic:

> IBM's design of the single-level storage was originally conceived and pioneered by Frank Soltis and Glenn Henry in the late 1970s as a way to build a transitional implementation to computers with 100% solid state memory. The thinking at the time was that disk drives would become obsolete, and would be replaced entirely with some form of solid state memory.

...so they were literally 50-60 (maybe 70-80 at this rate) years ahead of their time.

hinkley · on Nov 15, 2021

Starting to see more complaints about subsystems that Linux is not really in control of becoming more and more numerous.

Rumors seem to be that Oxide Computers is trying to make an OS to control many of these but I suspect the end result will be treating the case as a heterogeneous cluster. It's possible that these will continue acting cooperatively rather than anyone being under the misapprehension that they're actually 'in charge', but perhaps a different division of labor will give us some cool benefits.

Would still love to see SQLite or Postgres running directly on a storage subsystem.

jabl · on Nov 15, 2021

See also https://www.youtube.com/watch?v=36myc8wQhLo

I'm not sure what Oxide is doing, but creating an OS to replace the firmware on all kinds of 3rd party devices sounds, well, quite quixotic. I'd guess it's more like a common OS for the firmware of their own hardware (rack controllers and whatnot).

hinkley · on Nov 15, 2021

> firmware on all kinds of 3rd party devices sounds, well, quite quixotic

You could say the same about device drivers in Linux. It was a very long process that required many, many hands. It took a long time even to cover all of the common devices let alone the more obscure ones.

ants_a · on Nov 15, 2021

Or alternatively, toss in a network interface and have drives PXE boot an application specific firmware and spin up a ceph-osd or similar server.

joenathanone · on Nov 14, 2021

>"Hence, using the faster NVME protocol may seem rather pointless."

Isn't it the interface that is faster and not the protocol? PCIe vs SATA

Edit: after reading more, this article is littered with inaccuracies

wtallis · on Nov 14, 2021

It's both. Basic things like submitting a command to the drive requires fewer round trips with NMVe than AHCI+SATA, allowing for lower latency and lower CPU overhead. But the raw throughput advantage of multiple lanes of PCIe each running at 8Gbps or higher compared to a single SATA link at 6Gbps is far more noticeable.

joenathanone · on Nov 14, 2021

I get that, but with NVME being designed from the ground up specifically for SSD's wouldn't using it for an HDD present extra overhead for the controller to deal with an HDD, negating any theoretical protocol advantages?

wtallis · on Nov 14, 2021

NVMe as originally conceived was still based around the block storage abstraction implemented by hard drives. Any SSD you can buy at retail is still fundamentally emulating classic hard drive behavior, with some optional extra functionality to allow the host and drive to cooperate better (eg. Trim/Deallocate). But out of the box, you're still dealing with reading and writing to 512-byte LBAs, so there's not actually much that needs to be added back in to make NVMe work well for hard drives.

The low-level advantages of NVMe 1.0 were mostly about reducing overhead and improving scalability in ways that were not strictly necessary when dealing with mechanical storage and were not possible without breaking compatibility with old storage interfaces. Nothing about eg. the command submission and completion queue structures inherently favor SSDs over hard drives, except that allowing multiple queues per drive each supporting queue lengths of hundreds or thousands of commands is a bit silly in the context of a single hard drive (because you never actually want the OS to enqueue 18 hours worth of IO at once).

londons_explore · on Nov 14, 2021

> because you never actually want the OS to enqueue 18 hours worth of IO at once

As a thought experiment, I think there are usecases for this kind of thing for a hard drive.

The very nature of a hard drive is that sometimes accessing certain data happens to be very cheap - for example, if the head just happens to pass over a block of data on the way to another block of data I asked to read. In that case, the first read was 'free'.

If the drive API could represent this, then very low priority operations, like reading and compressing dormant data, defragmentation, error checking existing data, rebuilding RAID arrays etc. might benefit from such a long queue. Pretty much, a super long queue of "read this data only if you can do so without delaying the actual high priority queue".

wtallis · on Nov 14, 2021

When a drive only has one actuator for all of the heads, there's only a little bit of throughput to be gained from Native Command Queueing, and that only requires a dozen or so commands in the queue. What you're suggesting goes a little further than just plain NCQ, but I'd be surprised if it could yield more than another 5% throughput increase even in the absence of high-priority commands.

But the big problem with having the drive's queue contain a full second or more worth of work (let alone the hours possible with NVMe at hard drive speeds) is that you start needing the ability to cancel or re-order/re-prioritize commands that have already been sent to the drive, unless you're working in an environment with absolutely no QoS targets whatsoever. The drive is the right place for scheduling IO at the millisecond scale, but over longer time horizons it's better to leave things to the OS, which may be able to fulfill a request using a different drive in the array, or provide some feedback/backpressure to the application, or simply have more memory available for buffering and combining operations.

c_o_n_v_e_x · on Nov 15, 2021

There's definitely use cases but they're quite niche. Magnetic storage is still far cheaper per TB than solid state. Also, depending on workload, magnetic can handle heavy writes better. SATA is a dead man walking with no plans for SATAIV or V.

HDD manufacturers get to keep selling their same tech with a different interface. From an end user perspective, a drive like this lets you buy future proof server equipment with the newer interfaces. You can make the plunge to full SSDs once the market's providing what you need.

trasz · on Nov 15, 2021

Curiously, this appears to already exist, but hard drives implement it kind of backwards. You might find comments on https://reviews.freebsd.org/D26912 interesting.

R0b0t1 · on Nov 15, 2021

Yes, you could have 20 drives or whatever the magic number is behind a single NVME interface.

mastax · on Nov 14, 2021

Could you use multiple NVMe namespaces to represent the separate actuators in a multiple-actuator drive? Would there be a benefit? Do different namespaces get separate command queues or whatever?

wtallis · on Nov 14, 2021

NVMe supports multiple queues even for a single namespace, and multiple queues are used more for efficiency on the software side (one queue per CPU core) than for exposing the parallelism of the storage hardware.

There are several NVMe features intended to expose some information about the underlying segmentation and allocation of the storage media, for the sake of QoS. Since current multi-actuator drives are merely multiple actuators per spindle but still only one head per platter, this split could be exposed as separate namespaces or separate NVM sets or separate endurance groups. If we ever see multiple heads per platter come back (a la Conner Chinook), that would be best abstracted with multiple queues.

KingMachiavelli · on Nov 14, 2021

Would this reduce the need to have expensive and power hungry RAID/HBA cards? I would assume splitting nvme/PCIe is a lot simpler than PCIe to SATA.

wtallis · on Nov 14, 2021

PCIe switches are a lot simpler and more standardized than RAID and HBA controllers, but I'm not sure they're any cheaper for similar bandwidth and port counts. Broadcom/Avago/PLX and Microchip/Microsemi are the only two vendors for large-scale current generation PCIe switches, and starting with PCIe gen3 they decided to price them way out of the consumer market, contributing to the disappearance of multi-GPU from gaming PCs.

semi-extrinsic · on Nov 14, 2021

The main reason for multi-GPUs decline is that single GPU performance has grown much faster than both CPUs and RAM speeds.

Going multi-GPU on a high-end GPU today lets you run a AAA game at 4K with all the eye candy at 150 fps rather than 110 fps, i.e. visually undistinguishable. And at the low end, it's always better performance per dollar to buy a single higher-end card.

Of course this is all orthogonal to your point of PCIe switches. Anyone thinking about multi-GPU will buy a motherboard with at least 2 PCIe 4.0 x16 slots anyway, no switches required. You can even get motherboards with 2 PCIe 5.0 x16 slots without breaking the bank.

wtallis · on Nov 15, 2021

> Anyone thinking about multi-GPU will buy a motherboard with at least 2 PCIe 4.0 x16 slots anyway, no switches required. You can even get motherboards with 2 PCIe 5.0 x16 slots without breaking the bank.

I'm pretty sure you're referring to motherboards that have two slots which are mechanically x16, but actually using the second slot borrows half of the lanes from the first slot, leaving you operating at x8+x8. This is not the same as having a full 48+ lane PCIe packet switch between the CPU and two slots each wired for x16, which was very common for high-end consumer motherboards in the PCIe gen2 era (eg. with NVIDIA's NF200 switch).

These days, you can get 32+ lanes straight from the CPU to enable two x16 slots, but that requires stepping up to a bigger CPU socket than the mainstream consumer CPU+motherboard platforms use. And I don't think you can yet buy a motherboard with two PCIe gen5 x16 slots at any price: as far as I can tell, none of the Intel Alder Lake motherboards announced so far include PCIe switches, because such switches have been cost-prohibitive for several generations now.

semi-extrinsic · on Nov 15, 2021

Yeah, you are right, I was mistaken of course. Looks like probably the Xeon Sapphire Rapids will be the first system with dual PCIe Gen5 when it's out next year. Then AMD will be a little later with Zen 4.

namibj · on Nov 15, 2021

There is probably some server platform with 2 gen5 x16 slots and some more.

Also the Power10 systems from IBM hit GA very recently, but I won't expect those to be easy to get (in addition to their price).

namibj · on Nov 15, 2021

Only HEDT and server platforms offer x64_64 with 32 PCIe lanes directly from the CPU. Most of those mainboards have two mechanically x16 slots, one electrically x16 until the other is populated, at which point both are electrically x8.

p_l · on Nov 15, 2021

Not HBA adapters, in fact the design is pretty much directly for use with some very pricy HBAs - NVMe-OF seems obvious target, as I don't see much benefit from "just" allowing HDDs in U.2 bay and thus possibly standardising on single fabric bay in server chassis.

NVMe-OF has options in form of InfiniBand and ROCE HBA chips that can be configured to act as PCI-E root complexes bifurcated over many many links (in place of default x16 endpoint) with minimal management logic or built-in SoC features. So a network JBOD with NVMe-OF can be essentially one chip + PCI-E links to disks + support/glue logic.

mgerdts · on Nov 15, 2021

I don't think so. The reason that I would use a RAID card with HDDs is to get the benefit of a battery backed write cache, optimally performing RAID operations. I've only seen one RAID card that supports connecting to NVMe drives. It is terribly expensive and aimed supporting the speed of flash-based NVMe drives.

If you want a battery backed write cache with flash-based NVMe, get an enterprise drive with power loss protection, and you are golden. You don't generally need the write cache to get reasonable throughput, but you can save a few microseconds in write latency.

I suspect that NVMe HDDs will have the same amount of write cache that existing SATA/SAS HDDs have. If you feel you need a RAID card for RAID and/or cache with SAS or SATA HDDs, you will probably have the same need for a RAID card with NVMe HDDs.

Both Intel and AMD support limited hardware RAID for NVMe natively on some of their CPUs. I don't think this has any of the cache benefits you normally have with a RAID card. That is, it may work well with flash-based NVMe drives, particularly those with power loss protection.

But really, if I'm using hard drives, I'd probably use ZFS with flash-based read and write cache or Storage Spaces with a flash tier and its integrated write cache on flash.

wtallis · on Nov 15, 2021

> Both Intel and AMD support limited hardware RAID for NVMe natively on some of their CPUs.

"Limited" here means the motherboard firmware understands the same array format as their proprietary software RAID drivers, so you can boot an OS from the RAID volume.

toast0 · on Nov 14, 2021

Seems like it. The article mentions a PCIe switch, but PCIe bifurcation may also be an option. (That's splitting a multiple lane slot into multiple slots, requires system firmware suppport though)

formerly_proven · on Nov 14, 2021

Bifurcation has never been a thing on desktop platforms and even most entry-level (single socket, desktop-equivalent) servers don't support it. It seems to be reserved for HEDT and real server platforms. (This is of course purely a market segmentation decision by Intel/AMD).

wtallis · on Nov 14, 2021

PCIe bifurcation works fine on AMD consumer platforms. I've used a passive quad-M.2 riser card on an AMD B550 motherboard with no trouble other than changing a firmware setting to turn on bifurcation. It's only Intel that is strict about this aspect of product segmentation.

toast0 · on Nov 14, 2021

My A520 mini-itx board supports it; can't get anymore desktop than that. Although, that has limited options, I think I can do either 2 x8 or one x8 and two x4. For this, it looks like each drive is expected to be x1, so you'd want one x16 to 16 x1s. It's doable, but not without mucking about in the firmware (either by the OEM, or dedicated enthusiasts), so a PCIe switch is probably advisable.

formerly_proven · on Nov 14, 2021

Ah I see. It seems like previously bifurcation was not qualified for anything but the X-series chipset, but in the 500 series it's qualified for all. On top of that, it seems like some boards just allowed it regardless in prior generations.

Another complication is of course that the non-PEG slots on the (non-HEDT) platforms are usually electrically only x4 or x1, so bifurcation really only makes sense in the PEG.

wtallis · on Nov 14, 2021

PCIe root ports in CPUs are generally designed to provide an x16 with bifurcation down to x4x4x4x4 (or merely x8x4x4 for Intel consumer CPUs). Large-scale PCIe switches also commonly support bifurcation only down to x4 or sometimes x2, though x1 may start catching on with PCIe gen5.

Smaller PCIe switches and motherboard chipsets usually support link widths from x4 down to x1. Treating each lane individually goes hand in hand with the fact that many of the lanes provided by a motherboard chipset can be reconfigured between some combination of PCIe, SATA and USB: they design a multi-purpose PHY, put down one or two dozen copies of it at the perimeter of the die, and connect them to an appropriate variety of MACs.

formerly_proven · on Nov 14, 2021

Yeah, but what I meant above was that if you only have an x4 slot electrically, sticking in an x16 -> 4x M.2 riser isn't going to do a whole lot, because the 12 lanes of 3 out of 4 slots aren't hooked up to anything. So in this scenario you'd really want a riser with a switch in it instead (which are more expensive than almost all motherboards).

So on the consumer platforms that give you two PEGs best you could do while still having a GPU is stick that riser in the second PEG and use the x8/x8 split. Now the question becomes whether the UEFI allows you to use the x8/x8 bifurcation meant for dual GPU or similar use in an x8/(x4+x4) triple bifurcation kind of setup.

Realistically this entire thing just doesn't make a lot of sense on the consumer platforms because they just don't have enough PCIe lanes out of the CPU. Intel used to be slightly worse here with 20 (of which 4 are reserved for the PCH but you know that), while AM4 has 28 (4 for the I/O hub again). On an HEDT platform with 40+ lanes though...

(When I say bifurcation I meant bifurcation of a slot on the mainboard, not the various ways the slots and ports on the board itself can be configured, though that's technically bifurcation as well (or even switching protocols)

wtallis · on Nov 14, 2021

> Yeah, but what I meant above was that if you only have an x4 slot electrically, sticking in an x16 -> 4x M.2 riser isn't going to do a whole lot, because the 12 lanes of 3 out of 4 slots aren't hooked up to anything. So in this scenario you'd really want a riser with a switch in it instead (which are more expensive than almost all motherboards).

True; but given how PCIe speeds are no longer stalled, we may soon see motherboards offering an x4 slot that can be operated as x1x1x1x1. Currently, the only risers you're likely to find that split a slot into independent x1 ports are intended for crypto mining, and they require switches. A passive (or retimer-only) quad-M.2 riser that only provides one PCIe lane per drive currently sounds a bit bandwidth-starved and wouldn't work with current motherboards. But given PCIe gen5 SSDs or widespread availability of PCIe-native hard drives, those uses for an x4 slot will start to make sense.

rasz · on Nov 15, 2021

Intel supported PCIE bifurcation it on desktop chipsets up to Sandy Bridge. Just like Intel once supported memory parity on desktop chipsets until deciding to use it to upsell server products.

Recent Intel chipsets were locked to Intel SSDs and only some magic hardware DRM handshake unlocked bifurcation https://www.asus.com/us/support/FAQ/1037507/ But apparently AMD competition forced their hand and at lest some got unlocked https://forum.level1techs.com/t/new-bios-update-unlocks-4x-p...

On AMD front there doesnt seem to be any problems, plenty of desktop boards with unlocked bifurcation BIOS menus.

vmception · on Nov 14, 2021

What other connectors are coming down the pipeline or may currently be in draft specification phase?

Admittedly, I totally did not know NVMe’s were becoming a thing until a year ago, as I had been in the laptop-only space for a while or didnt need to optimize storage speed when connecting an existing drive to a secondhand motherboard.

I like being ahead of the curve and am now curious whats next

wtallis · on Nov 14, 2021

In the server space, SAS, U.2 and U.3 connectors are mechanically compatible with each other and partially compatible with SATA connectors. U.3 is probably the dead end for that family, but they won't disappear completely for a long time.

Traditional PCIe add-in cards (PCIe CEM connector) are still around and also not going to be disappearing anytime soon, but are in decline as many use cases have switched over to other connectors and form factors, particularly for the sake of better hot-swap support.

M.2 (primarily SSDs) is in rapid decline in the server space. It may hang on for a while for boot drives, but for everything else you want hot-swap and better power delivery (12V rather than 3.3V).

The up and coming connector is SFF-TA-1002, used in the EDSFF form factors and a few other applications like the latest OCP NIC form factor. Its smaller configurations are only a bit larger than M.2, and the wider versions are quite a bit denser than regular PCIe add-in card slots. EDSFF provides a variety of form factors suitable for 1U and 2U servers, replacing 2.5" drives. The SFF-TA-1002 connector can also be used as direct replacement for the PCIe CEM connector, but I'm not sure if that's actually going to happen anytime soon.

I haven't seen any sign that EDSFF or SFF-TA-1002 will be showing up in consumer systems. Existing solutions like M.2 and PCIe CEM are good enough for now and the foreseeable future. The older connectors sometimes need to have tolerances tightened up to support higher signal rates, but so far a backwards-incompatible redesign hasn't been necessary to support newer generations of PCIe (though the practical distances usable without retimers have been decreasing).

ksec · on Nov 14, 2021

>I like being ahead of the curve and am now curious whats next

Others correct me if I am wrong.

NVMe in itself is an interface specification, people often use the term NVMe when they are meant M2, the Connector.

You wont get a new connector in the pipeline. But the M.2 is essentially just 4 lane PCI-E express. So every time you get PCI-E express update, currently 4.0, 5.0 around the corner and 6.0 in final draft. 7.0 possibly within next 3-4 years. So you can expect 14GB/s SSD soon, 28GB/s in ~2024, 50GB/s within this decade assuming we could somehow get the SSD controller power usage down to a reasonable level.

magicalhippo · on Nov 15, 2021

> But the M.2 is essentially just 4 lane PCI-E express

M.2 is a bit more complex[1] than that, sadly.

[1]: https://en.wikipedia.org/wiki/M.2#Form_factors_and_keying

vmception · on Nov 14, 2021

Hmm insightful, yes I recall noticing that connector/specification thing when I was trying to get the right size cards, had figured NVMe and M.2 were just synonyms but I see the cause for the correlation now

So the NVMe card I added to an old motherboard’s PCI-E slot is really just PCI-E on PCI-E? yo dawg

wtallis · on Nov 14, 2021

> So the NVMe card I added to an old motherboard’s PCI-E slot is really just PCI-E on PCI-E?

Assuming that's a PCIe to M.2 adapter card, it's just rearranging the wires to a more compact connector. There's no nesting or layering of any protocols. Electrically, nothing changed about how the PCIe signals are carried (though M.2 lacks the 12V power supply), and either way you have NVMe commands encapsulated in PCIe packets (exactly analogous to how your network may have IP packets encapsulated in Ethernet frames).

guerrilla · on Nov 15, 2021

What I'm wondering is why this took so long. Aren't there already NVMe cards that can serve SATA/PATA drives? If not, why not? If so, why this?

inetknght · on Nov 14, 2021

Does this have the reliability as the rest of Seagate's lineup? Or is this actually something that isn't a ripoff?

wmf · on Nov 14, 2021

Obligatory reminder that there are only three hard disk vendors and all of them have made bad drives at one time.

thijsvandien · on Nov 14, 2021

To be specific: Seagate, Toshiba and Western Digital.

danlugo92 · on Nov 15, 2021

What about Samsung?

El_RIDO · on Nov 15, 2021

Seagate took over Samsungs HDD department in 2011: https://www.seagate.com/news/news-archive/seagate-completes-...

danlugo92 · on Nov 15, 2021

What about OWC?

Are they all repackaged from those 3?

thijsvandien · on Nov 16, 2021

Do they even pretend to make hard drives? Not in their shop at least: https://eshop.macsales.com/shop/hard-drives/3.5-SerialATA. Perhaps they do produce their own SSD's, as do many companies, since that's trivial in comparison. Definitely not hard drives, though.

danlugo92 · on Nov 16, 2021

I didn't know SSDs were easier trivial to build compared to HDD, I suspected it for a bit after reading your original comment.

So that explains why only 3 companies make HDDs.

thijsvandien · on Nov 16, 2021

SSDs are essentially just some chips (controller, flash memory, and often some DRAM cache) wired together, whereas mechanical drives have many moving parts operating within microscopic tolerances.

wmf · on Nov 16, 2021

For SSDs the hard part is the controller and NAND and there are only a half-dozen companies making those parts. 90% of SSD brands are the same Phison controllers under the hood.

kevin_thibedeau · on Nov 14, 2021

Only Seagate has shipped firmware that bricks an entire drive.

toast0 · on Nov 15, 2021

Hey, those were fixable if you ran a UART to the jumpers, if this is the right one. Not entirely a brick!

Also, firmware mistake = disk disappears from the bus is pretty common in enterprise SSD land. HPE did it twice, I've seen it on Intel drives, and I'm sure there are others I'm not remembering. Maybe Seagate was trying to be more enterprise... and at least their drives only failed to start up IIRC, if you were running them continuously it was fine (unlike the SSDs)

p_l · on Nov 15, 2021

I don't recall the vendor that Dell used, might have been intel, but I had NVMe drives that would displace clocks for half the connector on reboot unless you did a full powercycle...

kevin_thibedeau · on Nov 15, 2021

Only certain models supported the fix via UART.

namibj · on Nov 15, 2021

Have you got a link to back this up/allow the curious to read up on it without having to search?

userbinator · on Nov 15, 2021

It's so well-known that I wonder about your intentions, but just in case...

https://en.wikipedia.org/wiki/Seagate_Barracuda#Firmware_bug

https://en.wikipedia.org/wiki/Seagate_Barracuda#Firmware_bug...

https://news.ycombinator.com/item?id=27419342

...and then there's this one which is infamous enough to have its own Wikipedia page:

https://news.ycombinator.com/item?id=27419072

I don't think I've seen something this bad since the IBM DeathStars.

moffkalast · on Nov 15, 2021

I'm sure the first one they'll put out will be capable of 5400 rpm haha.

johnklos · on Nov 15, 2021

The article is written as if by someone who barely knows technology and is writing for people who don't know whether the monitor or the box it's attached to is the computer.

"Special drivers"? What year is it? Is "the datacenter" run on Windows ME?

Seriously, PCMAG, try a little harder.

wtallis · on Nov 15, 2021

> "Special drivers"

The only mention of drivers I can find in the article is an entirely correct reference to the fact that SAS and SATA HBAs frequently require proprietary (or at least vendor-specific) drivers. I'm not aware of any 8+ port SATA HBAs that use the vendor-neutral standard AHCI interface.