NVMe, the fast future for SSDs

ziedaniel1 · on April 5, 2015

The numbers in the article are all wrong. SATA's ceiling is 600 MBps (megabytes per second), not 600 Gpbs (gigabits per second). SAS goes up to 12Gbps, not 12GBps, which is actually the same thing as 1.5GBps. At least the PCIe numbers look right.

ChuckMcM · on April 5, 2015

Yes they are all wrong.

6 Gbps, 600 MBps (the capital B is supposed to indicate 'Bytes' versus 'bits') the encoding is 8b/10b which is 10 bauds per 8 bit byte.

PCIe 2.0 has 2.5Gbps "lanes" PCIe 3.0 has 5Gbps "lanes" they can be ganged together for additional bandwidth. (x1, x2, x4, x8, x16) it is also 8b/10b so you divide by 10 to get Bytes per second (250MBps/500MBps).

Both SATA and PCIe have a 'transaction limit' which is a function of the controller, which limits the total number of operations per second (IOPs). The product of the IOPs and the size of the transaction can never exceed the bandwidth of the channel. But it often is under. For example a typical SATA disk control (prior to the popularity of SSDs) would do about 25,000 IOPs, and if you had 512 byte (.5K) block reads and writes, you could read and write 25,000 * .5 or 12,500K or 12 MBps (which was much lower than the theoretical bandwidth of 200MBps on 2Gbps SATA II channels. Optimizing channel utilization requires that you figure out how many IOPs your OS/Controller can initiate and then sizing the payload to consume the max bandwidth. Large payloads and you'll push IOPs down, smaller payloads and you won't use all the bandwidth.

One of the nicer aspects of ATM was that it was designed and specified for full channel utilization with 64 byte packets which made it possible to reason about the performance and latency of an arbitrary number of streams of data moving through it.

JohnBooty · on April 6, 2015

  > For example a typical SATA disk control (prior to the 
  > popularity of SSDs) would do about 25,000 IOPs, and if 
  > you had 512 byte (.5K) block reads and writes, you 
  > could read and write 25,000 * .5 or 12,500K or 12 MBps 
  > (which was much lower than the theoretical bandwidth 
  > of 200MBps on 2Gbps SATA II channels.

That doesn't seem to jive with reality. What am I missing here? SATA HDDs would regularly hit 100MBps in sequential transfers.

To pick a 2009-era HDD review/benchmark at random that illustrates this: http://www.storagereview.com/western_digital_scorpio_black_5...

ChuckMcM · on April 6, 2015

Oh you can get faster throughput with longer reads, to get 100MBps on a 2Gbps channel you simply increase the read size until you've maxed out the bandwidth you can get.

So when characterizing a typical SATA drive you would start with 4K sequential reads and work up until your bandwidth hit either the channel bandwidth or stopped going up (which would be the disk bandwidth). Unless you ran across a reallocated sector many SATA drives could return data at a rate of 100MBps with 1MB reads. Or even smaller read sizes if you had command caching available. Random r/w was an issue of course because of head movement (burns your IOPs rate while waiting for the heads to change tracks)

You can do these experiments with iometer[1], there was a great paper out of CMU which talked about illuminating the inner workings of a drive by varying the workload[2]. Well worth playing with if you're ever trying to get the absolute most I/O out of a disk drive.

[1] http://www.iometer.org/

[2] http://repository.cmu.edu/cgi/viewcontent.cgi?article=1136&c...

userbinator · on April 5, 2015

Disk controllers since earliest ATA/IDE should be able to read or write up to 256 sectors (128KB) at a time with one command, and apparently with LBA48 it was expanded to 64K sectors (32MB!), so it's quite possible to saturate the bandwidth of the hardware, but the real bottleneck is at the filesystem level and above - only when reading/writing large amounts of data at once sequentially is this realised.

p1esk · on April 5, 2015

PCIe 3.0 uses 128b/130b line encoding, not 8b/10b.

ChuckMcM · on April 5, 2015

Thanks for the update. So closer to 586MBps for full bandwidth.

sigterm · on April 6, 2015

PCIe2 has 5Gbps lanes. PCIe3 has 8Gbps with updated encoding.

comatose_kid · on April 6, 2015

Didn't ATM use 53 bytes/cell?

_olwp · on April 5, 2015

The author seems to be mixing these around randomly without really knowing what they mean. Another example:

"While a single SATA port is limited to 600Gbps, combining four makes for 2.4GBps of bandwidth."

600Gbps * 4 = 2400Gbps = 3GBps

Maybe he thinks a byte is 10 bits or something?

It's also really odd that he's using Bps at all. I've never seen MBps anywhere other than this article. Usually it's Mbps and MB/s.

wtallis · on April 5, 2015

Since SATA uses 8b/10b encoding, the data rate in bytes/s is actually 1/10th of the raw bit rate. The same applies to earlier PCIe versions. 6Gbps SATA can transfer 600MB/s (ignoring protocol overhead).

I don't really know why it's become standard to publish raw bit rates in bit/s and data rates in bytes/s with error correction taken into account but not protocol overhead, but those are the two kinds of numbers you almost always see quoted nowadays. Raw bit rates at least map pretty directly to clock speed, and I guess protocol overhead must be too variable and too complicated for most people to bother explaining.

hmottestad · on April 5, 2015

2400Gbps = 3GBps

going from "b" to "B" is either 8 or 10 fold. As some of the other comments have noted. SATA uses 10 bits per byte.

so 2400 Gbps = 300 GBps (8 bit)

or 2400 Gbps = 240 GBps (10 bit)

rosser · on April 5, 2015

PCI-Express, SAS, SATA, and many other protocols use an 8b/10b encoding — encoding 8-bit bytes in 10-bit words.

https://en.wikipedia.org/wiki/8b/10b_encoding

MertsA · on April 6, 2015

While the author clearly doesn't have a grasp of the bandwidth limitations of various interconnects, the size of a byte is hardware dependent. The de facto standard is 8 bits but that's just what everyone tends to pick, that's why, for instance, an octet is used to describe a byte sometimes because an octet is 8 bits.

ghshephard · on April 6, 2015

If people would just use Mbits/sec, MBytes/sec, then the only confusion that would be left is whether they actually meant 10^6 or 2^20 when they say Mega.

carlob · on April 5, 2015

apparently it was already corrected once, but it's still chock-full of mistakes.

smcleod · on April 5, 2015

NVMe is one of the most important changes to storage over the past decade.

I'm currently replacing all our SANs with storage servers filled with NVMe SSDs (as the tier 1 storage, commodity SATA SSDs for second tier). I've posted the link to my first blog post about it in the comments on another NVMe post in the past: http://smcleod.net/building-a-high-performance-ssd-san/

I'm close to writing the next post around the actual build, my findings, benchmarks etc... Hopefully I'll have that done next week - but the system comes first.

I'm a little disappointed with this article as I think it could do with a) some technical review and b) some more detailed information.

zsmith928 · on April 11, 2015

awesome read, can't wait to see your results on performance, etc.

jccalhoun · on April 5, 2015

Here's a good review with lots of pics and benchmarks: http://www.pcper.com/reviews/Storage/Intel-SSD-750-Series-12...

nqzero · on April 5, 2015

one of the fundamental problem with SSDs is the impedance mismatch introduced by emulating HDDs. NVMe doesn't appear to help with that at all

we need an interface that allows us to bypass the FTL and access the underlying erase blocks

otterley · on April 5, 2015

NVMe is about as simple an interface to a block device as it gets: read blocks, write blocks, and a few health/diagnostic commands. Commands and geometries that originate in spinning disks have been completely extricated.

I'm not sure that I'd want the protocol to get involved in the intricacies of backing store housekeeping as you propose. Newer generations of SSDs may not even have erase blocks or translation layers; do we want to have yet another protocol when the technology changes?

baruch · on April 6, 2015

But these details do matter when you want to achieve maximum performance and they also matter when you want to figure out what went wrong with the device when it fails.

The hiding the happens currently in the block interfaces (HDD, SSD & NVMe) definitely allow for easy integration and let things work pretty good for most cases but prevent getting full performance from the device and also obstruct diagnostics when things fail.

skrause · on April 5, 2015

Is there actually any evidence that this would improve performance significantly? Removing the translation layer means that every OS and their file systems have to do good wear leveling because otherwise you'll destroy blocks quickly.

chongli · on April 5, 2015

That's pretty easy to do. Just use a log-structured filesystem[0]. The abstraction we use now is antiquated. It's very much reminiscent of the impedance mismatch in graphics APIs (such as OpenGL) which is now being solved (with Vulkan).

[0] https://en.wikipedia.org/wiki/Log-structured_file_system

justinsb · on April 5, 2015

I think this is the idea behind FusionIO: doing the translation on the host CPU is faster, and allows you to expose additional commands (like atomic writes, or direct key-value interfaces).

baruch · on April 5, 2015

It's hard to come up with evidence since there are very few options to create your own SSD/NVMe firmware for a real-world like device. The closest I've come up with is OpenSSD and that required $3000 and was for an old controller with very little documentation behind it.

pkaye · on April 6, 2015

I work on SSD firmware. There is lots of restrictions and algorithms involved with using NAND and presenting a reliable storage to the end user. And many of these algorithms are tuned to the specific NAND. The FTL basically hides the ugly details.

WeiShi · on April 9, 2015

Can you give some specific restrictions? like sequential page programming in blocks or 'LSB MSB' things?

stellarhopper · on April 6, 2015

http://lwn.net/Articles/615341/

deelowe · on April 5, 2015

ccleve · on April 5, 2015

So the important question is, how does the affect how we write applications?

Is the api different, or are we still reading/writing disk files?

Should we do memory mapping of the files or not?

Should we parallelize access to different sections of big files? Or write a ton of small files?

How does this affect database design? Current big data apps emphasize large append-only writes and large sequential reads (think LSM trees). Does this make sense any more?

What does disk caching mean in the context of these new drives?

pjc50 · on April 5, 2015

API is the same. Memory map if you prefer that access style and it suits your OS/language preferences. The OS will almost certainly not let you map straight across into the device's PCI memory mapped window so you'll incur a copy to userspace penalty either way. Benchmark. There is probably no longer any advantage to a sequential write, but you still have a per-syscall and per-IOP overhead, so large writes will be faster than N small ones. Disk caching is still there and still decreases latency but isn't so critical.

seatonist · on April 5, 2015

I can remember putting ISA "hardcards" in my 286s and similar. Full circle!

ChuckMcM · on April 5, 2015

But did they conform to the LIM[1] spec ? :-)

They also had RAM drives you could buy. The point then as now is that increasing the "high performance" working set space of a program, increases the amount of transactional data that can be "in flight" during an operation, and that increases the overall size of the data set you can work with.

I've been waiting for these boards to come down in price for about 4 years now. I started talking with Intel about them early on (we used their XM-25 SSDs because it was a price point for flash that was "enough" better than spinning rust that it made sense) and they insisted on trying to sell us the same flash chips on a PCIe card for 10x the dollars, I (and many others apparently) refused to pay that. Sure if you have a 'cost is no object' data base or something but for a large internet working set where revenue differences are measured in cents per thousand transactions? Not so much. I know one company that went so far as to design and build their own PCIe Flash card. I have heard it did great stuff for them.

[1] LIM - Lotus-Intel-Microsoft spec for extended memory on IBM PC compatible machines.

yuhong · on April 6, 2015

hardcards have nothing to do with LIM.

ChuckMcM · on April 6, 2015

Wow, so much for relying on my memory.

http://en.wikipedia.org/wiki/Expanded_memory vs

https://books.google.com/books?id=KjwEAAAAMBAJ&pg=PA61&lpg=P...

I was thinking of the plug-in expanded memory cards rather than the plug in hard drive cards.

jjoonathan · on April 5, 2015

I'm just happy that for once we got to see standards converge (SATA PHY ditched for PCIe) rather than proliferate.

BTW, does anyone know if NVMe uses the ATA command set for side channel stuff like configuring encryption and the like?

justincormack · on April 5, 2015

No I dont think so, it has a new command set.

_l4lu · on April 5, 2015

The NVMe command set is pretty close to the existing SCSI spec though. In case of security the commands are the SCSI Security Protocol IN/OUT.

Shameless plug: if you want to work with this stuff, check out purestorage.com/jobs

jjoonathan · on April 5, 2015

Hah, I wonder if it'll come full circle and we'll get to see AT commands sent over SCSI (all inside of NVMe). Revenge for ATAPI :P

_l4lu · on April 5, 2015

There is already a standard for doing exactly that: passing through ATA commands wrapped in SCSI. This can come in handy when a SATA drive sits behind some bridge that speaks SCSI.

mojoe · on April 6, 2015

Another cool thing about NVMe is the smaller command set -- only about 10 commands (excluding admin commands), vs the 200+ that SCSI has grown to over the years. It's fairly quick to learn.

higherpurpose · on April 5, 2015

Wasn't someone here saying these SSDs consume an abnormal amount of power to achieve those (2x?) faster speeds than current PCIe SSDs? Or was that applicable just to Intel's SSDs?

gnoway · on April 5, 2015

The Intel 750 that was linked on here earlier in the week does use more power than you would expect for a 'consumer' part. The stated reason was that it uses an 18 channel controller - apparently the same one they use in their higher-end enterprise offerings - and is not a low power part.

From [0]:

"The controller is the same 18-channel behemoth running at 400MHz that is found inside the SSD DC P3700. Nearly all client-grade controllers today are 8-channel designs, so with over twice the number of channels Intel has a clear NAND bandwidth advantage over the more client-oriented designs. That said, the controller is also much more power hungry and the 1.2TB SSD 750 consumes over 20W under load, so you won't be seeing an M.2 variant with this controller."

[0]: http://anandtech.com/show/9090/intel-ssd-750-pcie-ssd-review...

petrbela · on April 5, 2015

Samsung claims the opposite: NVMe SSD provides low energy consumption to help data centers and enterprises operate more efficiently and reduce expenses. Power-related costs typically represent 31% of total data center costs, with the memory and storage portion of the power (including cooling) consuming 32% of the total data center power. NVMe SSD requires lower power (less than 6W active power) with energy efficiency (IOPS/Watt) that is 2.5x as efficient as SATA.

http://www.samsung.com/global/business/semiconductor/product... http://www.pcworld.com/article/2866912/samsungs-ludicrously-...

petrbela · on April 5, 2015

And as far as I understand, isn't NVMe SSD just a "different name" for PCIe SSD? PCIe being the protocol (that's already being used for graphics cards), and NVMe the standard for SSDs to understand that protocol.

wtallis · on April 5, 2015

There's more to it than that. NVMe is a higher layer technology than PCIe.

SATA drives connect to the host system over a SATA PHY link to a SATA HBA that itself is connected to the host via PCIe. The OS uses AHCI to talk to the HBA to pass ATA commands to the drive(s).

PCIe SSDs that don't use NVMe exist, and work by unifying the drive and the HBA. This removes the speed limitation of the SATA PHY, but doesn't change anything else. The OS can't even directly know that there's no SATA link behind the HBA; it can only observe the higher speeds and 1:1 mapping of HBAs to drives. Some PCIe SSDs have been implemented using a RAID HBA, so the speed limitation has been circumvented by having multiple SATA links internally, presented to the OS as a single drive.

NVMe standardizes a new protocol that operates over PCIe, where the HBA is permanently part of the drive, and there's a new command set to replace ATA. New drivers are needed, and NVMe removes many bottlenecks and limitations of the AHCI+ATA protocol stack.

ColinDabritz · on April 5, 2015

Not quite. PCIe is just the bus transport protocol. The PCIe SSDs still need a protocol to describe data operations to and from the OS to the Drive. Currently, they use the same protocols as those designed for HDDs, and those protocols make a lot of tradeoffs and assumptions about disk access time. The reason NVMe is exciting is that it is a brand new data protocol, designed with low latency SSD style storage in mind. We're already seeing speeds much higher than existing PCIe SSDs can manage (3+ GB/s!)

pmjordan · on April 5, 2015

Most current PCIe SSDs behave to the PCI bus & operating system like an AHCI controller with a SATA SSD attached to it. Advantage of this is that it's well supported by firmware (BIOS/EFI) and operating systems (stock drivers). Such SSDs inherit the limitations of that architecture, which was designed for spinning platters. NVMe is a fresh start in that regard.

Dylan16807 · on April 5, 2015

PCIe and NVMe are both full-fledged protocols, one on top of the other. Previous products used custom protocols on top of PCIe.

PCIe is packet level, NVMe is a block device interface.

Dylan16807 · on April 5, 2015

That's just a design decision for those models. It's still PCIe, nothing is causing wasted power.

jjoonathan · on April 5, 2015

Last time I looked at the SATA PHY it seemed pretty wasteful. During every transaction the back channel was running at full speed blasting "R_OK R_OK R_OK" to the transmitting party. Not a checksum, mind you, just a bunch of magic DWORDs indicating that the drive was receiving (it would use a checksum too once the transmission was over, of course). I guess it's an easy way to keep the PLLs locked but it seemed like a pretty huge piece of low-hanging fruit towards the goal of reducing power consumption.

In comparison, PCIe is a much more sophisticated serial protocol. It's an entire damned packet network with addresses, subnets (so to speak), routing, retry, a credit system for bandwidth sharing. Unlike ethernet, which typically tops out at achieving ~75% of its theoretical bandwidth, PCIe typically tops out at ~95% of its theoretical bandwidth. Crazy stuff.

I'd bet good money that the PCIe PHY can beat the crap out of the SATA PHY on energy/bit and that the disparity will only widen with time.

vardump · on April 5, 2015

I'm usually getting 95%+ with gigabit ethernet. More than that with jumbo frames. Well, not with Realtek, but Intel and Broadcom chipsets are just fine.

exabrial · on April 5, 2015

SFF-8639 cables look pretty cool! Reminds me of what Apple/Intel was trying to do with thunderbolt. I can imagine some creative people will find other uses for this connector.

MCRed · on April 5, 2015

Why not simply make an SSD Controller that has a thunderbolt port? Since Intel is building thunderbolt into its support chips now, this seems like a good way to get quick performance, plenty of bandwidth without having to come up with new standard, separate drivers, etc. Thunderbolt ports could be put on motherboards fairly easily, etc.

Is there something I'm missing?

Plus this would have the advantage of driving down prices for thunderbolt and increasing adoption.

eurleif · on April 5, 2015

Isn't Thunderbolt just externalized PCI Express?

TazeTSchnitzel · on April 5, 2015

Thunderbolt is PCIe and DisplayPort.

pjc50 · on April 5, 2015

Why Thunderbolt? Its license fees are the main barrier.

Derbasti · on April 6, 2015

Is this relevant for consumers? SSDs were a huge improvement for everyday computing; would NVMe be a similar jump?