Why Intel is adding instructions to speed up non-volatile memory

bahahah · on Nov 6, 2014

There are several storage class memories that are nearing commercialization. Intel is betting big on at least one of them. Most technologies in this class are orders of magnitude faster and have orders of magnitude better endurance than flash memory, while being only slightly slower the DRAM, yet non-volatile.

It is plausible that with another layer of in-package cache they could eliminate DRAM altogether, replacing it with ultrafast NVM. Imagine the resume/suspend speed and power savings of a machine whose state is always stored in NVM.

runeks · on Nov 6, 2014

> There are several storage class memories that are nearing commercialization.

I'm very interested in this. Could you point out which technologies that are near ready for commercialization?

My understanding is that the current cost is orders of magnitude higher per unit of storage for these new technologies compared to NAND flash or even DDR3 RAM. But of course, a dedicated fab could change that very quickly.

jhallenworld · on Nov 6, 2014

Well nvDIMMs are available right now (from companies like Netlist, Agigatech, Viking, Smart, Micron). This is DRAM with an analog switch, a controller and flash memory. When you lose power, the DRAM is disconnected from the processor and the contents are copied to the flash. The newer technology might be cheaper, but I thought so far the write performance is not as good as DRAM.

The issue is the cache: the data is not non-volatile until it has been written back to DRAM. Even then, you need some advanced warning of a power outage for it all to work.

Unibus (bus for PDP-11 core memory systems) had an early warning signal, to give the memory controller a chance to write back the previous (destructive) read.

bahahah · on Nov 6, 2014

Components are available on the market now based on PCM, MRAM, and FRAM. I know that Intel has large productization, not research, teams working on a variant of SCM. Near means 2-3 years though. Research exit to market ready is always a 3-5 year cycle when process engineering is involved.

_3u10 · on Nov 6, 2014

Is this basically memristors coming to market or are memristors still a few years off?

the8472 · on Nov 6, 2014

This should be useful for any type of NVRAM, be it battery-backed DRAM, MRAM, memristors or DMA-mapped flash.

sweis · on Nov 6, 2014

I've heard predictions that a significant portion of new x86 servers will be using non-volatile memory within the new 5-7 years.

Memory is becoming the new disk. This could have major security implications, as memory contents are unencrypted in general.

Fortunately, Intel CPUs will have hardware support to encrypt SGX enclaves. Perhaps that support can be used for general memory access as well.

ams6110 · on Nov 6, 2014

if non-volatile memory is becoming the new disk, why is it any more or less likely to be encrypted than current disk storage (mostly not, as far as I've seen).

sweis · on Nov 6, 2014

Long story short, memory bandwidth is much faster than the best x86 crypto implementations can handle.

Encrypting disks or network is no problem today, but we'll need architectural changes to support full memory encryption without a performance hit.

WallWextra · on Nov 6, 2014

This could be done mostly transparently, with the encryption in the memory controller. Addresses and data are already scrambled with a (non-cryptographic) scrambling code for EMI reasons. Of course, a sufficiently fast hardware crypto core would be required.

EDIT: Also, I forgot that the last generation of consoles (and I assume the current) have transparent encryption of main memory.

justincormack · on Nov 6, 2014

Indeed, there is discussion of encrypted memory on the controller in Risc-V at http://lowrisc.org

slashdev · on Nov 6, 2014

How do you square that with the performance of the AES-NI instructions? That is theoretically 16 bytes per cycle from the manual. Per core. That is way in excess of memory bandwidth, even with DDR4.

pbsd · on Nov 6, 2014

The theoretical maximum for current chips is less than 16 bytes per cycle. On Haswell you can process (in parallel) 7 blocks in roughly the time it would take to process 1. The latency of each round is 7 cycles, a full AES-128 10 rounds is ~70 cycles, so effectively you can process at most 1.6 bytes per cycle, or 1.14 if you use 256-bit keys (ignoring the cost of key scheduling and overhead here).

Even if you dedicate all CPU cores to the task of encrypting memory, you still stop short of exceeding theoretical memory bandwidth by quite a bit.

acdha · on Nov 6, 2014

Do you believe it's reasonable to assume that AES performance will remain constant over the same 5-7 year timeframe? That's at least a couple of hardware generations for an improvement they could make in the current generation if there was a market for it.

pbsd · on Nov 8, 2014

There is certainly room for improvement, but I don't see an 16x speedup happening on a 5-year horizon using the current AES-NI instruction set.

slashdev · on Nov 6, 2014

Ah, I was forgetting about rounds, you're correct that you won't be able to match the memory bandwidth then.

crest · on Nov 6, 2014

The VIA C7 AES implementation could keep up with memory (ca. 20Gb/s). With suitable cipher modes you can use multiple pipelined units in parallel with negligible overhead.

AlyssaRowan · on Nov 6, 2014

Or fast, strong, pipelined hardware encryption.

AES is not the best you could do there.

justcommenting · on Nov 6, 2014

three words for you: cold boot attacks

AlyssaRowan · on Nov 6, 2014

No.

Remanence attacks are pointless against non-volatile media. You use them against volatile media in a physical attack in an attempt to sneak under/manipulate the limits of that volatility to cause violations of security assumptions, such as "the keys are in RAM" (true) > "RAM is instantly volatile on shutdown" (not quite true) > "keys are instantly zeroised on shutdown" (not this easily they're not).

Some RAM is much more volatile than conventional bulk SRAM or DRAM (for example, frequently L1/L2 caches on CPUs are impractical to exploit). Properly encrypt bulk data held in high-remenance or non-volatile RAM with a key held in such low-remanance RAM, and your security problem is solved.

gizmo686 · on Nov 6, 2014

That still doesn't answer the question. If you treat non-volatile memory as a disk, then the data would never touch it unencrypted, so a cold boot attack is useless against the non-volotile memory. Of course, you could still launch a cold boot attack on the volotile memory, but we can do that already.

Animats · on Nov 6, 2014

Computing really hasn't figured out how to handle non-volatile memory as yet. It's almost always used to emulate rotating disks, with file systems, named files, and a trip through the OS to access anything. Access times for non-volatile memory are orders of magnitude faster than disk access times, so small accesses are feasible. But that's not how it's treated under existing operating systems.

There are alternatives. Non-volatile memory could be treated as a key/value store, or a tree, with a storage controller between the CPU and the memory device. With appropriate protection hardware, this could be accessed from user space through special instructions. That's what I though this article indicated. But no. This is just better cache management for the OS.

zurn · on Nov 6, 2014

There have been systems where everything is memory mapped and disks are just used to emulate more memory.

It's called "single-level store" in System 36 and descendants. File access in Multics was all memory mapped.

There's nothing inherently rotating-disky about current filesystem APIs from the user point of view, a they just provide a database interface which has a certain type of namespace system for access. The block level part is largely invisible to the FS users (modulo leaky abstractions).

wereHamster · on Nov 6, 2014

It is already treated as a k/v store, where key is a LBA and the value is a 512/4096byte block. The OS builds everything else (ie. filesystems) on top of that. Applications can already now access the raw k/v store directly if they wish (open /dev/sd? directly, permissions allowing).

dfryer · on Nov 6, 2014

This is not (specifically) for the OS. This is for non-volatile memory that is directly attached to the memory bus. The OS can then directly map NVRAM into the address space of a user-space process; the application could use these instructions to efficiently ensure the crash consistency of its persistent data.

soamv · on Nov 6, 2014

Well, this is about userspace access to nvram, with the nvram mapped as memory. It just so happens that cache management is one of the hard parts of doing that, so that's what these new instructions are for.

rbanffy · on Nov 6, 2014

Don't forget core memory existed before rotating disks. The first internet routers were shipped with their program already loaded.

harshreality · on Nov 6, 2014

Can't use current flash chips in that way, because write endurance.

rbanffy · on Nov 6, 2014

Also, current Flash memories do not allow single address writes. At least the write endurance problem could be addressed by adding write leveling to an address translation layer. The single address thing could be addressed by a caching/grouping layer that could interact with the leveling mechanisms. Add to that an all-core state dump to a block write and you can recover to an internally consistent state after a power failure.

unwind · on Nov 6, 2014

Very interesting! It's always fun to see "external" development in the general field of computer architecture affect low-level stuff like a CPU's cache and memory subsystems.

It wasn't super-easy to figure out who in the grand ecosystem view of things is going to have to care about these instructions, but I guess database and OS folks.

Also, if the author reads this, the first block quote with instruction descriptions has an editing fail, it repeats the same paragraph three times (text begins "CLWB instruction is ordered only by store-fencing operations").

0x0 · on Nov 6, 2014

Isn't there a higher risk of data loss, if your "hard drive" is 100% memory mapped - all it would take is one buggy kernel driver writing to an invalid pointer or memset'ing the whole thing to 0?

JoeAltmaier · on Nov 6, 2014

Certainly damage can happen faster, since the NVRAM is faster. But my buggy driver could write the whole disk to 0 already.

signa11 · on Nov 6, 2014

well, the same is true now as well right ? for example, a buggy driver can override a buffer-cache pointer with something else, and then you are hosed. if you are playing in the kernel-land and not careful enough, you are courting disaster...

0x0 · on Nov 6, 2014

True, but if it overruns a buffer, it still needs to maintain a valid SCSI/ATAPI/whatever command packet format and submit the packet to the controller with repeatedly increasing block numbers - that's a lot of instructions, while something that clears the entire address space could probably be done in 1-2 assembly instructions (mov rcx, -1; rep stosq)

jhallenworld · on Nov 6, 2014

Support for non-voltile memory needs to be added to Linux. For example, one should be able to map the non-volatile memory into user space and directly access it. There needs to be some BIOS-OS interaction so that the OS doesn't treat the non-volatile memory as general memory (for the likely case where only some of the memory is non-volatile). Alternatively, the non-volatile memory should be usable as a block device.

The non-volatile memory needs a layer of RAID-like volume management. For example, when you transfer the memory from one system to another, there should be a way to determine that the memory is inserted in the correct slots (remember there is RAID like interleaving/striping across memory modules).

_h8ft · on Nov 6, 2014

How to solve the context switch overhead issue: https://www.destroyallsoftware.com/talks/the-birth-and-death...

JoeAltmaier · on Nov 6, 2014

How about: a cpu that has scores of hyperthreads? They don't block in the kernel; they stall on a semaphore register bitmask. That mask can include timer register matches another register; interrupt complete; event signaled.

Now I can do almost all of my I/o, timer and inter-process synchronization without ever entering a kernel or swapping out thread context. I've been waiting for this chip since the Z80.

rbanffy · on Nov 6, 2014

While not exactly a chip (it never reached board stage) I designed a processor in college where the register file was keyed to a task-id register. This way, context switches could take no longer than an unconditional jump.

I dropped this feature when I switched to a single-task stack-based machine (inspired by my adventures with GraFORTH - thank you, Paul Lutus). This ended up being my graduation project.

abhaybhuva · on Nov 6, 2014

[flagged]

6cxs2hd6 · on Nov 6, 2014

I'm downvoting your comment because

- that seems to be copy and pasted (i.e. plagiarized) from the original blog post http://danluu.com/clwb-pcommit/

- your submission history suggests you are associated with improgrammer.net

acqq · on Nov 6, 2014

Yes, it is 100% plagiarized. The last sentence in the original text:

"It doesn’t directly address the OS overhead issue, but that can, to a large extent, be worked around without extra hardware."

Their modification:

"It doesn’t directly address the OS overhead issue, but that can, to an astronomically immense extent, be worked around without extra hardware."

8957a7e86 · on Nov 6, 2014

It looks like the original text was automatically processed by replacing various words by their synonyms from a thesaurus, leading to hilariously non-idiomatic prose.

rbanffy · on Nov 6, 2014

Artificial Awkwardness. It was bound to happen.

dang · on Nov 6, 2014

Ouch, that's bad. Also, plagiarizing luu doesn't sit well around these parts. That's the home team!