Caching Beyond RAM: The Case for NVMe

Rafuino · on June 15, 2018

Disclosure: I work at Intel and submitted the post.

Dormando mentions the test was done with the help of Accelerate With Optane, a collaboration we have with Packet to provide access to servers with Intel Optane SSDs. Check out https://www.acceleratewithoptane.com/ for more info, and you can find me and the Packet team over at slack.packet.net. We're especially interested in open source projects that want to test what they can do with the tech and are interested in sharing what they learned with the broader community. Thanks to Dormando for going first!

ewams · on June 15, 2018

Skylake cpu? Which model?

Assuming using 32GB dimms to use all six lanes (if Skylake)? This is something that has bit me a few times - most Skylake cpus have 6 lanes, so balanced is 192, 384, etc. Not 4 lanes like we are used to!

Would also recommend trying different ram configs anyways, on our side we have seen better thru put on 768 than on 384. Even 512 performs better in some cases than 192.

Rafuino · on June 15, 2018

https://www.acceleratewithoptane.com/access/ has the server info. We aimed for a balanced configuration to enable the widest variety of use cases. Not everyone will need to use everything provided, and it might not fit everyone's needs!

bradleyjg · on June 15, 2018

What's the lifetime of an optane or NVMe drive when used as a constantly thrashing cache? Weeks? Months?

Edit: Missed this the first time through:

> Calculating this is done by monitoring an SSD's tolernace of "Drive Writes Per Day". If a 1TB device could survive 5 years with 2TB of writes per 24 hours it has a tolerance of 2 DWPD. Optane has a high tolerance at 30DWPD, while a high end flash drive is 3-6DWPD.

ec109685 · on June 15, 2018

They also talk in the paper of keeping the thrashing parts of the cache in RAM. Facebook for example calculates the effective hit rate of a larger set of items and only caches those that won’t be quickly purged out / overwritten in secondary storage.

ebikelaw · on June 15, 2018

It's only four more years before the patent on adaptive replacement caching expires. Then we can use it in memcached ...

NovaX · on June 15, 2018

Thankfully there are better algorithms that are not encumbered. ARC is fairly memory hungry and requires a large cache to be effective. It is not as scan resistant and does not capture frequency as well as many believe. LIRS or TinyLFU based policies are what new implementations should be based on.

https://github.com/ben-manes/caffeine/wiki/Efficiency

ec109685 · on June 15, 2018

That’s unfortunate. It’s a natural architecture you stumble into when you have a two tier cache (was doing this 10 years ago when we had a memory mapped cache and a secondary spinning disk cache for a popular website).

zaarn · on June 15, 2018

There is CAR which has almost the same performance and no patent. You can use that.

eternalban · on June 15, 2018

Why not just use LRU?

zaarn · on June 18, 2018

CLOCK and CAR can perform a bit better than LRU in certain situations.

Notably, CLOCK keeps items that are accessed atleast once during a round while LRU will kick out the least accessed item.

The benefit of using CLOCK is that you don't have to maintain a list but only a ringbuffer. Removing an item for a CLOCK's buffer can be almost free if you use a single bit to indicate presence. A LRU will have to maintain some form of list, array or linked. In practise, LRU is expensive to implement while CLOCK is simple. CAR offers LRU performance with less complexity.

Symmetry · on June 15, 2018

Performance. The Wikipedia page on cache policies is pretty good.

https://en.wikipedia.org/wiki/Cache_replacement_policies#Clo...

eternalban · on June 15, 2018

Wiki says "Substantially" better than LRU but actual results seem to show performance converging to same levels the larger the cache gets. (See page 5.)

https://dbs.uni-leipzig.de/file/ARC.pdf

[p.s. there is also the matter of the (patterns in the) various trace runs. Does anyone know where these traces can be obtained?]

ebikelaw · on June 15, 2018

All caches have equal hit rates in the limit when the size of the cache approaches infinity. For finite caches, ARC often wins. In practical experience I've found that a weighted ARC dramatically outperformed LRU for DNS RR caching, in terms of both hit rate and raw CPU time spent per access. This is because it was easy to code an ARC cache that had lock-free access to frequently referenced items; once an item had been promoted to T2 no locks were needed for most accesses. With LRU it's necessary to have exclusive access to the entire cache in order to evict something and add something else.

Of course there are more schemes than just LRU and ARC, and one can try to employ lock-free schemes more than I'm willing to do. This is just my experience.

NovaX · on June 16, 2018

ARC often wins against LRU, but there is a lot of left on the table compared to other policies. That's because they do capture some frequency, but not very well imho.

You can mitigate the exclusive lock using a write-ahead log approach [1] [2]. Then you record events into ring buffers, replay in batches, and have an exclusive tryLock. This works really well in practice and lets you do much more complex policy work without much less worry about concurrency.

[1] http://highscalability.com/blog/2016/1/25/design-of-a-modern...

[2] http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/pap...

eternalban · on June 16, 2018

I don't believe the table in question appraoched "infinity". Check again.

NovaX · on June 16, 2018

I wrote a simulator and link to the traces. One unfortunate aspect is they did not provide a real comparison with LIRS, except in a table that includes an incorrect number. It comes off a little biased since LIRS outperforms ARC in most of their own traces.

https://github.com/ben-manes/caffeine/wiki/Simulator

eternalban · on June 16, 2018

Thank you, you are awesome!

ksec · on June 15, 2018

Serious ? Doesn't ZFS already uses it?

ebikelaw · on June 15, 2018

Sun licenses the patent and has related patents on the same thing. Sun provides an implementation under the CDDL license. ZFS on Linux is distributed under the CDDL. Linux is distributed under the GPL for which no patent holders have granted permission to use the patented inventions. Many other implementations exist including one under the Apache license and one under the Mozilla license that I found in two seconds on github.

The whole thing is a mess.

IcePic · on June 15, 2018

I think there's only a part of ARC that was patented so if you ran the lists differently, you could basically use the double-list idea where you move entities back and forth between them as long as the lists aren't of the same kind.

logophobia · on June 15, 2018

I basically used a memory-weighted variant of the ARC algorithm, sufficiently different so that the patent doesn't cover it.

I imagine there are quite a couple of variants of ARC you can use without violating the patent.

wtallis · on June 15, 2018

Intel has stated that the next batch of enterprise Optane SSDs will increase the endurance rating to 60 drive writes per day, which will finally put it beyond even the historical records for drives that used SLC NAND flash.

djsumdog · on June 15, 2018

Are there current writes-per-day/durability numbers for traditional spinny disks? I can't seem to find anything other than SSD numbers.

wtallis · on June 15, 2018

Some hard drives come with workload ratings. For example, the WD Gold is rated for 550TB (reads+writes) per year (if you run it 24/7). But because the wearout mechanisms are so different between hard drives and SSDs, you can't make a very meaningful comparison between them.

monocasa · on June 15, 2018

Spinning rust doesn't really express durability in terms of number of writes. The two are orthogonal for that technology.

dragontamer · on June 15, 2018

Hard Drives don't fail that way. Hard drives fail in other ways, so "writes per day" is simply irrelevant to the hard drive market.

Hard Drives fail because of vibration, broken motors, and things like that. MTBF is the typical metric for hard drives. There are also errors that pop up if data sits still too long (on the order of years), because the magnetic field loses its charge over time.

dormando · on June 15, 2018

Much of the items that will end up on an SSD are already long lived. Every gigabyte of RAM runway before flushing items reduces the write load, since recent items are at highest risk of being overwritten/deleted.. also stuff with shorter TTL's won't be persisted (or can be persisted to bucketed pages for avoiding being compacted).

TL;DR: there's a lot to it and I'll be going into it in future posts. The full extstore docs explain in a lot of detail too.

nisa · on June 15, 2018

datasheet says something along 5pb write endurance.

praseodym · on June 15, 2018

Facebook has recently published a paper on how they use NVM caches with their MyRocks (MySQL) databases; the morning paper has a really good write-up: https://blog.acolyer.org/2018/06/06/reducing-dram-footprint-...

Andys · on June 15, 2018

The limiting factor of Optane is actually the controller. The power consumption of the controller itself, as well as the costs of getting high bandwidth to and from CPU and memory. Current products are striking a conservative balance and in the future they could get much faster.

reacharavindh · on June 15, 2018

An adapter that splits a single PCIE slot(x16) to hold 4 x M.2 NVMe SSDs(x4 each) would be a great way to persist a Redis instance that is not just serving as a cache.

If the same can be done with Optane SSDs, the lower latency will at higher queue depth will certainly help.

robhu · on June 15, 2018

This already exists: https://www.tweaktown.com/reviews/8542/asrock-ultra-quad-2-c...

namibj · on June 15, 2018

This is not a low profile card, and wastes quite a lot of space. It should take two cards on each side of the board, with the connectors facing orthogonal to those of the x16 slot.

wtallis · on June 15, 2018

You can't put something as tall as a M.2 connector on the back side of an expansion slot without violating the form factor guidelines and enchroaching on the space of the next slot over. The only compliant way to put M.2 drives on the back is to use an offset edge connector so the main board is a bit lower than it usually would be. Amfeltec has some boards like this, but I think they have a patent on their offset connector. http://amfeltec.com/products/pci-express-gen-3-carrier-board...

namibj · on June 16, 2018

Oh. Well, there are cases where it would fit without such an exotic connector, but those are non-compliant.

I assume you don't need licenses for just making a dumb PCIe card, if you don't name it with trademarks? Or are there patents you need to license to sell PCIe-compatible, non-electronic cards?

firebacon · on June 15, 2018

Why would that help -- genuinely curious? Shouldn't a redis server already not be bound by the secondary storage I/O speed? I thought it was a main memory system with asynchronous commits to disk?

mmt · on June 15, 2018

What would be interesting to see is if this benefit would be better applied directly to the database instead.

The relative scarcity of NVMe ports/bandwidth per server may make that as unattractive as doing the (RAM) caching on the database server itself, but it's not obvious, if one could only spend the money in one place, where it would be best spent.

rzzzt · on June 15, 2018

NVMe drives use 2-4 PCIe lanes, so you can have quite a few of them in a system with a suitable adapter.

mmt · on June 15, 2018

I'd say that's comparable scarcity to DIMM slots, as the ratio between those and PCIe channels tends to be in the range of 1:2-4 on current CPUs.

dragontamer · on June 15, 2018

Intel seems to be all in on NVDIMMs. IE: Adding more "RAM" slots where Optane can fit into.

mmt · on June 15, 2018

I would expect that to favor memcache, until the RDBMSes get code to specifically take advantage of it, but even that may be a naive expectation.

ihsw2 · on June 15, 2018

AMD's SP3 (Epyc) offerings address the limited capacity. Compared to Intel's current offerings, they are much more cost-competitive in terms of density.

mmt · on June 15, 2018

That's certainly true for PCIe lanes, though their memory channel density is unremarkable. (I've also read about cache latency affecting database performance, but I'm not sure what to think of that consindering how synthetic benchmarks usually are).

Historically, I've had trouble convincing management to get anywhere close to reaching the capacity of a server's PCIe lanes, possibly because of a general ignorance around direct attached (but external) storage options, but the popularity of NVMe may change that.

Now the challenge may become convincing people that 128 lanes per CPU is better than 40.

There's also the (extension of the perpetual challenge) that the mere existence of such a price-competitive option to Intel means one doesn't have to panic at the slightest hint of scale and invest immense engineering effort and non-linear infastructure cost into a distributed database to replace a "single"/central [1] RDMBS.

[1] With all the usual replicas and caches for performance (and failover), as is the point of the article.

majidazimi · on June 15, 2018

Is there any reason why memcahced/redis doesn't provide external store as an interface and let people implement the load/store operations? I would like to be able to implement hierarchy of "hot(RAM) -> warm(NVM) -> cold(SSD) -> DB" style cache.

dormando · on June 15, 2018

no reason, it's doable with a bit more cleanup. The I/O bits are encapsulated entirely in extstore.c with a relatively clean interface. it's not a layered setup, it's actually bolted all into the same hash table, which makes pluggable I/O systems not make much sense.

it ensures that a lot of operations can't touch secondary (like miss, touch, delete, sets of new items, etc), which reduces load on the IO by quite a lot.

edit: Also extstore itself will support NVM + SSD sort-of-layers soon enough. I'll be retesting that on the same optane+ssd machine in a couple weeks.

hyc_symas · on June 15, 2018

Could have just used LMDB memcachedb and gotten the RAM/NVMe cache management transparently, for free.

https://github.com/LMDB/memcachedb

ihsw2 · on June 14, 2018

It is very exciting to see Optane competitive with conventional NVMe drives -- this is something that Intel got right early and their exploration into DIMM-socketed Optane drives is similarly exciting.

It has already been established that for most consumer workloads, the latency differences between memory and Optane is negligible. This article shows that heavy-duty workloads (ie: high-traffic memcached clusters) can be accommodated by Optane and NVMe too. Clusters of 500K IOPS drives can take us most of the way there.

I don't want to get all /r/hailcorporate but Optane drives are great products and (more importantly) you can run them on AMD platforms (ie: SP3) too. Granted, NVMe drives bring the fight and they're much more cost competitive at scale, but that will change soon.

djsumdog · on June 15, 2018

> and (more importantly) you can run them on AMD platforms (ie: SP3) too

Wait, you can? I thought Optane required specific motherboards for the DIMM versions? From what I understand, the M.2/nvme devices just show up as a drive in Linux right?

But I thought the actual DDR4/DIMMS only work in certain Xeon boards.

a012 · on June 15, 2018

I suppose they're relating to Optane SSD[0], not Optane memory.

[0] https://www.intel.com/content/www/us/en/products/memory-stor...

ihsw2 · on June 15, 2018

Ah, the DDR4/DIMM Optane models are not yet publicly available (which should significantly bring us closer to parity with memory latency), and I was referring to the Optane NVMe drives being compatible with AMD systems.

selectodude · on June 15, 2018

I love toys and I would love to jump feet first into optane, but a dollar per gigabyte is hard to stomach.

JoshTriplett · on June 15, 2018

About a decade ago, I paid around $600 for the 80GB X25-M, the first SSD that focused on performance rather than just low latency. One of the best system upgrades I've ever had.

A dollar per gigabyte doesn't seem half bad at all for the top end of performance.

berbec · on June 15, 2018

My first hard drive upgrade was a 80MB HardCard - an ISA hard disk for my PC XT 286. I recall it cost clost to $1,000.

There's always going to be the top tier storage that costs an arm and a leg - that just means two steps down gets affordable. Optane pricing will drive down NVME which will drive down SATA SSD.

dragontamer · on June 15, 2018

The main issue is that video gamers don't see any benefits to NVMe NAND drives. So the "hardcore gamer" market is beginning to stall out on SATA SSD drives.

Why pay 2x more for NVMe NAND if your video game load times aren't any better?

zlynx · on June 15, 2018

They are better at load times. Just not 2x better.

berbec · on June 15, 2018

The price/performance of a couple 860 Evos in raid0 is hard to beat

coldtea · on June 15, 2018

>a dollar per gigabyte is hard to stomach.

My first HD in 1991 cost like $100 and was 40MB. So, there's that.

Besides "a dollar per gigabyte" is $1000 for 1TB.

512MB SSDs used to cost ~ $1000 just 5-8 years back...

SmellyGeekBoy · on June 15, 2018

I feel my sibling comments (I paid $x y years ago, get off my lawn) are missing your point. If you're looking to fill a whole datacenter with Optane the price compared to current storage tech is eye watering. This technology is aimed at servers with a high IO load and a lot of RAM caching, not home users.

Of course, as with previous SSD tech, the price will come down fairly rapidly.

dormando · on June 15, 2018

They have quite a ways to go before I'd be happier treating them as RAM for something like memcached... but it's a good start. The IOPS is very low and the latency is still quite a ways off from RAM. For a lot of other workloads it seems doable if expensive.