Constant Time LFU

ridiculous_fish · on Aug 28, 2020

I read it, I will try to summarize the idea with some intuition.

A classic LRU cache has a map from key to value, with the entries of the map arranged in a doubly linked list. When a value is accessed, it is promoted to the front of the list. So the last item in the list is least-recently accessed.

In a LFU cache, you might try the same idea, except each node also has a counter. When an item is accessed, increment its counter, and then swap with previously-higher-counter items to keep the list sorted. In this way an infrequently accessed item does not get promoted.

However this has a problem for certain access patterns. If every item has the same counter value, then accessing the last item in the list will force it to migrate up the whole list - O(N). You could use a heap instead, and get log time, but that's still not the constant time that LRU provides.

What's proposed here is to bucket items by counter. Each bucket has a linked list of entries, and each entry knows its bucket. The buckets themselves are arranged in a linked list. When an item in the N bucket is accessed, it conceptually unhooks itself from the N bucket and adds itself to the N+1 bucket, creating it if necessary.

Incidentally, a tip for writing this sort of cache: make your list circular, starting with a sentinel node (I call it the "mouth" since it eats its tail). This eliminates all the null checks in the linked list manipulation.

hinkley · on Aug 28, 2020

I suspect we're going to see more of this sort of 'discovery' at this point. We have done a lot of the fundamental research but haven't puzzled out all of the implications or applications yet. Your description is sort of reminiscent of a recipe. All the ingredients have been there for a long time, but the mix is important.

You may know what eggs and wheat and milk and butter are but if you've never seen a pancake before then boy have I got some good news for you. We can talk about croissant later once the buzz wears off.

repsilat · on Aug 28, 2020

Not to rain on anyone's parade, but this looks like a bog-standard interview-style question. It wouldn't be unusual to ask a candidate to "invent" and implement both data structures in a 45 minute session.

Not saying up this sort of thing makes for a good interview, just saying it's a cute little problem that has probably been solved a thousand times by a thousand different people.

fluffy87 · on Aug 28, 2020

Is there a way to make these data structures hardware friendly? So that SIMD instructions can be used, memory is traversed continuously, without multiple pointer indirections, etc?

Are there any blogposts about this?

jasonwatkinspdx · on Aug 28, 2020

For chaining hash tables like this you can use a linked list of buckets where each bucket allows multiple entries, and entries can be scanned with SIMD operations. The downside is it considerably complicates the insertion/update logic, so depending on the workload it may be a net performance loss. If the hash table is designed for concurrency the extra complexity can very likely lead to tricky to understand bugs or performance cliffs.

T0pH4t · on Aug 28, 2020

You can do this at the hashtable level when comparing keys. I think TBB does it. https://software.intel.com/en-us/node/506171

jleahy · on Aug 28, 2020

There are many ways, and they look nothing like this of course.

For a hint, have a look at how a CPU implements the LRU cache (eg. L1, L2). It’s not rocket science.

NightMKoder · on Aug 28, 2020

In practice I think cache eviction policies have trended to optimizing _memory_ usage. Storing two pointers per cache entry for LRU on 64-bit CPUs uses 16 bytes of data per entry - which adds up quickly! Especially if you're storing e.g. a like count (so an int64 or float per cache value). This approach to LFU has an even higher overhead.

There's a few cool alternatives. The first and most obvious - use a good dense hash table implementation! For eviction strategies: a random eviction strategy actually does pretty well in the general case and requires no extra memory - great if you're caching an int32->int32 mapping and the cost of recomputing isn't astronomical. You can also use try the CLOCK algorithm which just requires 1 bit of storage per cache entry.

From an abstract perspective, all page replacement algorithms can also be used as cache eviction policies. There's a whole list of these at https://en.wikipedia.org/wiki/Page_replacement_algorithm#Pag... .

rrobukef · on Aug 28, 2020

Cache sizes are nowhere near the size where a O(log n) factor is significant. In 1TB ram you can fit at most 2^36 entries (64-bit key to 64-bit value without additional datastructures). A factor of 36 is easily dominated by CPU-cache behaviour. This analysis should not be done with time complexities but with deeper tools.

While binary heaps are notorious for bad CPU-cache behaviour, improvements exist. Thus only benchmarks will convince me.

slashdev · on Aug 28, 2020

The overhead here in terms of space per element and amount of work while holding a mutex seems like it would only work well for some kinds of workloads. Particularly the design doesn't lend itself well to lock free algorithms or other techniques for reducing contention.

Knowing how some real life high performance LFU caches work, I think this is optimizing the wrong thing. Look at the design of caffeine and ristretto[1] for some discussion of the tradeoffs with actual measurements. I believe both are based on TinyLFU with modifications.

I don't know any serious LFU implementations that actually use a min heap and are still competitive, so the premise that LFU is commonly log N is not actually true.

[1] https://dgraph.io/blog/post/introducing-ristretto-high-perf-...

ghj · on Aug 28, 2020

I've implemented this for leetcode without realizing it's paper worthy: https://leetcode.com/problems/lfu-cache/

My implementation: https://pastebin.com/48mMTTdK

I am surprised to see that the paper was only published in 2010. I would expect something this simple to have been discovered earlier. I guess it's easy only after someone tells you constant time is possible.

The_rationalist · on Aug 28, 2020

I guess that it's more likely that someone implement an algorithm in an innovative way for its program than it is for one to publish a paper. But yeah maybe there's no blog anterior to the paper which would be surprising but possible.

bigyikes · on Aug 28, 2020

I came here to say the same thing. I'm not even particularly good at this kind of thing.

dominotw · on Aug 28, 2020

I had this on phone screen :D.

person_of_color · on Aug 28, 2020

Did you get the job?

contravariant · on Aug 28, 2020

You could do this, although LRU has the nice property that it is at most a constant factor slower than an optimal (prescient) algorithm that has less memory available [1] (e.g. with twice as much memory it's at worst half as slow). Interestingly FIFO has the same property, suggesting that even LRUs attempt to keep frequently used items in cache can only have a limited effect (so in some cases FIFO might be faster simply due to lower overhead).

Interestingly LFU has access patterns that force it to be slower than LRU. From the paper referenced above:

>A counterexample for LFU is a sequence of k + 1 accesses to each of N - 1 pages, followed by 2k alternating accesses to two new pages, k to each, with this pattern repeated indefinitely (so that all but the initial N - 1 pages have k accesses). [here N is the size of the cache]

[1]: https://www.cs.cmu.edu/~sleator/papers/amortized-efficiency....

T0pH4t · on Aug 28, 2020

Maybe also consider tinylfu (https://arxiv.org/abs/1512.00727). I've implemented the windowed version with segmented caches and it works really well. Though it is a much more complicated implementation.

NovaX · on Aug 28, 2020

Have you tried making it adaptive [1]? Using hill climbing with a decay rate has a nice impact to correct for LRU-biased workloads [2].

[1] https://dl.acm.org/doi/10.1145/3274808.3274816 [2] https://github.com/ben-manes/caffeine/wiki/Efficiency#adapti...

The_rationalist · on Aug 28, 2020

recency and frequency are two hortogonally useful informations that can allow better evictions. It's not that LFU is superior to LRU or the reverse.

The best of both worlds is to use LRFU. Now a 0(1) LRFU implementation would be nice to see. Also, no talk on this blog about the constant factor I believe

After a code search on github, gecko, chromium and Linux have zero LFU despite having tons of LRUs. And zero LRFU too. So it seems that browsers and OS engeeners should learn better cache algorithms, it could allow huge performance gains!

jleahy · on Aug 28, 2020

The main linux page cache is 50% LRU and 50% LFU (half of the available memory devoted to each).

namanaggarwal · on Aug 28, 2020

I have been asked to implement this in Google, Amazon and Facebook interviews and used the same method. :)

compsciphd · on Aug 28, 2020

this is a core data structure built into java.

https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHa...

pxx · on Aug 28, 2020

You're thinking of the data structure commonly used for an LRU cache, not an LFU cache.

compsciphd · on Aug 29, 2020

you're 100% right. stupid mistake on my part. Though the implementations are very similar, just can't easily reuse the java implementation I linked to for this.