Start with a dumb-cache using "FIFO" (first in/first out)
Keep track of anything with "hits" while in the cache (simple boolean/bitmap instead of complicated locks/data structures)
Re-insert "hit" items when they would normally be evicted/age out (Lazy Promotion).
BTW, also have a mini-cache in front of the main cache (10% of cache size) which tries to "drop off quickly" (effectively: must have a "hit" within first 10% of being cached).
BTW, also keep track of a "ghost cache" ("hits only", not contents??) for the duration of the overall FIFO-cache, and use that to guide eviction/re-insertion. I'm a little bit unclear on this aspect, but it seems like an "obvious in retrospect" set of guidance.
Useful summary. Made me want to read about the ghost cache. The cache layout is:
Probationary 10% + Traditional 90% (of memory).
The idea is that most items are not requested again prior to dropping out of probationary and so don't take up space and process in the Traditional Cache.
Some infrequently used items however would remain inside the Traditional cache but never make it through the probationary cache. So a small "ghost" cache is used to specifically detect these items rejected from the Probationary cache.
Next time we go to fetch a ghost item they go directly into the Traditional cache. The ghost cache is just storing metadata (pointers essentially).
It is the same size as the traditional cache, but it is also a FIFO. Naturally it doesn't need to worry about tracking access, since the moment the ghost cache is "hit" the relevant memory location is loaded into the traditional cache and the ghost cache entry becomes irrelevant. So I guess that means the ghost cache could just be a queue.
This sounds pretty similar to the pageout scan in Mach (i.e., NeXTSTEP, Darwin, macOS, iOS, and derivatives). It's also used by 4.4BSD and some of the modern BSDs.
There is an "active" queue and an "inactive" queue. The pageout daemon tries to maintain them in a 1:2 ratio. Active drains to inactive; inactive drains to be paged out. Both are simple FIFO, but the daemon clears the page table referenced bit of pages as they are enqueued onto inactive.
When dequeueing a page from inactive, the referenced bit is checked. If it's been set (i.e., the page was referenced since being moved to inactive), the page is diverted back up to active. Otherwise, it's paged out.
Not a great one offhand, but here's a decent one from Apple, from back in the days when they put energy into documenting internal stuff like this. (As a result, some of the details in it are slightly out-of-date, but the general shape is still correct.)
This is very helpful (for someone who has never worked on Darwin)! The algorithm uses two FIFO lists with a "visited" bit (lazy promotion), and is a great idea! There are two differences between what is described and our algorithm, 1. the two FIFO lists are not fix-sized, which cannot guarantee quick demotion; 2. new data (hard fault) are inserted into the active list (I assume it is large than the inactive list), which is different from the quick demotion.
1. You've got a cache miss
2. You need to evict something, so look at the next item due for eviction.
3. It's a candidate for promotion, so you reinsert it.
4. You still need to evict something, so goto 2.
Isn't that a potential infinite loop? And even if not (because there's some other exit clause), how is it quick?
There are two things. One is reinsertion or lazy promotion as you described, which is used to keep popular objects (reinserted objects have the bit reset to 0, so it won't be infinite loop). And it performs fewer work than LRU because many hits on a popular object result in one "lazy promotion".
The second is quick demotion which uses a small FIFO to filter out most objects that are not requested soon after insertion in the cache.
It looks like your lazy promotion algorithm is O(n). If you get a particularily unlucky request pattern you'll be walking the whole list to know which item to evict. I don't think the claim that this LP FIFO method is always less intensive than LRU will hold up to scrutiny. But I still find the paper inspiring. Thanks for shaking a bit the established believes in the field :)
This is common confusion. It is O(n) where n is the number of objects, but LRU is O(m) where m is the number of requests, which is often more than the number of objects :)
I think we are not talking about the same thing. LRU needs to access/modify the tail and head of the queue once per request for both Get and Insert.
Your algorithm on the other hand does not need to modify the order of the queue on a Get (just mark single item as hit if not marked already) which is really good. But on Insert your algorithm needs to do a reverse scan through the queue until it finds an item it can evict (not marked as hit). This might need to walk the whole queue if unlucky. And it can get so bad that it needs to do that on every single eviction so latency and CPU usage could skyrocket. Granted, this is a worst case scenario that shouldn't happen usually but if it does then it could be a real problem.
So in that sense LRU would be O(m) in terms of queue modifications and reads but yours would be O(n*m) for queue reads.
The O(n) I mentioned was for the number of items the algorithm needs to check on a single insert that results in an eviction aka the usual case.
Yes, in the worst case, it would need O(n) to evict, and it can hurt tail latency, but it only happens once in a while in the worst case because each time when an object is checked, the bit is set to 1.
I don't understand the O(n*m) argument, does the answer invalidates it?
On the tail latency part, if it becomes a problem, we can bound the number of objects it checks.
This will result in your queue being 100% filled with items that are marked as hit because of the second Fetch for each key. After the queue is full each first Fetch will result in an eviction (new key to insert) which has to walk the whole queue because the only item marked not-hit is at the head of the queue due to being lazy promoted.
My O(n*m) reply was because your O(m) was considering a whole run over multiple requests and my O(n) was only considering one request. So if I were to also talk about multiple requests then your eviction algorithm needs to look at O(n*m) items for evictions in the example pattern I presented. O(n) per request for m requests.
A malicious actor could abuse this fact to cause a DoS.
No you can't just bound the number of objects it checks because that would result in a flurry of cache misses as it would mean you have to fail the insert.
Consider the the request sequence you gave
a, a, b, b, c, c, d, d ...
and a cache size of 2, when the first c arrives, the cache is in the following state (left is head and right is the tail, inserted at the head):
b(1) a(1)
and it will trigger an eviction, and requires resetting 2 (or O(m)) objects with the following state changes
b(1) a(1) -> b(1) a(0) -> b(0) a(0) -> c(0) b(0)
the next c will change the bit to 1, and we will have
c(1) b(0),
When d arrives, b will be evicted, and we have
d(0) c(1)
Note that we only check/reset one object this time instead of O(m).
Therefore, in the worst case between each O(m) eviction, you need O(m) requests to set the m objects to "visited", strictly no more than a LRU cache.
Ok I see where the difference is now. You are clearing the hit bit for all items marked as hit on a single eviction. I mean yes I guess that means that during the following request you don't have to check all items in the queue but now you have a different problem: you evict items prematurely. With a cache of size 2 I can't show it but consider a cache of size 3 and the following pattern: a, a, b, b, c, b, d, e
When d arrives state changes c(0) b(1) a(1) -> d(0) c(0) b(0).
When e arrives state changes d(1) c(0) b(0) -> e(0) d(0) c(0).
Note how b got evicted from the cache even though it was requested more recently and more frequently than c because d cleared out the hit bit for both b and c.
Now you traded off hitrate vs latency. But the worst case latency is still O(n) just can't happen multiple times in a row anymore.
Imagine you have a cache of 1 million items all marked as hit and you get a new request which would evict all 1 million items. You gained latency for some subsequent requests but increased it in the current one. Crucially you also just cleared out the hit marker of whole cache.
Any eviction algorithm would have an adversarial workload, so a simpler algorithm is better because you know what the workload was and whether it happens in production; on the other end, the complicated one only makes this worse and the failure of a cache is often catastrophic. :)
I can't agree with that. More sophisticated algorithms don't necessarily have to make things worse. In fact I think the opposite is true. Simple algorithms are the ones most likely to fail catastrophically because they are made for a certain range of access patterns.
You don't always know the workload beforehand and can pick the appropriate cache policy. Workloads can even change during runtime.
I haven't seen a simple cache algorithm that performs well in every workload and I believe any that wants to succeed needs to incorporate different sub-algorithms for different workloads and be adaptive.
Have you experienced workload changes that caused an algorithm to be less effective? I am happy to learn about it.
From my experience and study of traces collected in the past, I haven't seen such cases. Of course, there is diurnal changes, and sometimes short abrupt changes (e.g., failover, updating deployment), but I have not seen a case that a workload change that needs a different algorithm, e.g., switch from LRU-friendly to LFU-friendly.
Having an ideal adaptive algorithm is certainly better, and I do observe different workloads favor different algorithms, but all adaptive algorithms have parameters that are not optimal for a portion of workloads and also hard to tune (including ARC).
If you give me an algorithm, I can always find a production workload for you that the algorithm does not work well.
"Simple algorithms are the ones most likely to fail catastrophically because they are made for a certain range of access patterns" I do not see how FIFO and LRU fail catastrophically except on scanning/streaming workloads. Happy to learn about it.
The classic example of a changing workload is a database which has a normal workload that maybe resembles a zipf or bell distribution but then someone runs exports which do full scans over large parts of the DB. These changes in workloads can be intermittent but they exist. A very similar change in workload that is well known is the pagecache for filesystems (e.g. backup doing full scan over disk and evicting everything from the cache) which is why admission policies were invented in the first place. So you have two different algorithms already. Just maybe not adaptive.
A friend of mine works on a online-shop / distribution system that during most of the day has a zipf like distribution but after normal working hours when retail shops close he gets completely different patterns as the shops push their data for the day into the system and then all kinds of BI tools run for a few hours.
Of course adaptive caches are never completely optimal because it takes time for them to learn which workload is actually currently happening but after a while they can get very close. A good adaptive cache should not need tuning to get good results because the whole reason for it to be adaptive is to tune itself. Of course there are quite a few adaptive caches, especially the first ones that pioneered this, that still needed tuning.
I don't understand the last question. You say you can't see how FIFO can fail catastrophically and at the same time state a case how it can. Scanning workloads are something that does show up in the real world all the time. NovaX seems to have run your algorithm through some traces where it had a near zero hitrate. It can be more important to avoid these extreme cases than having a slightly better hitrate in the normal case.
I wonder why all these papers ignore comparison against W-TinyLFU.
https://github.com/ben-manes/caffeine/wiki/Efficiency Shows that it really outperforms ARC as well and they also have an optimal oracle version that they evaluate against to show how much room there is left (admittedly the oracle version itself implies you’re picking some global criterion to optimize but that’s trickier when in reality there are multiple axes along which to optimize and you can’t simultaneously do well across all of them).
Also the lack of evaluation of the cache strategy against shifting workloads is also problematic.
It's mentioned in the paper, since W-TinyLFU is a means of qualifying things prior to going in the cache. That process is called QD in TFA.
They refer to the W-TinyLFU paper as reference 34.
> Moreover, admission algorithms,
e.g., TinyLFU [ 33, 34], Bloom Filter [18, 54 ], probabilistic [ 15]
and ML-based [ 35] admission algorithms, can be viewed as
a form of QD — albeit some of them are too aggressive at
demotion (rejecting objects from entering the cache).
Thank you! Yes, W-TinyLFU is very good, we have a follow up on this. But we find that the adaptive frequency in W-TinyLFU is less robust than using a static FIFO (yeah, I know this is surprising). And also the 1% LRU is sometimes too small. But I admit that W-TinyLRU is very good among all state-of-the-arts (but not as good as the FIFO+LP+QD), especially when you consider the scalability issue.
Yeah, I’d love to see a comparison against W-TinyLRU, particularly across many more evaluation axes eg what happens when traffic patterns shift, what if you want to better balance QoS between customers, etc etc etc.
In a shifting workload their algorithm achieves a 0.01% hit rate whereas LRU, ARC, and LIRS have a 20% hit rate and adaptive W-TinyLFU has an almost 40% hit rate.
Hi Ben, can you provide more details about shifting workload? Every algorithm has weakness, and finding it in the complex algorithms is often harder --- however, this does not mean simple algorithm is bad. I like W-TinyLFU, it has a slightly better result compared to LIRS (which also uses 1% window) albeit much simpler and more robust. However, we find that the adaptive frequency is not robust on the large dataset: a high frequency object in the main SLRU can stop many objects from entering the main cache. Happy to chat more 1 on 1 if you are interested.
To clarify, the 1% window was a heuristic that we recommended in the first paper a a good default for most workloads. However we showed that sometimes a larger window was needed, and concluded that as future work.
The follow up paper, which you also cite, explored adaptive techniques to resolve this deficiency. From there we introduced the idea of using hill climbing by monitoring the hit rate change over a sample period and adjusting the step direction. Caffeine incorporates this with a step size decay to allow for convergence. The implementation that you have does not include this improvement, but it is very simple to add and works surprisingly well.
Yes, I think that is my main concern in that often research papers do not disclose the weaknesses of their approaches and the opposing tradeoffs. There is no silver bullet.
The stress workload that I use is to chain corda-large [1], 5x loop [2], corda-large at a cache size of 512 entries (6M requests). This shifts from a strongly LRU-biased pattern to an MRU one, and then back again. My solution to this was to use hill climbing by sampling the hit rate to adaptively size of the admission window (aka your FIFO) to reconfigure the cache region sizes. You already have similar code in your CACHEUS implementation which built on that idea to apply it to a multi-agent policy.
Caffeine adjusts the frequency comparison for admission slightly to allow ~1% of losing warm candidates to enter the main region. This is to protect against hash flooding attack (HashDoS) [3]. That isn't intended to improve or correct the policy's decision making so should be unrelated to your observations, but an important change for real-world usage.
I believe LIRS2 [4] adaptively sizes their LIR region, but I do not recall the details as a complex algorithm. It did very well across different workloads when I tried it out and the authors were able to make a few performance fixes based on my feedback. Unfortunately I find LIRS algorithms to be too difficult to maintain for an industry setting because while excellent, the implementation logic is not intuitive which makes it frustrating to debug.
Beating plain LRU isn't very interesting, but they also evaluated a bunch of other algorithms (e.g. ARC) and concluded that it performed better than those as well.
I know the ARC paper discusses that other algorithms are often better if properly tuned, although ARC is usually consistently decent in a variety of situations without tuning. It would be awesome to have a new algorithm in that space.
ARC is good overall, but we find that when evaluating on such a large dataset (over 5000 traces), the adaptive algorithm in ARC is not robust, sometimes the recency queue is small, while sometimes it is too large. If we think closely, why
does "moving one space from recency queue to frequency queue upon a hit on the
frequency ghost" make sense? Should we distinguish the hit on the beginning of ghost LRU and the tail of LRU? The implicit parameters may not be reasonable in some cases. But overall, ARC is a good algorithm, just not perfect yet. :)
A recent study on over 5000 (key-value, block, object) cache traces from 2007 to 2020 shows that FIFO-Reinsertion is not only faster and more scalable, it is also more efficient (has a lower miss ratio). Maybe it is time to drop LRU from our classroom?
I've spent the last month or so working on a write-behind cache in both Rust and Zig, with a RAM + Disk component, similar to what a OS kernel memory page swap does, but at the app level and key-value-oriented.
My experience trying out cache algorithms is that they are all very generic, cache benchmarks are typically based on random distributions or on web-request datasets (twitter, CDNs, ...) that may not match your use case. Mine is about caching data stream transformations with specific ORDER BYs. Hits may come at a very wide point in time and LFU ended-up working better for me. Also your eviction policy choice is very important like number of items or RAM use (my case). So don't go running to the latest "benchmark proven" cache algorithm paper without weighting in your specific needs.
Do you mind sharing more details on the time length of workloads in your benchmark? Are you using LRU with no aging? Drop me an email if you would like to chat more juncheny@cs.cmu.edu
The LIRS papers do a good job of taking different workload patterns, such as loop and zig-zag and mixing multiple patterns together, to validate that their algorithm is robustly good as a general purpose cache. This seems to take traces with similar workload patterns that are distinguished by their excessive size to assert the algorithm's quality. When papers do this it looks good from an academic sense, but it is frustrating from an industry perspective because the workload may not be known upfront or the one must cherry-pick an algorithm for a bespone, consistent pattern. I find it more interesting when algorithm designers try to stress their work and discuss how/why it fails or succeeds. That gives more confidence to those of us who want to implement and use the work in a practical setting.
The large datasets from 2007 to 2020 cover a variety of workload patterns. If you have a workload that you would like to try out, our code and data are all open-sourced. Also feel free to drop us an email :)
You have a large corpus of similar workload patterns. Simply using a 750GB trace or many samples from a cluster of machines doing the same work does not make it a meaningful variety. It would be reasonable to have also tested with the traces that your comparison algorithms used, which shows your implementations are valid and how they differ on those authors terms. Skipping that may come across as cherry-picking because I find that QDLP under performs by -20% in zigzag, -15% in DS1, -33% in blockchain mining. Your algorithm is good in specific cases, like many others are, but there are many real workloads where it is not usable.
Implementing FIFO-Reinsertion should be easy, adding quick demotion would take some efforts. FIFO-Reinsertion also allows Redis to avoid random sampling and better handle TTLs
Yes! TwoQ is very similar to the FIFO-LP-QD, but we show that we need a smaller queue. This paper is to inspire people to investigate other lazy promotion and quick demotion technique :)
Yes, FIFO-LP-QD is similar with a smaller queue and uses FIFO only. But goal is to 1 inspire other lazy promotion and quick demotion technique, and 2 show that FIFO-Reinsrtion is better than LRU and there is no reason to use LRU any more. :)
Aha, there is someone care about this! We dis study this, we found that on the
over 5000 traces, belady's anomaly is very common! Our new algorithm does not have the least anomalies, CLOCK and LIRS have the least, but compared to other algorithms, it has smaller number of anomalies.
Start with a dumb-cache using "FIFO" (first in/first out)
Keep track of anything with "hits" while in the cache (simple boolean/bitmap instead of complicated locks/data structures)
Re-insert "hit" items when they would normally be evicted/age out (Lazy Promotion).
BTW, also have a mini-cache in front of the main cache (10% of cache size) which tries to "drop off quickly" (effectively: must have a "hit" within first 10% of being cached).
BTW, also keep track of a "ghost cache" ("hits only", not contents??) for the duration of the overall FIFO-cache, and use that to guide eviction/re-insertion. I'm a little bit unclear on this aspect, but it seems like an "obvious in retrospect" set of guidance.