Consider the the request sequence you gave a, a, b, b, c, c, d, d ... and a cach...

eis · on June 25, 2023

Ok I see where the difference is now. You are clearing the hit bit for all items marked as hit on a single eviction. I mean yes I guess that means that during the following request you don't have to check all items in the queue but now you have a different problem: you evict items prematurely. With a cache of size 2 I can't show it but consider a cache of size 3 and the following pattern: a, a, b, b, c, b, d, e

When d arrives state changes c(0) b(1) a(1) -> d(0) c(0) b(0). When e arrives state changes d(1) c(0) b(0) -> e(0) d(0) c(0).

Note how b got evicted from the cache even though it was requested more recently and more frequently than c because d cleared out the hit bit for both b and c.

Now you traded off hitrate vs latency. But the worst case latency is still O(n) just can't happen multiple times in a row anymore.

Imagine you have a cache of 1 million items all marked as hit and you get a new request which would evict all 1 million items. You gained latency for some subsequent requests but increased it in the current one. Crucially you also just cleared out the hit marker of whole cache.

1a1a11a · on June 25, 2023

I agree with you.

Any eviction algorithm would have an adversarial workload, so a simpler algorithm is better because you know what the workload was and whether it happens in production; on the other end, the complicated one only makes this worse and the failure of a cache is often catastrophic. :)

eis · on June 25, 2023

I can't agree with that. More sophisticated algorithms don't necessarily have to make things worse. In fact I think the opposite is true. Simple algorithms are the ones most likely to fail catastrophically because they are made for a certain range of access patterns.

You don't always know the workload beforehand and can pick the appropriate cache policy. Workloads can even change during runtime.

I haven't seen a simple cache algorithm that performs well in every workload and I believe any that wants to succeed needs to incorporate different sub-algorithms for different workloads and be adaptive.

1a1a11a · on June 26, 2023

Have you experienced workload changes that caused an algorithm to be less effective? I am happy to learn about it. From my experience and study of traces collected in the past, I haven't seen such cases. Of course, there is diurnal changes, and sometimes short abrupt changes (e.g., failover, updating deployment), but I have not seen a case that a workload change that needs a different algorithm, e.g., switch from LRU-friendly to LFU-friendly.

Having an ideal adaptive algorithm is certainly better, and I do observe different workloads favor different algorithms, but all adaptive algorithms have parameters that are not optimal for a portion of workloads and also hard to tune (including ARC).

If you give me an algorithm, I can always find a production workload for you that the algorithm does not work well.

"Simple algorithms are the ones most likely to fail catastrophically because they are made for a certain range of access patterns" I do not see how FIFO and LRU fail catastrophically except on scanning/streaming workloads. Happy to learn about it.

eis · on June 26, 2023

The classic example of a changing workload is a database which has a normal workload that maybe resembles a zipf or bell distribution but then someone runs exports which do full scans over large parts of the DB. These changes in workloads can be intermittent but they exist. A very similar change in workload that is well known is the pagecache for filesystems (e.g. backup doing full scan over disk and evicting everything from the cache) which is why admission policies were invented in the first place. So you have two different algorithms already. Just maybe not adaptive.

A friend of mine works on a online-shop / distribution system that during most of the day has a zipf like distribution but after normal working hours when retail shops close he gets completely different patterns as the shops push their data for the day into the system and then all kinds of BI tools run for a few hours.

Of course adaptive caches are never completely optimal because it takes time for them to learn which workload is actually currently happening but after a while they can get very close. A good adaptive cache should not need tuning to get good results because the whole reason for it to be adaptive is to tune itself. Of course there are quite a few adaptive caches, especially the first ones that pioneered this, that still needed tuning.

I don't understand the last question. You say you can't see how FIFO can fail catastrophically and at the same time state a case how it can. Scanning workloads are something that does show up in the real world all the time. NovaX seems to have run your algorithm through some traces where it had a near zero hitrate. It can be more important to avoid these extreme cases than having a slightly better hitrate in the normal case.