Hi Ben, can you provide more details about shifting workload? Every algorithm has weakness, and finding it in the complex algorithms is often harder --- however, this does not mean simple algorithm is bad. I like W-TinyLFU, it has a slightly better result compared to LIRS (which also uses 1% window) albeit much simpler and more robust. However, we find that the adaptive frequency is not robust on the large dataset: a high frequency object in the main SLRU can stop many objects from entering the main cache. Happy to chat more 1 on 1 if you are interested.
To clarify, the 1% window was a heuristic that we recommended in the first paper a a good default for most workloads. However we showed that sometimes a larger window was needed, and concluded that as future work.
The follow up paper, which you also cite, explored adaptive techniques to resolve this deficiency. From there we introduced the idea of using hill climbing by monitoring the hit rate change over a sample period and adjusting the step direction. Caffeine incorporates this with a step size decay to allow for convergence. The implementation that you have does not include this improvement, but it is very simple to add and works surprisingly well.
Yes, I think that is my main concern in that often research papers do not disclose the weaknesses of their approaches and the opposing tradeoffs. There is no silver bullet.
The stress workload that I use is to chain corda-large [1], 5x loop [2], corda-large at a cache size of 512 entries (6M requests). This shifts from a strongly LRU-biased pattern to an MRU one, and then back again. My solution to this was to use hill climbing by sampling the hit rate to adaptively size of the admission window (aka your FIFO) to reconfigure the cache region sizes. You already have similar code in your CACHEUS implementation which built on that idea to apply it to a multi-agent policy.
Caffeine adjusts the frequency comparison for admission slightly to allow ~1% of losing warm candidates to enter the main region. This is to protect against hash flooding attack (HashDoS) [3]. That isn't intended to improve or correct the policy's decision making so should be unrelated to your observations, but an important change for real-world usage.
I believe LIRS2 [4] adaptively sizes their LIR region, but I do not recall the details as a complex algorithm. It did very well across different workloads when I tried it out and the authors were able to make a few performance fixes based on my feedback. Unfortunately I find LIRS algorithms to be too difficult to maintain for an industry setting because while excellent, the implementation logic is not intuitive which makes it frustrating to debug.