The State of Caching in Go

djpilot · on March 11, 2019

Great article (or start of article)!

The GitHub stars obsession is too much, though. I read the request in the main content area, then a nag banner popped up at the bottom of the page requesting it again. How annoyingly desperate and uncool.

Enough already, I'm trying to read!

Or was, anyway..

Plus I'm not even logged in to GitHub at present and cannot without getting to another device for 2FA.

No offense, but now I'll likely never give that star.

mrjn · on March 11, 2019

Arrr... Sorry about that. That was my idea, so my bad. We will get rid of the pop up banner.

justinclift · on March 11, 2019

Please do. Same here... read the initial request, but when the lower banner thing popped up also begging for a star my perception of that was negative.

Clicking on the [x] in the corner of the pop up (to get rid of it) then launched a new window (well, new tab in this case).

Didn't read the rest of the post.

paulftw · on March 11, 2019

Do you mind sharing what browser/OS were you using? Clicking on the [x] shouldn't do that.

(i'm the person who implemented that widget)

bepvte · on March 11, 2019

On firefox on android, the X opens github

paulftw · on March 11, 2019

thanks, will look into that.

djpilot · on March 11, 2019

Thanks matee.

mrjn · on March 11, 2019

(co-author here) Thanks for sharing this post! We've elaborated the various issues we encountered trying to use a concurrent LRU cache in Dgraph and our dissatisfaction at existing choices which fail to provide memory management, high concurrency and hit ratios.

With guidance from Ben Manes, we're planning to work on a new concurrent Go cache based on Caffeine (Java). If you're interested in helping (or already have something similar), do reach out to us!

Meanwhile, check out Dgraph, our distributed graph database which is what keeps us busy: https://github.com/dgraph-io/dgraph

kasey_junk · on March 11, 2019

What are you thinking about for the API signatures? Without beating a dead horse this is where go’s lack of generics really becomes burdensome.

If it were an on disk cache []byte would likely make sense but given it’s in memory I’m not sure. If you use interface{} you’ll need to measure that cost as well.

mrjn · on March 11, 2019

As mentioned down in this thread, the key-val would be bytes-bytes or string-bytes. While this is more expensive in terms of CPU, it has the advantage that the size of elements in the cache are carefully understood and do not need the cost of GC sweeps.

When de-serialized, in-memory structs take up more memory on heap and it gets hard to compute that -- we've tried that in vain in the past.

Having said that, there's still a general need to store pointers as values and that's something we'll look at once this byte-based implementation is ready.

amanmangal · on March 11, 2019

Our current plan is to store []byte given that badger stores blobs of bytes. Using interface{} may have some overhead, but we are no so worried about that right now.

kasey_junk · on March 11, 2019

So the []byte solution requires running through a serielization step which for most in memory uses will be expensive. Did you choose that because in you use case it eventually ends up setialized in any case?

amanmangal · on March 11, 2019

correct!

np_tedious · on March 11, 2019

I'm a go novice, and don't have any particular knowledge about dgraph, but is this a case where auto-generated code could be a good solution? (Assuming it's a bounded set of types)

Thaxll · on March 11, 2019

I don't see why it's a problem, caching solutions like Redis also store bytes. If you cache a JSON blob you don't have much choice.

kasey_junk · on March 11, 2019

For an in memory cache (ie one not going to disk/network) you have to unnecessarily serialize your in memory representation to bytes. This is very frequently the most expensive (computational) step in a system.

If you are just going to send that stuff to nic/disk eventually anyway its perfectly fine (which is why something like Redis which is an out of memory cache can get away with it).

kochthesecond · on March 11, 2019

Very cool, I have missed Caffeine when making stuff in go!

AboutTheWhisles · on March 11, 2019

You might want to take a look at https://github.com/LiveAsynchronousVisualizedArchitecture/si...

It is a completely lock free key value store, though does not have eviction built in.

Its memory management is done using a free list of blocks, which is part of how it maintains lock free concurrency.

valyala · on March 12, 2019

Cache hit ratio increases with the size of cache and decreases with the number of "hot" items regardless of cache eviction policy - LRU, FIFO, etc. Cache hit ratio reaches 100% when cache size becomes large enough to hold all the "hot" data. It would be great to see https://github.com/VictoriaMetrics/fastcache in the benchmark results. It is faster and it uses less memory comparing to other solutions for caching the same amount of data.

sethammons · on March 11, 2019

I think it is a well put together analysis, but I feel I'm missing something.

> FreeCache and GroupCache reads are not lock-free and don’t scale after a point (20 concurrent accesses). (lower value is better on y axis)

If you look at the graphs, they level out and seem to work best with more concurrent load, the exact opposite of what they say. Am I missing something?

Also, I think it would be valuable to see higher concurrent requests: with what I'm used to, 60 concurrent requests would be low - I'm interested in maybe 1-5k concurrent requests.

cpitman · on March 11, 2019

I agree, those graphs were pretty confusing. As far as I can tell, they are not showing the average time each operation takes (ie average elapsed wall time), they must be taking the length of the test and dividing by the number of operations. It would be much easier to parse if they showed operations/second instead.

In other words, 9 concurrent pregnant mothers would show 1 month/baby on those charts.

If I have that right, their comment makes more sense. After 20 concurrent requests, the overall throughput of operations/second does not increase with additional concurrency.

amanmangal · on March 11, 2019

This is true that the latency of each operation won't change much as we increase concurrent accesses, but the amortized latency would (be expected to) reduce as the graphs show. When we did benchmarks, the Go benchmark framework provides us with ns/ops which we directly used to plot the graphs. We could calculate ops/sec too but that would just be another step of calculation using throughput = 1/(time taken/ops), may be more clear to understand though. The graphs would still flatten out after 20 concurrent accesses.

amanmangal · on March 11, 2019

We just updated the graphs to show throughput numbers instead. Should be more clear now.

dstroot · on March 11, 2019

The post speaks about go not having a thriving ecosystem of packages. My view is that there are actually a lot of Go packages out there but certainly go’s indifference to packages and the fact it is just now getting an “official” package capability (RIP dep) has contributed to slower growth of the ecosystem. NPM has had many many growing pains but I sure do love the ease of use and discoverability of packages. I sincerely hope go catches up. I love the language and tooling.

cdoxsey · on March 11, 2019

Go has lots of packages. They are easily discoverable on godoc.org or just some googling.

There is a cultural skepticism about the overuse of packages. There are occasions where the STD lib copied a function instead of adding an import for example. Like many things in Go this pushback is needed but sometimes goes too far.

But maybe single function packages in npm we're a bad idea.

Anyway caching is a mixed bag in my opinion. Local per-thread caching is often better than a global cache as it avoids contention. It's also trivial to implement with a map and requires no special coordination.

It does require rethinking how you design a solution to a problem. FWIW that design process tends to lead you down a better direction anyway for producing distributed systems.

For example one giant, randomly distributed Kafka topic with a global redis db for a cache is probably a lot worse off than a system with more predictable data locality on the consumers.

mrjn · on March 11, 2019

(co-author here) The number of Go libraries tend to be just slightly below the number of their users (dramatically speaking).

Libraries are hardened by repeated usage by many different users who each bring their own special use cases and improve them to bring them to production quality. Go has no platform where certain well-written libraries can be recommended and get more exposure. Thus, they don't tend to mature more than the specific use case they get written for, provided they are still being maintained.

Arch Linux is a great example of giving well-doing packages more exposure by upgrading an AUR to community to core/extra. That way, more users gather around packages improving them even further.

To the second point about per-thread caching, Go does not expose threads to end-users. So, there's no concept of thread-local. What you're describing results in lock striping, which has contention issues as described in the post.

sagichmal · on March 11, 2019

> Go has no platform where certain well-written libraries can be recommended and get more exposure.

Godoc.org is that platform, it serves the purpose. Others in the module universe are under development.

> To the second point about per-thread caching, Go does not expose threads to end-users. So, there's no concept of thread-local. What you're describing results in lock striping, which has contention issues as described in the post.

In Go you'd have to orchestrate this locality yourself, i.e. a fixed worker (goroutine) pool each with its own cache. This is probably less work than it sounds.

Generally, Go does encourage you to author solutions to your specific problem, rather than adapting a general-purpose library. This is almost as much a part of the ethos of Go as implicitly-satisfied interfaces, or "share memory by communicating", or any of the other proverbs. Maybe Go takes it too far. But effectively no other language exists at this point on the spectrum, and I'm happy that we have at least some representation over here; I think lots of programmers live here, too, and appreciate the tradeoffs.

mrjn · on March 11, 2019

> Godoc.org is that platform, it serves the purpose.

Godoc.org is great. But (beyond documentation) it is at best a search engine, providing equal platform to all libraries however production ready or broken they might be. Unless I'm missing something, it does not intend to promote certain libraries over others, the same way as AUR -> community -> core works in Arch (which is the model I think is missing in Go).

> In Go you'd have to orchestrate this locality yourself, i.e. a fixed worker (goroutine) pool each with its own cache.

Having many small caches within the same process would result in more misses per key, which if it results in disk accesses would not be ideal or might be worse than contention.

Moreover, being able to spin Goroutines as and when required to branch off a big job into smaller tasks is the beauty and benefit of Go compared to other languages like C++ or Java, where you must start a thread pool upfront and shoot tasks off to it.

Foxboron · on March 11, 2019

>the same way as AUR -> community -> core works in Arch (which is the model I think is missing in Go).

Arch packager here.

It doesn't quite work like that. The distinction between community and extra is largely based on who packages it, it used to be defined as above 5% usage. The distinction between that and core is that core is considered essential to the distribution. No package is going to go from AUR to core without replacing some integral part of the system.

mrjn · on March 11, 2019

I see. Then I guess there's a different model that can more accurately represent the idea of giving certain well-written/actively-maintained/popular packages a special platform than others to make it more enticing for wider community to adopt those.

The underlying issue is that building a production ready library doesn't just happen -- it needs one or few initial authors and (along with a large group of users) a group of dedicated power users who can then continue to maintain the library and optimize it for their usage, in turn developing it enough to be considered production ready.

cdoxsey · on March 11, 2019

> Thus, they don't tend to mature more than the specific use case they get written for, provided they are still being maintained.

This is all a bit abstract. There are plenty of go libraries which are suitably mature for production use. For example Google's cloud libraries or Amazon's AWS libraries. There's a platform for communication of issues and releases: Github. I guess I haven't really had an issue with it... The only problem I've had is the messiness of versioning, but there's been solutions to that for a long time (godep, govendor, glock, etc...)

> To the second point about per-thread caching, Go does not expose threads to end-users. So, there's no concept of thread-local. What you're describing results in lock striping, which has contention issues as described in the post.

Sorry for the confusion. I did not mean thread-local variables. I meant breaking up your workload across mutiple goroutines so that you don't have to use shared state.

For example suppose you receive a message with an org ID. You start 8 workers, all data for (id % 8 == 0) goes to worker 0, (id % 8 == 1) goes to worker 1, and so on.

Each worker maintains a local map of data and no mutex is needed to coordinate access, because only one goroutine can use the data at a time. So there's no lock-striping and no contention issues.

This is the recommended approach for Go concurrency:

> Do not communicate by sharing memory; instead, share memory by communicating.

I understand that this sort of approach doesn't work for every problem, there's not always an easy way to split up the data like this (you might end up with hotspots for example). But I feel like that conclusion needs to come after a bit of reflection, instead of reflexively reaching for a global object + mutex. (Not saying that's what you guys are doing... just a common issue I've seen working on distributed systems with developers new to Go)

mrjn · on March 11, 2019

Not to nitpick here, but Google's cloud libraries and Amazon's AWS libraries work because there're probably a dedicated team of engineers who're assigned to these projects, considering cloud platforms are supposed to make these giant orgs a ton of money.

When Google needs something, it gets built, corrected and optimized quickly (e.g. Go context library). Open source / crowd sourcing doesn't work that way.

jzoch · on March 12, 2019

A thread local cache is not going to be as performant esp. w.r.t. large data sets. There is a huge area between global redis db for a cache and thread local cache. At some point you have a large data set that is accessed by many different actors at different intervals. Having each application cache (and re-cache if they go down) is not always sustainable.

llimllib · on March 11, 2019

My feeling is that there's a good ecosystem in most areas, but that go discourages you pretty hard from producing collection types of your own, so it's particularly underdeveloped in this area.

perfmode · on March 11, 2019

It’s the lack of generics that makes code reuse hard in Go.

sethammons · on March 11, 2019

It is true that Go doesn't lend itself to general packages often. You are either left with interface{} parameters and return values (bye bye compile time type check) or working against an interface like the ol' sort package. That said, I've only run up against this a couple of times in multiple years of working in the language, and it so happens that caching is one of them.

mooreds · on March 11, 2019

Is there a need for an npm or a rubygems style central repository? Seems like it would help the language out, based on your comments.

northwindfoo · on March 11, 2019

I'm not very familiar with go but one thing I don't understand is why the caches need to be distributed. I wonder, why not just have one cache per thread?

scottlamb · on March 11, 2019

You're probably imagining an async, thread-per-core model. One cache per thread might be reasonable then (although having more, smaller caches decreases hit rate, so might fail their requirement #5).

Go programs are written in a synchronous, thread-per-request model. You'd end up with _tons_ of small, very cold caches and a miserable hit rate.

You could approximate the former in Go by just having an array of N caches and picking from them randomly. This is similar to their "lock striping" with less contention (no stripe is a hot spot) but a lower hit rate.

jadbox · on March 11, 2019

This was wonderfully simple and concise. Thank you.

tpetry · on March 11, 2019

Cache invalidation is pretty hard if you need to clear data from local caches for every thread.

jitl · on March 11, 2019

That requires N times more memory, where N is the number of threads. 32x more memory??

Twirrim · on March 11, 2019

Given the Zipf distribution, I wonder if just a small LRU cache per thread might not be a terrible thing?

Juggling caches on databases is a challenging thing, and there has been some back-and-forth on best practices. MySQL for the longest time shipped with a query cache. As of MySQL 5.7.20 it was deprecated, and has now been removed in MySQL 8, largely because it was as likely to hurt you badly as help you, particularly with correctly sized InnoDB buffering.

NovaX · on March 11, 2019

Zipf is a little idealistic. It is perfect for a quick analysis as the base case that a cache should excel at. Unfortunately LRU doesn't because it can be easily polluted. (examples: https://github.com/ben-manes/caffeine/wiki/Efficiency)

A database will often scan many records which would flush an LRU. Postgres uses small LRU buffer caches in your per-thread model, backed by a larger LRU-like cache. The buffer caches are easily flushed by scans, but protect the shared cache from this noise. That shared cache could probably benefit from a smarter policy and this is an on going topic.

andonisus · on March 11, 2019

This is assuming that all threads need a global view of the data. When this is not the case, it is sufficient to have many async goroutines ingest on a channel and fill their own local map, lock-free. This is our general strategy for high-performance streaming routines.

northwindfoo · on March 11, 2019

Right, fair enough. I only thought of that after I posted. Not used to thinking that memory is limited lol

The author did indicate that they had problems with oom and memory eviction so it probably is a memory limited use case.

djpilot · on March 11, 2019

> LSB 9-16

What is this? Google thinks it's some sort of metal stamp.

mwkaufma · on March 11, 2019

Least significant bits

djpilot · on March 11, 2019

Makes sense now, thanks!