This looks like it would be a very useful design, however the article doesn't di...

dragontamer · on April 17, 2019

> however the article doesn't discuss the implementation of the Promise.

Its not a "thundering herd" problem either. The Thundering herd is classically a scheduling problem.

You have 100-threads waiting on a resource (classically: a Mutex). The Mutex unlocks, which causes all 100-threads to wakeup. You KNOW that only one thread will win the Mutex, so 99 of the threads wasted CPU-time as they wokeup. When the next thread is done, 98 threads will wakeup (again, all wasting CPU time because only one can win).

Solving the thundering herd requires your scheduler to know all the resources that could be blocking a thread. The scheduler then only wakes ONE thread up at a time in these cases.

-----------

I'm not entirely sure what the problem should be named in the blog, but it definitely isn't a "thundering herd". I will admit that its a similar looking problem though.

mentat · on April 17, 2019

Thundering herd has referred to demand spikes in services architectures for at least 8 years[0], probably much longer.

0. https://qconsf.com/sf2011/dl/qcon-sanfran-2011/slides/Siddha...

dragontamer · on April 17, 2019

Hmm, the Netflix presentation there seems to make sense superficially though.

The key attribute of the "Thundering Herd" problem is the LOOP. The Thundering Herd causes another Thundering Herd... which later causes another Thundering Herd. In the Netflix presentation, the "Thundering Herd" causes all of the requests to time out, which causes two new servers to be added ("automatic scale up"), then everyone tries again.

When everyone tries again, there's more people waiting, so everyone times-out AGAIN, which causes everything to shutdown, add two more servers to scale up, and start over. Etc. etc. Its a cascading problem that gets worse and worse each loop. You solve the Thundering Herd not by adding more resources (that actually makes the problem worse!!), but by cutting off the feedback loop somehow.

The problem discussed in the blog post has no feedback loop. Its simply a problem that happens once on startup.

mentat · on April 17, 2019

Very good point, thanks for clarifying.

lmeyerov · on April 17, 2019

System start is indeed a synchronization point, and for limited resources like here, painful and now clpser to the vernacular

Thundering herds can cause escalating and successive failures. That is very much an issue with service start/restart. A bad restart will cause a timeout, another restart, and eventually, restarts on further layers. Imagine all this running above k8s. So yes, this pattern is indeed about one of the failure modes that happen with thundering herds.

Though if your cache needs another cache, that feels like a bad cache. The promise pattern can be done transparently by the cache, coalescing GETs, instead of requiring a user protocol. We do app level caching to stay process-local because latency is fun in GPU land and we are a visual analytics tool... But that is not for the problem shown here.

technoweenie · on April 17, 2019

FWIW I've always heard it described as a thundering herd. Though, your description is spot on, according to Wikipedia [1]. The problem the article discusses is called a cache stampede or dog pile [2].

[1] https://en.wikipedia.org/wiki/Thundering_herd_problem

[2] https://en.wikipedia.org/wiki/Cache_stampede

I wouldn't fault anyone for getting these similar names mixed up though.

nitrogen · on April 17, 2019

It's also possible the Wikipedia article is maintained by someone with a stronger opinion about the definition than would be reflected in the typical person using the term.

drb91 · on April 17, 2019

I'd hope so!

eigen-vector · on April 17, 2019

I agree with you, the article could have omitted the mention of the thundering herd and would still not feel any more incomplete.

I think it becomes a thundering herd problem when every request that's a potential cache miss tries to obtain a lock on the Promise for that request. Likely what the author was trying to get at which was lost due to overly generalising the problem.

jerf · on April 17, 2019

"This is not something memcached or redis support out of the box, as far as I know. It would seem to imply a cache manager service that has its own in-memory table of Promises."

Note that it isn't even meaningful to suggest that memcache or Redis should support this, because they aren't responsible for filling cache values for misses. Only something actually generating the value upon the miss can implement this "promise".

And in the general case, you can't serialize promises either, so you can't be "putting a promise into memcached", because memcached only stores bytes. You'd have to wrap a lot of machinery around it, and even then, I personally would say it's the machinery doing it, not memcached.

(I think Redis does have some features that could be pressed into service here for notification, but Redis still wouldn't be doing the actual filling in of the value, since it can't.)

luhn · on April 17, 2019

dogpile.cache [1] implements this pattern (or at least a very similar pattern) for both memcache and redis using locks. If the cache value doesn't exist, attempt to acquire the lock and generate the value. If the lock can't be acquired, wait until it frees then check for the value again.

[1] https://dogpilecache.sqlalchemy.org/en/latest/

technoweenie · on April 17, 2019

Groupcache is one implementation in Go. However, it uses its own in-memory cache on your app servers instead of an external cache service.

https://github.com/golang/groupcache

https://news.ycombinator.com/item?id=6121501

https://talks.golang.org/2013/oscon-dl.slide#43

kappaloris · on April 17, 2019

With Redis it's just a matter of locking the request with an expiring key and the promise effect can be done with Pub/Sub.

fwip · on April 17, 2019

My reading is that the cache is just another JavaScript service - a proxy that says "on: URL, if (in cache) return val, else fetch, add val to cache, return val."

Changing the cache to store promises is indeed an easy way to deduplicate most requests that occur roughly at the same time.

The logic now is: "on URL: if (in cache) return promise.resolve(), else add promise to cache, fetch, return promise.resolve()"

mason55 · on April 17, 2019

But what does it mean for a promise to be in the cache? If you're using an external cache, like redis or memcached, you can't just put a JavaScript promise in there and expect everything to work. Neither of those services understands what it means for the promise to resolve so that all the clients waiting on it can continue.

At best you could serialize the promise and then have everyone who was waiting on it poll the cache to see if the promise was resolved yet, but that is pretty inefficient.

You probably want another layer in there that handles the conversion of a JavaScript promise to something that can be awaited on from your cache, like a redis pub/sub. Then in JavaScript land your services just await on a promise as normal but under the covers your service is doing something else which is specific to the cache you're using.

fwip · on April 17, 2019

I don't think they are using an external cache.

coding123 · on April 17, 2019

even if you have 1000 servers that all have a local promise you may still be helping the backend service by reducing 10000000 requests to 1000.

wccrawford · on April 17, 2019

Yeah, some details on implementation would have been welcome.

I'm not sure how often "thundering herd" is actually a problem for your cache, but I could definitely see this being a useful feature to add to a caching system. Even just a way to tell Redis that something will exist there soon and to delay response until it arrives would be nice. (In essence, a Promise.)