> however the article doesn't discuss the implementation of the Promise. Its not...

mentat · on April 17, 2019

Thundering herd has referred to demand spikes in services architectures for at least 8 years[0], probably much longer.

0. https://qconsf.com/sf2011/dl/qcon-sanfran-2011/slides/Siddha...

dragontamer · on April 17, 2019

Hmm, the Netflix presentation there seems to make sense superficially though.

The key attribute of the "Thundering Herd" problem is the LOOP. The Thundering Herd causes another Thundering Herd... which later causes another Thundering Herd. In the Netflix presentation, the "Thundering Herd" causes all of the requests to time out, which causes two new servers to be added ("automatic scale up"), then everyone tries again.

When everyone tries again, there's more people waiting, so everyone times-out AGAIN, which causes everything to shutdown, add two more servers to scale up, and start over. Etc. etc. Its a cascading problem that gets worse and worse each loop. You solve the Thundering Herd not by adding more resources (that actually makes the problem worse!!), but by cutting off the feedback loop somehow.

The problem discussed in the blog post has no feedback loop. Its simply a problem that happens once on startup.

mentat · on April 17, 2019

Very good point, thanks for clarifying.

lmeyerov · on April 17, 2019

System start is indeed a synchronization point, and for limited resources like here, painful and now clpser to the vernacular

Thundering herds can cause escalating and successive failures. That is very much an issue with service start/restart. A bad restart will cause a timeout, another restart, and eventually, restarts on further layers. Imagine all this running above k8s. So yes, this pattern is indeed about one of the failure modes that happen with thundering herds.

Though if your cache needs another cache, that feels like a bad cache. The promise pattern can be done transparently by the cache, coalescing GETs, instead of requiring a user protocol. We do app level caching to stay process-local because latency is fun in GPU land and we are a visual analytics tool... But that is not for the problem shown here.

technoweenie · on April 17, 2019

FWIW I've always heard it described as a thundering herd. Though, your description is spot on, according to Wikipedia [1]. The problem the article discusses is called a cache stampede or dog pile [2].

[1] https://en.wikipedia.org/wiki/Thundering_herd_problem

[2] https://en.wikipedia.org/wiki/Cache_stampede

I wouldn't fault anyone for getting these similar names mixed up though.

nitrogen · on April 17, 2019

It's also possible the Wikipedia article is maintained by someone with a stronger opinion about the definition than would be reflected in the typical person using the term.

drb91 · on April 17, 2019

I'd hope so!

eigen-vector · on April 17, 2019

I agree with you, the article could have omitted the mention of the thundering herd and would still not feel any more incomplete.

I think it becomes a thundering herd problem when every request that's a potential cache miss tries to obtain a lock on the Promise for that request. Likely what the author was trying to get at which was lost due to overly generalising the problem.