Hacker News new | past | comments | ask | show | jobs | submit login

> however the article doesn't discuss the implementation of the Promise.

Its not a "thundering herd" problem either. The Thundering herd is classically a scheduling problem.

You have 100-threads waiting on a resource (classically: a Mutex). The Mutex unlocks, which causes all 100-threads to wakeup. You KNOW that only one thread will win the Mutex, so 99 of the threads wasted CPU-time as they wokeup. When the next thread is done, 98 threads will wakeup (again, all wasting CPU time because only one can win).

Solving the thundering herd requires your scheduler to know all the resources that could be blocking a thread. The scheduler then only wakes ONE thread up at a time in these cases.

-----------

I'm not entirely sure what the problem should be named in the blog, but it definitely isn't a "thundering herd". I will admit that its a similar looking problem though.




Thundering herd has referred to demand spikes in services architectures for at least 8 years[0], probably much longer.

0. https://qconsf.com/sf2011/dl/qcon-sanfran-2011/slides/Siddha...


Hmm, the Netflix presentation there seems to make sense superficially though.

The key attribute of the "Thundering Herd" problem is the LOOP. The Thundering Herd causes another Thundering Herd... which later causes another Thundering Herd. In the Netflix presentation, the "Thundering Herd" causes all of the requests to time out, which causes two new servers to be added ("automatic scale up"), then everyone tries again.

When everyone tries again, there's more people waiting, so everyone times-out AGAIN, which causes everything to shutdown, add two more servers to scale up, and start over. Etc. etc. Its a cascading problem that gets worse and worse each loop. You solve the Thundering Herd not by adding more resources (that actually makes the problem worse!!), but by cutting off the feedback loop somehow.

The problem discussed in the blog post has no feedback loop. Its simply a problem that happens once on startup.


Very good point, thanks for clarifying.


System start is indeed a synchronization point, and for limited resources like here, painful and now clpser to the vernacular

Thundering herds can cause escalating and successive failures. That is very much an issue with service start/restart. A bad restart will cause a timeout, another restart, and eventually, restarts on further layers. Imagine all this running above k8s. So yes, this pattern is indeed about one of the failure modes that happen with thundering herds.

Though if your cache needs another cache, that feels like a bad cache. The promise pattern can be done transparently by the cache, coalescing GETs, instead of requiring a user protocol. We do app level caching to stay process-local because latency is fun in GPU land and we are a visual analytics tool... But that is not for the problem shown here.


FWIW I've always heard it described as a thundering herd. Though, your description is spot on, according to Wikipedia [1]. The problem the article discusses is called a cache stampede or dog pile [2].

[1] https://en.wikipedia.org/wiki/Thundering_herd_problem

[2] https://en.wikipedia.org/wiki/Cache_stampede

I wouldn't fault anyone for getting these similar names mixed up though.


It's also possible the Wikipedia article is maintained by someone with a stronger opinion about the definition than would be reflected in the typical person using the term.


I'd hope so!


I agree with you, the article could have omitted the mention of the thundering herd and would still not feel any more incomplete.

I think it becomes a thundering herd problem when every request that's a potential cache miss tries to obtain a lock on the Promise for that request. Likely what the author was trying to get at which was lost due to overly generalising the problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: