Hacker News new | past | comments | ask | show | jobs | submit login

"To avoid this situation, there is a termination logic in the Executor processes whereby an Executor process terminates itself as soon as three consecutive heartbeat calls fail. Each heartbeat timeout is large enough to eclipse three consecutive heartbeat failures. This ensures that the Store Consumer cannot pull such tasks before the termination logic ends them—the second method that helps achieve this guarantee."

Neither this or the first method guarantees a lack of concurrent execution. A long GC pause or VM migration after the second check could allow the job to get rescheduled due to timeout. The first worker could resume thinking it still had one heartbeat left to execute before giving up on the job and it could've already been handed out to another worker in the meantime.




I bet they’ve thought through this with a system that operates at the scale they say it does.

But often with technical blogs like this, you get a “dumbed-down” version that is inaccurate but summarizes in a few minutes what is essentially many person-years of work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: