I don't want the pager to go off just because of some slight non-liveness. That's a likely outcome of high utilization (usually viewed as a good thing, isomorphic with low cost). If you're running really hot and a few tasks are shedding load by playing dead intermittently, that's OK up to a point; if a large portion of pods are doing that at a high rate, that might be bad. You might not even alert on it, just throw it up on a dashboard as informative indicator for operators.
> “ I don't want the pager to go off just because of some slight non-liveness.“
That’s just bad engineering. Really, one should want the pager to go off for that and be really pedantic to actually sniff out the root cause and actually fix it.
Hiding that type of issue by letting something like liveness/readiness policy tacitly conceal it is just going to result in a far worse or more systemic issue later with far worse pager disruptions to your life.
You’re skipping flossing every now and then only to need serious root canals later.