> If it is not good for development it is not good for production because ideall...

FooBarWidget · 2024-07-14T06:17:14.000000Z

Some customers of an internal Kubernetes platform complain that their pods keep getting evicted because their nodes keep running out of disk space. The platform's maintainers' first instinct was that customers' pods should not write to ephemeral storage, e.g., should not write to log files on the filesystem unless mounted from external storage. But that turns out not to be the case: customers' pods did not write anything to disk at all. So why were nodes running out of disk space? Prometheus metrics show which partitions use how much disk space but cannot go into more detail. The team wanted to inspect the node's filesystem to figure out what exactly is using so much disk space. The first thing they tried is to run a management pod that contains standard tools such as 'df' and that mounts the host's filesystem. Unfortunately the act of scheduling a pod on that node causes it to experience disk pressure, and so the management pod gets evicted.

So, being dogmatic about "the host should not have any tools installed" is good and all, but how do you debug this scenario without tools on the host?

We eventually figured it out. By logging into the host OS and using the shell tools there.

yjftsjthsd-h · 2024-07-15T04:03:13.000000Z

> So, being dogmatic about "the host should not have any tools installed" is good and all

Less dogma, more the lived experience that letting people log into hosts ends badly. Though I grant there's a cost/benefit both ways and perhaps there could be edge cases.

> but how do you debug this scenario without tools on the host?

Cordon the node, evict any one pod to free up just enough room, and then schedule your debug pod with a toleration so it ignores the error condition? I confess I've never had to do this but it seems workable.

sierra1011 · 2024-07-14T06:32:25.000000Z

Sounds a lot like the host OS is generic instead of container-focused, which could be part of the problem.

What was the cause/solution? Images too big?

FooBarWidget · 2024-07-14T16:43:12.000000Z

Actually, no, the host OS was Amazon BottleRocket, a specifically container-focused OS.

The cause was indeed images being too big. Images — not only the raw images, but also their extracted contents on the filesystem — count towards ephemeral storage too. In their case they can't even control the size of the images because those are supplied by a vendor.

The solution was to increase the node's disk space.

sierra1011 · 2024-07-14T18:34:34.000000Z

Interesting, I use Bottlerocket on my work clusters too. I think we had issues like this using some ridiculous data tool images that take up gigabytes, so we just upped the EBS size. Easily done.

breadwinner · 2024-07-14T06:08:47.000000Z

Is Talos for running inside pods, or running on the node. It is not immediately clear from the website.

sierra1011 · 2024-07-14T06:21:20.000000Z

"Talos: Secure, immutable, and minimal Linux OS for running Kubernetes"

I guess we'll never know...

But in truth it's for running on hosts.