Hacker News new | past | comments | ask | show | jobs | submit login

> If it is not good for development it is not good for production because ideally your dev and production environment should be the same.

Correct: Your dev environment should also not let you do stuff on the host machines. In an k8s environment, you run everything in pods. Don't compromise on security and operational concerns just because it's a dev environment.

> If you can't login to it then it is not good for development.

You develop inside pods, and you are more than welcome to install any shell and other programs you want inside containers. (Or for working at the k8s level it doesn't matter; you `kubectl apply` or run helm against the k8s API, it doesn't matter what's happening on the host.)




Some customers of an internal Kubernetes platform complain that their pods keep getting evicted because their nodes keep running out of disk space. The platform's maintainers' first instinct was that customers' pods should not write to ephemeral storage, e.g., should not write to log files on the filesystem unless mounted from external storage. But that turns out not to be the case: customers' pods did not write anything to disk at all. So why were nodes running out of disk space? Prometheus metrics show which partitions use how much disk space but cannot go into more detail. The team wanted to inspect the node's filesystem to figure out what exactly is using so much disk space. The first thing they tried is to run a management pod that contains standard tools such as 'df' and that mounts the host's filesystem. Unfortunately the act of scheduling a pod on that node causes it to experience disk pressure, and so the management pod gets evicted.

So, being dogmatic about "the host should not have any tools installed" is good and all, but how do you debug this scenario without tools on the host?

We eventually figured it out. By logging into the host OS and using the shell tools there.


> So, being dogmatic about "the host should not have any tools installed" is good and all

Less dogma, more the lived experience that letting people log into hosts ends badly. Though I grant there's a cost/benefit both ways and perhaps there could be edge cases.

> but how do you debug this scenario without tools on the host?

Cordon the node, evict any one pod to free up just enough room, and then schedule your debug pod with a toleration so it ignores the error condition? I confess I've never had to do this but it seems workable.


Sounds a lot like the host OS is generic instead of container-focused, which could be part of the problem.

What was the cause/solution? Images too big?


Actually, no, the host OS was Amazon BottleRocket, a specifically container-focused OS.

The cause was indeed images being too big. Images — not only the raw images, but also their extracted contents on the filesystem — count towards ephemeral storage too. In their case they can't even control the size of the images because those are supplied by a vendor.

The solution was to increase the node's disk space.


Interesting, I use Bottlerocket on my work clusters too. I think we had issues like this using some ridiculous data tool images that take up gigabytes, so we just upped the EBS size. Easily done.


Is Talos for running inside pods, or running on the node. It is not immediately clear from the website.


"Talos: Secure, immutable, and minimal Linux OS for running Kubernetes"

I guess we'll never know...

But in truth it's for running on hosts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: