IMO if you’re concerned about performance and yet are deploying databases this w...

charcircuit · on July 2, 2023

How would containers even hurt performance? How does the database no longer having the ability to see other processes on the machine somehow make it slower?

crabbone · on July 2, 2023

There are many "holes" in these containers.

1. fsync. You cannot "divide" it between containers. Whoever does it, stalls I/O for everyone else.

2. Context switches. Unless you do a lot of configurations outside of container runtime, you cannot ensure exclusive access to the number of CPU cores you need.

3. Networking has the same problem. You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server. Otherwise just the amount of chatter that goes on through the control plane of something like Kubernetes will be a noticeable disadvantage. Again, containers don't help here, they only get in the way as to get that kind of exclusive network access you need more configuration on the host, and, possible an CNI to deal with it.

4. kubelet is not optimized to get out of your way. It needs a lot of resources and may spike, hindering or outright stalling database process.

5. Kubernetes sucks at managing memory-intensive processes. It doesn't work (well or at all) with swap (which, again, cannot be properly divided between containers). It doesn't integrate well with OOM killer (it cannot replace it, so any configurations you make inside Kubernetes are kind of irrelevant, because system's OOM killer will do how it pleases, ignoring Kubernetes).

---

Bottom line... Kubernetes is lame from infrastructure perspective. It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you. You don't want that kind of program near your database.

lokar · on July 2, 2023

My background is more borg then k8s, but…

Alway allocate whole cores, just mask them off

Dedicate physical IO devices for sensitive workloads

You can have per cgroup swap if you want, but imo swap is not useful

I think all of this is possible in k8s

xyzzy_plugh · on July 3, 2023

Whole core masking is not quite as easy as it should be, predominantly because the API is designed to hand wave away actual cores. The way you typically solve this is to go the other way and claim exclusive cores for the orchestrator and other overhead.

FridgeSeal · on July 2, 2023

As these are obviously very real issues, and Kubernetes also isn’t going away imminently, how many of these can be fixed/improved with different design on the application front?

Would using direct-Io API’s fix most of the fsync issues? If workloads pin their stuff to specific cores can we incite some of the overhead here? (Assuming we’re only running a single dedicated workload + kubelet on the node).

> You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server

Tbh I’ve no idea we could do this with commodity cloud servers, nor do I know how, but I’m terribly interested in knowing how, do you know if there’s like a “dummy’s guide to better networking”? Haha

> kubelet is not optimized to get out of your way...Kubernetes sucks at managing memory-intensive processes

Definitely agree on both these issues, I’ve blown up the kubelet by overallocating memory before, which basically borked the node until some watchdog process kicked in. Sounds like the better solution here is a kubelet rebuilt to operate more efficiently and more predictably? Is the solution a db-optimised kubelet/K8s?

xyzzy_plugh · on July 3, 2023

This is extremely misinformed. No matter how you choose to manage workloads, ultimately you are responsible for tuning and optimization.

If you're not in control of the system, and thus kubelet, obviously your hands are tied. I'm not sure anyone is suggesting that for a serious workload.

Now to dispell your myths:

1. You can assign dedicated storage devices to your database. Outside of mount operations you're not going to see much alien fsync activity. This is paranoid.

2. You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones. There are a number of advanced techniques that are not at all necessary if you want to be a control freak, such as creating your own cgroups. This isn't "outside" of the runtime. Kubernetes is designed to conform to your managed cgroups. That's the whole point. RTFM.

3. The general theme of your complaint has nothing to do with kubernetes. There's no beating a dedicated NIC and even network fabric. Some cloud providers even allow you to multi-NIC out of the box so this is pretty solvable. Also, like, the dumbest QoS rules can drastically minimize this problem generally. Who cares.

4. Nah. RTFM. This is total FUD.

5.a. I don't understand. Are you sharing resources on the node or not? If you're not, then swap works fine. If you are, then this smells like cognitive dissonance and maybe listen to your own advice, but also swap is still very doable. It's just disk. swapon to your heart's content. But also swap is almost entirely dumb these days. Are you suggesting swapping to your primary IO device? Come on. More FUD.

5.b. OOM killer does what it wants. What's a better alternative that integrates "well" with the OOM killer? Do you even understand how resource limits work? The OOM killer is only ever a problem if you either do not configure your workload properly (true regardless of execution environment) or you run out of actual memory.

Bottom line: come down off your high horse and acknowledge that dedicated resources and kernel tuning is the secret to extreme high performance. I don't care how you're orchestrating your workloads, the best practices are essentially universal.

And to be clear, I'm not recommending using Kubernetes to run a high performance database but it's not really any worse (today) than alternatives.

> It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you.

What planet are you currently on? This makes no sense. It's a set of abstractions and patterns, the intent isn't to hide the complexity but to make it manageable at scale. I'd argue it succeeds at that.

Seriously, what is the alternative runtime you'd prefer here? systemd? hand rolled bash scripts? puppet and ansible? All of the above??

crabbone · on July 5, 2023

> You can assign dedicated storage devices to your database. Outside of mount operations you're not going to see much alien fsync activity. This is paranoid.

This is word salad. Do you even know what fsync is for? I'm not even asking if you know how it works... What is "alien" fsync activity? Mount is perhaps the one system call that has nothing to do with fsync... so, I wouldn't expect any fsync activity when calling mount...

Finally, I didn't say that you cannot allocate a dedicated storage device -- what I said is that Kubernetes or Docker or Singularity or containerd or... well, none of container (management) runtimes that I've ever used know how to do it. You need external tools to do it. The point isn't that you cannot, the point is that a container runtime will only stand in your way when you try to do it.

> You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones.

No you cannot. Not through Kubernetes. You need to do this on the node that hosts kubelet.

And... I don't have the time or the patience necessary to answer to the rest of the nonsense. Bottom line: you don't understand what you are replying to, and arguing with something I either didn't say, or just stringing meaningless words together.

xyzzy_plugh · on July 5, 2023

> Do you even know what fsync is for?

I do, though perhaps an ignorant life would be simpler. "Alien" is a word with a definition. Perhaps "foreign" is a better word. Forgive me for attempting to wield the English language.

No one well will use your fucking disk if you mount it exclusively in a pod. Does that make sense? You must be a joy to work with.

> The point isn't that you cannot, the point is that a container runtime will only stand in your way when you try to do it.

I have no idea what this means. How does kubernetes stand in your way?

> No you cannot. Not through Kubernetes. You need to do this on the node that hosts kubelet.

This is incorrect. You can absolutely configure the kubelet to reserve cores and offer exclusive cores to pods by setting a CPU management policy. I know because I was waiting for this for a very long time for all of the reason in the discussion here. It works fine.

You clearly have an axe to grind and it seems pretty obvious you're not willing to do the work to understand what you're complaining about. It might help to start by googling what a container runtime even is, but I'm not optimistic.

danappelxx · on July 2, 2023

I’ll assume the worst case:

- lots of containers running on a single host

- containers are each isolated in a VM (aka virtualized)

- workloads are not homogenous and change often (your neighbor today may not be your neighbor tomorrow)

I believe these are fair assumptions if you’re running on generic infrastructure with kubernetes.

In this setup, my concerns are pretty much noisy neighbors + throttling. You may get latency spikes out of nowhere and the cause could be any of:

- your neighbor is hogging IO (disk or network)

- your database spawned too many threads and got throttled by CFS

- CFS scheduled your DBs threads on a different CPU and you lost your cache lines

In short, the DB does not have stable, predictable performance, which are exactly the characteristics you want it to have. If you ran the DB on a dedicated host you avoid this whole suite of issues.

You can alleviate most of this if you make sure the DB’s container gets the entire host’s resources and doesn’t have neighbors.

gcoakes · on July 2, 2023

> - containers are each isolated in a VM (aka virtualized)

Why are you assuming containers are virtualized? Is there some container runtime that does that as an added security measure? I thought they all use namespaces on Linux.

danappelxx · on July 2, 2023

It’s becoming standard as a security measure. See: Kata containers, Firecracker VM

otterley · on July 3, 2023

Not so; neither Kata containers nor Firecracker are in widespread public use today. (Source: I work for AWS and consult regularly with container services customers, who both use AWS and run on premise.)

danappelxx · on July 3, 2023

Ah, good to know!

charcircuit · on July 3, 2023

None of those are the fault of containers. You can do all of what you said without containers.