Hacker News new | past | comments | ask | show | jobs | submit login

It's quite common on HPC systems to schedule jobs on every logical CPU except cpu0, and give the OS cpu0 for doing OS stuff. In many workloads it actually improves performance since your job is not stalled waiting for something on cpu0, which was preempted for executing OS tasks.



I once debugged a pathological case of this where OnApp would configure Xen with a CPU limit of 200% on a 24 core machine for Dom0, so it could only spend 1/12th of the time in Dom0 - this is measured in time spent but also applies to all 24 cores. It would more or less totally lock the machine because 1 core was waiting on a spinlock or similar from another core, and they just scheduled so slowly it took FOREVER. Combine that with minimimum timeslices of 30ms.. basically each 30ms if would spinlock and do nothing useful. Good times.

Even their support team couldn't figure that out despite having other customers running into it, I managed to figure it out using xenalyze or something basically tracing the scheduler actions and found that all the dom0 cores were only being scheduled something like once per second each. Was kinda crazy.

No Batch Core scheduling in that version of Xen either. I think newer versions might have it? Might not work when you have every core assigned to a domain though.

The solution was to only plug 2 CPUs into Dom0, rather than plug all 24 and give it a 200% slice of them.


HPC systems also tend do an unusual amount of stuff in userspace, such as running custom TCP stacks in userspace or using RDMA to bypass the CPU entirely. You wouldn't want to run a 10GB+ multiqueue Ethernet NIC on a single core, it'll choke on the interrupt load.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: