Hacker News new | past | comments | ask | show | jobs | submit login

With what is known about this bug so far, wouldn't it be possible to mitigate it by locking the kernel to one CPU core, and run user processes on the other cores?

Also, if this bug lets the kernel leak data to user processes, would it also not be the case that different processes would leak data to each other? If that is true, then it seems that just isolating the kernel wouldn't be enough.




wouldn't it be possible to mitigate it by locking the kernel to one CPU core, and run user processes on the other cores?

That would be a much, much more invasive architectural change - and it would perform much worse than the page table isolation fixes.

Also, if this bug lets the kernel leak data to user processes, would it also not be the case that different processes would leak data to each other?

No. The problem is with pages that are mapped, but (supposed to be) inaccessible from your current privilege level. The user mappings of other processes aren't in your page tables at all.


Nobody wants to waste a core, and passing cache lines from core to core isn't cheap, but yes... using a separate core is in fact more secure.

From one user process to another user process is already solved. The CR3 register gets reloaded, even on older kernels. Well, it's true for anything released after about 1992. If you have Linux 0.01 or 0.02, you might need a patch for that too.

I have to wonder how this extra code compares to just letting the hardware switch CR3 via a doublefault exception task gate. With a doublefault task switch, the last bit of executable code could be unmapped.


> Also, if this bug lets the kernel leak data to user processes, would it also not be the case that different processes would leak data to each other? If that is true, then it seems that just isolating the kernel wouldn't be enough.

There is already a TLB flush when context-switching from one user-space process to the other. This is one of the basis of multitasking. The problem here, if I understand correctly, is processes accessing arbitrary kernel region.


That might improve single core performance.

But you lock up two entire cores doing so.

For multi-core workloads, performance obviously decreases, mostly because you are missing an entire CPU core. Also, you would introduce locking between cores as only one user core could call the kernel at any one time.

Additionally, we have no idea how this bug (whatever it is) interacts with hyperthreading.


It's quite common on HPC systems to schedule jobs on every logical CPU except cpu0, and give the OS cpu0 for doing OS stuff. In many workloads it actually improves performance since your job is not stalled waiting for something on cpu0, which was preempted for executing OS tasks.


I once debugged a pathological case of this where OnApp would configure Xen with a CPU limit of 200% on a 24 core machine for Dom0, so it could only spend 1/12th of the time in Dom0 - this is measured in time spent but also applies to all 24 cores. It would more or less totally lock the machine because 1 core was waiting on a spinlock or similar from another core, and they just scheduled so slowly it took FOREVER. Combine that with minimimum timeslices of 30ms.. basically each 30ms if would spinlock and do nothing useful. Good times.

Even their support team couldn't figure that out despite having other customers running into it, I managed to figure it out using xenalyze or something basically tracing the scheduler actions and found that all the dom0 cores were only being scheduled something like once per second each. Was kinda crazy.

No Batch Core scheduling in that version of Xen either. I think newer versions might have it? Might not work when you have every core assigned to a domain though.

The solution was to only plug 2 CPUs into Dom0, rather than plug all 24 and give it a 200% slice of them.


HPC systems also tend do an unusual amount of stuff in userspace, such as running custom TCP stacks in userspace or using RDMA to bypass the CPU entirely. You wouldn't want to run a 10GB+ multiqueue Ethernet NIC on a single core, it'll choke on the interrupt load.


So if my app is running on core 1 and makes a syscall, the entire state must be shifted to the kernel core? That seems slower than swapping the memory maps.


How would syscalls work exactly?


They'd look more like Interprocessor Interrupts (IPIs). The kernel already internally uses those for TLB shootdowns.


That is a lot more expensive than even syscalls with KAISER




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: