If you can make sure something else won't emit watts (other cores, something usi...

rigtorp · on Aug 3, 2020

There was also this recent patch https://lwn.net/Articles/816211/ to deal with kthread affinities. Even with isolcpus I find I still need to run pgrep -P 2 | xargs -i taskset -p -c 0 {} and deal with the workqueues.

Have you tried "A full task-isolation mode for the kernel": https://lwn.net/Articles/816298/ ?

Running only a single thread per core, I see no difference between SCHED_FIFO vs SCHED_OTHER. Except SCHED_FIFO can cause lockups if running 100% since cores are not completely isolated (ie vmstat timer and some other stuff).

Yes, it's annoying you cannot disable compaction. There is also work on pro-active compaction now: https://nitingupta.dev/post/proactive-compaction/

angry_octet · on Aug 4, 2020

I like the Tosatti/WindRiver/Lameter patch. (Except the naming: _possible is the same meaning as _available, but they mean different things here depending if kthreads or user threads?) Just needs a proc interface.

With a patch like this you can force bottom half (interrupts), top half (kthreads) and user threads to all be on different cores.

The 'full task-isolation mode' seems wierd, because why should you drop out of isolation because of something outside your control like paging or TLB? Anyway, mlockall that. Its fine to be told I guess (except signals take time) but why drop out of isolation and risk glitches in re-isolating? It doesn't seem very polished.

Something else occured to me: you still have to be careful about data flow and having enough allocatable memory. E.g., a lot of memory local to core 0 will be consumed by buffer cache, it can be beneficial to drop it (free; numstat -ms; sync; echo 1 > /proc/sys/vm/drop_caches; free; numastat -ms).

I find it bizarre that all this THP compaction stuff is for workloads that are commonly run under a virtual machine, i.e. another layer of indirection.

rigtorp · on Aug 4, 2020

As I understand 'full task-isolation mode' will prevent compaction, completely disable vmstat timer etc. So it provides additional isolation. Since you already switched into kernel mode, might as well deliver a signal to let you know it happened. If the signal is masked there should be no overhead at all except a branch to check the signal mask.