HmgNextOwned doesn't do, by itself, all that much. There is some mild pointer-chasing, but not too horrible. All it does is to go through the global GDI handle table and look for handles owned by the terminating process. It is used loosely as
DWORD index = 0;
while(true) {
SOME_GDI_OBJECT_TYPE obj;
index = HmgNextOwned(index, pid, &obj);
if(index == 0) break; // done / nothing found
// ... do some clean up of the object here
}
The thing is, this loop is executed 6 (12, if counting the instances inlined into NtGdiCloseProcess) times when terminating a process, for various object types, and for each call of HmgNextOwned it will lock and unlock the handle manager lock.
Sounds to me like they could really do with a reverse mapping so that you have O(1) procedure to clean up the handles on exit.
Linux had a similar issue with memory maps (pte, vma, etc) when under memory pressure (i.e. needing to swap). In 2002 / 2003 successive reverse mapping (rmap) patches were added to the kernel so that a full memory map scan wasn't required to go from a memory object to a task and back again.
If it scans the entire GDI handle table then that suggests that its runtime will be proportional to the system's total number of GDI objects, correct? That sort of design can easily get expensive. I'm surprised it can't iterate through its own GDI objects without having to traverse everybody else's as well.
Yeah, it's proportional to the total number of GDI objects. Approximately:
DWORD HmgNextOwned(DWORD index, DWORD pid, HANDLE * handle) {
GDI_TABLE_ENTRY entry;
GreAcquireHmgrSemaphore();
// GetNextEntryIndex is inlined in reality, this is the main loop you see in the disassembly
while(index = gpHandleManager->GetNextEntryIndex(index, &entry)) {
if(entry.Type != 0 && ((pid & 0xFFFFFFFD) ^ entry.ProcessId) & 0xFFFFFFFE) == 0) {
*handle = index | (entry.Upper << 16);
break;
}
}
GreReleaseHmgrSemaphore();
return index;
}
Best information I could find on the relevant structures is [1], which gives an idea of what you're traversing. I don't know why this is designed this way; it does seem suboptimal.
That link was extremely helpful! My unsubstantiated belief is that the process-shutdown bug was new in Windows 10 Anniversary Edition, and that link supports that idea. It sounds like Windows 10 Anniversary Edition changed the management of GDI objects to avoid leaking kernel addresses and that made HmgNextOwned more expensive, thus triggering the visible serialization of process destruction.
Cool!
I really wish I could upvote the parent post more times. Failing that I've updated the latest post to link to it as an explanation for how the whole problem got started.
For what it's worth, I just went back to an earlier version (10240) of win32kbase.sys, and the search is indeed a lot simpler...it's essentially a flat linear table search (to be clear, it's still iterating through every GDI object), much friendlier to the CPU than the structure traversing that happens now. Could very well be the culprit.
Thanks for the information. I guess if scanning used to be two to three times faster maybe that would be fast enough that the issue, while still present, would not be noticeable.
Note that the scanning happens even when the process has zero GDI objects. That seems... suboptimal.
I've made a few updates to the end of the post based on the latest information.
tl;dr Traversing page aligned structures can totally hose the associative cache, which relies on a simple hash of the lower address bits. (I hope as a small brained primate I got that correct)
One easy fix for this kind of code is to add an API to fetch multiple instances per call. Fetching 100 handles per call would speed up the processing (assuming the locking/unlocking is the pain point) by a factor of 100.
Adjust the magical 100 for the typical problematic use case scale and be done. (Although dividing something by 100 is usually good enough.)