HmgNextOwned doesn't do, by itself, all that much. There is some mild pointer-ch...

tankenmate · on July 28, 2017

Sounds to me like they could really do with a reverse mapping so that you have O(1) procedure to clean up the handles on exit.

Linux had a similar issue with memory maps (pte, vma, etc) when under memory pressure (i.e. needing to swap). In 2002 / 2003 successive reverse mapping (rmap) patches were added to the kernel so that a full memory map scan wasn't required to go from a memory object to a task and back again.

brucedawson · on July 28, 2017

Thanks for the explanation.

If it scans the entire GDI handle table then that suggests that its runtime will be proportional to the system's total number of GDI objects, correct? That sort of design can easily get expensive. I'm surprised it can't iterate through its own GDI objects without having to traverse everybody else's as well.

pbsd · on July 28, 2017

Yeah, it's proportional to the total number of GDI objects. Approximately:

  DWORD HmgNextOwned(DWORD index, DWORD pid, HANDLE * handle) {
    GDI_TABLE_ENTRY entry;
    GreAcquireHmgrSemaphore();
    // GetNextEntryIndex is inlined in reality, this is the main loop you see in the disassembly
    while(index = gpHandleManager->GetNextEntryIndex(index, &entry)) { 
      if(entry.Type != 0 && ((pid & 0xFFFFFFFD) ^ entry.ProcessId) & 0xFFFFFFFE) == 0) {
        *handle = index | (entry.Upper << 16);
        break;
      }
    }
    GreReleaseHmgrSemaphore();
    return index;
  }

Best information I could find on the relevant structures is [1], which gives an idea of what you're traversing. I don't know why this is designed this way; it does seem suboptimal.

[1] https://cvr-data.blogspot.com/2016/11/windows-10-anniversary...

brucedawson · on July 28, 2017

That link was extremely helpful! My unsubstantiated belief is that the process-shutdown bug was new in Windows 10 Anniversary Edition, and that link supports that idea. It sounds like Windows 10 Anniversary Edition changed the management of GDI objects to avoid leaking kernel addresses and that made HmgNextOwned more expensive, thus triggering the visible serialization of process destruction.

Cool!

I really wish I could upvote the parent post more times. Failing that I've updated the latest post to link to it as an explanation for how the whole problem got started.

pbsd · on July 28, 2017

For what it's worth, I just went back to an earlier version (10240) of win32kbase.sys, and the search is indeed a lot simpler...it's essentially a flat linear table search (to be clear, it's still iterating through every GDI object), much friendlier to the CPU than the structure traversing that happens now. Could very well be the culprit.

brucedawson · on July 28, 2017

Thanks for the information. I guess if scanning used to be two to three times faster maybe that would be fast enough that the issue, while still present, would not be noticeable.

Note that the scanning happens even when the process has zero GDI objects. That seems... suboptimal.

I've made a few updates to the end of the post based on the latest information.

Gibbon1 · on July 28, 2017

Scanning the entire DGI handle table reminds me of this

https://news.ycombinator.com/item?id=14808652

Relevant sublink. see Linus Torvalds comment

https://groups.google.com/forum/#!msg/comp.arch/_uecSnSEQc4/...

tl;dr Traversing page aligned structures can totally hose the associative cache, which relies on a simple hash of the lower address bits. (I hope as a small brained primate I got that correct)

pierrebai · on July 28, 2017

One easy fix for this kind of code is to add an API to fetch multiple instances per call. Fetching 100 handles per call would speed up the processing (assuming the locking/unlocking is the pain point) by a factor of 100.

Adjust the magical 100 for the typical problematic use case scale and be done. (Although dividing something by 100 is usually good enough.)