Hacker News new | past | comments | ask | show | jobs | submit login
What is Windows doing while hogging that lock? (randomascii.wordpress.com)
154 points by ingve on July 27, 2017 | hide | past | favorite | 33 comments



HmgNextOwned doesn't do, by itself, all that much. There is some mild pointer-chasing, but not too horrible. All it does is to go through the global GDI handle table and look for handles owned by the terminating process. It is used loosely as

    DWORD index = 0;
    while(true) {
      SOME_GDI_OBJECT_TYPE obj;
      index = HmgNextOwned(index, pid, &obj);
      if(index == 0) break; // done / nothing found
      // ... do some clean up of the object here
    }
The thing is, this loop is executed 6 (12, if counting the instances inlined into NtGdiCloseProcess) times when terminating a process, for various object types, and for each call of HmgNextOwned it will lock and unlock the handle manager lock.


Sounds to me like they could really do with a reverse mapping so that you have O(1) procedure to clean up the handles on exit.

Linux had a similar issue with memory maps (pte, vma, etc) when under memory pressure (i.e. needing to swap). In 2002 / 2003 successive reverse mapping (rmap) patches were added to the kernel so that a full memory map scan wasn't required to go from a memory object to a task and back again.


Thanks for the explanation.

If it scans the entire GDI handle table then that suggests that its runtime will be proportional to the system's total number of GDI objects, correct? That sort of design can easily get expensive. I'm surprised it can't iterate through its own GDI objects without having to traverse everybody else's as well.


Yeah, it's proportional to the total number of GDI objects. Approximately:

  DWORD HmgNextOwned(DWORD index, DWORD pid, HANDLE * handle) {
    GDI_TABLE_ENTRY entry;
    GreAcquireHmgrSemaphore();
    // GetNextEntryIndex is inlined in reality, this is the main loop you see in the disassembly
    while(index = gpHandleManager->GetNextEntryIndex(index, &entry)) { 
      if(entry.Type != 0 && ((pid & 0xFFFFFFFD) ^ entry.ProcessId) & 0xFFFFFFFE) == 0) {
        *handle = index | (entry.Upper << 16);
        break;
      }
    }
    GreReleaseHmgrSemaphore();
    return index;
  }
Best information I could find on the relevant structures is [1], which gives an idea of what you're traversing. I don't know why this is designed this way; it does seem suboptimal.

[1] https://cvr-data.blogspot.com/2016/11/windows-10-anniversary...


That link was extremely helpful! My unsubstantiated belief is that the process-shutdown bug was new in Windows 10 Anniversary Edition, and that link supports that idea. It sounds like Windows 10 Anniversary Edition changed the management of GDI objects to avoid leaking kernel addresses and that made HmgNextOwned more expensive, thus triggering the visible serialization of process destruction.

Cool!

I really wish I could upvote the parent post more times. Failing that I've updated the latest post to link to it as an explanation for how the whole problem got started.


For what it's worth, I just went back to an earlier version (10240) of win32kbase.sys, and the search is indeed a lot simpler...it's essentially a flat linear table search (to be clear, it's still iterating through every GDI object), much friendlier to the CPU than the structure traversing that happens now. Could very well be the culprit.


Thanks for the information. I guess if scanning used to be two to three times faster maybe that would be fast enough that the issue, while still present, would not be noticeable.

Note that the scanning happens even when the process has zero GDI objects. That seems... suboptimal.

I've made a few updates to the end of the post based on the latest information.


Scanning the entire DGI handle table reminds me of this

https://news.ycombinator.com/item?id=14808652

Relevant sublink. see Linus Torvalds comment

https://groups.google.com/forum/#!msg/comp.arch/_uecSnSEQc4/...

tl;dr Traversing page aligned structures can totally hose the associative cache, which relies on a simple hash of the lower address bits. (I hope as a small brained primate I got that correct)


One easy fix for this kind of code is to add an API to fetch multiple instances per call. Fetching 100 handles per call would speed up the processing (assuming the locking/unlocking is the pain point) by a factor of 100.

Adjust the magical 100 for the typical problematic use case scale and be done. (Although dividing something by 100 is usually good enough.)


Good followup to Bruce's previous blog, discussed here: https://randomascii.wordpress.com/2017/07/09/24-core-cpu-and...

Surprised at how much you can divine from the ETW traces with enough persistence.


Well.. what is it doing?

Analyzing the assembly is neat but this doesn't seem like a CPU performance problem, it's a high-level design issue. It's a bit like analyzing sleep sort.


Agreed. I was just curious, and it seemed like a fun way to demonstrate the technique, which can be useful when profiling CPU intensive parts of your own code.


Might be worth looking to see if the processes are creating hidden windows or other UI objects. The previous post references NtGdiCloseProcess, and it'd make sense that some of those handles would require a lock to clean up safely.


The test code is up there. The master process loads user32.dll and makes a couple of window-type calls, but the other processes do nothing but wait on semaphores and then die.


according to his code, he's tested hidden windows and regular ones.


It's great to see all this discussion. If anyone has a link to any Intel papers or systematic analysis of how sampling profilers interact with Intel super-scalar out-of-order processors that would be appreciated.


What are the full specs on your 24 core machine?


HP Z840, two sockets, 64 GB of RAM, SSD. I don't remember the details and I'm at home now.

The most recent tests were done on my four-core/eight-thread laptop. It's running a more recent version of Windows so it's a "fairer" test in that way, and it reproduces the problem just fun with my artificial test program, CreateProcessTests.


>If there are three instructions executing simultaneously when the interrupt fires then which one “gets the blame”? If a load instruction misses in the cache and forces the CPU to wait then will the samples show up on the load instruction, or on the first use of the data? Both seem to happen.

Specifically in order to handle interrupts a CPU may execute instructions out of order but there's a final retire step where the results of all instructions are made 'official' in order. So from outside it looks like the CPU was executing in order.

The upshot is that, except in crazy optimized code that might sometimes overwhelm the retire unit, the last instruction you see as not having finished hasn't finished but there might be other instructions being worked on further along as well.


meh, i'm an idiot.


The question isn't about three instructions being executed by three separate OS threads. It's about three instructions being executed at the same time on a single thread on a single core.

The interrupt context will only contain a single program counter value; which of the instructions that could currently be considered to be executing on that thread will it actually point to?


I would guess PC is most recently retired instruction? Or would it be not retired? Although that is an interesting scenario.

    slow eax, ebx
    fast ecx, ecx
Assuming slow and fast always finish at the same time, will the profiler ever see the CPU executing slow?


Working on the most recently retired instruction would be odd, since it'd lead to things like the cost of cache misses being assigned to instructions that have nothing to do with memory access.

But other than that, I've never been able to build a mental model for exactly how the accounting works. For example take an inner loop with no memory accesses, like this:

    18: add    $0x1,%rax
        cmp    %rax,%rdx
        jne    18
On a Skylake this will execute once per cycle (IPC of ~3), so we're clearly spending equal amounts of time on each instruction. But perf will assign basically all of the cost to the jump.

     0.28 │18:┌─→add    $0x1,%rax
     0.27 │   │  cmp    %rax,%rdx
    99.45 │   └──jne    18
So it could be consistent with some kind of "first in-flight" or "last in-flight" model. But change the loop to be 5 instructions long (and with an IPC of ~5, so still executing the whole loop once per cycle), and suddenly the PC gets smeared across two instructions:

     2.12 │20:┌─→add    %rsi,%rax
     0.94 │   │  add    $0x1,%rcx
    45.36 │   │  add    %rax,%rdx
     0.10 │   │  cmp    %rcx,%rsi
    51.48 │   └──jne    20
Take an almost equivalent loop, just swapping two instructions. It's still 5 instructions and an IPC of ~5. And...

    80.79 │20:   add    $0x1,%rcx
     1.29 │      add    %rax,%rdx
     0.89 │      add    %rsi,%rax
     0.00 │      cmp    %rcx,%rsi
    17.02 │    ↑ jne    20
Multiple runs of an individual test case will be pretty consistent on a specific machine; so it's not that the assignment is just random. It's deterministic, but the exact mechanism is very opaque.


> it'd lead to things like the cost of cache misses being assigned to instructions that have nothing to do with memory access.

But this happens all the time with perf. I often see cache misses attributed to a jump dependent on a data load.


Sure, but that's still an in-flight instruction that is stalled waiting on a memory access (just indirectly). You don't see the miss being attributed to the instructions that preceded the load.


Take a look at the "cmp ebx,eax" on line 24 of the spreadsheet linked from the article. ebx is only ever modified by pure register manipulations. eax was mostly recently modified by having a register added to it.

And yet the cmp instruction is quite expensive - much more so than those that proceed it and which are executed equally frequently.

So, somehow the cost of the r8d four instructions early must end up being propagated through the add in the previous instruction and then gets charged to the "innocent bystander" of the compare. Crazy. I would have expected the "add eax,r8d" on line 23 to pay the price.


This looks funny, it seems that RDI is never modified by the loop (?) so lines 18-24 always load the same data, presumably from L1. It's mysterious why of all those instructions, 20 and 24 get most interrupt hits.


Good observation about rdi and its implications for the load of rcx (line 18) and therefore the load through rcx (line 20).

I added a note to that effect and also enabled adding comments to the spreadsheet.


Oh, nice examples. (And yeah, had I thought another minute, the PC will never be actually retired, but something after.)


https://stackoverflow.com/questions/8902132/when-an-interrup... has some discussion.

Basically the priority of the Intel design (and most successful processors) is to preserve the illusion of a series of instructions executing one after the other. So the control logic has to keep track of the "waterline", the point before which all instructions have been completely committed and retired. Everything beyond there is aborted on an interrupt.


The PC would be recorded from an interrupt handler as the address to resume execution at; I don’t think there’s any other way to record PCs on x86 (there’s Last Branch Record but that’s only for branch instructions). So, I’d expect it would be the first not-yet-retired instruction, and you’d always see ‘slow’, not ‘fast’.


> you’d always see ‘slow’, not ‘fast’.

And yet - line 34 of the spreadsheet shows "xor r8d,r8d" as being quite expensive. My theory is that the branch which targets it must be mispredicted and it gets blamed.

Similarly, line 24 of the spreadsheet shows "cmp ebx,eax" as being quite expensive. The only explanation I can come up with is that the previous instruction was blocked by a cache miss and then the "cmp" must have been folded in with the add so that mumble mumble nope I don't have a good mental model.

So, there are quite a few examples of instructions which by themselves are obviously always fast being hit by a huge portion of the CPU samples.


i figured i misunderstood. i guess if you run it long enough that statistically it doesn't matter? i still think the culprit is going to turn out to be something more mundane.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: