Unless the issue is that your setup cannot composite at 60 fps (don’t get me wrong, not pretending that Windows isn’t at fault if that’s the case), then neither double buffering nor software cursors introduce delay.
Unless your goal is tearing updates (a whole other discussion), then your only cause of latency is missed frame deadlines due to slow or badly scheduled rendering.
There is no need to switch to software cursor rendering unless you want to render something incompatible with the cursor plane, e.g. massive buffers or underlaying the cursor under another surface. Synchronization with primary plane updates is not at all an issue.
> Synchronization with primary plane updates is not at all an issue.
While I wouldn't be surprised if this is technically true in a hardware sense, software-wise, Windows knows where the cursor is before it's finished rendering the rest of the screen, and updates the hardware layer that contains the cursor before rendering has finished.
> While I wouldn't be surprised if this is technically true in a hardware sense, software-wise, Windows knows where the cursor is before it's finished rendering the rest of the screen
The earlier you sample the cursor position and update the cursor plane, the more the position is out of date once the next scanout comes around, increasing the perceived input delay.
The approach that leads to the smallest possible input latency is to sample the cursor position just before issuing the transaction that updates the cursor position and swaps in the new primary plane buffer (within Linux, this is called an atomic commit), whereas you maximize content consistency with still very good input latency by sampling just before the composition started.
Note that "composition" does not involve rendering "content" as the user perceives it, but just placing and blending already rendered window content, possibly with a color transform applied as the pixels hit the screen. Unless Microsoft is doing something weird, this should be extremely fast. <1ms fast.
> The earlier you sample the cursor position and update the cursor plane, the more the position is out of date once the next scanout comes around, increasing the perceived input delay.
No, the cursor position is more up-to-date than the rest of the screen because it doesn't need to wait for a GPU pipeline to finish after it's moved.
> Unless Microsoft is doing something weird, this should be extremely fast. <1ms fast.
Look, I'm saying this is what's going on. (not to scale)
Frames are extremely fast to render, but they arrive the frame after they were originally scheduled, because GPU pipelines are asynchronous. However, the cursor position arrives immediately because the position of the hardware layer can be synchronously updated immediately before scanout. The effect is that updates to the cursor position are (essentially) displayed 1 frame sooner than updates to the rest of the screen. If you actually try any of the tests I mentioned in my original comment you'll see this for yourself.
I'm making some assumptions about your chart as it is not to scale, but it looks like the usual worst-case strategy. Given a 60Hz refresh rate and a 1ms composition time an example of an optimized composition strategy would look something like this:
In this case, both the composite and the cursor position is only 1.2ms old at the time the GPU starts scanning it out, and hardware vs. software cursor has no effect on latency. Moving the cursor update closer would make the cursor out of sync with the displayed content, which is not really worth it.
(Games and other fullscreen applications can have their render buffer directly scanned out to remove the composition delay and read input at their own pace for simulation reasons, and those applications tend to be the subject at hand when discussing single or sub-millisecond input latency optimizations.)
> Frames are extremely fast to render, but they arrive the frame after they were originally scheduled, because GPU pipelines are asynchronous.
The display block is synchronous. While render pipelines are asynchronous, that is not a problem - as long as the render task completes before the scanout deadline, the resulting buffer can be included in that immediate scanout. Synchronization primitives are also there when you need it, and high-priority and compute queues can be used if you are concerned that the composition task ends up delayed by other things.
Also note that the scanout deadline is entirely virtual - the display block honors whatever framebuffer you point a plane to at any point, we just try to only do that during vblank to avoid tearing.
> If you actually try any of the tests I mentioned in my original comment you'll see this for yourself.
While it might be fun to see if Microsoft screwed up their composition and paint scheduling, that does not change that it is not related to GPUs or the graphics stack itself. Working in the Linux display server space makes me quite comfortable in my understanding of GPU's display controllers.
> that does not change that it is not related to GPUs or the graphics stack itself. Working in the Linux display server space makes me quite comfortable in my understanding of GPU's display controllers.
I didn't mean to suggest some sort of fundamental limitation in GPUs that makes it impossible to synchronize this. If you take a look at my previous comments, you'll see me explicitly pointing out that I'm talking about Windows, specifically, and I'm only using it as an example of how short a latency is still perceptible. How exactly that latency happens is almost certainly not a hardware issue, however, and I never meant to imply such.
Unless your goal is tearing updates (a whole other discussion), then your only cause of latency is missed frame deadlines due to slow or badly scheduled rendering.
There is no need to switch to software cursor rendering unless you want to render something incompatible with the cursor plane, e.g. massive buffers or underlaying the cursor under another surface. Synchronization with primary plane updates is not at all an issue.