Hacker News new | past | comments | ask | show | jobs | submit login
Troubleshooting: Terminal Lag (cmpxchg8b.com)
227 points by janvdberg 45 days ago | hide | past | favorite | 55 comments



Love such articles where I learn something new. cdb is completely new to me. It's apparently the Microsoft Console Debugger. For others like me who were wondering how `eb win32u!NtUserSetLayeredWindowAttributes c3` neutered the window animation:

"By executing this command, you are effectively replacing the first byte of the `NtUserSetLayeredWindowAttributes` function with a `ret` instruction. This means that any call to `NtUserSetLayeredWindowAttributes` will immediately return without executing any of its original code. This can be used to bypass or disable the functionality of this function"

(Thanks to GitHub Copilot for that)

Also see https://learn.microsoft.com/en-us/windows-hardware/drivers/d...


Nice. Here's a breakdown for anyone interested:

- eb[0] "enters bytes" into memory at the specified location;

- The RETN[1] instruction is encoded as C3 in x86 opcodes; and

- Debuggers will typically load ELF symbols so you can refer to memory locations with their names, i.e. function names refer to their jump target.

Putting those three together, we almost get the author's command. I'm not sure about the "win32u!NtUser" name prefix, though. Is it name-munging performed on the compiler side? Maybe some debugger syntax thrown in to select the dll source of the name?

[0]:https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

[1]:http://ref.x86asm.net/geek64.html#xC3


Yes, NtUserSetLayeredWindowAttributes is in win32u.dll.

And if you are wondering what's the difference between win32u.dll and user32.dll.

> win32u.dll is a link for System calls between User mode (Ring 3) and Kernel mode (Ring 0) : Ring 3 => Ring 0 https://imgbb.com/L8FTP2C [0]

[0] - https://learn.microsoft.com/en-us/answers/questions/213495/w...


The "win32u!" prefix is for the name of the DLL where the symbol lives. On Windows, the imported symbols are bound to their DLLs, instead of floating in the ether like they do on Linux where the dynamic loader just searches for them in whatever shared objects it has previously loaded.


So the root cause of the slowness was not found, it was just circumvented by keeping 3 xterms open and just using hiding/showing them?


But that does not make his solution any less valid. Or does it?

In fact, keeping something preloaded and ready to go is quite common, these two examples are off the top of my head:

- The Emacs server way - https://ungleich.ch/u/blog/emacs-server-the-smart-way/

- SSH connection reuse.


I agree pooling is a valid strategy. I just love those articles when people use some dark profiling magic to find something like misaligned memory causing severe and unexpected performance degradations.


300ms for startup still sounds slow to me. Not ridiculously so, but it won't give that snappy feeling.


I thought so, too. I'm not interested enough to benchmark it, but for all practical purposes it's instantaneous on my machine. As fast to open a new terminal as it is to switch to the existing one.


Mine takes 50ms, assuming wsl is hot (recorded screen and compared mouse click frame to window pop up frame). I think op should try a different wsl distro or a blank machine and compare differences. I have on access scanning off, performance on, Ubuntu wsl distro, and windows 10.


I believe OP recorded him pressing a key on keyboard and counted between key is clearly pressed and the moment when xterm is up.

Compare to screen recording, this adds latency introduced by keyboard and monitor, which sometimes could be 100ms+. See https://danluu.com/input-lag/


interesting side note, our brain is compensating for delay, it can do it to around 250ms

so if anything lags up to that amount our brain will compensate and make it feel imstantenious

there was interesting experiment that I reproduced at university, create app that slowly build up delay to clicks to allow brain to adapt, and then remove it completely. result is that you have feeling that it reacts just before you actually click until brain adapts again to new timing


I don't think it's right to say that the compensation makes things feel instantaneous, but rather that we are able to still feel the association between input and result, allowing for coordinated feedback loops to be maintained. We do grow accustomed to the latency, but I do not think it is right to say that it feels like zero latency.

If the delay is long enough, the output does not just feel delayed, but entirely unrelated to the input.

A latency perception test involving a switch can easily be thrown off by a disconnect between the actual point of actuation vs. the end-users perceived point of actuation. For example, the user might feel - especially if exposed to a high system latency - that the switch actuation is after the button has physically bottomed out and squeezed with an increased force as if they were trying to mechanically induce the action, and later be surprised to realize that the actuation point was after less than half the key travel when the virtual latency is removed.

Without knowing the details of the experiment, I think this is a more likely explanation for a perception of negative latency: Not intuitively understanding the input trigger.


As a long time gamer, I can anecdotally corroborate your theory with my early experiences playing FPS games using a dial-up connection. Average ping was about 200ms which allowed for an enjoyable and accurate experience after some adjustment. >250ms was unpleasant and had a significant impact to ability.

It was for this reason that I, and many others, for a short period, got objectively "worse" at the game when we switched to ISDN/Cable and suddenly found ourselves with 20-30ms pings; Our brains were still including the compensating latency when firing.


This seems more like compensation for projectile velocity, no?

I am assuming that the latency is in enemy location due to the game running hitscan (instantaneous weapons without trajectory simulation) on the server. In this case, your aim is as when you clicked the trigger, but hit is only computed <latency> time later when the server processes the incoming shot request, at which point the enemy position has changed.

This makes the latency behave similar to projectile velocity, where you need to aim not where a target is but where a target will be. Changing to a setup with a lower latency would then be like using a much faster weapon which requires new training to use.

(Input latency would mean that if you move your aim and click the trigger, your aim would continue to change for <latency> time before the bullet fires towards whatever your aim ends up being. This is much worse.)


> I’ve been using this configuration for a few days, so far it’s working great. I haven’t noticed any issues running it this way.

The journey was very useful, even the destination may be pretty specific to your needs. The process of how to go about debugging minor annoyances like this is really hard to learn about.


Just for fun I did film some video footage from my 60Hz monitor to see how quickly my terminal starts up. Seems like 2-3 frames to show up the terminal window, and 1-2 frames to show shell prompt. So 50 ms - 83 ms. This is with foot terminal on Sway.

My very unscientific methodology was to run

    $ echo hello && foot
in a terminal and measure the time between the hello text appearing and the new window appearing. Looking at my video, the time from physical key press to "hello" text appearing might be 20ish ms but that is less clear, so about 100 ms total from key press to shell prompt.

This is pretty much completely untuned setup, I haven't done any tweaks to improve the figures. Enabling foot server might shave some milliseconds, but tbh I don't feel that's necessary.

It'd be fun to do this with better camera, and also with better monitors. Idk how difficult it would be to mod in some LED to keyboard to capture the exact moment the key is activated, just trying to eyeball the key movement is not very precise.


Wouldn't it make more sense to screen record with wf-recorder instead of a video camera?


In the end click-to-photon latency is what matters, so measuring the whole system end to end is good starting point, and that means video camera pointing at screen. Something like wf-recoder sees only part of the whole pipeline. How much latency is there between compositor copying frame to wf-recorder and same frame getting pushed physically out on the display cable? Without knowing exactly how the whole system is built such question is difficult to answer.

But you also have to account for the fact that wf-recorder might interfere with the results, capturing screen is not free, and it might even push some part of the pipeline to less optimal paths. With video camera you can be fairly confident that measuring isn't interfering with anything.


Mhm makes sense, maybe something like a capture card with a high refresh rate would be the best option, as it won't interfere with the OS, and eliminates the delay between your camera capturing a frame and your monitor refresh rate.


Sure high-speed capture card could be nice. But when many smartphones can do high-speed video, some even 960 fps, then that is very convenient (and low-cost) solution.


This is a tour de force on the type of curiosity it takes to be really successful with computers.


I'm at the tail end of my career, so working on efficiency gains like this doesn't usually add up for me.

However I was interested in knowing whether it does for the author.

Assuming he/she does suffer this 1300 ms delay "hundreds" of times a day (let's say 200) and for the sake of argument they use their computer 300 days a year and have 20 years of such work ahead of them with this config, then this inefficiency will total 1300 x 200 x 300 x 20 / 1000 / 60 / 60 hours wasted during author's lifetime - some 430 hours.

So well worth the effort to fix!


I find that annoyances cost me much more than the wall-clock time of the delay. You're lucky to disagree :)


I had a printout of [1] at my office. Of course at is base it is only a simple multiplication table, but nevertheless is reminded me several time that a issue is worth fixing.

[1] https://xkcd.com/1205/


I'm so distracted by latency that I run my macOS with vsync disabled 24/7 (through Quartz Debug).

When I used to use Windows 10+ years ago, I had decent luck using xming + cygwin + Cygwin/X + bblean to run xterm in a minimal latency/lag environment.

I also launch Chrome/Spotify/Slack desktop using:

$ open -a open -a Google\ Chrome --args --disable-gpu-vsync --disable-smooth-scrolling


One way to have the cake and eat it too is to upgrade to a high-refresh rate display. No tearing + less latency + smoother display. Although it's diminishing returns even 60Hz -> 144Hz+ will make a lot of difference. On a 240Hz display, vsync penalty is just 4ms.

Also if you are using a miniLED M-class MBP, its pixel response is abysmal.


I've been running uncomposited X for years to reduce latency, but after getting a dual 120 Hz monitor setup, I might finally consider Wayland! This is good advice.

Too bad vscode doesn't support higher refresh rates. It's locked to 60 for some reason I haven't been able to grasp.


Since vscode is an electron app, have you tried opening it with

$ open -a open -a Visual\ Studio\ Code --args --disable-gpu-vsync --disable-smooth-scrolling --disable-frame-rate-limit


Yes, that doesn't help unfortunately. There is a github issue with a long discussion, and none of the tips have seemed to help.

https://github.com/microsoft/vscode/issues/65142


Yep, have been planning to upgrade to a 240hz+ OLED for awhile! I find the typing input latency on my M1 MacBook Pro to be pretty abysmal when using the built-in retina display and no external monitor — I almost feel like I can only get work done when I have it plugged into my external monitor in clamshell mode and disable vsync.


"abysmal"?? Literally the first time I've seen someone negatively mention latency on a recent Macbook


Pixel response time is something like 30ms on MacBook Pro. So the 120hz screen can feel more like 30hz.


I'm happy to report I have no idea what you're talking about and find the typing on the Macbook the best computer experience I ever had :D

I'll be careful not to use higher refresh rates devices though, that could show me what you're talking about :)


It just seems so wasteful to run desktop and office programs at hundreds of hertz…


This is one of those things where if your applications are written using native frameworks the difference is minimal, and you get the benefit of an actual smooth experience. Meanwhile if the app is a "custom lightweight framework", you're likely just burning CPU cycles.


How so? Anything under 1000Hz has obvious delays: https://youtu.be/vOvQCPLkPt4?si=oXDiV9gagyZdnkkM


Desktop programs should only repaint when they need to. So you are only actually rendering the programs at hundreds of hertz when something is animated.


Out of curiosity, have you tried a 144hz monitor on macOS with vsync enabled?


I use a cheap 75hz IPS from my office, though, ideally, I'd like to upgrade to a 240hz+ OLED w/ VRR since macOS now supports adaptive sync[0]; i've been waiting because I'm not satisfied w/ any of the OLED monitors currently on the market and my monitor upgrade request was denied by my employer.

Though I've used the Apple Magic Keyboard w/ Touch ID exclusively for awhile, I'm also thinking about upgrading to the new Wooting 80HE keyboard this fall since it has a 8kHz polling rate, analog hall effect switches, and is designed to be ultra low latency w/ tachyon mode enabled.

[0]: https://support.apple.com/guide/mac-help/use-adaptive-sync-w...


Very nice article, I love such debugging. I sometimes do it myself too.

Anyway, this also made me think about general bloat we have in new OSes and programs. Im still on old OS running spinning rust and bash here starts instantly when cache is hot. I think GUI designers lost an engineer touch...


300ms for startup still sounds slow to me. Not ridiculously so, but it won't give that snappy feeling.


We need a community of those obsessed with responsive applications. UI latency irks me on every device. Not only computers and smart phones, but now TVs, refrigerators, cars all have atrocious UI latency.

Great debugging work to come up with a solution!


it was a fun read


Upvote just for teaching me about the existence of `hyperfine`.

    $ hyperfine 'alacritty -e true'
    Benchmark 1: alacritty -e true
      Time (mean ± σ):      84.1 ms ±   4.9 ms    [User: 40.1 ms, System: 30.8 ms]
      Range (min … max):    80.5 ms … 104.4 ms    32 runs
    
    $ hyperfine 'xterm -e true'
    Benchmark 1: xterm -e true
      Time (mean ± σ):      81.9 ms ±   2.6 ms    [User: 21.7 ms, System: 7.9 ms]
      Range (min … max):    74.9 ms …  87.1 ms    37 runs
    
    $ hyperfine 'wezterm -e true'
    Benchmark 1: wezterm -e true
      Time (mean ± σ):     211.7 ms ±  13.4 ms    [User: 41.4 ms, System: 60.0 ms]
      Range (min … max):   190.5 ms … 240.5 ms    15 runs


If we're handing out tips, then as noted in a few examples from the article hyperfine is even more useful when called with multiple commands directly. It presents a concise epilogue with the information you're probably trying to gleam from a run such as yours:

    $ hyperfine -L arg '1,2,3' 'sleep {arg}'
    …
    Summary
      sleep 1 ran
        2.00 ± 0.00 times faster than sleep 2
        3.00 ± 0.00 times faster than sleep 3
If your commands don't share enough in common for that approach then you can declare them individually, as in "hyperfine 'blib 1' 'blob x y' 'blub --arg'", and still get the summary.


i once used hyperfine to micro-bench elisp functions. i se $SHELL to a script that evaluated it's arguments in emacs by talking to a long-running session over a named pipe. Hyperfine runs a few no-ops with $SHELL and factored out the overhead, though it was still helpful to run a nested loop in elisp for finer results.


Besides learning about `hyperfine`, the combination of `xargs` to keep N warm processes ready, `LD_PRELOAD` to trick them into waiting to map their windows, and `pkill --oldest ...` to get one of those to go is quite neat.

But I have a very different solution to this problem: have just one terminal window and use and abuse `tmux`. I only use new windows (or tabs, if the terminal app has those) to run `ssh` to targets where I use `tmux`. I even nest `tmux` sessions, so essentially I've two levels of `tmux` sessions, and I title each window in the top-level session to match the name of the session running in that window -- this helps me find things very quickly. I also title windows running `vi` after the `basename` of the file being edited. Add in a simple PID-to-tmux window resolver script, scripts for utilities like `cscope` to open new windows, and this gets very comfortable, and it's fast. I even have a script that launches this whole setup should I need to reboot. Opening a new `tmux` window is very snappy!



I also didn't know about `hyperfine`, very nice!

Even 80ms seems unnecessarily slow to me. 300ms would drive me nuts ...

I'm using a tiling window manager (dwm) and interestingly the spawning time varies depending on the position that the terminal window has to be rendered to.

The fastest startup time I get on the fullscreen tiling mode.

   hyperfine 'st -e true'
   Benchmark 1: st -e true
     Time (mean ± σ):      35.7 ms ±  10.0 ms    [User: 15.4 ms, System: 4.8 ms]
     Range (min … max):    17.2 ms …  78.7 ms    123 runs
The non-fullscreen one ends up at about 60ms which still seems reasonable.


You could maybe find out where the delay is by using st's Xembed support? Create a window with tabbed¹ in a tiling layout, open st in to it with "st -w <xid> -e true". If it is close to the monocle time, it is probably the other windows handling the resize event that is causing the slowdown not the layout choice.

To prove it to myself: I'm using river² and I can see a doubling-ish of startup time with foot³, iff I allow windows from heavier apps to handle the resize event immediately. If the time was a little longer(or more common) I'd be tempted to wrap the spawn along the lines of "kill -STOP <other_clients_in_tag>; <spawn & hold for map>; kill -CONT <other_clients_in_tag>" to delay the resize events until my new window was ready. That way the frames still resize, but their content resize is delayed.

¹ https://tools.suckless.org/tabbed/

² https://codeberg.org/river/river

³ https://codeberg.org/dnkl/foot


The result of running the same on st for me:

    Benchmark 1: st -e true
      Time (mean ± σ):      35.4 ms ±   6.9 ms    [User: 15.1 ms, System: 3.8 ms]
      Range (min … max):    24.2 ms …  65.2 ms    114 runs
This is on awesome-wm with the window opening as the 3rd tiled window on a monitor, which means it has to redraw at least the other two windows. I'm also running xfs on top of luks/dm-crypt for my filesystem, which shouldn't matter too much on this benchmark thanks to the page cache, but is a relatively common source of performance woes on this particular system. I really ought to migrate back to unencrypted ext4 and use my SSD's encryption but I haven't wanted to muck with it.


To get an idea of the cost of tiling (with bspwm, quarter screen tile and 2560x1440@60Hz screen):

  hyperfine -L args '','-c floating' 'st {args} -e true'
  Benchmark 1: st  -e true
    Time (mean ± σ):      25.0 ms ±   2.7 ms    [User: 10.5 ms, System: 3.7 ms]
    Range (min … max):    14.8 ms …  44.1 ms    197 runs

  Benchmark 2: st -c floating -e true
    Time (mean ± σ):      22.7 ms ±   2.6 ms    [User: 10.3 ms, System: 3.9 ms]
    Range (min … max):    20.7 ms …  35.4 ms    123 runs

  Summary
    'st -c floating -e true' ran
      1.10 ± 0.17 times faster than 'st  -e true'
Flexing my system too, heh.


Does it parse commands and call exec*() or spawn a new shell for every run of every command?


You can choose the behaviour with the --shell option¹. The default behaviour is nice because it allows you to benchmark pipelines easily, but if you want to change it you can.

¹ https://github.com/sharkdp/hyperfine#intermediate-shell




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: