Digging for performance gold: finding hidden performance wins

frabjoused · on April 23, 2021

I personally find performance optimization on an existing, high traffic system to be some of my favorite times as a developer. You have a large number, and you have to get it as close to zero as possible. There's no mystery as to whether you improved something and it has a tangible reward.

dan-robertson · on April 23, 2021

If you work at a sufficiently large company then optimisation work on sufficiently low level systems can save 7–8 figures per year, not that you’d likely see much of those savings yourself. It often turns out that some tiny bit of code with not-great performance becomes embedded deeply in the stack and the small cost can add up over many machines.

brundolf · on April 23, 2021

> There's no mystery as to whether you improved something

I'd nitpick a little bit and say it's possible that an optimization in one case causes a slowdown in another case- or worse, a bug. Benchmarks can also be inconsistent on the "same" case due to caches, etc, some of which may live outside of the code you actually control. Even the simplest program will vary a bit when you re-run it on the same system due to the state of that system (other processes, temperature, CPU cache, etc).

Some optimizations are clear wins, but many of them involve trade-offs and can have some mystery. Thorough testing/benchmarking helps a lot, but it can only get you so far.

jeffbee · on April 23, 2021

FYI Chrome performance is not generally guided by microbenchmarks, for the exact reasons you mention. It is guided by full-scale benchmarks (e.g. render the top 10000 sites) and by ChromeOS-wide profiles gathered in the wild. If a performance change doesn't work in the wild as indicated by profiles then it generally will be backed out and reconsidered. This is consistent with Google's backend performance culture where microbenchmarks are fine and good but changes need to be vetted on a full-scale production loadtest fixture.

brundolf · on April 23, 2021

Yep, makes sense to me. Unit tests assume that a clean-room environment translates reasonably well to the end result; the more naturally complex or unruly a product or a target metric is, the more your testing process should lean towards integration/real-world testing.

throwawayg5TS · on April 23, 2021

It looks like you are confusing testing and benchmarking. Those are different problems that require different skills and tools.

Integration/end-to-end tests are terrible for performance benchmarks. They are designed to capture functional issues at the seams. They are usually heavy and not very diverse (because very fragile and/or expensive to maintain), usually focused on a few key critical functional paths, e.g. "can I put something in my shopping cart and pay".

That's pretty much the opposite of real-world". The "real-world" is the distribution of what the end-user experiences. With proper tracing, one can identify the key real-world hotspots (including the data associated to it) and then focus on optimizing that.

m4rtink · on April 23, 2021

Not to mention maintenance and future development costs if the optimization makes the piece of software more complicated and less flexible.

segmondy · on April 23, 2021

It doesn't even have to be for a high traffic system, it could also be for a cost constrained system. If you're an indie developer. You might be able to afford $50 a month but not $500 a month for your side project and improving performance can keep you in business and give you a shot at success.

LeonB · on April 24, 2021

I saw a similar but different variation involving a low-use system.

Imagine it was a system for detectives to track complex murder* investigations. They only need to conduct approx 10 such investigations per year, but quality/correctness is crucial.

The full end to end tests are performed thousands and thousands of time per year, and if they are slow the entire development effort is slow. So you end up needing to make the whole system performant - even though end users aren’t experiencing the pain, devs/QA are.

(* this is not the actual domain, but similar rarity + criticality)

GordonS · on April 23, 2021

I'm with you - one of my favourite things to do is optimise performance, whether it's memory, CPU, latency, whatever.

Actually, I often enjoy it a bit too much... it's frequently the case that I'll realise I've just spent an entire day reducing memory allocations that didn't really need reducing, rather than building features :(

jeffbee · on April 23, 2021

Great write-up. I really feel like any kind of performance optimization, compromise, or other detail should be accompanied by tests or assertions that capture all of the inputs that supported the decision. In this example, ideally the compromise necessary to support Windows XP would have come with an assertion that the minimum supported version of Windows was still XP or earlier. This way, the decision is remembered and revisited if XP stops being supported, because the build would break. I don't know what the chrome code looks like but I imagine something like

  // TODO: Remove this hack if we drop Windows XP
  assert(min_win < win7)

... simple. I recall finding a function deep in Google search that had been "optimized" in x86 assembly, but way back when the cache lines were 32 bytes. On Opteron and later the "optimized" code was slower than idiomatic C++. That's when I decided any kind of performance decision needs to be recorded, somehow. Either something like `assert cache_bytes==32` or just a FIXME($date) that forces someone to revisit the decision every year.

bombcar · on April 23, 2021

I've always thought that there should be a coding construct (especially for inline assembly) where you have the code in a "macro-like comment" in original C, and then the inline "live" and one of the integration checks determines if they deviate in performance or results (and therefore should be retested).

aeturnum · on April 23, 2021

I wish the Chrome team would dig into why my Chrome uses nearly all of two cores all of the time with one tab open. The issue (or something similar) comes up all the time on their forums and they just lock all the topics[1]. Chrome is such a resource hog under normal operation that it's hard to say when something is going wrong.

[1] https://support.google.com/chrome/thread/17537877?hl=en

flakiness · on April 23, 2021

Use Chrome tracing [1] or Perfetto [2] to take a couple of trace when the problem is happening. Then submit a bug with the trace. That's one of the most promising way to report performance bugs. It is especially powerful when you're using Chromebook because Chrome OS integrates Linux ftrace to these app-level tracing and draws a system-wide picture of the workload.

(Disclaimer: I used to work on Chrome many years ago.)

[1] https://www.chromium.org/developers/how-tos/trace-event-prof... [2] https://ui.perfetto.dev/

aeturnum · on April 23, 2021

Thanks! I'll take a look. I've long since abandoned the browser but I like submitting bugs.

username90 · on April 23, 2021

It doesn't take much cpu for me. Chromes 19 processes combined sits there are 1% cpu and about 1gb ram for me. Probably the sites you are visitings fault.

Edit: Looking through that thread seems like some plugins caused the issue.

aeturnum · on April 23, 2021

I'm glad their browser works that well for you. It does not for me: https://dl3.pushbulletusercontent.com/dRiiaqbW844ZN3QHGcdWNe...

josephg · on April 23, 2021

Try disabling all your browser extensions and see if the problem persists.

It’s amazing to me how inefficiently a lot of browser extensions are written - eg last I checked, metamask pulls in web3, which is a clown car of javascript that takes hundreds of milliseconds to parse. That code needs to be parsed every time you navigate to a new website. You might not notice a single extension like that, but with a few bad extensions it’s easy for your browser to slow to a crawl. The obvious response is to blame the browser for stuff like this, but it’s usually the extensions that are causing your problems.

[1] https://github.com/ChainSafe/web3.js/issues/1178

aeturnum · on April 23, 2021

Yah, I've tried that. If you read the thread I linked everyone in it (who has not resolved their problem) has tried it. Right now I opened chrome back up and it's sitting at 186% cpu with one tab open and no extensions.

screenshot: https://dl3.pushbulletusercontent.com/dRiiaqbW844ZN3QHGcdWNe...

rasz · on April 25, 2021

open Chrome task manager (shift-esc), what is loaded in that one tab? does this thing have associated service-worker? heavy on js? webasm? multiple workers running?

matthewaveryusa · on April 23, 2021

Great write-up. I've done countless investigations like this and couldn't have worded it better.

>Depth vs. breadth.

Ah yes, which direction do you look at your program? do you look at which functions consume the most resources bottom up (probably some string or memory function in libc) or top down?

If you're the person writing the system libraries for enormous platforms, probably bottom-up, but if you're an application developer, top down. Sometimes though, especially with the performance issue described in the article you're in the middle -- those are tough to spot!

Leherenn · on April 23, 2021

Personally, in an application, I would quickly start bottom up. It's unlikely, but maybe there's an obvious way to improve one of those functions. An unnecessary copy or similar can easily happen.

Then, yes, spend time bottom up, what's were you're more likely to find consistent gains, usually by finding ways to call those low level functions less often.

vlovich123 · on April 23, 2021

> Chrome measures jank every 30 seconds, so Jank in 1% of samples for a given user means jank once every 50 minutes

Is that actually true? Doesn't this just mean that once every 50 minutes the system has been janky for >= 60s? Anything less & you're below the Nyquist frequency & are unlikely to be actually sampling it, no? My knowledge of signal analysis is just what I recall from some intro university classes so there could be more involved here in this claim so happy to learn if I'm misremembering (+ it might be made more complicated because their also sampling across a population of users).

(Speaking as somehow who regularly has to shut down Chrome because it's making my entire machine janky).

londons_explore · on April 23, 2021

I think they mean "we have code that sets a flag whenever the UI blocks for more than 100 milliseconds. We clear that flag every 30 seconds. We see it set 1% of the times that we clear it."

vlovich123 · on April 23, 2021

That would work but wouldn't let you say that jank happens "once every 50 minutes" because you don't actually know how many times that happened.

Also, this article isn't talking about UI blockage. This is talking about the time delta between user input & the result hitting the eyeball, presumably even across any asynchronous threads/IPC.

bentcorner · on April 23, 2021

I think the phrasing they used ("Let’s talk about 1%. 1% is quite large in practice. The core metric we use is “jank” which is a noticeable delay between when the user gives input and when software reacts to it. Chrome measures jank every 30 seconds, so Jank in 1% of samples for a given user means jank once every 50 minutes.") was just to give an example of what 1% meant in practice.

> Also, this article isn't talking about UI blockage. This is talking about the time delta between user input & the result hitting the eyeball, presumably even across any asynchronous threads/IPC.

Aren't those the same things?

vlovich123 · on April 23, 2021

That would be a pretty weird example to give I think if the article is solely focusing about a specific bank issue. I think it’s more that “at chrome scale, 1% is a lot, especially when you’re talking about number of users”.

> Aren’t those the same things?

Depends how you define it. Typically I think of “UI blockage” interpreted as “main thread doing CPU work or blocked on something and not processing events”. That’s a subset of the problems described (and maybe not even a perfect subset since you may have some kinds of UI blockage that’s not directly tracked to a user action). A user action might cause a repaint of the cursor/text. That repaint actually gets to the user through the compositor which is an external process (for security reasons). That’s all asynchronous and means you have to actually plumb through all your time stamps and metadata about the source event in all dependent work in a meaningful enough way to come up with an answer.

Dylan16807 · on April 23, 2021

You don't know how many times what happened?

You have to decide how to divide a continuous measurement into discrete instances of jank. Assuming that interpretation is right, they have decided to lump together up to 30 seconds of bad behavior into a single instance. Since the hit rate is still low, that seems like a pretty accurate way to get a count.

If they wanted to measure dropped frames, they could do that too, but it's a less useful number all by itself because you have no idea how they're distributed. Lumping every 30 seconds together gives you a much better idea of distribution.

londons_explore · on April 23, 2021

> Speaking as somehow who regularly has to shut down Chrome because it's making my entire machine janky

You either need more RAM, or a browser that uses less RAM...

vlovich123 · on April 23, 2021

Currently I have 32GB & used a machine with 96GB. How much RAM do browsers need? FWIW Firefox doesn't do too much better.

londons_explore · on April 23, 2021

Do you have a machine with slow storage (hard disk or early SSD)? Chromes HTTP cache does a lot of tiny reads and can easily make the whole system slow, especially when the profile is gigabytes or more.

Clearing the cache, or even the entire Chrome profile will fix it if its the case.

vlovich123 · on April 23, 2021

Traditionally always an SSD & more recently (on the 32gb machine) NVME. I/O is certainly a good hypothesis. Regardless of the subresource, I think the fault actually lays with the kernel. I don't care how many subprocesses are started - the totality should be grouped under a jail that is fairly queued with all other work on CPU & I/O unless I explicitly raise that jails limits (heck, maybe even RAM - swap out Chrome more quickly if it's hogging up all the RAM).

londons_explore · on April 23, 2021

Chrome has a lot of processes, but only 1 or 2 processes do all the disk and network IO, so I don't think that particular hypothesis holds up.

What you say probably is an issue with CPU scheduling though.

vlovich123 · on April 23, 2021

Yeah. I was thinking more that kernels historically have not been able to achieve good I/O queuing for user-facing operation (some of which was probably because the hardware interfaces weren’t good enough. Maybe that’s been since resolved.

I do think it causes issues with CPU scheduling but it could be any number of other issues. I don’t think kernel developers are looking at improving the overall perf of the system with a large number of chrome tabs.

vlovich123 · on April 23, 2021

Now that I've finished reading, I'm curious why the Chrome team didn't optimize GetLinkedFonts since it's the obvious culprit to my eye. Querying the registry is slow. Really slow. . The Chrome code appears to always read it on a missing value. If you have a missing value in your already populated cache, then every miss is going to reach out to the infrequently changing registry. It makes far more sense to only invalidate your in-process cache when the registry actually changes

masklinn · on April 23, 2021

> Now that I've finished reading, I'm curious why the Chrome team didn't optimize GetLinkedFonts since it's the obvious culprit to my eye.

That's the point of the article: `GetLinkedFonts` is the "obvious culprit", but it's the fallback to the fallback, it should not be getting called in the first place. It doesn't really matter that it's slow because it should almost never be called.

And then, I assume they fixed (or will fix) the cache so that it'd cache failures, so GetLinkedFonts would only be called once per failure instead of being called over and over again, only for failure (as successes would get cached after the first one).

vlovich123 · on April 24, 2021

As you can see from another comment, they’re being in efficient in the “uncommon” case and reading the registry even when there’s no changes. How often is “not often”? The 1% case they’re trying to tackle. It’s ~1-2 weeks (maybe more) for one engineer to watch the registry and either they didn’t think of that or consider it not yet warranted for ROI (or didn’t think to do a registry watcher).

londons_explore · on April 23, 2021

Indeed it would seem a trivial change to cache failed lookups here[1]

[1]: https://source.chromium.org/chromium/chromium/src/+/master:u...

vlovich123 · on April 23, 2021

Yup. Looked at the source first to double-check there wasn't a legitimate reason for the registry read.

EDIT: Filed the suggestion upstream: https://bugs.chromium.org/p/chromium/issues/detail?id=120214...

infogulch · on April 23, 2021

I think you're assuming too much about what they mean by 'measure'. My guess is that they measure all jank regardless of duration, and record in frequency buckets that are 30s wide. But it's not exactly clear.

vlovich123 · on April 23, 2021

30s frequency buckets wouldn't be phrased as "every 30s" though, no?

infogulch · on April 23, 2021

I don't know how they would phrase it, but the whole sentence seems disjointed like it's been through too many editing passes and lost meaning.

markdog12 · on April 23, 2021

> A subset of Canary users who have opted in to sharing anonymized metrics have circular-buffer tracing enabled

Where is that setting? I'm pretty sure it asks on install, but what about after that?

Update: Seems to be in settings -> Sync and Google Services -> Help improve Chrome's features and performance

ww520 · on April 23, 2021

Speaking of performance Chrome’s WebGL performance is quite good. Some of the stress tests I ran it came up 40% faster than Firefox. It seems Chrome is faster at ferrying the WebGL calls and large amount of data to the GPU.

PaulHoule · on April 23, 2021

I wrote something using Python and Pillow that prints titles, credits, and qr codes on the back of art prints.

I ran very much into the problem that there are not really "unicode" fonts but rather the web browser is patching together characters from different fonts when you use Chinese, Arabic, Emoji(s), etc.

I want something that looks like the card at the art museum that introduces a piece so I have just a nice serif en font and a Japanese font I like because I have a lot of Japanese subject matter.

If I wanted to print some math character or arabic I would have to register that typeface in my system but it is a hassle at the moment.

What I get for this (as compared to HTML) is that the system understands the border of the card, which is big for a 6x4 card and I can align multiple printings on both sides to the limits of the hardware.

Jank is the least of my problems.

bombcar · on April 23, 2021

The "Noto" family of fonts may be of interest: https://www.google.com/get/noto/

PaulHoule · on April 24, 2021

Those are OK if I have band-aids for my eyes!

Dolores12 · on April 24, 2021

Is there anything like that for Firefox? I would love to participate.

shinycode · on April 23, 2021

I uninstalled Chrome and switched to an other Chromium alternative and never looked back. Never had a performance issue on my Mac since then …