Modern graphics hardware can composite multiple buffers at scan-out time (which does not consume extra memory bandwidth), but it's not clear to me whether anyone besides perhaps DWM uses that.
Multiple means about 4 (that's the number for Skylake & Kabylake gen Intel), and one of them is cursor plane. There might be also misc limitation regarding overlaps, so it might not be generally usable. Android in the earlier years used it for the notification area, as it basically split the screen vertically when the notification shade was moved.
Additionally, these buffers can be scaled at scan-out time. So what it is used for is emulating lower resolutions for Xrandr clients under XWayland (Wayland doesn't allow to switch resolutions to random apps).
Under MacOS, scaling at scan-out time is used for fractionally scaling of the entire framebuffer without using GPU.
I was under the impression that it was mostly the mobile GPUs that supported blending a large number of planes at scanout time. I've written software for random ARM SoCs where there were a dozen planes or so that you had to program the ordering and bounds of. The first was typically the default framebuffer, another was the cursor, two were the outputs of the hardware video decoders, and the rest were up to the application developer to use.
The big desktop GPUs seem to only have the standard framebuffer, a cursor plane, and a small number (<= 2) of overlay planes. It seems that the general consensus is that they tend to have such a ridiculous amount of horsepower that rendering everything into an output buffer and displaying that won't even kick the GPU out of its idle power state.
That being said, I had a few hours of fun hacking glxgears and glmark2 to render into the cursor plane on Wayland.