There are hardware reasons even if you leave any software scaling inefficiency t...

There are hardware reasons even if you leave any software scaling inefficiency to the side. For tasks that can use lots of threads, modern hardware trades off per-thread performance for getting more overall throughput from a given amount of silicon.

When you max out parallelism, you're using 1) hardware threads which "split" a physical core and (ideally) each run at a bit more than half the CPU's single-thread speed, and 2) the small "efficiency" cores on newer Intel and Apple chips. Also, single-threaded runs can feed a ton of watts to the one active core since it doesn't have to share much power/cooling budget with the others, letting it run at a higher clock rate.

All these tricks improve the throughput, or you wouldn't see that wall-time reduction and chipmakers wouldn't want to ship them, but they do increase how long it takes each thread to get a unit of work done in a very multithreaded context, which contributes to the total CPU time being higher than it is in a single-threaded run.