The biggest speedup one could get on modern hardware (in single-threaded context...

The biggest speedup one could get on modern hardware (in single-threaded context) comes from effectively using the CPU caches and in certain cases, vectorization. Clever use of these can give you 10-100x speedups.

This extra optimization is somewhat offset by the very point of a managed language: the programmer don’t want memory layout details to leak into the design/APIs, but in the rare case it is needed it can be done just as well with the escape hatches they provide (byte buffers, value types also bring you quite far). But business logic seldom involve these scorching hot loops to begin with, so there may not even be anything to optimize in this manner.