It should be fairly easy to add instruction fusing, where they recognize often-used instruction pairs, combine their C code, and then let the compiler optimize the combined code. Combining LOAD_CONST with the instruction following it if that instruction pops the const from the stack seems an easy win, for example.
In the interpreter, I don’t think it wouldn’t reduce overhead much, if at all. You’d still have to recognize the two byte codes, and your interpreter would spend additional time deciding, for most byte code pairs, that it doesn’t know how to combine them.
With a compiler, that part is done once and, potentially, run zillions of times.
If fusing a certain pair would significantly improve performance of most code, you'd just add that fused instruction to your bytecode and let the C compiler optimize the combined code in the interpreter. I have to assume CPython as already done that for all the low hanging fruit.
In fact, for such a fused instruction to be optimized that way on a copy-and-patch JIT it'd need to exist as a new bytecode in interpreter. A JIT that fuses instructions is no longer a copy-and-patch JIT.
A copy-and-patch JIT reduces interpretation overhead by making sure the branches in the executed machine code are the branches in the code to be interpreted, not branches in the interpreter.
This is make a huge difference in more naive interpreters, not so much in an heavily optimized threaded-code interpreter.
The 10% is great, and nothing to sneeze at for a first commit. But I'd actually like some realistic analysis of next steps for improvement, because I'm skeptical instruction fusing and other things being hand waved are it. Certainly not on a copy-and-patch JIT.
For context: I spent significant effort trying to add such instruction fusing to a simple WASM AOT compiler and got nowhere (the equivalent of constant loading was precisely one of the pairs). Only moving to a much smarter JIT (capable of looking at whole basic blocks of instructions) started making a difference.