I think most optimisation work is like that. Early effort with each technique can yield large gains. But the marginal gain of using any specific technique decreases over time.
For example, I've gotten a lot of speedups from essentially decreasing the number of malloc calls my programs make. Its often the case that ~80% of all allocations in a program come from just a few hotspots. Rewriting those hotspots to use a better data structure can yields big speed improvements, both because malloc & free are expensive calls and because the CPU hates chasing pointers. But there's usually only so much benefit in reducing allocations. At some point, it makes sense to just accept malloc calls.
The reason is totally logical. Lets say you reduce the number of allocations your program does by 10x and that yields a 45% performance improvement (10 seconds -> 5.5 seconds). You might think reducing allocations by 10x again would yield another 45% performance improvement - but thats just not how the math works out. We should expect that would take your 5.5 seconds down to 5.05 seconds - which is just a 9% improvement. That might not be worth it, given the next 10x reduction in malloc calls will probably be much harder to achieve.
If you want another 50% perf improvement, you need to run that profiler again and look at where the new hotspots are. If the CPU is spending less time following pointers, it'll now be spending more time (proportionately) running linear code. Maybe this time the performance wins will be found using SIMD. Or by swapping to a different algorithm. Or multithreading. Or by making better use of caching. Or something else - who knows.
For example, I've gotten a lot of speedups from essentially decreasing the number of malloc calls my programs make. Its often the case that ~80% of all allocations in a program come from just a few hotspots. Rewriting those hotspots to use a better data structure can yields big speed improvements, both because malloc & free are expensive calls and because the CPU hates chasing pointers. But there's usually only so much benefit in reducing allocations. At some point, it makes sense to just accept malloc calls.
The reason is totally logical. Lets say you reduce the number of allocations your program does by 10x and that yields a 45% performance improvement (10 seconds -> 5.5 seconds). You might think reducing allocations by 10x again would yield another 45% performance improvement - but thats just not how the math works out. We should expect that would take your 5.5 seconds down to 5.05 seconds - which is just a 9% improvement. That might not be worth it, given the next 10x reduction in malloc calls will probably be much harder to achieve.
If you want another 50% perf improvement, you need to run that profiler again and look at where the new hotspots are. If the CPU is spending less time following pointers, it'll now be spending more time (proportionately) running linear code. Maybe this time the performance wins will be found using SIMD. Or by swapping to a different algorithm. Or multithreading. Or by making better use of caching. Or something else - who knows.