Back in the days, an integer division took something like 46 clocks (original Pentium), and now on Ice Lake it's just 12 with a reciprocal throughput of 6. Multiply that by the clock speed increase and a modern CPU can "do division" about 300-400 times faster than a Pentium could. Then multiply that by the number of cores available now versus just one core, and that increases to about 2000 times faster!
I used to play 3D games on Pentium-based machines and I thought of them as a "huge upgrade" from 486, which in turn were a huge upgrade from 286, etc...
Now, people with Ice Lake CPUs in their laptops and servers complain that things are slow.
And things are slow as we waste all that processing power on running javascript one way or another. And everything requires a slow blocking connection to the mainframe. Nowadays the “always connected” mindset is really slowing us down.
Ok fair enough, but the mindset is spreading: json (javascript) parsing is what caused GTA Online loading times to balloon and I dread playing Call of Duty online as it wants to download and install dozens of GBs every time I launch it.
It wasn’t json parsing per se, but a buggy roll-your-own implementation that used a function (sscanf iirc) with a surprising nontrivial complexity on a long string. Fun part is, if they just outsourced that load to javascript and its JSON.parse, they’d never encounter that exponential slowdown. Javascript is a nice target to blame, but it isn’t the problem. CPUs got hundreds of times faster, javascript only divides it by N, which stays low and constant (at least) through decades. Do you really believe that if browsers only supported MSVC++-based DLLs without scripting, sites would run faster? That would be naive.
However there is definitely still less intrinsic optimisation from a dev perspective I think - people will iterate over the same array multiple times in different places rather than do it once.
I guess our industry has decided moving faster is better than running faster for a lot of stuff.
Part of the reason computers seem slower than they did is that most programs (and most programmers) only use one of those cores. Most of the reason, though, is that programmers buy new computers, and also that programmers only optimise code that’s slow on their computer.
Those insrtruction latencies are in addition to the pipeline created latency. (They are actually the number of cycles added to the dependency chain specifically). The mult port has a small pipeline itself of 3 stages (that why 3 cycles latency). Intel has a 5 stage pipeline so the minimum latency is going to be 8 for just those two things.
Sorry, I dropped all the 1s in that message when i typed it (laptop keyboard is a little sketchy right now). That should have been 15 and 18. I think the recent Intel microarchs take 14 plus howewever long at uopd decoding that minimum 1, so 15-20 or close to that.
> The dependency chain length is what is normally intended as instruction latency.
Yes, the way I read the original post and others was that you actually your response back in 3 cycles, which isn't correct. It doesn't get comitted for a while (but following instructions can use the result even if it hasn't been committed yet). You're not getting a result in less than 20 cycles basically.
> Sorry, I dropped all the 1s in that message when i typed it
It makes sense now! :D
> You're not getting a result in less than 20 cycles basically
But the end of the pipeline is an arbitrary point. It will take a few more cycles to get to L1 (when it makes it out of the write buffer), a few tens more to traverse the L2 and L3 and hundreds to get to RAM (if it gets there at all). If it has to get to an human it will take thousands of cycles to get through the various busses to a screen or similar.
The only reasonably useful latency measure is what it takes for the value to be ready to be consumed by the next instruction, which is indeed 3-5 cycles depending to the specific microarchitecture.
> which is indeed 3-5 cycles depending to the specific microarchitecture
I assume you are talking about from fetch to hitting the store buffer? That would be the aabsolute min time before the data could be seen elsehwere I would think. It can still potentially be rolled back, and that would be higher than reciprocal be way too fast to sustain, but for a single instr burst, I'm not sure. So much happens at the same time, An L1 read hit will cost you 4 minimum, hut all but 1 of that is hidden. can't avoid the multi cost of 3 or add 1. the decoding and uop cache hit, reservation, etc will cost a few. I have no idea.
If you know of anything describing it in such detail, I would be comopletely curiouis.
3
https://www.agner.org/optimize/instruction_tables.pdf