> If you hold up a sign with, say, a multiplication, a CPU will produce the resu...

moonchild · on June 16, 2022

> The latency on multiplication (register input to register output) is 5-clock ticks

3

https://www.agner.org/optimize/instruction_tables.pdf

jiggawatts · on June 16, 2022

Back in the days, an integer division took something like 46 clocks (original Pentium), and now on Ice Lake it's just 12 with a reciprocal throughput of 6. Multiply that by the clock speed increase and a modern CPU can "do division" about 300-400 times faster than a Pentium could. Then multiply that by the number of cores available now versus just one core, and that increases to about 2000 times faster!

I used to play 3D games on Pentium-based machines and I thought of them as a "huge upgrade" from 486, which in turn were a huge upgrade from 286, etc...

Now, people with Ice Lake CPUs in their laptops and servers complain that things are slow.

bouke · on June 17, 2022

And things are slow as we waste all that processing power on running javascript one way or another. And everything requires a slow blocking connection to the mainframe. Nowadays the “always connected” mindset is really slowing us down.

onion2k · on June 17, 2022

That explains why Electron apps and Web pages are slow but the post you're replying to is about games...

bouke · on June 17, 2022

Ok fair enough, but the mindset is spreading: json (javascript) parsing is what caused GTA Online loading times to balloon and I dread playing Call of Duty online as it wants to download and install dozens of GBs every time I launch it.

wruza · on June 17, 2022

It wasn’t json parsing per se, but a buggy roll-your-own implementation that used a function (sscanf iirc) with a surprising nontrivial complexity on a long string. Fun part is, if they just outsourced that load to javascript and its JSON.parse, they’d never encounter that exponential slowdown. Javascript is a nice target to blame, but it isn’t the problem. CPUs got hundreds of times faster, javascript only divides it by N, which stays low and constant (at least) through decades. Do you really believe that if browsers only supported MSVC++-based DLLs without scripting, sites would run faster? That would be naive.

jiggawatts · on June 17, 2022

You may be thinking of GTA Online, but the point is still valid.

fennecfoxy · on June 17, 2022

Games do use JS, some UIs actually render using a webview (Unity etc is capable of this)

jcranberry · on June 17, 2022

I didn't see anyone complaining about games in the parent comments.

fennecfoxy · on June 17, 2022

Afaik much JS these days is optimised, see things like V8's turbofan https://v8.dev/blog/turbofan-jit

However there is definitely still less intrinsic optimisation from a dev perspective I think - people will iterate over the same array multiple times in different places rather than do it once.

I guess our industry has decided moving faster is better than running faster for a lot of stuff.

taneq · on June 17, 2022

Part of the reason computers seem slower than they did is that most programs (and most programmers) only use one of those cores. Most of the reason, though, is that programmers buy new computers, and also that programmers only optimise code that’s slow on their computer.

jnordwick · on June 16, 2022

Those insrtruction latencies are in addition to the pipeline created latency. (They are actually the number of cycles added to the dependency chain specifically). The mult port has a small pipeline itself of 3 stages (that why 3 cycles latency). Intel has a 5 stage pipeline so the minimum latency is going to be 8 for just those two things.

gpderetta · on June 17, 2022

I don't understand what you are trying to say. The dependency chain length is what is normally intended as instruction latency.

Also the pipeline length is certainly not 5 stages but more like 20-30.

jnordwick · on June 17, 2022

Sorry, I dropped all the 1s in that message when i typed it (laptop keyboard is a little sketchy right now). That should have been 15 and 18. I think the recent Intel microarchs take 14 plus howewever long at uopd decoding that minimum 1, so 15-20 or close to that.

> The dependency chain length is what is normally intended as instruction latency.

Yes, the way I read the original post and others was that you actually your response back in 3 cycles, which isn't correct. It doesn't get comitted for a while (but following instructions can use the result even if it hasn't been committed yet). You're not getting a result in less than 20 cycles basically.

gpderetta · on June 17, 2022

> Sorry, I dropped all the 1s in that message when i typed it

It makes sense now! :D

> You're not getting a result in less than 20 cycles basically

But the end of the pipeline is an arbitrary point. It will take a few more cycles to get to L1 (when it makes it out of the write buffer), a few tens more to traverse the L2 and L3 and hundreds to get to RAM (if it gets there at all). If it has to get to an human it will take thousands of cycles to get through the various busses to a screen or similar.

The only reasonably useful latency measure is what it takes for the value to be ready to be consumed by the next instruction, which is indeed 3-5 cycles depending to the specific microarchitecture.

jnordwick · on June 19, 2022

> which is indeed 3-5 cycles depending to the specific microarchitecture

I assume you are talking about from fetch to hitting the store buffer? That would be the aabsolute min time before the data could be seen elsehwere I would think. It can still potentially be rolled back, and that would be higher than reciprocal be way too fast to sustain, but for a single instr burst, I'm not sure. So much happens at the same time, An L1 read hit will cost you 4 minimum, hut all but 1 of that is hidden. can't avoid the multi cost of 3 or add 1. the decoding and uop cache hit, reservation, etc will cost a few. I have no idea.

If you know of anything describing it in such detail, I would be comopletely curiouis.

reacharavindh · on June 17, 2022

This reminds me of that “todo” I wrote for myself a long time ago. These days processors come with bigger L1,L2, and L3 caches. Would it be possible for a program that works on a tiny bit of data(few KB) to load it all up in the cache and provide ultimate response times?!

Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?

2143 · on June 17, 2022

> Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?

I'm probably the worst person to explain this.

Long long ago, I took a parallel programming class in grad school.

It turns out the conventional way to do matrix multiplication results in plenty of cache misses.

However, if you carefully tweak the order of the loops and do certain minor modifications — I forget the details — you could substantially increase the cache hits and make matrix multiplication go noticeably faster on benchmarks.

Some random details that may be relevant:

* When the processor loads a single number M[x][y], it sort of loads in the adjacent numbers as well. You need to take advantage of this.

* Something about row-major/column-major array is an important detail.

What I'm trying to say is, it is possible to indirectly optimize cache hits by careful manual hand tweaking. I don't know if there's a general automagic way to do this though.

This probably wasn't very useful, but I'm just putting it out there. Maybe more knowledgeable folks can explain this better.

matthias509 · on June 17, 2022

I’m guessing you are thinking of cache lines where CPUs will read/write data in 64 byte chunks from memory to/from cache.

As you said, being aware of this lets you optimize away cache misses by controlling the memory access pattern.

diroussel · on June 17, 2022

There is a data access strategy called cache oblivious algorithms which aim to make it more likely to utilise this property without knowing the actual cache size.

I used that approach once on a batch job that read two multi megabytes files to produce a multigigabyte output file. It gave a massive speed up on at 32-bit intel machine.

https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

dragontamer · on June 17, 2022

> Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?

Not for general purpose programs, because L1 caches change so quickly each year there is no point.

For embedded real-time processors, yes. For GPUs, yes. (OpenCL __local, CUDA __shared__).

This is because Microsoft's DirectX platform guarantees 32kB or something of __shared__ / tiled memory, so all GPU providers who want a DirectX11 certification are guaranteed to have that cache-like memory that programmers can rely upon. When DirectX12 or DirectX13 comes about, the new minimum specifications are published and all graphics programmers can then take advantage of it.

-------

No sane Linux/Windows programmer however would want these kinds of guarantees for normal CPU programs, outside of very strict realtime settings (at which point, you can rely upon the hardware being constant). Linux/Windows are designed as general purpose OSes.

DirectX 9 / 10 / 11 / 12 however, is willing to tie itself to the "GPUs of the time", and includes such specifications.

l33t2328 · on June 17, 2022

I don’t think you can generally control the cache with such granularity since modern processors do all sorts of instruction level parallelism and cache coherency voodoo

deredede · on June 17, 2022

On CPUs you can't really force data to stay on the cache, but if you access it frequently and there is not too much load, it will stay there anyways.

Some architectures (e.g. GPUs) provide local "scratchpad" memories instead of (or in addition to) caches. These are separate uninitialized adressable memory region with similar access times to a L2/L1 cache.

erwincoumans · on June 17, 2022

Explicit control over the fastest memory is what GPU local storage or the PlayStation 3 Cell SPUs allow/require.

For x86_64 there are cache hints, no pinning/reserving parts of the caches (as far ad I know).

I wonder if Apple M1 or M2 cpu with unified CPU/GPU memory has anything like pinning or explicit cache control?

jcranberry · on June 17, 2022

If the data is contiguous in memory and frequently accessed it will almost certainly make its way into L1 cache and be there for the life of the program.

If the data is not contiguous it could make the CPU's life much harder.

There's also the matter of program size (the amount of instructions in the actual program) and whether the program does anything which forces it to go lower cache levels or RAM.

There are intrinsics for software prefetching such as __mm_prefetch, but those are difficult to use such that they actually increase you're performance.

jodrellblank · on June 17, 2022

One of the claimed advantages of the array programming language "K" is that the interpreter is small enough for the hot paths to stay in the CPU cache. It's hard to Google but claims come from people/places like this thread: https://news.ycombinator.com/item?id=15908394