A slightly more elaborate answer than the sibling post to drive home how much happens on a simple read that is not cached :
- request to L1D cache, misses
- request to L2D cache, misses
- packet is dropped on the mesh network to access L3D, likely misses
- L3D requests load from memory from the memory controller, load is put in queue
- dram access latency ~100-150
- above chain in reverse
This is the best case scenario on miss, because there could be a DTLB miss on the address (which is why huge tables are crucial in the paper) or there could be dirty cache lines somewhere in other cores that trigger the coherency mechanism.