Test results for Broadwell and Skylake

charleskh · on Dec 29, 2015

A lot of this goes over my head but does this mean it's not worth waiting for the 2016 Macbook Pro to come out with the Skylake processor?

virtuallynathan · on Dec 29, 2015

It has lots more besides AVX512, so I'd consider waiting. These include: lower power, HEVC in hardware, JPEG in hardware, some VP9 in hardware, Better GPU, Faster AES-NI, and Thunderbolt 3.

If Apple supports it in software, it also supports jumping between power-saving states much faster.

desdiv · on Dec 29, 2015

No consumer Skylake chips will have AVX-512. It's going to be on server Xeon Skylake chips only ("server Xeon" as opposed to "laptop Xeon"[0]).

[0] http://www.imore.com/intels-xeon-coming-soon-laptops-and-wha...

nkurz · on Dec 29, 2015

It's actually even more restricted than that. The E3-1200 V5 series Skylake Xeon's that came out last month are in the "server" line, but do not support AVX-512. I think it's presumed that none of the Greenlow generation will support it, but that the Purley generation that follows will.

So far, my limited experience with Skylake would say that performance gains will be small. If you are buying a Skylake laptop, it should probably be for the increase in battery life rather than hopes of significantly higher speeds.

ksec · on Dec 29, 2015

WoW, this is news to me, what is Skylake without AVX-512?

This is more like a tick from Intel rather then tock.

Narishma · on Dec 29, 2015

I don't think they are following the tick tock rythm anymore. Fab process improvements have slowed down recently.

nkurz · on Dec 28, 2015

I think I figured out what was happening the question I posted to Agner. I submitted a reply, but it's still in moderation. For me, it finally explains a lot of lower than expected performance issues I've had with AVX/AVX2. In case others are interested, here it is.

-----

Agner wrote: You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc.

Yes, although in my example I'm considering the much simpler case where there are two reads but no writes, and all data is already in L1. So although problematic in the real world, these shouldn't be a factor here. In fact, I see the same maximum speed if I read the same 4 vectors over and over rather than striding over all the data. I've refined my example, though, and think I now understand what's happening. The problem isn't a bank conflict, rather it's a slowdown due to unaligned access. I don't think I've seen this discussed before.

Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both.

I put up new test code here: https://gist.github.com/nkurz/439ca1044e11181c1089

Results at L1 sizes are essentially the same on Haswell and Skylake.

  Loading 4096 floats with 64 byte raw alignment
  Vector alignment 8:
  load_xmm : 19.79 bytes/cycle
  load_xmm_nonsequential : 23.41 bytes/cycle
  load_ymm : 28.64 bytes/cycle
  load_ymm_nonsequential : 36.57 bytes/cycle

  Vector alignment 16:
  load_xmm : 29.26 bytes/cycle
  load_xmm_nonsequential : 29.05 bytes/cycle
  load_ymm : 28.44 bytes/cycle
  load_ymm_nonsequential : 36.90 bytes/cycle

  Vector alignment 24:
  load_xmm : 19.79 bytes/cycle
  load_xmm_nonsequential : 23.54 bytes/cycle
  load_ymm : 28.64 bytes/cycle
  load_ymm_nonsequential : 36.57 bytes/cycle

  Vector alignment 32:
  load_xmm : 29.05 bytes/cycle
  load_xmm_nonsequential : 28.85 bytes/cycle
  load_ymm : 53.19 bytes/cycle
  load_ymm_nonsequential : 52.51 bytes/cycle

What this says is that unless your loads are 32B aligned, regardless of method you are limited to about 40B loaded per cycle. If you are sequentially loading non-32B aligned data from L1, the speeds for 16B loads and 32B loads are identical, and limited to less than 32B per cycle. All alignments not shown were the same as 8B alignment.

Loading in a non-sequential order is about 20% faster for unaligned XMM and unaligned YMM loads. It's possible there is a faster order than I have found so far. Aligned loads are the same speed regardless of order. Maximum speed for aligned XMM loads is about 30 B/cycle, and maximum speed for aligned YMM loads is about 54 B/cycle.

At L2 sizes, the effect still exists, but is less extreme. XMM loads are limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell, YMM non-aligned loads are 18-20 B/cycle, and YMM aligned loads are 24-26 B/cycle. On Skylake, YMM aligned loads are slightly faster at 27 B/cycle. Interestingly, sequential unaligned L2 loads on Skylake are almost the same as aligned loads (26 B/cycle), while non-sequential loads are much slower (17 B/cycle).

At L3 sizes, alignment is barely a factor. On Haswell, all loads are limited to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13 B/cycle, while YMM loads are slightly faster at 14-17 B/cycle.

Coming from memory, XMM and YMM loads on Haswell are the same regardless of alignment, at about 5 B/cycle. On Skylake, XMM loads are about 6.25 B/cycle, and YMM loads are about 6.75 B/cycle, with little dependence on alignment. It's possible that prefetch can improve these speeds slightly.

Agner writes: The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7.

I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address.

stephencanon · on Dec 28, 2015

When dealing with unaligned loads, keep in mind that the penalty for a page-crossing load is huge pre-Skylake (~100 cycles); does your test set include page boundary crossings?

I believe there's still a throughput penalty for cacheline-crossing loads as well, though it's quite modest compared to pre-Nehalem (where each cacheline crossing cost you 20 cycles!)

nkurz · on Dec 29, 2015

Hi Stephen, thanks for the insight.

Looking now, Intel's guide says that Skylake reduces the cross page load penalty from 100 cycles to 5 cycles. I tried with and without crossing page boundaries, although this wasn't my intent. In one version I reloaded loaded 16KB of floats, and in the other I reloaded the same 4 floats. I saw minimal difference between these two on both Haswell and Skylake. But since the penalty is increased latency, it's possible that the "load and throw away the result" doesn't illustrate this issue.

matt_d · on Dec 29, 2015

I realize that this is another topic, although in a somewhat similar context, so I thought I may just ask: Have there been any advances in reducing the page walk latency?

I'm thinking of the virtual address translation costs having impact on the run times of common algorithms, e.g., as demonstrated in the following work by Jurkiewicz & Mehlhorn: http://arxiv.org/abs/1212.0703

The recent research I'm aware of is, e.g., Generalized Large-page Utilization Enhancements (GLUE) mechanism, proposed in "Large Pages and Lightweight Memory Management in Virtualized Environments" (from this year's Micro): slides: https://dl.dropboxusercontent.com/u/36554102/BPC-1.pdf ; paper: http://paul.rutgers.edu/~binhpham/phamMICRO15.pdf

Admittedly, it focuses specifically on one aspect (the Double Address Translation on Virtual Machines issue in the Jurkiewicz & Mehlhorn context).

What I'm wondering about is: Has there been any progress on that on the "practical implementation" side, in the recent/coming Intel (or other, for that matter) CPUs?

nkurz · on Dec 30, 2015

I haven't read these papers (thanks for the links!) but there has been one major recent improvement. Starting with Broadwell (and I presume continuing with Skylake), the CPU can now handle two page misses in parallel: http://www.anandtech.com/show/8355/intel-broadwell-architect...

The other interesting thing is that page walks themselves are actually not very expensive: something on the order of 10 cycles. They only become painfully expensive when the page table is too large to fit in cache, and spills into memory, necessitating a load from memory just to get the page table. So improvements in memory (and cache) latency will have strong positive effect.

matt_d · on Jan 6, 2016

Thank you for the reply!

Interesting about parallel misses handling, thanks!

One worry is that this tends to compound other effects -- say, non-prefetch-friendly access combined with TLB misses resulting in increasingly expensive slowdowns (as in the continuous-vs.-random array access example in the paper).