It has lots more besides AVX512, so I'd consider waiting. These include: lower power, HEVC in hardware, JPEG in hardware, some VP9 in hardware, Better GPU, Faster AES-NI, and Thunderbolt 3.
If Apple supports it in software, it also supports jumping between power-saving states much faster.
It's actually even more restricted than that. The E3-1200 V5 series Skylake Xeon's that came out last month are in the "server" line, but do not support AVX-512. I think it's presumed that none of the Greenlow generation will support it, but that the Purley generation that follows will.
So far, my limited experience with Skylake would say that performance gains will be small. If you are buying a Skylake laptop, it should probably be for the increase in battery life rather than hopes of significantly higher speeds.
I think I figured out what was happening the question I posted to Agner. I submitted a reply, but it's still in moderation. For me, it finally explains a lot of lower than expected performance issues I've had with AVX/AVX2. In case others are interested, here it is.
-----
Agner wrote:
You are always limited by cache ways, read/write buffers, faulty
prefetching, suboptimal reordering, etc.
Yes, although in my example I'm considering the much simpler case
where there are two reads but no writes, and all data is already in
L1. So although problematic in the real world, these shouldn't be a
factor here. In fact, I see the same maximum speed if I read the same
4 vectors over and over rather than striding over all the data. I've
refined my example, though, and think I now understand what's
happening. The problem isn't a bank conflict, rather it's a slowdown
due to unaligned access. I don't think I've seen this discussed
before.
Contrary to my previous understanding, alignment makes a big
difference on the speed at which vectors are read from L1 to register.
If your data is 16B aligned rather than 32B aligned, a sequential read
from L1 is no faster with 256-bit YMM reads than it is with 128-bit
XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot
achieve 2 32B loads per cycle if the underlying data is not 32B
aligned. If the data is 32B aligned, you still can't quite sustain 64
B/cycle of load with either, but you can get to about 54 B/cycle with
both.
What this says is that unless your loads are 32B aligned, regardless
of method you are limited to about 40B loaded per cycle. If you are
sequentially loading non-32B aligned data from L1, the speeds for 16B
loads and 32B loads are identical, and limited to less than 32B per
cycle. All alignments not shown were the same as 8B alignment.
Loading in a non-sequential order is about 20% faster for unaligned
XMM and unaligned YMM loads. It's possible there is a faster order
than I have found so far. Aligned loads are the same speed
regardless of order. Maximum speed for aligned XMM loads is about 30
B/cycle, and maximum speed for aligned YMM loads is about 54 B/cycle.
At L2 sizes, the effect still exists, but is less extreme. XMM loads
are limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell,
YMM non-aligned loads are 18-20 B/cycle, and YMM aligned loads are
24-26 B/cycle. On Skylake, YMM aligned loads are slightly faster at
27 B/cycle. Interestingly, sequential unaligned L2 loads on Skylake
are almost the same as aligned loads (26 B/cycle), while non-sequential
loads are much slower (17 B/cycle).
At L3 sizes, alignment is barely a factor. On Haswell, all loads are
limited to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13
B/cycle, while YMM loads are slightly faster at 14-17 B/cycle.
Coming from memory, XMM and YMM loads on Haswell are the same
regardless of alignment, at about 5 B/cycle. On Skylake, XMM loads
are about 6.25 B/cycle, and YMM loads are about 6.75 B/cycle, with
little dependence on alignment. It's possible that prefetch can
improve these speeds slightly.
Agner writes:
The write operations may sometimes use port 2 or 3 for address
calculation, where the maximum throughput requires that they use port
7.
I don't recall if you mention it in your manuals, but I presume you
are aware that Port 7 on Haswell and Skylake is only capable of
"simple" address calculations? Thus sustaining 2 loads and a store is
only possible if the store address is [const + base] form rather than
[const + index*scale + base]. And as you point out, even if you do
this, it can still be difficult to force the processor to use only
Port 7 for the store address.
When dealing with unaligned loads, keep in mind that the penalty for a page-crossing load is huge pre-Skylake (~100 cycles); does your test set include page boundary crossings?
I believe there's still a throughput penalty for cacheline-crossing loads as well, though it's quite modest compared to pre-Nehalem (where each cacheline crossing cost you 20 cycles!)
Looking now, Intel's guide says that Skylake reduces the cross page load penalty from 100 cycles to 5 cycles. I tried with and without crossing page boundaries, although this wasn't my intent. In one version I reloaded loaded 16KB of floats, and in the other I reloaded the same 4 floats. I saw minimal difference between these two on both Haswell and Skylake. But since the penalty is increased latency, it's possible that the "load and throw away the result" doesn't illustrate this issue.
I realize that this is another topic, although in a somewhat similar context, so I thought I may just ask: Have there been any advances in reducing the page walk latency?
I'm thinking of the virtual address translation costs having impact on the run times of common algorithms, e.g., as demonstrated in the following work by Jurkiewicz & Mehlhorn: http://arxiv.org/abs/1212.0703
Admittedly, it focuses specifically on one aspect (the Double Address Translation on Virtual Machines issue in the Jurkiewicz & Mehlhorn context).
What I'm wondering about is: Has there been any progress on that on the "practical implementation" side, in the recent/coming Intel (or other, for that matter) CPUs?
I haven't read these papers (thanks for the links!) but there has been one major recent improvement. Starting with Broadwell (and I presume continuing with Skylake), the CPU can now handle two page misses in parallel: http://www.anandtech.com/show/8355/intel-broadwell-architect...
The other interesting thing is that page walks themselves are actually not very expensive: something on the order of 10 cycles. They only become painfully expensive when the page table is too large to fit in cache, and spills into memory, necessitating a load from memory just to get the page table. So improvements in memory (and cache) latency will have strong positive effect.
Interesting about parallel misses handling, thanks!
One worry is that this tends to compound other effects -- say, non-prefetch-friendly access combined with TLB misses resulting in increasingly expensive slowdowns (as in the continuous-vs.-random array access example in the paper).