New Intel Instructions for Alder Lake, Also BF16 for Sapphire Rapids

_ugfj · on April 2, 2020

This doesn't mean ADL will actually see a wide release.

I am still dead convinced we will never see a wide release of a 10nm desktop chip. It will be delayed then cancelled in favor of 7nm.

Server wise, same, limited releases, more smoke and mirrors etc etc

Intel will limp along with 14nm until 7nm. 10nm is utter broken and they can't fix it.

reitzensteinm · on April 2, 2020

I think that's true if 7nm progress is independent of 10nm, and the mistakes made with 10nm aren't also delaying 7nm.

My understanding is that Intel's 14->10nm shrink was the most aggressive in the industry, promising to yield a greater increase in density than the geometry would imply, when usually there's a loss factor and the density increase isn't as good as you'd naively expect.

Even after 10nm was delayed by quite some time, Intel pointed to this and declared they weren't lagging the industry as the shrink was closer to a 1.5 node shrink.

If the 10nm delay was the result of this aggressiveness and Intel was well in to developing 7nm in a similar fashion before it became obvious how 10nm was going to turn out, this may not be a matter of skipping over a single bad apple.

_ugfj · on April 2, 2020

> I think that's true if 7nm progress is independent of 10nm, and the mistakes made with 10nm aren't also delaying 7nm

That is exactly what's happening

keanebean86 · on April 2, 2020

7nm was planned for 2017 as of 2014. Ouch.

https://www.tweaktown.com/news/41582/intel-to-hit-10nm-in-20...

_ugfj · on April 2, 2020

That's just early optimism, as far as these things can be known from the outside, the process is fine very much unlike 10nm which was known to be broken for years even before Cannon Lake.

justAnotherNET · on April 2, 2020

7nm intel is EUV, 10nm intel is quad patterned DUV. Massive difference.

scottlocklin · on April 2, 2020

Is there a EILI5 on why Intel borked 10nm and why others apparently didn't?

_ugfj · on April 2, 2020

ELI5? Not really.

1. The metal pitch targeted (36nm) requires Self-Aligned Quad Patterning (SAQP) apparently is very very hard to do without EUV. TSMC is doing 40nm metal pitch and Samsung does 36nm BUT both of them are using EUV.

2. Cobalt https://fuse.wikichip.org/news/525/iedm-2017-isscc-2018-inte...

3. COAG https://www.reddit.com/r/AMD_Stock/comments/dk2pw7/why_it_is...

Symmetry · on April 2, 2020

I think the ELI5 is "Intel tried to do some things that were cleverer than they could actually do and messed up."

scottlocklin · on April 2, 2020

I remember them developing an EUV lithography rig at the ALS 11 bend magnet in the 90s... guess it never worked out. Looking at the website: dude's still there!

Thanks for the explanation; close enough to ELI5

throw0101a · on April 2, 2020

Possibly a meta question on instructions: is there such a thing as having 'too many'?

I know transistors are cheap, but at what point is it diminishing returns on new instructions? How many different use cases need to be handled? Certainly there are new situations that need to be handled (e.g., H.264/5/6, AV1 video), but will there ever be a point where we can say "this is enough"?

londons_explore · on April 2, 2020

More instructions builds a slightly higher moat for new companies who think they might design a CPU, and adds a bit to AMD's costs to keep up. Customers who use those instructions will stick with intel for a generation or two until AMD comes out with something compatible, and if intel is smart, they'll contribute patches to opensource software projects which make use of them, and have some low performance fallback path for non-intel processors.

As those instructions become part of the regular software ecosystem, other architectures like ARM will have to add something similar too, again increasing the cost of playing catchup.

creato · on April 2, 2020

This may be true, but it is ridiculously cynical to think this is the main reason to add instructions to a processor. Every vendor I'm aware of for CPUs, GPUs, and other kinds of processors, are all adding instructions for ML workloads, and they often really do help a lot.

Also, if you think that ARM and x86 have parity in support for instructions, you are off your rocker.

the8472 · on April 2, 2020

The instructions are certainly not free of cost, but they don't have to implement new function units for every new instruction. Often new instructions are just slightly different/more general modes of existing ones. And on the other side legacy instructions will be emulated via microcode instead of having dedicated units.

The software equivalent would be compatibility shims and wrapper functions with different signatures sharing a common, highly parameterized internal implementation.

vbezhenar · on April 2, 2020

When you can put more transistors, you have to use them for something. You can either put more cores, but that requires for software to be written in a way to utilize those cores and that's not always easy task or even possible. Or you can use those transistors to make some common operations faster and that potentially will increase performance with little rewrites (or even with no rewrites for libraries).

"This is enough" will be said when we wouldn't be able to put more transistors. I think that it won't happen in the next 10 years, but, of course, it'll happen eventually.

derefr · on April 2, 2020

> When you can put more transistors, you have to use them for something.

I mean, it’d honestly be really interesting to know the performance charateristics (TDP et al) of an Intel 8088 if it were shrunk to 14nm and then driven at a modern clock speed. Maybe less is more?

dan_hawkins · on April 2, 2020

It would've been really bad compared to modern CPUs. Just one example: having no cache, instruction fetches would be slowed down to the speed of RAM, what's worse if that instruction writes/reads memory location then another RAM access would be issued, with high probability of accessing very different location additional timing penalties would occur (row pre-charge, etc.)

Someone · on April 2, 2020

On the plus side, its entire address space (a whopping 1MB) could easily be on-chip, with the speed of a level 3 cache (Probably even a level 2 one).

That CPU wouldn’t need RAM (it would be a fun exercise to write an OS that can run on a modern CPU without external RAM)

woadwarrior01 · on April 2, 2020

https://www.coreboot.org/data/yhlu/cache_as_ram_lb_09142006....

JoeAltmaier · on April 2, 2020

Ha, no. It didn't have cache, pipelining, separate bus and instruction subsystems, a wide bus. It would run one instruction every N clock cycles instead of N every 1 clock cycle, and hang on every memory read or write. Multi-byte math would take many instructions. It would be pathetic.

david-gpu · on April 2, 2020

> Maybe less is more?

"Maybe thousands of computer architects are incompetent and don't realize there's no benefit from anything they've done in the past 40 years".

Does that sound likely?

Disclaimer: I'm one of those architects.

Edit: I tried to "strel-man" your argument, but I couldn't see how.

derefr · on April 2, 2020

(It's "steel-man.")

My argument isn't that modern software would run faster on such a CPU; my argument is instead that:

1. the most trivially-true version of the argument: software written for that CPU would run faster on a modern "remastering" of that CPU, than it would run (directly, via a lot of microcode-level emulation; or indirectly, via actual emulation) on a modern CPU. (Yes, some software that's still binary-forward-compatible with modern CPUs—only using generic ISA ops—would be faster on the modern CPU. But I'm talking about the worst, most persnickety edge-case uses of the ISA. The kinds of "requires a whole different model of the world to have the right side-effects" ops that make IBM write emulators for their previous mainframe architectures, rather than just shimming those ops into their new POWER ISAs and doing load-time static recompilation to the new ISA.)

2. smaller transistor size would mean less total power draw per cycle—i.e. it's a rather dim bulb—which means you could overclock the heck out of that CPU.

3. As long as you don't also make the die-size any smaller (but rather just lay out your small transistors with super-long trace-paths between them), then you're not decreasing the thermal surface-area of the die in the process, so you can then attach a modern cooling setup to it to clock it even higher.

4. Or, if you like, you can shrink the die-size and produce a compact 10nm 8088, at which point it'd probably be, say... 1 sq. mm? Smaller than a Cortex-M0+, for sure. That's the point when things are small enough that you can start to do wacky things like covering the entire (uncapped) die surface in a focused laser beam, to cool it by entangling the coherent "negative temperature" photons with the traces' positive-temperature baryons, as an indiscriminate version of an atomic-force microscope's method of ion capture.

Someone · on April 2, 2020

But what would you do with that speed? Let’s say you run that 8088 at 10GHz (about a factor 1,000 faster than a fast 8088).

What useful algorithm needs that speed but not more than 1MB memory (that CPU could read and write its entire address space a thousand times a second)?

throw0101a · on April 2, 2020

Caching? More I/O?

I would think that the speed that which most cores and instructions work at is often rarely the bottleneck nowadays for a good portion of workloads.

api · on April 2, 2020

CISC is basically a compression codec based on macros. Each macro encodes a batch of micro ops and core engineers are free to decide how to implement them for the best speed vs silicon trade off for each instruction.

The thing that killed RISC (at least the true minimal kind) was the divergence of CPU and RAM speeds. All those instructions saturate the memory bus. Code density became important.

X86 is of course a bit of a legacy mess, but the worst thing about it is not too many instructions but how many instruction widths there are. The X86 stream is hard to parse compared with ARM and others with only one or two instruction word sizes.

A few ideas from RISC survived, and chief among them was fixed instruction widths, minimizing instruction dependencies, and canonical encodings. RISC-V is really not RISC in the original minimal sense, but it kept the moniker for those aspects.

Adding more instructions (macros) to X64 will not make its downsides any worse. That being said, I do wonder if maybe X64 has so many because it is using them to work around the inefficiencies in its instruction encoding. Maybe some of these ops would be just as fast without a big macro if the core instruction set were less crufty, but I am not a CPU engineer.

throwaway-9320 · on April 2, 2020

I remember reading somewhere that nowadays a significant chunk of the instructions isn't actually implemented on the CPU using transistors, but by using CPU microcode to sort of emulate these instructions by combining existing ones. Someone correct me if I'm wrong.

jcranmer · on April 2, 2020

Micro-ops are the actual things that can be executed by the hardware. A floating-point FMA unit is going to support a floating point addition, subtraction, fused multiply add (with various intermediate sign twiddles), and integer multiplication and wide multiplication--all without adding much more hardware: you're adding a few xors or muxes to the big, fat multiplier in the middle of it all. Each of these might have distinct micro-ops, or you might be able to separate the processing stages and use a single multiplier micro-op with distinct preprocessing micro-ops for the different instructions. Realistically, though, you are adding new micro-ops, although the overall hardware burden may be light.

The motivation of adding new instructions is generally to get higher performance, so there's going to be pressure to have hardware to execute it well, as opposed to a more naive emulation. But sometimes people add support without making it fast--AMD chips used to (still do? I'm not sure) implement the 256-bit AVX instructions by sending the 128-bit halves through their units in sequence, so that it technically supported AVX instructions but didn't see any improved benefit from it.

Koshkin · on April 2, 2020

This is true. On the other hand, most of the transistors in the CPU are spent on memory (microcode and cache).

Symmetry · on April 2, 2020

Back in the high CISC era every instruction would be backed by microcode as a series of instructions like "Load the first argument from memory location X; load the address of the second argument from memory location Y; now use that to get the second argument; store the result in memory location Z;"

Then in the RISC era the instructions being fed to processors more closely matched what was going on inside, though pipelining made that a bit more complicated.

These days a processor will still take the incoming instruction stream and sometimes break up instructions into pieces but it will also sometimes fuse two instructions into a single one like a compare followed by a branch.

gpderetta · on April 2, 2020

That's not really the case. Many complex, obsolete or not timing critical instructions are microcoded, but the large majority of instructions executed by the cpu are not. They are translated to microops, but that's a different thing, normally there is a single microop that executes the bulk of the instruction.

londons_explore · on April 2, 2020

I think we're a long long way from someone declaring that "we have found the best way to arrange the transistors - there are no further improvements to be made".

yjftsjthsd-h · on April 2, 2020

That feels like an orthogonal issue. I mean, what if the best way to arrange the transistors turns out to be a one-instruction set computer? It's not obvious to me that adding more instructions is a better use of the transistors available in the surface area we've got to work with.

pjmlp · on April 2, 2020

That is why bytecode based executables are so nice.

Sure you need an AOT compiler as part of the OS, or having a JIT as part of the runtime, but I rather let the compiler infer what might be the best to use, even if not optimal, than playing around with -march configurations.

Something that mainframes have been doing since the 70's.

saagarjha · on April 2, 2020

Which bytecode-based compilers actually perform real optimizations, aside from WebAssembly? Both Python and Java seem to emit "stupid" bytecode and let the JIT take care of it. (And any bytecode implementation is almost always slower than native code.)

pjmlp · on April 3, 2020

All of those that are actually compiled to native code?

Optimizations done at bytecode level aren't worthwhile unless you are targeting pure interpreters.

Still, there are bytecode optimizers for Java and .NET, that is part of what code obfuscators are able to do, what many toolchains for embedded devices do as packing step, and a standard step in Android builds.

In what concerns JIT/AOT themselves, well having bytecodes with a kernel JIT is what has allowed IBM and Unisys to update their hardware designs, while keeping software with several decades of history to keep running.

It is also what allows Android and watchOS code to transition to 64 bits without most developers even having to care about it.

Or how Android, Java and .NET developers are able to take advantage of AVX auto-vectorization, string SIMD, bits operation optimizations, without having to touch their decades old binaries.

It also allows the usage of cloud compilers (UWP, Android, IBM J9), which gather PGO information across multiple deployments and use it as input to the top tier JIT/AOT optimizing compilers.

throw0101a · on April 2, 2020

Is there a rationale behind Intel's codename scheme?

With (e.g.) Ubuntu they go through the English alphabet, so you can get some idea of the order things have/will come out, but it seems that Intel is using a random word generator for their X Lake names.

auvi · on April 2, 2020

I think all the X Lake names are taken form real lakes in the state of Oregon.

throw0101a · on April 2, 2020

Sure, but how are the names chosen?

If Ubuntu et al it is alphabetical order, and you can generally tell the timeline of releases. But how can one really tell what the current product from Intel is, and what came before, and what are the upcoming releases?

It's not like they're going through Oregon lakes in some kind of order, or are they? By discovery, by size/volume, other?

barkingcat · on April 2, 2020

As in most code names there is no order, if you want an order, maybe in terms of favourite to least favourite parks/lakes of the Intel staff.

What you are looking for are monotonically increasing version numbers, which code names are definitely not.

If you are looking for details about generations of chips, https://ark.intel.com/ is your friend.

throw0101a · on April 2, 2020

> What you are looking for are monotonically increasing version numbers, which code names are definitely not.

They are not, but given that Intel seems to put them in their public information / marketing material, it seems like the "codenames" are being used as version numbers. If they were strictly internal-to-Intel I could see that POV, but that doesn't seem to be happening.

And the Intel's model numbers / SKUs also seem to be created from a random number generator. :)

barkingcat · on April 2, 2020

What's wrong with having names that are public that doesn't match with a version number?

Like you said, even model/sku numbers are mostly random. Why the expectation of a monotonically increasing version number if intel has almost never done it before?

msla · on April 2, 2020

> Why the expectation of a monotonically increasing version number if intel has almost never done it before?

"Almost never" except for 8086/80286/80386/80486, you mean.

But who remembers those obscurities?

throw0101a · on April 3, 2020

Pentium 2/3/4:

* https://en.wikipedia.org/wiki/Pentium

msla · on April 3, 2020

And the "Pentium" itself, arguably, being a "penta-" name coming after the 80486.

But the Pentium broke the numbering scheme to an extent, with the Pentium Pro wedged between the Pentium and Pentium II.

throw0101a · on April 3, 2020

> What's wrong with having names that are public that doesn't match with a version number?

Depends if you want to reduce confusion amongst yours customers. Or if you want to put up a smoke screen.

Imagine if instead of the iPhone 8/9/10 or OnePlus 6/7/8 there was the iPhone Xavier/John/Fred: how would people know what they're buying?

derefr · on April 2, 2020

I think you might be able to connect the releases on a literal (driving) roadmap.

That’s also sort of how Microsoft did their codenames in the late 90s/early 2000s, when they were all places you could reach in a few hour’s drive from Redmond.

jcranmer · on April 2, 2020

In Intel's "Tick-Tock" model, there was a running theme of the names. The Sandy Bridge and Ivy Bridge processors were successive models on the same core microarchitecture, and then similarly for Haswell and Broadwell, and finally Skylake.

And then the 10nm debacle happened, so everything since then has been X lake.

throwaway_pdp09 · on April 2, 2020

The article says "LBRs (Last Branch Recording) in order to speed up branches". Reading the PDF seems to be more about recording it (for profiling?). If it really is about speedups can someone summarise how, and how that differs from using branch predictors.

chrisseaton · on April 2, 2020

I think it's for tooling like profilers, not directly for performance. But you could use that tool to improve performance in your code, so maybe that's what they mean.

zbjornson · on April 2, 2020

That's correct. Info on LBR in general: https://lwn.net/Articles/680985/

gok · on April 2, 2020

Is it clear to anyone if the BFloat16 support in these chips means they have extra-low precision multipliers internally, or are the dot products still implemented with the single precision guts? Does it actually increase compute over FP32 or is it just a bandwidth win?

jcranmer · on April 2, 2020

The new instructions are, I believe:

* VCVTNE2PS2BF16 — Convert Two Packed Single Data to One Packed BF16 Data [i.e., float32 -> bfloat16 conversion]

* VCVTNEPS2BF16 — Convert Packed Single Data to Packed BF16 Data [ditto]

* VDPBF16PS — Dot Product of BF16 Pairs Accumulated into Packed Single Precision [i.e., multiply two v2bf16 with each other, and then sum the two results into a f32.]

The last instruction is the only computation instruction, and that can be implemented by breaking the 23-bit multiplier of an FMA unit into two 7-bit multipliers, one for each bfloat16, as well as duplicating the normalization logic beforehand.