I think that's true if 7nm progress is independent of 10nm, and the mistakes made with 10nm aren't also delaying 7nm.
My understanding is that Intel's 14->10nm shrink was the most aggressive in the industry, promising to yield a greater increase in density than the geometry would imply, when usually there's a loss factor and the density increase isn't as good as you'd naively expect.
Even after 10nm was delayed by quite some time, Intel pointed to this and declared they weren't lagging the industry as the shrink was closer to a 1.5 node shrink.
If the 10nm delay was the result of this aggressiveness and Intel was well in to developing 7nm in a similar fashion before it became obvious how 10nm was going to turn out, this may not be a matter of skipping over a single bad apple.
That's just early optimism, as far as these things can be known from the outside, the process is fine very much unlike 10nm which was known to be broken for years even before Cannon Lake.
1. The metal pitch targeted (36nm) requires Self-Aligned Quad Patterning (SAQP) apparently is very very hard to do without EUV. TSMC is doing 40nm metal pitch and Samsung does 36nm BUT both of them are using EUV.
I remember them developing an EUV lithography rig at the ALS 11 bend magnet in the 90s... guess it never worked out. Looking at the website: dude's still there!
Possibly a meta question on instructions: is there such a thing as having 'too many'?
I know transistors are cheap, but at what point is it diminishing returns on new instructions? How many different use cases need to be handled? Certainly there are new situations that need to be handled (e.g., H.264/5/6, AV1 video), but will there ever be a point where we can say "this is enough"?
More instructions builds a slightly higher moat for new companies who think they might design a CPU, and adds a bit to AMD's costs to keep up. Customers who use those instructions will stick with intel for a generation or two until AMD comes out with something compatible, and if intel is smart, they'll contribute patches to opensource software projects which make use of them, and have some low performance fallback path for non-intel processors.
As those instructions become part of the regular software ecosystem, other architectures like ARM will have to add something similar too, again increasing the cost of playing catchup.
This may be true, but it is ridiculously cynical to think this is the main reason to add instructions to a processor. Every vendor I'm aware of for CPUs, GPUs, and other kinds of processors, are all adding instructions for ML workloads, and they often really do help a lot.
Also, if you think that ARM and x86 have parity in support for instructions, you are off your rocker.
The instructions are certainly not free of cost, but they don't have to implement new function units for every new instruction. Often new instructions are just slightly different/more general modes of existing ones. And on the other side legacy instructions will be emulated via microcode instead of having dedicated units.
The software equivalent would be compatibility shims and wrapper functions with different signatures sharing a common, highly parameterized internal implementation.
When you can put more transistors, you have to use them for something. You can either put more cores, but that requires for software to be written in a way to utilize those cores and that's not always easy task or even possible. Or you can use those transistors to make some common operations faster and that potentially will increase performance with little rewrites (or even with no rewrites for libraries).
"This is enough" will be said when we wouldn't be able to put more transistors. I think that it won't happen in the next 10 years, but, of course, it'll happen eventually.
> When you can put more transistors, you have to use them for something.
I mean, it’d honestly be really interesting to know the performance charateristics (TDP et al) of an Intel 8088 if it were shrunk to 14nm and then driven at a modern clock speed. Maybe less is more?
It would've been really bad compared to modern CPUs. Just one example: having no cache, instruction fetches would be slowed down to the speed of RAM, what's worse if that instruction writes/reads memory location then another RAM access would be issued, with high probability of accessing very different location additional timing penalties would occur (row pre-charge, etc.)
Ha, no. It didn't have cache, pipelining, separate bus and instruction subsystems, a wide bus. It would run one instruction every N clock cycles instead of N every 1 clock cycle, and hang on every memory read or write. Multi-byte math would take many instructions. It would be pathetic.
My argument isn't that modern software would run faster on such a CPU; my argument is instead that:
1. the most trivially-true version of the argument: software written for that CPU would run faster on a modern "remastering" of that CPU, than it would run (directly, via a lot of microcode-level emulation; or indirectly, via actual emulation) on a modern CPU. (Yes, some software that's still binary-forward-compatible with modern CPUs—only using generic ISA ops—would be faster on the modern CPU. But I'm talking about the worst, most persnickety edge-case uses of the ISA. The kinds of "requires a whole different model of the world to have the right side-effects" ops that make IBM write emulators for their previous mainframe architectures, rather than just shimming those ops into their new POWER ISAs and doing load-time static recompilation to the new ISA.)
2. smaller transistor size would mean less total power draw per cycle—i.e. it's a rather dim bulb—which means you could overclock the heck out of that CPU.
3. As long as you don't also make the die-size any smaller (but rather just lay out your small transistors with super-long trace-paths between them), then you're not decreasing the thermal surface-area of the die in the process, so you can then attach a modern cooling setup to it to clock it even higher.
4. Or, if you like, you can shrink the die-size and produce a compact 10nm 8088, at which point it'd probably be, say... 1 sq. mm? Smaller than a Cortex-M0+, for sure. That's the point when things are small enough that you can start to do wacky things like covering the entire (uncapped) die surface in a focused laser beam, to cool it by entangling the coherent "negative temperature" photons with the traces' positive-temperature baryons, as an indiscriminate version of an atomic-force microscope's method of ion capture.
But what would you do with that speed? Let’s say you run that 8088 at 10GHz (about a factor 1,000 faster than a fast 8088).
What useful algorithm needs that speed but not more than 1MB memory (that CPU could read and write its entire address space a thousand times a second)?
CISC is basically a compression codec based on macros. Each macro encodes a batch of micro ops and core engineers are free to decide how to implement them for the best speed vs silicon trade off for each instruction.
The thing that killed RISC (at least the true minimal kind) was the divergence of CPU and RAM speeds. All those instructions saturate the memory bus. Code density became important.
X86 is of course a bit of a legacy mess, but the worst thing about it is not too many instructions but how many instruction widths there are. The X86 stream is hard to parse compared with ARM and others with only one or two instruction word sizes.
A few ideas from RISC survived, and chief among them was fixed instruction widths, minimizing instruction dependencies, and canonical encodings. RISC-V is really not RISC in the original minimal sense, but it kept the moniker for those aspects.
Adding more instructions (macros) to X64 will not make its downsides any worse. That being said, I do wonder if maybe X64 has so many because it is using them to work around the inefficiencies in its instruction encoding. Maybe some of these ops would be just as fast without a big macro if the core instruction set were less crufty, but I am not a CPU engineer.
I remember reading somewhere that nowadays a significant chunk of the instructions isn't actually implemented on the CPU using transistors, but by using CPU microcode to sort of emulate these instructions by combining existing ones. Someone correct me if I'm wrong.
Micro-ops are the actual things that can be executed by the hardware. A floating-point FMA unit is going to support a floating point addition, subtraction, fused multiply add (with various intermediate sign twiddles), and integer multiplication and wide multiplication--all without adding much more hardware: you're adding a few xors or muxes to the big, fat multiplier in the middle of it all. Each of these might have distinct micro-ops, or you might be able to separate the processing stages and use a single multiplier micro-op with distinct preprocessing micro-ops for the different instructions. Realistically, though, you are adding new micro-ops, although the overall hardware burden may be light.
The motivation of adding new instructions is generally to get higher performance, so there's going to be pressure to have hardware to execute it well, as opposed to a more naive emulation. But sometimes people add support without making it fast--AMD chips used to (still do? I'm not sure) implement the 256-bit AVX instructions by sending the 128-bit halves through their units in sequence, so that it technically supported AVX instructions but didn't see any improved benefit from it.
Back in the high CISC era every instruction would be backed by microcode as a series of instructions like "Load the first argument from memory location X; load the address of the second argument from memory location Y; now use that to get the second argument; store the result in memory location Z;"
Then in the RISC era the instructions being fed to processors more closely matched what was going on inside, though pipelining made that a bit more complicated.
These days a processor will still take the incoming instruction stream and sometimes break up instructions into pieces but it will also sometimes fuse two instructions into a single one like a compare followed by a branch.
That's not really the case. Many complex, obsolete or not timing critical instructions are microcoded, but the large majority of instructions executed by the cpu are not. They are translated to microops, but that's a different thing, normally there is a single microop that executes the bulk of the instruction.
I think we're a long long way from someone declaring that "we have found the best way to arrange the transistors - there are no further improvements to be made".
That feels like an orthogonal issue. I mean, what if the best way to arrange the transistors turns out to be a one-instruction set computer? It's not obvious to me that adding more instructions is a better use of the transistors available in the surface area we've got to work with.
That is why bytecode based executables are so nice.
Sure you need an AOT compiler as part of the OS, or having a JIT as part of the runtime, but I rather let the compiler infer what might be the best to use, even if not optimal, than playing around with -march configurations.
Something that mainframes have been doing since the 70's.
Which bytecode-based compilers actually perform real optimizations, aside from WebAssembly? Both Python and Java seem to emit "stupid" bytecode and let the JIT take care of it. (And any bytecode implementation is almost always slower than native code.)
All of those that are actually compiled to native code?
Optimizations done at bytecode level aren't worthwhile unless you are targeting pure interpreters.
Still, there are bytecode optimizers for Java and .NET, that is part of what code obfuscators are able to do, what many toolchains for embedded devices do as packing step, and a standard step in Android builds.
In what concerns JIT/AOT themselves, well having bytecodes with a kernel JIT is what has allowed IBM and Unisys to update their hardware designs, while keeping software with several decades of history to keep running.
It is also what allows Android and watchOS code to transition to 64 bits without most developers even having to care about it.
Or how Android, Java and .NET developers are able to take advantage of AVX auto-vectorization, string SIMD, bits operation optimizations, without having to touch their decades old binaries.
It also allows the usage of cloud compilers (UWP, Android, IBM J9), which gather PGO information across multiple deployments and use it as input to the top tier JIT/AOT optimizing compilers.
Is there a rationale behind Intel's codename scheme?
With (e.g.) Ubuntu they go through the English alphabet, so you can get some idea of the order things have/will come out, but it seems that Intel is using a random word generator for their X Lake names.
If Ubuntu et al it is alphabetical order, and you can generally tell the timeline of releases. But how can one really tell what the current product from Intel is, and what came before, and what are the upcoming releases?
It's not like they're going through Oregon lakes in some kind of order, or are they? By discovery, by size/volume, other?
> What you are looking for are monotonically increasing version numbers, which code names are definitely not.
They are not, but given that Intel seems to put them in their public information / marketing material, it seems like the "codenames" are being used as version numbers. If they were strictly internal-to-Intel I could see that POV, but that doesn't seem to be happening.
And the Intel's model numbers / SKUs also seem to be created from a random number generator. :)
What's wrong with having names that are public that doesn't match with a version number?
Like you said, even model/sku numbers are mostly random. Why the expectation of a monotonically increasing version number if intel has almost never done it before?
I think you might be able to connect the releases on a literal (driving) roadmap.
That’s also sort of how Microsoft did their codenames in the late 90s/early 2000s, when they were all places you could reach in a few hour’s drive from Redmond.
In Intel's "Tick-Tock" model, there was a running theme of the names. The Sandy Bridge and Ivy Bridge processors were successive models on the same core microarchitecture, and then similarly for Haswell and Broadwell, and finally Skylake.
And then the 10nm debacle happened, so everything since then has been X lake.
The article says "LBRs (Last Branch Recording) in order to speed up branches". Reading the PDF seems to be more about recording it (for profiling?). If it really is about speedups can someone summarise how, and how that differs from using branch predictors.
I think it's for tooling like profilers, not directly for performance. But you could use that tool to improve performance in your code, so maybe that's what they mean.
Is it clear to anyone if the BFloat16 support in these chips means they have extra-low precision multipliers internally, or are the dot products still implemented with the single precision guts? Does it actually increase compute over FP32 or is it just a bandwidth win?
* VCVTNE2PS2BF16 — Convert Two Packed Single Data to One Packed BF16 Data [i.e., float32 -> bfloat16 conversion]
* VCVTNEPS2BF16 — Convert Packed Single Data to Packed BF16 Data [ditto]
* VDPBF16PS — Dot Product of BF16 Pairs Accumulated into Packed Single Precision [i.e., multiply two v2bf16 with each other, and then sum the two results into a f32.]
The last instruction is the only computation instruction, and that can be implemented by breaking the 23-bit multiplier of an FMA unit into two 7-bit multipliers, one for each bfloat16, as well as duplicating the normalization logic beforehand.
I am still dead convinced we will never see a wide release of a 10nm desktop chip. It will be delayed then cancelled in favor of 7nm.
Server wise, same, limited releases, more smoke and mirrors etc etc
Intel will limp along with 14nm until 7nm. 10nm is utter broken and they can't fix it.