Intel Skylake/Kaby Lake processors: broken hyper-threading

userbinator · on June 25, 2017

The problem description is short and scary:

Problem: Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. RAX, EAX or AX for AH) may cause unpredictable system behavior. This can only happen when both logical processors on the same physical processor are active.

I wonder how many users have experienced intermittent crashes etc. and just nonchalantly attributed it to something else like "buggy software" or even "cosmic ray", when it was actually a defect in the hardware. Or more importantly, how many engineers at Intel, working on these processors, saw this happen a few times and did the same.

More interestingly, I would love to read an actual detailed analysis of the problem. Was it a software-like bug in microcode e.g. neglecting some edge-case, or a hardware-level race condition related to marginal timing (that could be worked around by e.g. delaying one operation by a cycle or two)? It reminds me of bugs like https://news.ycombinator.com/item?id=11845770

This and the other rather scary post at http://danluu.com/cpu-bugs/ suggests to me that CPU manufacturers should do more regression testing, and far more of it. I would recommend demoscene productions, cracktros, and even certain malware, since they tend to exercise the hardware in ways that more "mainstream" software wouldn't come close to. ;-)

(To those wondering about ARM and other "simpler" SoCs in embedded systems etc.: They have just as much if not more hardware bugs than PCs. We don't hear about them often, since they are usually worked around in the software which is usually customised exactly for the application and doesn't change much.)

ereyes01 · on June 25, 2017

A few past lives ago, I used to work on the AIX kernel at IBM. I once spent a few weeks poring through trace data trying to investigate a very mysterious cache-aligned memory corruption induced by a memory stress test. Our trace data was quite comprehensive, and is always turned on due to its very low overhead. It was concerning enough (and took me long enough) that it eventually sucked in the rest of my team to aid in the investigation. None of these other guys were noobs- a couple of them had (at the time) built over 20 years of experience in this system, and in diagnosing similar memory corruption bugs beyond any doubt (many were due to errant DMAs from device drivers). I had too, though for much less than 20 years.

After several full days of team-wide debugging, we had no better explanation based on the available evidence than cosmic rays, or a hardware bug. IBM's POWER processor designers worked across the street from us, so we tried to get them to help- first by asking nicely, then by escalating through management channels.

Their reply was more or less: we've run our gamut of hardware tests for years, and your assertion that it's hardware related is vanishingly unlikely... we don't look into hardware bugs unless you can prove to us beyond a doubt it's hardware related. Cache-aligned memory corruption without any other circumstantial evidence isn't enough.

On a crashed test system sitting in the kernel debugger for several weeks now, there would be no more circumstantial evidence beyond the traces. A corruption like this was never seen again, by all accounts.

If we were right and it was evidence of a hardware failure, this is one way such a problem can go undetected. I hope it was something else, or even a cosmic ray, but we'll never know for sure, I guess.

ausimian · on June 25, 2017

I understand that someone at Microsoft Research once found a bug in the XB360's memory subsystem by model checking a TLA+ spec of it. The story goes that IBM initially refused to believe the bug report. A few weeks later they admitted that such a bug did indeed exist, had been missed by all their testing and would have resulted in system crashes after about 4 hours use.

jldugger · on June 25, 2017

Did you by chance see the paper a year back or so outlining that memory errors are more likely to occur near page boundaries? The author's premise was that a lot of 'cosmic rays' are just manufacturing flaws.

ereyes01 · on June 26, 2017

I debugged lots of memory corruption errors in my time working on AIX kernel stuff. Most of the ones I got to work on did indeed happen at a page boundary [1]. I think there is a pretty simple reason for this: "pages" are both the logical unit size of memory the kernel's allocator hands out, and the unit size that the hardware is capable of addressing in most cases. Therefore, when something at the kernel level is done incorrectly, it often references someone else's page.

It's also possible for a _byte_ offset to be wrong, and these types of errors need not occur at a page boundary. Useful things to do with a raw memory dump with this kind of corruption (at least in AIX kernel):

- Identify the page containing the corruption, and find all activity concerning the physical frame in the traces.

- Try to reverse engineer the bad data. Often times, there are pointers you can follow. You would have to manually translate the virtual address to physical frames, but that's pretty simple to do, both for user space and kernel space (in our case, it was always in kernel space, which was 64-bit and just used a simple prefix).

From there, you just have to be crafty and thorough in following the breadcrumbs and either identifying the bug in code, or at least who should investigate next.

In my original post, note that the corruption was on a _CPU cache_ boundary (128 bytes in POWER). IIRC, the containing page was allocated and pinned [2] for a longer time than the trace buffer tracked (it's been a few years, though).

[1] To make things fun, AIX and POWER supports multiple page sizes- 4K, 64K, 16MB, 16GB. Hardware also has the ability of promoting / demoting between 4K and 64K for in-use memory... lots of fun :-)

[2] AIX is a pageable kernel. If kernel code can't handle a page fault, it needs to "pin" the memory.

... note that any of this can be really outdated, it's been almost a decade since I was an expert in this stuff :-)

(Edit: formatting... how do you make a pretty bulleted list on HN??)

kev009 · on June 26, 2017

Pretty cool to read these stories about AIX. It's always been a nice system paired with HW often a couple years ahead of the performance curve, but it certainly became more and more of a sustaining effort after 2001.

scott_s · on June 25, 2017

Can you provide a citation? I'd like to read it.

jldugger · on June 26, 2017

Ah, found it: "Cosmic Rays Don't Strike Twice" http://www.cs.toronto.edu/~bianca/papers/ASPLOS2012.pdf

breck · on July 4, 2017

Fantastic paper. Thanks for sharing! Just posted it here: https://news.ycombinator.com/item?id=14697744

Bartweiss · on June 26, 2017

Shouldn't ECC offer a second form of benchmark on this? If you see transient, cosmic-ray looking errors in ECC, presumably that's much stronger evidence of a hardware bug. Of course, I've also heard it claimed that ECC design and manufacture are held to a higher standard that might hide the issue.

jldugger · on June 26, 2017

That's precisely what was used to benchmark it.

nol13 · on June 25, 2017

having often attributed these things to cosmic rays as well, anyone have any actual estimates on how often/likely cosmic rays are to cause errors in desktop-like setups?

had to have been a cosmic ray at least once right?

Shikadi · on June 25, 2017

Cosmic ray errors (called soft errors in the industry) are measured in FIT rate, (failures in time) which is measured in failures per billion device hours. Basically, it's very unlikely to hit a desktop computer, somewhat likely to happen higher in the atmosphere/space, and very likely to happen in a system with hundreds or thousands of processors.

matt4077 · on June 26, 2017

There are some numbers in http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabur...

cesarb · on June 25, 2017

> short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. RAX, EAX or AX for AH)

This is yet another of the many places where the complexity of the x86 ISA shows up and makes its hardware implementations more complicated: the x86 ISA has instructions which can modify the second-lowest byte of a register, while keeping the rest of the rest of the register unmodified (but AFAIK no instructions which do the same for the third-lowest byte, showing its lack of orthogonality).

For in-order implementations, like the ones which originated the x86 ISA, it's not much of a problem. But for out-of-order implementations, which do register renaming, partial register updates are harder to implement, since the output value depends on both the output of the instruction and the previous value of the register. The simplest implementation would be to make a instruction depending on the new value wait until it's committed to the physical register file or equivalent, and that's probably how it was done for these instructions for these partial registers before Skylake.

For Skylake, they probably optimized partial register writes to these four partial "high" registers (AH, BH, CH, DH), but the optimization was buggy in some hard-to-hit corner case. That corner case probably can only be reached when some part of the out-of-order pipeline is completely full, which is why it needs a short loop (so the decoder is not the bottleneck, AFAIK there's a small u-op cache after the decoder) and two threads in the same core (one thread is probably not enough to use up all the resources of a single core). The microcode fix is probably "just" flipping a few bits to disable that optimization.

And this shows how a ISA is more than just the decoding stage; design decisions can affect every part of the core. In this case, if your ISA does not have partial register updates (usually by always zero-extending or sign-extending when writing to only part of a register, instead of preserving the non-overwritten parts of the previous value), you won't have the extra complexity which led to this bug. AMD partially avoided this when doing the 64-bit extension (a partial write to the lower 32 bits of a register clears the upper 32 bits), but they kept the legacy behavior for writes to the lower 16 bits, or to either of the 8-bit halves of the lower 16 bits.

pbsd · on June 25, 2017

The loop needs to be short because the loopback buffer is only active in loops of 64 or fewer entries (usually fewer real instructions, something like 40 or so). Moreover, Skylake introduced one loopback buffer per thread, instead of the previous loopback buffer shared between both threads.

My guess is that is where the bug is; the behavior for partial register access stalls---insert one extraneous uop to combine, e.g., ah with rax---is unchanged since Sandy Bridge.

CalChris · on June 25, 2017

Just as information, the Loop Stream Detector was introduced in Intel Core microarchitectures. With Sandy Bridge, it was 28 μops. It grows to 56 μops (with HT off) with Haswell and Broadwell. It grows again (considerably) with Skylake to 64 μops per thread (HT on or off).

The LSD resides in the BPU (branch prediction unit) and it is basically telling the BPU to stop predicting branches and just stream from the LSD. This saves energy. However, predicting is different than resolving. Branch resolution still happens and when resolution (speculation) fails, the LSD bails out.

In any case, 64 μops is a lot. That's a good sized inner loop.

Symmetry · on June 26, 2017

It's also a problem with SMT[1]. The design cost is pretty small, it's a fairly straightforward extension of what an out of order CPU is already doing. But due to the concurrency issues debugging/verifying it is incredibly difficult.

[1]Simultanious MultiThreading, which is marketed by Intel under the name Hyperthreading when using two threads.

pcwalton · on June 25, 2017

This is an amazing analysis, and seems entirely likely to be right to me. Thanks for writing it up.

valarauca1 · on June 25, 2017

You really don't know what you're talking about.

---

     For Skylake, they probably optimized partial register 
     writes to these four partial "high" registers (AH, BH, 
     CH, DH), but the optimization was buggy in some hard-to-
     hit corner case.

They did not do this.

The high registers (AH/BH/DH/CH) are nearly written out of existence with the REX Prefix in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.

The 16bit registers (AX/BX/DX/CX) are in worse situation, but it ends up costs additional cycles to even decode these instructions as the main encoder can't handle these instructions and you have to swap to the legacy encoder, and you'll end up losing alignment. This costs ~4-6 cycles, also the perf registers to track were only added in Haswell (and require Ring0 to use [2]).

High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.

    That corner case probably can only be reached when some
    part of the out-of-order pipeline is completely full,
    which is why it needs a short loop (so the decoder is not
    the bottleneck, AFAIK there's a small u-op cache after the decoder)

There is a 64uOP cache between the decoder and L1i cache that is called loop stream detector. Normally this exists to do batched writes to the L1i cache.

But in _some_ scenarios when a loop can fit completely within this cache it'll be given extremely priority. This is a way to max out the 5uOP per cycle Intel gives you [1]. It'll flush its register file to L1 cache piece meal as it continues to predict further and further and further ahead speculatively executing EVERYPART OF IT in parallel. [3]

In short this scenario is extremely rare. uOPs have stupidly weird alignment rules. Which you can boil down to:

    Intel x64 Processor are effectively 16byte VLIW RISC processors
    that can pretend to be 1-15byte AMD64 CISC processors at a minor performance
    cost.

---

The real issue here is when Loop Stream mode ends it is properly reloading the register file, and OoO state.

This is likely just a small micro-code fix. The 8low/8high/16bit/32bit/64bit weirdness is likely somebody wasn't doing alignment checks when flushing the register file.

---

[1] On Skylake/KabyLake. IvyBridge, SandyBridge, Haswell, and Boardwell limited this to 4.

[2] Volume 3 performance counting registers I think we're up to 12 now on Boardwell.

[3] Volume 3 Chapter 3.4.1.7 (Page 107)

CalChris · on June 25, 2017

> The high registers (AH/BH/DH/CH) are nearly written out of existence with the RAX flag in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.

I think you meant REX prefix but even that doesn't make any sense.

High registers are a first class element of the Intel 64 and IA-32 architectures. They aren't going anywhere. Microarchitectural implementations are an entirely different thing.

That aside, where in the manuals does Intel say not to use the high registers? They're pretty clear about such warnings and usually state them in Assembly/Compiler Coding Rules.

From the parent:

> For Skylake, they probably optimized partial register writes to these four partial "high" registers (AH, BH, CH, DH), but the optimization was buggy in some hard-to-hit corner case.

That is about right. I don't agree with the preceding slap at x86 but this is a good summary.

BTW writing to the low registers is in principle also a partial register hazard but then Intel sees fit to optimize that as a more common case.

In particular, mov AH,BH is not emulated from the MS-ROM which is just hella slow. It uses two μops for Sandy Bridge and above. This is covered in 3.5.2.4 Partial Register Stalls.

Lastly, there is no section 3.4.1.7 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual which is 3 volumes. You must be talking about the Intel® 64 and IA-32 Architectures Optimization Reference Manual which is a single volume. And it isn't clear how that section furthers your argument.

brigade · on June 25, 2017

> High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.

Someone really ought to tell clang and gcc this; they both happily use 16-bit registers for 16-bit arithmetic.

Anyway, Intel obviously already has special optimizations for many partial register accesses, dating back to Sandy Bridge. While it's quite possible that they left out the high registers initially (no clue, don't care), if they did they could have decided to include them in Skylake. Who knows though...

What are you even talking about with the LSD? The LSD is entirely before any register renaming and the entire out-of-order bits of the CPU. It's likely the LSD is involved only because that (plus hyperthreading) might be the only way to get enough in-flight µops to trigger whatever is going wrong, whether or not it's due to optimizations for partial register accesses.

userbinator · on June 25, 2017

Indeed, I've worked with CPUs that didn't have the register split of x86 and they are far less friendly to implementing certain algorithms, which would otherwise require many additional registers and lots of shift/add instructions to achieve the same effect. ("MOV AH, AL" being one simple example -- try doing that on a MIPS, for instance.)

pcwalton · on June 26, 2017

How often have you really needed to do "a = (a & 0xffff00ff) | ((a & 0xff) << 8)"? I don't think I've ever needed to do it, and I wouldn't be surprised if compilers don't even generate "mov ah,al" for that, due to the fact that AH/BH/CH/DH only exist for 4 registers.

Anyway, since you asked: In AArch64 that would be written "bfi w0,w0,#8,#8". "bfi" is an alias of "bfm", an instruction far more flexible and useful than any of the baroque x86 register names. BFM can move an arbitrary number of bits from any position in any register to any other register, and it has optional zero-extension and sign-extension modes.

Espressosaurus · on June 26, 2017

If you're talking to any memory-mapped registers, you'll be doing it all the time. Granted, you're much more likely to be using something like ARM to do that. x86 is a bit large/expensive/power hungry for embedded programming.

wbl · on June 25, 2017

CPU manufacturers do do huge amounts of testing, and Intel does formal verification of some functional units. The reliability is far better than most software, in part because making a new release costs billions.

davidmr · on June 25, 2017

In my limited experience, their root cause analyses are really impressive as well with lots of internal attention and resources. I'm not allowed to talk about any Intel issues, but we reported a very strange issue to Nvidia, sent a couple of dozen cards back and six months later got a truly fascinating report back we with hundreds of pages of compute test result tables and electron microscope images and chemistry lab reports. Anything that hints of a manufacturing problem is taken incredibly seriously.

baruch · on June 25, 2017

I worked for a large company that used thousands of Intel CPUs every year and when we suspected a CPU bug we were mostly brushed off. We had a very persistent person on the team who kept tracking the issue to find correlations and some very good kernel developers that went on to nearly pin-point the issue and only then did Intel pay attention and it then took them still several months to acknowledge the issue give a brief report on the issue and acknowledge that our proposed workaround will indeed work.

I've never seen Intel do a very good job at failure analysis or following on with failures unless prodded very hard.

magila · on June 25, 2017

Intel likely has very thorough data on the issue, but you'll never see it unless you are one of their tier 1 customers* and have an NDA with them. In my experience working for a large hardware manufacturer they are very skittish about releasing detailed failure analysis data to outside companies.

* For Intel that would be companies like Dell, Apple, HP, and maybe a couple of others.

Sharlin · on June 25, 2017

To be fair, I don't even want to think about the amount of spurious bug reports they are probably receiving.

efraim · on June 26, 2017

Here's a video of a tour of Nvidia's lab: https://www.youtube.com/watch?v=pRz_CG3DZb4

qb45 · on June 26, 2017

That's a nice lab. You could try to submit this to the front page.

dboreham · on June 25, 2017

This is because once upon a time people were put in jail for not doing that (when the customer was the DoD).

ploxiln · on June 25, 2017

That's absolutely true. When it comes to CPU/memory, skilled software engineers always think, "it must be my bug, it always is".

So in that super rare case of actually running into a CPU defect, it's a mindfuck, it'll drive you crazy. You'll be looking for the flaw in your algorithm which makes it fail once a week under production load. But you just can't find it, it makes no sense ...

(When it comes to drivers for network/storage/graphics etc devices, it's a whole different story. Those things are piles of bugs that need work-arounds in drivers.)

Cerium · on June 25, 2017

A little anecdote describing one such bug. I didn't find this, another one of my teammates did.

The symptom was that a board with a specific microcontroller on it would be working fine, then after a power cycle it might not keep working. A flash dump would show that the reset vector, the first byte of flash on that system, would be all zeroed out. Of course the system would not run anymore, but why did it happen? After months of intermittent debugging and trying to reproduce the cause was determined. At least under certain conditions the brownout detection level was lower than the voltage level that caused the CPU to make errors. If the board lost power slowly then the CPU would start executing corrupted / arbitrary instructions which generally included lots of zeros. It would occasionally write zeros to the zero address, bricking the board.

Since then we have external power monitoring and reset circuits on all the new boards, but existing ones needed a fix. Luckily the board had power failure interrupt connected, so when that triggers we reconfigure the CPU to execute on the slowest possible clock rate, which greatly reduced the occurrence.

DamonHD · on June 25, 2017

Yes, we have seen similar in AVRs (328p) where the lowest voltage brownout threshold/trip is not enough to prevent the EEPROM and/or Flash from getting corrupted somehow...

justsid · on June 25, 2017

Similarly: Kernel bugs! Once upon a time I spend a good week trying to reproduce a rare crash one of our users saw ever so often (just enough to get our attention, but not enough to get a repro case). Stack traces made no sense and just in general the whole crash made no sense the more I looked at it. Turns out it was a bug in XNUs pthread subsystem, a part where I would have never looked if it weren‘t for desperation, because, well, it‘s the kernel, it works, right?

wbl · on June 25, 2017

Emacs would regularly panic the kernel on one OS X version when I opened preview to look at a pdf and had TeX recompile.

inferiorhuman · on June 25, 2017

The DVD player on OSX with a MacBook (polycarbonate ones) would occasionally cause a kernel panic if you had 4GB of RAM installed.

digi_owl · on June 25, 2017

Or at least should have been long since spotted thanks to use and abuse.

digi_owl · on June 25, 2017

You have some lovely stories online about these things. Like the guy tracking down a stuck bit in RAM.

Or on the network level, a VPN that failed only when traversing one possible route between company offices.

thoughtsimple · on June 25, 2017

I have a very old one. I was working on a embedded system using an Intel 80186. The ‘186 had a memory mapped IO system that was very unusual for Intel x86 CPUs. The code that I started with was originally written by a bunch of Motorola 68000 guys so they used this mode. I had to modify the interrupt structure so I rewrote the interrupt service routines. Since it was memory mapped and ANDs were documented to be atomic, I assumed that I could just AND bits into the interrupt mask register. Big mistake. It turned out that ANDs to the IO registers were not atomic. And could be interrupted in the middle of the write after read. I took me about a month to realize that this was a hardware “bug” and not in my interrupt code. Drove me nuts.

chiph · on June 26, 2017

These problems have existed for years - I once tracked down a dead cell in core memory on a system I worked on in the Air Force. Because of the chance alignment, it caused incoming messages to never "end". Which was fun for the operators.

http://www.1882nd.com/images/DSTE.jpg

heisenbit · on June 25, 2017

It may be far better than average software quality but far more also relies on it. The question is whether the quality is adequate in light of what is at stake.

lorenzhs · on June 25, 2017

Really? How is it different from a kernel bug that causes random behaviour? Both apparently can be fixed with a software update. They're both bad, but why should Intel be held to a higher standard if mitigation is similarly complex?

wav-part · on June 25, 2017

Because futher down the stack, more reliable a tech has to be. Otherwise good luck debugging. Also agree with heisenbit.

justinclift · on June 25, 2017

Hmmm... on that note, if the universe is a simulation, then a bug in that could have some interesting ramifications.

"Don't use the bookshelf over there, physics is broken on that shiny spot." :D

krylon · on June 26, 2017

// FIXME Trying to read both the position and the momentum of a particle reliably crashes the simulation.

// Workaround: Make the accessor method return fuzzy values for either of those values.

roywiggins · on June 25, 2017

The problem with finding a physics bug is that it's liable to be amplifiable. Break conservation laws just a little bit and suddenly you have potential for exponential runaway leading to an unplanned reality excursion, and say goodbye to your light-cone.

justinclift · on June 25, 2017

Well, by definition the behaviour would be undefined so could go any which way. ;)

Modern day VM software has various levels of exception checking, and code to catch/mitigate/etc when bugs crop up.

So, a universe-capable simulator might have any kind of behaviour if/when a bug occurs. It doesn't need to be an unbounded, runaway scenario. :)

rbanffy · on June 25, 2017

And, if a very bad reality excursion happens, you scrap that branch and restore the last checkpoint. Nobody inside the simulation will ever remember it. You only bother fixing it if halt/scrap/restore becomes a burden.

It probably happens more often than we imagine ;-)

wolfgke · on June 26, 2017

> Hmmm... on that note, if the universe is a simulation, then a bug in that could have some interesting ramifications.

> "Don't use the bookshelf over there, physics is broken on that shiny spot." :D

You know "The Animatrix - Beyond"?

justinclift · on June 26, 2017

Nope, just looked it up though:

https://en.wikipedia.org/wiki/The_Animatrix#.22Beyond.22

Seems like the same kind of concept. :)

qubex · on June 25, 2017

I don't know why you were downvoted. I find this hilarious and thought-provoking.

dmead · on June 25, 2017

IIRC wasn't there some blog posts about how intel has cut their verification/QA drastically since 2010? is this the result?

userbinator · on June 25, 2017

Although it mentions 2015, I think that might be the one I linked in my original post above: http://danluu.com/cpu-bugs/

gcb0 · on June 25, 2017

releasing cpu micro code to work around bugs is very cheap.

CountSessine · on June 25, 2017

But that doesn't mean you necessarily get to keep the affected feature - the only 'work-around' might be to disable it. Consider TSX on Haswell and Broadwell, where Intel had to disable the whole feature because of a bug. And of course there was the 486 FP bug which couldn't be fixed by any kind of microcode update.

If Intel had to completely disable hyperthreading in Skylake and Kabylake that would make the premium anyone paid for i5 vs i7 worthless.

gcb0 · on June 25, 2017

you mean i3 and i7.

despite what cpuinfo tells you, no HT in i5.

and my previous comment was ironic :)

CountSessine · on June 25, 2017

But if you don't have HT in your i7, you basically have an i5 that you paid more than you needed for.

Or has that changed? At one point, i7 was full-featured, i5 was an i7 with HT disabled, and i3 was i7 with HT intact but smaller caches. Is that different with Skylake/Kabylake?

and my previous comment was ironic :)

Ack - sorry. I must be irony-impaired. That's why I don't post very often. :-)

theandrewbailey · on June 26, 2017

i3 also has half the cores.

happycube · on June 25, 2017

Mobile dual-core lie5's have hyperthreading too.

autokad · on June 25, 2017

i am more willing to accept software having bugs / failing. i have 0 tolerance for hardware to have bugs

pshc · on June 25, 2017

I wonder if there's a CPU out there that doesn't have bugs.

userbinator · on June 25, 2017

It depends on how exactly you define a bug, but something like a 6502 or Z80, which has essentially all the errata well-documented since it has existed for so long, might qualify.

DamonHD · on June 25, 2017

Z80 implementations for one had a whole slew of undocumented instructions, allegedly because they were buggy in some way.

protomyth · on June 26, 2017

The VIPER was built to be fully proven, but there was come debate on methodology.

autokad · on June 25, 2017

stop playing lawyer ball. i'm talking about the bugs that require turning off core functionality (such as hyperthreading) and/or that lead to system halts / corrupted data.

yes it happens, but software bugs happen every day where as if your system blue screened every day you know dang well you'd be on the phone with the hardware vendor for a refund / new system.

coldtea · on June 25, 2017

>yes it happens, but software bugs happen every day where as if your system blue screened every day you know dang well you'd be on the phone with the hardware vendor for a refund / new system.

Well, since hundreds of millions of people use Skylake/Kaby Lake CPUs for 2 years now, and only now we learn about this, obviously this is not of the "system blue screens every day" variety but a very rare bug.

sqldba · on June 26, 2017

> obviously this is not of the "system blue screens every day" variety but a very rare bug

Not to be anal but we can't know this.

CJefferson · on June 25, 2017

How much more would you pay, or how much of a speed penalty would you take, for bug-free CPUs?

baq · on June 25, 2017

nobody has, but it ain't easy.

wyager · on June 25, 2017

This seems like precisely the sort of thing that a competent manufacturer should rule out formally. Formal verification of individual FUs isn't exactly ambitious...

I think we're getting to levels of complexity where the process Intel uses, with lots of different QA and testing teams doing their best to look for bugs, just isn't going to cut it. We need formally verified models transformed step-by-verified-step all the way down to the silicon. It's already feasible, with free tools, to formally verify your high-level model (using e.g. LiquidHaskell) and then transform this to RTL (using e.g. Clash). With Intel's QA/testing budget, it's well within reach to A) verify the transformation steps and B) figure out how to close the performance gap between machine-generated (but maybe slower) and hand-rolled (faster, but evidently wrong) silicon.

laydn · on June 25, 2017

I've done formal verification on several units of a microprocessor, and I can assure you, formal verification on individual FUs is very, very ambitious (think impossible) for a modern microprocessor.

For example, you can not possibly formally verify the fetcher unit on its own, because the state space that you need to cover for several cycles for all the module inputs and outputs is beyond the capability of any formal verification tool.

Typically, you run formal verification on sub-blocks of sub-blocks of FUs.

For this particular bug, it looks like multiple functional units are involved, so it might have been missed by formal verification.

wyager · on June 26, 2017

> because the state space that you need to cover for several cycles for all the module inputs and outputs is beyond the capability of any formal verification tool.

Then use a proof methodology that doesn't require exhaustive enumeration. This objection is actually fairly alarming to me; perhaps there is a larger disconnect between industrial formal techniques for hardware and software than I thought. Strings also occupy a "large state space", but this obviously doesn't prevent us from doing formal verification on functions over strings.

Confusion · on June 26, 2017

  Then use a proof methodology that doesn't require exhaustive enumeration.

You may as well have said "then you use a magical tool that doesn't exist".

  Strings also occupy a "large state space", but this 
  obviously doesn't prevent us from doing formal 
  verification on functions over strings.

What is your reason to think this is 'obviously' the case?

wyager · on June 26, 2017

> You may as well have said "then you use a magical tool that doesn't exist".

Ok, so what this tells me is that you're not aware of what modern verification techniques look like.

There's not really any one resource I can point you to, but take a look at these links. I've used these or similar technologies personally, but there are others I haven't used.

https://en.m.wikipedia.org/wiki/Intuitionistic_type_theory

https://en.m.wikipedia.org/wiki/Homotopy_type_theory

https://leanprover.github.io/about/

http://goto.ucsd.edu/~rjhala/liquid/liquid_types.pdf

> What is your reason to think this is 'obviously' the case?

Because I've done this, and anyone who claims to be familiar with formal methods should be at least passingly familiar with all the things I mentioned.

Even if you aren't, if you've ever heard of SAT (a fairly universal CS concept) you should at least be familiar with the idea that non-exhaustive proofs are a thing.

Confusion · on June 26, 2017

  Ok, so what this tells me is that you're not aware of what 
  modern verification techniques look like.

Well, techniques are not tools. I am asserting that it is quite probable that no practically useful tool exists to non-exhaustively verify the state space laydn wants to cover.

Notwithstanding your example that it's possible to non-exhaustively verify some functions on strings. There is a quite some distance between that to verifying just any function.

  Because I've done this,

Sure, but how does that make it obvious to laydn that his state space is coverable? And what makes it so obvious to you that his state space is coverable? After all, the 'largeness' can have different sources, including those that make non-exhaustive methods infeasible.

I think this may be an example of the disconnect between research and the industry. Researchers say things are solved when they have shown something is possible and published about it. They feel it's then up to the industry to extrapolate, while research moves on to exciting new and greener pastures. Meanwhile the industry thinks the results are too meager, thinks the extrapolation involves a lot of technical difficulties and generally is not willing to spend enough money on what they cannot see as anything but a longshot.

If you do any model checking at all, with tools like SPIN or TLA+, you are already in the state-of-the-art minority in industry.

jlouis · on June 26, 2017

I've had great success (in software) with randomized/probability based testing. You don't walk the full state space but generate subsets randomly to walk.

If done naively, it doesn't work because you are unlikely to hit the problematic bug. So you have to be clever: do not generate inputs uniformly! Generate those inputs which are really really nasty all the time. If you find a problem, then gradually simplify the case until you have reduced the complexity to something which can be understood.

With some experience for where former bugs tend to live, I've been able to remove lots of bugs from my software in edge cases via this method. (See "QuickCheck").

lorenzhs · on June 25, 2017

So it's okay for software to have bugs that get fixed (I think everybody here acknowledges that software will always have bugs), but Intel isn't allowed to have issues in their processors, even if they can fix them with a software (microcode) update?

userbinator · on June 25, 2017

In general, people expect hardware to be far more bug-free than all the software which depends on it. I'm disgusted enough at the current trend of continuously releasing half-baked software and "updating" it to fix some of the bugs (while introducing new bugs in the process); and yet it seems to be spreading to hardware too. From the perspective of formal verification, buggy hardware is like an incorrect axiom.

but Intel isn't allowed to have issues in their processors, even if they can fix them with a software (microcode) update?

They could've caught and fixed this one with some more testing, before actually releasing. Formal verification isn't necessarily going to help, if it's a statistical type of defect --- thus the concept of wafer yield, and why dies manufactured from the exact same masks can behave completely differently with respect to the clock speeds attainable and voltages required.

Cyph0n · on June 25, 2017

Formal verification cannot be used to fully verify an entire Intel CPU, simply because the computational resources required to do so are astronomical. Heck, the cache coherence unit alone is basically impossible to verify.

Therefore, CPU designers verify the important units (e.g., the ALU) independently of each other, and then try to verify their interaction. But a system level design verification check is simply not possible.

Obviously, faults that result from fabrication can still take place, so they are tested for using BIST and JTAG. Test coverage can be pretty high, but obviously not 100%.

As you can see, there are still a ton of places where hardware errors can seep through.

lorenzhs · on June 25, 2017

Expectations of hardware are indeed higher than for software, but isn't it a valid trade-off for Intel to make as long as most of the issues can be fixed with a microcode update? On Windows, it's updated via Windows Update, and on most Linux distributions, a package exists for CPU microcode. Most users have an easy—if not automatic—way to obtain the updates.

I believe the main issue with formally verifying everything is that reasoning about parallel code is extremely hard and might well make the entire endeavour unattainable. It would be great to have formally verified CPUs, though.

> They could've caught and fixed this one with some more testing, before actually releasing.

Isn't this true of any bug?

sqldba · on June 26, 2017

Windows update applies CPU patches?!

I know they did AMT but that was a special case and different.

lorenzhs · on June 26, 2017

Seems like they do when it fixes serious bugs, a quick search yields e.g. https://support.microsoft.com/en-us/help/3064209/june-2015-i... ("This update fixes several issues with Intel CPUs that could cause computer crashes or functions incorrectly.")

ComputerGuru · on June 25, 2017

If this "fix" ends up disabling HT, how can I get a refund not just for the CPU but for the $3k laptop I spec'd around it? Without needing to sue?

lorenzhs · on June 25, 2017

That's just the workaround until a fix is available for your CPU, if I understood correctly.

ComputerGuru · on June 25, 2017

No, that's not what I meant. In the past, Intel's microcode updates have simply disabled the broken functionality as an official solution.

http://techreport.com/news/26911/errata-prompts-intel-to-dis...

lorenzhs · on June 26, 2017

You specifically said the fix might be to permanently disable HT.

The TSX situation was indeed unfortunate, I have one of the affected CPUs. But it was a new feature that was broken, not something that used to work, so the bug's impact was less serious, and disabling the feature didn't have too much of an impact.

brianwawok · on June 25, 2017

Contact your laptop OEM.

voltagex_ · on June 25, 2017

Have you ever had to do this?

The Dell XPS 9350 (Skylake) has numerous issues with GPU noise, USB-C compatibility, wifi reliability - Dell don't care unless your consumer laws are strong enough to make them care.

brianwawok · on June 26, 2017

No but dell had a 30 day full refund policy. Did those bugs not start until day 31?

voltagex_ · on June 26, 2017

TBH, everything was covered by warranty other than the coil whine and wifi - by sticking to one particular driver version you can avoid many of the wifi issues. Not everyone has had the USB-C problems and it's gotten a little better with the most recent updates.

kutkloon7 · on June 25, 2017

Well, if a software bug leads to data corruption it is not really "allowed" as well. It is quite amazing that there are not more bugs like this. Intel spends an insane amount of money on very large teams of very smart people to work on these insanely complicated systems.

I don't think this is sustainable, and I think that massive arrays of relatively simple processors. First, we will need a culture shift that learns programmers to program concurrent programs from the start, and this will take a lot of time (because technology moves a lot faster than culture).

sqldba · on June 26, 2017

It would be nice if they allowed users to update rather than vendors. I have a mobile Xeon. I don't know if it's affected or not.

throwaway2048 · on June 25, 2017

There are hardware bugs that are impossible to fix in microcode. What then?

baq · on June 25, 2017

you make workarounds in software, obviously.

userbinator · on June 25, 2017

...or recall all the hardware and replace it:

https://en.wikipedia.org/wiki/Pentium_FDIV_bug

coldtea · on June 25, 2017

There will always be. What then? We stop producing cpus and wait for some magical new technology that prevents all bugs?

throwaway2048 · on June 25, 2017

Its why you can't treat hardware bugs and dismiss them flippantly like software ones.

AsyncAwait · on June 25, 2017

It's funny, since many of today's software bugs are due to the legacy of C, faults of which can often times be attributed to mimicking the behaviour CPU hardware too closely.

EDIT: My wording was poor, but what I meant was that it's not like hardware vendors do not make bugs happen due to deficiencies in their design, which are not only tolerated, but often reflected in software, which runs counter to OP saying that apparently hardware vendors are not allowed to have bugs, but the software ones are:

> So it's okay for software to have bugs that get fixed (I think everybody here acknowledges that software will always have bugs), but Intel isn't allowed to have issues in their processors

blahman2 · on June 25, 2017

Go back to rust-land now :D

AsyncAwait · on June 25, 2017

It was meant as a response to OP basically saying that hardware vendors make way fewer bugs that the software ones, which is true, but many bugs of software can arguably be attributed to hardware design, so it's not like they're without fault. Rust has nothing to do with it, not sure why even bring it up.

blahman2 · on June 25, 2017

I get it - sorry, meant to playfully Sunday afternoon troll you.

Ever since I moved from C to Java (I like low level stuff but the company I joined is such) I have been having one of three major problems - logic bugs, too many frameworks, causing bugs due to lack of in depth understanding of each one of them, and GC bottlenecks. Of all of them, I hate the GC ones the most.

bsder · on June 25, 2017

Formal verification of a decode unit simply will never happen. The complexity is far too high.

Formally verifying something like a multiplier block is difficult but doable if you care. Formally verifying an FPU is probably at about the limit.

If you want formal verification, you would have to simplify a modern microprocessor a lot.

wyager · on June 26, 2017

That's the whole point of doing a provably correct transformation from a simpler model domain. Even complicated OOOE engines are fairly tractable if you don't try to implement them at an RTL level...

Perhaps we are imagining different formal verification methodologies. Can you tell me what kind of formal verification you're referring to?

qb45 · on June 25, 2017

> I wonder how many users have experienced intermittent crashes

I wonder if it's exploitable ;) Maybe that's why they never release the details of these CPU bugs.

> Was it a software-like bug in microcode e.g. neglecting some edge-case, or a hardware-level race condition related to marginal timing

Not sure about microcode, these x86 cores execute many simple operations natively, by means of dedicated circuits. Microcode is only involved in emulation of complex x86 instructions.

And hardware problem doesn't have to be marginal timing. It could simply be a logic bug, i.e. the circuit operates as designed but it was designed to do something else than it should be doing in some unforeseen circumstances.

rpcope1 · on June 25, 2017

I feel like a lot of the processors Intel has released recently that have had problems like this. Intel's Bay Trail processors like the Celeron J1900 have a huge problem around power state management (https://bugzilla.kernel.org/show_bug.cgi?id=109051) that's unlikely to ever get resolved and makes those processors almost unusable under a lot of conditions (random hard hangs on systems without watch dog timers really kind of sucks). I wonder if Intel has been more lax recently with how the systems get tested?

ibic · on June 26, 2017

I've no knowledge about IC design, but it sounds to me that even the biggest name in CPU industry doesn't (or do they ever) do formal verification? Is the process like when I'm writing some mediocre code and say to myself: "hmm, it probably works", and throw the bunch into the version control (whereas they throw it to the wafer fab)?

skrebbel · on June 26, 2017

> * I would recommend demoscene productions, cracktros, and even certain malware*

Modern PC demoscene productions don't really do very funky things CPU-wise anymore. Most run mostly in shaders, actually. Amiga and C64 is a different story, but Intel isn't making that many Amiga CPUs :-)

theGimp · on June 25, 2017

  The issue was being investigated by the OCaml community since
  2017-01-06, with reports of malfunctions going at least as far back as
  Q2 2016.  It was narrowed down to Skylake with hyper-threading, which is
  a strong indicative of a processor defect.  Intel was contacted about
  it, but did not provide further feedback as far as we know.
 
  Fast-forward a few months, and Mark Shinwell noticed the mention of a
  possible fix for a microcode defect with unknown hit-ratio in the
  intel-microcode package changelog.  He matched it to the issues the
  OCaml community were observing, verified that the microcode fix indeed
  solved the OCaml issue, and contacted the Debian maintainer about it.
 
  Apparently, Intel had indeed found the issue, *documented it* (see
  below) and *fixed it*.  There was no direct feedback to the OCaml
  people, so they only found about it later.

Inexcusable.

lorenzhs · on June 25, 2017

They forgot to follow up on a support ticket. As your quotation mentions, the issue was documented and fixed. Calling that "inexcusable" is a bit strong, don't you think?

I'm not a particularly big fan of Intel's practices, but the reactions in this thread seem a bit too strong to me.

aeleos · on June 25, 2017

It wouldn't be a big deal if this was just software, but the fact that Intel allowed a PROCESSOR bug to be reported, tested, and fixed without telling anyone that the bug actually exists is honestly horrible. You can't just let CPU bugs under the run since it can throw the stability and reliability of the entire system into question. The people that reported it shouldn't have to dig through Intel microcode updates and test different fixes to see if the bug they found was fixed. Hardware manufacturers (especially processor manufacturers) need to be held to a higher standard when it comes to bug reporting and this kind of behavior really has no excuse.

lorenzhs · on June 25, 2017

> Intel allowed a PROCESSOR bug to be reported, tested, and fixed without telling anyone that the bug actually exists

That's just not true. Intel published the erratum in April: https://www3.intel.com/content/dam/www/public/us/en/document... - search for SKL150. It was also clearly noted in Debian's intel-microcode changelog on May 15: http://metadata.ftp-master.debian.org/changelogs/non-free/i/...

They absolutely should have followed up to the OCaml people's support ticket. But sloppy followup is an issue that every large project encounters.

leovonl · on June 25, 2017

It's an excuse that is commonly given - but easily avoidable if you care enough about the ones that pay your salary by buying processors from your company.

And not only that, but they did much more than the average Joe and essentially pin-pointed the issue for them. So yeah, inexcusable it is.

lvh · on June 25, 2017

Do we have any evidence that that's where Intel even learned about the bug?

gus_massa · on June 25, 2017

I agree. It's totally possible that Intel found the bug in another bug report, fixed it, closed that bug report, and never realized that the OCaml bug was related.

d33 · on June 25, 2017

What exactly do you find inexcusable here?

mjw1007 · on June 25, 2017

I don't mind Intel keeping very quiet about fixed-in-microcode bugs that don't directly affect valid userspace programs, like

« Instruction Fetch May Cause Machine Check if Page Size and Memory Type Was Changed Without Invalidation »

or

« Execution of VAESIMC or VAESKEYGENASSIST With An Illegal Value for VEX.vvvv May Produce a #NM Exception »

but something like this should be announced clearly.

(I keep my microcode packages up to date, but I don't normally bother rebooting when an update comes in.)

ncr100 · on June 25, 2017

It was reported to Intel who fixed it and did not reply to the reporter.

theGimp · on June 25, 2017

The contempt for users. I know what I do when a user files a real bug: respond to them, acknowledge it's a problem, tell them when it's fixed.

The fact that Intel does not do that with a bug of this magnitude shows how much respect they have for their users.

wyldfire · on June 26, 2017

It's totally plausible that Intel detected this bug independently with their own verification effort or through another customer. Matching different defect reports when "unexplained" or nondeterministic behavior is the expected result can be challenging.

hueving · on June 26, 2017

'Contempt' is far too dramatic of a word. I think what you mean is 'indifference'.

coldtea · on June 25, 2017

>Inexcusable

What, the "found the issue, documented it, and fixed it" part?

toolslive · on June 26, 2017

We had been seeing this one on and off on some of our machines, and were already at least mentally pointing the finger to the LWT library. Turns out these machines were affected. That's one less worry.

fotcorn · on June 25, 2017

The latest intel-microcode package from Ubuntu 16.04 does not fix the problem. I installed the same package from Ubuntu 17.10 [0] which fixes the problem. You can check your system with the script linked in the mailing list thread [1].

[0] https://packages.ubuntu.com/en/artful/amd64/intel-microcode/...

[1] https://lists.debian.org/debian-devel/2017/06/msg00309.html

arde · on June 25, 2017

Indeed, the latest Intel microcode published for Ubuntu 16.04 is the ancient 20151106 [1]. Later Ubuntu releases do have more recent microcode packages [2]. I cannot understand why they left out 16.04 there. So much for LTS, it seems.

This recently came to my attention while debugging some increasingly frequent lockups, which took me a solid week of eliminating all seemingly more likely causes (VirtualBox, nVidia driver, faulty RAM, etc). In the end I found the culprit while digging into the Intel Specification updates: my Core i7-5820K (and most other Haswell-E and Broadwell processors) has a bug when leaving package C-states, and the only workaround is to disable C-states above level 1. Timely updated microcode, which applies this workaround, would have saved me my week.

[1] https://launchpad.net/ubuntu/xenial/+source/intel-microcode [2] https://launchpad.net/ubuntu/+source/intel-microcode/+change...

rlpb · on June 26, 2017

> Indeed, the latest Intel microcode published for Ubuntu 16.04 is the ancient 20151106.

By ancient, perhaps you mean the version that was current at the time 16.04 shipped?

> I cannot understand why they left out 16.04 there. So much for LTS, it seems.

See https://wiki.ubuntu.com/StableReleaseUpdates. The point of an LTS (or any stable release, for that matter) is that it doesn't change by default. For those who want to keep everything up-to-date, Ubuntu ships a new release on a six month cadence. If you choose not to use that, then you shouldn't be surprised when things aren't updated, since that's exactly what you opted in to.

The microcode package may warrant an exception, however, and we have a bug to track that. It's tricky because without the source we cannot pick apart what changed, or determine whether any changes meet our update policy. We have to be careful. Sooner or later some user will inevitably come along to tell us that a microcode update broke things, and ask why we didn't fulfill our LTS promise by not changing it.

arde · on June 26, 2017

By ancient I mean that it is 1.7 years old and by now Intel has published 4 later releases (20160607, 20160714, 20161104 and 20170511). I think it's a fair and tame adjective considering this is processor microcode we're talking about and that it made me spend a week of my own time hunting it.

You say Ubuntu has a bug to track the microcode package as an exception, but that doesn't seem to be having a positive effect, does it? Precisely because Ubuntu cannot pick it apart, what is it that they're trying to decide in the bug? Why is Ubuntu second guessing Intel in deciding which microcode update to apply and which to skip? How would Ubuntu know that better than the manufacturer? The Intel specification updates list tons of processor bugs including some very critical ones, so we know the microcode updates do help with some of those. When was the last time that an Intel microcode update brought a new bug or made something worse? I'm not aware of any such instance, and although that may indeed happen sometime it doesn't seem as likely as facing existing known bugs, right?

I think it could be argued that it is up to the user to decide (say, a warning during installation), or that Ubuntu could choose to apply all microcode updates by default and let the user opt out. Ubuntu might impose a certain delay, say a month or two at most, in order to see if a microcode update gets withdrawn or ends up too buggy. But I don't think Ubuntu could reasonably choose to skip all microcode updates for 1.7 years like it did in my case, or to choose which ones to apply and which ones to skip, like it seems to be trying to. Microcode should be treated like other propietary software, but with special dilligence due to its criticality. If nVidia says a particular driver release is very buggy and should be updated, Ubuntu promptly updates it. Why would Ubuntu sit on known critical microcode updates then? If, and it's really a big if, eventually some microcode update brings a new bug and Ubuntu deployed it, it would be Intel's fault and not Ubuntu's.

rlpb · on June 27, 2017

Why is this Ubuntu's sole responsibility? Can you not get a UEFI firmware update from your vendor?

> I think it's a fair and tame adjective considering this is processor microcode we're talking about...

Processor microcode updates haven't, to my knowledge, ever automatically been applied by distributions in the past. In light of that, I don't see how it's reasonable to have an expectation otherwise.

> You say Ubuntu has a bug to track the microcode package as an exception, but that doesn't seem to be having a positive effect, does it?

By being careful before pushing out an update to millions of users? I'd say that's a positive effect.

> ...what is it that they're trying to decide in the bug?

Whether to continue to let users have a choice, or by taking that choice away by doing things automatically for them. There are also packaging-based regressions to consider; not just the microcode ones. For example: if the wrong microcode is applied to the wrong processor because of a packaging error, who would you be blaming? Intel or Ubuntu?

> But I don't think Ubuntu could reasonably choose to skip all microcode updates for 1.7 years like it did in my case...

Ubuntu didn't "choose to skip" all microcode updates. Ubuntu didn't choose at all; pushing an update requires a specific effort.

In light of this issue, Ubuntu is now considering what to do about it, responsibly, for all users. Both for this particular issue, and for microcode updates going forward.

arde · on June 28, 2017

I don't think yours is a serious answer, what I said already refutes your points. Ubuntu has already made me waste a week and I don't want to waste any more, particularly when you can't or won't listen to what I say and look for irrelevant excuses like the availability of UEFI vendor updates. Just FYI Intel provides these updates for very good reasons and RHEL/CentOS/Fedora have been providing processor microcode updates for ages now and with relatively frequent updates (see the microcode_ctl RPM changelog). Bye now.

rlpb · on June 28, 2017

> RHEL/CentOS/Fedora have been providing processor microcode updates for ages now and with relatively frequent updates (see the microcode_ctl RPM changelog)

I wasn't aware of this, thanks. Though I searched, and I found that they've been causing their users problems by doing so: https://rhn.redhat.com/errata/RHBA-2017-0028.html

I think this backs up my point: care must be taken.

lmm · on June 27, 2017

> Why is Ubuntu second guessing Intel in deciding which microcode update to apply and which to skip? How would Ubuntu know that better than the manufacturer?

The whole point of an LTS release is that they will keep a stable baseline for all the packages they're distributing and apply a certain amount of integration testing. If you believe upstream knows best (and I'm not saying you're wrong to do so) then why use LTS in the first place?

arde · on June 28, 2017

> The whole point of an LTS release is that they will keep a stable baseline

I think you're confusing an objective (LTS) with a tactic (keeping a stable baseline). Certainly the whole point of LTS is not to keep a stable baseline, but to provide long-term support. And that is clearly violated when Ubuntu chooses to not provide support when it is known to be needed (e.g., listed in an Intel Spec update) and a solution is available by a vendor (e.g., Linux-specific microcode update being made available by Intel). The whole point of LTS is to avoid the bleeding edge while fixing known bugs. Microcode updates is not bleeding edge, it's just patches for known bugs.

lmm · on June 28, 2017

> the whole point of LTS is not to keep a stable baseline, but to provide long-term support.

No, you've got it backwards. If you just wanted long-term support you'd use a rolling release distribution, of which there are any number (yes some rolling release options are "bleeding edge", but there are stable options too). The whole point of LTS releases is that they are stable baselines that are supported in the long term.

arde · on June 29, 2017

If LTS' sole point were to keep a stable baseline, then that point would only be for the benefit of some short-sighted developer whose only interest were to keep his to-do list brief. Even a slightly more forward-looking developer would realize his work's point lies somewhere else, particularly when done in a corporate environment such as Canonical's and not as a hobby (but even when done as a hobby for a community project such as Debian). His work should be oriented towards goals somehow related to the costumer priorities, without whom it ceases to matter. Keeping a stable baseline is unrelated to any costumer or user interest.

The costumer or user of LTS wants, well, Support (fixes) for some extended time (Long Term) and avoiding problems (less bugs), perhaps at the cost of getting less new features. The whole point of LTS is self-explanatory.

Keeping a stable baseline is a tactic used to try to provide long-term support by reducing the amount of new bugs at the cost of less new features. But keeping a stable baseline does not attain the key word, the substantive in LTS: Support. Support is provided only by introducing fixes, and that is exactly what has been omitted in the case of the Intel microcode bugs.

edwinksl · on June 25, 2017

Launchpad bug report for this: https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+b...

gt565k · on June 26, 2017

How do you check if the issue is actually fixed after installing the microcode fix?

matrixagent · on June 26, 2017

How do you install the update from that version?

pedrocr · on June 25, 2017

Here's how to fix it on a Thinkpad on Linux. I've got a T460s and checked with the script[1] that it was indeed affected. The Debian instructions said to update your BIOS before updating the microcode package so I went to the model support page[2] to the BIOS/UEFI section and downloaded the "BIOS Update (Bootable CD)" one. The changelog included microcode updates so it looked promising[3]. To get the ISO onto a usb drive I did the following:

  $ geteltorito n1cur14w.iso > eltorito-bios.iso # provided by the genisoimage package on Ubuntu
  $ sudo dd if=eltorito-bios.iso of=/dev/sdXXX # replace with your usb drive with care to not write over your disk

I then had a bootable USB drive that I ran by rebooting the computer, pressing Enter and then F12 to get to the boot drive selection and selecting the USB. From then it's just following the options it gives you. It's basically pressing 2 to go into the update and then pressing Y and Enter a few times to tell it you really want to do it. After that just let it reboot a few times and the update is done. After booting again the same test script[1] now said I had an affected CPU but new enough microcode.

[1] https://lists.debian.org/debian-user/2017/06/msg01011.html

[2] http://pcsupport.lenovo.com/pt/en/products/laptops-and-netbo...

[3] https://download.lenovo.com/pccbbs/mobiles/n1cur14w.txt

tyingq · on June 25, 2017

There's a perl script on the debian mailing list that digs a bit deeper and tells you if you're affected in the first place, if you're affected but patched already, affected but have HT disabled, etc.

https://lists.debian.org/debian-user/2017/06/msg01011.html

fattire · on June 26, 2017

I ported this to bash since I have a chromebook w/o perl and (as for right now) the fs is read-only, so I just piped the script to it and sure enough my brand-new Samsung Chromebook Pro appears to be vulnerable, though apparently patchable.

Details and if you want the script I link to it from here: https://forum.xda-developers.com/hardware-hacking/chromebook... - don't judge my shitty bash skills.

ft

Syzygies · on June 25, 2017

In my experience with parallel code written in Haskell, hyper-threading offers only a very mild speedup, perhaps 10%. It is essentially an illusion, a logical convenience. (How long does it take to complete a parallel task on a dedicated machine? Four cores with hyper-threading off has nearly the performance of eight virtual cores with hyper-threading on.)

Many people have neither the interest nor the hardware access to overclock, and these processors have less overclocking headroom than earlier designs. Nevertheless, the hyper-threading hardware itself generates heat, restricting the overclocking range for given cpu cooling hardware. In this case, turning off hyper-threading pays for itself, because one can then overclock further, overtaking any advantage to hyper-threading.

barrkel · on June 25, 2017

It depends on what resources your code uses on-chip. If all threads are contending on the same resources, then you won't see a speedup; if they're using different resources, hyperthreading can increase throughput significantly. I've seen hyperthreading give me the equivalent of 50% of another CPU, particularly when I'm running multiple CPU-bound processes concurrently (so they're not executing the same code at the same time in some kind of parallel operation, and certainly aren't bound on synchronization primitive overheads).

Syzygies · on June 25, 2017

That makes sense. I'm a mathematician, and my experience is with pure computations, homogeneous across each (virtual) core.

RaleyField · on June 26, 2017

>It is essentially an illusion, a logical convenience.

Just checked numbers. That was my expectation as well until I came across code that experienced a bit over 80% speedup when HT was used.

esaym · on June 25, 2017

It normally helps "some", and rarely hurts performance. So you might as well enable it.

ploxiln · on June 25, 2017

I worry about hyperthreading hurting worst-case latency (since a thread might be assigned to run on a virtual core which does not work as fast as expected).

KayEss · on June 26, 2017

I've looked for this and I see little evidence that worries me (this on Linux). The kernel seems to schedule first on the real primaries and then on the secondaries within a pair, and the kernel is also not all that shy about moving a thread to a different CPU.

mjw1007 · on June 25, 2017

It's painful to have to read text like « select Intel Pentium processor models ».

If Intel used marketing names that were more closely related to technical reality, then when something like this happens they wouldn't have so many customers finding themselves in the "maybe I'm affected by this horrid bug" box.

ourcat · on June 25, 2017

So will this be affecting most Macbook Pros of the past few years?

If so, there's a way to disable hyper-threading, but you need Xcode (Instruments).

Open Instruments. Go to Preferences. Choose 'CPU'. Uncheck "Hardware Multi-Threading". Rebooting will reset it.

skm · on June 25, 2017

If you're on a Mac, you can find the exact processor model number by opening Terminal.app and entering

  sysctl -n machdep.cpu.brand_string

this will return something like

  Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz

(Found on http://osxdaily.com/2011/07/15/get-cpu-info-via-command-line... )

DamonHD · on June 25, 2017

Exactly the same output for me. Assuming that's not the only hardwired output from that command, is your password also hunter2? B^>

yborg · on June 25, 2017

This is kind of like cutting off your leg because of a hangnail. I've been running a Skylake MBP for more than 6 months for compilation workloads and haven't seen a single processor hang.

I'm much more annoyed by the completely unpredictable desktop assignment on monitors when hotplugging DisplayPort connections on multiple displays. This one bothers me every day.

richdougherty · on June 25, 2017

> haven't seen a single processor hang

If there was data corruption you might not notice.

codezero · on June 26, 2017

That's weird re monitor issues. I've been impressed with how consistent mine are.

What I see:

Same monitor/monitors plugged into same ports produce consistent configs.

I get a unique config per monitor/port.

ourcat · on June 26, 2017

I agree. I've been fine myself. But if people feel the need to turn it off. ;)

rayiner · on June 25, 2017

No. Even though Skylake came out in 2015, Apple used an older CPU (Haswell) until the October 2016 MacBook Pros.

reaperhulk · on June 25, 2017

They used Skylake in late 2016 MBPs and Kaby in the 2017 models.

isatty · on June 25, 2017

I'm on a Kaby

    Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

I haven't seen any erratic crashes yet on code compiled by LLVM 8.1.0 on multiple projects (ROS, Qt5). I hope a microcode fix is pushed by Apple on behalf of Intel soon.

pedrocr · on June 25, 2017

Are you from the future? Isn't LLVM currently at 4.0 and going for 5.0 soon?

sclangdon · on June 25, 2017

macOS comes with Apple's build of Clang, which has different versioning from "proper" Clang.

pedrocr · on June 25, 2017

That's going to be mess when LLVM proper catches up to those version numbers...

isatty · on June 25, 2017

It prefixes the version with Apple LLVM so I assume it is versioned differently.

rsynnott · on June 25, 2017

From one of the 2016 MacBook Pros:

> machdep.cpu.model: 78 > ... > machdep.cpu.stepping: 3 > ... > machdep.cpu.microcode_version: 174

Can't find if 174 is the fixed version or not.

So this is one of the models for which there exists a fix, as per the email.

deefcee · on June 26, 2017

I'd also like to know if 78 is the fixed version.

onli · on June 25, 2017

Rule of thumb: On a desktop, if you have an i5 you do not have Hyperthreading. All i3s and i7s do have Hyperthreading, as do new Kaby Lake Pentiums (G4560, 4600, 4620).

On laptops, some i5s are not real quad cores but dual cores with Hyperthreading.

decisiveness · on June 25, 2017

>Rule of thumb: On a desktop, if you have an i5 you do not have Hyperthreading. All i3s and i7s do have Hyperthreading, as do new Kaby Lake Pentiums (G4560, 4600, 4620).

Hmm...either this statement is wrong or this desktop /proc/cpinfo is wrong:

    $ grep -E 'model|stepping|cpu cores' /proc/cpuinfo | sort -u
    cpu cores	    : 4
    model           : 94
    model name	    : Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
    stepping	    : 3
    $ grep -q '^flags.*[[:space:]]ht[[:space:]]' /proc/cpuinfo && echo "Hyper-threading is supported"
    Hyper-threading is supported

Intel's product spec page[1] lists this CPU as not supporting Hyper-Threading so I'm a bit puzzled as to why the ht flag is present.

[1]https://ark.intel.com/products/88188/Intel-Core-i5-6600-Proc...

justinclift · on June 25, 2017

Hmmm, checking for "ht" seems to be giving weird info. On a i5-750 here (few years old), running Fedora 25:

    $ grep '^flags.*[[:space:]]ht[[:space:]]' /proc/cpuinfo
    flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
    dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts
    rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2
    ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid dtherm ida

"ht" is being returned even though the CPU only has 4 cores and no hyperthreading:

https://ark.intel.com/products/42915/Intel-Core-i5-750-Proce...

dmidecode seems to give more accurate info for this:

    $ sudo dmidecode -t processor | grep 'Count:'
    	Core Count: 4
    	Thread Count: 4

decisiveness · on June 25, 2017

It looks like dmidecode also contradicts itself with the hyper threading flag:

    $ sudo dmidecode -t processor | grep -E 'Flags:|HTT|Status|Count'
    	Flags:
		HTT (Multi-threading)
	Status: Populated, Enabled
	Core Count: 4
	Thread Count: 4

justinclift · on June 25, 2017

Hmmm... yeah that's showing the same on mine too. According to dmidecode, this CPU has hyperthreading.

decisiveness · on June 25, 2017

To quote the Intel Developer Instructions[1] on the HTT flag:

>A value of 0 for HTT indicates there is only a single logical processor in the package and software should assume only a single APIC ID is reserved. A value of 1 for HTT indicates the value in CPUID.1.EBX[23:16] (the Maximum number of addressable IDs for logical processors in this package) is valid for the package.

UPDATE: It appears these flags refer to each initial APIC ID, so it seems the HTT flag value should be 0 in all cases where the overall processor:thread ratio is 1, suggesting there might either be incorrect information in the CPUID instruction for some Intel CPUs or the kernel is not correctly evaluating CPUID.1.EBX[23:16].

Hopefully, someone more versed in CPUs can correct me here.

[1]https://www.intel.com/content/www/us/en/architecture-and-tec...

strmpnk · on June 25, 2017

cpuinfo is probably wrong in this case. Looking at the intel spec sheet: http://ark.intel.com/products/88188/Intel-Core-i5-6600-Proce...

my123 · on June 25, 2017

Do you have 8 threads?

age_bronze · on June 25, 2017

I would've expected at least an example assembly code reproducing the bug? How was it not discovered before, but only with the OCaml compiler? They say "unexpected behavior", does this mean that code compiled with this can give incorrect results? Can this have any security implication? How much code was compiled with similar patterns? Can the problem reproduced with any JIT compiler? We need to know what can cause this, maybe compiled and working code already contains such patterns waiting to be abused...

zzalpha · on June 25, 2017

Charming. I picked up a 5th Gen X1 Carbon configured with a Kaby Lake processor, and apparently there's no way to disable hyperthreading in the BIOS, and according to Intel's errata, no fix available yet.

Oh well... so far the machine (running Windows 10) has been stable minus one or two random lockups in 2 months of heavy usage which could be attributed to this. Guess I wait...

luckydude · on June 25, 2017

That's a really nicely done announcement. Simple, to the point, no drama, all the info you could want, scripts to figure out your processor, etc.

Well done Debian folks!

rwmj · on June 25, 2017

So if I understand correctly, some affected processors can be fixed by a microcode update, but there are some which cannot be fixed at all?

Also the advisory seems to imply that the OCaml compiler uses gcc for code generation, which it does not -- it generates assembly directly, only using gcc as a front end to the linker.

brianwawok · on June 25, 2017

Sounds like they can be fixed by the patch isn't out yet?

tom_mellior · on June 26, 2017

> advisory seems to imply that the OCaml compiler uses gcc for code generation, which it does not -- it generates assembly directly

Yes, but that assembly code contains calls into the OCaml runtime, for garbage collection etc. If I understand correctly, the particular loop affected by this bug was somewhere in this memory management code. That code is written in C and compiled with a C compiler.

dooglius · on June 25, 2017

It sounds like they can all be fixed by disabling hyperthreading.

libeclipse · on June 25, 2017

> fixed

mitigated

a3_nm · on June 25, 2017

Assuming the BIOS/UEFI has an option for it... my laptop doesn't seem to have one (and there are no updates to fix the bug either...).

bhouston · on June 25, 2017

What is the percentage of non fixable chips for Skylake and kabylake? Sometimes those early steppings are not widely distributed.

mschuster91 · on June 25, 2017

According to the mail, the systems "cannot be fixed" because they lack HyperThreading in the first place so there is no fix to apply.

ComputerGuru · on June 25, 2017

So, serious question: If the microcode "fix" for this ends up disabling HT, how does one get a refund not just for the CPU but for the $3k laptop I spec'd around it? Without needing to sue?

This isn't a hypothetical; what did Intel do when the only fix for broken functionality was to disable TSX entirely?

mixmastamyk · on June 25, 2017

I remember the pentium bug in the mid 90s, they actually shipped out replacement processors. Doubt that could be pulled off on laptops. Perhaps a microcode update can work around it.

senectus1 · on June 26, 2017

Given my Surface Book got a 1/10 repair-ability score on ifixit... I dont think they'll just replace the chip :-P

paulddraper · on June 26, 2017

The same for any part of your computer, I imagine.

What happens when your laptop's display has frequently broken pixels?

sqldba · on June 26, 2017

> What happens when your laptop's display has frequently broken pixels?

Well they wait for 6 or more before they say it's out of spec

herpderperator · on June 26, 2017

If anyone on Windows wishes to update their CPU microcode without waiting for Microsoft to push it out via Windows Update, you can use this tool from VMware https://labs.vmware.com/flings/vmware-cpu-microcode-update-d... which can update microcode as well.

Windows stores its microcode in C:\Windows\System32\mcupdate_GenuineIntel.dll which is a proprietary binary file and you can't simply replace it with Intel's microcode.dat file (which is ASCII text), so you have to use a third-party tool such as VMware's one.

Simply: 1. Download and extract the zip file in the first paragraph 2. Modify the install.bat file so that the line which reads `for %%i IN (microcode.dat microcode_amd.bin microcode_amd_fam15h.bin) DO (` only contains the microcode.dat parameter (since you obviously don't have an AMD CPU, and the tool is made for both) 3. Download and extract microcode.dat from Intel's website (https://downloadcenter.intel.com/download/26798/Linux-Proces...) and place it into the same directory as the VMware tool 4. Run install.bat with admin privileges 5. Hit cancel when it tells you that the AMD microcode files are missing, and you're done

The CPU microcode will be updated immediately (yes, while Windows is running.) The service will also run on each boot and update your CPU microcode, since microcode updates are only temporary and are lost each time you restart. You can check Event Viewer for entries from `cpumcupdate` to see what it has done. It's advised to run a tool to view the microcode version before installing (such as HWiINFO64) so you can re-run the tool after installing and confirming that the version has changed.

I have done this and it works as described. I went from 0x74 to 0xba as shown by the μCU field in HWiNFO64, and I have an i7-6700k.

wscott · on June 26, 2017

Has anyone benchmarked one of these machines before and after applying this microcode update? The options in microcode are rather limited and all are likely to have performance impacts. This is likely disabling functionally to avoid this case. I would hope the patch is smart enough to not apply if threading is not enabled, but who knows.

jakeogh · on June 28, 2017

Would a performance hit go unnoticed?

Usually (always?) it's not a ROM update, the encrypted microcode blob is loaded into the CPU by the OS on every boot via CONFIG_MICROCODE.

some linkrot: http://imgur.com/a/z1uLv

ncrmro · on June 25, 2017

Just got the 2017 no touchbar 13 macbook pro with the kaby lake i7. Should I be worried, can I even disable HT with mac. And presumably the update will be provided so the whole laptop is still ok?

I've been using the thunderbolt 3 dock with two external monitors and occasionally get a little glitch prolly loose cable I think.

I've downloaded the bitcoin blockchain, done quite a bit of work in pycharm + chrome, multiple projects, flow and webpack in the background and haven't had any sort of crashes tho.

isaac_is_goat · on June 25, 2017

Holy cow. Definitely feel like I dodged a bullet by building an AMD/Ryzen system this time around - which had it's own set of issues (but seem to be more or less ironed out now).

gbin · on June 25, 2017

This is not a fair comment: Ryzen had a crash that can be triggered by compiling with GCC and a memory compatibility issue where it cannot run them at their nominal speed. Ryzen is a really young architecture, it already had like 6 stable patches of microcode and you can expect way more.

Laforet · on June 26, 2017

Is there any published errata for Ryzen or other family 17h CPUs?

isaac_is_goat · on June 25, 2017

Still seems to be doing better than "turn off half of your processor, sorry no fix just buy a new one".

pvdebbe · on June 25, 2017

Hyperthreads are just that - threads. It won't be 50 % slower with HT disabled.

SXX · on June 25, 2017

Intel still charge extra $100 for it.