I wonder at what point the hardware fix for these issues stop becoming worthwhile and if we'll see a resurgence of processors without speculative execution or any of these other speed ups.
Ironically, a high performance, general purpose architecture without speculative execution might require a deep reinvestment in SMT. Instead of trying to speculatively make one thread fast to mask IO stalls, run a large pool of threads that can stall frequently but still keep the execution units and memory channels busy.
To avoid reintroducing these spectre like bugs, you'd have to conservatively design the per-thread execution to avoid those covert channels. Not only synchronously enforcing all logical ISA guarantees for paging and other exception states, but also using more heavy-handed tagging methods to partition TLB, cache, etc. for separate protection domains.
> Instead of trying to speculatively make one thread fast to mask IO stalls, run a large pool of threads that can stall frequently but still keep the execution units and memory channels busy.
Isn't something like that done for GPUs? They have the advantage of having a massive number of threads to execute. For CPUs, the number of runnable threads tends to be lower.
You can kiss any semblance of reasonable performance goodbye if you eliminate "speculative execution". Pipelining is the most basic tool in the toolbox. Even microcontrollers do it.
To give ballpark numbers: modern Intel processors can retire a few instructions per cycle in tight loops (4 is the theoretical maximum; > 2 is realistic in a lot of high-performance code). A branch misprediction wastes 10-15 cycles.
So getting rid of speculation entirely, and stalling on every branch, would waste time equivalent to dozens of instructions. On typical code that has a branch every few instructions, this could slow down execution by several times.
The simplified SIMD cores in early GPUs had to fake branching to some extent for their virtual threads: every branch in the shader code would be tested for each virtual flag and that thread (really just a vector component) would be masked out for the instructions of the branch that didn't apply. The GPU would run both branches, relying on the mask. It was workable, but very slow.
Pipelining isn't strictly the same thing as speculation, though, is it? If I have,
add %rax, %rbx
add %rcx, %rdx
I can pipeline those without needing to speculate on anything. If there is a dependency on a previous instruction, then we might have to speculate, but hopefully there is still some case for pipelining?
Have any of these bugs been completely based on speculation, or is it always speculating across privilege boundaries? (Although I feel like even the former isn't same, e.g., if you're in some form of VM attempting to maintain privilege separations.)
It's related. If you want decent performance with pipelining, you're going to want to speculate at least a bit -- assume that FP math doesn't trigger exceptions, assume that you predicted branches correctly, assume that memory accesses don't fault, etc.
Intel does more speculation, but you won't find anything beyond the tiniest embedded CPUs which don't do any.
E.g. this new one can only be reproduced on Intel, not on AMD and ARM.
If you want to ban speculative execution for everything, you need to make the case that it's a fundamental issue and not an implementation specific issue.
Right now, that's not the case for many of these vulnerabilities.
As I understand it, the Intel only vulnerabilities, Foreshadow/L1TF, and this set which I've not looked at the details of yet, are targeting specific Intel features, and there's no reason to believe a similar focus on the other companies' products wouldn't also find unique problems.
For example, the first version of Foreshadow went after the SGX enclave. Given how widespread Meltdown and Spectre bugs are, there's absolutely no reason to believe that the other vendors don't have similar unique problems.
As you say, only the first Foreshadow attack went after SGX - it turned out to be a broader flaw that also affected OS page table protections more generally and could be used to attack process-OS and VM-hypervisor isolation. Those variants relied only on Intel's implementation of standard x86 paging, and they don't exist on AMD because they didn't implement it in the flawed way Intel did. That is, Foreshadow/L1TF is Intel-only not because it relies on an Intel-only feature, but because it's an Intel-specific implementation flaw. (Linux had to substantially rework its paging code to work around this.)
AMD don't seem to have commented on ZombieLoad yet, presumably because it's much newer and they didn't have pre-announcement info about it, but they've commented on the other two vulnerabilities announced today and explained that the reason they're not vulnerable is because the corresponding units in their CPUs don't allow speculative data access unless the access checks pass and their whitepaper seems to suggest the same is true of ZombieLoad: https://www.amd.com/system/files/documents/security-whitepap...
SGX does make for an easier and flashier demo for Foreshadow, though, so it makes sense that the researchers went after that target. They managed to recover the top-level SGX keys that all SGX security and encryption on the system relies on, something that I don't think anyone had ever managed before.
Also, as I've said elsewhere, Intel seems to speculatively leak data that shouldn't be accessible pretty much everywhere in their designs where memory is accessed.
" and there's no reason to believe a similar focus on the other companies' products wouldn't also find unique problems."
Sure there is. Just like the first round last year, intel totally through AMD under the bus to save face and stock price. That is the reason to mention AMD literally, to keep their stock price from crashing.
The industry will probably get dragged, kicking and screaming, into using tagged pointers. CPU could then use the information to put safe lid on speculative execution.
And it will be tough as no compiler supports it, moreover C/C++ are architected from the beginnign to not bother with runtime information about object types/sizes.
Ultimately, if we do transparent per process memory encryption, then we can let the CPU do all the speculation it wants, but the result will be gibberish. And it's a lot easier to do a simple key switch than doing a full TLB flush. Of course, it probably doesn't do much against/for the timing attacks (side channels).
My guess is that the performance loss from removing these features would make such CPUs less economical than strictly enforced separation between security domains on a hardware assignment and scheduling level. That is, just forget about having the same server run stuff from different contexts at the same time.