A cache invalidation bug in Linux memory management

tom_ · on Sept 30, 2018

4 billion was a lot when I first got into programming, but it just isn't that much any more. Always start out by making your sequence numbers 64 bits: 2^64 nanoseconds = ~585 years. (I have in the past stolen a few bits for other uses, but I'm less certain of the wisdom of that these days.)

32 bits is suitable for things that can only happen once a frame - (2^32/60) seconds = 2.26 years - but not much more than that.

But there's a flip side to computers being able to wrap a uint32_t so quickly: if you've got 32 bits'-worth of combinations, then you can might actually be able to do an exhaustive test. 2^32 microseconds = 71 minutes. Work in C++, run on all threads, and you probably won't even need to wait that long.

Good example: https://randomascii.wordpress.com/2014/01/27/theres-only-fou... - runs in 90 seconds.

I did something similar a few years ago, prototyping a 16x16 multiply routine for an 8-bit CPU. It took about 5 minutes running on both threads on my 2.5GHz dual core laptop, and found some interesting cases I'd got completely wrong.

mmphosis · on Sept 30, 2018

65536 was a lot when I first got into programming, there was also floating point, and then the realization that we don't need to be constrained to 4 bits, 8 bits, 16 bits, 24 bits, 32 bits, 64 bits, 128 bits, two digit years, 640K of RAM, small stack frames, broken memory models and all the other ridiculous limits. Give it a decade and I bet the ~585 year time frame is reduced to a small enough interval where bugs appear. Bet on lots of 64-bit bugs appearing, and a programmer in the future commenting that 128-bits should have been used, and 128-bits will break in the future of that future, and so on.

For me, the biggest bug in the modern x86 chip is ME: a full system that can be turned on, take over everything and eats up valuable core space.

dagenix · on Oct 1, 2018

> Give it a decade and I bet the ~585 year time frame is reduced to a small enough interval where bugs appear.

I'm skeptical that much changes in the next decade regarding where 64-bit integers are useful. It's been a long time since there are been meaningful clock frequency increases in commodity CPUs. Even if we had a 100Ghz CPU and we had it increment a register once per clock cycle, we're still talking ~6 years for it to overflow. And, thats assuming a 20+ times increase in clock frequency when clock frequencies have been mostly flat for the last 10 years.

> a programmer in the future commenting that 128-bits should have been used, and 128-bits will break in the future of that future, and so on.

I disagree with this statement. With 128-bits, it becomes difficult (though not impossible) to even come up with things that are numerous enough to be unable to count with that number of bits. That was never the case for 64-bit or smaller integers.

kevin_thibedeau · on Oct 1, 2018

ME is implemented with a tiny Quark processor on the PCH. It doesn't consume any valuable space in the CPU. Every modern non-trivial support ASIC has an embedded processor for housekeeping duty. The problem is how much ME can extend its tendrils into every aspect of a modern PC.

VMG · on Oct 1, 2018

Bear in mind that the 128 bit space is not double the size of the 64 bit space - it's 2^64 times the size.

Cryptography actually relies on there being a space that is too large to count though, even in parallel.

ummonk · on Oct 1, 2018

128 bits is actually good enough for basically anything. 64 bits comes close but not always good enough.

kardos · on Sept 30, 2018

Exhaustive search is untenable for the 64-bit case. Any strategy here? Testing 2^40 bits might be tenable; any way to select which ones would have the best ROI?

reitzensteinm · on Oct 1, 2018

I'm not so sure that's true. You can rent a Skylake chip on Google Cloud that'll perform 1.6 trillion 64 bit operations per second for $0.96/hr preemptively. That's enough to run one instruction over a 64 bit address space exhaustively over 120 days, or for ~$2800.

Obviously that needs to be multiplied by the number of instructions a test needs to execute (and it has to be well written with AVX 512), but we're not talking nation state levels of resources here for simple cases, and that figure is only going to drop further with time.

Not to mention the potential for FPGA or ASIC implementations if the tests are important enough.

kardos · on Oct 1, 2018

Hmm that's surprisingly within reach. I bet you could crowd-source that kind of verification with BOINC.

pag · on Sept 30, 2018

Check out DeepState! https://github.com/trailofbits/deepstate

dataflow · on Sept 30, 2018

It's hard to give a general strategy without knowing the problem structure, but I guess you could say one general strategy is to exploit the problem structure in a way that lets you test multiple values in one shot.

namibj · on Sept 30, 2018

Feed llvm bytecode through Z3 and hope you didn't implement something with cryptographic behaviour.

fao_ · on Oct 1, 2018

Could you give some advice or pointers on this?

namibj · on Oct 1, 2018

This should get you going. Come back if you get stuck. https://twitter.com/whitequark/status/915211914443153409

Edit: sorry, I just don't have the time to get you the details.

chrisseaton · on Sept 30, 2018

> 32 bits is suitable for things that can only happen once a frame - (2^32/60) seconds = 2.26 years

What do you mean by a 'frame' and how does 2.26 years relate to a frame?

dmoy · on Sept 30, 2018

Frame of video at 60fps. For games I suppose that's a benchmark burned into your skull.

dvlsg · on Oct 1, 2018

144 please. :)

For real though, the difference between 60 and 144 is night and day. I can't go back to 60 after playing with 144. I wonder if going beyond 144 would have the same effect.

dmoy · on Oct 2, 2018

Yea but that would require spending more than $600 per 5 years on computer hardware :\

earenndil · on Oct 1, 2018

A frame is 1/60 of a second (generally), so if you are indexing your frames with a 32-bit integer, you have enough for 2.26 years of video (or video game, or whatever).

mort96 · on Oct 1, 2018

High refresh rate monitors are getting more and more popular; at 240 FPS, a signed 32-bit counter will just last a bit over 3 months. Competitive players may want to play at even higher frame rates to decrease input lag even further. It's _probably_ still fine, because people _generally_ don't leave the game running for months, but those 31 usable bits get used up rsurprisingly quickly.

Kenji · on Sept 30, 2018

Yes, that's a good point. I unit test all small functions that take no more than 32 bit in input on literally all inputs.

steelframe · on Sept 30, 2018

Note this is by Jann Horn, the same Google security researcher who discovered Spectre/Meltdown. Serious props to this guy, and hopefully he stays in a whitehat role where he can find and deal with these issues through responsible disclosure.

ronnier · on Sept 30, 2018

Wow he’s only 22 https://www.bloomberg.com/news/articles/2018-01-17/how-a-22-...

merb · on Sept 30, 2018

I mean with 20 (while still in colleague), cure53 got already curious about Horn, which is quite crazy. (cure53 is one of german's best penetration/security company, they even analayze your software as a service, they actually pentested curl: https://cure53.de/pentest-report_curl.pdf, prometheus: https://cure53.de/pentest-report_prometheus.pdf and probably other high valuable software)

romed · on Sept 30, 2018

Perhaps people new to the codebase (or the industry, or to life in general) are less likely to view existing constructs as obviously correct, and therefore more likely to point out flaws.

sleepydog · on Sept 30, 2018

I think you are right. It definitely takes more than that, but you can be the smartest guy in the world and you won't discover many vulnerabilities if you assume certain components are "hallowed ground".

Apparently he thought of the spectre-style vulnerabilities while through the Intel processor manuals[1]. How many established engineers would a) read through these reference manuals at all, and b) question the implementations described therein?

[1]: https://www.bloomberg.com/news/articles/2018-01-17/how-a-22-...

quotemstr · on Sept 30, 2018

That's one reason big tech companies should allow easy inter-team transfers.

romed · on Sept 30, 2018

Also why you should give all the critical projects to interns.

ndesaulniers · on Sept 30, 2018

New opinions are always suspected, and usually opposed, without any other reason but because they are not already common. ~John Locke

edoceo · on Sept 30, 2018

Math and computers are closely related. Loads of cool math discovery was done by the under 30 group.

samfisher83 · on Sept 30, 2018

I think software is a young mans game. I think bugs like this take a lot of intense time and focus. As you get older you got other things to distract you like kids, wife, house etc.

dman · on Oct 1, 2018

I am a lot more careful now, as compared to the start of my career. I have a lot more experience with testing and by all metrics I can measure the code that I ship goes out with a much lower defect rate. I think it is just the reverse - I think software is an excellent profession to stick to for the long haul, because with even a moderate amount of effort you can improve tremendously over time.

samfisher83 · on Oct 1, 2018

If this was true why is there rampant ageism in SV?

copperx · on Sept 30, 2018

You could make exactly the same argument for math research. It is somewhat true, but not entirely. There are many breakthroughs by older people.

jacquesm · on Sept 30, 2018

"This fits in with the kernel's policy of attempting to keep the system running as much as possible by default; if an accidental use-after-free bug occurs here for some reason, the kernel can probably heuristically mitigate its effects and keep the process working."

That's a pretty bad design decision. I'd much rather have a kernel panic than a kernel that continues to run with known bad datastructures. Like that the bugs will never really shake out, silent failures like that are super dangerous because essentially the system is running in an undefined state past that point.

palotasb · on Sept 30, 2018

IIUC the heuristic here is how much of the system to take down (i.e., back out of fast path [this case] < kill thread < kill process < kernel panic) and this seems correct (but could still be a symptom of more corruption).

Anyway, there has been discussion about this issue in general previously on the Linux kernel mailing list, and Linus said [1] that the correct procedure is to first introduce by-default reporting and an opt-in kill switch, then make the kill switch the default, then remove the non-kill option. This is supposed to weed out bugs eventually without disrupting users too much, but I can imagine that it enables some expolits. There was an HN discussion about it too. [2]

[1] https://lkml.org/lkml/2017/11/21/356 [2] https://news.ycombinator.com/item?id=15754988

tonysdg · on Sept 30, 2018

It's terrible from a correctness perspective, sure. But from a business perspective, that could mean $50 million of revenue not-lost to a flaw you (as an owner/operator) can't do much about. Sure, it could lead to an exploit like this, but I'd wager a cost-benefit analysis for almost any organization (except maybe CIA/NSA types) would support this design choice.

That's not to say it's a good design choice, but it's certainly a defensible one IMO. You can have the most secure OS in the world, but if no one wants to use it, all you've got is a very secure waste of hard drive space.

jacquesm · on Sept 30, 2018

If a kernel panic can cost you $50 million you have other problems. Really, in an organization where downtime due to a server rebooting would be that expensive you'd hope they would be able to deal with that gracefully and that they would be deploying the rough equivalent of chaosmonkey to ensure that their stuff is protected against such errors.

After all, a harddrive or a CPU could die just the same.

ploxiln · on Oct 1, 2018

This is a principle borne of experience. There have been a number of cases where a new fatal-error check killed a bunch of systems that apparently were working for a long time before. Usually it is preferred to WARN_ONCE() but not BUG_ON(). Otherwise, big corporate users, or a wide random assortment of desktop users, end up going back to an older kernel to make their systems work. The linux kernel tries above all to never have regressions, even if these regressions are just catching probably very bad bugs that were just not caught before. This happens, often enough!

Think of this from the lense of natural selection. There are many many subtle tricky low-level bugs that can result in memory corruption - many drivers, many features, many optimizations, many "cleanups", a surprisingly high rate of code changes. Many of them are caught when they cause problematic corruption. Others, just due to luck, very rarely cause any visible problem, - for months or years. We do want to know about them. We do not want to suicide the whole system, it may be one of those rare bugs that made it this far because it's usually (surprisingly) not fatal on its own.

sangnoir · on Sept 30, 2018

Your implicit assumption is that the server will only reboot once. I think if the Linux kernel is upfront about its security posture and your threat-model is incompatible, it is OK to go with OpenBSD.

jacquesm · on Sept 30, 2018

The more frequent the reboots the quicker the problem will get diagnosed!

Especially if the system is still stable enough to write a log entry to disk.

cryptonector · on Oct 1, 2018

The problem is that downtime sucks, but if you don't take the time to make the system more correct then these buglets pile up and then you have a huge technical debt problem. It's all about economics, really, and if you have the resources early on in a project, then aiming for correctness is better than sweeping things under the rug.

pcwalton · on Sept 30, 2018

It's extremely useful for developing drivers. I've developed drivers on kernels that are hardwired to panic at the slightest problem and it's a real nuisance.

Whether you should leave it on in production is debatable, but I like at least having the option of making exceptions raised in drivers nonfatal.

jacquesm · on Sept 30, 2018

And the reason for that is that these drivers are linked into the monolith. If they were standalone processes you'd love the ability to do post-mortem debugging on them, or maybe even the luxury of being able to run them directly under the debugger.

Whenever I had to write device drivers I wrote them under QnX first and only then ported them to other OS's, that saved so much time, at least I knew the hardware interface would be up and running, and the data structures would all work as intended. After that all I then had to do was glue it to whatever calling convention the various *NIX flavors had.

That trick served me well, for motion capture devices as well as for high speed serial cards and some more exotic devices.

im_down_w_otp · on Sept 30, 2018

This is fairly analogous to our approach too, just substituting QNX w/ seL4.

NightMKoder · on Sept 30, 2018

IMO - it's actually a really hard call. Error isolation is hard but we all want it. One of the somewhat less controversial versions is failing a single user request (in node or your web framework) with a 5xx on an exception. You're essentially betting that the exception is a logic bug and not somewhere that will affect other users. That's probably true, unless, you know, your database driver is responsible. We bet that's usually not the case - and we're mostly right.

Honestly I'm ok with trying to keep the system limping with a huge but - you must, at the point you first detect an error, dump...everything you know. You can try to limp along because you're trying to be a good host, but debugging after that state is, as you said, not trivial.

If you separate the two concerns (post-mortem debugging & uptime) there's a nice medium to be found. Ideally kernel panics aren't the only source of observability. You can have a daemon running that files a bug report to your favorite error tracker (sentry, etc) and (attempts to) gracefully reboot the system. That would be pretty sweet.

jacquesm · on Oct 1, 2018

Erlang is the only environment that I know about that gets this right.

amaccuish · on Sept 30, 2018

> I'd much rather have a kernel panic than a kernel that continues to run with known bad datastructures

I mean, that is literally what an Oops is.

quotemstr · on Sept 30, 2018

I agree. Efforts to continue past unexpected states frequently just make the situation worse and end up causing more user pain than a deterministic restart. This kind of recovery also masks real bugs and leads to their persistence in the codebase. (Heuristic error recovery also makes debugging more difficult.) A fail-fast approach both preserves user invariants and provides an impetus to fix bugs fast.

xyproto · on Sept 30, 2018

To reboot when this happen could be a kernel configuration option. This way users could choose.

afwaller · on Sept 30, 2018

There are only two hard problems in computer science.

Cache invalidation, and naming things. And off by one errors.

Scarblac · on Sept 30, 2018

That's the difference between computer science and software engineering. Software engineering has only a single hard problem: people.

progval · on Sept 30, 2018

And off by 2^32 errors

taneq · on Sept 30, 2018

Don't think I've ever heard of off-by-0 errors before. Doesn't sound that bad...

jstimpfle · on Sept 30, 2018

And asynchronous callbacks.

quickthrower2 · on Sept 30, 2018

Why is that a hard problem?

(Once someone answers this, I promise to read and process the answer, but I'll go and do something else in the meantime)

ibotty · on Oct 1, 2018

The joke is, that this answer came late.

drharby · on Sept 30, 2018

What you did there, I see

quotemstr · on Sept 30, 2018

I've been taking a pretty close look at Linux's mm code recently, and I've been wondering why it needs to be as complex as it is. There's a brain-exploding amount of complexity, most of it open-coded, that seems to my eye to be unnecessary, or at least amenable to abstraction.

For example:

1) Why do we have entirely different management structures for anonymous memory (anon_vma and friends) and shared memory (inodes) when they really end up doing the same job? (Anonymous memory can be shared, so both paths end up needing to do the same kind of thing.) ISTM we can just use shared objects in all cases, tweaking the semantics as needed for shared memory. Anonymous memory is just swap-backed shared memory, after all. It should use the same logic.

(Yes, you need to work without a swapfile. No, that doesn't change the conceptual model.)

2) Do we really need open-coded page table walking? Why duplicate essentially the same logic four times? Sure, you occasionally want to do slightly different things at each page table level, but you can still unify most of the logic. (If you really want, you can encode the per-level differences in code generated via C++ template.)

3) Do micro-optimizations of the sort mentioned in the article really help? If these sequence numbers had been 64 bits long, they wouldn't have overflowed. If the VMA tree had been implemented as a splay tree (like on NT) instead of an RB tree with a per-thread cache, the per-thread cache might not have been needed. (Splay trees automatically cache recently-accessed nodes.) How much do these little tricks actually help? Is their cumulative effect positive?

4) Why is so much of the vm internal logic spread throughout the kernel? Both NT and the BSDs have a well-defined API that separates the MM subsystem from the rest of ring 0, but on Linux, ISTM that lots of code needs to care directly about low-level mm structures and locks. Why should code very far from the mm core (like the i915 driver) take mmap_sem directly? It feels like more abstraction would be possible here.

I get arguments based around ruthless pursuit of performance, but with function calls taking nanoseconds and page faults taking orders of magnitude more time, is anyone saving anything significant?

jacquesm · on Sept 30, 2018

If you strip out all performance optimizations from modern computers they will be a lot more secure but they will run slow as molasses. Starting with the hardware caches and ending with optimizations like these we are not talking 'minor gains' but likely orders of magnitude difference between optimized and non-optimized. And all of those have security trade-offs.

It's a tough call, the mantra used to be 'get it right, then make it fast', but in practice whoever ships the fastest stuff will win the race, even if it is incorrect. As long as it is correct long enough to run the benchmarks I guess.

This is yet another reason why I love microkernels (real ones), they are a lot easier to reason about and to get right to the point that you can really rely on them not to suddenly exhibit some weird behavior due to the complexity of their execution context.

ehsankia · on Sept 30, 2018

Also worth noting that for large companies that have dozens of data centers, even a 0.1% improvement can lead to millions of dollars saved.

quotemstr · on Sept 30, 2018

Agreed, but much of the time, these improvements can be had without complicating the code itself. That's one reason I much prefer C++ over C: it's much more amenable "cost free abstraction" techniques that provide both performance and clarity.

quotemstr · on Sept 30, 2018

> If you strip out all performance optimizations from modern computers they will be a lot more secure but they will run slow as molasses

Each optimization needs to be evaluated on its own. Not every optimization actually helps, and due to the effects of path length, icache pressure, and code complexity, the cumulative effect of many micro-optimizations may end up being negative. (It's why -funroll-all-loops is not in fact very fun.)

There's no forced tradeoff between complexity and speed! The fastest code is the code that doesn't run at all. Often, the best way to optimize a program is to strip it down to the rafters and then re-add only the optimizations that actually help. Simpler is often faster.

jacquesm · on Sept 30, 2018

That is mostly true, but the fact is that until 'spectre' and 'meltdown' everybody believed that the security implications of hardware caches were theoretical. It took decades since we started using them for the first practical manifestations of issues that were warned about long ago.

For pure software optimizations I would be more than happy to end up with a machine that is half as fast but bulletproof. But I am not sure that I am in the majority there.

And even though you are right that there is no forced tradeoff between complexity and speed in practice this is often the case, so often that we have institutionalized it, we generally accept that in order to make things faster we will have to make them more complex. If not then the most naive version of code would in general be the fastest and this just isn't the case.

So in the end it all boils down to where you draw an arbitrary line and say 'to the right there is too much complexity, to the left we are too slow'. Everybody will want to draw that line somewhere else, so we try to have our cake and eat it too: we optimize as much as we can, make stuff more and more complex and then we try to shoot holes in it. Sometimes successfully!

Gibbon1 · on Sept 30, 2018

> The fastest code is the code that doesn't run at all.

I'm taken with Daniel Bernstein's observation that speed is totally unimportant for ~95% of the code generated. The reason being it the percentage of the time it runs is a small percentage of the total, or it runs hardly or not at all.

And I'll add that for most programs written, speed isn't important. Because the cost to run them is a small fraction of the cost to write and maintain them.

https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf

markjdb · on Sept 30, 2018

Regarding 3), I suspect that the optimization is rather important for page fault scalability when the process has many threads. You would traditionally synchronize access to the VMA tree using some sort of reader-writer lock, but scalable read locks impose a higher cost to writers. It's easy to believe that splay trees wouldn't help and might hurt in this case, as lookups may modify the tree structure and thus can require more synchronization than a read lock.

Calling this a micro-optimization is thus misleading; rather, it probably helps quite a lot in some particular workloads and has a negligible impact on others.

amluto · on Sept 30, 2018

> Do we really need open-coded page table walking? Why duplicate essentially the same logic four times?

Historical reasons. There’s an early prototype to get rid of it.

MBCook · on Sept 30, 2018

Plus when there were only two levels of page tables it didn’t matter that much. Now that we’re starting to see for and even five it’s becoming an issue.

gshrikant · on Oct 1, 2018

I'm sorry of this is a trivial question but what does open coding mean? This is the first time I've heard of that term.

tedunangst · on Oct 2, 2018

Manual inlining. Writing code instead of calling a function.

misiti3780 · on Sept 30, 2018

Serious question - where would you start if you wanted to learn about how to do find/diagnose this types of issues and maybe one day work for project zero. Everytime I see these posts it blows my mind, maybe because I have never done any kernel development, but I tend to get the feeling these guys/girls must be the best of the best.

zydeco · on Sept 30, 2018

I can see why they haven't come up with a catchy name for it yet.

db48x · on Sept 30, 2018

It should obviously be The Mummy's Curse, because of all the wrappings.

kolderman · on Sept 30, 2018

Sounds like a classic case of "not thinking things through properly at the beginning".

I wonder how freeBSD engineered this...Linux appears to grow organically at times.

markjdb · on Sept 30, 2018

You could say that about most bugs...

FreeBSD doesn't have a per-thread cache, but does similarly use a generation number to allow operations to detect stale state upon reacquiring the vm map lock. I don't see why it isn't prone to the same kinds of bugs.

kolderman · on Sept 30, 2018

I entirely disagree. You have have the most well thought out, architected, designed system in the world with:

if (i=0) {

markjdb · on Oct 1, 2018

You could argue that a system written in a language that permits such errors is not well-thought out. The terms you're using are not well-defined, so it's easy to disagree endlessly without arriving at a useful insight.

hi41 · on Oct 1, 2018

This is such fabulous work! Sometimes I look at such work with a tinge of jealousy. How does one get good so good at programming. I try but I cannot get to be this good. I especially find it hard to understand someone else's code.

alain94040 · on Sept 30, 2018

The bug was fixed by changing the sequence numbers to 64 bits, thereby making an overflow infeasible, and removing the overflow handling logic.

So the bug is still there then? It will just take longer to hit.

Until someone else decides to change the encoding of that field, for some reason, and not initialize it at 0. It instead starts with a random value (for safety reasons). Good luck figuring out that random crash that happens once in a blue moon.

jacquesm · on Sept 30, 2018

> So the bug is still there then? It will just take longer to hit.

It will take so long that the hardware will have died many times over before the bug triggers, assuming society as we know it is still around.

alain94040 · on Sept 30, 2018

Like I said in my original comment, not if a future kernel change randomized that value to make it unpredictable to attacks. Then the bug will be back.

winstonewert · on Sept 30, 2018

The bug doesn't trigger when the counter resets to zero, the bug triggers when the counter reuses a previous value.

So:

A) There is no reason whatsoever to randomize the counter B) Even if they did, it wouldn't be a problem because it would still take an absurdly long time to loop around to the starting positions.