I think I found a Mac kernel bug (2018)

alin23 · on Dec 9, 2022

Coincidentally, I also stumbled upon a way to make the kernel of Apple Silicon Macs panic and restart while developing the https://lowtechguys.com website.

I distilled the problem in a repo so it can be reproduced with a single command: https://github.com/alin23/m1-panic

I found it while on Monterey and reported it 2 times through Feedback Assistant, but it still happens on Ventura.

NOTE: Don't try it without saving all your work, it has a very high chance of restarting your computer forcefully.

saagarjha · on Dec 10, 2022

The bug is in one of these two lines:

https://github.com/apple-oss-distributions/xnu/blob/5c2921b0...

This uses the macro SLIST_REMOVE (https://github.com/apple-oss-distributions/xnu/blob/5c2921b0...) to remove an item from the linked list. If you look at the code to do this it's a pretty simple linked list traversal: https://github.com/apple-oss-distributions/xnu/blob/5c2921b0.... However, it doesn't have a check for the end of the list, so the item must be in the list, or it's going to walk right off the end and dereference a null pointer. In this case that's exactly what it ended up doing.

alin23 · on Dec 10, 2022

That's.. wow, ok. How exactly did you end up in that source code at that specific line?

I know you're well versed in reverse engineering Darwin, and I'm reading your posts and trying to improve my skills in this daily, but this seems way over my skillset.

Did you debug this using KDK or m1n1? Do you have a setup always ready for debugging a Darwin kernel?

saagarjha · on Dec 10, 2022

In theory I have all of those, but currently I have none, so it's manual work. Your best friend in diagnosing a kernel crash is a KDK. If you have one that matches your build, it will have symbols in it. With a little bit of math you can take the backtrace in the crash log and slide it appropriately to match the binary. Personally I use LLDB for this. Here's an example of what this looks like on an x86-64 kernel (Apple silicon has its own math but it's largely the same): https://github.com/saagarjha/unxip/issues/14#issuecomment-10.... The kernel is typically compiled with optimization, so there's a lot of inlining and code folding, but with function names, source files, and instruction offsets it's pretty trivial to match it to the code Apple publishes.

In this case I do not have a KDK for that build. In fact Apple has been unable to produce one for a couple of months, a inadequacy which I have repeatedly emphasized to them because of how critical they are for stuff like this. Supposedly they are working on it. Whatever; in lieu of that I got to figure out how good the tooling for analyzing kernels is these days, which was my real goal anyways.

For this crash log I downloaded the IPSW file for your build, 22A400. All of them get linked on The iPhone Wiki, e.g. https://www.theiphonewiki.com/wiki/Firmware/Mac/13.x. Once you unpack the IPSW (it's a zip file) there are compressed kernelcache files inside. Apple changed the format of these this year so most of the tooling breaks on it, but https://github.com/blacktop/ipsw was able to decompress them. Then I loaded it in to Binary Ninja, which apparently doesn't support them either but compiling this person's plugin (+166 submodules, and a LLVM & Boost build) gets it to work: https://github.com/skr0x1c0/binja_kc.

From there you can load up the faulting address from the crash log and see what the function looks like. In this case, a bunch of junk has been inlined into it but there's a really obvious and fairly unique string reference for "invalid knote %p detach, filter: %d". From there, you can compare it against the actual source code to see which one matches the "shape" of the function you're looking at. I happened to also pull up an older kernel which did have a KDK available and then compared its assembly to the new one to match it up to ptsd_kqops_detach. The disassembly of the crashing code is obviously doing a linked list walk so you can figure out exactly which line it is from that.

If I wasn't lazy I might also fire up a debugger to see why the function had walked off the end of the list but without KDKs things get pretty bad, not that they're very good to begin with. I don't have a m1n1 setup (I should probably do this at some point) and the things I do have, like remote debugging or the VM GDB stub, are not really worth suffering through for a Hacker News comment.

alin23 · on Dec 10, 2022

Saagar, thank you so much! This is priceless!

I was in the process of trying to get I²C working through the built-in HDMI port of the Apple Silicon Macs (the one containing the MCDP29xx HDMI-to-DP converter chip) and been hitting a lot of dead ends while looking at kexts and opaque firmware blobs. This is going to help a lot as the KDK seems to contain logging messages related to DDC that I've never seen before.

I also found SIP disabled + Frida very useful for debugging without going through the KDK/m1n1 route. Not sure if it also helps with kext code though, I mostly used it for SkyLight and other private frameworks, but it's very nice to be able to also alter the code while it is running in realtime, or sometimes simply log specific function calls with argument value to get an idea what action causes which code to run.

saagarjha · on Dec 11, 2022

Unfortunately patching the kernel or injecting your own code into it is quite difficult, unlike the situation in userspace. Though I haven’t gotten a chance to try it I think running a kernel debugger through m1n1 to be the best strategy to doing dynamic analysis of the kernel.

trollied · on Dec 9, 2022

Might have been better to report it to the security people. This sort of thing can be exploitable.

nemetroid · on Dec 9, 2022

They did report it to Apple, multiple times:

> I found it while on Monterey and reported it 2 times through Feedback Assistant, but it still happens on Ventura.

lilyball · on Dec 9, 2022

"to the security people" means emailing the relevant email address, not Feedback Assistant.

nemetroid · on Dec 10, 2022

If it doesn't end up with the relevant people, that's Apple's problem.

saagarjha · on Dec 10, 2022

It does eventually. If you want a prompt response you should contact product security.

hackmiester · on Dec 10, 2022

Shouldn't Apple be the ones who really want to respond promptly? Why should we work around bugs in Apple's issue reporting system?

saagarjha · on Dec 10, 2022

I don’t really see the problem with getting faster responses by contacting Apple’s security team directly for potential vulnerabilities when compared to the general-purpose bug tracker.

LawTalkingGuy · on Dec 10, 2022

By an hour or two, maybe. But since the last version of the OS? No.

nigamanth · on Dec 10, 2022

Usually big companies such as Discord give perks to the bug hunters who find bugs. Apparently Apple doesn't have that. There are probably people at Apple who won't admit that they have bugs, when every operating system has bugs, the code is too big to not create a single bug or exploit.

trogdor · on Dec 10, 2022

Huh? Apple has a robust bug bounty program.

https://security.apple.com/bounty/

bjoli · on Dec 10, 2022

I have found 2 crashes in osX back in Yosemite. I have reported them with every release since.

I have no idea of they work on the arm Macs, but I will have the ability to check in a couple of days. Probably nothing exploitable, but still a hard crash.

saagarjha · on Dec 9, 2022

It’s a null deref.

catiopatio · on Dec 9, 2022

What’s the feedback ID #?

macshome · on Dec 9, 2022

Can you put the panic log text into the repo as well?

alin23 · on Dec 9, 2022

Sure, added here: https://github.com/alin23/m1-panic#panic-crash-report-after-...

metadat · on Dec 9, 2022

Do you know what the underlying problematic instruction sequence is? Or the precise location where it halts?

alin23 · on Dec 9, 2022

I have no idea how I could find that, given that the system freezes completely.

Maybe tracing the CPU instructions using LLDB might be possible, but the bug is most likely in the kernel code so this would not help much.

catiopatio · on Dec 9, 2022

You can debug the kernel remotely over Ethernet: https://developer.apple.com/documentation/apple-silicon/debu...

If that still fails, virtualization tools provide debugging interfaces you can use to step the execution of the virtualized CPU; e.g. VMware’s “debugStub” feature.

sharikous · on Dec 9, 2022

You can't with Apple Silicon. It's shameful in my opinion. You still can load a core dump or view the state after a NMI but you can't run the kernel under a debugger.

saagarjha · on Dec 9, 2022

You can, the experience is just not very good.

Firmwarrior · on Dec 9, 2022

haha, let's just let the Apple engineers on this thread figure that shit out

bri3d · on Dec 10, 2022

The best public kernel debugger for Apple Silicon is m1n1 from the Asahi project.

saagarjha · on Dec 9, 2022

ldr x9, [x8, #0x18]!

MuffinFlavored · on Dec 10, 2022

does it work in zsh or bash? only fish?

throwaway09223 · on Dec 9, 2022

It's surprisingly easy to stumble into crash bugs when playing around with processes.

I remember a decade or two ago I ran into a linux bug where the kernel would panic if a process was killed with an open descriptor on its /proc entries. That is:

open /proc/$pid/something; kill -9 $pid #kernel crash

We unfortunately discovered this when using fuser in a runscript to kill stale versions of a process, eg:

sudo fuser --kill --namespace tcp 80 # kill whatever is listening on port 80

This would reliably cause kernel panics every so often, with one straightforward shell command. This ended up causing a big problem because it was part of a runscript which ran on bootup. But, it normally would do nothing so it went unnoticed until the app in question had a startup problem, leaving a copy of itself dangling listening on the port -- and instead of killing the old instance, it began crashing the entire system in a loop. Oops.

teawrecks · on Dec 9, 2022

It was also unsurprisingly easy to crash a kernel a decade or two ago.

jeffparsons · on Dec 10, 2022

I remember a time when you had to be careful to not reveal your IP address to untrusted peers (e.g. on IRC) because a single specially malformed packet called the "Ping of Death" would reliably crash any internet-connected Windows PC.

That was a wild time. Nobody talked about security back then. The idea that everything in our lives would eventually run over the internet just wasn't on people's minds.

kayodelycaon · on Dec 9, 2022

It's freezing the querying of process status, which is very not good, but that isn't the entire kernel. If it was the entire kernel, you wouldn't be able to use Ctrl+C.

anyfoo · on Dec 9, 2022

Yes. The title of the actual blog post seems more accurate.

hinkley · on Dec 9, 2022

In the long dark ago there was a program called 'crashme' which would generate and run random code from user space to see if it could cause kernel panics.

jwilk · on Dec 9, 2022

https://people.delphiforums.com/gjc/crashme.html

xcdzvyn · on Dec 9, 2022

I presume there's a low yet non-zero chance this inadvertently messes up something on the FS?

civopsec · on Dec 9, 2022

Did it predate the “fuzzing” term?

hinkley · on Dec 9, 2022

By at least a decade or two. The last time I saw one in the wild was probably around 1998, and it was a very old idea by then.

throwawaaarrgh · on Dec 9, 2022

It's very easy to freeze a system as a non-root user; cause too many interrupts, consume too many resources, etc. Many kinds of infinite loop will lock a system hard. Hell, you can crash systems with too many packets.

And it's very easy to cause ps to hang. Many different kernel syscalls hang / are blocking. Mostly you see this with kernel features dependent on a resource that doesn't resolve itself, like a stuck disk, network filesystem, etc. But other various quirks of the system can cause blocking.

adrian_b · on Dec 9, 2022

While what you say is true, these are nonetheless kernel bugs.

The kernel should never let any user process consume so many resources as to cause a system freeze.

The kernel must not only be able to preempt any user process at any time, stopping it to consume all CPU time, but it must also prevent any user process to completely fill a SSD or HDD, because that can prevent many programs from starting.

throwawaaarrgh · on Dec 10, 2022

Preemptibility is an optional kernel design feature. Not all kernels have it and not in all ways. If it's intentionally designed that way it's not a bug. No kernel stops users from filling up disks (though some filesystems have such limits as features, which most of us turn off)

Pretty much the only kernels that are totally preemptible are RTOS and they still don't stop you from shooting yourself in the foot.

astrange · on Dec 10, 2022

A computer's job is to do whatever you ask it to do; that includes using all disk or memory up, if that's what you really want. There's not really a way to prevent breaking the OS (or just using up the battery) without preventing you from using all the computer you bought.

A phone is different since it always has to be able to make phone calls.

mort96 · on Dec 10, 2022

The things you mention cause "freezing" by asking the kernel to do too much so that it doesn't have time to deal with other stuff. Those issues are unfortunate, but really hard to completely avoid.

The interesting thing here is that the described bug isn't just overloading the kernel with work or starving it of resources, it's something which seems completely innocent.

saagarjha · on Dec 9, 2022

Past discussion of this bug: https://news.ycombinator.com/item?id=16082861

dang · on Dec 9, 2022

Thanks! Macroexpanded:

I think I found a Mac kernel bug? - https://news.ycombinator.com/item?id=16250677 - Jan 2018 (117 comments)

anon291 · on Dec 9, 2022

Not surprised. I wrote some kqueue code in C once that not only froze the Mac Kernel. It caused the entire computer to crash. I reported the issue to Apple, and never really heard back. They don't really care, as long as all the mac store apps work, in my experience.

This was one call to kqueue with incorrect (but not particularly malicious, just normal C silliness) arguments, and boom!

amenghra · on Dec 9, 2022

In 2017, I wrote:

Over the years, I have found numerous four different bugs in Apple's Calculator app. Here is today's wtf.

Switch calculator to Scientific mode (⌘-2). Type: 1, 0, ^, 2, 0, enter, command-c (to copy), command-v (to paste)

Expected result: an amount in $$$ I wouldn't mind having. Actual result: smallest number to appear 6 times in Pascal's triangle

I reported all four and they never acknowledged any of them. They didn't fix any of them either. Doesn't motivate anyone to report more bugs to them.

Some of the bugs got auto-closed after a while. Eventually, the bugs did get fixed, except the clipboard thing ¯\_(ツ)_/¯.

lilyball · on Dec 9, 2022

I just tested right now, ⌘C copies the string "1e20" and ⌘V pastes that same string. So yeah it looks like it was fixed.

anamexis · on Dec 10, 2022

I also just tested now and reproduced the bug. The key thing is that you are pasting back into the calculator – which presumably is just stripping letters.

The behavior is maybe a bit surprising, but I could also see it being defensible. You can't type "1e20" into the calculator, so why would you be able to paste it in?

Outside of plain text, I don't think clipboard operations are necessarily expected to be reversible.

amenghra · on Dec 11, 2022

That’s a good point. Maybe you should be able to type “1e20” in scientific mode? It would make the calculator more feature full and prevent “losing” data if you copy/paste in the middle of a long calculation.

lilyball · on Dec 13, 2022

"e" in scientific mode is a shortcut for the "ln" function.

That said, I tested again, and once again copying "1e20" and pasting it into the Calculator works just fine. It's definitely not treating it the same as pressing each key separately. I'm testing on macOS 13.0.1.

jwilk · on Dec 9, 2022

> smallest number to appear 6 times in Pascal's triangle

You mean exacty 6 times or at least 6 times?

amenghra · on Dec 9, 2022

Exactly 6 times. What's happening is if the result is 1e<something>, a roundtrip to the clipboard results in 1<something>. I.e. 1e20 becomes 120.

JonathonW · on Dec 9, 2022

Bugs do get fixed (eventually; they're not always timely about it depending on severity), but Apple's feedback systems are and always have been a black hole. Basically, as a reporter, the only time you hear anything back from Apple about a bug report is if they need additional information; nothing else in their process is visible externally (until you go back and retest a few macOS releases later and your bug is or isn't fixed).

catiopatio · on Dec 9, 2022

Is it still reproducible?

anon291 · on Dec 9, 2022

I haven't had a mac in years. This was ~2010.

_7eoi · on Dec 9, 2022

My naïve guess is that this is probably some sort of lock contention thing.

enedil · on Dec 10, 2022

Lock contention usually impacts performance, but not liveness.

saagarjha · on Dec 10, 2022

Usually, but in truly pathological cases you may starve something important essentially indefinitely.

mort96 · on Dec 10, 2022

Lock contention from running a syscall once?

IceWreck · on Dec 9, 2022

Is this fixed now ?

kayodelycaon · on Dec 9, 2022

Tested it on macOS 12.3. It's fixed.

eesmith · on Dec 9, 2022

https://github.com/hishamhm/htop/issues/682#issuecomment-377... (the htop issue) says "according to others above, Apple has seemingly fixed this in 10.13.4."

resters · on Dec 9, 2022

I asked ChatGPT to write some programs like this and it refused!