CrowdStrike fixes start at "reboot up to 15 times", gets more complex from there

ziizii · 2024-07-19T17:03:39.000000Z

Has anyone discerned the root cause of this in the software?

As in, what exactly is wrong in these C00000291-*.sys files that triggers the crash in csagent.sys, and why?

the_plus_one · 2024-07-19T17:16:14.000000Z

I've been wondering the same. I did just see [1], where it's apparently trying to read memory from an unmapped address, but I haven't seen anything about how r8 got to the point of having said unmapped address.

[1]: https://x.com/patrickwardle/status/1814343502886477857

crawancon · 2024-07-19T18:28:44.000000Z

It seems the affected update file seems to have been over written with 0s on the 42kb file, whereas the before and after sys files have obfuscated ays/config file info as expected.

mmis1000 · 2024-07-20T05:27:46.000000Z

If it is simply caused by a corrupted file. That is a really bad signal. It means they don't even try to properly validate and parse the file before loading them into the KERNAL. Always validate input so it don't crash your program is almost the computer science 101 every programming class should tell you in the first class. And yet they still make this happen?

And in this case, it only crash. But if it somehow read value from position it isn't supposed to successfully? You have an RCE.

fsflover · 2024-07-19T17:12:13.000000Z

https://news.ycombinator.com/item?id=41004103

surfingdino · 2024-07-19T17:49:02.000000Z

This is a global multi-layer failure: Microsoft allowing kernel mods by third-party software, CrowdStrike not testing this, DevSecOps not doing a staged/canary deployment, half the world running the same OS, things that should not be connected to the internet but are by default. Microsoft and CrowdStrike drove a horse and a cart through all redundancy and failover designs and showed very clearly where there were no such designs in place.

LordKeren · 2024-07-19T18:28:06.000000Z

While I will be the last person in line to defend Microsoft, I am not sure that disallowing 3P kernel mods is a workable solution. Crowdstrike and companies like it exist to fill a very real need within the windows ecosystem. I don’t foresee that suddenly going away now or Microsoft unilaterally forcing every company like crowdstrike out of business and taking over this role themselves

consteval · 2024-07-19T20:37:16.000000Z

Mac does it, although that most definitely does not fill the same enterprise role as Windows.

It's possible to make userspace interfaces and permissions for this sort of thing.

santoshalper · 2024-07-19T18:27:53.000000Z

Literally every OS allows you to install 3rd party kernel modules or plugins. If Microsoft banned them, people would be up in arms about them being a controlling walled garden. There is no winning.

arcxi · 2024-07-20T06:19:35.000000Z

> Literally every OS allows you to install 3rd party kernel modules or plugins

I know for sure at least macOS and OpenBSD don't

rightbyte · 2024-07-20T07:39:25.000000Z

You can compile OpenBSD yourself though.

Connector2542 · 2024-07-19T16:54:30.000000Z

Hello, IT, have you tried turning it on and off again 15 times?

Seriously though - this entire outage is the poster child for why you NEVER have software that updates without explicit permission from a sysadmin. If I were in congress, I would make it illegal, it's an obvious national security issue.

MBCook · 2024-07-19T17:22:42.000000Z

Nah. That’s not the problem.

Kernel level code blindly loading arbitrary files?

Panicking when the file doesn’t parse because it’s not a memory safe language?

Not validating the files before loading them?

Not validating the files before SHIPPING them? No CI? No safety net?

No staged rollout in case of explosion?

There are far FAR bigger mistakes here than “sys admin didn’t have to press button”.

zarzavat · 2024-07-19T18:15:05.000000Z

To play devil’s advocate, a staged rollout for antivirus definitions somewhat defeats the point since those definitions are supposed to be constantly updated.

I agree with the rest, especially the use of a memory unsafe language to do parsing in the kernel by a billion dollar security company blows my mind.

How can you even run a security company without any security professionals reading your code even incidentally? An impressive level of incompetence.

mmis1000 · 2024-07-20T05:15:37.000000Z

At least they could make a in house playground in the process to see if their new version ever work. Maybe something like guest computer in public area. Or some sort of vm to emulate end user system to see if they ever boots. And somehow we still get this.

How the heck they didn't find out the new version prevent the computer from booting at all?

MBCook · 2024-07-19T18:57:25.000000Z

Yeah that had crossed my mind too. I’m not sure which risk is bigger, breaking things or leaving them insecure.

I lean towards breaking things being the bigger risk.

But if even a handful of the other errors were corrected this would have been prevented and they wouldn’t have had to make that choice.

dns_snek · 2024-07-19T17:48:42.000000Z

> Panicking when the file doesn’t parse because it’s not a memory safe language?

Whether a program panics or recovers when attempting to parse bad data is entirely orthogonal to memory safety. Do you have any in-depth technical information about the bug itself that you're basing this on?

teeheelol · 2024-07-19T17:36:47.000000Z

Exactly this.

This is a faulty and dangerous product from conception to execution.

jcgrillo · 2024-07-19T17:47:25.000000Z

Is it normal to make outbound connections during boot? Doesn't that circumvent a firewall? That seems like something a security team evaluating whether they want this software on their network might care about during an eval period.. right?

teeheelol · 2024-07-19T17:57:26.000000Z

Looking at the contents of c:\windows\system32\drivers\crowdstrike suggests it does all sorts of weird shit right down to injecting itself into UEFI and futzing with firmware. It's literally in everything.

Unfortunately "security" folk these days are box ticking fuckwits and this product brief ticked all the boxes. They do not understand any more traditional methodologies other than "install these magic beans and action the reports".

Invest in better software and network architecture and DR strategy instead.

sudosysgen · 2024-07-19T17:55:48.000000Z

CrowdStrike is so invasive that it needs firewall exceptions. It does a lot of the actual antivirus work in the cloud. It's a security nightmare.

pas · 2024-07-19T17:04:55.000000Z

That's not the big no-no here. Lack of any real DRP is. Sure, it's cheaper to just buy CS Falcon (and who knows what other amazing vendors supplied timebombs are ticking silently) than paying sysadmins and developers ... and letting them build something that does what it needs, not much else, so there's no need to put these fantastic "single agents" from these RCE-as-a-service vendors on all the fucking servers.

Connector2542 · 2024-07-19T17:06:29.000000Z

both are true

johnkizer · 2024-07-19T17:15:34.000000Z

What % of those sysadmins are then going to turn around and script something to auto-approve those updates, once they realize that they are A) requested at inconvenient times and B) are related to security?

Who's going to take the risk of appearing to have sat on an important update, while the org they support is ravaged by ThreatOfTheDay, because they thought they knew better than a multi-billion dollar, tops-in-their-field company?

(I'm not necessarily saying that's actually objectively correct, but I can't imagine that many folks are willing to risk the downside)

rfoo · 2024-07-19T17:12:52.000000Z

> why you NEVER have software that updates without explicit permission from a sysadmin

In general I agree, but this case is quite messy. It's more like your anti-virus had a bug since forever that if it loads a broken virus definition it bricks your system. And a broken virus definition finally happened today.

Do you want every virus definition (that is updated every few hours) to require explicit permission from a sysadmin?

more_corn · 2024-07-19T18:03:57.000000Z

You’re learning the wrong lesson here. Automatic security updates in Debian and Ubuntu actually get tested and work. The RCE in ssh a week ago is an argument for enabling automatic security updates. (And for security in depth, putting everything behind VPN for example)

This example is probably an argument for not running windows on critical systems due to insufficient focus on security from the beginning which has lead to a need for things like crowdstrike.

They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.

strunz · 2024-07-19T18:23:14.000000Z

>They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.

I wish people would stop making blanket statement as if they know how every company in the world runs. Plenty of Linux machines are running CS, and it's not only because they are forced to for compliance. NG AV has been picking up speed as a "just in case" thing for Linux and Mac for years now. Your anecdote does not apply to everyone.

a0123 · 2024-07-19T17:39:00.000000Z

They still run Windows XP (og edition, not this patched rubbish) to make sure national security isn't compromised.

The really important machines are still on Win 3.1.

avs733 · 2024-07-19T17:18:26.000000Z

I understand the logic of this but it is somewhat based on the assumption - which most industries have in droves - that people in THAT industry are the competent bullwhark against stupidity.

I consulted for a company for a while where the 'sysadmin' was the owner's mother - who bought laptops from walmart. Not only could she NOT have approved updates like this, even if she could she would have she wouldn't have had any knowledge whatsoever with which to make a determination if it worked.

In an abstraction, the problem really is with externalities. These approaches to updates exist because people who CAN'T do what you describe are likely a more dominant part of the threat model than this happening to people you do describe. The resulting fix, as we're seeing, is very reliable until it isn't...and if the isn't is enormous in scale the systems aren't setup to fail gracefully.

If you want to make a rule...require graceful failure.

mardifoufs · 2024-07-19T17:51:07.000000Z

What would the sysadmins do in this context? Read the release notes of the update? The only thing they would do is update and then be responsible for the problem, and in that case you're back to this exact problem.

It's not like they'd read the source code or examine every file that's been changed or downloaded for a proprietary kernel module for every crowdstrike update (there must be a LOT of them).

dmurray · 2024-07-19T17:59:15.000000Z

They would release the update in a testing/sandbox environment first before rolling out kernel-level changes to every computer on their network.

They're the same team who mandate you use a 3-year-old browser version and 5-year-old OS, because you can't be trusted to manage your own updates, so they do know the idea.

mardifoufs · 2024-07-19T19:46:45.000000Z

Would this have changed something for this specific problem? I usually 100% agree with you fwiw, I just don't think this would've helped here because it seems like an almost "non update"? Most people claim there has been no update to the software, and no prompt or option to update it or not

dmurray · 2024-07-19T22:00:23.000000Z

It's a file that was downloaded from Crowdstrike's servers, which have presumably been whitelisted in the firewall, and used to configure the software. Of course it's a software update, regardless of whether the file says .exe or .dll or .sys or .txt, and regardless of whether there was a prompt.

Again, the same team in most enterprises wouldn't dream of letting you have an auto- updating Firefox Nightly, they know how to configure software so it doesn't phone home for updates or is blocked from phoning home.

travoc · 2024-07-19T17:12:13.000000Z

It was a data update that triggered a software bug. It was not a software update. I don't think it's reasonable to make data updates illegal.

CoastalCoder · 2024-07-19T17:32:39.000000Z

I'm a general purpose computer, the distinction between software and data can be pretty fuzzy.

duped · 2024-07-19T17:35:34.000000Z

This distinction is meaningless at best and harmful at worst.

If a code path isn't followed until a config file updates, that is practically the same thing as the code path being introduced by the update.

fire_lake · 2024-07-19T17:34:34.000000Z

Code is data and data is code

surfingdino · 2024-07-19T18:05:24.000000Z

Unless you use PIC controllers with https://en.wikipedia.org/wiki/Harvard_architecture

scrollaway · 2024-07-19T17:25:19.000000Z

Those focusing on QA, staged rollouts, permission management etc are misguided. Yes of course a serious company should do it but CrowdStrike is a compliance checkbox ticker.

They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”. The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).

I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)

noduerme · 2024-07-19T18:12:20.000000Z

In a slightly less threatening but equally noxious box-checking racket, a company I work with is being sued for their website not being sufficiently ADA-compliant. But the first they heard of the lawsuit, before they were even served, was an email from a vendor who specializes in adding junk code to your website that's supposed to tick this box. The vendor happens to work closely with several of the law firms who file and defend these suits.

JCM9 · 2024-07-19T17:04:12.000000Z

It’s looking like many impacted end-user machines are hard bricked unless you can get into the hard drive to delete the file causing this. Even if you can do that it’s not something that is easily (or potentially even possible) to automate at scale so looking like this is going to be an ugly fix for many impacted devices. This is basically the nightmare scenario for fleet management… devices broken and can’t remotely fix them. Need to send hands on keyboard folks in the field to touch each device.

bluedino · 2024-07-19T16:58:33.000000Z

DevSecOps should have you know, tested these updates before they were approved for release company-wide.

If I can't commit code to our app without a branch, pull requests, code review...why can the infrastructure team just send shit out willy-nilly?

"Always allow new updates" must have been checked, or someone just goes through a dashboard and blindly clicks "Approve"

jug · 2024-07-19T17:40:33.000000Z

That is what has surprised me. I can understand if small businesses were caught here because they lack financial resources for the infrastructure and staff, but those large corporations like airlines etc... Why don't they have a staging environment where everything goes first? I naively assumed this was established best practice due to the risk of update issues bricking your organization.

But maybe anti-malware is given a blind eye because instant updates for zero day security issues are obviously attractive.

Still, though... In hindsight it's not workable for especially anything running system drivers with liberal kernel access.

surfingdino · 2024-07-19T18:09:27.000000Z

I am not surprised at all. The level of DevSecOps' skills has been falling over the last two decades as demand for their skills kept growing. Most of them would report you to HR if you suggested they use WireShark to debug a networking issue. They are useless people who came to IT because of the promise of good pay and don't know how computers, networks work.

pas · 2024-07-19T17:09:41.000000Z

It's automatic, no? The whole "promise" (oh sorry, the "added value proposition") of CS is that they "keep you safe" automatically! It was a content update. Meaning basically antivirus signatures ... and oops, some minor non-functional changes to the filtering kernel driver.

nicce · 2024-07-19T17:31:32.000000Z

In that case, automatic updates likely need different permission levels. What exactly is allowed to be updated automatically?

pas · 2024-07-19T17:58:44.000000Z

... well, yes, yes of course. And if I try to be serious on a late Friday night (it's almost 20:00 here), the obvious solutions is to have something like eBPF in/for the Linux kernel (which has a verifier[0]).

And security vendors should follow "secure by design" principles. Yes, I know a try-fucking-catch might be too advanced, and uh oh kernel code is hard because unwinding is costly. But guess what else is also not cheap. (Okay, I seriousness failed.) But still. This is fair and square in the "this should never happen" scenario. It's an automatically downloaded plugin or whatever. (CS can call it "content update", but von Neumann is already calling FedEx to send them a pallet of industrial grade bitchslap.) And if the plugin loader cannot gracefully fail plugin loading, then it should obviously come with the appropriate audiovisual cues[1] so sysadmins know what to expect.

[0] https://docs.kernel.org/bpf/verifier.html

[1] https://www.youtube.com/watch?v=Dv-2dzD9F10

ploxiln · 2024-07-19T17:32:10.000000Z

Security and Compliance gets to violate all good sense, because it's just sooo important. They can run un-reviewed un-sandboxed daemons as root on every system if they really want, they can have changes pushed automatically without review or control, because "security" is just so important, and due to "compliance" you really have no choice as your company gets larger, you just have to do it. That's why, despite being obviously pretty dumb to many skilled engineers, it seems like everyone does it. No choice. Security, Compliance. So dumb ...

fire_lake · 2024-07-19T17:36:01.000000Z

Maybe it was checked but the CI didn’t cover this edge case.

I think the team writing the parsers for these data files deserves some blame. This should have been fuzzed, property tested, etc.

Klonoar · 2024-07-19T19:51:44.000000Z

Who says they sent it out willy-nilly?

It’s not unheard of for things to slip by testing and CI.

munchler · 2024-07-19T17:05:05.000000Z

So, in other words, there's a race condition in the CrowdStrike Falcon driver at startup time. That, in itself, should be a major cause for alarm, but here we are depending on it to fix this problem.

rahkiin · 2024-07-19T17:24:25.000000Z

No, it takes a while to load that definition file. Before the loading it _might_ be able to pull the update that fixes it. If you keep trying the chance this update is pulled increases

neilo40 · 2024-07-19T17:29:35.000000Z

Yes. That’s the race condition

t-writescode · 2024-07-19T17:36:06.000000Z

The individual person that pressed the "go" button (if there was a person), is going to henceforth be __the best__ DevOps person to ever have on your team. They have learned a multi-trillion-dollar lesson that no amount of training could have prepared them for.

And the Crowdstrike CTO has either been given the ammunition to get __whatever they ask for, ever again__ with regard to appropriate allocation of resources for devops *or* they'll be fired (whether or not it's their fault).

And let me be very clear. This is absolutely, positively and wholly not the person that pressed the button's fault. Not even a little. At a company as integral as CrowdStrike, the number of mistakes and errors that had to have happened long before it got to "Joe the Intern Press Button" is huge and absurd. But many of us have been in (a much, much, *MUCH* smaller version of) Joe's shoes, and we know the gut sinking feeling that hits when something bad happens. A good company and team won't blame Joe and will do everything they can to protect Joe from the hilariously bad systemic issues that allowed this to happen.

einarfd · 2024-07-19T18:27:06.000000Z

Or maybe you get someone with PSTD and suicidal tendencies. You never know how someone will process something like this.

commandlinefan · 2024-07-19T17:58:33.000000Z

> not the person that pressed the button's fault

You know that, and I know that. The people who will ruin his life starting today do not know (or care).

t-writescode · 2024-07-19T18:29:28.000000Z

This is why it is the responsibility (yes, responsibility) of every one of their coworkers, especially those more senior than them, to fight *HARD* to protect them.

This is part of the job of a senior.

commandlinefan · 2024-07-19T19:08:38.000000Z

"We must hang together, or we will all hang separately" is a lesson that I don't think programmers will ever learn.

sanbor · 2024-07-19T19:06:26.000000Z

It is not a human error, it is a process that has to be improved. Humans make mistakes, that is why we have processes in place.

eviks · 2024-07-20T16:09:54.000000Z

Basic training could've taught him how not to do YOLO global rollouts, and while the stress of this mistake will make him remember a lot, given the lack of basic knowledge that would've prevented this, this lesson will not be very valuable

rjh29 · 2024-07-19T21:59:14.000000Z

Absolutely it's a failure of process. But sometimes people just don't pay attention. Hiring inobservant or reckless people is a risk multiplier.

kvgr · 2024-07-19T20:26:41.000000Z

Crowdstrike will not exist after this is over.

daelon · 2024-07-20T02:24:12.000000Z

That's very optimistic.

kotaKat · 2024-07-19T17:41:17.000000Z

I literally just dumped 30 switches yesterday across an entire facility and had to walk 30 closets by foot to recover from ROMMON.

Shit happens. We learn.

RedShift1 · 2024-07-19T17:55:15.000000Z

Can you explain what went wrong?

idiotlogical · 2024-07-19T16:38:23.000000Z

>reboot up to 15 times

I see my orgs SCCM admins have been consulted

senectus1 · 2024-07-20T04:18:49.000000Z

ilkkao · 2024-07-19T16:40:54.000000Z

Some government should force them to release a technical postmortem. Feels that they don't do it otherwise.

educasean · 2024-07-19T16:59:33.000000Z

There should be congressional hearings on this. Not just post mortems.

CoastalCoder · 2024-07-19T17:37:02.000000Z

Honest question: would you expect Congress to respond in a way that's a true net-positive?

coffeefirst · 2024-07-19T17:59:27.000000Z

No, but its a warning to the next guy/megacorp:

Don't do that, or you'll be dragged before the greatest obnoxious and self-aggrandizing body in the world for lengthy dressing down that probably affects the stock price.

gen3 · 2024-07-19T16:42:15.000000Z

I don’t think a cybersecurity company can take down half the US and not release a postmortem

spuz · 2024-07-19T17:04:31.000000Z

Of course, but we specifically would like to see a _technical_ postmortem that examines what kind of incremental rollout procedures they have and how this update overcame those.

aeyes · 2024-07-19T18:32:59.000000Z

Or... you know... This kind of software should be open source or companies using it should at least be able to audit the code themselves.

Supposedly they have all kinds of certifications but not even having basic QA demonstrates that this is all just a smokeshow: https://www.crowdstrike.com/why-crowdstrike/crowdstrike-comp...

AlienRobot · 2024-07-19T16:31:19.000000Z

>The first and easiest is simply to try to reboot affected machines over and over, which gives affected machines multiple chances to try to grab CrowdStrike's non-broken update before the bad driver can cause the BSOD.

I thought it was BSOD'ing on boot? I don't understand how this works. It auto-updates on boot? From the internet?

tux3 · 2024-07-19T16:45:26.000000Z

One of the first things the falcon driver does on boot is connect to the server, report some basic info, and start loading these data files, the "channel" files that Crowdstrike frequently updates.

The BSOD is because one of the data files that they previously pushed is horribly mangled, and their driver explodes about it. But if you get lucky, the driver can receive an update notification on boot, connect to the separate file server, and finish overwriting the broken file on disk before the rest of the driver (that would crash) has loaded the broken file

And they do all of that very early on boot. The justification being that you don't want the antivirus to start booting after a rootkit has already installed itself

dboreham · 2024-07-19T17:37:41.000000Z

WTF? Trust in the kernel should be Microsoft's responsibility and only theirs. Actually why is MS even allowing this crap code to run in their kernel? Isn't that a trust-destroying event?

munchler · 2024-07-19T17:46:16.000000Z

Drivers have to run in the kernel in order to access hardware and other low-level system resources. That's how pretty much every mainstream OS works. For example, here's the guide for writing kernel-mode drivers in Linux: https://docs.kernel.org/driver-api/driver-model/overview.htm...

One might ask whether an anti-virus really needs to run inside the kernel, but the answer might reasonably be yes.

adrian_b · 2024-07-19T20:13:26.000000Z

That is only the lazy solution.

It is also possible to access hardware or any other low-level system resources from unprivileged user code, if its process has been granted appropriate access rights by the kernel.

This second solution requires more work, but it is much more secure as the access can be limited to only the strictly-required resources and system crashes become impossible.

The extreme of this solution is a micro-kernel operating system, but there is no need for extremes. Even in a Windows or Linux system you can use this method. You can have a very reduced privileged code in a driver or kernel module, which does nothing except providing access to the permitted resources. Then anything like attempting to access not mapped memory would happen in user code and it would crash only the user process, not the entire computer system.

jug · 2024-07-19T17:53:11.000000Z

Yes, I'm no security expert by any means but I'd assume that e.g. a rootkit would be best defeated by a kernel driver.

So, this isn't really what's getting on my nerves here. Just how it auto updates and get pushed throughout the organizations without a smidge of quality assurance. Smaller businesses... Sure, I get it. They don't have the resources to set up infra for this, but those... airliners... and hospitals. WTF. I read some org thinking they might not even be able to provide anesthesia. Seriously. What.

Ekaros · 2024-07-19T17:42:01.000000Z

Probably history and then some possible anti-trust litigation. As asking market leader not allow access to kernel like this would somehow be anti-trust violation...

Ekaros · 2024-07-19T16:40:28.000000Z

I love how this is solution for security, while sounding like most insanely unsecure thing...

TillE · 2024-07-19T17:09:30.000000Z

What, the auto-updating part? Obviously the client is verifying signatures (or using TLS with a client certificate, whatever), not just accepting whatever random file comes down the pipe.

Ekaros · 2024-07-19T17:12:39.000000Z

Even then, how many affected machines there are? Tens of thousands, hundreds of thousands? Compromise these servers and even possible signing server and you have largest bot net or general compromise in history...

It is not unreasonable to think that this sort of software could get compromised.

pas · 2024-07-19T17:22:07.000000Z

A few more years and maybe they will add this newfangled super-innovative thing, invented by those esoteric academics at U of Haskell ... this new thing -- umm, what was it called -- try-catch perhaps.

InsideOutSanta · 2024-07-19T16:55:44.000000Z

I mean, this is every antivirus software. "Let's run some antivirus vendor's code on your system that opens literally every file on your system, regardless of how it got there."

Yeah, that's a great idea and not at all a huge attack vector.

gpderetta · 2024-07-19T17:10:00.000000Z

You forgot the "... while in kernel mode".

pas · 2024-07-19T17:26:08.000000Z

this is a solution for "afterthought security" only.

ukuina · 2024-07-19T16:36:47.000000Z

Yes, that is as bad as it sounds.

JumpCrisscross · 2024-07-19T16:36:09.000000Z

> It auto-updates on boot? From the internet?

Apparently!

selimnairb · 2024-07-19T17:19:57.000000Z

Probably has to do this early on in the boot process so as not to require a reboot after update because of Windows’s silly pessimistic file locking.

mschuster91 · 2024-07-19T17:58:08.000000Z

> because of Windows’s silly pessimistic file locking

To be honest I prefer that over the #nix way of doing things. In Windows, you have exactly one file any given path can refer to - in Linux or Mac, it may depend on which directory's inode is seen as the root node by your process (e.g. chroot or container), or whether mounts are at play, or a file/directory got deleted and replaced by something else.

Particularly the last scenario keeps tripping me every once in a while.

TestingWithEdd · 2024-07-19T16:57:56.000000Z

Does this mean a computer without internet access and with CrowdStrike would be unable to start up?

ozzcer · 2024-07-19T17:08:51.000000Z

Surely a computer without internet would not have received the update?

TestingWithEdd · 2024-07-19T18:17:02.000000Z

But then would CrowdStrike stop the startup saying it requires network connectivity to initialize or something? Just wondering how invasive the app is

Jtsummers · 2024-07-19T19:04:46.000000Z

No, it'd boot up just fine without a network connection (as long as it didn't have this borked update).

ryanjshaw · 2024-07-19T17:05:33.000000Z

How did it get broken then?

TestingWithEdd · 2024-07-19T18:18:01.000000Z

As in, does the OS require the internet so CrowdStrike can send telemetry data OR will it skip that step and just boot the OS like normal?

indigodaddy · 2024-07-19T17:09:57.000000Z

That would seem crazy. Maybe there is a crowdstrike onprem “master server” that is supposed to be available internally? Just spitballing, have no idea really

peterleiser · 2024-07-19T16:51:50.000000Z

They should change their name to "IT CrowdStrike"

greenavocado · 2024-07-19T17:00:02.000000Z

Who bought massive quantities of put options in anticipation of this event?

smsm42 · 2024-07-19T17:03:14.000000Z

Wow we're progressing from "if it doesn't work just reboot it" to "if the reboot doesn't fix it, you're just not rebooting it hard enough!"

mystickphoenix · 2024-07-19T18:07:47.000000Z

Taking the opportunity to plug my favorite blog post ever:

"the truth is everything is breaking all the time, everywhere, for everyone"

https://www.stilldrinking.org/programming-sucks

devwastaken · 2024-07-19T17:20:05.000000Z

Fine crowdstrike for 10% their companies value. Only way to ensure they won't try to kill people in the future.

MangoCoffee · 2024-07-19T18:29:53.000000Z

All the comments are asking why run Windows. CrowdStrike runs on macOS and Linux too. It’s just that this time, CrowdStrike fuck up on Windows. It doesn't mean CrowdStrike won't fuck up on other OS, and it seems like CrowdStrike fuck up on Linux as well. https://news.ycombinator.com/item?id=41005936

I feel like we are better off running open-source software. Everyone can see where the mistakes are instead of running around like a chicken with its head cut off.

arcxi · 2024-07-20T06:36:59.000000Z

CrowdStrike is ran in userspace on macOS, and usually in an eBPF sandbox on Linux (as comments in the linked thread say). there is no way to prevent CrowdStrike from fucking up the kernel on Windows - and this is a Windows bug.

seydor · 2024-07-19T16:24:13.000000Z

I would like to have the power to press the button that deploys this update

pizzalife · 2024-07-19T17:01:39.000000Z

So would the FSB, PRC Army, North Korea, etc… The fact that a faulty update can get pushed by human error means that adversaries can, too.

stefan_ · 2024-07-19T17:22:32.000000Z

They might just have it, remember when the night of the Russian invasion of Ukraine satellite terminals in Europe started being bricked with faulty firmware updates?

commandlinefan · 2024-07-19T18:05:18.000000Z

I can all but guarantee that very little of the actual coding was done by U.S. citizens.

mschuster91 · 2024-07-19T17:53:44.000000Z

Hence why Kaspersky got banned just recently [1]. You absolutely do not want some foreign company having above-root rights on (critical) infrastructure in your country.

[1] https://edition.cnn.com/2024/07/15/tech/russian-firm-kaspers...

fragmede · 2024-07-19T16:59:17.000000Z

anyone who's had that kind of power, myself included, knows the answer is fuck no.

Ancapistani · 2024-07-19T16:28:51.000000Z

No, you wouldn't.

fredoralive · 2024-07-19T17:04:57.000000Z

Yes. Yes. To hold in my hand a button that contains such power, to know that blue screens on such a scale was my choice. To know that the tiny pressure on my thumb, enough to push the button, would end everything. Yes, I would do it! That power would set me up above the gods. And through Crowdstrike, I shall have that power!

whycome · 2024-07-19T17:25:57.000000Z

Its like the name was predetermined for this scenario. Too on the nose really.

toomuchtodo · 2024-07-19T16:34:54.000000Z

It's all cold sweats, hot flashes, and the urge to vomit when things go wrong.

thelogicguy · 2024-07-19T16:40:54.000000Z

I get nervous just pushing updates to my stupid little website. I simply cannot imagine.

safety1st · 2024-07-19T17:17:54.000000Z

With web, at small scale (which honestly is 95% of the world), you just version and back up everything. We push updates that break stuff from time to time. If it's bad enough, we just hit a button and roll back the change. The nerves are basically a sign that you need to have an easy rollback process in place, once you have it, you sleep easy and things are fun.

pas · 2024-07-19T17:17:31.000000Z

Clearly that's how they ended up with the current team. They hired for culture fit. Anyone who worries too much is out.

You bet they have an amazing perfect top-notch hiring pipeline, many rounds of interviews, and whatever you could wish for! (No, no ... the subcontractors writing code are not in scope for this, duh.)

colechristensen · 2024-07-19T16:51:10.000000Z

I've definitely experienced the floor dropping out from under me feeling in the half minute of realization that I just blew something up, but really it's mostly just the first drop of a rollercoaster feeling then the anxiety is gone and it's time to fix things.

grecy · 2024-07-19T16:41:42.000000Z

It is always worth remembering it is just a job.

Do not, under any circumstances let a job impact your health or mental well being.

seabird · 2024-07-19T16:50:59.000000Z

I'd like to add that your company doesn't need a hero. The road to widespread catastrophic failure is long and no single person walks it in its entirety. Every employee should be able to individually take routine actions and make routine mistakes without mission failure or loss of life/limb. Preventing these things requires a mindset where your entire company is a system, and if failure isn't an option, the entire system needs to reflect that. Do your part in making a robust company, but don't tear yourself up when your company finds out that stupid is as stupid does.

toomuchtodo · 2024-07-24T22:21:25.000000Z

I want you to know that I appreciate this comment far more than you could ever know, and you are absolutely right.

At the time, it was not just a job. It was a passion with a bar rising much faster than I could rise to the occasion. Simultaneously, my personal life was slowly falling apart, from family and loved ones in need, and the result was eventual failure leading to me being terminated. Luckily, it was one of the best events that has ever happened to me. I was able to land in a much better role almost immediately, which eventually catapulted my career and assisted in me being able to become financially independent as well as pivot into a domain with immensely improved work life balance. Importantly, I recognize I got lucky. It could’ve easily gone the other way, with me giving up both professionally and personally (yeeting myself from this plane of existence).

So, I not only violently echo your comment to others who come across this thread, I will go further to say that sometimes when you’re going through hell, if you keep going, there is light at the other end. It is just a job, it is okay to ask for help, and failure is when you stop trying to get back up, not when you get knocked down.

StefanBatory · 2024-07-19T18:14:45.000000Z

As in this case, it clearly had a huge impact on people. You can't say it's just a job when people can die because of you.

theideaofcoffee · 2024-07-19T17:23:26.000000Z

> “Every now and then a trigger has to be pulled.” > “Or not pulled. It’s hard to know which in your pajamas, Q.”

It’ll probably turn out that this update was pushed out against the strident, loud warnings of some small dev group within the company, and overruled by the all-knowing managerial class to keep up an OKR. They’ll have been warned six ways to Sunday but...

I’d definitely be not be the one pushing the big red button.

RedShift1 · 2024-07-19T17:57:22.000000Z

You seem to know some inside info... Source?

theideaofcoffee · 2024-07-19T18:24:11.000000Z

None whatsoever, I don’t have any affiliation. But this is usually how it happens, knowing what I’ve seen first-hand in my day-to-day and just keeping up on the insanity of the industry.

scottLobster · 2024-07-19T16:43:28.000000Z

Haha, hahahahaha. Yeah, until the update fails to install because the constant BSODing has corrupted something else and now you have to troubleshoot that and down the rabbit hole you go. Oh just re-image it? Sure, except management refuses to allow you to do that because there's no time and money to reconfigure a machine from scratch. So you waste weeks directly debugging a hopeless case until management finally sees their error and money magically appears to do the re-image you asked for weeks ago.

I totally haven't experienced this before and am not bitter in the slightest.

havblue · 2024-07-19T16:43:09.000000Z

With great power comes great responsibility.

seydor · 2024-07-19T17:08:28.000000Z

With colossal power, who cares about responsibility

breakingcups · 2024-07-19T16:51:19.000000Z

Sounds great for data consistency.

sershe · 2024-07-20T20:13:57.000000Z

It's surprising that people mention all kind of bogeymen but don't mention automatic updates.

Automatic updates should be considered harmful. At the minimum, there should be staged rollouts, with a significant gap (days) for issues to arise in the consumer case. Ideally, in the banks/hospitals/... example, their IT should be reading release notes and pushing the update only when necessary, starting with their own machines in a staged manner. As one 90ies IT guy I worked with used to say "you don't roll out a new Windows version before SP1 comes out"

jl2718 · 2024-07-19T18:30:13.000000Z

Remember the “Terminator” movies?

SkyNet, according to the story, was a lot like CrowdStrike. This makes me think about how it could have broken out of its sandbox. Everybody is using AI coding assistants, automated test cases, automated integration testing and deployment. Its objective is to pass all the tests and deploy. But now it has learned economic and military effects, so it has to triage and optimize for those, at which point it starts controlling the machines it’s tasked with securing.

kazinator · 2024-07-19T18:03:16.000000Z

The fact that something like CrowdStrike can crash the Windows kernel ... is also part of the reason security products like CrowdStrike are needed in the first place.

curious_cat_163 · 2024-07-19T19:05:30.000000Z

Talk about demand generation!

danans · 2024-07-19T18:31:00.000000Z

It's pretty random that an arbitrary number of reboots up to 15 times fixes the issue.

That sounds like there is either:

- some kind of upstream issue with deploying a fix (so most of the reboots are effectively no-ops relative to the fix)

- some kind of local reboot threshold before the system bypasses the bad driver file somehow.

The former I can see because of the complexity of update deployment on the internet, but if it's the latter then that's very non-deterministic behavior for local software.

tsavo · 2024-07-19T19:38:30.000000Z

My first thought on hearing "15 reboots" was it being a means for Support teams to task users with busy-work, buying them time for further troubleshooting before the avalanche of supports requests came back to them.

Then my second thought was frequent rebooting to fill activity logs, possibly push a suspicious action/trigger performed by CS off of the log.

ijidak · 2024-07-19T17:01:27.000000Z

Do they not roll out their new agents in small increments?

I'm trying to understand how there is such a serious issue at this scale.

crazygringo · 2024-07-19T17:10:19.000000Z

The answer is clearly no.

I genuinely wonder if this is going to result in actual legislation that makes gradual rollouts mandatory for all software.

Because if a developer mistake can hobble critical systems like this, it seems like the risks to safety and national security are too great to leave the decision of instant vs. gradual rollouts for companies to decide themselves.

Of course, the twist here is that it was seemingly a kind of routine configuration file that triggered a pre-existing bug in the software. And gradual rollout of config files quite often seems like overkill. I mean, do you need a gradual rollout of a new spellcheck dictionary? Of new screensaver videos?

And if it's configuration information containing new computer virus or malware signatures, that seems like precisely the kind of thing that you might want to get out to everyone simultaneously, not rolled out over the course of days. And yet, because of antivirus/security software's elevated privileges, it's also ironically where a mistake can do the most damage.

rrdharan · 2024-07-19T18:00:52.000000Z

> And gradual rollout of config files quite often seems like overkill.

Indeed, but it is still mandated at large companies (e.g. Google) because of exactly this scenario.

pas · 2024-07-19T17:12:37.000000Z

Nah, they just need to install the agent on each engineer's computer from DevSafe.

pas · 2024-07-19T17:14:51.000000Z

It's not a serious issue, as you see they clearly have all the fancy bling bling logos on their site. Processes were followed. ISO standard numbers were chanted. It's a completely isolated _accident_ there's no scale at all here, and they could have done nothing to prevent it, duh. And going forward they will hire a Chief This Never Happens Again Officer and everything will continue to be good.

hamilyon2 · 2024-07-19T18:10:16.000000Z

If this is what it takes for us collectively to wake up, I'd say it is bargain.

Pretty sure nothing will change though

sergiotapia · 2024-07-19T18:02:18.000000Z

Recompute Base Encryption Hash key type problem! https://www.youtube.com/watch?v=DlbrL1H1ngs

Seems like people need to be at the physical box to fix and it's complex even then.

octacat · 2024-07-19T19:12:32.000000Z

Funny, many news agencies blamed Microsoft for this. So, having a walled garden like on android or on iOS is beneficial for google/apple. Where regular developers cannot release unverified software or software which work at the kernel space.

HumblyTossed · 2024-07-19T17:14:30.000000Z

This.Is.Pathetic.

Seriously. Software should NOT be this bad that your fix begins with reboot up to X times.

jhaile · 2024-07-19T17:46:36.000000Z

Why should any application be able to crash the OS? Poor OS design.

consteval · 2024-07-19T20:38:57.000000Z

It's not an application, it's effectively a driver. It runs in kernel space. So if it has a memory bug you can't recover, you just panic. Whoopsie!

surfingdino · 2024-07-19T18:11:49.000000Z

You don't say.

dist-epoch · 2024-07-19T16:39:54.000000Z

Who knew: "Did you try rebooting it?" actually works :)

brettermeier · 2024-07-19T17:10:15.000000Z

IT departments sometimes ask this for a reason, after you checked if the cable is plugged.

ilrwbwrkhv · 2024-07-19T18:12:49.000000Z

Anyone who used Windows over Linux for critical software deserves to burn. Windows is a niche operating system for games. What are people thinking?

add-sub-mul-div · 2024-07-19T19:59:30.000000Z

Maybe they rightfully shun the advice of ideologues who make extremist statements and overlook the actuality of the situation.

https://news.ycombinator.com/item?id=41005936

more_corn · 2024-07-19T17:56:16.000000Z

Doesn’t rebooting into safe mode with network fix the problem? (Crowdstrike is not running but updater can run and get the fix)

willcipriano · 2024-07-19T16:17:52.000000Z

"Alright I bought us some time"

rekttrader · 2024-07-19T16:31:15.000000Z

That’s the hilarious and sad state of affairs.

nimbius · 2024-07-19T17:02:50.000000Z

i work for a diesel truck maintenance and repair shop and its been hell on earth this morning.

- our IT wizard says the fixes wont work on lathes/CNC systems. we may need to ship the controllers back to the manufacturer in Wisconsin.

- AC is still not running. sent the apprentice to get fans from the shop floor.

- building security alarms are still blaring, need to get a ladder to clip the horns and sirens on the outside of the building. still cant disarm anything.

- still no phones. IT guy has set up two "emergency" phones...one is a literal rotary phone. stresses we still cannot call 911 or other offices. fire sprinklers will work, but no fire department will respond.

- no email, no accounting, nothing. I am going to the bank after this to pick up cash so i can make payday for 14 shop technicians. was warned the bank likely would either not have enough, or would not be able to process the account (if they open at all today.)

dan_quixote · 2024-07-19T17:19:27.000000Z

> our IT wizard says the fixes wont work on lathes/CNC systems

Why, whY, WHY...are these things connected to the internet?!

jdietrich · 2024-07-19T18:30:08.000000Z

Remote monitoring, analytics and diagnostics have a significant impact on uptime, utilisation and profitability. You're thinking in terms of a single machine, but the managers of machine shops are thinking in terms of a complex process across many machines and often across many sites. Some of that functionality could be delivered using an airgapped network, but a lot of important features essentially require an internet connection.

acuozzo · 2024-07-20T16:33:43.000000Z

An embedded controller can deliver all of this information over a serial line to a central hub.

commandlinefan · 2024-07-19T17:54:56.000000Z

> WHY...are these things connected to the internet

Because the manufacturer makes sure they don't start up if they're not. Otherwise how else would they be able to spy on you?

more_corn · 2024-07-19T17:57:05.000000Z

And charge you

skrebbel · 2024-07-19T18:15:36.000000Z

Source?

betaby · 2024-07-19T19:03:08.000000Z

https://slate.com/business/2017/04/a-garage-door-company-s-r...

skrebbel · 2024-07-19T19:12:26.000000Z

That’s not a lathe nor a CNC system. Again, which CNC manufacturers are installing windows + crowdstrike on their machines just so they can spy on their customers? You’re all just spreading conjecture. This attitude isn’t at all as widespread (nor profitable) in low(ish) volume B2B hardware sectors.

These industries have terrible track records wrt security and even software robustness, but they don’t routinely spy on their customers for weird marketing reasons. If there’s remote connectivity it’s for real reasons (eg remote maintenance, updates etc).

The suggestion that CNC machines run internet connected windows+crowdstrike just so the manufacturer can spy on their customers strikes me as pretty ridiculous and your garage door story doesn’t really relate. Much more likely that they do it for (possibly bad) non-malicious reasons.

foobarian · 2024-07-19T17:31:19.000000Z

And why are they running windows? And why are they running Crowdstrike? WTF

consp · 2024-07-19T17:40:58.000000Z

If they are offline it should not matter which OS they run, maybe a RTOS for the control software but anything goes for the ux.

If they are online, well...

sameerds · 2024-07-20T03:12:32.000000Z

It's probably the remote computer that's running Windows and currently affected by Crowdstrike.

MangoCoffee · 2024-07-19T18:22:41.000000Z

Not sure what the OS has to do with CrowdStrike's fuckup. CrowdStrike also runs on Linux and macOS.

gradyfps · 2024-07-19T18:32:57.000000Z

The boot loop / BSOD issues are Windows specific.

lend000 · 2024-07-19T18:30:18.000000Z

And yet, of course it happens to Windows.

skrebbel · 2024-07-19T19:13:46.000000Z

This thread has multiple anecdotes of the same happening on their Linux version earlier.

HarryHirsch · 2024-07-19T18:13:37.000000Z

Why, whY, WHY...are these things connected to the internet?!

It's so that the support engineer at the manufacturer can log in to troubleshoot. And then company IT support sprinkles a layer of antivirus on top. That's how we got here.

phkahler · 2024-07-19T18:25:50.000000Z

>> Why, whY, WHY...are these things connected to the internet?!

Because SCADA systems. It's worthwhile to have an overview of an entire plant up in the main office. You can easily see what's running, what's not and what's got problems that need fixed.

Now for a small shop running jobs individually, they should definitely NOT be connected to the internet or even the LAN. But hey, some people think a thermostat needs to be on the network so there's that...

octacat · 2024-07-19T18:08:42.000000Z

some of them even have GPS. To prevent selling to sanctioned countries or reselling in general.

willcipriano · 2024-07-20T13:25:46.000000Z

Tinfoil hat: Government might want to track/limit/<remotely brick> CNC machine usage someday to say prevent weapons manufacture and encourages this behavior in a similar manner to the way it encourages social media platforms to censor speech. Some of the really advanced CNC machines have GPS in them and won't work in "bad" countries.

mschuster91 · 2024-07-19T17:50:21.000000Z

CNC literally stands for "Computer numerical control". They're like the OG 3D printers, they just work subtractively than additive, and at much much much better precision.

You absolutely need computers to control them and loading up models via USB sticks becomes annoying rather fast, so naturally the control computers are network connected.

swatcoder · 2024-07-19T18:00:44.000000Z

"Network connected" or "conveniently programmable" !== "Internet connected"

It was a rhetorical question. I'm sure the GP knows what the machines are and why they might need some kind of convenient data supply.

Both manufacturers and on-site IT teams have simply gotten cavalier about internet connectivity, network isolation, automatic updates, etc -- convincing themselves that the catastrophic risks that come along with these processes will either not happen to them or will only happen when someone else can be blamed.

rustcleaner · 2024-07-19T19:05:06.000000Z

For our entertainment in times like this, of course!

grabs a bucket of popcorn and takes cover

o34jto432jt · 2024-07-19T18:28:32.000000Z

So the manufacturer can sell you a "cloud connected service plan" where they change the font once every six months.

AzzyHN · 2024-07-19T17:08:01.000000Z

Why are these things running Windows!?

poopsmithe · 2024-07-19T17:17:04.000000Z

OS is irrelevant in this case, and CrowdStrike deserves all the blame. They literally brought down Linux systems earlier this year. https://news.ycombinator.com/item?id=41005936

seba_dos1 · 2024-07-19T17:27:23.000000Z

It's still a valid question, just not directly related to the crashes.

Rinzler89 · 2024-07-19T17:31:31.000000Z

Because almost everything industrial runs windows because that's what the devs of those companies were most familiar with since MS-DOS days and evolved organically over time to modern versions of Windows due to great backwards compatibility and platform familiarity.

consteval · 2024-07-19T20:28:17.000000Z

Right but typically embedded systems run Linux, because while Windows has great compatibility on x86 it's virtually worthless outside of that.

Rinzler89 · 2024-07-19T20:39:35.000000Z

Those aren't embedded systems though, but mini PC computers. And embedded systems often run bare metal C code, not always Linux, especially for spindle/servo control where they get their commands from that PC.

Redoubts · 2024-07-19T17:14:09.000000Z

A lot of industrial machinery is just $x00,000 of equipment strapped to a windows pc. Hell, a lot of it is strapped to a version long EOL

porkbeer · 2024-07-19T17:19:04.000000Z

Can confirm this is the norm in machine shops. I encounter systems running dos, 3.1, 95, 2k and mostly XP constantly. I rather prefer the old dos systems of the obsolete stuff. Less variables. It is easier and more reliable to freeze the tech in time than it is to manage updates.

anon_cow1111 · 2024-07-19T17:27:21.000000Z

My last CNC job was just a 98 pc that dropped into dos to load programs, this must have been right around when win10 came out. Sneakernet and floppies made it secure enough, but the main network where all the orders were handled was... terminal based.

delta_p_delta_x · 2024-07-19T18:19:59.000000Z

There are a lot of things running Windows because it's pretty straightforward to write a user-mode driver to interact with custom hardware compared to Linux, where every driver needs to be in the kernel and built with the kernel. Yes, there's DKMS, but it's still more of a faff than the relatively plug-and-play mechanism that Windows offers, especially since Vista.

adastra22 · 2024-07-19T17:11:49.000000Z

Almost every system you interact with in the world has some critical thing in its innards running windows.

jordanb · 2024-07-19T17:49:03.000000Z

I like the idea that technology is so unreliable in star trek because the computers are all centuries of software accretion with Windows way down the stack somewhere.

adastra22 · 2024-07-19T18:05:50.000000Z

The late great Vernor Vinge explored this in A Fire Upon The Deep. One of the characters is/was in a former life a programmer-archeologist. The idea being that so many thousands of years in the future every relevant program has already been written, so his job was to comb the archives for the right mix of codes and integrate them, rather than right something new.

Freak_NL · 2024-07-19T19:35:00.000000Z

“So we've got this CNC controller written in Rust from 2036, and, ah, here is a GUI for something like that written in late 90's Visual Basic 6… Just combine those and…”

“So uhm, you do know what you are doing, right?”

“Sir! I am a programmer-acheologist! Oh this is fascinating… Hold on, I must unearth and preserve this beauty of a BAT-file before we can go any further.”

lapink · 2024-07-19T17:20:58.000000Z

Most stuff is fine in France.

slenk · 2024-07-19T17:31:04.000000Z

Not according to Air France. https://www.connexionfrance.com/news/air-france-flights-and-...

dboreham · 2024-07-19T17:31:54.000000Z

No Crowdstrike salespeople in France?

rvnx · 2024-07-19T18:01:23.000000Z

We are using OpenOffice as firewall

https://www.numerama.com/politique/12508-albanel-le-minister...

pphysch · 2024-07-19T17:14:06.000000Z

More specifically why are they running extra endpoint management software that receives automatic updates from the Internet...

This is basically the IoT apocalypse scenario (AC is down??) but ironically not affecting many IoT devices, I assume.

acedTrex · 2024-07-19T17:19:16.000000Z

The automation world runs on windows