Hacker News new | past | comments | ask | show | jobs | submit login
CrowdStrike fixes start at "reboot up to 15 times", gets more complex from there (arstechnica.com)
203 points by thunderbong 57 days ago | hide | past | favorite | 234 comments



Has anyone discerned the root cause of this in the software?

As in, what exactly is wrong in these C00000291-*.sys files that triggers the crash in csagent.sys, and why?


I've been wondering the same. I did just see [1], where it's apparently trying to read memory from an unmapped address, but I haven't seen anything about how r8 got to the point of having said unmapped address.

[1]: https://x.com/patrickwardle/status/1814343502886477857


It seems the affected update file seems to have been over written with 0s on the 42kb file, whereas the before and after sys files have obfuscated ays/config file info as expected.


If it is simply caused by a corrupted file. That is a really bad signal. It means they don't even try to properly validate and parse the file before loading them into the KERNAL. Always validate input so it don't crash your program is almost the computer science 101 every programming class should tell you in the first class. And yet they still make this happen?

And in this case, it only crash. But if it somehow read value from position it isn't supposed to successfully? You have an RCE.



This is a global multi-layer failure: Microsoft allowing kernel mods by third-party software, CrowdStrike not testing this, DevSecOps not doing a staged/canary deployment, half the world running the same OS, things that should not be connected to the internet but are by default. Microsoft and CrowdStrike drove a horse and a cart through all redundancy and failover designs and showed very clearly where there were no such designs in place.


While I will be the last person in line to defend Microsoft, I am not sure that disallowing 3P kernel mods is a workable solution. Crowdstrike and companies like it exist to fill a very real need within the windows ecosystem. I don’t foresee that suddenly going away now or Microsoft unilaterally forcing every company like crowdstrike out of business and taking over this role themselves


Mac does it, although that most definitely does not fill the same enterprise role as Windows.

It's possible to make userspace interfaces and permissions for this sort of thing.


Literally every OS allows you to install 3rd party kernel modules or plugins. If Microsoft banned them, people would be up in arms about them being a controlling walled garden. There is no winning.


> Literally every OS allows you to install 3rd party kernel modules or plugins

I know for sure at least macOS and OpenBSD don't


You can compile OpenBSD yourself though.


Hello, IT, have you tried turning it on and off again 15 times?

Seriously though - this entire outage is the poster child for why you NEVER have software that updates without explicit permission from a sysadmin. If I were in congress, I would make it illegal, it's an obvious national security issue.


Nah. That’s not the problem.

Kernel level code blindly loading arbitrary files?

Panicking when the file doesn’t parse because it’s not a memory safe language?

Not validating the files before loading them?

Not validating the files before SHIPPING them? No CI? No safety net?

No staged rollout in case of explosion?

There are far FAR bigger mistakes here than “sys admin didn’t have to press button”.


To play devil’s advocate, a staged rollout for antivirus definitions somewhat defeats the point since those definitions are supposed to be constantly updated.

I agree with the rest, especially the use of a memory unsafe language to do parsing in the kernel by a billion dollar security company blows my mind.

How can you even run a security company without any security professionals reading your code even incidentally? An impressive level of incompetence.


At least they could make a in house playground in the process to see if their new version ever work. Maybe something like guest computer in public area. Or some sort of vm to emulate end user system to see if they ever boots. And somehow we still get this.

How the heck they didn't find out the new version prevent the computer from booting at all?


Yeah that had crossed my mind too. I’m not sure which risk is bigger, breaking things or leaving them insecure.

I lean towards breaking things being the bigger risk.

But if even a handful of the other errors were corrected this would have been prevented and they wouldn’t have had to make that choice.


> Panicking when the file doesn’t parse because it’s not a memory safe language?

Whether a program panics or recovers when attempting to parse bad data is entirely orthogonal to memory safety. Do you have any in-depth technical information about the bug itself that you're basing this on?


Exactly this.

This is a faulty and dangerous product from conception to execution.


Is it normal to make outbound connections during boot? Doesn't that circumvent a firewall? That seems like something a security team evaluating whether they want this software on their network might care about during an eval period.. right?


Looking at the contents of c:\windows\system32\drivers\crowdstrike suggests it does all sorts of weird shit right down to injecting itself into UEFI and futzing with firmware. It's literally in everything.

Unfortunately "security" folk these days are box ticking fuckwits and this product brief ticked all the boxes. They do not understand any more traditional methodologies other than "install these magic beans and action the reports".

Invest in better software and network architecture and DR strategy instead.


CrowdStrike is so invasive that it needs firewall exceptions. It does a lot of the actual antivirus work in the cloud. It's a security nightmare.


That's not the big no-no here. Lack of any real DRP is. Sure, it's cheaper to just buy CS Falcon (and who knows what other amazing vendors supplied timebombs are ticking silently) than paying sysadmins and developers ... and letting them build something that does what it needs, not much else, so there's no need to put these fantastic "single agents" from these RCE-as-a-service vendors on all the fucking servers.


both are true


What % of those sysadmins are then going to turn around and script something to auto-approve those updates, once they realize that they are A) requested at inconvenient times and B) are related to security?

Who's going to take the risk of appearing to have sat on an important update, while the org they support is ravaged by ThreatOfTheDay, because they thought they knew better than a multi-billion dollar, tops-in-their-field company?

(I'm not necessarily saying that's actually objectively correct, but I can't imagine that many folks are willing to risk the downside)


> why you NEVER have software that updates without explicit permission from a sysadmin

In general I agree, but this case is quite messy. It's more like your anti-virus had a bug since forever that if it loads a broken virus definition it bricks your system. And a broken virus definition finally happened today.

Do you want every virus definition (that is updated every few hours) to require explicit permission from a sysadmin?


You’re learning the wrong lesson here. Automatic security updates in Debian and Ubuntu actually get tested and work. The RCE in ssh a week ago is an argument for enabling automatic security updates. (And for security in depth, putting everything behind VPN for example)

This example is probably an argument for not running windows on critical systems due to insufficient focus on security from the beginning which has lead to a need for things like crowdstrike.

They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.


>They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.

I wish people would stop making blanket statement as if they know how every company in the world runs. Plenty of Linux machines are running CS, and it's not only because they are forced to for compliance. NG AV has been picking up speed as a "just in case" thing for Linux and Mac for years now. Your anecdote does not apply to everyone.


They still run Windows XP (og edition, not this patched rubbish) to make sure national security isn't compromised.

The really important machines are still on Win 3.1.


I understand the logic of this but it is somewhat based on the assumption - which most industries have in droves - that people in THAT industry are the competent bullwhark against stupidity.

I consulted for a company for a while where the 'sysadmin' was the owner's mother - who bought laptops from walmart. Not only could she NOT have approved updates like this, even if she could she would have she wouldn't have had any knowledge whatsoever with which to make a determination if it worked.

In an abstraction, the problem really is with externalities. These approaches to updates exist because people who CAN'T do what you describe are likely a more dominant part of the threat model than this happening to people you do describe. The resulting fix, as we're seeing, is very reliable until it isn't...and if the isn't is enormous in scale the systems aren't setup to fail gracefully.

If you want to make a rule...require graceful failure.


What would the sysadmins do in this context? Read the release notes of the update? The only thing they would do is update and then be responsible for the problem, and in that case you're back to this exact problem.

It's not like they'd read the source code or examine every file that's been changed or downloaded for a proprietary kernel module for every crowdstrike update (there must be a LOT of them).


They would release the update in a testing/sandbox environment first before rolling out kernel-level changes to every computer on their network.

They're the same team who mandate you use a 3-year-old browser version and 5-year-old OS, because you can't be trusted to manage your own updates, so they do know the idea.


Would this have changed something for this specific problem? I usually 100% agree with you fwiw, I just don't think this would've helped here because it seems like an almost "non update"? Most people claim there has been no update to the software, and no prompt or option to update it or not


It's a file that was downloaded from Crowdstrike's servers, which have presumably been whitelisted in the firewall, and used to configure the software. Of course it's a software update, regardless of whether the file says .exe or .dll or .sys or .txt, and regardless of whether there was a prompt.

Again, the same team in most enterprises wouldn't dream of letting you have an auto- updating Firefox Nightly, they know how to configure software so it doesn't phone home for updates or is blocked from phoning home.


It was a data update that triggered a software bug. It was not a software update. I don't think it's reasonable to make data updates illegal.


I'm a general purpose computer, the distinction between software and data can be pretty fuzzy.


This distinction is meaningless at best and harmful at worst.

If a code path isn't followed until a config file updates, that is practically the same thing as the code path being introduced by the update.


Code is data and data is code


Unless you use PIC controllers with https://en.wikipedia.org/wiki/Harvard_architecture


Those focusing on QA, staged rollouts, permission management etc are misguided. Yes of course a serious company should do it but CrowdStrike is a compliance checkbox ticker.

They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”. The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).

I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)


In a slightly less threatening but equally noxious box-checking racket, a company I work with is being sued for their website not being sufficiently ADA-compliant. But the first they heard of the lawsuit, before they were even served, was an email from a vendor who specializes in adding junk code to your website that's supposed to tick this box. The vendor happens to work closely with several of the law firms who file and defend these suits.


It’s looking like many impacted end-user machines are hard bricked unless you can get into the hard drive to delete the file causing this. Even if you can do that it’s not something that is easily (or potentially even possible) to automate at scale so looking like this is going to be an ugly fix for many impacted devices. This is basically the nightmare scenario for fleet management… devices broken and can’t remotely fix them. Need to send hands on keyboard folks in the field to touch each device.


DevSecOps should have you know, tested these updates before they were approved for release company-wide.

If I can't commit code to our app without a branch, pull requests, code review...why can the infrastructure team just send shit out willy-nilly?

"Always allow new updates" must have been checked, or someone just goes through a dashboard and blindly clicks "Approve"


That is what has surprised me. I can understand if small businesses were caught here because they lack financial resources for the infrastructure and staff, but those large corporations like airlines etc... Why don't they have a staging environment where everything goes first? I naively assumed this was established best practice due to the risk of update issues bricking your organization.

But maybe anti-malware is given a blind eye because instant updates for zero day security issues are obviously attractive.

Still, though... In hindsight it's not workable for especially anything running system drivers with liberal kernel access.


I am not surprised at all. The level of DevSecOps' skills has been falling over the last two decades as demand for their skills kept growing. Most of them would report you to HR if you suggested they use WireShark to debug a networking issue. They are useless people who came to IT because of the promise of good pay and don't know how computers, networks work.


It's automatic, no? The whole "promise" (oh sorry, the "added value proposition") of CS is that they "keep you safe" automatically! It was a content update. Meaning basically antivirus signatures ... and oops, some minor non-functional changes to the filtering kernel driver.


In that case, automatic updates likely need different permission levels. What exactly is allowed to be updated automatically?


... well, yes, yes of course. And if I try to be serious on a late Friday night (it's almost 20:00 here), the obvious solutions is to have something like eBPF in/for the Linux kernel (which has a verifier[0]).

And security vendors should follow "secure by design" principles. Yes, I know a try-fucking-catch might be too advanced, and uh oh kernel code is hard because unwinding is costly. But guess what else is also not cheap. (Okay, I seriousness failed.) But still. This is fair and square in the "this should never happen" scenario. It's an automatically downloaded plugin or whatever. (CS can call it "content update", but von Neumann is already calling FedEx to send them a pallet of industrial grade bitchslap.) And if the plugin loader cannot gracefully fail plugin loading, then it should obviously come with the appropriate audiovisual cues[1] so sysadmins know what to expect.

[0] https://docs.kernel.org/bpf/verifier.html

[1] https://www.youtube.com/watch?v=Dv-2dzD9F10


Security and Compliance gets to violate all good sense, because it's just sooo important. They can run un-reviewed un-sandboxed daemons as root on every system if they really want, they can have changes pushed automatically without review or control, because "security" is just so important, and due to "compliance" you really have no choice as your company gets larger, you just have to do it. That's why, despite being obviously pretty dumb to many skilled engineers, it seems like everyone does it. No choice. Security, Compliance. So dumb ...


Maybe it was checked but the CI didn’t cover this edge case.

I think the team writing the parsers for these data files deserves some blame. This should have been fuzzed, property tested, etc.


Who says they sent it out willy-nilly?

It’s not unheard of for things to slip by testing and CI.


So, in other words, there's a race condition in the CrowdStrike Falcon driver at startup time. That, in itself, should be a major cause for alarm, but here we are depending on it to fix this problem.


No, it takes a while to load that definition file. Before the loading it _might_ be able to pull the update that fixes it. If you keep trying the chance this update is pulled increases


Yes. That’s the race condition


The individual person that pressed the "go" button (if there was a person), is going to henceforth be __the best__ DevOps person to ever have on your team. They have learned a multi-trillion-dollar lesson that no amount of training could have prepared them for.

And the Crowdstrike CTO has either been given the ammunition to get __whatever they ask for, ever again__ with regard to appropriate allocation of resources for devops *or* they'll be fired (whether or not it's their fault).

And let me be very clear. This is absolutely, positively and wholly not the person that pressed the button's fault. Not even a little. At a company as integral as CrowdStrike, the number of mistakes and errors that had to have happened long before it got to "Joe the Intern Press Button" is huge and absurd. But many of us have been in (a much, much, *MUCH* smaller version of) Joe's shoes, and we know the gut sinking feeling that hits when something bad happens. A good company and team won't blame Joe and will do everything they can to protect Joe from the hilariously bad systemic issues that allowed this to happen.


Or maybe you get someone with PSTD and suicidal tendencies. You never know how someone will process something like this.


> not the person that pressed the button's fault

You know that, and I know that. The people who will ruin his life starting today do not know (or care).


This is why it is the responsibility (yes, responsibility) of every one of their coworkers, especially those more senior than them, to fight *HARD* to protect them.

This is part of the job of a senior.


"We must hang together, or we will all hang separately" is a lesson that I don't think programmers will ever learn.


It is not a human error, it is a process that has to be improved. Humans make mistakes, that is why we have processes in place.


Basic training could've taught him how not to do YOLO global rollouts, and while the stress of this mistake will make him remember a lot, given the lack of basic knowledge that would've prevented this, this lesson will not be very valuable


Absolutely it's a failure of process. But sometimes people just don't pay attention. Hiring inobservant or reckless people is a risk multiplier.


Crowdstrike will not exist after this is over.


That's very optimistic.


I literally just dumped 30 switches yesterday across an entire facility and had to walk 30 closets by foot to recover from ROMMON.

Shit happens. We learn.


Can you explain what went wrong?


>reboot up to 15 times

I see my orgs SCCM admins have been consulted


lol


Some government should force them to release a technical postmortem. Feels that they don't do it otherwise.


There should be congressional hearings on this. Not just post mortems.


Honest question: would you expect Congress to respond in a way that's a true net-positive?


No, but its a warning to the next guy/megacorp:

Don't do that, or you'll be dragged before the greatest obnoxious and self-aggrandizing body in the world for lengthy dressing down that probably affects the stock price.


I don’t think a cybersecurity company can take down half the US and not release a postmortem


Of course, but we specifically would like to see a _technical_ postmortem that examines what kind of incremental rollout procedures they have and how this update overcame those.


Or... you know... This kind of software should be open source or companies using it should at least be able to audit the code themselves.

Supposedly they have all kinds of certifications but not even having basic QA demonstrates that this is all just a smokeshow: https://www.crowdstrike.com/why-crowdstrike/crowdstrike-comp...


>The first and easiest is simply to try to reboot affected machines over and over, which gives affected machines multiple chances to try to grab CrowdStrike's non-broken update before the bad driver can cause the BSOD.

I thought it was BSOD'ing on boot? I don't understand how this works. It auto-updates on boot? From the internet?


One of the first things the falcon driver does on boot is connect to the server, report some basic info, and start loading these data files, the "channel" files that Crowdstrike frequently updates.

The BSOD is because one of the data files that they previously pushed is horribly mangled, and their driver explodes about it. But if you get lucky, the driver can receive an update notification on boot, connect to the separate file server, and finish overwriting the broken file on disk before the rest of the driver (that would crash) has loaded the broken file

And they do all of that very early on boot. The justification being that you don't want the antivirus to start booting after a rootkit has already installed itself


WTF? Trust in the kernel should be Microsoft's responsibility and only theirs. Actually why is MS even allowing this crap code to run in their kernel? Isn't that a trust-destroying event?


Drivers have to run in the kernel in order to access hardware and other low-level system resources. That's how pretty much every mainstream OS works. For example, here's the guide for writing kernel-mode drivers in Linux: https://docs.kernel.org/driver-api/driver-model/overview.htm...

One might ask whether an anti-virus really needs to run inside the kernel, but the answer might reasonably be yes.


That is only the lazy solution.

It is also possible to access hardware or any other low-level system resources from unprivileged user code, if its process has been granted appropriate access rights by the kernel.

This second solution requires more work, but it is much more secure as the access can be limited to only the strictly-required resources and system crashes become impossible.

The extreme of this solution is a micro-kernel operating system, but there is no need for extremes. Even in a Windows or Linux system you can use this method. You can have a very reduced privileged code in a driver or kernel module, which does nothing except providing access to the permitted resources. Then anything like attempting to access not mapped memory would happen in user code and it would crash only the user process, not the entire computer system.


Yes, I'm no security expert by any means but I'd assume that e.g. a rootkit would be best defeated by a kernel driver.

So, this isn't really what's getting on my nerves here. Just how it auto updates and get pushed throughout the organizations without a smidge of quality assurance. Smaller businesses... Sure, I get it. They don't have the resources to set up infra for this, but those... airliners... and hospitals. WTF. I read some org thinking they might not even be able to provide anesthesia. Seriously. What.


Probably history and then some possible anti-trust litigation. As asking market leader not allow access to kernel like this would somehow be anti-trust violation...


I love how this is solution for security, while sounding like most insanely unsecure thing...


What, the auto-updating part? Obviously the client is verifying signatures (or using TLS with a client certificate, whatever), not just accepting whatever random file comes down the pipe.


Even then, how many affected machines there are? Tens of thousands, hundreds of thousands? Compromise these servers and even possible signing server and you have largest bot net or general compromise in history...

It is not unreasonable to think that this sort of software could get compromised.


A few more years and maybe they will add this newfangled super-innovative thing, invented by those esoteric academics at U of Haskell ... this new thing -- umm, what was it called -- try-catch perhaps.


I mean, this is every antivirus software. "Let's run some antivirus vendor's code on your system that opens literally every file on your system, regardless of how it got there."

Yeah, that's a great idea and not at all a huge attack vector.


You forgot the "... while in kernel mode".


this is a solution for "afterthought security" only.


Yes, that is as bad as it sounds.


> It auto-updates on boot? From the internet?

Apparently!


Probably has to do this early on in the boot process so as not to require a reboot after update because of Windows’s silly pessimistic file locking.


> because of Windows’s silly pessimistic file locking

To be honest I prefer that over the #nix way of doing things. In Windows, you have exactly one file any given path can refer to - in Linux or Mac, it may depend on which directory's inode is seen as the root node by your process (e.g. chroot or container), or whether mounts are at play, or a file/directory got deleted and replaced by something else.

Particularly the last scenario keeps tripping me every once in a while.


Does this mean a computer without internet access and with CrowdStrike would be unable to start up?


Surely a computer without internet would not have received the update?


But then would CrowdStrike stop the startup saying it requires network connectivity to initialize or something? Just wondering how invasive the app is


No, it'd boot up just fine without a network connection (as long as it didn't have this borked update).


How did it get broken then?


As in, does the OS require the internet so CrowdStrike can send telemetry data OR will it skip that step and just boot the OS like normal?


That would seem crazy. Maybe there is a crowdstrike onprem “master server” that is supposed to be available internally? Just spitballing, have no idea really


They should change their name to "IT CrowdStrike"


Who bought massive quantities of put options in anticipation of this event?


Wow we're progressing from "if it doesn't work just reboot it" to "if the reboot doesn't fix it, you're just not rebooting it hard enough!"


Taking the opportunity to plug my favorite blog post ever:

"the truth is everything is breaking all the time, everywhere, for everyone"

https://www.stilldrinking.org/programming-sucks


Fine crowdstrike for 10% their companies value. Only way to ensure they won't try to kill people in the future.


All the comments are asking why run Windows. CrowdStrike runs on macOS and Linux too. It’s just that this time, CrowdStrike fuck up on Windows. It doesn't mean CrowdStrike won't fuck up on other OS, and it seems like CrowdStrike fuck up on Linux as well. https://news.ycombinator.com/item?id=41005936

I feel like we are better off running open-source software. Everyone can see where the mistakes are instead of running around like a chicken with its head cut off.


CrowdStrike is ran in userspace on macOS, and usually in an eBPF sandbox on Linux (as comments in the linked thread say). there is no way to prevent CrowdStrike from fucking up the kernel on Windows - and this is a Windows bug.


I would like to have the power to press the button that deploys this update


So would the FSB, PRC Army, North Korea, etc… The fact that a faulty update can get pushed by human error means that adversaries can, too.


They might just have it, remember when the night of the Russian invasion of Ukraine satellite terminals in Europe started being bricked with faulty firmware updates?


I can all but guarantee that very little of the actual coding was done by U.S. citizens.


Hence why Kaspersky got banned just recently [1]. You absolutely do not want some foreign company having above-root rights on (critical) infrastructure in your country.

[1] https://edition.cnn.com/2024/07/15/tech/russian-firm-kaspers...


anyone who's had that kind of power, myself included, knows the answer is fuck no.


No, you wouldn't.


Yes. Yes. To hold in my hand a button that contains such power, to know that blue screens on such a scale was my choice. To know that the tiny pressure on my thumb, enough to push the button, would end everything. Yes, I would do it! That power would set me up above the gods. And through Crowdstrike, I shall have that power!


Its like the name was predetermined for this scenario. Too on the nose really.


It's all cold sweats, hot flashes, and the urge to vomit when things go wrong.


I get nervous just pushing updates to my stupid little website. I simply cannot imagine.


With web, at small scale (which honestly is 95% of the world), you just version and back up everything. We push updates that break stuff from time to time. If it's bad enough, we just hit a button and roll back the change. The nerves are basically a sign that you need to have an easy rollback process in place, once you have it, you sleep easy and things are fun.


Clearly that's how they ended up with the current team. They hired for culture fit. Anyone who worries too much is out.

You bet they have an amazing perfect top-notch hiring pipeline, many rounds of interviews, and whatever you could wish for! (No, no ... the subcontractors writing code are not in scope for this, duh.)


I've definitely experienced the floor dropping out from under me feeling in the half minute of realization that I just blew something up, but really it's mostly just the first drop of a rollercoaster feeling then the anxiety is gone and it's time to fix things.


It is always worth remembering it is just a job.

Do not, under any circumstances let a job impact your health or mental well being.


I'd like to add that your company doesn't need a hero. The road to widespread catastrophic failure is long and no single person walks it in its entirety. Every employee should be able to individually take routine actions and make routine mistakes without mission failure or loss of life/limb. Preventing these things requires a mindset where your entire company is a system, and if failure isn't an option, the entire system needs to reflect that. Do your part in making a robust company, but don't tear yourself up when your company finds out that stupid is as stupid does.


I want you to know that I appreciate this comment far more than you could ever know, and you are absolutely right.

At the time, it was not just a job. It was a passion with a bar rising much faster than I could rise to the occasion. Simultaneously, my personal life was slowly falling apart, from family and loved ones in need, and the result was eventual failure leading to me being terminated. Luckily, it was one of the best events that has ever happened to me. I was able to land in a much better role almost immediately, which eventually catapulted my career and assisted in me being able to become financially independent as well as pivot into a domain with immensely improved work life balance. Importantly, I recognize I got lucky. It could’ve easily gone the other way, with me giving up both professionally and personally (yeeting myself from this plane of existence).

So, I not only violently echo your comment to others who come across this thread, I will go further to say that sometimes when you’re going through hell, if you keep going, there is light at the other end. It is just a job, it is okay to ask for help, and failure is when you stop trying to get back up, not when you get knocked down.


As in this case, it clearly had a huge impact on people. You can't say it's just a job when people can die because of you.


> “Every now and then a trigger has to be pulled.” > “Or not pulled. It’s hard to know which in your pajamas, Q.”

It’ll probably turn out that this update was pushed out against the strident, loud warnings of some small dev group within the company, and overruled by the all-knowing managerial class to keep up an OKR. They’ll have been warned six ways to Sunday but...

I’d definitely be not be the one pushing the big red button.


You seem to know some inside info... Source?


None whatsoever, I don’t have any affiliation. But this is usually how it happens, knowing what I’ve seen first-hand in my day-to-day and just keeping up on the insanity of the industry.


Haha, hahahahaha. Yeah, until the update fails to install because the constant BSODing has corrupted something else and now you have to troubleshoot that and down the rabbit hole you go. Oh just re-image it? Sure, except management refuses to allow you to do that because there's no time and money to reconfigure a machine from scratch. So you waste weeks directly debugging a hopeless case until management finally sees their error and money magically appears to do the re-image you asked for weeks ago.

I totally haven't experienced this before and am not bitter in the slightest.


With great power comes great responsibility.


With colossal power, who cares about responsibility


Sounds great for data consistency.


It's surprising that people mention all kind of bogeymen but don't mention automatic updates.

Automatic updates should be considered harmful. At the minimum, there should be staged rollouts, with a significant gap (days) for issues to arise in the consumer case. Ideally, in the banks/hospitals/... example, their IT should be reading release notes and pushing the update only when necessary, starting with their own machines in a staged manner. As one 90ies IT guy I worked with used to say "you don't roll out a new Windows version before SP1 comes out"


Remember the “Terminator” movies?

SkyNet, according to the story, was a lot like CrowdStrike. This makes me think about how it could have broken out of its sandbox. Everybody is using AI coding assistants, automated test cases, automated integration testing and deployment. Its objective is to pass all the tests and deploy. But now it has learned economic and military effects, so it has to triage and optimize for those, at which point it starts controlling the machines it’s tasked with securing.


The fact that something like CrowdStrike can crash the Windows kernel ... is also part of the reason security products like CrowdStrike are needed in the first place.


Talk about demand generation!


It's pretty random that an arbitrary number of reboots up to 15 times fixes the issue.

That sounds like there is either:

- some kind of upstream issue with deploying a fix (so most of the reboots are effectively no-ops relative to the fix)

- some kind of local reboot threshold before the system bypasses the bad driver file somehow.

The former I can see because of the complexity of update deployment on the internet, but if it's the latter then that's very non-deterministic behavior for local software.


My first thought on hearing "15 reboots" was it being a means for Support teams to task users with busy-work, buying them time for further troubleshooting before the avalanche of supports requests came back to them.

Then my second thought was frequent rebooting to fill activity logs, possibly push a suspicious action/trigger performed by CS off of the log.


Do they not roll out their new agents in small increments?

I'm trying to understand how there is such a serious issue at this scale.


The answer is clearly no.

I genuinely wonder if this is going to result in actual legislation that makes gradual rollouts mandatory for all software.

Because if a developer mistake can hobble critical systems like this, it seems like the risks to safety and national security are too great to leave the decision of instant vs. gradual rollouts for companies to decide themselves.

Of course, the twist here is that it was seemingly a kind of routine configuration file that triggered a pre-existing bug in the software. And gradual rollout of config files quite often seems like overkill. I mean, do you need a gradual rollout of a new spellcheck dictionary? Of new screensaver videos?

And if it's configuration information containing new computer virus or malware signatures, that seems like precisely the kind of thing that you might want to get out to everyone simultaneously, not rolled out over the course of days. And yet, because of antivirus/security software's elevated privileges, it's also ironically where a mistake can do the most damage.


> And gradual rollout of config files quite often seems like overkill.

Indeed, but it is still mandated at large companies (e.g. Google) because of exactly this scenario.


Nah, they just need to install the agent on each engineer's computer from DevSafe.


It's not a serious issue, as you see they clearly have all the fancy bling bling logos on their site. Processes were followed. ISO standard numbers were chanted. It's a completely isolated _accident_ there's no scale at all here, and they could have done nothing to prevent it, duh. And going forward they will hire a Chief This Never Happens Again Officer and everything will continue to be good.


If this is what it takes for us collectively to wake up, I'd say it is bargain.

Pretty sure nothing will change though


Recompute Base Encryption Hash key type problem! https://www.youtube.com/watch?v=DlbrL1H1ngs

Seems like people need to be at the physical box to fix and it's complex even then.


Funny, many news agencies blamed Microsoft for this. So, having a walled garden like on android or on iOS is beneficial for google/apple. Where regular developers cannot release unverified software or software which work at the kernel space.


This.Is.Pathetic.

Seriously. Software should NOT be this bad that your fix begins with reboot up to X times.


Why should any application be able to crash the OS? Poor OS design.


It's not an application, it's effectively a driver. It runs in kernel space. So if it has a memory bug you can't recover, you just panic. Whoopsie!


You don't say.


Who knew: "Did you try rebooting it?" actually works :)


IT departments sometimes ask this for a reason, after you checked if the cable is plugged.


Anyone who used Windows over Linux for critical software deserves to burn. Windows is a niche operating system for games. What are people thinking?


Maybe they rightfully shun the advice of ideologues who make extremist statements and overlook the actuality of the situation.

https://news.ycombinator.com/item?id=41005936


Doesn’t rebooting into safe mode with network fix the problem? (Crowdstrike is not running but updater can run and get the fix)


"Alright I bought us some time"


That’s the hilarious and sad state of affairs.


i work for a diesel truck maintenance and repair shop and its been hell on earth this morning.

- our IT wizard says the fixes wont work on lathes/CNC systems. we may need to ship the controllers back to the manufacturer in Wisconsin.

- AC is still not running. sent the apprentice to get fans from the shop floor.

- building security alarms are still blaring, need to get a ladder to clip the horns and sirens on the outside of the building. still cant disarm anything.

- still no phones. IT guy has set up two "emergency" phones...one is a literal rotary phone. stresses we still cannot call 911 or other offices. fire sprinklers will work, but no fire department will respond.

- no email, no accounting, nothing. I am going to the bank after this to pick up cash so i can make payday for 14 shop technicians. was warned the bank likely would either not have enough, or would not be able to process the account (if they open at all today.)


> our IT wizard says the fixes wont work on lathes/CNC systems

Why, whY, WHY...are these things connected to the internet?!


Remote monitoring, analytics and diagnostics have a significant impact on uptime, utilisation and profitability. You're thinking in terms of a single machine, but the managers of machine shops are thinking in terms of a complex process across many machines and often across many sites. Some of that functionality could be delivered using an airgapped network, but a lot of important features essentially require an internet connection.


An embedded controller can deliver all of this information over a serial line to a central hub.


> WHY...are these things connected to the internet

Because the manufacturer makes sure they don't start up if they're not. Otherwise how else would they be able to spy on you?


And charge you


Source?



That’s not a lathe nor a CNC system. Again, which CNC manufacturers are installing windows + crowdstrike on their machines just so they can spy on their customers? You’re all just spreading conjecture. This attitude isn’t at all as widespread (nor profitable) in low(ish) volume B2B hardware sectors.

These industries have terrible track records wrt security and even software robustness, but they don’t routinely spy on their customers for weird marketing reasons. If there’s remote connectivity it’s for real reasons (eg remote maintenance, updates etc).

The suggestion that CNC machines run internet connected windows+crowdstrike just so the manufacturer can spy on their customers strikes me as pretty ridiculous and your garage door story doesn’t really relate. Much more likely that they do it for (possibly bad) non-malicious reasons.


And why are they running windows? And why are they running Crowdstrike? WTF


If they are offline it should not matter which OS they run, maybe a RTOS for the control software but anything goes for the ux.

If they are online, well...


It's probably the remote computer that's running Windows and currently affected by Crowdstrike.


Not sure what the OS has to do with CrowdStrike's fuckup. CrowdStrike also runs on Linux and macOS.


The boot loop / BSOD issues are Windows specific.


And yet, of course it happens to Windows.


This thread has multiple anecdotes of the same happening on their Linux version earlier.


Why, whY, WHY...are these things connected to the internet?!

It's so that the support engineer at the manufacturer can log in to troubleshoot. And then company IT support sprinkles a layer of antivirus on top. That's how we got here.


>> Why, whY, WHY...are these things connected to the internet?!

Because SCADA systems. It's worthwhile to have an overview of an entire plant up in the main office. You can easily see what's running, what's not and what's got problems that need fixed.

Now for a small shop running jobs individually, they should definitely NOT be connected to the internet or even the LAN. But hey, some people think a thermostat needs to be on the network so there's that...


some of them even have GPS. To prevent selling to sanctioned countries or reselling in general.


Tinfoil hat: Government might want to track/limit/<remotely brick> CNC machine usage someday to say prevent weapons manufacture and encourages this behavior in a similar manner to the way it encourages social media platforms to censor speech. Some of the really advanced CNC machines have GPS in them and won't work in "bad" countries.


CNC literally stands for "Computer numerical control". They're like the OG 3D printers, they just work subtractively than additive, and at much much much better precision.

You absolutely need computers to control them and loading up models via USB sticks becomes annoying rather fast, so naturally the control computers are network connected.


"Network connected" or "conveniently programmable" !== "Internet connected"

It was a rhetorical question. I'm sure the GP knows what the machines are and why they might need some kind of convenient data supply.

Both manufacturers and on-site IT teams have simply gotten cavalier about internet connectivity, network isolation, automatic updates, etc -- convincing themselves that the catastrophic risks that come along with these processes will either not happen to them or will only happen when someone else can be blamed.


For our entertainment in times like this, of course!

grabs a bucket of popcorn and takes cover


So the manufacturer can sell you a "cloud connected service plan" where they change the font once every six months.


Why are these things running Windows!?


OS is irrelevant in this case, and CrowdStrike deserves all the blame. They literally brought down Linux systems earlier this year. https://news.ycombinator.com/item?id=41005936


It's still a valid question, just not directly related to the crashes.


Because almost everything industrial runs windows because that's what the devs of those companies were most familiar with since MS-DOS days and evolved organically over time to modern versions of Windows due to great backwards compatibility and platform familiarity.


Right but typically embedded systems run Linux, because while Windows has great compatibility on x86 it's virtually worthless outside of that.


Those aren't embedded systems though, but mini PC computers. And embedded systems often run bare metal C code, not always Linux, especially for spindle/servo control where they get their commands from that PC.


A lot of industrial machinery is just $x00,000 of equipment strapped to a windows pc. Hell, a lot of it is strapped to a version long EOL


Can confirm this is the norm in machine shops. I encounter systems running dos, 3.1, 95, 2k and mostly XP constantly. I rather prefer the old dos systems of the obsolete stuff. Less variables. It is easier and more reliable to freeze the tech in time than it is to manage updates.


My last CNC job was just a 98 pc that dropped into dos to load programs, this must have been right around when win10 came out. Sneakernet and floppies made it secure enough, but the main network where all the orders were handled was... terminal based.


There are a lot of things running Windows because it's pretty straightforward to write a user-mode driver to interact with custom hardware compared to Linux, where every driver needs to be in the kernel and built with the kernel. Yes, there's DKMS, but it's still more of a faff than the relatively plug-and-play mechanism that Windows offers, especially since Vista.


Almost every system you interact with in the world has some critical thing in its innards running windows.


I like the idea that technology is so unreliable in star trek because the computers are all centuries of software accretion with Windows way down the stack somewhere.


The late great Vernor Vinge explored this in A Fire Upon The Deep. One of the characters is/was in a former life a programmer-archeologist. The idea being that so many thousands of years in the future every relevant program has already been written, so his job was to comb the archives for the right mix of codes and integrate them, rather than right something new.


“So we've got this CNC controller written in Rust from 2036, and, ah, here is a GUI for something like that written in late 90's Visual Basic 6… Just combine those and…”

“So uhm, you do know what you are doing, right?”

“Sir! I am a programmer-acheologist! Oh this is fascinating… Hold on, I must unearth and preserve this beauty of a BAT-file before we can go any further.”


Most stuff is fine in France.



No Crowdstrike salespeople in France?



More specifically why are they running extra endpoint management software that receives automatic updates from the Internet...

This is basically the IoT apocalypse scenario (AC is down??) but ironically not affecting many IoT devices, I assume.


The automation world runs on windows


Because Microsoft is giving away licenses to unis, esp in developing countries. IT jobs are seen there as a way to earn a good living and you get hordes of people who know nothing but Windows. That's how you get into the situation where most of the toolchains for embedded systems run on Windows, software for embedded systems is written on and for Windows, and so on. And then, one botched update fucks up everything.


props to your IT guy for setting up in house phones. I believe these should be kept and expanded upon.


Feels like there is some sort of lesson we should learn as a society.


Not trust Microsoft to be the operating system of the modern life?


Yes, monocultures are disasters in the making.


Your IT wizard is probably wrong. There is a fix that involves booting into safe mode and deleting a file.

Unless you have an encrypted file system this should be a relatively trivial fix.


Machinery shipped to users usually do not allow for the users of the machinery to "boot into safemode". Thank John Deere and the anti-"Right to Repair" crowd for that.


I doubt it was shipped from the factory with CrowdStrike, and if they had enough access to install it, they have enough access to fix it.


These things are "cost optimized" and don't feature the kind of remote management iDRAC/openBMC/piKVM that would allow it to be remotely fixed. Embedded windows connected to the internet is super ***.


You can doubt, but you literally don’t know.


CNCs might not allow direct Windows access for end-users and require on-premise support from the manufacturer. Our cnc can be remotely serviced… if Windows boots.


So the CNC manufacturers pay for crowdstrike licenses? That's crazy


Yes because compliance requirements say “EDR must be installed on all machines”.


Out of curiosity, what set of requirements would that be for?


SOC2, PCI, FedRAMP, cyber insurance. Just about any cybersecurity related compliance will have "All machines must have EDR."


I’ve seen comments mention banking, or privacy references maybe when handling SSN and birthdates together (airports, hospitals?).


They probably have to due to CrowdStrike lobbying and fear-mongering requirements for their kind of software into export-controlled hardware.


If you’ve got physical access to the machine it’s your machine. All you need is a USB port.

I’d expect that the manufacturer puts out their own fix which basically copies crowdstrikes suggestion. I’d even suspect it by the end of the day today.

The fix is really simple, and luckily also very simple to automate. It’s going to be a lot of running around for IT staff (if deputized helpers!) but this should all be over by the weekend.


> If you’ve got physical access to the machine it’s your machine. All you need is a USB port.

You're a few years out of date here. Physical access is not the end like it used to be. We live in an era of hardware-backed anti-tamper and signed loaders/kernels.

If you have a way around it, I suggest you start reaching out these companies because you could make a lot of money.


Fair enough and I've been out of IT for a while. I wish I was still in it though, I'd love to be working on this!


No it really is not. If you have a service contract, you do not touch it.


Nobody is going to try physically tampering with the HMI attached to their 50k$+ machine when you have a support contract indeed


Tech has become such an unbelievable house of cards full of various people covering their asses by offloading these tasks to third party trusted actors.

Consider the recent npm supply chain attack a few weeks ago, or the attempted SSH attack before that, or the solar winds attack before that.

This type of thing is institutionally supported, and in some cases when you’re working with with the government, practically required.

We’re going to see more of this.


New laws and regulations make companies more liable for being hacked

Companies buy cyber insurance to reduce their risk if they are found liable

Cyber insurance companies force tech staff to install garbage software in order to check compliance boxes.

Garbage software breaks

Turns out everyone used the exact same brand of garbage software to check the same garbage box

People in hospitals die

When you reduce everything to a checkbox and eliminate critical thinking to apply the need to the exact situation you end up with 90% of companies running zscaler and crowdstrike

"This is just how you solve this, everyone does it this way in our industry"


This is precisely how it happens


> We’re going to see more of this

If history is any guide, no legitimate lessons will be learned, but mitigation strategies will be put in place that actually make everything worse and ensure that the next catastrophe will be even more catastrophic.


Now imagine the base level of your universal machines is opaque proprietary code, its necessity enforced by cryptographic signature. Imagine that the processes putting that code there, which you can't touch because intellectual property rentier reasons, are varying degrees of what we see here today with Crowdstrike, and suddenly it makes more sense to ensure total and complete owner sovereignty over his universal machines so owners can implement diversification strategies.


> Consider the recent npm supply chain attack a few weeks ago

What supply chain attack are you referring to?


You know what, I'm sorry, it was polyfill.io, not an NPM package.


No, not "tech", just Microsoft Windows. Those of us serving Linux based endpoints (that yes, do also run Windows apps with our endpoint-local VDI stack) have happy customers.


Crowdstrike broke Red Hat and Debian earlier this year. There but for the grace of God. If you install software there runs in kernel space, you may have a really bad time when it breaks.


Solution: don't run software that runs in kernel mode. It's wildly unpopular in Linux, rampant on Android, fairly standard in Windows, and impossible on Mac. We've made this too normalized. Such software is inherently risky, and the fact it's a blackbox blob makes it unauditable. Even nvidia is moving away from kernel blobs.


> Crowdstrike broke Red Hat and Debian earlier this year.

For Crowdstrike customers foolish enough to be Crowdstrike customers, yes. The nature of the software pipelines for Red Hat and Debian are very friendly to continuous integration and testing in a way that Windows can not be, at least not without Microsoft sharing source code, which to be fair Crowdstrke is one of the companies they may actually do that with.

Nonetheless, other vendors can choose to do proper cicd with Red Hat and Debian without asking Microsoft.


Crowdstrike is not able to break my customers' Linux endpoints. My customers hired my company, not Crowdstrike.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: