I've been wondering the same. I did just see [1], where it's apparently trying to read memory from an unmapped address, but I haven't seen anything about how r8 got to the point of having said unmapped address.
It seems the affected update file seems to have been over written with 0s on the 42kb file, whereas the before and after sys files have obfuscated ays/config file info as expected.
If it is simply caused by a corrupted file. That is a really bad signal. It means they don't even try to properly validate and parse the file before loading them into the KERNAL. Always validate input so it don't crash your program is almost the computer science 101 every programming class should tell you in the first class. And yet they still make this happen?
And in this case, it only crash. But if it somehow read value from position it isn't supposed to successfully? You have an RCE.
This is a global multi-layer failure: Microsoft allowing kernel mods by third-party software, CrowdStrike not testing this, DevSecOps not doing a staged/canary deployment, half the world running the same OS, things that should not be connected to the internet but are by default. Microsoft and CrowdStrike drove a horse and a cart through all redundancy and failover designs and showed very clearly where there were no such designs in place.
While I will be the last person in line to defend Microsoft, I am not sure that disallowing 3P kernel mods is a workable solution. Crowdstrike and companies like it exist to fill a very real need within the windows ecosystem. I don’t foresee that suddenly going away now or Microsoft unilaterally forcing every company like crowdstrike out of business and taking over this role themselves
Literally every OS allows you to install 3rd party kernel modules or plugins. If Microsoft banned them, people would be up in arms about them being a controlling walled garden. There is no winning.
Hello, IT, have you tried turning it on and off again 15 times?
Seriously though - this entire outage is the poster child for why you NEVER have software that updates without explicit permission from a sysadmin. If I were in congress, I would make it illegal, it's an obvious national security issue.
To play devil’s advocate, a staged rollout for antivirus definitions somewhat defeats the point since those definitions are supposed to be constantly updated.
I agree with the rest, especially the use of a memory unsafe language to do parsing in the kernel by a billion dollar security company blows my mind.
How can you even run a security company without any security professionals reading your code even incidentally? An impressive level of incompetence.
At least they could make a in house playground in the process to see if their new version ever work. Maybe something like guest computer in public area. Or some sort of vm to emulate end user system to see if they ever boots. And somehow we still get this.
How the heck they didn't find out the new version prevent the computer from booting at all?
> Panicking when the file doesn’t parse because it’s not a memory safe language?
Whether a program panics or recovers when attempting to parse bad data is entirely orthogonal to memory safety. Do you have any in-depth technical information about the bug itself that you're basing this on?
Is it normal to make outbound connections during boot? Doesn't that circumvent a firewall? That seems like something a security team evaluating whether they want this software on their network might care about during an eval period.. right?
Looking at the contents of c:\windows\system32\drivers\crowdstrike suggests it does all sorts of weird shit right down to injecting itself into UEFI and futzing with firmware. It's literally in everything.
Unfortunately "security" folk these days are box ticking fuckwits and this product brief ticked all the boxes. They do not understand any more traditional methodologies other than "install these magic beans and action the reports".
Invest in better software and network architecture and DR strategy instead.
That's not the big no-no here. Lack of any real DRP is. Sure, it's cheaper to just buy CS Falcon (and who knows what other amazing vendors supplied timebombs are ticking silently) than paying sysadmins and developers ... and letting them build something that does what it needs, not much else, so there's no need to put these fantastic "single agents" from these RCE-as-a-service vendors on all the fucking servers.
What % of those sysadmins are then going to turn around and script something to auto-approve those updates, once they realize that they are A) requested at inconvenient times and B) are related to security?
Who's going to take the risk of appearing to have sat on an important update, while the org they support is ravaged by ThreatOfTheDay, because they thought they knew better than a multi-billion dollar, tops-in-their-field company?
(I'm not necessarily saying that's actually objectively correct, but I can't imagine that many folks are willing to risk the downside)
> why you NEVER have software that updates without explicit permission from a sysadmin
In general I agree, but this case is quite messy. It's more like your anti-virus had a bug since forever that if it loads a broken virus definition it bricks your system. And a broken virus definition finally happened today.
Do you want every virus definition (that is updated every few hours) to require explicit permission from a sysadmin?
You’re learning the wrong lesson here. Automatic security updates in Debian and Ubuntu actually get tested and work.
The RCE in ssh a week ago is an argument for enabling automatic security updates. (And for security in depth, putting everything behind VPN for example)
This example is probably an argument for not running windows on critical systems due to insufficient focus on security from the beginning which has lead to a need for things like crowdstrike.
They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.
>They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.
I wish people would stop making blanket statement as if they know how every company in the world runs. Plenty of Linux machines are running CS, and it's not only because they are forced to for compliance. NG AV has been picking up speed as a "just in case" thing for Linux and Mac for years now. Your anecdote does not apply to everyone.
I understand the logic of this but it is somewhat based on the assumption - which most industries have in droves - that people in THAT industry are the competent bullwhark against stupidity.
I consulted for a company for a while where the 'sysadmin' was the owner's mother - who bought laptops from walmart. Not only could she NOT have approved updates like this, even if she could she would have she wouldn't have had any knowledge whatsoever with which to make a determination if it worked.
In an abstraction, the problem really is with externalities. These approaches to updates exist because people who CAN'T do what you describe are likely a more dominant part of the threat model than this happening to people you do describe. The resulting fix, as we're seeing, is very reliable until it isn't...and if the isn't is enormous in scale the systems aren't setup to fail gracefully.
If you want to make a rule...require graceful failure.
What would the sysadmins do in this context? Read the release notes of the update? The only thing they would do is update and then be responsible for the problem, and in that case you're back to this exact problem.
It's not like they'd read the source code or examine every file that's been changed or downloaded for a proprietary kernel module for every crowdstrike update (there must be a LOT of them).
They would release the update in a testing/sandbox environment first before rolling out kernel-level changes to every computer on their network.
They're the same team who mandate you use a 3-year-old browser version and 5-year-old OS, because you can't be trusted to manage your own updates, so they do know the idea.
Would this have changed something for this specific problem? I usually 100% agree with you fwiw, I just don't think this would've helped here because it seems like an almost "non update"? Most people claim there has been no update to the software, and no prompt or option to update it or not
It's a file that was downloaded from Crowdstrike's servers, which have presumably been whitelisted in the firewall, and used to configure the software. Of course it's a software update, regardless of whether the file says .exe or .dll or .sys or .txt, and regardless of whether there was a prompt.
Again, the same team in most enterprises wouldn't dream of letting you have an auto- updating Firefox Nightly, they know how to configure software so it doesn't phone home for updates or is blocked from phoning home.
Those focusing on QA, staged rollouts, permission management etc are misguided. Yes of course a serious company should do it but CrowdStrike is a compliance checkbox ticker.
They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”.
The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).
I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)
In a slightly less threatening but equally noxious box-checking racket, a company I work with is being sued for their website not being sufficiently ADA-compliant. But the first they heard of the lawsuit, before they were even served, was an email from a vendor who specializes in adding junk code to your website that's supposed to tick this box. The vendor happens to work closely with several of the law firms who file and defend these suits.
It’s looking like many impacted end-user machines are hard bricked unless you can get into the hard drive to delete the file causing this. Even if you can do that it’s not something that is easily (or potentially even possible) to automate at scale so looking like this is going to be an ugly fix for many impacted devices. This is basically the nightmare scenario for fleet management… devices broken and can’t remotely fix them. Need to send hands on keyboard folks in the field to touch each device.
That is what has surprised me. I can understand if small businesses were caught here because they lack financial resources for the infrastructure and staff, but those large corporations like airlines etc... Why don't they have a staging environment where everything goes first? I naively assumed this was established best practice due to the risk of update issues bricking your organization.
But maybe anti-malware is given a blind eye because instant updates for zero day security issues are obviously attractive.
Still, though... In hindsight it's not workable for especially anything running system drivers with liberal kernel access.
I am not surprised at all. The level of DevSecOps' skills has been falling over the last two decades as demand for their skills kept growing. Most of them would report you to HR if you suggested they use WireShark to debug a networking issue. They are useless people who came to IT because of the promise of good pay and don't know how computers, networks work.
It's automatic, no? The whole "promise" (oh sorry, the "added value proposition") of CS is that they "keep you safe" automatically! It was a content update. Meaning basically antivirus signatures ... and oops, some minor non-functional changes to the filtering kernel driver.
... well, yes, yes of course. And if I try to be serious on a late Friday night (it's almost 20:00 here), the obvious solutions is to have something like eBPF in/for the Linux kernel (which has a verifier[0]).
And security vendors should follow "secure by design" principles. Yes, I know a try-fucking-catch might be too advanced, and uh oh kernel code is hard because unwinding is costly. But guess what else is also not cheap. (Okay, I seriousness failed.) But still. This is fair and square in the "this should never happen" scenario. It's an automatically downloaded plugin or whatever. (CS can call it "content update", but von Neumann is already calling FedEx to send them a pallet of industrial grade bitchslap.) And if the plugin loader cannot gracefully fail plugin loading, then it should obviously come with the appropriate audiovisual cues[1] so sysadmins know what to expect.
Security and Compliance gets to violate all good sense, because it's just sooo important. They can run un-reviewed un-sandboxed daemons as root on every system if they really want, they can have changes pushed automatically without review or control, because "security" is just so important, and due to "compliance" you really have no choice as your company gets larger, you just have to do it. That's why, despite being obviously pretty dumb to many skilled engineers, it seems like everyone does it. No choice. Security, Compliance. So dumb ...
So, in other words, there's a race condition in the CrowdStrike Falcon driver at startup time. That, in itself, should be a major cause for alarm, but here we are depending on it to fix this problem.
No, it takes a while to load that definition file. Before the loading it _might_ be able to pull the update that fixes it. If you keep trying the chance this update is pulled increases
The individual person that pressed the "go" button (if there was a person), is going to henceforth be __the best__ DevOps person to ever have on your team. They have learned a multi-trillion-dollar lesson that no amount of training could have prepared them for.
And the Crowdstrike CTO has either been given the ammunition to get __whatever they ask for, ever again__ with regard to appropriate allocation of resources for devops *or* they'll be fired (whether or not it's their fault).
And let me be very clear. This is absolutely, positively and wholly not the person that pressed the button's fault. Not even a little. At a company as integral as CrowdStrike, the number of mistakes and errors that had to have happened long before it got to "Joe the Intern Press Button" is huge and absurd. But many of us have been in (a much, much, *MUCH* smaller version of) Joe's shoes, and we know the gut sinking feeling that hits when something bad happens. A good company and team won't blame Joe and will do everything they can to protect Joe from the hilariously bad systemic issues that allowed this to happen.
This is why it is the responsibility (yes, responsibility) of every one of their coworkers, especially those more senior than them, to fight *HARD* to protect them.
Basic training could've taught him how not to do YOLO global rollouts, and while the stress of this mistake will make him remember a lot, given the lack of basic knowledge that would've prevented this, this lesson will not be very valuable
Don't do that, or you'll be dragged before the greatest obnoxious and self-aggrandizing body in the world for lengthy dressing down that probably affects the stock price.
Of course, but we specifically would like to see a _technical_ postmortem that examines what kind of incremental rollout procedures they have and how this update overcame those.
>The first and easiest is simply to try to reboot affected machines over and over, which gives affected machines multiple chances to try to grab CrowdStrike's non-broken update before the bad driver can cause the BSOD.
I thought it was BSOD'ing on boot? I don't understand how this works. It auto-updates on boot? From the internet?
One of the first things the falcon driver does on boot is connect to the server, report some basic info, and start loading these data files, the "channel" files that Crowdstrike frequently updates.
The BSOD is because one of the data files that they previously pushed is horribly mangled, and their driver explodes about it. But if you get lucky, the driver can receive an update notification on boot, connect to the separate file server, and finish overwriting the broken file on disk before the rest of the driver (that would crash) has loaded the broken file
And they do all of that very early on boot. The justification being that you don't want the antivirus to start booting after a rootkit has already installed itself
WTF? Trust in the kernel should be Microsoft's responsibility and only theirs. Actually why is MS even allowing this crap code to run in their kernel? Isn't that a trust-destroying event?
Drivers have to run in the kernel in order to access hardware and other low-level system resources. That's how pretty much every mainstream OS works. For example, here's the guide for writing kernel-mode drivers in Linux: https://docs.kernel.org/driver-api/driver-model/overview.htm...
One might ask whether an anti-virus really needs to run inside the kernel, but the answer might reasonably be yes.
It is also possible to access hardware or any other low-level system resources from unprivileged user code, if its process has been granted appropriate access rights by the kernel.
This second solution requires more work, but it is much more secure as the access can be limited to only the strictly-required resources and system crashes become impossible.
The extreme of this solution is a micro-kernel operating system, but there is no need for extremes. Even in a Windows or Linux system you can use this method. You can have a very reduced privileged code in a driver or kernel module, which does nothing except providing access to the permitted resources. Then anything like attempting to access not mapped memory would happen in user code and it would crash only the user process, not the entire computer system.
Yes, I'm no security expert by any means but I'd assume that e.g. a rootkit would be best defeated by a kernel driver.
So, this isn't really what's getting on my nerves here. Just how it auto updates and get pushed throughout the organizations without a smidge of quality assurance. Smaller businesses... Sure, I get it. They don't have the resources to set up infra for this, but those... airliners... and hospitals. WTF. I read some org thinking they might not even be able to provide anesthesia. Seriously. What.
Probably history and then some possible anti-trust litigation. As asking market leader not allow access to kernel like this would somehow be anti-trust violation...
What, the auto-updating part? Obviously the client is verifying signatures (or using TLS with a client certificate, whatever), not just accepting whatever random file comes down the pipe.
Even then, how many affected machines there are? Tens of thousands, hundreds of thousands? Compromise these servers and even possible signing server and you have largest bot net or general compromise in history...
It is not unreasonable to think that this sort of software could get compromised.
A few more years and maybe they will add this newfangled super-innovative thing, invented by those esoteric academics at U of Haskell ... this new thing -- umm, what was it called -- try-catch perhaps.
I mean, this is every antivirus software. "Let's run some antivirus vendor's code on your system that opens literally every file on your system, regardless of how it got there."
Yeah, that's a great idea and not at all a huge attack vector.
> because of Windows’s silly pessimistic file locking
To be honest I prefer that over the #nix way of doing things. In Windows, you have exactly one file any given path can refer to - in Linux or Mac, it may depend on which directory's inode is seen as the root node by your process (e.g. chroot or container), or whether mounts are at play, or a file/directory got deleted and replaced by something else.
Particularly the last scenario keeps tripping me every once in a while.
That would seem crazy. Maybe there is a crowdstrike onprem “master server” that is supposed to be available internally? Just spitballing, have no idea really
All the comments are asking why run Windows. CrowdStrike runs on macOS and Linux too. It’s just that this time, CrowdStrike fuck up on Windows. It doesn't mean CrowdStrike won't fuck up on other OS, and it seems like CrowdStrike fuck up on Linux as well. https://news.ycombinator.com/item?id=41005936
I feel like we are better off running open-source software. Everyone can see where the mistakes are instead of running around like a chicken with its head cut off.
CrowdStrike is ran in userspace on macOS, and usually in an eBPF sandbox on Linux (as comments in the linked thread say). there is no way to prevent CrowdStrike from fucking up the kernel on Windows - and this is a Windows bug.
They might just have it, remember when the night of the Russian invasion of Ukraine satellite terminals in Europe started being bricked with faulty firmware updates?
Hence why Kaspersky got banned just recently [1]. You absolutely do not want some foreign company having above-root rights on (critical) infrastructure in your country.
Yes. Yes. To hold in my hand a button that contains such power, to know that blue screens on such a scale was my choice. To know that the tiny pressure on my thumb, enough to push the button, would end everything. Yes, I would do it! That power would set me up above the gods. And through Crowdstrike, I shall have that power!
With web, at small scale (which honestly is 95% of the world), you just version and back up everything. We push updates that break stuff from time to time. If it's bad enough, we just hit a button and roll back the change. The nerves are basically a sign that you need to have an easy rollback process in place, once you have it, you sleep easy and things are fun.
Clearly that's how they ended up with the current team. They hired for culture fit. Anyone who worries too much is out.
You bet they have an amazing perfect top-notch hiring pipeline, many rounds of interviews, and whatever you could wish for! (No, no ... the subcontractors writing code are not in scope for this, duh.)
I've definitely experienced the floor dropping out from under me feeling in the half minute of realization that I just blew something up, but really it's mostly just the first drop of a rollercoaster feeling then the anxiety is gone and it's time to fix things.
I'd like to add that your company doesn't need a hero. The road to widespread catastrophic failure is long and no single person walks it in its entirety. Every employee should be able to individually take routine actions and make routine mistakes without mission failure or loss of life/limb. Preventing these things requires a mindset where your entire company is a system, and if failure isn't an option, the entire system needs to reflect that. Do your part in making a robust company, but don't tear yourself up when your company finds out that stupid is as stupid does.
I want you to know that I appreciate this comment far more than you could ever know, and you are absolutely right.
At the time, it was not just a job. It was a passion with a bar rising much faster than I could rise to the occasion. Simultaneously, my personal life was slowly falling apart, from family and loved ones in need, and the result was eventual failure leading to me being terminated. Luckily, it was one of the best events that has ever happened to me. I was able to land in a much better role almost immediately, which eventually catapulted my career and assisted in me being able to become financially independent as well as pivot into a domain with immensely improved work life balance. Importantly, I recognize I got lucky. It could’ve easily gone the other way, with me giving up both professionally and personally (yeeting myself from this plane of existence).
So, I not only violently echo your comment to others who come across this thread, I will go further to say that sometimes when you’re going through hell, if you keep going, there is light at the other end. It is just a job, it is okay to ask for help, and failure is when you stop trying to get back up, not when you get knocked down.
> “Every now and then a trigger has to be pulled.”
> “Or not pulled. It’s hard to know which in your pajamas, Q.”
It’ll probably turn out that this update was pushed out against the strident, loud warnings of some small dev group within the company, and overruled by the all-knowing managerial class to keep up an OKR. They’ll have been warned six ways to Sunday but...
I’d definitely be not be the one pushing the big red button.
None whatsoever, I don’t have any affiliation. But this is usually how it happens, knowing what I’ve seen first-hand in my day-to-day and just keeping up on the insanity of the industry.
Haha, hahahahaha. Yeah, until the update fails to install because the constant BSODing has corrupted something else and now you have to troubleshoot that and down the rabbit hole you go. Oh just re-image it? Sure, except management refuses to allow you to do that because there's no time and money to reconfigure a machine from scratch. So you waste weeks directly debugging a hopeless case until management finally sees their error and money magically appears to do the re-image you asked for weeks ago.
I totally haven't experienced this before and am not bitter in the slightest.
It's surprising that people mention all kind of bogeymen but don't mention automatic updates.
Automatic updates should be considered harmful. At the minimum, there should be staged rollouts, with a significant gap (days) for issues to arise in the consumer case. Ideally, in the banks/hospitals/... example, their IT should be reading release notes and pushing the update only when necessary, starting with their own machines in a staged manner. As one 90ies IT guy I worked with used to say "you don't roll out a new Windows version before SP1 comes out"
SkyNet, according to the story, was a lot like CrowdStrike. This makes me think about how it could have broken out of its sandbox. Everybody is using AI coding assistants, automated test cases, automated integration testing and deployment. Its objective is to pass all the tests and deploy. But now it has learned economic and military effects, so it has to triage and optimize for those, at which point it starts controlling the machines it’s tasked with securing.
The fact that something like CrowdStrike can crash the Windows kernel ... is also part of the reason security products like CrowdStrike are needed in the first place.
It's pretty random that an arbitrary number of reboots up to 15 times fixes the issue.
That sounds like there is either:
- some kind of upstream issue with deploying a fix (so most of the reboots are effectively no-ops relative to the fix)
- some kind of local reboot threshold before the system bypasses the bad driver file somehow.
The former I can see because of the complexity of update deployment on the internet, but if it's the latter then that's very non-deterministic behavior for local software.
My first thought on hearing "15 reboots" was it being a means for Support teams to task users with busy-work, buying them time for further troubleshooting before the avalanche of supports requests came back to them.
Then my second thought was frequent rebooting to fill activity logs, possibly push a suspicious action/trigger performed by CS off of the log.
I genuinely wonder if this is going to result in actual legislation that makes gradual rollouts mandatory for all software.
Because if a developer mistake can hobble critical systems like this, it seems like the risks to safety and national security are too great to leave the decision of instant vs. gradual rollouts for companies to decide themselves.
Of course, the twist here is that it was seemingly a kind of routine configuration file that triggered a pre-existing bug in the software. And gradual rollout of config files quite often seems like overkill. I mean, do you need a gradual rollout of a new spellcheck dictionary? Of new screensaver videos?
And if it's configuration information containing new computer virus or malware signatures, that seems like precisely the kind of thing that you might want to get out to everyone simultaneously, not rolled out over the course of days. And yet, because of antivirus/security software's elevated privileges, it's also ironically where a mistake can do the most damage.
It's not a serious issue, as you see they clearly have all the fancy bling bling logos on their site. Processes were followed. ISO standard numbers were chanted. It's a completely isolated _accident_ there's no scale at all here, and they could have done nothing to prevent it, duh. And going forward they will hire a Chief This Never Happens Again Officer and everything will continue to be good.
Funny, many news agencies blamed Microsoft for this.
So, having a walled garden like on android or on iOS is beneficial for google/apple. Where regular developers cannot release unverified software or software which work at the kernel space.
i work for a diesel truck maintenance and repair shop and its been hell on earth this morning.
- our IT wizard says the fixes wont work on lathes/CNC systems. we may need to ship the controllers back to the manufacturer in Wisconsin.
- AC is still not running. sent the apprentice to get fans from the shop floor.
- building security alarms are still blaring, need to get a ladder to clip the horns and sirens on the outside of the building. still cant disarm anything.
- still no phones. IT guy has set up two "emergency" phones...one is a literal rotary phone. stresses we still cannot call 911 or other offices. fire sprinklers will work, but no fire department will respond.
- no email, no accounting, nothing. I am going to the bank after this to pick up cash so i can make payday for 14 shop technicians. was warned the bank likely would either not have enough, or would not be able to process the account (if they open at all today.)
Remote monitoring, analytics and diagnostics have a significant impact on uptime, utilisation and profitability. You're thinking in terms of a single machine, but the managers of machine shops are thinking in terms of a complex process across many machines and often across many sites. Some of that functionality could be delivered using an airgapped network, but a lot of important features essentially require an internet connection.
That’s not a lathe nor a CNC system. Again, which CNC manufacturers are installing windows + crowdstrike on their machines just so they can spy on their customers? You’re all just spreading conjecture. This attitude isn’t at all as widespread (nor profitable) in low(ish) volume B2B hardware sectors.
These industries have terrible track records wrt security and even software robustness, but they don’t routinely spy on their customers for weird marketing reasons. If there’s remote connectivity it’s for real reasons (eg remote maintenance, updates etc).
The suggestion that CNC machines run internet connected windows+crowdstrike just so the manufacturer can spy on their customers strikes me as pretty ridiculous and your garage door story doesn’t really relate. Much more likely that they do it for (possibly bad) non-malicious reasons.
Why, whY, WHY...are these things connected to the internet?!
It's so that the support engineer at the manufacturer can log in to troubleshoot. And then company IT support sprinkles a layer of antivirus on top. That's how we got here.
>> Why, whY, WHY...are these things connected to the internet?!
Because SCADA systems. It's worthwhile to have an overview of an entire plant up in the main office. You can easily see what's running, what's not and what's got problems that need fixed.
Now for a small shop running jobs individually, they should definitely NOT be connected to the internet or even the LAN. But hey, some people think a thermostat needs to be on the network so there's that...
Tinfoil hat: Government might want to track/limit/<remotely brick> CNC machine usage someday to say prevent weapons manufacture and encourages this behavior in a similar manner to the way it encourages social media platforms to censor speech. Some of the really advanced CNC machines have GPS in them and won't work in "bad" countries.
CNC literally stands for "Computer numerical control". They're like the OG 3D printers, they just work subtractively than additive, and at much much much better precision.
You absolutely need computers to control them and loading up models via USB sticks becomes annoying rather fast, so naturally the control computers are network connected.
"Network connected" or "conveniently programmable" !== "Internet connected"
It was a rhetorical question. I'm sure the GP knows what the machines are and why they might need some kind of convenient data supply.
Both manufacturers and on-site IT teams have simply gotten cavalier about internet connectivity, network isolation, automatic updates, etc -- convincing themselves that the catastrophic risks that come along with these processes will either not happen to them or will only happen when someone else can be blamed.
Because almost everything industrial runs windows because that's what the devs of those companies were most familiar with since MS-DOS days and evolved organically over time to modern versions of Windows due to great backwards compatibility and platform familiarity.
Those aren't embedded systems though, but mini PC computers. And embedded systems often run bare metal C code, not always Linux, especially for spindle/servo control where they get their commands from that PC.
Can confirm this is the norm in machine shops. I encounter systems running dos, 3.1, 95, 2k and mostly XP constantly. I rather prefer the old dos systems of the obsolete stuff. Less variables. It is easier and more reliable to freeze the tech in time than it is to manage updates.
My last CNC job was just a 98 pc that dropped into dos to load programs, this must have been right around when win10 came out.
Sneakernet and floppies made it secure enough, but the main network where all the orders were handled was... terminal based.
There are a lot of things running Windows because it's pretty straightforward to write a user-mode driver to interact with custom hardware compared to Linux, where every driver needs to be in the kernel and built with the kernel. Yes, there's DKMS, but it's still more of a faff than the relatively plug-and-play mechanism that Windows offers, especially since Vista.
I like the idea that technology is so unreliable in star trek because the computers are all centuries of software accretion with Windows way down the stack somewhere.
The late great Vernor Vinge explored this in A Fire Upon The Deep. One of the characters is/was in a former life a programmer-archeologist. The idea being that so many thousands of years in the future every relevant program has already been written, so his job was to comb the archives for the right mix of codes and integrate them, rather than right something new.
“So we've got this CNC controller written in Rust from 2036, and, ah, here is a GUI for something like that written in late 90's Visual Basic 6… Just combine those and…”
“So uhm, you do know what you are doing, right?”
“Sir! I am a programmer-acheologist! Oh this is fascinating… Hold on, I must unearth and preserve this beauty of a BAT-file before we can go any further.”
Because Microsoft is giving away licenses to unis, esp in developing countries. IT jobs are seen there as a way to earn a good living and you get hordes of people who know nothing but Windows. That's how you get into the situation where most of the toolchains for embedded systems run on Windows, software for embedded systems is written on and for Windows, and so on. And then, one botched update fucks up everything.
Machinery shipped to users usually do not allow for the users of the machinery to "boot into safemode". Thank John Deere and the anti-"Right to Repair" crowd for that.
These things are "cost optimized" and don't feature the kind of remote management iDRAC/openBMC/piKVM that would allow it to be remotely fixed. Embedded windows connected to the internet is super ***.
CNCs might not allow direct Windows access for end-users and require on-premise support from the manufacturer. Our cnc can be remotely serviced… if Windows boots.
If you’ve got physical access to the machine it’s your machine. All you need is a USB port.
I’d expect that the manufacturer puts out their own fix which basically copies crowdstrikes suggestion. I’d even suspect it by the end of the day today.
The fix is really simple, and luckily also very simple to automate. It’s going to be a lot of running around for IT staff (if deputized helpers!) but this should all be over by the weekend.
> If you’ve got physical access to the machine it’s your machine. All you need is a USB port.
You're a few years out of date here. Physical access is not the end like it used to be. We live in an era of hardware-backed anti-tamper and signed loaders/kernels.
If you have a way around it, I suggest you start reaching out these companies because you could make a lot of money.
Tech has become such an unbelievable house of cards full of various people covering their asses by offloading these tasks to third party trusted actors.
Consider the recent npm supply chain attack a few weeks ago, or the attempted SSH attack before that, or the solar winds attack before that.
This type of thing is institutionally supported, and in some cases when you’re working with with the government, practically required.
New laws and regulations make companies more liable for being hacked
Companies buy cyber insurance to reduce their risk if they are found liable
Cyber insurance companies force tech staff to install garbage software in order to check compliance boxes.
Garbage software breaks
Turns out everyone used the exact same brand of garbage software to check the same garbage box
People in hospitals die
When you reduce everything to a checkbox and eliminate critical thinking to apply the need to the exact situation you end up with 90% of companies running zscaler and crowdstrike
"This is just how you solve this, everyone does it this way in our industry"
If history is any guide, no legitimate lessons will be learned, but mitigation strategies will be put in place that actually make everything worse and ensure that the next catastrophe will be even more catastrophic.
Now imagine the base level of your universal machines is opaque proprietary code, its necessity enforced by cryptographic signature. Imagine that the processes putting that code there, which you can't touch because intellectual property rentier reasons, are varying degrees of what we see here today with Crowdstrike, and suddenly it makes more sense to ensure total and complete owner sovereignty over his universal machines so owners can implement diversification strategies.
No, not "tech", just Microsoft Windows. Those of us serving Linux based endpoints (that yes, do also run Windows apps with our endpoint-local VDI stack) have happy customers.
Crowdstrike broke Red Hat and Debian earlier this year. There but for the grace of God. If you install software there runs in kernel space, you may have a really bad time when it breaks.
Solution: don't run software that runs in kernel mode. It's wildly unpopular in Linux, rampant on Android, fairly standard in Windows, and impossible on Mac. We've made this too normalized. Such software is inherently risky, and the fact it's a blackbox blob makes it unauditable. Even nvidia is moving away from kernel blobs.
> Crowdstrike broke Red Hat and Debian earlier this year.
For Crowdstrike customers foolish enough to be Crowdstrike customers, yes. The nature of the software pipelines for Red Hat and Debian are very friendly to continuous integration and testing in a way that Windows can not be, at least not without Microsoft sharing source code, which to be fair Crowdstrke is one of the companies they may actually do that with.
Nonetheless, other vendors can choose to do proper cicd with Red Hat and Debian without asking Microsoft.
As in, what exactly is wrong in these C00000291-*.sys files that triggers the crash in csagent.sys, and why?