CrowdStrike in this context is a NT kernel loadable module (a .sys file) which does syscall level interception and logs then to a separate process on the machine. It can also STOP syscalls from working if they are trying to connect out to other nodes and accessing files they shouldn't be (using some drunk ass heuristics).
What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.
This has taken us out and we have 30 people currently doing recovery and DR. Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver. We have to literally take each node down, attach the disk to a working node, delete the .sys file and bring it up. Either that or bring up a new node entirely from a snapshot.
This is fine but EC2 is rammed with people doing this now so it's taking forever. Storage latency is through the roof.
I fought for months to keep this shit out of production because of this reason. I am now busy but vindicated.
Edit: to all the people moaning about windows, we've had no problems with Windows. This is not a windows issue. This is a third party security vendor shitting in the kernel.
I did approximately this recently, but on a Linux machine on GCP. It sucked far worse than it should have: apparently GCP cannot reliably “stop” a VM in a timely manner. And you can’t detach a boot disk from a VM that isn’t “stopped”, nor can you multi-attach it, nor can you (AFAICT) convince a VM to boot off an alternate disk.
I used to have this crazy idea that fancy cloud vendors had competent management tools. Like maybe I could issue an API call to boot an existing instance from an alternate disk or HTTPS netboot URL. Or to insta-stop a VM and get block-level access to its disk via API, even if I had to pay for the instance while doing this.
And I’m not sure that it’s possible to do this sort of recovery at all without blowing away local SSD. There’s a “preview” feature for this on GCP, which seems to be barely supported, and I bet it adds massive latency to the process. Throwing away one’s local SSD on every single machine in a deployment sounds like a great way to cause potentially catastrophic resource usage when everything starts back up.
Hmm, I wonder if you’re even guaranteed to be able to get your instance back after stopping it.
WTF. Why can’t I have any means to access the boot disk of an instance, in a timely manner? Or any better means to recover an instance?
AWS is not any better really on this. In fact 2 years ago (to the day!) we had a complete AZ outage in our local AWS region. This resulted in their control plane going nuts and being unable to shut down or start new instances. Then capacity problems.
That's happened several times, actually. That's probably just the latest one. The really fun one was when S3 went down in 2017 in Virginia. Caused global outages of multiple services because most services were housed out of Virginia and when EC2 and other services went offline due to dependency on S3, everything cascade failed across multiple regions (in terms of start/stop/delete...ie. api actions. Stuff that was running was, for the most part, still working in some places).
...I remember that day pretty well. It was a busy day.
> apparently GCP cannot reliably “stop” a VM in a timely manner.
In OCI we made a decision years ago that after 15 minutes from sending an ACPI shutdown signal, the instance should be hard powered off. We do the same for VM or BM. If you really want to, we take an optional parameter on the shutdown and reboot commands to bypass this and do an immediate hard power off.
So worst case scenario here, 15 minutes to get it shut down and be able to detach the boot volume to attach to another instance.
I had this happen to one of my VMs, I was trying to compile something and went out of memory, then tried to stop the VM and it only came back after 15 min. I think it is a good compromise, long enough to give a chance for a clean reboot but short enough to prevent longer downtimes.
I’m just a free tier user but OCI is quite powerful. It feels a bit like KDE to me where sometimes it takes a while to find out where some option is, but I can always find it somewhere, and in the end it beats feeling limited by lack of options.
We've tried at shorter time periods, back in the earlier days of our platform. Unfortunately what we've found is that the few times we've tried to lower it from 15 minutes, we've ended up with Windows users experiencing corrupt drives. Our best blind interpretation is that some things common enough on Windows can take up to 14 minutes to shut down under worst circumstances. So 15 minutes it is!
Based on your description, AWS has another level of stop, the "force stop", which one can use in such cases. I don't have statistics on the time, so I don't know if that meets your criteria of "timely", but I believe it's quick enough (sub-minute, I think).
There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.
As for "throwing away local SSD", that only happens on AWS with instance store volumes which used to be called ephemeral volumes as the storage was directly attached to the host you were running on and if you did a stop/start of an ebs-backed instance, you were likely to get sent to a different host (vs. a restart API call, which would make an ACPI soft command and after a duration...I think it was 5 minutes, iirc, the hypervisor would kill the instance and restart it on the same host).
When the instance would get sent to a different host, it would get different instance storage and the old instance storage would be wiped from the previous host and you'd be provisioned new instance storage on the new host.
However, with EBS-volumes, those travel from host to host across stop/start cycles and they're attached via very low latency across the network from EBS servers and presented as a local block device to the instance. It's not quite as fast as local instance store, but it's fast enough for almost every use case if you get enough IOPS provisioned either through direct provisioning + correct instance size OR through a large enough drive + large enough instance to maximjze the connection to EBS (there's a table and stuff detailing IOPs, throughput, and instance size in the docs).
Also, support can detach the volume as well if the instance is stuck shutting down and doesn't get manually shut down by the API after a timeout.
None of this is by any means "ideal", but the complexity of these systems is immense and what they're capable of at the scale they operate is actually pretty impressive.
The key is...lots of the things you talk about are do-able at small scale, but when you add more and more operations and complexity to the tool stack on interacting with systems, you add a lot of back-end network overhead, which leads to extreme congestion, even in very high speed networks (it's an exponential scaling problem).
The "ideal" way to deal with these systems is to do regular interval backups off-host (ie. object/blob storage or NFS/NAS/similar) and then just blow away anything that breaks and do a quick restore to the new, fixed instance.
It's obviously easier said than done and most shops still on some level think about VMs/instances as pets, rather than cattle or have hurdles that make treating them as cattle much more challenging, but manual recovery in the cloud, in general, should just be avoided in favor of spinning up something new and re-deploying to it.
> There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.
This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?
The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.
Also, there should be a way to force stop an instance that is already stopping.
>This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?
The issue is far more nuanced than that. The systems are very complex and they're a hypervisor that has layers of applications and interfaces to allow scaling. In fact, the hosts all have BMCs (last I knew...but I know there were some who wanted to get rid of the BMC due to BMCs being unreliable, which is, yes, an issue when you deal with scale because BMCs are in fact unreliable. I've had to reset countless stuck BMCs and had some BMCs that were dead).
The hypervisor is certainly capable of killing an instance instantly, but the preferred method is an orderly shutdown. In the case of a reboot and a stop (and a terminate where the EBS volume is not also deleted on termination), it's preferred to avoid data corruption, so the hypervisor attempts an orderly shutdown, then after a timeout period, it will just kill it if the instance has not already shutdown in an orderly manner.
Furthermore, there's a lot more complexity to the problem than just "kill the guest". There are processes that manage the connection to the EBS backend that provides the interface for the EBS volume as well as apis and processes to manage network interfaces, firewall rules, monitoring, and a whole host of other things. If the monitoring process gets stuck, it may not properly detect an unhealthy host and external automated remediation may not take action. Additionally, that same monitoring is often responsible for individual instance health and recovery (ie. auto-recover) and if it's not functioning properly, it won't take remediation actions to kill the instance and start it up elsewhere. Furthermore, the hypervisor itself may not be properly responsive and a call from the API won't trigger a shutdown action. If the control plane and the data plane (in this case, that'd be the hypervisor/host) are not syncing/communicating (particularly on a stop or terminate), the API needs to ensure that the state machine is properly preserved and the instance is not running in two places at once. You can then "force" stop or "force" terminate and/or the control plane will update state in its database and the host will sync later. There is a possibility of data corruption or double send/receive data in a force case, which is why it's not preferred. Also, after the timeout (without the "force" flag), it will go ahead and mark it terminated/stopped and will sync later, the "force" just tells the control plane to do it immediately, likely because you're not concerned with data corruption on the EBS volume, which may be double-mounted if you start up again and the old one is not fully terminated.
>The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.
It does have a concept where all resources are still held and billed, except CPU and Memory. That's what a reboot effectively does. Same with a stop (except you're not billed for compute usage and network usage will obviously be zero, but if you have an EIP, that would incur charges still). The transition between stop and running is also fast, the only delays incurred are via the control plane...either via capacity constraints causing issues placing an instance/VM or via the chosen host not communicating properly...but in most cases, it is generally a fast transition. I'm usually up and running in under 20 seconds when I start up an existing instance from a stopped state. There's also now a hibernate or sleep state that the instance can be put into if it's windows via the API where the instance acts just like a sleep/hibernate state of a regular Windows machine.
>Also, there should be a way to force stop an instance that is already stopping.
There is. I believe I referred to it in my initial response. It's a flag you can throw in the API/SDK/CLI/web console when you select "terminate" and "stop". If the stop/terminate command don't execute in a timely manner, you can call the same thing again with a "force" flag and tell the control plane to forcefully terminate, which marks the instance as terminated and will asynchronously try to rectify state when the hypervisor can execute commands. The control plane updates the state (though, sometimes it can get stuck and require remediation by someone with operator-level access) and is notified that you don't care about data integrity/orderly shutdown and will (once its updated the state in the control plane and regardless of the state of the data plane) mark it as "stopped" or "terminated". Then, you can either start again, which should kick you over to a different host (there are some exceptions) or you can launch a new instance if you terminated and attach an EBS volume (if you chose not to terminate the EBS volume on termination) and retrieve data (or use the data or whatever you were doing with that particular volume).
Almost all of that information is actually in the public docs. There was only a little bit of color about how the backend operates that I added for color. There's hundreds of programs that run to make sure the hypervisor and control plane are both in sync and able to manage resources and if just a few of them hang or are unable to communicate or the system runs out of resources (more of a problem on older, non-nitro hosts as that's a completely different architecture with completely different resource allocations), then the system can become partially functional...enough so that remediation automation won't step in or can't step in because other guests appear to be functioning normally. There's many different failure modes of varying degrees of "unhealthy" and many of them are undetectable or need manual remediation, but are statistically rare and by and large most hosts operate normally. On a normally operating host, forcing a shutdown/terminate works just fine and is fast. Even when some programs that are managing the host are not functioning properly, launch/terminate/stop/start/attach/detach all tend to continue to function (along with the "force" on detach, terminate, stop), even if one or two functions of the host are not functioning properly. It's also possible (and has happened several times) where a particular resource vector is not functioning properly, but the rest of the host is fine. In that case, the particular vector can be isolated and the rest of the host works just fine. It's literally these tiny little edge cases that happen maybe .5% of the time that cause things to move slower and at scale, a normal host with a normal BMC would have the same issues. Ie. I've had to clear stuck BMCs before on those hosts. Also, I've dealt with completely dead BMCs. When those states occur, if there's also a host problem, remediation can't go in and remedy host-level problems, which can lead to those control-plane delays as well as the need to call a "force".
Conclusion: it may SEEM like it should be super easy, but there's about a million different moving parts to cloud vendors and it's not just as simple as kill it with fire and vengeance (ie. quemu guest kill). BMCs and hypervisors do have an instant kill switch (and guest kill is used on the hypervisor as is a BMC power off in the right remediation circumstances), but you're assuming those things always work. BMCs fail. BMCs get stuck. You likely haven't had the issue because you're not dealing with enough scale. I've had to reset BMCs manually more times than I can count and I've also dealt with more than my fair share of dead ones. So, "power off immediately" does not always work, which means a disconnect occurs between the control plane and the data plane. There's also delays in remediation actions that automation takes to give enough time for things to respond to the given commands, which leads to additional wait time.
I understand that this complexity exists. But in my experience with Google Compute, this isn’t a 1%-of-the-time problem with something getting stuck. It’s a “GCP lacks the capability” issue. Here’s the API:
yeah, AWS rarely has significant capacity issues. While the capacity utilization typically sits around 90% across the board, they're constantly landing new capacity, recovering broken capacity, and working to fix issues that cause things to get stuck (and lots of alarms and monitoring).
I worked there for just shy of 7 years and dealt with capacity tangentially (knew a good chunk of their team for a while and had to interact with them frequently) across both teams I worked on (support and then inside the EC2 org).
Capacity, while their methodologies for expanding were, in my opinion, antiquated and unenlightened for a long time, were still rather effective. I'm pretty sure that's why they never updated their algorithm for increasing capacity to be more JIT. Now, they have a LOT more flexibility in capacity now that they have resource vectoring, because you no longer have hosts with fixed instance sizes for the entire host (homogenous). You now have the ability to fit everything like legos as long as it is the same family (ie. c4 with c4, m4 with m4, etc.) and there was additional work being done to have cross-family resource vectoring as well that was in-use.
Resource vectors took a LONG time for them to get in place and when they did, capacity problems basically went away.
The old way of doing it was if you wanted to have more capacity for, say, c4.xlarge, you'd either have to drop new capacity and build it out to where the entire host had ONLY c4.xlarge OR you would have to rebuild excess capacity within the c4 family in that zone (or even down to the datacenter-level) to be specifically built-out as c4.xlarge.
Resource vectors changed all that. DRAMATICALLY. Also, to reconfigure a hosts recipe now takes minutes, rather than rebuilding a host and needing hours. So, capacity is infinitely more fungible than it was when I started there.
Also, I think resource vectoring came on the scene around 2019 or so? I don't think it was there in 2018 when I went to work for EC2...but it was there for a few years before I quit...and I think it was in-use before the pandemic...so, 2019 sounds about right.
Prior to that, though, capacity was a much more serious issue and much more constrained on certain instance types.
I always said if you want to create real chaos, don't write malware. Get on the inside of a security product like this, and push out a bad update, and you can take most of the world down.
> Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver.
It took a bit to figure out with some customers, but we provide optional VNC access to instances at OCI, and with VNC the trick seems to be to hit esc and then F8, at the right stage in the boot process. Timing seems to be the devil in the details there, though. Getting that timing right is frustrating. People seem to be developing a knack for it though.
That would make sense but it appears everyone is doing EBS snapshots in our regions like mad so they aren't restoring. Spoke to our AWS account manager (we are a big big big org) and they have contention issues everywhere.
I really want our cages, C7000's and VMware back at this point.
I'm betting I have a good idea of one of the possible orgs you work for, since I used to work specifically with the largest 100 customers during my ~3yr stint in premium support
Netflix isn't really that big. two organizations ago our reverse proxy used 40k cores. netflixes is less than 5k. of course, that could just mean our nginx extensions are 8 times crappier than netflix.
Honest question, I've seen comments in these various threads about people having similar issues (from a few months/weeks back) with kernel extension based deployments of CrowdStrike on Debian/Ubuntu systems.
I haven't seen anything similar regarding Mac OS, which no longer allows kernel extensions.
Is Mac OS not impacted by these kinds of issues with CrowdStrike's product, or have we just not heard about it due to the small scale?
Personally, it's a shared responsibility issue. MS should build a product that is "open to extension but closed for modification".
> they pissed over everyone's staging and rules and just pushed this to production.
I am guessing that act alone is going to create a massive liability for CrowdStrike over this issue. You've made other comments that your organization is actively removing CrowdStrike. I'm curious how this plays out. Did CrowdStrike just SolarWind themselves? Will we see their CISO/CTO/CEO do time? This is just the first part of this saga.
The issue is where it is integrated. You could arguably implement CrowdStrike in BPF on Linux. On NT they literally hook NT syscalls in the kernel from a driver they inject into kernel space which is much bad juju. As for macOS, you have no access to the kernel.
There is no shared responsibility. CrowdStrike pushed a broken driver out, then triggered the breakage, overriding customer requirement and configuration for staging. It is a faulty product with no viable security controls or testing.
Yep, it's extremely lame that CS has been pushing the "Windows" narrative to frame it as a Windows issue in the press, so everyone will just default blame Microsoft (which everyone knows) and not Crowdstrike (which only IT/cybersec people are familiar with).
And then you get midwits who blame Microsoft for allowing kernel access in the first place. Yes Apple deprecated kexts on macOS; that's a hell of a lot easier to do when you control the entire hardware ecosystem. Go ahead and switch to Apple then. If you want to build your own machines or pick your hardware vendor, guess what, people are going to need to write drivers, and they are probably going to want kernel mode, and the endpoint security people like CrowdStrike will want to get in there too because the threat is there.
There's no way for Microsoft or Linux for that matter to turn on a dime and deny kernel access to all the thousands upon thousands of drivers and system software running on billions of machines in billions of potential configurations. That requires completely reworking the system architecture.
This midwit spent the day creating value for my customers instead of spinning in my chair creating value for my cardiologist.
Microsoft could provide adequate system facilities so that customers can purchase products that do the job without having the ability to crash the system this way. They choose not to make those investments. Their customers pay the price by choosing Microsoft. It's a shared responsibility between the parties involved, inclduing the customers that selected this solution.
We all make bad decisions like this, but until customers start standing up for themselves with respect to Microsoft, they are going to continue to have these problems, and society is going to continue to pay the price all around.
We can and should do better as an industry. Making excuses for Microsoft and their customers doesn't get us there.
This midwit believes a half decent Operating System kernel would have a change tracking system that can auto-roll back a change/update that impacts the boot process causing a BSOD. We see in Linux, multiple kernel boot options, fail safe etc. It is trivial to code at the kernel the introduction of driver / .sys tracking that can detect a failed boot and revert to the previous good config. A well designed kernel would have roll back, just like SQL.
um.. don't have access to the kernel? what's with all the kexts then? [edit: just read 3rd parties don't get kexts on apple silicon. that's a step in the right direction, IMHO. I love to bitch about Mach/NeXTStep flaws, but happy to give them props when they do the right thing.]
Horrible for sure, not least because hackers now know that the channel file parser is fragile and perhaps exploitable. I haven't seen any significant discussion about follow-on attacks, it's all been about rolling back the config file rather than addressing the root cause, which is the shonky device driver.
pish! this isn't VM/SP! commodity OSes and hardware took over because customers didn't want to pay firms to staff people who grokked risk management. linux supplanted mature OSes because some dork implied even security bugs were shallow with all those billions of eyes. It's a weird world when MSFT does a security stand down in 2003 and in 2008 starts widening security holes because the new "secure" OS they wrote was a no-go for third parties who didn't want to pay $100 to hire someone who knew how to rub two primes together.
I miss my AS/400.
This might be a decent place to recount the experience I had when interviewing for office security architect in 2003. my background is mainframe VM system design and large system risk management modeling which I had been doing since the late 80s at IBM, DEC, then Digital Switch and Bell Canada. My resume was pretty decent at the time. I don't like Python and tell VP/Eng's they have a problem when they can't identify benefits from JIRA/SCRUM, so I don't get a lot of job offers these days. Just a crusty greybeard bitching...
But anyway... so I'm up in Redmond and I have a decent couple of interviews with people and then the 3rd most senior dev in all of MSFT comes in and asks "how's your QA skills?" and I start to answer about how QA and Safety/Security/Risk Management are different things. QA is about insuring the code does what it's supposed to, software security, et al is about making sure the code doesn't do what it's not supposed to and the philosophic sticky wicket you enter when trying to prove a negative (worth a google deep dive if you're unfamiliar.) Dude cuts me off and says "meh. security is stupid. in a month, Bill will end this stupid security stand down and we'll get back to writing code and I need to put you somewhere and I figured QA is the right place."
When I hear that MSFT has systems that expose inadequate risk management abstractions, I think of the culture that promoted that guy to his senior position... I'm sure he was a capable engineer, but the culture in Redmond discounts the business benefits of risk management (to the point they outsource critical system infrastructure to third parties) because senior engineers don't want to be bothered to learn new tricks.
Culture eats strategy for breakfast, and MSFT has been fed on a cultural diet of junk food for almost half a century. At least from the perspective of doing business in the modern world.
> ”This is not a windows issue. This is a third party security vendor shitting in the kernel.“
Sure, but Windows shares some portion of the blame for allowing third-party security vendors to “shit in the kernel”.
Compare to macOS which has banned third-party kernel extensions on Apple Silicon. Things that once ran as kernel extensions, including CrowdStrike, now run in userspace as “system extensions”.
Back in 2006 the Microsoft agreed to allow kernel level access for Security companies due to an EU anti trust investigation. They were being sued by anti virus companies because they were blocking kernel access in the soon to be released Vista.
Yes... in the same sense that if a user bricks their own system by deleting system32 then Windows shares some small sliver of the blame. In other words, not much.
Why should Windows let users delete system32? If they don't make it impossible to do so accidentally (or even maliciously), then I would indeed blame Windows.
On macOS you can't delete or modify critical system files without both a root password and enough knowledge to disable multiple layers of hardware-enforced system integrity protection.
the difference is you can get most of the functionality you want without deleting system32, but if you want the super secure version of NT, you have to let idiots push untested code to your box.
linux, Solaris, BSD and macOS aren't without their flaws, but MSFT could have done a much better job with system design.
...but still, if the user space process is broken, MacOS will fail as well. Maybe it's a bit easier to recover, but any broken process with non-trivial privileges can interrupt the whole system.
It's certainly not supposed to work like that. In the kernel, a crash brings down the entire system by design. But in userspace, failed services can be restarted and continued without affecting other services.
If a failure in a userspace service can crash the entire system, that's a bug.
It's kind of inevitable that a security system can crash the system. It just needs to claim than one essential binary is infected with malware, and the system won't run.
I'm a reporter with Bloomberg News covering cybersecurity. I'm trying to learn more about this Crowdstrike update potentially bypassing staging rules and would love to hear about your experience. Would you be open to a coversation?
Before reaching the "pushed out to every client without authorization" stage, a kernel driver/module should have been tested. Tested by Microsoft, not by "a third party security vendor shitting in the kernel" that some criminally negligent manager decided to trust.
Congratulations on actually fixing the root cause, as opposed to hand wringing and hoping they don't break you again. I'm expecting "oh noes, better keep it on anyway to be safe" to be the popular choice.
yeah, I agree. I think most places will at least keep it until the existing contract comes time for renegotiation and most will probably keep using cs.
It's far easier for IT departments to just keep using it than it is to switch and managers will complain about "the cost of migrating" and "the time to evaluate and test a new solution" or "other products don't have feature X that we need" (even when they don't need that feature, but THINK they do).
It's a shitty C++ hack job within CloudStrike with a null pointer. Because the software has root access, Windows shuts it down as a security precaution. A simple unit test would have caught this, or any number of tools that look for null pointers in C++, not even full QA. It's unbelievable incompetence.
Took down our entire emergency department as we were treating a heart attack. 911 down for our state too. Nowhere for people to be diverted to because the other nearby hospitals are down. Hard to imagine how many millions of not billions of dollars this one bad update caused.
Yup - my mom went into the ER for stroke symptoms last night and was put under MRI. The MRI imaging could NOT be sent to the off-site radiologist and they had to come in -- turned out the MRI outputs weren't working at all.
We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.
A relative of mine had back surgery late yesterday. Today the hospital nursing staff couldn’t proceed with the pain medication process for patients recovering from surgery because they didn’t have access to the hospital systems.
Hope she's okay. For better or worse, our entire emergency department flow is orchestrated around epic. If we can't even see the board, nurses don't know what orders to perform, etc.
If it’s so critical that nurses are left standing around clueless then if it goes down entire teams of people should be going to prison for manslaughter.
Or, we could build robust systems that can tolerate indefinite down time. Might cost more, might need more staff.
Pick one. I’ll always pick the one that saves human lives when systems go down.
Okay but that will affect hospital profits and our PE firms bought these hospitals specifically to wrench all redundancy out of these systems in the name of efficiency (higher margins and thus profit) so that just won't do.
Private equity people need to start getting multiple life sentences for fucking around with shit like this. It's unironically a national security issue.
Doctors can't even own hospitals now. Doctor-owned hospitals were banned with the passage of Obamacare in order to placate big hospital systems concerned about the growing competition.
Another way to look at it is that you can have more hospitals using systems with a lower cost, thus saving more lifes comparing to only a few hospitals using an expensive system.
I hope no one uses such single point of failure systems anymore. Especially CS. The same is applicable for Cloudflare as well! But at least, the systems will be functioning standalone and accessible in their case and could cause only netwide outage! (i.e., if the CF infra goes down!)
Anyways, who knows what is going to happen with such widespread vendor dependency?
The world gets reminded about the Supply Chain Attacks every year which is a good (but a scary) one that definitely needs some deep thinking...
> We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.
That's an extra 4 hours of emergency room fees you ideally wouldn't have to pay for.
The system crashed while my coworker was running a code (aka doing CPR) in the ER last night. Healthcare IT is so bad at baseline that we are somewhat prepared for an outage while resuscitating a critical patient.
The second largest hospital group in Nashville experienced a ransomware attack about two months ago. Nurses told me they were using manual processes for three weeks.
It takes a certain type of a criminal a55hole to attack hospitals and blackmail them. I would easily support life or death penalty for anyone attempting this cr@p.
Yes. And I was told by multiple nurses at St. Thomas Midtown that the hospital did not have manual procedures already in place. In their press release they refer to their hospitals as "ministries" [0], so apparently they practice faith-based cyber security (as in "we believe that we don't need backups") since it took over 3 weeks to recover.
As a paramedic, there is very little about running a code that requires IT. You have the crash cart, so not even stuck trying to get meds out of the Pyxis. The biggest challenge is charting / scribing the encounter.
I used to work in healthcare IT. Running a code is not always only CPR.
Different medications may be pushed (injected into the patient) to help stabilize them. These medications are recorded via a bar code and added to the patients chart in Epic. Epic is the source of truth for the current state of the patient. So if that is suddenly unavailable that is a big problem.
Okay,not having historical data avaliable to make decision on what to put into a patient is understandable - but maybe also print critical stuff per patient once a day? - but not being able to log an action in realtime should not be a critical problem.
It is a critical problem if your entire record of life-saving drugs you've given them in the past 24 hours suddenly goes down. You have to start relying on people's memories, and it's made worse by shift turn-overs so the relevant information may not even be reachable once the previous shift has gone home.
There are plenty of drugs that can only be given in certain quantities over a certain period of time, and if you go beyond that, it makes the patient worse not better. Similarly there are plenty of bad drug interactions where whether you take a given course of action now is directly dependent on which drugs that patient has already been given. And of course you need to monitor the patient's progress over time to know if the treatments have been working and how to adjust them, so if you suddenly lose the record of all dosages given and all records of their vital signs, you've lost all the information you need to treat them well. Imagine being dropped off in the middle of nowhere, randomly, without a GPS.
That's why there's a sharpie in the first aid kit. If you're out of stuff to write on you can just write on the patient.
More seriously, we need better purpose build medical computing equipment, that runs on it's own OS, and only has outbound network connectivity for updating other systems.
I also think of things like the old school "check list boards" that used to be literally built into the yolk of the airplane they were made for.
I’m afraid the profitability calculation shifted it in favor of off-the-shelf OS a long time ago. I agree with you, though, that a general purpose OS has way too much crap that isn’t needed in a situation like this.
> It is a critical problem if your entire record of life-saving drugs you've given them in the past 24 hours suddenly goes down.
Will outages like this motivate a backup paper process? The automated process should save enough information on paper so a switch over to paper process at any time is feasible. Similar to elections.
Maybe if all the profit seeking entities were removed from healthcare that money could instead go to the development of useful offline systems.
Maybe a handheld device for scanning in drugs or entering procedure information that stores the data locally which can then be synced with a larger device with more storage somewhere that is also 100% local and immutable which then can sync to online systems if that is needed.
A good system is resilient. Paper process could take over when system is down. Form my understanding healthcare systems undergo recurrent outages for various reasons.
Many place did revert back to paper processes. But, it’s a disaster model that has to tested to make sure everyone can still function when your EMR goes down. Situations like this just reinforce that you can’t plan for if IT systems go down, it is when they go down.
My experience with internet outages affecting retail is the ability to rapidly and accurately calculate bill totals and change is not practiced much anymore. Not helped by things like 9.075 % tax rates to be sure.
Real paper is probably as much about breaking from the "IT culture" as it's about the physical properties. E-ink display would probably help with power outage, but happily display BSOD in an incident like this.
Honestly if you were designing a system to be resilient to events like this one, the focus would be on distributed data and local communication. The exact sort of things that have become basically dirty words in this SaaS future we are in. Every PC in the building, including the ones tethered to equipment, is presently basically a dumb terminal, dependent on cloud servers like Epic, meaning WAN connection is a single point of failure (I assume that a hospital hopefully has a credible backup ISP though?) and same for the Epic servers.
If medical data were synced to the cloud but also stored on the endpoint devices and local servers, you’d have more redundancy. Obviously much more complexity to it but that’s what it would take. Epic as single source of truth means everyone is screwed when it is down. This is the trade off that’s been made.
> synced to the cloud but also stored on the endpoint devices and local servers
That's a recipe for a different kind of disaster. I actually used Google Keep some years ago for medical data at home — counted pills nightly, so mom could either ask me or check on her phone if she forgot to take one. Most of the time it worked fine, but the failure modes were fascinating. When it suddenly showed data from half a year ago, I gave up and switched to paper.
I don't think it is historical data required to make a decision, it is required to store the action for historical purposes in the future. This is ultimately to bill you and to track that a doctor isn't stealing medication, improperly treating the patient, and to track it for legal purposes.
Some hospitals require you to input this in order to even get physical access to the medications.
Although a crash cart would normally have common things necessary to save someone in an emergency, so I would think that if someone was truly dying they could get them what they needed. But of course there are going to be exceptions and a system being down will only make the process harder.
Of course the real backup plan should be designed based on the actual needs, perhaps the whole system needs an "offline mode" switch. I assume they already run things locally, in case the big cable seeker machine arrives in the neighborhood.
Most printers in these facilities run standalone on an embedded Linux variant.They actually can host whole folders of.data for reproduction "offline". Actually all scan/print/fax multi function machines can generally do that these days. If the IT onsite is good though the usb ports an storage on devices should be locked down.
Oh yes. This would be a contingency measure, just to keep the record in a human readable form while requiring little manual labor. Printed codes could be scanned later into Epic and, if you need to transfer the patient, tear the paper and send it with them.
It is not necessarily crowdstrike's responsibility, but it should be someone's.
If I go to Home Depot to buy rope for belaying at my rock climbing center and someone falls, breaks the rope and dies, then I am on the hook for manslaughter.
Not the rope manufacturer, who clearly labeled the packaging with "do not use in situations where safety can be endangered". Not the retailer, who left it in the packaging with the warning, and made no claim that it was suitable for a climbing safety line. But me, who used a product in a situation where it was unsuitable.
If I instead go to Sterling Rope and the same thing happens, fault is much more complicated, but if someone there was sufficiently negligent they could be liable for manslaughter.
In practice, to convict of manslaughter, you would need to show an individual was negligant. However, our entire industry is bad at our job, so no individual involved failed to perform their duties to a "reasonable" standard.
Software engineering is going to follow the path that all other disciplines of meatspace engineering did. We are going to kill a lot of people; and every so often, enough people will die that we add some basic rules for safety critical software, until eventually, this type of failure occuring without gross negligence becomes nearly unthinkable.
Its on whoever runs the hospitals computer systems - allowing a ring 0 kernel driver to update ad-hoc from the internet is just sheer negligence.
Then again, the management that put this in are probably also the same idiots that insist on a 7 day lead time CAB process to update a typo on a brochure ware website "because risk".
This patient is dead. They would not have been if the computer system was up. It was down because of CrowdStrike. CrowdStrike had a duty of care to ensure they didn't fuck over their client's systems.
I'm not even beyond two degrees of seperation here. I don't think a court'll have trouble navigating it.
If that really were how it worked, I don’t think that software would really exist at all. Open Source would probably be the first to disappear too — who would contribute to, say, Linux, if you could go to jail for a pull request you made because it turns out they were using it in a life or death situation and your code had a bug in it. That checks all the same boxes that your scenario does: someone is dead, they wouldn’t be if you didn’t have a bug in your code.
Now, a tort is less of a stretch than a crime, but thank goodness I’m not a lawyer so I don’t have to figure out what circumstances apply and how much liability the TOS and EULAs are able to wash away.
When I read something like this that has such a confident tone while being incredibly incorrect all I can do is shake my head and try to remember I was young once and thought I knew it all as well.
I don't think you understand the scale of this problem. Computers were not up to print from. Our Epic cluster was down for placing and receiving orders. Our lab was down and unable to process bloodwork - should we bring out the mortar and pestle and start doing medicine the old fashioned way? Should we be charged with "criminal negligence" for not having a jar of leeches on hand for when all else fails?
I was advocating for a paper fall back. That means that WHILE the computers are running, you must create a paper record, eg “medication x administered at time y”, etc., hence the receipt printers, which are cheap and low-dependency.
The grandparent indicated that the problem was that when all tow computers went down, they couldn’t look up what had already been done for the patient. I suggested a simple solution for that - receipt printers.
After the computers fail you tape the receipt to the wall and fall pack to pen and paper until the computers come back up.
I completely understand the scale of the outage today. I am saying that it was a stupid decision and possibly criminally negligent to make a life critical process dependent on the availability of a distributed IT application not specifically designed for life critical availability. I strongly stand by that POV.
> I suggested a simple solution for that - receipt printers.
Just so I understand what you are saying you are proposing that we drown our hospital rooms in paper receipt constantly. In the off chance the computers go down very rarely?
Do you see any possible drawbacks with your proposed solution?
> possibly criminally negligent to make a life critical process dependent on the availability of a distributed IT application
What process is not “life critical” in a hospital? Do you suggest that we don’t use IT at all?
Modern medicine requires computers. You literally cannot provide medical care in a critical care setting with the sophistication and speed required for modern critical care without electronic medical records. Fall back to paper? Ok, but you fall back to 1960s medicine, too.
Why would you ever need to move a patient from one hospital room containing one set of airgapped computers into another, containing another set of airgapped computers?
Why would you ever need to get information about a patient (a chart, a prescription, a scan, a bill, an X-Ray) to a person who is not physically present in the same room (or in the same building) as the patient?
Local area networks air gapped from the internet don't need to be air gapped from each other. You could have nodes in each network responsible for transmitting specific data to the other networks.. like, all the healthcare data you need. All other traffic, including windows updates? Blocked. Using IP still a risk? Use something else. As long as you can get bytes across a wire, you can still share data over long distances.
In my eyes, there is a technical solution therr that keeps friction low for hospital staff: network stuff, on an internet, but not The Internet...
Edit: I've since been reading the other many many comment threads on this HN post which show the reasons why so much stuff in healthcare is connected to each other via good old internet, and I can see there's way more nuance and technicality I am not privy to which makes "just connect LANs together!" less useful. I wasn't appreciating just how much of medicine is telemedicine.
I think wiring computers within the hospital over LAN, and adding a human to the loop for inter-hospital communication seems like a reasonable compromise.
Yes there will be some pain, but the alternative is what we have right now.
> nobody wants to do it.
Tough luck. There's lots of things I don't want to do.
A hospital my wife worked at over a decade ago didn't use EMR's, it was all on paper. Each patient had a binder. Per stay. And for many of them it rolled into another binder. (This was neuro-ICU so generally lengthy patient stays with lots of activity, but not super-unusual or Dr House stuff, every major city in America will have 2-3 different hospitals with that level of care.)
But they switched over to EMR because the advantages of Pyxis[1] in getting the right medications to the right patients at the right time- and documenting all of that- are so large that for patient safety reasons alone it wins out over paper. You can fall back to paper, it's just a giant pain in the ass to do it, and then you have to do the data entry to get it all back into EMR's. Like my wife, who was working last night when everyone else in her department got Crowdstrike'd, she created a document to track what she did so it could be transferred into EMR's once everything comes back up. And the document was over 70 pages long! Just for one employee for one shift.
1: Workflow: Doctor writes prescription in EMR. Pharmacist reviews charts in EMR, approves prescription. Nurse comes to Pyxis cabinet and scans patient barcode. Correct drawer opens in cabinet so the proper medication- and only the proper medication- is immediately available to nurse (technicians restock cabinet when necessary). Nurse takes medication to patient's room, scans patient barcode and medication barcode, administers drug. This system has dramatically lowered the rates of wrong-drug administration, because the computers are watching over things and catch humans getting confused on whether this medication is supposed to go to room 12 or room 21 in hour 11 of their shift. It is a great thing that has made hospitals safer. But it requires a huge amount of computers and networks to support.
Why would a Pyxis cabinet run Windows? I realize Windows isn't even necessarily at fault here, but why on earth would such a device run Windows? Is the 90s form of mass incompetence in the industry still a thing where lots of stuff is written for Windows for no reason?
I don't know what Pyxis runs on, my wife is the pharmacist and she doesn't recognize UI package differences with the same practiced eye that I do. And she didn't mention problems with the Pyxis. Just problems with some of their servers and lots of end user machines. So I don't know that they do.
For relying on windows to run this kind of stuff and not doing any kind of staged rollout but just blindly applying untested kernel driver 3rd party patching fleet wide? yeah honestly. We had safer rollouts for cat videos than y'all seem to have for life critical systems. Maybe some criminal liability would make y'all care about reliability a bit more.
Staged rollout in the traditional sense wouldn't have helped here because the skanky kernel driver worked under all test conditions. It just didn't work when ot got fed bad data. This could have been mitigated by staging the data propagation, or by fully testing the driver with bad data (unlikely to ever have been done by any commercial organization). Perhaps some static analysis tool could have found the potential to crash (or the isomorphic "safe language" that doesn't yet exist for NT kernel drivers).
A QR code can store 3 KB of data. Every patient has a small QR Sticker printer on their bed. Whenever EPIC updates, print a new small QR sticker. Patient being moved tear of sticker and stick to their wrist tag.
This much of patients state will be carried on their wrist. Maybe for complex cases you need two stickers. Have to be judicious in encoding data, maybe just last 48 hours.
Handheld qr readers, off line that read and display QR data strings.
You need to document everything during a code arrest. All interventions, vitals and other pertinent information must be logged for various reasons. Paper and pen work but they are very difficult to audit and/or keep track of. Electronic reporting is the standard and deviating from the standard is generally a recipe for a myriad of problems.
We chart all codes on paper first and then transfer to computer when it's done. There's a nurse whose entire job is to stay in one place and document times while the rest of us work. You don't make the documenter do anything else because it's a lot of work.
And that's in the OR, where vitals are automatically captured. There just aren't enough computers to do real-time electronic documentation, and even if there were there wouldn't be enough space.
I chart codes on my EPCR, in the PT's house, almost everyday with one hand. Not joking about the one hand either.
Its easier, faster, and more accurate than writing in my experience. We have a page solely dedicated to codes and the most common interventions. Got IO? I press a button and its documented with timestamp. Pushing EPI, button press with timestamp. Dropping an I-Gel or Intubating, button press... you get the idea.
The details of the interventions can be documented later along with the narrative, but the bulk of the work was captured real-time. We can also sync with our monitors and show depth of compressions, rate of compressions and rhythms associated with the continuous chest compression style CPR we do for my agency.
Going back to paper for codes would be ludicrous for my department. The data would be shit for a start. Hand writing is often shit and made worse under the stress of screaming bystanders. Depending on whether we achieved ROSC or not would increase the likelihood of losing paper in the shuffle
The idea is to have the current system create a backup paper trail from which you practice resuming from for when computers go down. Nothing about current process for you need change only that you be familiar with falling back to paper backups when computers are down.
Which means that you have to be operating papered before the system goes down. If you aren't, the system never gets to transition because it just got CrowdStruck.
Correct. We use paper receipts for shopping and paper ballots for voting. Automation is fast and efficient, but there must be a manual fallback when power fails or automation is unreliable.
This wisdom is echoed in some religious practices that avoid complete reliance on modern technology.
You can do CPR without a computer system, but changing systems in the middle of resuscitation where a delay of seconds can mean the difference between survival and death is absolutely not ideal. CPR in the hospital is a coordinated team response and if one person can’t do their job without a computer then the whole thing breaks down.
If you're so close to death that you're depending on a few seconds give or take, you're in God's hands. I would not blame or credit anyone or any system for the outcome, either way.
Judgement is always part of the process, but yeah running a routine code is pretty easy to train for. It's one of the easiest procedures in medicine. There are a small number of things that can go wrong that cause quick death, and for each a small number of ways to fix them. You can learn all that in a 150 hour EMT class.
Hello, I'm a journalist looking to reach people impacted by the outage and wondering if you could kindly connect with your ER colleague. My email is sarah.needleman@wsj.com. Thanks!
I mean if they're finding sources through the comment and then corroborating their stories via actual interviews, it's completely fine practice. As long as what's printed is corroborated and cross-referenced I don't see a problem.
If they go and publish "According to hackernews user davycro ..." _then_ there's a problem.
> Took down our entire emergency department as we were treating a heart attack.
It makes my blood boil to be honest that there is no liability for what software has become. It's just not acceptable.
Companies that produce software with the level of access that Crowdstrike has (for all effective purposes a remote root exploit vector) must be liable for the damages that this access can cause.
This would radically change how much attention they pay to quality control. Today they can just YOLO-push barely tested code that bricks large parts of the economy and face no consequences. (Oh, I'm sure there will be some congress testimony and associated circus, but they will not ever pay for the damages they caused today.)
If a person caused the level and quantity of damage Crowdstrike caused today they would be in jail for life. But a company like Crowdstrike will merrily go on doing more damage without paying any consequence.
What about companies that deploy software with the level of quality that Crowdstrike has? Or Microsoft 365 for that matter.
That seems to be the bigger issue here; after all Crowdstrike probably says it is not suitable for any critical systems in their terms of use. You shouldn't be able to just decide to deploy anything not running away fast enough on critical infrastructure.
On the other hand, Crowdstrike Falcon Sensor might be totally suitable for a non-critical systems, say entertainment systems like the Xbox One.
CrowdStrike
https://www.crowdstrike.com › resources › infographics
Learn how CrowdStrike keeps your critical areas of risk such as endpoints, cloud workloads, data, and identity, safe and your business running
Wife is a nurse. They eventually go 2 computers working for her unit. I don't think it impacted patients already being treated, but they couldn't get surgeries scheduled and no charting was being done. Some of the other floors were in complete shambles.
Hi, as I noted to another commenter, I'm a journalist looking to speak with people who've been impacted by the outage. I'm wondering if I could speak with your wife. My email is sarah.needleman@wsj.com. Thanks.
Local emergency services were basically nonfunctioning for better part of the day along with the heat wave and various events, seems like a number of deaths (locally at least, specific to what I know for my mid sized US city) will be indirectly attributable to this.
It's entirely possible (likely, even) that someone died from this, but it's hard to know with critically ill patients whether they would have survived without the added delays.
We are in the process of calculating this but need this 24H period to roll over so we can benchmark the numbers against a similar 24H period. Its hard to tell if the numbers we get back will even be reliable given a lot of the statistics back from today from what I can tell have been via emails or similar.
Crowdstrike is on every machine in the hospital because hospitals and medical centers became a big target for ransomware a few years ago. This forced medical centers to get insured against loss of business and getting their data back. The insurance companies that insure companies against ransomware insist on putting host based security systems onto every machine or they won't cover losses. So Crowdstrike (or one of their competitors) has to run on every machine.
I wonder why putting software on every machine, instead of relying on a good firewall and network separation.
Granted, you are still vulnerable of physical attacks (i.e. the person coming with an USB stick) but I would say much more difficult, and if you put firewalls also between compartment of internal networks even difficult.
Also, I think the use of Windows in critical settings is not a good choice, and to me we had a demonstrations. For who says the same could have happened to Linux, yes but you could have mitigated it. For example, to me a Linux system used in critical settings shall have a root read-only root filesystem, on Windows you can't. Thus the worse you would had is to reboot the machine to restore it.
A common attack vector is phishing, where someone clicks on an email link and gets compromised or supplies credentials on a spoofed login page. External firewalls cannot help you much there.
Segmenting your internal network is a good defence against lots of attacks, to limit the blast radius, but it's hard and expensive to do a lot of it in corporate environments.
Yup as you say, if you go for a state of the art firewall, then that firewall also becomes a point of failure. Unfortunately complex problems don't go away by saying the word "decentralize".
> I wonder if those same insurance policies are going to pay out due to the losses from this event?
They absolutely should be liable for the losses, in each case where they caused it.
(Which is most of them. Most companies install crowdstrike because their auditor want it and their insurance company says they must do whatever the auditor wants. Companies don't generally install crowdstrike out of their own desire.)
But of course they will not pay a single penny. Laws need to change for insurance companies, auditors and crowdstrike to be liable for all these damages. That will never happen.
Depends on what the policy (contract) says. But there's a good argument that your security vendor is inside the wall of trust at a business, and so not an external risk.
In a sense, it looks like these insurance company's policies work a little bit like regulation. Except that it's not monopolistic (different companies are free to have different rules), and when shit hits the fan, they actually have to put their money where their mouth is.
Despite this horrific outage, in the end it sounds like a much better and anti-fragile system than a government telling people how to do things.
A little bit, probably slightly better. But insurance companies don't want to eliminate risk (if they did that, no one would buy their product). They instead want to quantify, control and spread the risk by creating a risk pool. Good, competent regulation would be aimed at eliminating, as much as reasonably possible, the risk. Instead, insurance company audits are designed to eliminate the worst risk and put everyone into a similar risk bucket. After spending money on an insurance policy and passing an audit, why would a company spend even more money and effort? They have done "enough".
> The insurance companies that insure companies against ransomware insist on putting host based security systems onto every machine or they won't cover losses.
This is part of the problem too. These insurance/audit companies need to be made liable for the damage they themselves cause when they require insecure attack vectors (like Crowdstrike) to be installed on machines.
Crowdstrike and its ilk are basically malware. There have to be better anti-ransomware approaches, such as replicated, immutable logs for critical data.
2. Why would anyone trust a ransomware perpetrator to honor a deal to not reveal or exploit data upon receipt of a single ransom payment? Are organizations really going to let themselves be blackmailed for an indefinite period of time?
3. I'm unconvinced that crowdstrike will reliably prevent sensitive data exfiltration.
1. Double extortion is the norm, some groups don't even bother with the encryption part anymore, they just ask a ransom for not leaking the data
2. Appearently yes. Why do you think calls to ban payments exist?
3. At minimum it raises the bar for the hackers - sure, it's not like you can't bypass edr but it's much easier if you don't have to bypass it at all because it's not there
I agree edr is not a DLP solution, but edr is there to prevent* an attack getting to the point where staging the data exfil happens... In which case yes I would expect web/volumetric DLP kicks in as the next layer.
*Ok ok I know it's bypassable but one of the happy paths for an attack is to pivot to the machine that doesn't have edr and continue from there.
By "decentralized" I think you mean "doesn't auto-update with new definitions"?
I have worked at places which controlled the roll-out of new security updates (and windows updates) for this very reason. If you invest enough in IT is possible. But you have to have a lot of money to invest in IT to have people good enough to manage it. If you can get SwiftOnSecurity to manage your network, you can have that. But can every hospital, doctor's office, pharmacy, scan center, etc. get top tier talent like SwiftOnSecurity?
I used to work for a major retailer managing updates to over 6000 stores. We had no auto updates (all linux systems in stores) and every update went through our system.
When it came to audit time, the auditors were always impressed that our team had better timely updates than the corporate office side of things.
I never really thought we were doing anythin all that special (in fact, there were always many things I wanted to improve anout the process) but reading about this issue makes me think that maybe we really were just that much better than the average IT shop?
If, for example, they were doing slow rollouts for configs in addition to binaries, they could have caught the problem in their canary/test envs and not let it proceed to a full blackout.
When I say decentralized, I mean security measures and updates taken locally at the facility. For example, MRI machines are local, and they get maintained and updated by specialists dispatched by the vendor (Siemens or GE)
Siemens or GE or whomever built the MRI machine aren't really experts in operating systems, so they just use one that everyone knows how to work, MS Windiows. It's unfortunate that to do things necessary for modern medicine they need to be networked together with other computers (to feed the EMR's most importantly) but it is important in making things safer. And these machines are supposed to have 10-20 year lifespans (depending on the machine)! So now we have a computer sitting on the corporate network, attached to a 10 year old machine, and that is a major vulnerability if it isn't protected, patched, and updated. So is GE or Siemens going to send out a technician to every machine every month when the new Windows patch rolls out? If not, the computer sitting on the network is vulnerable for how long?
Healthcare IT is very important, because computers are good at record-keeping, retrieval and storage, and that's a huge part of healthcare.
A large hospital takes in power from multiple feeds in case any one provider fails. It's amazing that we're even thinking in terms of "a security company" rather than "multiple security layers."
The fact that ransomware is still a concern is an indication that we've failed to update our IT management and design appropriately to account for them. We took the cheap way out and hoped a single vendor could just paper over the issue. Never in history has this ever worked.
Also speaking of generators a large enough hospital should be running power failure test events periodically. Why isn't a "massive IT failure test event" ever part of the schedule? Probably because they know they have no reasonable options and any scale of catastrophe would be too disastrous to even think about testing.
It's a lesson on the failures of monoculture. We've taken the 1970s design as far as it can ago. We need a more organically inspired and rigorous approach to systems building now.
This. The 1970s design of the operating system and the few companies that deliver us the monoculture are simply not adequate or robust given the world of today.
> Hard to imagine how many millions of not billions of dollars this one bad update caused.
And even worse, possibly quite a few deaths as well.
I hope (although I will not be holding my breath), that this is the wake-up call we need to realise that we cannot have so much of our critical infrastructure rely on the bloated OS of company known for its buggy, privacy-intruding, crapware riddled software.
I'm old enough to remember the infamous blue-screen-of-death Windows 98 presentation. Bugs exist but that was hardly a glowing endorsement of high-quality software.. This was long ago, yet it is nigh on impossible to believe that the internal company culture has drastically improved since then, with regular high-profile screw-ups reminding us of what is hiding under the thin veneer of corporate of respectability.
Our emergency systems don't need windows, our telephone systems don't need windows, our flight management systems don't need windows, our shop equipment systems don't need windows, our HVAC systems don't need windows, and the list goes on, and on, and on.
Specialized, high-quality OSes with low attack surfaces are what we need to run our systems. Not a generic OS stuffed with legacy code from a time when those applications were not even envisaged.
Keep-it-simple-stupid -KISS-is what we need to go back to, our lives literally depend on it.
With the mutli-billion dollars screw-up that happened yesterday, and an as-of-yet unknown number of deaths, it's impossible to argue that the funds are unavailable to develop such systems. Plurality is what we need, built on top of strong standards for compatibility and interoperability.
OK, but this was a bug in an update of a kernel module that just happened to be deployed on Windows machines. How many OSs are there that can gracefully recover from an error in kernel space? If every machine that crashed had been running, say, Linux and the update had been coded equivalently, nothing would've changed.
Perhaps rather than an indictment on Windows, this is a call to re-evaluate microkernels, at least for critical systems and infrastructure.
What does this mean? Did the power go down? Is all the equipment connected? Or is it the insurance software that can't run do nothing gets done? Maybe you can't access patient files anymore but is that taking down the whole thing?
Every computer entered a bluescreen loop. We are dependent on Epic for placing orders, for nursing staff to know what needs to be done, for viewing records, for transmitting and interpreting radiology machines. It's how we know the current state of the department and where each patient (out of 50+ people we are simultaneously treating) is at. Our equipment still works but we're flying blind and having to shout orders at each other and have no way to send radiology images to other doctors for consultation.
Yeah in Radiology we depend on Epic and a remote reading service called VRAD. VRAD runs on AWS and went down just after 0130 hrs EST. Without Epic & VRAD we were pretty helpless.
Can't imagine how stressful this must have been for Radiology. I had two patients waiting on CT read with expectation to discharge if no acute findings. Had to let them know we had no clear estimate for when that would be, and might not even know when the read comes back if we can't access epic.
Have a family member in crit care who was getting a sepsis workup on a patient when this all happened. They somehow got plain film working offline after a bit of effort.
We have limited visibility into this in the emergency department. You stabilize the patient and admit them to the hospital, then they become internal medicine or ICU's patient. Thankfully most of the work was done and consults were called prior to the outage, but they were in critical condition.
I will say - the way we typically find out really sends a shiver down your spine.
You come in for you next shift and are finishing charting from your prior shift. You open one of your partially finished charts and a little popup tells you "you are editing the chart for a deceased patient".
i'll admit i have no idea what i'm talking about but isn't there some Plan B options? something that's more manual? or are surgeons too reliant on computers?
There are plan B options like paper charting, downtime procedures, alternative communication methods and so on. So while you can write down a prescription and cut a person open you can't manually do things pull up the patient's medical history for the last 10 years in a few seconds, have an image read remotely when there isn't a radiologist available on site, or electronically file for the meds to just show up instantly (all depending on what the outage issue is affecting of course). For short outages some of these problems are more "it caused a short rush on limited staff" than "things were falling apart". For longer outages it gets to be quite dangerous and that's where you hope it's just your system that's having issues and not everyone in the region so you can divert.
If the alternatives/plan b's were as good or better than the plan a's then they wouldn't be the alternatives. Nobody is going to have half a hospital's care capacity sit as backup when they could use that year round to better treat patients all the time, they just have plans of last resort to use when what they'd like to use isn't working.
(worked healthcare IT infrastructure for a decade)
> So while you can write down a prescription and cut a person open you can't manually do things pull up the patient's medical history for the last 10 years in a few seconds, have an image read remotely when there isn't a radiologist available on site, or electronically file for the meds to just show up instantly (all depending on what the outage issue is affecting of course).
I worked for a company that sold and managed medical radiology imaging systems. One of our customers' admins called and said "Hey, new scans aren't being properly processed so radiologists can't bring them up in the viewer". I told him I'd take a look at it right away.
A few minutes later, he called back; one of their ERs had a patient dying of a gunshot wound and the surgeon needed to get the xray up so he could see where the bullet was lodged before the guy bled out on the table.
Long outages are terrifying, but it only takes a few minutes for someone to die because people didn't have the information they needed to make the right calls.
Yep, when patients often still die while everything is working fine even a minor inconvenience like "all of the desktop icons reset by mistake" can be enough to tilt the needle the wrong way for someone.
I used to work for a company that provided network performance monitoring to hospitals. I am telling a Story second hand that I heard the CEO share.
One day, during a rapid pediatric patient intervention, a caregiver tried to log in to a PC to check a drug interaction. The computer took a long time to log in because of a VDI problem where someone had stored many images in a file that had to be copied on login. While the care team was waiting for the computer, an urgent decision was made to give the drug. But a drug interaction happened — one that would have been caught, had the VDI session initialized more quickly.
The patient died and the person whose VDI profile contained the images in the bad directory committed suicide. Two lives lost because files were in the wrong directory.
We can definitely get local imaging with X-Ray and ultrasound - we use bedside machines that can be used and interpreted quickly.
X-Ray has limitations though - most of our emergencies aren't as easy to diagnose as bullets or pneumonia. CT, CTA, and to a lesser extent MRI are really critical in the emergency department, and you definitely need four years of training to interpret them, and a computer to let you view the scan layer-by-layer. For many smaller hospitals they may not have radiology on-site and instead use a remote radiology service that handles multiple hospitals. It's hard to get doctors who want to live near or commute to more rural hospitals, so easier for a radiologist to remotely support several.
GP referred to "processed," which could mean a few things. I interpreted it to mean that the images were not recording correctly locally prior to any upload, and they needed assistance with that machine or the software on it.
Seems like a possible plan would be duplicate computer systems that are using last week's backup and not set to auto-update. Doesn't cover you if the databases and servers go down (unless you can have spares of those too), but if there is a bad update, a crypto-locker, or just a normal IT failure each department can switch to some backups and switch to a slightly stale computer instead of very stale paper.
We have "downtime" systems in place, basically an isolated Epic cluster, to prevent situations like this. The problem is that this wasn't a software update that was downloaded by our computers, it was a configuration change by Crowdstrike that was immediately picked up by all computers running its agent. And, because hospitals are being heavily targeted by encryption attacks right now, it's installed on EVERY machine in the hospital, which brought down our Epic cluster and the disaster recovery cluster. A true single point of failure.
Can only speak for the UK here, but having one computer system that is sufficiently functional for day-to-day operations is often a challenge, let alone two.
There are often such plans from DR systems to isolated backups to secondary system, as much as risk management budget allow at least. Of course it takes time to switch to these and back, the missing records cause chaos (both inside synced systems and with patient data) both ways and it takes a while to do. On top of that not every system will be covered so it's still a limited state.
There are problems with getting lab results, X-rays, CT and MRI scans. They do not have paper-based Plan B. IT outage in a modern hospital is a major risk to life and health of their patients.
It's often the case that the paper fallbacks can't handle anywhere near the throughput required. Yes, there's a mechanism there, but it's not usable beyond a certain load.
I think it's eventually manageable for some subset of medical procedures, but the transition to that from business as usual is a frantic nightmare. Like there's probably a whole manual for dealing with different levels of system failure, but they're unlikely to be well practiced.
Or maybe I'm giving these institutions too much credit?
I assume Crowdstrike is software you usually want to update quickly, given it is (ironically) designed to counter threats to your system.
Very easy for us to second guess today of course. But in another scenario a manager is being torn a new one because they fell victim to a ransomware attack via a zero day systems were left vulnerable to because Crowdstrike wasn’t updated in a timely manner.
Maybe, if there's a new zero-day major exploit that is spreading like wildfire. That's not the normal case. Most successful exploits and ransom attacks are using old vulnerabilites against unpatched and unprotected systems.
Mostly, if you are reasonably timely about keeping updates applied, you're fine.
> Maybe, if there's a new zero-day major exploit that is spreading like wildfire. That's not the normal case.
Sure. And Crowstrike releasing an update that bricks machines is also not the normal case. We're debating between two edges cases here, the answers aren’t simple. A zero day spreading like wildfire is not normal but if it were to happen it could be just as, if not more, destructive than what we’re seeing with Crowdstrike.
In the context of the GP where they were actively treating a heart attack, the act of restarting the computer (let alone it never come back) in of itself seems like an issue.
I believe this update didn't restart the computer, just loaded some new data into kernel. Which didn't crash anything the previous 1000 times. A successful background update could hurt performance, but probably machines where that's considered a problem just don't run a general-purpose multitasking OS?
Crowdstrike pushed a configuration change that was a malformed file, which was picked up by every computer running a the agent (millions of computers across the globe). It's not like hospitals and IT systems are manually running this update and can roll it back.
As to why they didn't catch this during tests or why they don't use perform gradual change rollouts to hosts, your guess is as good as mine. I hope we get a public postmortem for this.
Considering Crowdstrike mentioned in their blog that systems that had their 'falcon sensor' installed weren't affected [1], and the update is falcon content, I'm not sure it was a malformed file, but just software that required this sensor to be installed. Perhaps their QA only checked if the update broke systems with this sensor installed, and didn't do a regression check on windows systems without it.
It says that if a system isn’t “affected”, meaning it doesn’t reboot in a loop, then the “protection” works and nothing needs to be done. That’s because the Crowdstrike central systems, on which rely the agents running on the clients’ systems, are working well.
The “sensor” is what the clients actually install and run on their machines in order to “use Crowdstrike”.
The crash happened in a file named csagent.sys which on my machine was something like a week old.
Likely because staggered updates would harm their overall security services. I'm guessing these software offer telemetry that gets shared across their clientele, so that gets hampered if you have a thousand different software versions.
My guess is this was an auto-update pushed out by whatever central management server they use. Given CS is supposed to protect your from malware, IT may have staged and pushed the update in one go.
High-end hospital-management software is not simple stuff, to roll your own. And the (very few) specialty companies which produce such software may see no reason to support a variety of OS's.
Are you sure that argument still holds when everyone has Android/iOS phone with apps that talk to Linux servers, and some use Windows desktops and servers as well?
There isn't, and never was, a benevolent dictator choosing the OS for computers in medical settings.
Instead, it's a bunch of independent-ish, for-profit software & hardware companies. Each one trying to make it cheap & easy to develop their own product, and to maximize sales. Given the dominance of MS-DOS and Windows on cheap-ish & ubiquitous PC's, starting in the early-ish 1980's, the current situation was pretty much inevitable.
To add detail for those that don't understand, the big healthcare players barely have unix teams, and the small mom and pop groups literally have desktops sitting under the receptionist desk running the shittiest software imaginable.
The big health products are built on windows because they are built by outsourced software shops and target the majority of builds which are basically the equivalent of bob's hardware store still running windows 95 on their point of sale box.
The major players that took over this space for the big players had to migrate from this, so they still targeted "wintel" platforms because the vast majority of healthcare servers are windows.
Its basically the tech equivalent of everything evolved from the width of oxen for railway.
Because of critical mass. A significant amount of non-technically inclined people use Windows. Some use Mac. And they're intimidated by anything different.
There's a bunch of non-web proprietary software medical offices use to access patient files, result histories, prescription dispensation etc. At least here in Ontario my doctor uses an actual windows application to accomplish all that.
Then they use those apps. The point is that since they usage of the OS as such is so minimal as to be irrelevant as long as it has a launcher and an X in the top corner.
Question is: why half+ of Fortune 500 companies allowed Crowdstrike - Windows hackers - access and total control of their not-a-ms-windows business ? Obviously Crowdstrike do not do medicine or lifting cranes differentiation. "In the middle of the surgery" is not in their use case docs!
There was somewhere Mercedes pitstop image with wall of BSoD monitors :) But that is not Crowdstrike business either...
And all that via public internet and misc clouds. Banks have their own fibre lines, why hospitals can't?
Airports should disconnect from Internet too, selling tickets can be separate infra, synchronization between POSes and checkout don't need to be in real time.
There is only one sane way to prevent such events: EOD controlled by organization and this is sharply incompatible with 3rd party on-line EOD providers. But they can sell it in a box and do real time support when called.
Auditing: using Windows plus AV plus malware protection means you demonstrate compliance faster than trying to prove your particular version on Linux is secure. Hospitals have to demonstrate compliance in very short timeframes and every second counts. If you fail to achieve this, some or all of your units can be closed.
Dependency chains: many pieces of kit either only have drivers on windows or work much better on Windows. You are at the mercy of the least OS diverse piece of kit. Label printers are notorious for this as an e.g.
Staffing: Many of your staff know how to do their jobs excellently, but will struggle with tech. You need them to be able assume a look and feel, because you dont want them fighting UX differences when every second counts. Their stress level is roughly equiv. to their worst 10 seconds of their day. And staff will quit or strike over UX. Even UI colour changes due to virtualization down scaling have triggered strife.
Change Mgmt: Hospitals are conservative and rarely push the envelope. We are seeing a major shift at the moment in key areas (EMR) but this still happening slowly. No one is interested in increasing their risk just because Linux exists and has Win64 compatability. There is literally no driver for change away from windows.
> What are the hard problems? I can think of a few, but I'm probably wrong.
Billing and insurance reimbursement process change all the time and is a headache to keep up to date. E.g. the actual dentist software is paint but with mainly the bucket and some way to quickly insert teeth objects to match your mouth. I.e. almost no medical skill in the software itself helping the user.
It's not just that. A large portion of IT people who work in these industries find Windows much easier to administer. They're very resistant to switching out even if it was possible and everything the company needed was available elsewhere.
Even if they did switch, they'd then want to install all the equivalent monitoring crap. If such existed, it would likely be some custom kernel driver and it could bring a unix system to its knees when shit goes wrong too.
ER worker here. It really depends on the details. If she was C-STAT positive with last known normal within three hours, you assume stroke, activate the stroke team, and everything moves very quickly. This is where every minute counts, because you can do clot busting to recover brain function.
The fact that she was discharged without an overnight admit suggests to me that the MRI did not show a stroke, or perhaps she was outside the treatment window when she went to the hospital.
I remember a fed speaker in the 90s at Alexis hotel Defcon trying to rationalize their weirdly over-aggressive approach to enforcement by mentioning how hackers would potentially kill people in hospitals, fast forward to today and it's literally the "security" software vendor that's causing it.
It's not like hackers haven't killed people in hospitals with e.g. ransomware. Our local dinky hospital system was hit by ransomware twice, which at the very least delayed some important surgeries.
I can't imagine why any critical system is connected to the internet at all. It never made sense to me. Wifi should not be present on any critical system board and ethernet plugged in only when needed for maintenance.
This should be the standard for any life sustaining or surgical systems, and any critical weapons systems.
I work for a large medical device company and my team works on securing medical devices. At least at my company as a general rule, the more expensive the equipment (and thus the more critical the equipment, think surgical robots) the less likely it will ever be connected to a network, and that is exactly because of what you said, you remove so many security issues when you keep devices in a disconnected state.
Most of what I do is creating the tools to let the field reps go into hospitals and update capital equipment in a disconnected state (IE, the reps must be physically tethered to the device to interact with it). The fact that any critical equipment would get an auto-update, especially mid-surgery is incredibly bad practice.
I work for the government supporting critical equipment - not in medical, in transportation sector - and the systems my team supports not only are not connected to the internet, they aren't even capable of being so connected. Unfortunately the department responsible for flogging us to do cybersecurity reporting (different org branch than my team) has all our systems miscategorized as IT data systems (when they don't even contain an operating system). So we waste untold numbers of engineer hours now reporting "0 devices affected" to lists of CvE's and answering data calls about SSH, Oracle or Cisco vulnerabilities, etc. etc. which we keep answering with "this system is air gapped and uses a microcontroller from 1980 that cannot run Windows or Linux" but the cybersecurity-flogging department refuses to properly categorize us. My colleague is convinced they're doing that because it inflates their numbers of IT systems.
Anyway: it is getting to the point that I cynically predict we may be required to add things to the system (such as embedding PCs), just so we can turn around and "secure" them to comply with the requirements that shouldn't be applied to these systems. Maybe this current outage event will be a wake up call to how misplaced the priorities are, but I doubt it.
Have you ever tried to airgap a gigantic wifi network across several buildings?
Has to be wifi because the carts the nurses use roll around. Has to be networked so you can have EMR's that keep track of what your patients have gotten and the Pharmacists, doctors, and nurses can interface with the Pyxis machines correctly. The nurse scans a patients barcode at the Pyxis, the drawer opens to give them the drugs, and then they go into the patient's room and scan the drug barcode and the patients barcode before administering the drug. This system is to prevent the wrong drug from being administered, and has dramatically dropped the rates of mis-administering drugs. The network has to be everywhere on campus (often times across many buildings). Then the doctor needs to see the results of the tests and imaging- who is running around delivering all of these scans to the right doctors?
You don't know what you are talking about if you think this is easy.
Air gap the system with the external world is different from air gap internally. The systems are only update via physical means. And possibly all data in and out is offline like, via certain double firewall arrangement (you do not let direct contact but dump in and out files). Not common but for industrial critical system saw a few big shops did this.
So how does a doctor issue a discharge order via e-prescription to the patients pharmacy for them to pick up when they leave? How do you update the badge readers on the drug vaults when an employee leaves and you need to deactivate their badge? How do you update the EMR's from the hospital stay so the GP practice they use can see them after discharge? How do you order more supplies and pharmacy goods when you run out? How do you contact the DEA to get approval for using certain scheduled meds? I'm afraid that external networks are absolutely a requirement for modern hospitals.
If the system has to be networked with the outside world, who is responsible for physically updating all of these machines, so they don't get ransomware'd? Who has to go out and visit each individual machine and update it each month so the MRI machine doesn't get bricked by some teen ransomware gang? Remember that was the main threat hospitals faced 3-4 years ago, which is why Crowdstrike ended up on everyone's computer: because the ransomware insurance people forced them too.
There is a reason that I am a software engineer and not an IT person. I prefer solving more tractable problems, and I think proving p!=np would be easier than effectively protecting a large IT network for people who are not computing professionals.
One of my favorite examples: in October 2013 casino/media magnate and right wing billionaire Sheldon Adelson gave a speech about how the US and Israel should use nuclear weapons to stop Iran nuclear program. In February 2014 a 150 line VB macro was installed on the Sands casino network that replicated and deleted all HDDs, causing 150 million dollars of damage. That was to a casino, which spends a lot of money on computer security, and even employs some guys named Vito with tire irons. And it wasn't nearly enough.
> Who has to go out and visit each individual machine and update it each month so the MRI machine doesn't get bricked by some teen ransomware gang?
The manufacturer does. As I mentioned in my OP I help build the software for our field reps to go into hospitals and clinics to update our devices in a disconnected state. Most of the critical equipment we manufacture has this as a requirement since it can't be connected to a network for security reasons.
As for discharge orders, etc, I can't speak to that, but that's also not what I would consider critical. I'm talking about things like surgical robots, which can not be connected to a network for obvious reasons, especially during a surgery.
External networks are required but it should be possible to air gap the critical stuff to read only. It’s just that it’s costly and hospitals are poor/cheap
My wife is a hospital pharmacist. (1) When she gets a new prescription in, she needs to see the patients charts on the electronic medical records, and then if she approves the medication a drawer in the Pyxis cabinet (2) will open up when a nurse scans the patients barcode, allowing them to remove the medication, and then the nurse will scan the patient's barcode and the medication barcode in the patients room to record that it was delivered at a certain time. Computers are everywhere in healthcare, because they need records and computers are great at record-keeping. All of those need networks to connect them, mostly on wifi (so the nurses scanners can read things).
In theory you could build an air-gapped network within a hospital, but then how do you transmit updates to the EMR's across different campuses of your hospital? How do you issue electronic prescriptions for patients to pick up at their home pharmacy? How do you handle off-site data backup?
Quite honestly, outside of defense applications I'm not aware of people building large air-gapped networks (and from experience, most defense networks aren't truly air-gapped any more, though I won't go into detail). Hospitals, power plants, dams, etc. all of them rely heavily on computers these days, and connect those over the regular internet.
1: My wife was the only pharmacist in her department last night whose computer was unaffected by Crowdstrike (for unknown reasons). She couldn't record her work in the normal ways, because the servers were Crowdstrike'd as well. So she spun up a document of her decisions and approvals, for later entry into the systems. It was over 70 pages long when she went off shift this morning. She's asleep right now.
First - drop "air-gapped" term and replace it with "internet-gapped". TA^h^h^a^a! And it already have a name: "The LAN"... Now teach managers about importance of local net vs open/public/world net. Tell them cloud costs more becouse someone is making a fortune or three on it !
TIP: many buildings can be part of one LAN! It is called VPN and Russia and China do not like it becouse it is good for peoples!
TIP: data can be easily exchanged when needed! Including LAN.
--
My wife is a hospital pharmacist. (1) When she gets a new prescription in, she needs to see the patients charts on the electronic medical records, and then if she approves the medication a drawer in the Pyxis cabinet (2) will open up when a nurse scans the patients barcode, allowing them to remove the medication, and then the nurse will scan the patient's barcode and the medication barcode in the patients room to record that it was delivered at a certain time. Computers are everywhere in healthcare, because they need records and computers are great at record-keeping. All of those need networks to connect them, mostly on wifi (so the nurses scanners can read things).
--
It was description of very local workflow...
It was description of data flow - no any reason it should be monopolized by unsecure by design os vendor that need to be mandatory secured by essentialy kernel rootkit aka os hacking. Which contradicts using that os in the first place!
And looks like Crowdstrike is just if you ask for price then you can't have it version of SELinux :>>> RH++ for two decades of making presentations of SELinux necessity.
But over all allowing automatic updates from 3rd party not having clue about medicine to hospital system, etc. is managers criminal negligence. Simple as that. Curent state of the art ? More negligence! Add (business) academia & co to chronic offenders. Call them what they truly are - sociopaths via craft training facilities.
>In theory you could build an air-gapped network within a hospital, but then how
>do you transmit updates to the EMR's across different campuses of your hospital?
How do you transmit to other campuses of other hospitals ? EASY! Transfer mandatory data.
Pleas notice I used words like "mandatory" and "data". I DID NOT SAY "use mandatory http stack to transfer data"! NO. NO, I'm far, faaar from even sugesting THAT ! :>
>How do you issue electronic prescriptions for patients to pick up at their home pharmacy?
Hard sold on that "air-gapped and in cage" meme, eh? Send them required data via secure and private method! Communications channels already "hacked" - monopolized - by FB? Obviously that should do not happend in first place. So resolve it as part of un-win-dosing critical civilian infra.
>How do you handle off-site data backup?
That one I do not get. You saying that cloud access is a only possibility to have backups??? And Internet is a must to do it?? Is medical staff brain dead? Ah, no... It's just managers... Again.
>Quite honestly, outside of defense applications I'm not aware of people building large air-gapped networks
And dhcp and "super glue" and tons of other things was invented by military, for a reason, but that things proliferated to civilians anyway. For good reasons. Air-gapping should be much more common when wifi signal allows tracking how you move in your own home. Not to mention GSM+ based "technologies"...
There is old saying: Computers maximize doing. And when somewhere is chaos then computers simply do their work.
I think the criticial systems here are often the ones that need to be connected to some network. Somebody up there mentioned how the MRI worked fine, but they still needed to get the results to the people who needed it. So the problem there was more doctor <-> doctor.
Yeah, our imaging devices were working fine, but with Epic down, you lose most of your communication between departments and your sole way of sharing radiology images and interpretations.
> Roslin: ...it tells people things like where the restroom is, and--
> Adama: It's an integrated computer network, and I will not have it aboard this ship.
> Roslin: I heard you're one of those people. You're actually afraid of computers.
> Adama: No, there are many computers on this ship. But they're not networked.
> Roslin: A computerized network would simply make it faster and easier for the teachers to be able to teach--
> Adama: Let me explain something to you. Many good men and women lost their lives aboard this ship because someone wanted a faster computer to make life easier. I'm sorry that I'm inconveniencing you or the teachers, but I will not allow a networked computerized system to be placed on this ship while I'm in command. Is that clear?
... at which point you will lose battles to enemies who have successfully networked their command and control operations. (For extra laughs, just wait until this is also true of AI.)
Ultimately there are just too darned many advantages to connecting, automating, and eventually 'autonomizing' everything in sight. It sucks when things don't go right, or when a single point of failure causes a black-swan event like this one, but in an environment where you're competing against either time or external adversaries, the alternatives are all worse.
Or the opposite: the enemy (or a third-party enemy who wasn't previously a combatant in the battle) hijacks your entire naval USV/UUV fleet & air force drone fleet using an advanced cyberattack, and suddenly your enemy's military force has almost doubled while yours is down to almost zero, and these hijacked machines are within your own lines.
Yes, the efficiency gains of remote automated administration and deployment make up for most outages that are caused by it.
A better thing to do is do phased deployment, so you can see if an update will cause issues in your environment before pushing it to all systems. As this incident shows, you can’t trust a software vendor to have done that themselves.
This wasn't a binary patch though, it was a configuration change that was fed to every device. Which raises a LOT of questions about how this could have happened and why it wasn't caught sooner.
Writing from the SRE side of the discipline, it's commonly a configuration change (or a "flag flip") that ultimately winds up causing an outage. All too seldom are configuration data considered part of the same deployable surface area (and, as a corollary, part of the same blast radius) as program text.
I've mostly resigned myself today to deploying the configuration change and watching for anomalies in my monitoring for a number of hours or days afterward, but I acknowledge that I also have both a process supervisor that will happily let me crash loop my programs and deployment infrastructure that will nonetheless allow me to roll things back. Without either of those, I'm honestly at a loss as to how I'd safely operate this product.
The most insidious part of this is when there are entire swaths of infrastructure in place that circumvent the usual code review process in order to execute those configuration changes. Boolean flags like your `config('foo')` here are most common, but I've also seen nested dictionaries shoved through this way.
When I was at FB there were a load of SEVs caused by config changes, such that the repo itself would print out a huge warning about updating configs and show you how to do a canary to avoid this problem.
As in, there was no way to have configured the sensors to prevent this? They were just going to get this if they were connected to the internet? If I was an admin that would make me very angry.
This is the way it's done in the nuclear industry across the US for power and enrichment facilities. Operational/secure section of the plant is airgapped with hardware data diodes to let info out to engineers. Updates and data are sneaker netted in.
At least hackers let people boot their machines, and some even have an automated way to restore the files after a payment. CS doesn't even do that. Hackers are looking better and more professional if we're going to put them in the same bucket, that is.
The criminal crews have a reputation to uphold. You don't deliver on payment, the word gets around and soon enough nobody is going to pay them.
These security software vendors have found a wonderful tacit moat: they have managed to infect various questionnaire templates by being present in a short list of "pre-vetted and known" choices in a dropdown/radiobutton menu. If you select the sane option ("other"), you get to explain to technically inept bean counters why you did so.
Repeat that for every single regulator, client auditing team, insurance company, etc. ... and soon enough someone will decide it's easier and cheaper to pick an option that gets you through the blind-leading-the-blind question karaoke with less headaches.
Remember: vast majority of so-called security products are sold to people high up in the management chain, but they are inflicted upon their victims. The incentives are perverse, and the outcomes accordingly predictable.
Funnily enough, a bit of snark can help from time to time.
For anyone browsing the thread archive in the future: you can have that quip in your back pocket and use it verbally when having to discuss the bingo sheet results with someone competent. It's a good bit of extra material, but it can not[ß] be your sole reason. The term you do want to remember is "additional benefit".
The reasons you actually write down boil down to four things. High-level technical overview of your chosen solution. Threat model. Outcomes. And compensating controls. (As cringy as that sounds.)
If you can demonstrate that you UNDERSTAND the underlying problem, and consider each bingo sheet entry an attempt at tackling a symptom, you will be on firmer ground. Focusing on threat model and the desired outcomes helps to answer the question, "what exactly are you trying to protect yourself from, and why?"
ß: I face off with auditors and non-technical security people all the time. I used to face off with regulators in the past. In my experience, both groups respond to outcome-based risk modeling. But you have to be deeply technical to be able to dissect and explain their own questions back to them in terms that map to reality and the underlying technical details.
The problem is concentration risk and incentives. Everyone is incentivized to follow the herd and buy Crowdstrike for EDR because of sentiment and network effects. You have to check the box, you have to be able to say you're defending against this risk (Evolve Bank had no EDR, for example), and you have to be able to defend your choice. You've now concentrated operational risk in one vendor, versus multiple competing vendors and products minimizing blast radius. No one ever got fired for buying Crowdstrike previously, and you will have an uphill climb internally attempting to argue that your org shouldn't pick what the bubble considers the best control.
With that said, Microsoft could've done this with Defender just as easily, so be mindful of system diversity in your business continuity and disaster recovery plans and enterprise architecture. Heterogeneous systems can have inherent benefits.
If you have a networked hybrid heterogeneous system though now you have weakest link issue, since lateral movement can now happen after your weaker perimeter tool is breached
A threat actor able to evade EDR and moving laterally or pivoting through your env should be an assumption you’ve planned for (we do). Defense in depth, layered controls. Systems, network, identity, etc. One control should never be the difference between success and failure.
> “This is a function of the very homogenous technology that goes into the backbone of all of our IT infrastructure,” said Gregory Falco, an assistant professor of engineering at Cornell University. “What really causes this mess is that we rely on very few companies, and everybody uses the same folks, so everyone goes down at the same time.”
Yes, but computers get infected by ransomware randomly; Crowdstrike infected large amount of life-critical systems worldwide over some time, and then struck them all down at the same time.
I'm not sure I agree, ransomware attacks against organizations are often targeted. They might not all happen on the same day, but it is even worse: an ongoing threat every day.
It's why it's not worse - an ongoing threat means only small amount of systems are affected at a time, and there is time to develop countermeasures. An attack on everything all at once is much more damaging, especially when it eliminates fallback options - like the hospital that can't divert their patients because every other hospital in the country is down too, and so is 911.
Ransomware that affects only individual computers died not get payouts outside of hitting extremely incompetent orgs.
If you want actually good payout, your crypto locker has to either encrypt network filesystems, or infect crucial core systems (domain controllers, database servers, the filers directly, etc).
Ransomware getting smarter about sideways movement, and proper data exfiltration etc attacks, are part of what led to proliferation of requirements for EDRs like Crowdstrike, btw
Ransomware vendors at least try to avoid causing damage to critical infrastructure, or hitting way too many systems simultaneously - it's good neither for business nor for their prospects of staying alive and free.
But that's besides the point. Point is, attacks distributed over time and space ultimately make the overall system more resilient; an attack happening everywhere at once is what kills complex systems.
> Ransomware getting smarter about sideways movement, and proper data exfiltration etc attacks, are part of what led to proliferation of requirements for EDRs like Crowdstrike, btw
To use medical analogy, this is saying that the pathogens got smarter at moving around, the immune system got put on a hair trigger, leading to a cytokine storm caused by random chance, almost killing the patient. Well, hopefully our global infrastructure won't die. The ultimate problem here isn't pathogens (ransomware), but the oversensitive immune system (EDRs).
A lot of security software, ranging from properly using EDRs like Crowdstrike to things like simply setting some rules in Windows File Server Resource Manager fooled many ransomware attacks at the very least
I'm guessing hundreds of billions if you could somehow add it all up.
I can't believe they pushed updates to 100% of Windows machines and somehow didn't notice a reboot loop. Epic gross negligence. Are their employees really this incompetent? It's unbelievable.
I wonder where MSFT and Crowdstrike are most vulnerable to lawsuits?
This outage seems to be the natural result of removing QA by a different team than the (always optimistic) dev team as a mandatory step for extremely important changes. And neglecting canary type validations. The big question is will businesses migrate away from such a visibly incompetent organization. (Note I blame the overall org; I am sure talented individuals tried their best inside a set of procedures that asked for trouble.)
So there was apparently an Azure outage prior to this big one. One thing that is a pretty common pattern in my company when there are big outages is something like this:
1. Problem A happens, it’s pretty bad
2. A fix is rushed out very quickly for problem A. It is not given the usual amount of scrutiny, because Problem A needs to be fixed urgently.
3. The fix for Problem A ends up causing Problem B, which is a much bigger problem.
tl;dr don’t rush your hotfixes through and cut corners in the process, this often leads to more pain
If you’ve ever been forced to use a PC with Crowdstrike it’s not amazing at all. I’m amazed incident of this scale didn’t happen earlier.
Everything about it reeks of incompetence and gross negligence.
It’s the old story of the user and purchaser being different parties-the software needs to be only good enough to be sold to third parties who never neeed to use it.
It’s a half-baked rootkit part of performative cyberdefence theatrics.
> It’s a half-baked rootkit part of performative cyberdefence theatrics.
That describes most of the space, IMO. In a similar vein, SOC2 compliance is bullshit. The auditors lack the technical acumen – or financial incentive – to actually validate your findings. Unless you’re blatantly missing something on their checklist, you’ll pass.
From a enterprise software vendor perspective, cyber checklists feel like a form of regulatory capture. Someone looking to sell something gets a standard or best practice created, added to the checklists, and everyone is forced to comply, regardless of the context.
Any exception made to this checklist is reviewed by third parties that couldn't care less, bean counters, or those technically incapable of understanding the nuance, leaving only the large providers able to compete on the playing field they manufactured.
This will go on for multiple days, but hundreds of billions would be >$36 trillion annualized if it was that much damage for one day. World annual GDP is $100 trillion.
Their terms of use undoubtedly disclaim any warranty, fitness for purpose, or liability for any direct or incidental consequences of using their product.
I am LMFAO at the entire situation. Somewhere, George Carlin is smiling.
I wonder if companies are incentivized to buy Crowdstrike because of Crowdstrike's warranty that will allegedly reimburse you if you suffer monetary damage from a security incident while paying for Crowdstrike.
There must be an incentive. Because from a security perspective bringing in a 3rd party to a platform (microsoft) to do a job the platform already does is literally just the definition of opening up holes in your security. Completely b@tshit crazy, the salesmen for these products should hang their heads in shame. It's just straight up bad practice. Im astounded it's so widespread.
I saw one of the surgery videos recently. The doctor was saying, "Alexa, turn on suction." It boggled my mind. There could be so many points of failure.
Not to be that guy, but I often say software engineering as a field should have harsher standards of quality and certainly liability for things like these. You know like civil engineers, electrical engineers and most people whose work could kill people if done wrongly.
Usually when I write this devs get all defensive and ask me what the worst thing is that could happen.. I don't know.. Could you guarantee it doesn't involve people dying?
Dear colleagues, software is great because one persons work multiplies. But it is also a damn fucking huge responsibility to ensure you are not inserting bullshit into the multiplication.
Some countries such as Canada have taken minor steps towards this, for example making it illegal to call oneself a software engineer unless you are certified by the provinces professional engineering body, however this is still missing a lot. I also don't wish to be "that guy" but I'll go further and say that the US is really holding this back by not making using Software Engineer as a title (without holding a PEng) illegal in a similar fashion.
If we can at least get that basis then we can start to define more things such as jobs that non Engineers can not legally do, and legal ramifications for things such as software bugs. If someone will lose their professional license and potentially their career over shipping a large enough bug, suddenly the problem of having 25,000 npm dependences and continuous deployment breaking things at any moment will magically cease to exist quite quickly.
I'd go a step farther and say software engineering as a field is not respected at the same levels as such certified/credentialed engineers, because of these lacks of standards and liabilities. Leading to common occurrences of systemic destructive failures such as this, due to organization level direction being very lax in dealing with software failure potential.
I don't know, I get paid more than most of my licensed engineer friends. That's the only respect that really matters to me. Not saying there might not be other advantages to a professional organization for software.
I feel the same way but do agree there’s a general lack of respect for the field relative to other professions. Here’s another thread on the subject https://news.ycombinator.com/item?id=23676651
I believe instances like this will push people to reconsider the lax stance. Humans in general have a hard time regulating something abstract. The fact that people can be killed is well-known since the 80s', see https://en.wikipedia.org/wiki/Therac-25
I once worked on some software that generated PDFs of lab reports for drug companies monitoring clinical trials. These reports had been tested, but not exhaustively.
We got a new requirement to give doctors access to print them on demand. Before this, doctors only read dot matrix-printed reports that had been vetted for decades. With our XSL-FO PDF generator, it was possible that a column could be pushed outside the print boundary, leading a doctor to see 0.9 as 0. I assume in a worst worst case scenario, this could lead to an misdiagnosis, intervention, and even a patient's death.
I was the only one in the company who cared about doing a ton more testing before we opened the reports to doctors. I had to fight hard for it, then I had to do all the work to come up with every possible lab report scenario and test it. I just couldn't stand the idea that someone might die or be seriously hurt by my software.
Imagine how many times one developer doesn't stand up in that scenario.
This is why I made that point, similar to you I would not stand for having my code in something that I can't stand behind, especially if it potentially harms people.
I'd endorse this. That way when my hypothetical PHB wants to know why something is taking so long I can say "See this part? Someone could die if we don't refactor it."
Not my story to tell, so I'm relaying it. Childhood friend works for a big company, you've heard their name, they make nuclear control systems for nuclear reactors; they have products out in the field they support and there are new reactors in parts of the world from time to time. We were scheduled to have lunch a couple years back and he bailed, we rescheduled, he bailed because that was the day you couldn't defer XP updates anymore, they came in and some XP systems became Windows 10. XP was "nuclear reactor approved" by someone, they had a tool chain that didn't work right on other versions of windows, it all gave me chills.
They ended up giving MS a substantial amount of money to extend support for their use case for some number of years. I can't remember the number he told me but it was extremely large.
It sounds like he said XP machines auto-updated to Windows 10, and they would have had to have been connected to the internet in order to download that update. (I'm assuming, optimistically, that these were more remote-control computers than actual nuclear devices.)
Eh. There are a great many problems that could befall a medical emergency systems that are unrelated to OS. Like power loss. I think the core problem here really is a lack of redundancy.
Just a few weeks ago I had an OpenBSD box render itself completely unbootable after nothing more than a routine clean shutdown. Turns out their paranoid-idiotic "we re-link the kernel on every boot" coupled with their house-of-cards file system corrupted the kernel, then overwrote the backup copy when I booted from emergency media - which doesn't create device nodes by default so can't even mount the internal disks without more cryptic commands.
Counter anecdote: I’ve been using Linux for 20 years, nearly half of that professionally. The only time I’ve broken a Linux box where it wasn’t functional was mixing Debian unstable with stable, and I was still able to fix it.
I’ve had hardware stop working because I updated the kernel without checking if it removed support, but a. that’s easily reversible b. Linux kept working fine, as expected.
I’ll also point out, as I’m sure you know, that the BSDs are not Linux.
Funny, i broke my Debian twice (on two separate laptops) by doing exactly that, mixing stable with testing. I was kinda obliged to use "testing" because Dell XPS would miss critical drivers.
In fairness, this is the number one way listed [0] on how to break Debian. That said, if you need testing (which isn’t that uncommon for personal use; Debian is slow to roll out changes, favoring stability), then running pure Sid is actually a viable option. It’s quite stable, despite its name.
I do not think windows is the problem here. The problem is that equipment that is critical infrastructure being connected to the internet, imo. There is little reason for a lot of computers in some settings to be connected to the internet, except for convenience or negligence. If data transfer needs to be done, it can happen through another computer. Some systems should exist on a (more or less) isolated network at best. Too often we do not really understand the risk of a device being connected to the internet, until something like this happens.
Why would a machine that is required for a MRI machine to work (as one of the examples given in the thread here) need to be online? I understand about logging, though even then I think it is too risky. Do all these machines _really_ need to be online, or just nobody bothered after all the times something happened or, even worse, software companies profit in certain ways and would not want to change their models? Can we imagine no other way to do things apart from connecting everything to some server wherever that is?
MRI read outs are 3d, so can't be printed for analysis. They are gigabytes in size, and the units are usually in a different part of the building. So you could sneakernet cds every time an MRI is done, then sneakernet the results back. Or you could batch it and then analysis is done slowly and all at once. OR you could connect it to a central server and results/analysis can be available instantly.
Smarter people than us have already thought through this and the cost-benefit analysis said "connect it to a server"
So in that case you setup a NAS server that it can push the reports to, everything else is firewalled off.
Its just laziness, and to be honest, an outage like this has no impact on their management reputation as a lot of other poorly run companies and institutions were also impacted, so the focus is on crowdstrike and azure, not them.
I admit I'm not a medical professional but these sound like problems with better solutions than lots of Internet connected terminals that can be taken down by edr software.
Why not an internal only network for all the terminals to talk to a central server, then disable any other networking for the terminals? Why do those terminals need a browser where pretty much any malware is going to enter from? If hospitals are paying out the ass for their management software from epic/etc, they should be getting something with a secure design. If the central server is the only thing that can be compromised then when edr takes it down you at least still have all your other systems, presumably with cached data to work from
Many X-Rays (MRIs, CT scans, etc.) are read and interpreted by doctors who are remote. There are firms who that's all they do - provide a way to connect radiologists and hospitals, and handle the usual business back-end work of billing, HR, and so on. Search for "teleradiology"
Same goes for electronic medical records. There are people who assign ICD-10 codes (insurance billing codes) to patient encounters. Often this is a second job for them and they work remote and typically at odd hours.
A modern hospital cannot operate without internet access. Even a medical practice with a single doctor needs it these days so they can file insurance claims, access medical records from referred patients and all the other myriad reasons we use the internet today.
Okay, so (as mentioned elsewhere in this thread), connect the offline box to an online NAS with the tightest security between the two humanly possible. You can get the relevant data out to those who need it.
This stuff isn't impossible to solve. Rather, the incentives just aren’t there. People would rather build an apparatus for blame-shifting than actually just building a better solution.
Do you think everyone involved is physically present? The gp was absolutely accurate that you guys have no idea how modern healthcare works and this had nothing to do with externally introduced malware.
This sounds a bit like someone just got ran over by a truck because the driver couldn’t see them so people ask why trucks are so big that they’re dangerous and the response is “you just don’t know how trucks work” rather than “yeah maybe drivers should be able to see pedestrians”.
If modern medicine is dangerous and fragile because of network connected equipment then that should be fixed even if the way it currently works doesn’t allow it.
This is a completely different discussion. They absolutely should be reliable. The part that is a complete non starter is not being networked because it ignores that telemedicine, pacs integration, and telerobotics exist.
If you don't understand why it has to be networked with extremely bad fallback to paper, then I suggest working in healthcare for a bit before pontificating on how everything should just go back to the stone age.
Networking puts their reliability into risk. As shown here, as shown in ransomware cases. It is not the first time something like this happen.
The question is not whether or not hospitals need internet at all or to go back into printing things in paper or whatever nobody ever said. The question is whether everything in the hospital should be connected to the internet. Again the example used was simple. Having the computer processing and exporting the data from an MRI machine connected online in order to transfer the data, vs using a separate computer to transfer the data and the first computer is offline. This is how we are supposed to transfer similar data at my work for security reasons. I am not sure why it cannot happen in there. If you cannot transfer data through that computer, there could be an emergency backup plan. But you need to solve only the transfering data part. Not everything.
You don’t print the images an MRI produced, you transmit them to the people who can interpret them, and they are almost never in the same room as the big machine, and sometimes they need to be called up in a different office altogether.
The comment [0] mentioned that they could not get at all the mri outputs even with the radiologist coming on site. Obviously, software that was processing/exporting the data was running on a computer that was connected online, if not requiring internet connection itself. Data transfer can happen from another computer than the one the data is processed/obtained. Less convenient, but this is common practice in many other places for security and other reasons.
I mean, this is incentivized by current monetization models. Remove the need to go through a payment based aaS infra, and all the libraries to do the data visualization could be running on the MRI dude's PC.
-aaS by definition requires you to open yourself to someone else to let them do the work for you. It doesn't empower you, it empowers them.
Yeah I suspect -aaS monetisation models are one of the reasons of the current all-to-internet mess. However, such software running in the machine using a hardware usb key as authenticating is not unheard of either in software like that. I wish that decisions on these subjects were done based on the specific needs of the users rather than the finance people of -aaS companies.
Is that an ironic question? Or serious one? I fail to detect the presence or absence of irony sometimes online. I just hope that my own healthcare system has some back-up plans for how to do day-to-day operations like transfering my scan results to a specialist in case the system they normally use fails.
"It seems like you’ve never worked with critical infra."
My entire career has been spent building, and maintaining, critical infra.[1]
Further, in my volunteer time, I come into contact with medical, dispatch and life-safety systems and equipment built on Windows and my question remains the same:
Why is Windows anywhere near critical infra ?
Just because it is common doesn't mean it's any less shameful and inadequate.
I repeat: We've fully understood these risks and frailties for 25 years.
[1] As a craft, and a passion - not because of "exciting career opportunities in IT".
Is this the rsync.net HN account? If so, lmao @ the comment you replied to.
> As a craft, and a passion
I believe you’ve nailed the core problem. Many people in tech are not in it because they genuinely love it, do it in their off time, and so on. Companies, doubly so. I get it, you have to make money, but IME, there is a WORLD of difference in ability and self-solving ability between those who love this shit, and those who just do it for the money.
What’s worse is that actual fundamental knowledge is being lost. I’ve tried at multiple companies to shift DBs off of RDS / Aurora and onto at the very least, EC2s.
“We don’t have the personnel to support that.”
“Me. I do this at home, for fun. I have a rack. I run ZFS. Literally everything in this RFC, I know how to do.”
“Well, we don’t have anyone else.”
And that’s the damn tragedy. I can count on one hand the number of people I know with a homelab who are doing anything other than storing media. But you try telling people that they should know how to administer Linux before they know how to administer a K8s cluster, and they look at you like you’re an idiot.
The old school sysadmins who know technology well are still around but there is increasingly less of them while the demand skyrockets as our species gives computers an increasing number of responsibilities.
There is tremendous demand for technology that works well and works reliably. Sure, setting up a database running on an EC2 instance is easy. But do you know all of the settings to make the db safe to access? Do you maintain it well, patch it, replicate it, etc? This can all be done by one of the old school sysadmins. But they are rare to find, and not easy to replace. It's hard to judge from the outside, even if you are an expert in the field.
So when the job market doesn't have the amount of sysadmins/devops engineers available, then the cloud offers a good replacement. Even if you as an individual company can solve it by offering more money and having a tougher selection process, this doesn't scale over the entire field, as at that point the whole number of available experts comes in.
Aurora is definitely expensive, but there is cheaper alternatives to it. Full disclosure, I'm employed by one of these alternative vendors (Neon). You don't have to use it, but many people do and it makes their life easier. The market is expected to grow a lot. Clouds seem to be one of the ways our industry is standardizing.
I’m not even a sysadmin, I just learned how to do stuff in Gentoo in the early ‘00s. Undoubtedly there are graybeards who will laugh at the ease of tooling that was available to me.
> But do you know all of the settings to make the db safe to access? Do you maintain it well, patch it, replicate it, etc?
Yes, but to be fair, I’m a DBRE (and SRE before that). I’m not advocating that someone without fairly deep knowledge attempt to do this in prod at a company of decent size. But your tiny startup? Absolutely; chuck a default install of Postgres or MySQL onto Debian, and optionally tune 2 – 3 settings (shared_buffers, effective_cache_size, and random_page_cost for Postgres; (innodb_buffer_pool_* and sync_array_size for MySQL – the latter isn’t necessary until you have high concurrency, but it also can’t be changed without a restart so may as well). Pick any major backup solution for your DB (Barman for Postgres, XtraBackup for MySQL, etc.), and TEST YOUR BACKUPS. That’s about it. Apply any security patches (or use unattended-upgrades, just be careful) as they’re released, and don’t do anything outside of your distro’s package management. You’ll be fine.
Re: Neon, I’ve not used it, but I’ve read your docs extensively. It’s the most interesting Postgres-aaS product I’ve seen, alongside postgres.ai, but you’re (I think) targeting slightly different audiences. I wish you luck!
> It’s the most interesting Postgres-aaS product I’ve seen, alongside postgres.ai, but you’re (I think) targeting slightly different audiences. I wish you luck!
Also a lot of the passionate security people such as myself moved on to other fields as it has just become bullshit artists sucking on the vendors teat and filling out risk matrix sheets, but no accountability when their risk assessments invariably turn out to be wrong.
In the past, old versions of Windows were often considered superior because they stopped changing and just kept working. Today, that strategy is breaking down because attackers have a lot more technology available to them: a huge database of exploits, faster computers, IoT botnets, and so on. I suspect we're going to see a shift in the type of operating system hospitals run. It might be Linux or a more hardened version of Windows. Either way, the OS vendor should provide all security infrastructure, not a third party like Crowdstrike, IMHO.
> I suspect we're going to see a shift in the type of operating system hospitals run. It might be Linux or a more hardened version of Windows.
Why? "Hardening" the OS is exactly what Crowdstrike sells and bricked the machines with.
Centralization is the root cause here. There should be no by design way for this to happen. That also rules out Microsoft's auto updates. Only the IT department should be able to brick the hospitals machines.
Hardening is absolutely not what crowdstrike sells. They essentially sell OS monitoring and anomaly detection. OS monitoring involves minimizing the attack surface, usually by minimizing the number of services running and limiting the ability to modify the OS
Nothing wrong with that. Windows XP-64 supports up to 128GB physical RAM, could be 5 years until that is available on laptops. Windows 7 Pro supports up to 192 GB of RAM. Now if you were to ask me what you would run on those systems with maxed out RAM, I wouldn't know. I also don't think the Excel version that runs on those versions of windows allows partially filled cells for Gantt charts.
>Most of it runs on 6 to 10 year old unpatched versions of Windows…
Well, that's a pretty big problem. I don't know how we ended up in a situation where everybody is okay with the most important software being the most insecure, but the money needed to keep critical infra totally secure is clearly less than the money (and lives!) lost when the infra crashes.
Well you can use stupid broken software with any OS, not just Windows. Isn't CrowdStrike Falcon available on Linux, is there any reason why couldn't they have introduced a similar bug and similar consequences there?
None. There are a bunch of folks here who clearly haven’t spent a day in enterprise IT proclaiming Linux would’ve saved the day. 30 seconds of research would’ve lead them to discover crowdstrike also runs on Linux and has created similar problems on Linux in the past.
It's even better when you get told about the magical superiority of apple for that...
... Except Apple pretty much pushes you to run such tools just to get reasonable management key alone things like real-time integrity monitoring of important files (Crowdstrike in $DAYJOB[-1] is how security knew to ask whether it was me or something else that edited PAM config for sudo on corporate Mac)
Enterprise mac always follows the same pattern, users proclaim its superiority while its off the radar, then it gets mcaffee, carbon black, airlock, and a bunch of other garbage tooling installed and runs as poorly as enterprise Windows.
The best corporate dev platform at moment is WSL2 - most of the activity inside the WSL2 vm isn't monitored by the windows tooling so performance is fast. Eventually security will start to mandate agents inside the WSL2 instance, but at the moment most orgs dont.
> Why would Windows systems be anywhere near critical infra ?
This is just a guess, but maybe the client machines are windows. So maybe there are servers connected to phone lines or medical equipment, but the doctors and EMS are looking at the data on windows machines.
No. The problem isn’t expertise — it’s CIOs that started their career in the 1990s and haven’t kept up with the times. I had to explain why we wanted PostgreSQL instead of MS SQL server. I shouldn’t have to have that conversation with an executive that should theoretically be a highly experienced expert. We also have CIOs that have MBAs but not actual background in software. (I happen to have an MBA but I also have 15+ years of development experience.) My point is CIOs generally know “business” and they know how to listen to pitches from “Enterprise” software companies — but they don’t actually have real-world experience using the stuff they’re forcing upon the org.
I recently did a project with a company that wanted to move their app to Azure from AWS — not for any good technical reason but just because “we already use Microsoft everywhere else.”
Completely stupid. S3 and Azure Blob don’t work the same way. MCS and AWS SES also don’t work the same way — but we made the switch not even for reasons of money, but because some Microsoft salesman convinced the CIO that their solution was better. Similar to why many Jira orgs force Bitbucket on developers — they listen to vendors rather than the people that have to use this stuff.
> I had to explain why we wanted PostgreSQL instead of MS SQL server.
Tbf, you are giving up a clustering index in that trade. May or may not matter for your workload, but it’s a remarkably different storage strategy that can result in massive performance differences. But also, you could have the same by shifting to MySQL, sooooo…
That’s so infuriating. But, while the people in your story sound dumb, they still sound way more technically literate than 95% of society. Azure is blue, AWS is followed by OME.
Teach a 60 year old industrial powertrain salesman to use Linux and to redevelop their 20 year old business software for a different platform.
Also explain why it’s worth spending food, house, and truck money on it.
Finally, local IT companies are often incompetent. You get entire towns worth of government and business managed by a handful of complacent, incompetent local IT companies. This is a ridiculously common scenario. It totally sucks, and it’s just how it is.
Windows servers are “niche” compared to Linux servers. Command line knowledge is not “uncommon expertise,” it’s imo the bare minimum for working in tech.
I’m not wildly opinionated here, I should clarify. I’d love a more Linux-y world. I’m just saying that a lot of small-medium towns, and small-medium businesses are really just getting by with what they know. And really, Windows can be fine. Usually, however, you get people who don’t understand tech, who can barely use a Windows PC, nevermind Linux, and don’t really have the budget to rebuild their entire tech ecosystem or the knowledge to inform that decision. It sucks, but it’s how it is.
Also, Open Office blows chunks. Business users use Windows. M365 is easy to get going, email is relatively hands-off, deliverability is abstracted. Also, a LOT of business software is Windows exclusive. And that also blows chunks.
I would LOVE a more open source, security minded, bespoke world! It’s just not the way it is right now.
> Why would Windows systems be anywhere near critical infra ?
Why would computers be anywhere near critical infra? This sounds like something that should failsafe, the control system goes down but the thing keeps running. If power goes down, hospitals have generator backups, it seems weird that computers would not be in the same situation
Without access to Epic we can't place med orders, look up patient records, discharge patients from the hospital, enter them into our system, really much of anything. Every provider in the emergency department is on their computer placing orders and doing work when not interacting with a patient. Like most hospitals in this country, our entire workflow depends on Epic. We couldn't even run blood tests because the lab was down too.
The STEMI was stabilized, it's more that it was scary to lose every machine in the department at once while intubating a crashing patient. You're flying blind in a lot of ways.
If the computer system was down, and medicine was needed to save a life, would either some protocol dictate grabbing the medicine and dealing with the paperwork or consequence later? If protocol didn’t allow for discussion, would staff start breaking protocol to save life?
You can skip paperwork but what if the patient is allergic to a medicine and you need to check medical records? Or you need to call for a surgeon but VoIP is down? Etc…
My father's coworker died from being in hospital for observation after few scratches in car accident because they were accidentally given medication they were allergic to.
So, yeah. The paperwork can save lives too and not sadly red tape is bad.
Otherwise you may go o hospital to pickup your friend and be told to wait for coroner
I'm guessing they were being treated over the phone as the systems went down. I've been through a similar situation, the person on the phone will give step by step instructions while waiting for an ambulance to arrive.
Sounds like with the systems being down the call would have been cut off which sounds horrible.
No, treating in person. But we can't function as a department without computers. You call cardiology (on another floor) and none of their computers are working to be able review the patients records. You could take the EKG printout and run it to them, but we're just telling them lab results from what we can remember before our machines all bluescreened. The lab's computers were down so we can't do blood tests. Nursing staff knows what to do next by looking at the board or their computer. Without that you're just a room full of people shouting things at each other, and definitely can't see the 3-4x patients an hour you're expected to. Doctors and midlevels rely on epic to place med orders too.
It's against the site guidelines to post like this, and we have to ban accounts that to it repeatedly, so if you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.
I just had a ten hour hospital shift from hell, apologies if my writing is lacking. I can't think of a better way to try to measure the scope of the damage caused by this.
Just completed a standing 24 due to this outage. My B-Shift brothers and sisters had to monitor the radios all night for their units to be called for emergencies. I heard every dispatch that went out.
We were back in the 1960's with paper and pen for everything, no updates on nature of call, no address information, nothing... find out when you show up and hope the scene is secure. It was wild as it was coupled to a relatively intense Monsoon storm.
Starting with an ER story kind of set up the expectation that you'll be "measuring the scope of the damage" in lives lost, not dollars. Though I guess at large enough scale, they're convertible.
Regardless, thanks for your report; seeing it was very sobering. I hope you can get some rest, and that things will soon return to normalcy.
A tiny bit of thought about your situation IMO should lead anyone to conclude that you just first-hand experienced the fallout of today's nightmare, and then took a step back and realized you were likely one of millions if not billions of other people experiencing the same, and relayed that thought in terms of immediately understandable loss. Someone else might see "wrong" but I saw empathy.
Billions in losses means a somewhat worse life for a huge number of people and potentially much worse healthcare problems down the line, the NHS was affected
When it comes to measuring the impact to society at scale, dollars is really the only useful common proxy. One can't enumerate every impact this is going to have on the world today -- there's too many.
I've told my testers for years their efficacy at their jobs would be measured in unnecessary deaths prevented. Nothing less. Exactly this outcome was something I've made unequivocally clear was possible, and came bundled with a cost in lives. Yet the "Management and bean counter types" insist "Oh, nope. Only the greenbacks matter. It's the only measure."
Bull. Shit. If we weren't so obsessed with imaginary value attached to little green strips of paper, maybe we'd have the systems we need so things like this wouldn't happen. You may not be able to enumerate every, but you damn well can enumerate enough. Y'all just don't want to because then work starts looking like work.
Why measure only death, as if it is the only terrible thing that can happen to someone?
That doesn’t count serious bodily injury, suffering, people who were victimized, people who had their lives set back for decades due to a missed opportunity, a person who missed the last chance to visit a loved one, etc.
There are uncountable different impacts that happen when you’re talking about events on the scale of an economy. Which is why economists use dollars. The proxy isn’t useful because it is more important than life, it it useful because the diversity of human experience is innumerable.
I understand your emotion but perhaps people simply don't value human lives.
At least putting a number to life is an genuine attempt even though it may be distasteful.
The fact is that there already is a number on it, which one can derive entirely descriptively without making moral judgements. Insurance companies and government social security offices already attempt to determine the number.
> "Took down our entire emergency department as we were treating a heart attack."
Not questioning that it happened, but this was a boot loop after a content update. So if the computers were off and didn't get the update, and you booted them, they would be fine. And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.
How did it happen that you were rebooting in the middle of treating a heart attack? [Edit: BSOD -> auto reboot]
Beyond the BSOD that happened in this case, in general this is not true with Windows:
> And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.
Windows has been notorious for forcing updates down your throat, and rebooting at the least appropriate moments (like during time-sensitive presentations, because that's when you stepped away from the keyboard for 5 minutes to set up the projector). And that's in private setting. Corporate setting, the IT department is likely setting up even more aggressive and less workaround-able reboot schedule.
Things like this is exactly why people hate auto-updates.
But it has created a culture of everything needing to be kept up to date all the time no matter what, and pulling control of those updates out of your own hands into the provider's.
How do you propose ensuring critical security updates get deployed then?
Especially if an infected machine can attack others?
Users/IT regularly would never update or deploy patches which has its own consequences. There’s no perfect solution—but rather there to accept the pain.
Yes. But you don't deploy experimental vaccines simultaneously across the entire population all at once. Inoculating an entire country takes months; the logistics incidentally provide protection against unforeseen immediate-term dangerous side effects. Without that delay, well, every now and then you'd kill half the population with a bad vaccine. The equivalent of what's happening now with CrowdStrike.
Windows update actually provides sensible control over when and how to supply updates since I think Windows 2000 (definitely was there by vista time). You just need to use it.
It was degrading since Windows 2000, with Microsoft steadily removing and patching up any clever workarounds people came with to prevent the system from automatically rebooting. The pinnacle of that, an insult added to injury, was introduction of "active hours" - a period of, initially, at most 8 or 10 hours, designated as the only time in the day your system would not reboot due to updates. Sucks if your computer isn't an office machine only ever used 9-to-5.
No, it was not degrading - Windows 10 introduced forced updating in home editions because it was weighed to be better for general cases (that it got abused later is separate issue).
The assumption is that "pros" and "enterprise" either know how to use provided controls or have WSUS server setup which takes over all of scheduling updates.
We do not know if the update was new version of the driver (which also can be updated without reboot on Windows since... ~17 years ago at least) or if it was done data that was hot-reloaded that triggered a latent big in the driver
> "Windows has been notorious for forcing updates down your throat"
in the same way cars are notorious for forcing you to run out of gas while you're driving them and leaving you stranded... because you didn't make time to refill them before it became a problem.
> "Things like this is exactly why people hate auto-updates."
And people also hate making time for routine maintenance, and hate getting malware from exploits they didn't patch, and companies hate getting DDoS'd by compromised Windows PCs the owners didn't patch, and companies hate downtime from attackers taking them offline. There isn't an answer which will please everyone.
This isn't really a good faith response. This prevention of functionality during a critical period while forcing an update would be like if a modern car refused to drive during an emergency due to a forced over the air update that paused the ability to drive till the update was finished.
The parent response wasn't good faith; it was leaning on an emergency in a hospital department caused by CrowdStrike to whine about Microsoft in trollbait style.
> "This prevention of functionality during a critical period while forcing an update would be like if a modern car refused to drive during an emergency"
Machines don't know if there's an emergency going on; if you don't do maintenance, knowing that the thing will fail if you don't, then you're rolling the dice on whether it fails right when you need it. It's akin to not renewing an SSL certificate - you knew it was coming, you didn't deal with it, now it's broken - despite all reasonable arguments that the connection is approximately as safe 1 minute after midnight as it was 1 minute before, if the smartphone app (or whatever) doesn't give you any expired cert override then complaining does nothing. Windows updates are released the same day every month, and have been mandatory for eight years: https://www.forbes.com/sites/amitchowdhry/2015/07/20/windows...
And we all know why - because Windows had a reputation of being horribly insecure, and when Microsoft patched things, nobody installed the patches. So now people have to install the patches. Complaining "I want to do it myself" leads to the very simple reply: you can - why didn't you do it yourself before it caused you a problem?
If you're still stubbornly refusing to install them, refusing to disable them, refusing to move to macOS or Linux, and then complaining that they forced you to update at an inconvenient time, you should expect people to point out how ridiculous (and off-topic) you're being.
But that's the thing, forced updates are not akin to maintenance or certs that expire on an annual basis. I'm not sure where you seem to be getting your "you should expect people to point out how ridiculous you're being" line from. Your the only one I'm seeing arguing this idea.
Disabling forced updates by using proper managed updates features that exist longer than "forced updates" had is table stakes for IT. In fact, it was considered important and critical before Windows became major OS in business.
Not setting computers that are in any critical path on proper maintenance schedule (which, btw, overrides automatic updates on Windows and doesn't require extra licenses!) is the same as willfully ignoring maintenance just because the car didn't punch you in the face every time you need to up some fluids
I agree that it is willfully ignoring maintenance, but I completely disagree with the analogy that it is the same as ignoring a fluid change in a car. A car will break down and may stop working without fluid changes. The same is almost assuredly not usually true if a windows, or other, update is ignored. If you disagree, then I'd be happy to review any evidence you have that these updates really are always as critical as you think.
A lot of things that come as "mandatory patches" in IT, not just for Windows, are things that tend to generate recalls - or "sucks to be you, buy new car" in automotive world.
In more professional settings than private small car ownership, you often will both have regular maintenance updates provided and mandates to follow them. Sometimes they are optional because your environment doesn't depend on them, sometimes they are mandatory fixes, sometimes they change from optional to mandatory overnight when previous assumptions no longer apply.
Several years ago a bit over 100 people and uncounted amount of possible more had their lives endangered because an extra airflow directing piece of metal was optional, and after the incident it was quickly made mandatory, with hundreds of aircraft being stopped to have the fix applied (which previously was only required for hot locations - climate change really bit it).
Similarly, when you drive your car and it fails to operate, that's just you. When it's a more critical service, you're either facing corporate, or in worst case, governmental questions.
idk, a lot of system are never meant to be rebooted outside of the update schedule, so they wouldn't have been off in the first place. And if those systems control others, then there is a domino effect.
I can see very well how one computer could have screwed all others. It's really not hard to imagine.
What happens when a computer gets rebooted as part of daily practice or because of the update, and then it becomes unusable, and then the treatment team needs to use it hours later?
I dunno, but they'd know about it hours earlier in time to switch to paper, or pull out older computers, or something - in that scenario it wouldn't have happened "as we were treating a heart attack" and they would have had time to prepare.
They're probably deployed to a virtualized system to easy with maintenance and upkeep.
Updates are partially necessary to ensure you don't end up completely unsupported in the future.
It's been a long time, but I worked IT for an auto supplier. Literally nothing was worse than some old computer crapping out with an old version of Windows and a proprietary driver. Mind you, these weren't mission critical systems, but they did disrupt people's workflows while we were fixing the systems. Think, things like digital measurements or barcode scanners. Everything can be easily done by hand but it's a massive pain.
Most of these systems end up migrated to a local data center than deployed via a thin client. Far easier to maintain and fix than some box that's been sitting in the corner of a shop collecting dust for 15 years.
Real problem is not that it's just a damn lift and shouldn't need full Windows. It's that something as theoretically solved and done problem as an operating system is not practically so.
An Internet of Lift can be done with <32MB of RAM and <500MHz single core CPU. Instead they(for whoever they) put a GLaDOS-class supercomputer for it. That's the absurdity.
You’d be surprised at how entrenched Windows is in the machine automation industry. There are entire control systems algo implemented and run in realtime Windows, vendors like Beckhoff and ACS only have Windows build for their control software which developers extend and build on top with Visual Studio.
Siemens is also very much in on this. Up to about the 90s most of these vendors were running stuff on proprietary software stacks running on proprietary hardware networked using proprietary networks and protocols (an example for a fully proprietary stack like this would be Teleperm). Then in the 90s everyone left their proprietary systems behind and moved to Windows NT. All of these applications are truly "Windows-native" in the sense that their architecture is directly built on all the Windows components. Pretty much impossible to port, I'd wager.
So for maintenance and fault indications. Probably saves some time from someone digging up manuals for checking error codes from where ever they maybe placed or not. Also could display things like height and weight.
According to reports the ATMs of some banks also showed the BSOD which surprised me; i wouldn't have thought such "embedded" devices needed any type of "third-party online updates".
Its easier and cheaper (and a lil safer) to run wires to the up\down control lever and have those actuate a valve somewhere, than it is to run hydraulic hoses to a lever like in lifts of old, for example.
That said it could also be run by whatever the equivalent of "PLC on an 8bit Microcontroller" is, and not some full embedded Windows system with live online virus protection so yeah, what the hell.
I'm having a hard time picturing a multi-story diesel repair shop. Maybe a few floors in a dense area but not so high that a lack of elevators would be show stopping. So I interpret "lift" as the machinery used to raise equipment off the ground for maintenance.
The most basic example is duty cycle monitoring and trouble shooting. You can also do things like digital lock-outs on lifts that need maintenance.
While the lift might not need a dedicated computer, they might be used in an integrated environment. You kick off the alignment or a calibration procedure from the same place that you operate the lift.
how many lifts, and how many floors, with how many people are you imagining? Yes, there's a dumb simple case where there's no need for a computer with an OS, but after the umpteenth car with umpteen floors, when would you put in a computer?
and then there's authentication. how do you want key cards which say who's allowed to use the lift to work without some sort of database which implies some sort of computer with an operating system?
It's a diesel repair shop, not an office building. I'm interpreting "lift" as a device for lifting a vehicle off the ground, not an elevator for getting people to the 12th floor.
Your understanding of stuxnet is flawed, Iran was attacked by the Us Gov in a very very specific spearfish attack with years of preparation to get Stux into the enrichment facilities - nothing to do with lifts connected to the network.
Also the facility was air-gapped, so it wasn't connected to ANY outside network. They had to use other means to get Stux on those computers and then used something like 7 zero days to move from windows into Siemens computers to inflict damage.
Stux got out potentially because someone brought their laptop to work, the malware got into said laptop and moved outside the airgap from a different network.
"Stux got out potentially because someone brought their laptop to work, the malware got into said laptop and moved outside the airgap from a different network."
The lesson here is that even in an air-gapped system the infrastructure should be as proprietary as is possible. If, by design, domestic Windows PCs or USB thumb drives could not interface with any part of the air-gapped system because (a) both hardwares were incompatible at say OSI levels 1, 2 & 3; and (b) software was in every aspect incompatible with respect to their APIs then it wouldn't really matter if by some surreptitious means these commonly-used products entered the plant. Essentially, it would be almost impossible† to get the Trojan onto the plant's hardware.
That said, that requires a lot of extra work. By excluding subsystems and components that are readily available in the external/commercial world means a considerable amount of extra design overhead which would both slow down a project's completion and substantially increase its cost.
What I'm saying is obvious, and no doubt noted by those who've similar intentions to the Iranians. I'd also suggest that the use of individual controllers etc. such as the Siemens ones used by Iran either wouldn't be used or they'd need to be modified from standard both in hardware and with the firmware (hardware mods would further bootstrap protection if an infiltrator knew the firmware had been altered and found a means of restoring the default factory version).
Unfortunately, what Stuxnet has done is to provide an excellent blueprint of how to make enrichment (or any other such) plants (chemical, biological, etc.) essentially impenetrable.
† Of course, that doesn't stop or preclude an insider/spy bypassing such protections. Building in tamper resistance and detection to counter this threat would also add another layer of cost and increase the time needed to get the plant up and running. That of itself could act as a deterrent, but I'd add that in war that doesn't account for much, take Bletchley and Manhattan where money was no object.
I once engineered a highly secure system that used (shielded) audio cables and amodem as the sole pathway to bridge the airgap. Obscure enough for ya?
Transmitted data was hashed on either side, and manually compared. Except for very rare binary updates, the data in/out mostly consisted of text chunks that were small enough to sanity-check by hand inside the gapped environment.
Stux also taught other government actors what's possible with a few zero days strung together, effectively starting the cyberwasr we've been in for years.
To work with various private data, you need to be accredited and that means an audit to prove you are in compliance with whatever standard you are aspiring to. CS is part of that compliance process.
Another department in the corporation is probably accessing PII, so corporate IT installed the security software on every Windows PC. Special cases cost money to manage, so centrally managed PCs are all treated the same.
Anything that touches other systems is a risk and needs to be properly monitored and secured.
I had a lot of reservations about companies installing Crowdstrike but I'm baffled by the lack of security awareness in many comments here. So they do really seem necessary.
They optimize for small batch development costs. Slapping windows PC when you sell a few hundred to thousand units is actually pretty cheap. Software itself is probably same order of magnitude, cheaper for UI itself...
And cheap both short and long term. Microsoft has 10 year lifecycles you don't need to pay extra for. Linux you need IT staff to upgrade it every 3 years. Not to mention hiring engineers to recompile software every 3 years with the distro upgrade.
Probably a Windows-based HMI (“human-machine interface”).
I used to build sorting machines that use variants of the typical “industrial” tech stack, and the actual controllers are rarely (but not never!) Windows. But it’s common for the HMI to be a Windows box connected into the rest of the network, as well as any server.
In a lot of cases you find tangential dependencies on Windows in ways you don't expect. For example a deployment pipeline entirely linux-based deploying to linux-based systems that relies on Active Directory for authentication.
I'm more confused because I have never, ever encountered a lift that wasn't just some buttons or joysticks on a controller attached to the lift. There is zero need of more computing power than a 8-bit microcontroller from the 1980s. I don't know where I would even buy such a lift with a windows PC.
No one sells 8 bit microcontrollers from the 1980s anymore. Just because you don't need the full power of modern computing hardware and software doesn't mean you are going to pay extra for custom, less capable options.
I think the same question can be asked for why lots of equipment seemingly requires an OS. My take is that these products went through a phase of trying to differentiate themselves from competitors and so added convenience features that were easier to implement with a general purpose computer and some VB script rather than focusing on the simplest most reliable way to implement their required state machines. It's essentially convenience to the implementors at the expense of reliability of the end result.
My life went sideways when organizations I worked for all started to make products solely for selling and not for using those. If the product was useful for something, that was the side effect of being sellable. Not the goal.
Worse is Better has eaten the world. The philosophy of building things properly with careful, bespoke, minimalist designs has been totally destroyed by a race to the bottom. Grab it off the shelf, duct tape together a barely-working MVP, and ship it.
Some idiot with college degree in office no-where near the place sees that we have these PCs here. And then they go over compliance list and mandate this is needed. Now go install it and the network there...
Or they want to protect their Windows-operated lifts from very real and life threatening events like an attacker jumping from host to host until they are able to lock the lifts and put people lives at risk or cause major inconveniences.
Not all security is done by stupid people. Crowdstrike messed up in many ways. It doesn't make the company that trusted them stupid for what they were trying to achieve.
For the same reason people want to automate their homes, or the industries run with lots of robots, etc: because it increases productivity. The repair shop could be monitoring for usage, for adequate performance of hydraulics, long-term performance statistics, some 3rd-party gets notified to fix it before it's totally unusable, etc.
I have a friend that is a car mechanic. The amount of automation he works with is fascinating.
Sure, lifts and whatnot should be in a separate network, etc, but even banks and federal agencies screw up network security routinely. Expecting top-tier security posture from repair shops is unrealistic. So yes, they will install a security agent on their Windows machines because it looks like a good idea (it really is) without having the faintest clue about all the implications. C'est la vie.
But what are you automating? It's a car lift, you need to be standing next to it to safely operate it. You can't remotely move it, it's too dangerous. Most of the things which can go wrong with a car lift require a physical inspection and for things like hydraulic pressure you can just put a dial indicator which can be inspected by the user. Heck, you can even put electronic safety interlocks without needing an internet connection.
There are lots of difficult problems when it comes to car repair, but cloud lift monitoring is not something I've ever heard anyone ask for.
The things you're describing are all salesman sales-pitch tactics, they're random shit which sound good if you're trying to sell a product, but they're all stuff nobody actually uses once they have the product.
It's like a six in one shoe horn. It has a screw driver, flash light, ruler, bottle opener, and letter opener. If you're just looking at two numbers and you see regular shoe horn £5, six in one shoe horn £10 then you might blindly think you're getting more for your money. But at the end of the day, I find it highly unlikely you'll ever use it for anything other than to put tight shoes on.
I imagine something keeps monitors how many times the lift has gone up and down for maintenance reasons. Maybe a nice model monitors fluid pressure in the hydraulics to watch for leaks. Perhaps a model watches strain, or balance, to prevent a catastrophic failure. Maybe those are just sensors but if they can’t report their values they shutdown for safety’s sake. There are all kinds of reasonable scenarios that don’t rely on bad people trying to screw or cheat someone.
None of these features require internet or a windows machine, most of them do not require a computer or even a microcontroller. Strain gauges can be useful for checking for an imbalanced load, but they cannot inspect the metal for you.
In my office, when we swipe our entry cards at the security gates, a screen at the gate tells us which lift to take based on the floor we work on, and sets the lift to go to that floor. It's all connected.
Remote monitoring and maintenance. Predictive maintenance, monitor certain parameters of operation and get maintenance done before lift stops operating.
It's a car lift. Not only would it be irresponsible to rely on a computer to tell you when you should maintain it, as some inspections can only be done visually, it seems totally pointless as most inspections need to be done manually.
Get a reminder on your calendar to do a thorough inspection once a day/week (whatever is appropriate) and train your employees what to look for every time it's used. At the end of the day, a car lift on locks is not going to fail unless there's a weakness in the metal structure, no computer is going to tell you about this unless there's a really expensive sensor network and I highly doubt any of the car lifts in question have such a sensor network.
Moreover, even if they did have such a sensor network, why are these machines able to call out to the internet?
These requirements can be met by making the lift's systems and data observable, which is a uni-directional flow of information from the lift to the outside world. Making the lift's operation modifiable from the outside world is not required to have it be observable.
The same reason everyone just uses a microcontroller on everything. It's like a universal glue and you can develop in the same environment you ship. Makes it easy.
Lathes probably have PCs connected to them to control them, and do CNC stuff (he did say the controllers). Laser alignment machines all have PCs connected to them these days.
The cranes and lifts though... I've never heard of them being networked or controlled by a computer. Usually it's a couple buttons connected to the motors and that's it. But maybe they have some monitoring systems in them?
Off then top of my head, based on limited experience in industrial automation:
- maintenance monitoring data shipping to centralised locations
- computer based HMI system - there might be good old manual control but it might require unreasonable amounts of extra work per work order
- Centralised control system - instead of using panel specific to lift, you might be controlling bunch of tools from common panel
- integration with other tools, starting from things as simple as pulling up manufacturers' service manual to check for details to doing things like automatically raising the lift to position appropriate for work order involving other (possibly also automated) tools with adjustments based on the vehicle you're lifting
Remember that CNC is programming environment. Now how do actually see what program is loaded? Or where is the execution at the moment? For anything beyond few lines of text on dotmatrix screen actual OS starts to be come desirable.
And all things considered, Windows is not that bad option. Anything else would also have issues. And really what is your other option some outdated, unmaintained Android? Does your hardware vendor offer long term support for Linux?
Windows actually offers extremely good long term support quite often.
> And all things considered, Windows is not that bad option
I'm gonna go out on a limb and say that it actually is. It's a closed source OS which includes way more functionality than you need. A purpose-built RTOS running on a microcontroller is going to provide more reliability, and if you don't hook it up to the internet it will be more secure, too. Of course, if you want you can still hook it up to the internet, but at least you're making the conscious decision to do so at that point.
Displaying something on a screen isn't very hard in an embedded environment either.
I have an open source printer which has a display, and runs on an STM32. It runs reliably, does its job well, and doesn't whine about updates or install things behind my back because it physically can't, it has no access to the internet (though I could connect it if I desired). A CNC machine is more complex and has more safety considerations, but is still in a similar class of product.
> Does your hardware vendor offer long term support for Linux?
This seems muddled. If the CNC manufacturer puts Linux on an embedded device to operate the CNC, they're the hardware manufacturer and it's up to them to pick a chip that's likely to work with future Linuxes if they want to be able to update it in the future. Are you asking if the chip manufacturer offers long-term-support for Linux? It's usually the other way around, whether Linux will support the chip. And the answer, generally, is "yes, Linux works on your chip. Oh you're going to use another chip? yes, Linux works on that too". This is not really something to worry about. Unless you're making very strange, esoteric choices, Linux runs on everything.
But that still seems muddled. Long-term support? How long are we talking? Putting an old Linux kernel on an embedded device and just never updating it once it's in the field is totally viable. The Linux kernel itself is extremely backwards compatible, and it's often irrelevant which version you're using in an embedded device. The "firmware upgrades" they're likely to want to do would be in the userspace code anyhow - whatever code is showing data on a display or running a web server you can upload files to or however it works. Any kernel made in the last decade is going to be just fine.
We're not talking about installing Ubuntu and worrying about unsolicited Snap updates. Embedded stuff like this needs a kernel with drivers that can talk to required peripherals (often over protocols that haven't changed in decades), and that can kick off userspace code to provide a UI either on a screen or a web interface. It's just not that demanding.
As such, people get away with putting FreeRTOS on a microcontroller, and that can show a GUI on a screen or a web interface too, you often don't need a "full" OS at all. A full OS can be a liability, since it's difficult to get real-time behaviour which presumably matters for something like a CNC. You either run a real-time OS, or a regular OS (from which the GUI stuff is easier) which offloads work to additional microcontrollers that do the real-time stuff.
I did not expect Windows to be running on CNCs. I didn't expect it to be running on supermarket checkouts. The existence of this entire class of things pointlessly running self-updating, internet-connected Windows confuses me. I can only assume that there are industries where people think "computer equals Windows" and there just isn't the experience present, for whatever reason, to know that whacking a random Linux kernel on an embedded computer and calling it a day is way easier than whatever hoops you have to jump through to make a desktop OS, let alone Windows, work sensibly in that environment.
5-10 years is not unreasonable expected support I think.
And if you are someone manufacturing physical equipment be it CNC machine or vehicle lift hiring entire team to keep Linux patched and making your own releases seems pretty unreasonable and waste of resources. In the end anything you choose is not error free. And the box running software is not main product.
This is actually huge challenge. Finding vendor that can deliver you a box where to run software with promised long term support, when the support is actually more than just few years.
Also I don't understand how it is any more acceptable to run unpatched Linux in networked environment than it is Windows. These are very often not just stand-alone things, but instead connected to at least local network if not larger networks. With possible internet connections too. So not updating vulnerabilities is as unacceptable as it would be with Windows.
With CNC there is place for something like Windows OS. You have separate embedded system running the tools. But you still want a different piece managing the "programs". As you could have dozens or hundreds of these. And at that point reading them from network starts once again make sense. Time of dealing with floppies is over...
And with checkouts, you want more UI than just buttons. And Windows CE has been reasonably effective tool in that.
Linux is nice on servers, but often with embedded side keeping it secure and up to date is massive amount of pain. Windows does offer excellent stability and long term support. And you can just simply buy a computer with sufficient support from MS. One could ask why do not not massive companies run their own Linux distributions?
> 5-10 years is not unreasonable expected support I think.
A couple of years ago, I helped a small business with an embroidery machine that runs Windows 98. Its physical computer died, and the owner could not find the spare parts. Fortunately, it used a parallel port to control the embroidery hardware, so it was easy to move to a VM with a USB parallel port adapter.
That was very lucky then. USB parallel ports adapters are only intended to work with printers. They fail with any hardware that does custom signalling over the parallel port.
Maybe you want your lift to be able to diagnose itself. Tell possible faults, instead of spending man hours on troubleshooting every part each time downtime included. With big lifts there are many parts that could go wrong. Being able to identify which one saves lot of time and time is money.
These sort of outages are actually extremely rare nowadays. Considering how long these control systems have been kept around must mean that they are not actually causing that many issue that replacing them would be worth it.
you log into the machine, download files, load files onto the program. that doesn't need a desktop environment? you want to reimplement half of one, poorly, because that would have avoided this stupid mistake, in exchange for half a dozen potential others, and a worse customer experience?
> you log into the machine, download files, load files onto the program. that doesn't need a desktop environment?
Believe it or not, it doesn't! An embedded device with a form of flash storage and an internet connection to a (hopefully) LAN-only server can do the same thing.
> you want to reimplement half of one, poorly
Who says I would do it poorly? ;)
> and a worse customer experience?
Why would a purpose-built system be a worse customer experience than _windows_? Are you really going to set the bar that low?
Or lathe, or cranes, or alarms, or hvac... what the actual fuck.
Next move should be some artisanal as mechanical-as-possible quality products, or at least Linux(TM) certified product or similar (or Windows-free (TM)). The opportunity is here, everybody noticed this clusterfuck, and smart folks don't like ignoring threats that are in your face.
But I suppose in 2 weeks some other bombastic news will roll over this and most will forget. But there is always some hope
I feel like this is the fake reason given to try to hide the obvious reason: automatic updates are a power move that allows companies to retain control of products they've sold.
Yep. And even aside from security, its a nightmare needing to maintain multiple versions of a product. "Oh, our software is crashing? What version do you have? Oh, 4.5. Well, update 4.7 from 2 years ago may fix your problem, but we've also released major versions 5 and 6 since then - no, I'm not trying to upsell you ma'am. We'll pull up the code from that version and see if we can figure out the problem."
Having evergreen software that just keeps itself up to date is marvellous. The Google Docs team only needs to care about the current version of their software. There are no documents saved with an old version. There's no need to backport fixes to old versions, and no QA teams that need to test backported security updates on 10 year old hardware.
Its just a shame about, y'know, the aptly named crowdstrike.
Fine. But Google can mass-migrate all of them to a new format any time they want. They don’t have the situation you used to have with Word, where you needed to remember to Save As Word 2001 format or whatever so you could open the file on another computer. (And if you forgot, the file was unreadable). It was a huge pain.
Yes it is better than the Word situation, but no it isn't not caring. There do exist old format docs and Google does have to care - to make that migration.
Yes, they have to migrate once. But they don’t need to maintain 8 different versions of Word going back a decade, make sure all security patches get back ported (without breaking anything along the way), and make all of them are in some way cross compatible despite having differing feature sets.
If google makes a new storage format they have to migrate old Google docs. But that’s a once off thing. When migrations happen, documents are only ever moved from old file formats to new file formats. With word, I need to be able to open an old document with the new version of word, make changes then re-save it so it’s compatible with the old version of word again. Then edit it on an old version of word and go back and forth.
I’m sure the Google engineers are very busy. But by making Docs be evergreen software, they have a much easier problem to solve when it comes to this stuff. Nobody uses the version of Google docs from 6 months ago. You can’t. And that simplifies a lot of things.
They have to migrate each time they change the format, surely. Either that or maintain converters going back decades, to apply the right one when a document is opened.
> but they don’t need to maintain 8 different versions of Word going back a decade, make sure all security patches get back ported
Nor does Microsoft for Word.
> With word, I need to be able to open an old document with the new version of word, make changes then re-save it so it’s compatible with the old version of word again.
You don't have to, unless you want the benefit of that.
And Google Docs offers the same.
> Nobody uses the version of Google docs from 6 months ago. You can’t. And that simplifies a lot of things.
Well, I'd love to use the version of Gmail web from 6 months ago. Because three months ago Google broke email address input such that it no longer accesses the contacts list and I have to type/paste each address in full.
That's a price we pay for things being "simpler" for a software provider than can and does change the software I am using without telling me let alone giving me the choice.
Not to mention the change that took away a large chunk of my working screen space for an advert telling me to switch to the app version, despite have the latest version of Google's own Chrome. An advert I cannot remove despite having got the message 1000 times. Pure extortion. Simplification is no excuse.
It used to be the original reason why automatic updates were accepted and it was valid.
But since then it has been abused for all sorts of things that really are nothing more than consolidation of power, including an entire shift in mentality of what "ownership" even means: Tech companies today seem to think it's the standard that they keep effective ownership of a product for its entire life cycle, no matter how much money a customer has paid for it, and no matter deeply the customer relies on that product.
(Politicians mostly seem fine with that development or even encourage it)
I agree that an average nontechnical person can't be expected to keep track of all the security patches manually to keep their devices secure.
What I would expect would be an easy way to opt-out of automatic updates if you know what you're doing. The fact that many companies go to absurd lengths to stop you from e.g. replacing the firmware or unlocking the bootloader, even if you're the owner of the device is a pretty clear sign to me they are not doing this out of a desire to protect the end-user.
Also, I'm a bit baffled that there is no vetting at all of the contents of updates. A vendor can write absolutely whatever they want into a patch for some product of theirs and arbitrarily change the behaviour of software and devices that belong to other people. As a society, we're just trusting the tech companies to do the right thing.
I think a better system would be if updates would at the very least have to be vetted by an independent third party before being applied and a device would only accept an update if it's signed by the vendor and the third-party.
The third-party cold then do the following things:
- run tests and check for bugs
- check for malicious and rights-infringing changes deliberately introduced by the vendor (e.g. taking away functionality that was there at time of purchase)
- publicly document the contents of an update, beyond "bug fixes and performance improvements".
What you're describing is what Linux distro maintainers do: Debian maintainers check the changes of different software repos, look at new options and decide if anything should be disabled in the official Debian release, and compile and upload the packages.
The problem you are complaining about here is the weakening of labor and consumer organizations vis a vis capital or ownership organizations. The software must be updated frequently due to our lack of skill in writing secure software. Whether all the corporations will take advantage of everything under the sun to reduce the power the purchasers and producers of these products have is a political and legal questions. If only the corporations are politically involved then only they will have their voice heard by the legislatures.
no reason why both can't be true — the security is overall better, and companies are happy to invest in advancing this paradigm because it gives them more control
incentive can and does undermine the stated goal. what if the government decided to take control of everyone's investment portfolio to prevent the market doing bad things? or an airplane manufacturer gets takes control of its own safety certification process because obviously its in their best interest that their planes are safe? imposed curfew, everyone has to be inside their homes while its dark outside because most violent crimes occur at night?
how much lathe-ing have you done recently? did you load files onto your CNC lathe with an SD card, and thus there is a computer, which needs updates, or are you thinking of a lathe that is a motor and a rubber band, and nothing else, from, like, high school woodshop?
I bought a 3d printer years ago then let it sit collecting dust for like 2 or more years because I was intimidated by it. Finally started using it and was blown away how useful it has been to me. Then a long time later realized holy shit there are updates and upgrades one can easily do. I can add a camera and control everything and monitor everything from any online connected device. I always hated pulling out the sd card and bringing it to my computer and copying it over and back to the printer and so on. Being online makes things so much easier and faster. I have been rocking my basic printer for a few years now and have not paid much attention to the scene and then started seeing these multi color prints holy shit am I slow and behind the times. The newer printers are pretty rad but I will give props to my Anycubic Mega it has been a work horse and I have had very little problems. I don't want it to die on me but a newer printer would be cool also.
There are immense benefits to using modern computing power, including both onboard and remote functionality. The cost of increased software security vulnerability is easily justified.
1. Nobody auto updates my linux machines. They have no malware.
2. It's my job to change the oil in my car. When Ford starts sending a tech to my house to tamper with my machines "because they need maintenance" will be the day I am no longer a Ford customer.
The irony of this comment is almost perfected by the fact Ford were one of the leading companies in bringing ECU's (one of the myriad of computer systems essential to modern vehicles that can and do receive regular updates) to market in checks notes 1975.
Those Linux systems that aren't getting updates must be the ones sending Mirai to my Linux systems, which are getting updates (and also Mirai, although it won't run because it's the wrong architecture).
No malware? Only if you have your head in the sand.
I assume that comment was saying that they handle the update process and that their machines don't have any malware on them.
I ignored it because it was somewhat abusive and is missing the problem that automatic updates are trying to solve: that most people, but not all, don't do updates.
Wow, this hits close to home. Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009. I added a check on the driver initialization path and didn't annotate the code as non-paged because frankly I didn't know at the time that the Windows kernel was paged. All my kernel development experience up to that point was with Linux, which isn't paged.
BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.
The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.
The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.
> without even the most basic level of qualification
That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.
Exactly this what I was missing in the story. Like why not to have a limited set of users have it before going live for the whole user base at a mission critical product like this is beyond comprehension of everyone ever came across software bugs (so billions of people). And then we already overcame the part of not testing internally well, or at all? Something clusteruck must have happened there which is still better than imagining that this is the normal way the organization operates. Which is a very scary vision. Serious rethinking of trusting this organization is due everywhere!
The funniest part was seeing Mercedes F1 team pit crew staring at BSODs at their workstations[1] while wearing CrowdStrike t-shirts. Some jokes just write themselves. Imagine if they loose the race because of their sponsor.
But hey, at least they actually dogfood the products of their sponsors instead of just taking money to shill random stuff.
Because CrowdStrike is an EDR solution it likely has tamper-proofing features (scheduled tasks, watchdog services, etc.) that re-enables it. These features are designed to prevent malware or manual attackers from disabling it.
These features drive me nuts because they prevent me, the computer owner/admin, from disabling. One person thought up techniques like "let's make a scheduled task that sledgehammers out the knobs these 'dumb' users keep turning' and then everyone else decided to copycat that awful practice.
Are you saying that the compliance rule requires the software to be uninstallable? Once it's installed it's impossible to uninstall? No one can uninstall it? I have a hard time believing it's impossible to remove the software. In the extreme case, you could reimage the machine and reinstall Windows without Crowdstrike.
Or are you saying that it is possible to uninstall, but once you do that, you're not in compliance, so while it's technically possible to uninstall, you'll be breaking the rules if you do so?
The person I originally replied to, rkagerer, said there was some technical measure preventing rkagerer from uninstalling it even though rkagerer has admin on the computer.
I was referring to the difficulty overriding the various techniques certain modern software like this use to trigger automatic updates at times outside admin control.
Disabling a scheduled task is easy, but unfortunately vendors are piling on additional less obvious hooks. Eg. Dropbox recreates its scheduled task every time you (run? update?) it, and I've seen others that utilize the various autostart registry locations (there are lots of them) and non-obvious executables to perform similar "repair" operations. You wind up in "Deny Access" whackamole and even that isn't always effective. Uninstalling isn't an option if there's a business need for the software.
The fundamental issue is their developers / product managers have decided they know better than you. For the many users out there who are clueless to IT this may be accurate, but it's frustrating to me and probably others who upvoted the original comment.
Is what you're saying relevant in the Crowdstrike case? If you don't want Crowdstrike and you're an admin, I assume there are instructions that allow you to uninstall it. I assume the tamper-resistant features of Crowdstrike won't prevent you from uninstalling it.
It's currently a DOS by the crashing component, so it's already broken the Availability part of Confidentiality/Integrity/Availability that defines the goals of security.
But a loss of availability is so much more palatable than the others, plus the others often result in manually restricting availability anyway when discovered.
I think the wider societal impact from the loss of availability today - particularly for those in healthcare settings - might suggest this isn't always the case
What is the importance of data integrity? If important pre-op data/instructions are missing or gets saved on the wrong patient record which causes botched surgeries, if there are misprescribed post-op medications, if there is huge confusion and delays in critical follow-up surgeries because of a 100% available system that messed up patient data across hospitals nationwide, if there are malpractice lawsuits putting entire hospitals out of business etc etc, then is that fallout clearly worth having an available system in the first place?
Huh? We're talking about hypotheticals here. You're saying availability is clearly more important than data integrity. I'm saying that if a buggy kernel loadable module allowed systems to keep on running as if nothing was wrong, but actually caused data integrity problems while the system is running, that's just as bad or worse.
If Linux and Windows have similar architectural flaws, Microsoft must have some massive execution problems. They are getting embarrassed in QA by a bunch of hobbyists, lol.
If you're planning around bugs in security modules, you're better off disabling them - malware routinely use bugs in drivers to escalate, so the bug you're allowing can make the escalation vector even more powerful as now it gets to Ring 0 early loading.
Isn't DoSing your own OS an attack vector? and a worse one when it's used in critical infrastructure where lives are at stake.
There is a reasonable balance to strike, sometimes it's not a good idea to go to extreme measures to prevent unlikely intrusion vectors due to the non-monetary costs.
In the absence of a Crowdstrike bug, if an attacker is able to cause Crowdstrike to trigger a bluescreen, I assume the attacker would be able to trigger a bluescreen in some other way. So I don't think this is a good argument for removing the check.
That assumes it's more likely than crowdstrike mass bricking all of these computers... this is the balance, it's not about possibility, it's about probability.
I use Explorer Patcher on a windows 11 machine. It had a history of crash loops with Explorer that they implemented this circuit breaker functionality.
It's baffling how fast and wide the blast radius was for this Crowdstrike update. Quite impressive actually, if you think about it - updating billions of systems that quickly.
This was my first thought too. I'm not that familiar with the space, but I would think for something this sensitive the rollout would be staggered at least instead of what looks like globally all at the same time.
This is the bit I am still trying to understand. On CrowdStrike you can define how many updates a host is behind. I.e. n (latest), n-1 (one behind) or n-2 etc. This update was applied to a 'latest' policy hosts and the n-2 hosts. To me it appears that there was more to this than just a corrupt update, otherwise how was this policy ignored? Unless it doesn't separate the update as deeply and maybe just a small policy aspect, which would also be very concerning.
I guess we won't really know until they release the post mortem...
Yeah, my guess is that they roll out the updates to every client at the same time, and then have the client implement the n-1/2/whatever part locally. That worked great-ish until they pushed a corrupt (empty) update file which crashed the client when it tried to interpret the contents... Not ideal, and obviously there isn't enough internal testing before sending stuff out to actual clients.
But do you ever get free world-wide advertisement that everyone uses your product? Crowdstrike sure did and I'm sure they'll use that to sell it to more people.
> It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.
Discussed elsewhere it is claimed that the file causing the crash was a data file that has been corrupted in the delivery process. So the development team and their CI have probably tested a good version, but the customer received a bad one.
If that is true to problem is that the driver first uses an unsigned file at all, so all customer machines are continuously at risk for local attacks. And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.
> And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.
To me, this is the inexcusable sin. These updates should be signed and signatures validated before the file is read. Ideally the signing/validating would be handled before distribution so that when this file was corrupted, the validation would have failed here.
But even with a good signature, when a file is read and the values don’t make sense, it should be treated as a bad input. From what I’ve seen, even a magic bytes header here would have helped.
the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers
"And they promise fast threat mitigation... Let allow them to take over EVERYTHING! With remote access, of course. Some form of overwatch of what they in/out by our staff ? Meh...
And it even allow us to do cuts in headcount and infra by $<digits_here> a year."
> I didn't know at the time that the Windows kernel was paged.
At uni I had a professor in database systems, who did not like written exams, but mostly did oral exams. Obviously for DBMSes the page buffer is very relevant, so we chatted about virtual memory and paging. So in my explanation I made the difference for kernel space and user space. I am pretty sure I had read that in a book describing VAX/VMS internals. However, the professor claimed that a kernel never does paging for its own memory. I did not argue on that and passed the exam with the best grade. Did not check that book again to verify my claim. I have never done any kernel space development even vaguely close to memory management, so still today I don't know the exact details.
However, what strikes me here: When that exam happened in 1985ish the NT kernel did not exist yet, I'd believe. However, IIRC a significant part of the DEC VMS kernel team went to Microsoft to work on the NT kernel. So the concept of paging (a part of) kernel memory went with them? Whether VMS --> WNT, every letter increased by one is just a coincidence or intentionally the next baby of those developers I have never understood. As Linux has shown us today much bigger systems can be successfully handled without the extra complications for paging kernel memory. Whether it's a good idea I don't know, at least not a necessary one.
The VMS --> WNT acronym relationship was not mentioned, maybe it was just made up later.
One thing I did not know (or maybe not remember) is that NT was originally developed exclusively for the Intel i860, one of Intel's attempts to do RISC. Of course in the late 1980s CISC seemed deemed and everyone was moving to RISC. The code name of the i860 was N10. So that might well be the inside origin of NT, the marketing name New Technology retrofitted only later.
"New Technology", if you want to search the transcript. Per Dave, marketing did not want to use "NT" for "New Technology" because they thought no one would buy new technology.
Actually it was not only x86 hardware that was not really planned for the NT kernel, also Windows user space was not the first candidate. Posix and maybe even OS/2 were earlier goals.
So the current x86 Windows monoculture came up as an accident because strategically planned new options did not materialize. The user space change should finally debunk the theory that VMS andvances into WNT was a secret plot by the engineers involved. It was probably a coincidence discovered after the fact.
"Perhaps the worst thing about being a systems person is that
other, non-systems people think that they understand the daily
tragedies that compose your life. For example, a few weeks ago,
I was debugging a new network file system that my research
group created. The bug was inside a kernel-mode component,
so my machines were crashing in spectacular and vindic-
tive ways. After a few days of manually rebooting servers, I
had transformed into a shambling, broken man, kind of like a
computer scientist version of Saddam Hussein when he was
pulled from his bunker, all scraggly beard and dead eyes and
florid, nonsensical ramblings about semi-imagined enemies.
As I paced the hallways, muttering Nixonian rants about my
code, one of my colleagues from the HCI group asked me what
my problem was. I described the bug, which involved concur-
rent threads and corrupted state and asynchronous message
delivery across multiple machines, and my coworker said,
“Yeah, that sounds bad. Have you checked the log files for
errors?” I said, “Indeed, I would do that if I hadn’t broken every
component that a logging system needs to log data. I have a
network file system, and I have broken the network, and I have
broken the file system, and my machines crash when I make
eye contact with them. I HAVE NO TOOLS BECAUSE I’VE
DESTROYED MY TOOLS WITH MY TOOLS. My only logging
option is to hire monks to transcribe the subjective experience
of watching my machines die as I weep tears of blood.”
Ah, the joys of trying to come up with creative ways to get feedback from your code when literally nothing is available. Can I make the beeper beep in morse code? Can I just put a variable delay in the code and time it with a stopwatch to know which value was returned from that function? Ughh.
Some of us have worked on embedded systems or board bringup. Scope and logic analyzer ... Serial port a luxury.
IIRC Windows has good support for debugging device drivers via the serial port. Overall the tooling for dealing with device drivers in windows is not bad including some special purpose static analysis tool and some pretty good testing.
Yeah. Been there, done that. Write to an unused address decode to trigger the logic analyzer when I got to a specific point in the code, so I could scroll back through the address bus and figure out what the program counter had done for me to get to that piece of code.
This is an interesting piece of creative writing, but virtual machines already existed in 2013. There are very few reasons to experiment on your dev machine.
At the time, Mickens worked at Microsoft Research, and with the Windows kernel development team. There may only be a few reasons to experiment on your dev machine, but that's one environment where they have those reasons.
>Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009.
Hello from a fellow BitLocker dev from this time! I think I know who this is, but I'm not sure and don't want to say your name if you want it private. Was one of your Win10 features implementing passphrase support for the OS drive? In any case, feel free to reach out and catch up. My contact info is in my profile.
Win8. I've been seeing your blog posts show up here and there on HN over the years, so I was half expecting you to pick up on my self-doxx. I'll ping you offline.
"It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification."
It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?
Drivers have had to be signed forever and pass pretty rigorous test suites and static analysis.
The problem here is obviously this other file the driver sucks in. Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...
How so? Preventing roll-backs on software updates is a "security feature" in most cases for better and for worse. Yeah, it would be convenient for tinkerers or in rare events such as these, but would be a security issue in the 99,9..99% of the time for enterprise users where security is the main concern.
I don't really understand this, many Linux distributions like Universal Blue advertise rollbacks as a feature. How is preventing a roll-back a "security feature"?
Imagine a driver has an exploitable vulnerability that is fixed in an update. If an attacker can force a rollback to the vulnerable older version, then the system is still vulnerable. Disallowing the rollback fixes this.
This is what I don’t get, it’s extremely hard for me to believe this didn’t get caught in CI when things started blue screening. Every place I ever did test rebooting/powercycling was part of CI, with various hardware configs. This was before even our lighthouse customers even saw it.
Apparently the flaw was added to the config file in post-processing after it had completed testing. So they thought they had testing, but actually didn't.
I was thinking, this doesn't seem like its a case of all these machines still on an old version of windows, or some specific version, that is having issues. Therefore QA just missed one particular variant in their smoke testing. It seems like its every windows instance with that software, so either they don't have basic automated testing, or someone pushed this outside of a normal process.
> Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.
Did I mention this was 15 years ago? Software development back then looked very different than it does now, especially in Wincore. There was none of this "Cloud-native development" stuff that we all know and love today. GitHub was just about 1 year old. Jenkins wouldn't be a thing for another 2 years.
In this case the "automated test" flipped all kinds of configuration options with repeated reboots of a physical workstation. It took hours to run the tests, and your workstation would be constantly rebooting, so you wouldn't be accomplishing anything else for the rest of the day. It was faster and cheaper to require 8 devs to rollback to yesterday's build maybe once every couple of quarters than to snarl the whole development process with that.
The tests still ran, but they were owned and run by a dedicated test engineer prior to merging the branch up.
Oh I rebooted, I just didn't happen to have the right configuration options to invoke the failure when I rebooted. Not every dev workstation was bluescreening, just the ones with the particular feature enabled.
But as someone already pointed out, the issue was seen on all kinds of windows hosts. Not just the ones running a specific version, specific update etc.
There's "something that requires highly specific conditions managed to slip past QA" and then there's "our update brought down literally everyone using the software". This isn't a matter of bad luck.
The memory used by the Windows kernel is either Paged or Non-Paged. Non-Paged means pinning the memory in physical RAM. Paged means it might be swapped out to disk and paged back in when needed. OP was working on BitLocker a file system driver, which handles disk IO. It must be pinned in physical RAM to be available all the times; otherwise, if it's paged out, an IO request coming would find the driver code missing in memory and try to page in the driver code, which triggers another IO request, creating an infinite loop. The Windows kernel usually would crash at that point to prevent a runway system and stops at the point of failure to let you fix the problem.
Linux is a bit unusual in that kernel memory is generally physically mapped and unless you use vmalloc any memory you allocate has to correspond to pages backed by RAM. This also ties into how file IO happens, swapping, and how Linux approach to IO is actually closer to Multics and OS/400 than OG Unix.
Many other systems instead default to using full power of virtual memory including swapping kernel space to disk, with only things explicitly need to be kept in ram being allocated from "non-paged" or "wired" memory.
Must have been DNS... when they did the deployment run and the necessary code was pulled and the DNS failed and then the wrong code got compiled...</sarcasm>
that they don't even do staged/A-B pushes was also <mind-blown-away>
Some Canonical guy I think many years ago mentioned this as their sales strategy a few year ago after a particularly nasty Windows outage:
We don't ask customers to switch all systems from Windows to Ubuntu, but to consider moving maybe a third to Ubuntu so they won't sit completely helpless next time Windows fail spectacularly.
While I see more and more Ubuntu systems, and recently have even spotted Landscape in the wild I don't think they were as successful as they hoped with that strategy.
That said, maybe there is a silver lining on todays clouds both WRT Ubuntu and Linux in general, and also WRT IT departments stopping to reconsider some security best practices.
Except further up this thread another poster mentions that CrowdStrike took down their Debian servers back in April as well. As soon as you're injecting third party software into your critical path with self-triggered updates you're vulnerable to the quality (or lack of) that software despite platform.
Honestly your comment highlights one of the few defenses... don't sit all on one platform.
Sure, but note the sales pitch was to encourage resiliency through diversity. While that may not be helpful in cases where one vendor may push the same breaking change through to multiple platforms, it also may be helpful. I remember doing some work with a mathematics package under Solaris while in university, while my peers were using the same package under Windows. Both had the same issue, but the behaviour was different. Under Solaris, it was possible to diagnose since the application crashed with useful diagnostic information. Under Windows, it was impossible to diagnose since it took out the operating system and (because of that) it was unable to provide diagnostic information. (It's worth noting that I've seen the opposite happen as well, so this isn't meant to belittle Windows.)
Yes, I already heard one manager at my company today say they're getting a mac for their next computer. That's great, the whole management team shouldn't be on Windows. The engineering team is already pretty diversified between mac, windows, and linux. The next one might take down all 3 but at least we tried to diversify the risk.
Yep, these episodes are the banana monoculture [0] applied to IT. The solution isn't to use this vendor or avoid that vendor, it's to diversify your systems such that you can have partial operability even if one major component is down.
Debian has automatic updates but they can be manual as well. That's not the case in Windows.
The best practice for security critical infrastructure in which peoples lives are at stake, is to install some version of BSD stripped down to it's bare minimum. But then the company has to pay for much more expensive admins. Windows admins are much cheaper and plentiful.
Also as a user of Ubuntu and Debian for more than a decade, i have a hunch that this will not happen in India [1].
well, in another sense, Windows is certainly to blame partially. Several technical solutions have been put forward here and in other places, that would've at least limited the blast radius of a faulty update/driver/critical path.
Windows didn't implement any of those. Presumably by choice and for good reasons: A tradeoff would be that software like crowdstrike is more limited in protecting you. So the Windows devs deliberately opted for this risk.
Yeah, I see a lot of noise on social media blaming this on Microsoft/Windows... but AFAIK if you install a bad kernel driver into any major OS the result would be the same.
The specific of this CrowdStrike kernel driver (which AFAIK is intended to intercept and log/deny syscalls depending on threat assessment?) means that this is badnewsbears no matter which platform you're on.
Like sure, if an OS is vulnerable to kernel panics from code in userland, that's on the OS vendor, but this level of danger is intrinsic to kernel drivers!
It's interesting to me that lay people are asking the right questions, but many in the industry, such as the parent here, seem to just accept the status quo. If you want to be part of the solution, you have to admit there is a problem.
Apple deprecated kernel extensions with 10.15 in order to improve reliability and eventually added a requirement that end users must disable SIP in order to install kexts. Security vendors moved to leverage the endpoint security framework and related APIs.
On Linux, ebpf provides an alternative, and I assume, plenty of advantages over trying to maintain kernel level extensions.
I haven’t researched, but my guess is that Microsoft hasn’t produced a suitable alternative for Windows security vendors.
> My prior on competence of "cybersecurity" companies is very, very low.
Dmitri Alperovitch agrees with you.[0] He went on record a few months back in a podcast, and said that some of the most atrocious code he has ever seen was in security products.
I am certain he was implicitly referring, at least in part, to some of the code seen inside his past company's own code base.
Yeah, I think your point is totally valid. Why does CrowdStrike need syscall access on Windows when it doesn't need it elsewhere?
I do think there's an argument to be made that CrowdStrike is more invasive on Windows because Windows is intrinsically less secure. If this is true then yeah, MSFT has blame to share here.
I don't know about MacOS, but at least as recently as a couple years ago crowdstrike did ship a Linux kernel module. People were always complaining about the fact that it advertised the licensing as GPL and refused to distribute source.
I imagine they've simply moved to eBPF if they're not shipping the kernel module anymore.
I haven't looked too deeply into how EDRs are implemented on Linux and macOS, but I'd wager that CrowdStrike goes the way of its own bit of code in kernel space to overcome shortcomings in how ETW telemetry works. It was never meant for security applications; ETW's purpose was to aid in software diagnostics.
In particular, while it looks like macOS's Endpoint Security API[0] and Linux 4.x's inclusion of eBPF are both reasonably robust (if the literature I'm skimming is to be believed), ETW is still pretty susceptible to blinding attacks.
(But what about PatchGuard? Well, as it turns out, that doesn't seem to keep someone from loading their own driver and monkey patching whatever WMI_LOGGER_CONTEXT structures they can find in order to call ControlTraceW() with ControlCode = EVENT_TRACE_CONTROL_STOP against them.)
Linux and open source also have the potential to be far more modular than Windows is. At the moment we have airport display boards running a full windows stack including anti-virus/spyware/audit etc, just to display a table ... madness
I'm a Kubuntu user that, seemingly due to Canonical's decision to ship untested software regularly, has been repeatedly hit by problems with snaps. What were initially basic, obvious, and widespread issues with major software.
Yes, distribute your eggs, but check the handles on the baskets being sold to you by the guy pointing out bad handles.
I’ll never forgive them for the spyware they defaulted to on in their desktop stuff. It wasn’t the worst thing in the world, but they’re also the only major distro to ever do it, so Ubuntu (and Canonical as a whole) can get fucked, imo.
i started with RH (Non-EL) back in the mid-to-late 90s, and switched to gentoo as soon as one of my best (programmer) friends gushed about how much better of an admin it had made them[0], so i started down that path - by the time AWS appeared, we were both automating everything, using build (pump) servers, etc. I like debian, a lot - really! I think apt is about the best non-technical-user package manager, and the packages that were available without having to futz with keyrings was great.
Ubuntu spent a lot of time, talent, and treasure on trying to migrate people off windows instead of being a consistent, great OS. It is still with great dread that i open docs for some new package/program linked to from HN or elsewhere; dread that the first instruction is "start with ubuntu 18.04|20.04".
[0] They actually maintained the unofficial gentoo AWS images for over a decade. unsure if they still do, it could be automated to run a new build off every quarter. https://github.com/genewitch/gentoo/blob/master/gentoo_auto.... (a really old version of the script i keep to remind me that automation is possible with nearly everything...)
canonical has some of the most ridiculous IT job postings i’ve come across. just sounds like a bananas software shop. didn’t give me much confidence in whatever they cooking up in there
Sure but if that Canonical sales person was successful in that, I'd almost guarantee that after they switched the first third they'd be in there arguing to switch out the rest.
Canonical in particular are no better, they do the exact same thing with that aberration called snap. They have brought entire clusters down before with automatic updates.
Yes, but it's not included in the upstream Ubuntu security repository. In fact, it's not available via any repository AFAIK. It updates itself via fetching new versions from CrowdStrike backend according to your update policy for the host in question. However, as we've learned the past days, that policy does not apply to the "update channel" files...
things are so interdependent that in this scenario you might now just end up crashing the system if either Windows or Ubuntu are down instead of just the one of them you chose
I dunno. The stock price will probably dead cat bounce, but this is the sort of thing that causes companies to spiral eventually.
They just made thousands of IT people physically visit machines to fix them. Then all the other IT people watched that happen globally. CTOs got angry emails from other C-levels and VPs. Real money was lost. Nobody is recommending this company for a while.
I have a feeling that Microsoft's PR team will be able to navigate this successfully and Microsoft might even benefit from this incident as it tries to pull customers away from CrowdStrike Falcon and into its own EDR product -- Microsoft Defender for Endpoint.
My (very unprofessional) guess here is that investors in the near term will discount the company too heavily and the previously overvalued stock will blow past a realistic valuation and be priced too low for a little while. The software and company aren't going anywhere as far as I can tell, they have far too much marketshare and use of CrowdStrike is often a contractual obligation.
That said, I don't gamble against trading algorithms these days and am only guessing at what I think will happen. Anyone passing by, please don't take random online posts as financial advice.
Edited to add: I wonder what the economic fallout from this will be? 10x his monetary worth? 100x? (not trying to put a price on the people who will die because of the outage; for that he and everyone involved needs to go to jail)
He will be the guy that convinced the investors and stakeholders to pour more money into the company despite some world-wide incident.
He deserves at least 3x the pay.
PS: look at the stocks! They sank, and now they are gaining again value. People can't work, people die, flights get delayed/canceled because of their software.
If you invest based on fundamentals and company finances, you probably haven't had many chances to buy any positions in the last decade. Stock prices are completely unhinged from company financial reports.
"This is just a demonstration about how critical our software is and how undervalued we are. If the whole world economy requires us to run, we should be worth more" /s
>I run a software company that deploys antivirus software automatically to customers' computers and servers. What should I be aware of as part of this and what actions should I ensure occur?
I gave it this question and part of the steps included deploying to avoid peak hours to avoid potential disruptions (such as the weekend for businesses); including a way to detect deployment issues and easily rollback the software in case there is an issue; and testing the deployment in a controlled environment before rolling it out to all customers (this would include doing a small rollout before a larger one). All of these actions are very standard and all would have greatly helped in this situation.
Crowdstrike (Falcon) has all possible compliance certifications (they are PCI, FedRAMP, ISO-27001, etc.), and these often include following standard practices of SDLC, such as testing, deploying, incident management, etc. And they are much more detailed than the answer from AI.
My suspicion, as someone pointed out, is that they deployed this update as a small patch, maybe bypassing some controls and cutting some corners. I really wouldn't blame it on lack of automated tests, etc. Only crowdstrike people know how this thing possibly made it to production.
ChatGPT
Obtaining relevant certifications can enhance the credibility of your antivirus software and reassure customers about its security, reliability, and compliance with industry standards. Here are some key certifications you should consider:
1. Security Certifications:
ISO/IEC 27001: This certification specifies the requirements for establishing, implementing, maintaining, and continually improving an information security management system (ISMS). It demonstrates your commitment to managing and protecting sensitive information.
ISO/IEC 15408 (Common Criteria): This certification provides assurance that the security features of your software have been rigorously tested and meet internationally recognized standards.
2. Quality Management Certifications:
ISO 9001: This certification focuses on quality management systems and ensures that your processes consistently provide products that meet customer and regulatory requirements.
ISO/IEC 25010: This standard specifies a model for evaluating software quality, ensuring that your software meets high standards of functionality, reliability, usability, efficiency, maintainability, and portability.
3. Data Privacy Certifications:
GDPR Compliance: If you operate in or serve customers in the European Union, demonstrating compliance with the General Data Protection Regulation (GDPR) is essential.
CCPA Compliance: For customers in California, compliance with the California Consumer Privacy Act (CCPA) is important.
4. Industry-Specific Certifications:
HIPAA Compliance: If your software handles protected health information (PHI) in the United States, ensuring compliance with the Health Insurance Portability and Accountability Act (HIPAA) is crucial.
PCI DSS: If your software processes payment card information, compliance with the Payment Card Industry Data Security Standard (PCI DSS) is necessary.
5. Product-Specific Certifications:
ICSA Labs Certification: ICSA Labs, an independent division of Verizon, provides testing and certification for security products, including antivirus software. Certification from ICSA Labs can validate the effectiveness of your software.
AV-TEST Certification: The AV-TEST Institute is an independent research institute that evaluates and certifies antivirus software for its protection, performance, and usability. An AV-TEST certification can enhance your product’s credibility.
VB100 Certification: Virus Bulletin (VB) offers the VB100 certification, which is awarded to antivirus products that meet specific criteria for detecting known malware without generating false positives.
OPSWAT Certification: OPSWAT offers certifications that validate the interoperability and effectiveness of cybersecurity products, including antivirus software.
6. Environmental and Occupational Health Certifications:
ISO 14001: This certification demonstrates your commitment to environmental management, which can be important for corporate social responsibility (CSR) and sustainability.
ISO 45001: This certification focuses on occupational health and safety management systems, ensuring a safe and healthy workplace for your employees.
Kurtz response is ridiculous blaming the customer on X. He will probably find another company to hire him as CEO tho. Just an upside down world in the C-suite world.
> One mistake doesn’t mean they are completely incompetent.
They are completely incompetent because for something as critical as crowdstrike code, you must build so many layers of validation that one, two or three mistakes don't matter because they will be caught before the code ends up in a customer system.
Looks like they have so little validation that one mistake (which is by itself totally normal) can end up bricking large parts of the economy without ever being caught. Which is neither normal nor competent.
Except this isn’t one mistake. Writing buggy code is a mistake. Not catching it in testing, QA, dogfooding or incremental rollouts is a complete institutional failure
Reminds me of Phil Harrison who always seems to find himself in an of executive position, botching launches of new video game platforms - PlayStation 3, Xbox One, Google Stadia
I didn’t understand why in 2010, it didn’t seem to make most news…
Took out the entire company where I worked.
People thought it was a worm/virus — few minutes after plugging in laptop, McAfee got the DAT update, quarantined the file; which caused Windows to start countdown+reboot (leading to endless BSODs).
Yet another successful loser who somehow continues to ascend corporate ranks despite poor company performance. Just shows how disconnected job performance is from C-suite peer reviews, a glorified popularity contest. Should add the unity and better.com folk here
“There's an old saying in Tennessee — I know it's in Texas, probably in Tennessee — that says, fool me once, shame on — shame on you. Fool me — you can't get fooled again.”
This event is predicted in Sydney Dekker’s book “Drift into Failure”, which basically postulates that in order to prevent local failure we setup failure prevention systems that increase the complexity beyond our ability to handle, and introduce systemic failures that are global. It’s a sobering book to read if you ever thought we could make systems fault tolerant.
We need more local expertise is really the only answer. Any organization that just outsources everything is prone to this. Not that organizations that don't outsource aren't prone to other things, but at least their failures will be asynchronous.
Funny thing is that for decades there were predictions about how there was a need for millions of more IT workers. It was assumed one needed local knowledge in companies. Instead what we got was more and more outsourced systems and centralized services. This today is one of the many downsides.
The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely. So companies who do their own IT would be routinely outcompeted by ones that outsource, only for the latter to get into trouble when the black swan swoops in. The problem is all other kinds of companies are mostly extinct by then unless their investors had some super-human foresight and discipline to invest for years into something that year after year looks like losing money.
> The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely.
Is that because of economies of scale or because the vendor is just cutting costs while hiding their negligence?
I don't understand how a single vendor was able to deploy an update to all of these systems virtually simultaneously, and _that_ wasn't identified as a risk. This smells of mindless box checking rather than sincere risk assessment and security auditing.
Kinda both I think, with an addition of principal agent problem. If you found a formula that provides the client with an acceptable CYA picture it is very scalable. And the model of "IT person knowledgeable in both security, modern threats and company's business" is not very scalable. The former, as we now know, is prone to catastrophic failures, but those are rare enough for a particular decision-maker to not be bothered by it.
Depressing thought that this phenomena is some kind of Nash equilibrium. That in the space of competition between firms, the equilibrium is for companies to outsource IT labor, saving on IT costs and passing that cost savings onto whatever service they are providing. -> Firms that outsource, out-compete their competition + expose their services to black swan catastrophic risk.
Is regulation that only way out of this, from a game theory perspective?
The whole market in which crowdstrike can exist is a result of regulation, albeit bad regulation.
And since the returns of selling endpoint protection are increasing with volume, the market can, over time, only be an oligopoly or monopoly.
It is a screwed market with artificially increased demand.
Also the outsourcing is not only about cost and compliance. There is at least a third force. In a situation like this, no CTO who bought crowdstrike products will be blamed. He did what was considered best industry practice (box ticking approach to security). From their perspective it is risk mitigation.
In theory, since most of the security incidents (not this one) involve the loss of personal customer data, if end customers would be willing to a pay a premium for proper handling of their data, AND if firms that don’t outsource and instead pay for competent administrators within their hierarchy had a means of signaling that, the equilibrium could be pushed to where you would like it to be.
Those are two very questionable ifs.
Also how do you recognise a competent administrator (even IT companies have problems with that), and how many are available in your area (you want them to live in the vicinity) even if you are willing to pay them like the most senior devs?
If you want to regulate the problem away, a lot of influencing factors have to be considered.
Also a major point in the Black Swan. In the Black Swan, Taleb describes that it is better for banks to fail more often than for them to be protected from any adversity. Eventually they will become "too big to fail". If something is too big to fail, you are fragile to a catastrophic failure.
I was wondering when someone would bring up Taleb RE: this incident.
I know you aren't saying it is, but I think Taleb would argue that this incident, as he did with the coronavirus pandemic for example, isn't even a Black Swan event. It was extremely easy to predict, and you had a large number of experts warning people about it for years but being ignored. A Black Swan is unpredictable and unexpected, not something totally predictable that you decided not to prepare for anyways.
That is interesting, where does he talk about this? I'm curious to hear his reasoning. What I remember from the Black Swan is that Black Swan events are (1) rare, (2) have a non-linear/massive impact, (3) and easy to predict retrospectively. That is, a lot of people will say "of course that happened" after the fact but were never too concerned about it beforehand.
Apart from a few doomsdayers I am not aware of anybody was warning us about a crowd strike type of event. I do not know much about public health but it was my understanding that there were playbooks for an epidemic.
Even if we had a proper playbook (and we likely do), the failure is so distributed that one would need a lot of books and a lot of incident commanders to fix the problem. We are dead in the water.
I think it was "predicted" by Sunburst, the Solarwinds hack.
I don't think centrally distributed anti-virus software is the only way to maintain reliability. Instead, I'd say companies to centralize anything like administration since it's cost effective and because they actually aren't concerned about global outage like this.
JM Keynes said "A ‘sound’ banker, alas! is not one who foresees danger and avoids it, but one who, when he is ruined, is ruined in a conventional and orthodox way along with his fellows, so that no one can really blame him." and the same goes for corporate IT.
Many systems are fault tolerant, and many systems can be made fault tolerant. But once you drift into a level of complexity spawned by many levels of dependencies, it definitely becomes more difficult for system A to understand the threats from system B and so on.
Do you know of any fault tolerant system? Asking because in all the cases I know, when we make a system "fault tolerant" we increase the complexity and we introduce new systemic failure modes related to our fault-tolerant-making-system, making them effectively non fault tolerant.
In all the cases I know, we traded frequent and localized failure for infrequent but globalized catastrophic failures. Like in this case.
You can make a system tolerant to certain faults. Other faults are left "untolerated".
A system that can tolerate anything, so have perfect availability, seems clearly impossible. So yeah, totally right, it's always a tradeoff. That's reasonable, as long as you trade smart.
I wonder if the people deciding to install Crowdstrike are aware of this. If they traded intentionally, and this is something they accepted, I guess it's fine. If not... I further wonder if they will change anything in the aftermath.
There will be lawsuits, there will be negotiations for better contracts, and likely there will be processes put in place to make it look like something was done at a deeper level. And yet this will happen again next year or the year after, at another company. I would be surprised if there was a risk assessment for the software that is supposed to be the answer to the risk assessment in the first place. Will be interesting to see what happens once the dust settles.
- This is system has a single point of failure, it is not fault tolerant. Lets introduce these three things to make it fault-tolerant
- Now you have three single points of failure...
It really depends on the size of the system and the definition of fault tolerance. If I have a website calling out to 10 APIs and one API failure takes down the site, that is not fault tolerance. If that 1 API failure gets caught and the rest operate as normal, that is fault tolerance, but 10% of the system is down. If you go to almost any site and open the dev console, you'll see errors coming from parts of the system, that is fault tolerance. Any twin engine airplane is fault tolerant...until both engines fail. I would say the solar system is fault tolerant, the universe even moreso if you consider it a system.
tldr there are levels to fault tolerance and I understand what you are saying. I am not sure if you are advocating for getting rid of fault handling, but generally you can mitigate the big scary monsters and what is left is the really edge case issues, and there really is no stopping one of those from time to time given we live in a world where anything can happen at anytime.
This instance really seems like a human related error around deployment standards...and humans will always make mistakes.
well, you usually put a load balancer and multiple instances of your service to handle individual server failures. In a basic no-lb case, your single server fails, you restart it and move on (local failure). In a load balancer case, your lb introduces its own global risks e.g. the load balancer can itself fail, which you can restart, but the load balancer can have a bug and stop handling sticky sessions when your servers are relying on it, and now you have a much harder to track brown-out event that is affecting every one of your users for a longer time, it's hard to diagnose, might end up with hard to fix data issues and transactions, and restarting the whole might not be enough.
So yeah, there is no fault tolerance if the timeframe is large enough, there are just less events, with much higher costs. It's a tradeoff.
The cynical in me thinks that the one advantage of these complex CYA systems is that when systems fail catastrophically like CrowdStrike did, we can all "outsource" the blame to them.
It's also in line with arguments made by Ted Kaczynski (the Unabomber)
> Why must everything collapse? Because, [Kaczynski] says, natural-selection-like competition only works when competing entities have scales of transport and talk that are much less than the scale of the entire system within which they compete. That is, things can work fine when bacteria who each move and talk across only meters compete across an entire planet. The failure of one bacteria doesn’t then threaten the planet. But when competing systems become complex and coupled on global scales, then there are always only a few such systems that matter, and breakdowns often have global scopes.
crazy how much he was right. if he hadn't gone down the path of violence out of self-loathing and anger he might have lived to see a huge audience and following.
I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.
There was a quote last year during the "Twitter files" hearing, something like, "it is axiomatic that the government cannot do indirectly what it is prohibited from doing directly".
Perhaps ironically, I had a difficult time using Google to find the exact wording of the quote or its source. The only verbatim result was from a NYPost article about the hearing.
>I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.
Be realistic, none his ideas would be blacklisted. They sound good on paper, but the instant it's time for everyone to return to mudhuts and farming, 99% of people will return to Playstations and ACs.
He wasn't "silenced" because the government was out to get him, no one talks about his ideas because they are just bad. Most people will give up on ecofascism once you tell them that you won't be able to eat strawberries out of season.
"would be blacklisted, deplatformed, or deamplified by consolidated authorities"
Sorry. Not true. You have Black Swan (Taleb) and Drift into Failure (Dekker) among many other books. These ideas are very well known to anyone who makes the effort.
The only thing that got Unabomber blacklisted is that he started to send bombs to people. His manifesto was dime a dozen, half the time you can expect politician boosting such stuff for temporary polling wins.
Hell, if we take his alleged (don't have vetted the genealogy tree) cousins, his body count isn't even that impressive.
I think a surprising amount of people already share this view, even if they don't go into extensive treatment with references like Dekker presumably does (I haven't read it).
I suspect most people in power just don't subscribe to that. which is precisely why it's systemic to see the engineer shouting "no!" when John CEO says "we're doing it anyway." I'm not sure this is something you can just teach, because the audience definitely has reservations about adopting it.
You can't prevent failure. You can only mitigate the impact. Biology has pretty good answers as to how to achieve this without having to increase complexity as a result, in fact, it often shows that simpler systems increase resilliency.
Something we used to understand until OS vendors became publicly traded companies and "important to national security" somehow.
> if you ever thought we could make systems fault tolerant
The only possible way to fault tolerancy is simplicity and then more simplicity.
Things like crowsdtrike have the opposite approach. Add a lot of fragile complexity attempting to catch problems, but introducing more attack surfaces than they can remove. This will never succeed.
As an architect of secure, real-time systems, the hardest lesson I had to learn is there's no such thing as a secure, real-time system in the absolute sense. Don't tell my boss.
I haven't read it, but I'd take a leap to presume it's somewhere between the people that say "C is unsafe" and "some other language takes care of all of things".
The thing that amazes me is how they've rolled out such a buggy change at such a scale. I would assume that for such critical systems, there would be a gradual rollout policy, so that not everything goes down at once.
Lack of gradual, health mediated rollout is absolutely the core issue here. False positive signatures, crash inducing blocks, etc will always slip through testing at some % no matter how good testing is. The necessary defense in depth here is to roll out ALL changes (binaries, policies, etc) in a staggered fashion with some kind of health checks in between (did > 10% of endpoints the change went to go down and stay down right after the change was pushed?).
Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.
You can stagger changes out within a reasonable timeframe - the blocks already take hours/days/weeks to come up with, taking an extra hour or two to trickle the change out gradually with some basic sanity checks between staggers is a tradeoff everyone would embrace in order to avoid the disaster we're living through today.
Need a reset on their balance point of security:uptime.
Wow !! good to know real reason for non-staggered release of the software ...
> Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.
There's some irony there in that the whole point of CrowdStrike itself is that it does behavioural based interventions. ie: it notices "unusual" activity over time and then can react to that autonomously. So them telling you they can't engineer it is kind of like them telling you they do don't know how to do a core feature they actually sell and market the product itself doing.
It's quite handy that all the things that pass QA never fail in production. :)
On a serious note, we have no way of knowing whether their update passed some QA or not, likely it hasn't, but we don't know. Regardless, the post you're replying to, IMHO, correctly makes the point that no matter how good your QA is: it will not catch everything. When something slips, you are going to need good observability and staggered, gradual, rollbackable, rollouts.
Ultimately, unless it's a nuclear power plant or something mission critical with no redundancy, I don't care if it passes QA, I care that it doesn't cause damage in production.
Had this been halted after bricking 10, 100, 1.000, 10.000, heck, even 100.000 machines or a whopping 1.000.000 machines, it would have barely made it outside of the tech circle news.
> On a serious note, we have no way of knowing whether their update passed some QA or not
I think we can infer that it clearly did not go through any meaningful QA.
It is very possible for there to be edge-case configurations that get bricked regardless of how much QA was done. Yes, that happens.
That's not what happened here. They bricked a huge portion of internet connected windows machines. If not a single one of those machines was represented in their QA test bank, then either their QA is completely useless, or they ignored the results of QA which is even worse.
There is no possible interpretation here that doesn't make Crowdstrike look completely incompetent.
If there had been a QA process, the kill rate could not have been as high as it is, because there'd have to be at least one system configuration that's not subject to the issue.
I agree that testing can reduce the probability of having huge problems, but there are still many ways in which a QA process can fail silently, or even pass properly, without giving a good indication of what will happen in production due to data inconsistencies or environmental differences.
Ultimately we don't know if they QA'd the changes at all, if this was data corruption in production, or anything really. What we know for sure is that they didn't have a good story for rollbacks and enforced staggered rollouts.
My understanding of their argument is that they can't afford the time to see if it breaks the QA fleet. Which I agree with GP is not a sufficient argument.
If and when there is a US Cyber Safety Review Board investigation of this incident, documents like that are going to be considered with great interest by the parties involved.
Often it is the engineers working for a heavily invested customer at the sharp end of the coal face who get a glimpse underneath the layers of BS and stare into the abyss.
This doesn’t look good, they say. It looks fine from up top! Keep shoveling! Comes the reply.
Sure, gradual rollout seems obviously desirable, but think of it from a liability perspective.
You roll out a patch to 1% of systems, and then a few of the remaining 99% get attacked and they sue you for having a solution but not making it available to them. It won't matter that your sales contract explains that this is how it works and the rollout is gradual and random.
Then push it down to customer, better yet provide integration points with other patch management software (no idea if you can integrate with WSUS without doing insane crap, but it's not the only system to handle that, etc.)
Another version of the "fail big" or "big lie" type phenomenon. Impact 1% of your customers and they sue you saying the gradual rollout demonstrates you had prior knowledge of the risk. Impact 100% of your customers and somehow you get off the hook by declaring it a black swan event that couldn't have been foretold.
This. I can see such an update shipping out for a few users. I mean I've shipped app updates that failed spectacularly in production due to a silly oversight (specifically: broken on a specific Android version), but those were all caught before shipping the app out to literally everybody around the world at the same time.
The only thing I can think of is they were trying to defend from a very severe threat very quickly. But... it seems like if they tested this on one machine they'd have found it.
Unless that threat was a 0day bug that allows anyone to SSH to any machine with any public key, it was not worth pushing it out in haste. Full stop. No excuses.
I also blame the customers here to be completely honest.
The fact the software does not allow for progressive rollout of a version in your own fleet should be an instantaneous "pass". It's unacceptable for a vendor to decide when updates are applied to my systems.
Absolutely. I may be speaking from ignorance here, as I don't know much about Windows, but isn't it also a big security red flag that this thing is reaching out to the Internet during boot?
I understand the need for updating these files, they're essentially what encodes the stuff the kernel agent (they call it a "sensor"?) is looking for. I also get why a known valid file needs to be loaded by the kernel module in the boot process--otherwise something could sneak by. What I don't understand is why downloading and validating these files needs to be a privileged process, let alone something in the actual kernel. And to top it all off, they're doing it at boot time. Why?
I hope there's an industry wide safety and reliability lesson learned here. And I hope computer operators (IT departments, etc) realize that they are responsible for making sure the things running on their machines are safe and reliable.
With fear of sounding like a douche-bag, I honestly believe there's A LOT of incompetence in the tech-world, which permeates all layers, security companies, AV companies, OS companies etc.
I really blame the whole power-structure, it looked like the engineers had the power, but last 10 years tech has been turned upside-down and exploited as any other industry, controlled by the opportunistic and greedy people. Everything is about making money, shipping features, the engineering is lost.
Would you rather tick compliance boxes easily or think deep about your critical path? Would you rather pay 100k for a skilled engineer or 5 cheaper (new) ones? Would you rather sell your HW now despite pushing feature-incomplete buggy app ruining the experience for many many customers? Will you listen to your engineers?
I also blame us, the SWE engineers, we are waay to easily busied around by these types of people who have no clue. Have professional integrity, tests is not optional or something that can be cut, it's part of SWE. Gradual rollout, feature-toggles, fall-backs/watchdogs etc. basic tools everyone should know.
I know people really dislike how Apple restricts your freedom to use their software in any way they don't intend. But this is one of the times where they shine.
Apple recognised kernel extension brought all sorts of trouble for users such as instability, crashing, etc. and presented a juicy attack surface. They deprecated and eventually disallowed kernel extensions supplanting them with a system extensions framework to provide interfaces for VPN functionality, EDR agents, etc.
A Crowdstrike agent couldn't panic or boot loop macOS due to a bug in the code when using this interface.
> I know people really dislike how Apple restricts your freedom to use their software in any way they don't intend. But this is one of the times where they shine.
Yes, the problem here is that the system owners had too much control over their systems.
No, no, that's the EXACT OPPOSITE of what happened. The problem is Crowdstrike had too much control of systems -- arguing that we should instead give that control to Apple is just swapping out who's holding the gun.
> arguing that we should instead give that control to Apple is just swapping out who's holding the gun.
apple wrote the OS, in this scenario they're already holding a nuke, and getting the gun out of crowdstrike's hands is in fact a win.
it is self-evident that 300 countries having nukes is less safe than 5 countries having them. Getting nukes (kernel modules) out of the hands of randos is a good thing even if the OS vendor still has kernel access (which they couldn't possibly not have) and might have problems of their own. IDK why that's even worthy of having to be stated.
don't let the perfect be the enemy of the good, incremental improvements in the state of things is still improvement. there is a silly amount of black-and-white thinking around "popular" targets like apple and nvidia (see: anything to do with the open-firmware-driver) etc.
"sure google is taking all your personal data and using it to target ads to your web searches, but apple also has sponsored/promoted apps in the app store!" is a similarly trite level of discourse that is nonetheless tolerated when it's targeted at the right brand.
This is good nuance to add to the conversation, thanks.
I think in most cases you have to trust some group of parties. As an individual you likely don't have enough time and expertise to fully validate everything that runs on your hardware.
Do you trust the OSS community, hardware vendors, OS vendors like IBM, Apple, M$, do you trust third party vendors like Crowdstrike?
For me, I prefer to minimize the number of parties I have to trust, and my trust is based on historical track record. I don't mind paying and giving up functionality.
Even if you've trusted too many people, and been burned, we should design our systems such that you can revoke that trust after the fact and become un-burned.
Having to boot into safe mode and remove the file is a pretty clumsy remediation. Better would be to boot into some kind of trust-management interface and distrust cloudstrike updates dated after July 17, then rebuild your system accordingly (this wouldn't be difficult to implement with nix).
Of course you can only benefit from that approach if you trust the end user a bit more than we typically do. Physical access should always be enough to access the trust management interface, anything else is just another vector for spooky action at a distance.
It is some mix of priorities along the frontier, with Apple being on the significantly controlling end such that I wouldn't want to bother. Your trust should also be based on prediction, and giving a major company even more control over what your systems are allowed to do has been historically bad and only gets worse. Even if Apple is properly ethical now (I'm skeptical, I think they've found a decently sized niche and that most of their users wouldn't drop them even if they moved to significantly higher levels of telemetry, due to being a status good in part), there's little reason to give them that power in perpetuity. Removing that control when it is absued hasn't gone well in the past.
Microsoft is also trying to make drivers and similar safer with HVCI, WDAC, ELAM and similar efforts.
But given how a large part of their moat is backwards compatibility, very few of those things are the default and even then probably wouldn't have prevented this scenario.
These customers wouldn't be able to do that in time frames measured in anything but decades and/or they would risk going bankrupt attempting to switch.
Microsoft has far more leverage than they choose to exert, for various reasons.
I can't run a 10year old game on my Mac but i can run a 30 year old game on my windows 11 box. Microsoft prioritizes backwards compatibility for older software,
For apple you just need to be an apple customer, they do a good job on crashing computers with their OSX updates like Sonoma. I remember my first macbook pro retina couldn’t go to sleep because it wouldn’t wake up till apple decided to release a fix for it. Good thing they don’t make server OSes.
I remember fearing every OSX update because until they switched to just shipping read-only partition images you had considerable chance of hitting a bug in Installer.app that resulted in infinite loop... (the bug existed since ~10.6 until they switched to image-based updates...)
30 years ago would be 1994. Were there any 32-bit Windows games in 1994 other than the version of FreeCell included with Win32s?
16-bit games (for DOS or Windows) won't run natively under Windows 11 because there's no 32-bit version of Windows 11 and switching a 64-bit CPU back to legacy mode to get access to the 16-bit execution modes is painful.
Maybe. Have you tried? 30 year old games often did not implement delta timing, so they advance ridiculously fast on modern processors. Or the games required a memory mode not supported by modern Windows (see real mode, expanded memory, protected mode), requiring DOSBox or other emulator to run today.
Well - recognition where it's due - that actually looks pretty great. (Assuming that, contrary to prior behavior, they actually support it, and fix bugs without breaking backwards compatibility every release, and don't keep swapping it out for newer frameworks, etc etc)
> I also blame us, the SWE engineers, we are waay to easily busied around by these types of people who have no clue. Have professional integrity, tests is not optional or something that can be cut, it's part of SWE.
Then maybe most of what's done in the "tech-industry" isn't, in any real sense, "engineering"?
I'd argue the areas where there's actual "engineering" in software are the least discussed---example being hard real-time systems for Engine Control Units/ABS systems etc.
That _has_ to work, unlike the latest CRUD/React thingy that had "engineering" processes of cargo-culting whatever framework is cool now and subjective nonsense like "code smells" and whatever design pattern is "needed" for "scale" or some such crap.
Perhaps actual engineering approaches could be applied to software development at large, but it wouldn't look like what most programmers do, day to day, now.
How is mission-critical software designed, tested, and QA'd? Why not try those approaches?
Amen to that. Software Engineering as a discipline badly suffers from not incorporating well-known methods for preventing these kinds of disasters from Systems Engineering.
> How is mission-critical software designed, tested, and QA'd? Why not try those approaches?
Ultimately, because it is more expensive and slower to do things correctly, though I would argue that while you lose speed initially with activities like actually thinking through your requirements and your verification and validation strategies, you end up gaining speed later when you're iterating on a correct system implementation because you have established extremely valuable guardrails that keep you focused and on the right track.
At the end of the day, the real failure is in the risk estimation of the damage done when these kinds of systems fail. We foolishly think that this kind of widespread disastrous failure is less likely than it really is, or the damage won't be as bad. If we accurately quantified that risk, many more systems we build would fall under the rigor of proper engineering practices.
Accountability would drive this. Engineering liability codes are a thing, trade liability codes are a thing. If you do work that isn't up to code, and harm results, you're liable. Nobody is holding us software developers accountable, so it's no wonder these things continue to happen.
"Listen to the engineers?" The problem is that there are no engineers, in the proper sense of the term. What there are is tons and tons of software developers who are all too happy to be lax about security and safe designs for their own convenience and fight back hard against security analysts and QA when called out on it.
Engineers can be lazy and greedy, too. But at least they should better understand the risks of cutting corners.
> Have professional integrity, tests is not optional or something that can be cut, it's part of SWE. Gradual rollout, feature-toggles, fall-backs/watchdogs etc. basic tools everyone should know.
In my career, my solution for this has been to just include doing things "the right way" as part of the estimate, and not give management the option to select a "cutting corners" option. The "cutting corners" option not only adds more risk, but rarely saves time anyway when you inevitably have to manually roll things back or do it over.
Sigh, I've tried this. So management reassigned to a dev who was happy to ship a simalcrum of the thing that, at best, doesn't work or, at worst, is full of security holes and gives incorrect results. And this makes management happy because something shipped! Metrics go up!
And then they ask why, exactly, did the senior engineer say this would take so long? Why always so difficult?
I don't know that incompetence is the best way to describe the forces at play but I agree with your sentiment.
There is always tension between business people and engineering. Where the engineers want things to be perfect and safe, because we need to fix the arising issues during nights and weekends.
The business people are interested in getting features released, and don't always understand the risks by pushing arbitrary dates.
It's a tradeoff which in healthy organizations where the two sides and leadership communicate effectively is well managed.
> Where the engineers want things to be perfect and safe, because we need to fix the arising issues during nights and weekends. The business people are interested in getting features released, and don't always understand the risks by pushing arbitrary dates.
Isn't this issue a vindication of the engineering approach to management, where you try to _not_ brick thousands of computers because you wanted to meet some internal deadline faster?
> There is always tension between business people and engineering.
Really? I think this situation (and the situation with Boeing!) shows that the tension is between ultimately between responsibility and irresponsibility.
I cannot be said that this is a win for short-sighted and incompetent business people?
If people don't understand the risks they shouldn't be making the decisions.
I think this is especially true in businesses where the thing you are selling is literally your ability to do good engineering. In the case of Boeing the fundamental thing customers care about is the "goodness" of the actual plane (for example the quality, the value for money, etc). In the case of Crowdstrike people wanted high quality software to protect their computers.
Yeah, good point. If you buy a carton of milk and it's gone off you shrug and go back to the store. If you're sitting in a jet plane at 30,000ft and the door goes for a walk... Twilight Zone. (And if the airline's security contractor sends a message to all the planes to turn off their engines... words fail. It's not... I can't joke about it. Too soon.)
Yes. I have been working in the tech industry since the early aughts and I never seen the industry so weak on engineer lead firms. Something really happened and the industry flipped.
In most companies, businesspeople without any real software dev experience control the purse strings. Such people should never run companies that sell life-or-death software.
The reality is there is plenty of space in the software industry to trade off velocity against "competent" software engineering. Take Instagram as an example. No one is going to die if e.g. a bug causes someone's IG photo upload to only appear in a proper subset of the feeds where it should appear.
In the civil engineering world, at least in Europe, the lead engineer would sign papers that would put him as liable if a bridge or a building structure collapses on its own. The civil engineers face literal prison time if they make a sloppy work.
In the software engineering world, we have TOSs that deny any liability if the software fails. Why?
It boils my blood to think that the heads of CrowdStrike would maybe get a slap on the wrist and everything will slowly continue as usual as the machines will get fixed.
Let's think about this for a second. I agree to some extend with what you are trying to say, I just think there's a critical thing missing here in your consideration, and that is usage of the product outside its intended purpose/marketing.
Civil engineers built bridges knowingly that civilians use them, and structural failure can cause deaths. The line of responsibility is clear.
SW companies (like CrowdStrike (CS)) it MAY BE less straight-forward.
A relevant real-world example is the use of consumer drones in military conflicts. Companies like DJI design and market their drones for civilian use, such as photography. However, these drones have been repurposed in conflict zones, like Ukraine, to carry explosives. If such a drone malfunctioned during military use, it would be unreasonable to hold DJI accountable, as this usage clearly falls outside the product's intended purpose and marketing.
The liability depends on the guarantees they make. If they market it for AV used for critical infrastructure, such as healthcare (seems like they do https://www.crowdstrike.com/platform/) - by all means, it's reasonable to hold with accountable.
However, SW companies should be able to sell products and long as they're clear what the limitations are, and it needs to be clearly communicated to the customers.
We have those TOS's in the software world because it would be prohibitively expensive to make all software reliable as a publicly used bridge. For those who died as a direct result of CrowdStrike, that's where the litigious nature of the US becomes a rare plus. And CrowdStrike will lose a lot of customers over this. It isn't perfect, but the market will arbitrate CrowdStrike's future in the coming months and years.
We’re definitely in a moment. I’ve seen a large shift away from discipline in the field. People don’t seem to care about professionalism or “good work”.
I mean back in the mid teens we had the whole “move fast and break things” motif. I think that quickly morphed into “be agile” because no one actually felt good about breaking things.
We don’t really have any software engineering leaders these days. It would be nice if one stood up and said “stop being awful. Let’s be professionals and earn our money.” Like, let’s create our own oath.
> We don’t really have any software engineering leaders these days. It would be nice if one stood up and said “stop being awful. Let’s be professionals and earn our money.”
I assume you realize that you don't get very far in many companies when you do that. I'm not humble-bragging, but I used to say just this over past 10-15 years even when in senior/leadership positions, and it ended up giving me a reputation of "oh, gedy is difficult", and you get sidelined by more "helpful" junior devs and managers who are willing to sling shit over the wall to please product. It's really not worth it.
It’s a matter of getting a critical mass of people who do that. In other words, changing the general culture. I’m lucky to work at a company that more or less has that culture.
Yeah I’ve found this is largely cultural, and it needs to come from the top.
The best orgs have a gnarly, time-wisened engineer in a VP role who somehow is also a good people person, and pushes both up and down engineering quality above all else. It’s a very very rare combination.
> We’re definitely in a moment. I’ve seen a large shift away from discipline in the field. People don’t seem to care about professionalism or “good work”.
Agreed. Thinking back to my experience at a company like Sun, every build was tested on every combination of hardware and OS releases (and probably patch levels, don't remember). This took a long time and a very large number of machines running the entire test suites. After that all passed ok, the release would be rolled out internally for dogfooding.
To me that's the base level of responsibility an engineering organization must have.
Here, apparently, Crowdstrike lets a code change through with little to no testing and immediately pushes it out to the entire world! And this is from a product that is effectively a backdoor to every host. What could go wrong? YOLO right?
This mindset is why I grow to hate what the tech industry has become.
As an infra guy, it seems like all my biggest fights at work lately have been about quality. Long abandoned dependencies that never get updated, little to no testing, constant push to take things to prod before they're ready. Not to mention all the security issues that get shrugged off in the name of convenience.
I find both management and devs are to blame. For some reason the amazingly knowledgeable developers I read on here daily are never to be found at work.
Yes. I’ve had the same experience. Literally have had engineers get upset with me when I asked them to consider optimizing code or refactor out complexity. “Yeah we’ll do it in a follow up, this needs to ship now,” is what I always end up hearing. We’re not their technical leads but we get pulled into a lot of PRs because we have oversight on a lot of areas of the codebase. From our purview, it’s just constantly deteriorating.
IMO, if you want to write code for anything mission critical you should need some kind of state certification, especially when you are writing code for stuff that is used by govt., hospitals, finance etc.
Not certification, licensure. That can and will be taken away if you violate the code of ethics. Which in this case means the code of conduct dictated to you by your industry instead of whatever you find ethical.
Like a license to be a doctor, lawyer, or civil engineer.
There’s - perhaps rightfully, but certainly predictably - a lot of software engineers in this thread moaning about how evil management makes poor engineers cut corners. Great, licensure addresses that. You don’t cut corners if doing so and getting caught means you never get to work in your field again. Any threat management can bring to the table is not as bad as that. And management is far less likely to even try if they can’t just replace you with a less scrupulous engineer (and there are many, many unscrupulous engineers) because there aren’t any because they’re all subject to the same code of ethics. Licensure gives engineers leverage.
I think that could cause a huge shift away from contributing to or being the maintainer of open source software. It would be too risky if those standards were applied and they couldn't use the standard "as is, no warranties" disclaimers.
Actually, no it wouldn't, as the licensire would likely be tied with providing the service on a paid basis to others. You could write or maintain any codebase you want. Once you start consuming it for an employer though, the licensure kicks in.
Paid/subsidized maintainers may be a different story though. But there absolutely should be some level of teeth and stake wieldable by a professional SWE to resist pushes to "just do the unethical/dangerous thing" by management.
I might have misunderstood. I took it to mean that engineers would be responsible for all code they write - the same as another engineer may be liable for any bridge they build - which would mean the common "as is", "no warranty", "not fit for any purpose" cute clauses common to OSS would no longer apply as this is clearly skirting around the fact that you made a tool to do a specific thing, and harming your computer isn't the intended outcome.
You can already enforce responsibility via contract but sure, some kind of licensing board that can revoke a license so you can no longer practice as a SWE would help with pushback against client/employer pressure. In a global market though it may be difficult to present this as a positive compared to overseas resources once they get fed up with it. It would probably need either regulation, or the private equivalent - insurance companies finding a real, quantifiable risk to apply to premiums.
Trouble is, the bridge built by any licensed engineer stands in its location, and can't be moved or duplicated. Software however is routinely duplicated, and copied to places that might not be suitable for ite original purpose.
I’d be ok with this so long as 1) there are rules about what constitutes properly built software and 2) there are protections for engineers who adhere to these rules
Far from being douchey, I think you've hit the nail on the head.
No one is perfect, we're all incompetent to some extent. You've written shitty code, I've definitely written shitty code. There's little time or consideration given to going back and improving things. Unless you're lucky enough to have financial support while working on a FOSS project where writing quality software is actually prioritized.
I get the appeal software developers have to start from scratch and write their own kernel, or OS, etc. And then you realize that working with modern hardware is just as messy.
We all stack our own house of cards upon another. Unless we tear it all down and start again with a sane stable structure, events like this will keep happening.
I think you are correct on that many SWEs are incompetent. I definitely am. I wish I had the time and passion to go through a complete self-training of CS fundamentals using Open Course resources.
> I honestly believe there's A LOT of incompetence in the tech-world
I can understand why. An engineer with expertise in one area can be a dunce in another; the line between concerns can be blurry; and expectations continue to change. Finding the right people with the right expertise is hard.
100% what we seen in the last couple of decades is the march of normies into the techno sphere to the detriment of the prior natives.
We've essentially watched digital colonialism, and it certainly peaks with Elon musk wealth and ego, attempting to buy up the digital market place of ideas.
Applying rigorous engineering principles is not something I see developers doing often. Whether or not it's incompetence on their part, or pressure from 'imbecile MBAs and marketers', it doesn't matter. They are software developers, not engineers. Engineers in most countries have to belong to a professional body and meet specific standards before they can practice as professionals. Any asshat can call themselves a 'software engineer', the current situation being a prime example, or was this a marketing decision?
You're making the title be more than it is. This won't get solved by more certification. The checkbox of having certified security is what allowed it to happen in the first place.
No. Engineering means something. This is a software ‘engineering’ problem. If the field wants the nomenclature, then it behooves them to apply rigour to who can call themselves an engineer or architect. Blaming middle management is missing the wood for the trees. The root cause was a bad patch. That is developments fault, and no one else’s. As to why this fault could happen, well the design of Windows should be scrutinised. Again, middle management isn’t really to blame here, software architects and engineers design the infrastructure, they choose to use Windows for a variety of reasons.
The point here m trying to make is blaming “MBAs and marketing” shifts blame and misses the wood for the trees. The OP is as on the holier-than-thou “engineer” trip. They are not engineers.
I think engineering only means something because of culture. It all starts from the culture of collective people who define and decide what principles are to be followed and why. All the certifications and licensing that are prerequsite to becoming an engineer are outcomes of the culture that defined them.
Today we have pockets of code produced by one culture linked (literally) with pockets of code produced by a completely different ones and somehow expect the final result to adhere to the most principled and disciplined culture.
Not entirely true. The company I worked for, major network equipment provider, had a customer user group that had self-organised to take it in turns to be the first customer to deploy major new software builds. It mostly worked well.
This is the thing that gets me most about this. Any Windows systems developer knows that a bug in a kernel driver can cause BSODs - why on earth would you push out such changes en-masse like this?!
In 2012 a local bank rolled out an update that basically took all of their customer services offline. Couldn't access your money. Took them a month to get things working again.
I'm confused as to how this issue is so widespread in the first place. I'm unfamiliar with how Crowdstrike works, do organizations really have no control over when these updates occur? Why can't these airlines just apply the updates in dev first? Is it the organizations fault or does Crowdstrike just deliver updates like this and there's no control? If that's just how they do it, how do they get away with this?
Can somebody summarize what CrowdStrike actually is/does? I can't figure it out from their web page (they're an "enterprise" "security" "provider", apparently). Is this just some virus scanning software? Or is it some bossware/spyware thing?
It's both. Antivirus along with spyware to also watch for anything the user is doing that could introduce a threat, such as opening a phishing email, posting on HN, etc.
It's not really up to the companies. In this day and age, everyone is a target for ransomware, so every company with common sense holds insurance against a ransomware attack. One of the requirements of the insurance is that you have to have monitoring software like Crowdstrike installed on all company machines. The company I work for fortunately doesn't use Crowdstrike, but we use something similar called SentinelOne. It's very difficult to remove, and it's a fireable offense if you manage to.
No doubt mandated so that the NSA can have a backdoor to everything just by having a deal with each one of those providers.
I think there's a Ben Franklin quote that applies here. "Those who would give up essential liberty, to purchase a little temporary safety, deserve neither liberty nor safety."
It is kinda implied throughout SP 800-171r3 that EDRs will make meeting the requirements easier, although they are only specifically mentioned in section 03.04.06
Most corporate places I've encountered over the last N years mandate one kind of antivirus/spyware combo or another on every corporate computer. So it'd be pretty much every major workplace.
Just because everyone does it doesn't not make it a dumb idea. Everyone eats sugar.
If the average corporation hates/mistrusts their employees enough to add a single point of failure to their entire business and let a 3rd party have full access to their systems, then well, they reap what they sow.
I think you have to look beyond the company. In my experience, even the people implementing these tools hate them and rarely have some evil desire to spy on their employees and slow down their laptops. But without them as part of the IT suite, the company can't tick the EDR or AV box, pass a certain certification, land a certain type of customer, etc. It is certainly an unfortunate cycle.
This goes way higher than the average corporation.
This is companies trying desperately to deliver value to their customer at a profit while also maintaining SOC 2, GDPR, PCI, HIPAA, etc. compliance.
If you're not a cybersecurity company, a company like CrowdStrike saying: 'hey, pay us a monthly fee and we'll ensure you're 100% compliant _and_ protected' sounds like a dream come true. Until today, it probably was! Hell, even after today, when the dust settles, still probably worth it.
Sounds like the all too common dynamic of centralized top-down government/corporate "security" mandates destroying distributed real security. See also TSA making me splay my laptops out into a bunch of plastic bins while showing everyone where and how I was wearing a money belt. (I haven't flown for quite some time, I'm sure it's much worse now)
There's a highly problematic underlying dynamic where 364 days out of the year, when you talk about the dangers of centralized control and proprietary software, you get flat out ignored as being overly paranoid and even weird (don't you know that "normal" people have zero ability or agency when it comes to anything involving computers?!). Then something like this happens and we get a day or two to say "I told you so". After which the managerial class goes right back to pushing ever-more centralized control. Gotta check off those bullet point action items.
They fixed that. Now you can fly without taking your laptop out, or taking your shoes and belt off. You just have to give them fingerprints, a facial scan and an in-person interview. They give you a little card. It's nifty.
My response was intended as sarcasm. But eventually, I don't think it will be a two-tiered system. You simply won't be allowed to fly without what is currently required for precheck.
And fwiw, I don't think the strong argument against precheck has to do with social class... it's not terribly expensive, and anyone can do it. It's just a further invasion of privacy.
Precheck is super cheap, it's like less than $100 once per 5 years. Yes, it is an invasion of privacy, but I suspect the government already has all that data anyway many times over.
> showing everyone where and how I was wearing a money belt
I only fly once every couple years, but I really hated emptying my pockets into those bins. The last time I went through, the agent suggested I put everything in my computer bag. That worked a lot better.
Last time I flew, in sweden, the guy was angry at me for having to do his job so he slipped my passport away from the tray, so that I'd lose it. Lucky for me I saw him doing that.
At my work in the past year or 2 they rolled out Zscaler onto all of our machines which I think is supposed to be doing a similar thing. All it's done is caused us regular network issues.
I wonder if they also have the capability to brick all our Windows machines like this.
Zscaler is awful. It installs a root cert to act as a man-in-the-middle TCP traffic snooper. Probably does some other stuff, but all you TLS traffic is snooped with zscaler. It is creepy software, IMO.
Ah, yeah, they gave us zscaler not too long ago. I wondered if it was logging my keystrokes or not, figured it probably was because my computer slowed _way_ down ever since it appeared.
Zscaler sounds like it would be a web server. Just looked it up: "zero trust leader". The descriptiveness of terms these days... if you say it gets installed on a system, how is that having zero trust in them? And what do they do with all this nontrust? Meanwhile, Wikipedia says they offer "cloud services", which is possibly even more confusing for what you describe as client software
Somebody upthread pointed out that it installs a root CA and forces all of your HTTPS connections to use it. I verified that he's correct - I'm on Hacker News right now with an SSL connection that's verified by "ZScaler Root CA", not Digicert.
ZScaler has various deployment layouts. Instead of the client side TLS endpoint, you can also opt for the "route all web traffic to ZScaler cloud network" which office admins love because less stuff to install on the clients. The wonderful side effect is that some of these ZScaler IPs are banned from reddit, Twitter, etc, effectively banning half the company.
Zero trust means that there is no implicit trust whether you’re accessing the system from an internal protected network or from remote. All access to be authenticated to the fullest. In theory you should be doing 2FA every time you log in for the strictest definition of zero trust.
They are a SASE provider, I am assume they offer a beyond Corp style offering allowing companies to move their apps off a private VPN and allow access on the public internet. Probably have a white paper on how they satisfy zero trust architecture.
See the recent waves of ransomware encrypting drives and similar attacks. They cause real cost as well and this outage can be blamed on crowdstrike without losing face. If you are in the news for phished data or have an outage since all data is encrypted blaming somebody else is hard
Well it’s not aimed at IT people and programmers (though the policies still apply to them), it’s aimed at everyone else who doesn’t understand what a phishing email looks like.
These comments make me think that both you and the commenter you replied to have never read 1984.
It's anti totalitarian propaganda. There is IIRC not much about how Airstrip One came to be, it's kinda always been there because the state controls history. People did not ask for the telescreens, they accept them.
The system in the book is so strongly based on heavy-handed coercion and manipulation that I actually find it psychologically implausible (though, North Korea...). The strength of the book, I would say, is not its plausibility, but the intensity of the nightmare and the quality of the prose that describes it.
So there's the control freak at the top who made this decision, and then there are the front lines who are feverishly booting into safe mode and removing the update, and then there are the people who can't get the data they need to safely perform surgeries.
So yeah, screw 'em. But let's be specific about it.
I think the question this raises is why critical systems like that have unrestricted 3rd party access and are open to being bricked remotely. And furthermore, why safety critical gear has literally zero backup options to use in case of an e.g. EMP, power loss, or any other disruption. If you are in charge of a system where it crashing means that people will die, you are a complete moron to not provide multiple alternatives in such a case and should be held criminally liable for your negligence.
Agreed on all points, but if we're going to start expecting people to do that kind of diligence, re: fail-safes and such (and we should), then we're going to have to stop stretching people as thin as we tend to, and we're going to have to give them more autonomy than we tend to.
Like the kind of autonomy that let's them uninstall Crowdstrike. Because how can you be responsible for a system which at any time could start running different code.
What I don't get why nobody questions how's OS that needs all third-party shit to function and be compliant, gets into critical paths in the first place??
This kind of thing is required by FedRAMP. Good luck finding a company without ending management software who is legally allowed to be a US government vendor.
If you stick to small privately held companies you might be able to avoid ending management but that's it.. any big brand you can think of is going to be running this or something similar on their machines -- because they're required to
Presumably endpoint detection & response (EDR) agents need to do things like dynamically fetch new malware signatures at runtime, which is understandable. But you'd think that would be treated as new "content", something they're designed to handle in day-to-day operation, hence very low risk.
That's totally different to deploying new "code", i.e. new versions of the agent itself. You'd expect that to be treated as a software update like any other, so their customers can control the roll out as part of their own change management processes, with separate environments, extensive testing, staggered deployments, etc.
I wonder if such a content vs. code distinction exists? Or has EDR software gotten so complex (e.g. with malware sandboxing) that such a distinction can't easily be made any more?
In any case, vendors shouldn't be able to push out software updates that circumvent everyone's change management processes! Looking forward to the postmortem.
My guess is it probably was a content update that tickled some lesser trodden path in the parser/loader code, or created a race condition in the code which lead to the BSOD.
Even if it’s ‘just’ a content update, it probably should follow the rules of a code update (canaries, pre-release channels, staged rollouts, etc).
CrowdStrike is an endpoint detection and response (EDR) system. It is deeply integrated into the operating system. This type of security software is very common on company-owned computers, and often have essentially root privileges.
Well, actually more than root. Even for an administrator user on Windows, it’s pretty hard to mess with things and get into BSOD. CrowdStrike has these files as drivers (as indicated by .sys file extension) which run in the kernel mode.
Companies operate on a high level of fear and trust. This is the security vendor, so in theory they want those updates rolled out as quickly as possible so that they don't get hacked. Heh.
These updates happen automatically and as far as I can tell, there is no option to turn this feature off. From a security perspective, the vendor will always want you to be on the most recent software to protect from attack holes that may open up by operating on an older version. Your IT department will likely want this as well to avoid culpability. Just my 2 observations, whether it is the right away or if CS is effective at what it does, no idea.
Crowdstrike did this to our production linux fleet back on April 19th, and I've been dying to rant about it.
The short version was: we're a civic tech lab, so we have a bunch of different production websites made at different times on different infrastructure. We run Crowdstrike provided by our enterprise. Crowdstrike pushed an update on a Friday evening that was incompatible with up-to-date Debian stable. So we patched Debian as usual, everything was fine for a week, and then all of our servers across multiple websites and cloud hosts simultaneously hard crashed and refused to boot.
When we connected one of the disks to a new machine and checked the logs, Crowdstrike looked like a culprit, so we manually deleted it, the machine booted, tried reinstalling it and the machine immediately crashes again. OK, let's file a support ticket and get an engineer on the line.
Crowdstrike took a day to respond, and then asked for a bunch more proof (beyond the above) that it was their fault. They acknowledged the bug a day later, and weeks later had a root cause analysis that they didn't cover our scenario (Debian stable running version n-1, I think, which is a supported configuration) in their test matrix. In our own post mortem there was no real ability to prevent the same thing from happening again -- "we push software to your machines any time we want, whether or not it's urgent, without testing it" seems to be core to the model, particularly if you're a small IT part of a large enterprise. What they're selling to the enterprise is exactly that they'll do that.
Oh, if you are also running Crowdstrike on linux, here are some things we identified that you _can_ do:
- Make sure you're running in user mode (eBPF) instead of kernel mode (kernel module), since it has less ability to crash the kernel. This became the default in the latest versions and they say it now offers equivalent protection.
- If your enterprise allows, you can have a test fleet running version n and the main fleet run n-1.
- Make sure you know in advance who to cc on a support ticket so Crowdstrike pays attention.
I know some of this sounds obvious, but it's easy to screw up organizationally when EDR software is used by centralized CISOs to try to manage distributed enterprise risk -- like, how do you detect intrusions early in a big organization with lots of people running servers for lots of reasons? There's real reasons Crowdstrike is appealing in that situation. But if you're the sysadmin getting "make sure to run this thing on your 10 boxes out of our 10,000" or whatever, then you're the one who cares about uptime and you need to advocate a bit.
I would wager that even most software developers who understand the difference between kernel and user mode aren't going to be aware there is a "third" address space, which is essentially a highly-restricted and verified byte code virtual machine that runs with limited read-only access to kernel memory
Not that it changes your point, and I could be wrong, but I'm pretty sure eBPF bytecode is typically compiled to native code by the kernel and runs in kernel mode with full privileges. Its safety properties entirely depend on the verifier not having bugs.
fwiw there's like a billion devices out there with cpus that can run java byte code directly - it's hardly experimental. for example, Jazelle for ARM was very widely deployed
Depending on what kernel I'm running, CrowdStrike Falcon's eBPF will fail to compile and execute, then fail to fall back to their janky kernel driver, then inform IT that I'm out of compliance. Even LTS kernels in their support matrix sometimes do this to me. I'm thoroughly unimpressed with their code quality.
JackC mentioned in the parent comment that they work for a civic tech lab, and their profile suggests they’re affiliated with a high-profile academic institution. It’s not my place to link directly, but a quick Google suggests they do some very cool, very pro-social work, the kind of largely thankless work that people don’t get into for the money.
Perhaps such organizations attract civic-minded people who, after struggling to figure out how to make the product work in their own ecosystem, generously offer high-level advice to their peers who might be similarly struggling.
It feels a little mean-spirited to characterize that well-meaning act of offering advice as “insane.”
This is gold. My friend and me were joking around that they probably did this to macos and linux before, but nobody gave a shit since it's... macos and linux.
(re: people blaming it on windows and macos/linux people being happy they have macos/linux)
I don’t think people are saying that causing a boot loop is impossible on Linux, anyone who knows anything about the Linux kernel knows that it’s very possible.
Rather it’s that on Linux using such an invasive antiviral technique in Ring 0 is not necessary.
On Mac I’m fairly sure it is impossible for a third party to cause such a boot loop due to SIP and the deprecation of kexts.
I believe Apple prevented this also for this exact reason. Third-parties cannot compromise the stability of the core system, since extensions can run only in user-space.
I might be wrong about it, but I feel that malware with root access can wreak quite a havoc. Imagine that this malware decides to forbid launch of every executable and every network connection, because their junior developer messed up with `==` and `===`. It won't cause kernel crash, but probably will render the system equally unusable.
Root access is a separate issue, but user space access to sys level functions is something Apple has been slowly (or quickly on the IOS platform, where they are trying to stop apps snooping on each other) clamping down on for years.
On both macOS and Linux, there's an increasingly limited set of things you can do from root. (but yeah, malware with root is definitely bad, and the root->kernel attack surface is large)
Malware can do tons of damage even with only regular user access, e.g. ransomware. That’s a different problem from preventing legitimate software from causing damage accidentally.
To completely neuter malware you need sandboxing, but this tends to annoy users because it prevents too much legitimate software. You can set up Mac OS to only run sandboxed software, but nobody does because it’s a terrible experience. Better to buy an iPad.
> but nobody does because it’s a terrible experience
To be fair, all apps from the App Store are sandboxed, including on macOS. Some apps that want/need extra stuff are not sandboxed, but still use Gatekeeper and play nice with SIP and such.
FWIW, according to Activity Monitor, somewhere around 2/3 to 3/4 of the processes currently running on my Mac are sandboxed.
Terrible dev experience or not, it's pretty widely used.
It depends on your setup. If you actually put in the effort to get apparmor or selinux set up, then root is meaningless. There have been so many privilege escalation exploits that simply got blocked by selinux that you should worry more about setting selinux up than some hypothetical exploit.
It's not unnecessary, it's harder (no stable kernel ABI, and servers won't touch DKMS with a ten foot pole).
On the other hand you might say that lack of stable kernel ABI is what begot ebpf, and that Microsoft is paying for the legacy of allowing whatever (from random drivers to font rendering) to run in kernel mode.
I’ve had an issue with it before in my work MacBook. It would just keep causing the system to hang, making the computer unusable. Had to get IT to remove it.
> we push software to your machines any time we want, whether or not it's urgent, without testing it
Do they allow you to control updates? It sounds like what you want is for a small subset of your machines using the latest, while the rest wait for stability to be proven.
This is what happened to us. We had a small fraction of the fleet upgraded at the same time and they all crashed. We found the cause and set a flag to not install CS on servers with the latest kernel version until they fixed it.
I wonder if the changes they put in behind the scenes for your incident on Linux saved Linux systems in this situation and no one thought to see if Windows was also at risk.
So in a nutshell it is about corporations pushing for legislation which compels usage of their questionable products, because such products enable management to claim compliance when things go wrong, even when the things that go wrong is are the compliance ensuring products.
CrowdStrike Falcon may ship as a native package, but after that it completely self-updates to whatever they think you should be running. Often, I have to ask IT to ask CS to revert my version because the "current" one doesn't work on my up-to-date kernel/glibc/etc. The quality of code that they ship is pretty appalling.
Thanks for confirming. Is there any valid reason these updates couldn't be distributed through proper package repositories, ideally open repositories (especially data files which can't be copyrightable anyway)?
Yes but that puts a lot of complexity on the end user and you end-up with:
1. A software vendor that is unhappy about the speed they can ship new features at
2. Users that are unhappy the software vendor isn't doing more to reduce their maintenance burden, especially when they have a mixture of OS, distros and complex internal IT structures
IMO default package manager have failed on both linux and windows to provide a good solution for remote updates so everyone re-invents the wheel with custom mini package managers + dedicated update systems.
This seems to be misinformation? The CrowdStrike KB says this was due to a Linux kernel bug.
---
Linux Sensor operating in user mode will be blocked from loading on specific 6.x kernel versions
Published Date: Apr 11, 2024
Symptoms
In order to not trigger a kernel bug, the Linux Sensor operating in user mode will be prevented from loading on specific 6.x kernel versions with 7.11 and later sensor versions.
Applies To
Linux sensor 7.11 in user mode will be prevented from loading:
For Ubuntu/Debian kernel versions:
6.5 or 6.6
For all distributions except Ubuntu/Debian, kernel versions:
6.5 to 6.5.12
6.6 to 6.6.2
Linux sensor 7.13 in user mode will be prevented from loading:
For all distributions except Ubuntu/Debian, kernel versions:
6.5 to 6.5.12
6.6 to 6.6.2
Linux Sensors running in kernel mode are not affected.
Resolution
CrowdStrike Engineering identified a bug in the Linux kernel BPF verifier, resulting in unexpected operation or instability of the Linux environment.
In detail, as part of its tasks, the verifier backtracks BPF instructions from subprograms to each program loaded by a user-space application, like the sensor. In the bugged kernel versions, this mechanism could lead to an out-of-bounds array access in the verifier code, causing a kernel oops.
This issue affects a specific range of Linux kernel versions, that CrowdStrike Engineering identified through detailed analysis of the kernel commits log. It is possible for this issue to affect other kernels if the distribution vendor chooses to utilize the problem commit.
To avoid triggering a bug within the Linux kernel, the sensor is intentionally prevented from running in user mode for the specific distributions and kernel versions shown in the above section
These kernel versions are intentionally blocked to avoid triggering a bug within the Linux kernel. It is not a bug with the Falcon sensor.
Sensors running in kernel mode are not affected.
No action required, the sensor will not load into user mode for affected kernel versions and will stay on kernel mode.
For Ubuntu 22.04 the following 6.5 kernels will load in user mode with Falcon Linux Sensor 7.13 and higher:
6.5.0-1015-aws and later
6.5.0-1016-azure and later
6.5.0-1015-gcp and later
6.5.0-25-generic and later
6.5.0-1016-oem and later
If for some reason the sensor needs to be switched back to kernel mode:
Switch the Linux sensor backend to kernel mode
sudo /opt/CrowdStrike/falconctl -s --backend=kernel
At one point overnight airlines were calling for an "international ground stop for all flights globally". Planes in the air were unable to get clearance to land or divert. I don't believe such a thing has ever happened before except in the immediate aftermath of 9/11.
A pilot WILL land, even without clearance. They're not going to crash their own plane. Either way, ATC has fallback procedures and can just use radio to communicate and manage everything manually. Get all the planes on the ground in safe order and then wait for a fix before clearing new takeoffs. https://aviation.stackexchange.com/questions/43379/is-there-...
Planes always get landing clearance via radio. "Planes in the air were unable to get clearance to land or divert" strongly suggests that the radios themselves were not working if it's actually true.
I wouldn't expect emergency rooms and 911 to stop working either, but here we are, so until someone says otherwise, I'm assuming some ATCs went down too.
I imagine the flight planning software they use was affected (so their ability to coordinate with other airport's ATC), but not their radio systems or aircraft radar (nearly all radar systems I've worked with are run on Linux, and are hardened to the Nth degree). Been out of the game for 12 years though, so things have likely changed.
The Tenerife disaster (second-deadliest aviation incident in history, after 9/11) was ultimately caused by chaotic conditions due to too many airplanes having to be diverted and land at an alternate airport that wasn't equipped to handle them comfortably.
I'd argue that Tenerife was due to taking off (in bad weather), not landing. But of course, a bunch of planes landing at the same airport without ATC sounds quite dangerous.
There were a lot of contributing causes, but it wouldn't have happened if not for the fact that Tenerife North airport was massively overcrowded due to Gran Canaria airport being suddenly closed (for unrelated reasons) and flights forced to divert.
The issue wasn't with landing specifically; I'm just using it as a general example of issues caused by havoc situations in aviation.
Pilots know where there are other places to land, e.g. there are a lot of military strips and private airfields where some craft can land, depending on size.
I would also point out that the backup plan (Radio and Binoculars) are not only effective but also extremely cheap & easy to keep ready in the control tower at all times.
Why does this tool exist and must be installed on servers? Well, Windows OS design definitely plays a role here.
Why does this software run in a critical path that can cause the machine to BSOD? This is where the OS is a problem. If it is fragile enough that a bad service like this can cause it to crash in an unfixable state (without manual intervention), that’s on Windows.
> Why does this tool exist and must be installed on servers?
Fads, laziness, and lack of forethought. This tool didn't exist a few years ago. Nobody stopped IT departments worldwide and said "hey, maybe you shouldn't be auto-rolling critical software updates without testing, let alone doing this via a third-party tool with dubious checks."
This could have happened on any OS. Auto deployment is the root problem.
In this very thread there was report of a Debian Linux fleet being kernel crashed in exactly the same scenario by exactly the same malware few months ago.
So the only blame Windows can take is its widespread usage, compared to Debian.
Yes, the Linux device driver has many of the same issues (monolithic drivers running in kernel space/memory). I’m not sure what the mitigations were in that case, but I’d be interested to know.
But we both know this isn’t the only model (and have commented as such in the thread). MacOS has been moving away from this risk for years, largely to the annoyance of these enterprise security companies. The vendor that was used by an old employer blamed Apple on their own inability to migrate their buggy EDM program to the new version of macOS. So much so that our company refused to upgrade for over 6 months and then it was begrudgingly allowed.
A tool that has full control of the OS (which is apparently required by such security software) fundamentally must have a way to crash the system, and continue to do so at every restart.
This really should be a hell no. Perhaps Microsoft's greatest claim to fame is their enduring ability to quickly and decisively react to security breaches with updates. Their process is extremely public and hasn't significantly changed in decades.
If your company can't work with Microsoft's process, your company is the problem. Every other software company in the last forty years has figured it out.
I don't blame Windows, but do blame these systems for running Windows, if that makes sense.
I imagined a lot of this ran on some custom or more obscure and hardened specialty system. One that would generally negate the need for antiviruses and such. (and obviously, no, not off the shelf Linux/BSD either)
Legit question, not trolling. Android is the next biggest OS used to run a single application like POS, meter readers, digital menus, navigation systems. It might be the top one by now. It's prone to all the same 'spyware' drawbacks and easier to set up than "Linux".
It would be better than Windows for sure. You’ve got A/B updates, verified boot, selinux, properly sandboxed apps and a whole range of other isolation techniques.
For something truly mission critical, I’d expect something more bespoke with smaller complexity surface. Otherwise Android is actually not a bad choice.
Any sort of Immutable OS would be better for critical systems like this. The ability to literally just rollback the entire state of the system to before the update would have gotten things back online as fast as a reboot...
Something like Android Lollipop from 2014 supports all the latest techniques. It's likely there's no security issues left on Lollipop by now.
A lot of the new forced updates on Android is to prevent people some apps from being used to spy on other apps, stealing passwords, notification backdoor etc, but you don't need that if it's just a car radio.
the same time new showed up here, on wechat tiktok clone (moments i think, in English) was showing animations of the usa air traffic maps and how the tech blackout affected it. from those images i that it was huge.
We are a major CS client, with 50k windows-based endpoints or so. All down.
There exists a workaround but CS does not make it clear whether this means running without protection or not. (The workaround does get the windows boxes unstuck from the boot loop, but they do appear offline in the CS host management console - which of course may have many reasons).
Does CS actually offer any real protection? I always thought it was just feel-good software, that Windows had caught up to separating permissions since after XP or so. Either one is lying/scamming, but which one?
> Does CS actually offer any real protection? I always thought it was just feel-good software, that Windows had caught up to separating permissions since after XP or so. Either one is lying/scamming, but which one?
Our ZScaler rep (basically, they technically work for us) come out with massive impressive looking numbers of the thousands of threats they detect and eliminate every month
Oddly before we had zscaler we didn't seem to have any actual problems. Now we have it and while we have lots of zscaler caused problems around performance and location, we still don't have any actual problems.
Feels very much like a tiger repelling rock. But I'm sure the corporate hospitality is fun.
AFAIK, most of the people I know that deploy CrowdStrike (including us) just do it to check a box for audits and certifications. They don't care much about protections and will happily add exceptions on places where it gives problems (and that's a lot of places)
It's not about checking the boxes themselves, but the shifting of liability that enables. Those security companies are paid well not for actually providing security, but for providing a way to say, "we're not at fault, we adhered to the best security practices, there's nothing we could've done to prevent the problem".
Shouldn't that hit Crowdstrike's stock price much more than it has then? (so far I see ~11% down which is definitely a lot but it looks like they will survive).
Not quite. Insurance is a product that provides compensation in the event of loss. Deploying CrowdStrike with an eye toward enterprise risk management falls under one of either changing behaviors or modifying outcomes (or perhaps both).
Pay for what exactly though? Cybersecurity incidents result in material loss, and someone somewhere needs to provide dollars for the accrued costs. Reputation can't do that, particularly when legal liability (or, hell, culpability) is involved.
EDR deployment is an outcome-modifying measure, usually required as underwritten in a cybersecurity insurance policy for it to be in force. It isn't itself insurance.
Just adding my two cents: I work as a pentester and arguably all of my colleagues agree that engagements where Crowdstrike is deployed are the worst because it's impossible to bypass.
It definitely isn't impossible to bypass. It gets bypassed all the time, even publicly. There's like 80 different CrowdStrike bypass tricks that have been published at some point. It's hard to bypass and it takes skill, and yes it's the best EDR, but it's not the best solution - the best solution is an architecture where bypassing the EDR doesn't mean you get to own the network.
An attacker that's using a 0 day to get into a privileged section in a properly set up network is not going to be stopped by CrowdStrike.
By “impossible to bypass” are you meaning that it provides good security? Or that it makes pen testing harder because you need to be able to temporarily bypass it in order to do your test?
The first. AV evasion is a whole discipline in itself and it can be anything from trivial to borderline impossible. Crowdstrike definitely plays in the champions league.
I’ll say this: I did a small lab in college for a hardware security class and I got a scary email from IT because CrowdStrike noticed there was some program using speculative execution/cache invalidation to leak data on my account - they recognized my small scale example leaking a couple of bytes. Pretty impressive to be honest.
Those able to write and use FUD malware do not create public documentation. Crowdstrike is not impossible to bypass, but for a junior security journeyman known as a pentester, working for corporate interests with no budget and absurdly limited scopes under contract for n-hours a week for 3 weeks will never be able to do anything as simple as an EDR evasion, however if you wish to actually learn the basics the common practitioner of this art please go study the offsec evasion class. Then go read a lot of code and syscall documentation and learn assembly.
I don't understand why you were downvoted. I'm interested in what you said. When you mentioned offsec evasion class, is this what you mean? It seems pretty advanced.
What kind of code should I read? Actually, let me ask this, what kind of code should I write first before diving into this kind of evasion technique? I feel I need to write some small Windows system software like duplicating Process Explorer, to get familiar with Win32 programming and Windows system programming, but I could be wrong?
I think I do have a study path, but it's full of gap. I work as a data engineer -- the kind that I wouldn't even bother to call myself engineer /s
I know quite a few offensive security pros that are way better than I will ever be at breaking into systems and evading detections that can only barely program anything beyond simple python scripts.
It’s a great goal to eventually learn everything, but knowing the correct tools and techniques and how and when to use them most effectively are very different skillsets from discovering new vulnerabilities or writing new exploit code and you can start at any of them.
Compare for instance a physiologist, a gymnastics coach, and an Olympic gymnast. They all “know how the human body works” but in very different ways and who you’d go to for expertise depends on the context.
Similarly just start with whatever part you are most interested in. If you want to know the techniques and tools you can web search and find lots of details.
If you want to know how best to use them you should set up vulnerable machines (or find a relevant CTF) and practice. If you want to understand how they were discovered and how people find new ones you should read writeups from places like Project Zero that do that kind of research. If you’re interested in writing your own then yes you probably need to learn some system programming. If you enjoy the field you can expand your knowledge base.
My contacts abroad are saying "that software US government mandated us to install on our clients and servers to do business with US companies is crashing our machines".
When did Crowdstrike get this gold standard super seal of approval? What could they be referring to?
I guarantee you that the damage caused by Crowdstrike today will significantly outweigh any security benefits/savings that using their software might have had over the years.
* lights out interfaces not segregated from business network. Bonus points if its a supermicro which discloses the password hash to unauthenticated users as a design features.
* operational technology not segregated from information technology
* Not a windows bug, but popular on windows: 3rd party services with unquoted exe and uninstall strings, or service executable in a user-writable directory.
I remediate pentests as well as realworld intrusion events and we ALWAYS find one of these as the culprit. An oopsie happening on the public website leading to an intrusion is actually an extreme rarity. It's pretty much always email > standard user > administrator.
I understand not liking EDR or AV but the alternative seems to be just not detecting when this happens. The difference between EDR clients and non-EDR clients is that the non-EDR clients got compromised 2 years ago and only found it today.
Thanks for the list. I got this job as the network administrator at a community bank 2 years ago and 9/9 of these were on/enabled/not secured. I've got it down to only 3/9 (dhcpv6, unquoted exe, operational tech not segregated from info tech).
I'm asking for free advise, so feel free to ignore me, but of these three unremediated vectors, which do you see as the culprit most often?
dhcpv6 poisoning is really easy to do with metasploit and creates a MITM scenario. It's also easy to fix (dhcpv6guard at the switch, a domain firewall rule, or a 'prefer ipv4' reg key).
unquoted paths are used to make persistence and are just an indicator of some other compromise. There are some very low impact scripts on github that can take care of it
Network segregation, the big thing I see in financial institutions is the cameras. Each one has its own shitty webserver, chances are the vendor is accessing the NVR with teamviewer and just leaving the computer logged in and unlocked, and none of the involved devices will see any kind of update unless they break. Although I've never had a pentester do anything with this I consider the segment to be haunted.
I believe the question was 'in which ways is windows vulnerable by default', and I answered that.
If customers wanted to configure them properly, they could, but they don't. EDR will let them keep all the garbage they seem to love so dearly. It doesn't just check a box, it takes care of many other boxes too.
At work we have two sets of computers. One gets beamed down by our multi-national overlords, loaded with all kinds of compliance software. The other is managed by local IT and only uses windows defender, has some strict group policies applied, BMCs on a separate vlans etc.
Both pass audits, for whatever that's worth.
believe it or not, most users dont run around downloading random screensavers or whatever. Instead they are receiving phish emails, often from trusted contacts who have recently been compromised using the same style of message that they are used to receiving, that give the attacker a foothold on the computer. From there, you can use a commonly available insecure legacy protocol or other privilege escalation technique to gain administrative rights on the device.
You don't need exploits to remotely access and run commands on other systems, steal admin passwords, and destroy data. All the tools to do that are built into Windows. A large part of why security teams like EDR is that it gives them the data to detect abuse of built-in tools and automatically intervene.
Not the same poster, but one phase of a typical attack inside a corporate network is lateral movement. You find creds on one system and want to use them to log on to a second system. Often, these creds have administrative privileges on the second system. No vulnerabilities are necessary to perform lateral movement.
Just as an example: you use a mechanism similar to psexec to execute commands on the remote system using the SMB service. If the remote system has a capable EDR, it will shut that down and report the system from which the connection came from to the SOC, perhaps automatically isolate it. If it doesn't, an attacker moves laterally through your entire network with ease in no time until they have domain admin privs.
Anyone who claims CS is nothing but a compliance checkbox has never worked as an actual analyst, of course it's effective...no, dur, its worth 50bn for no reason...god some people are stupid AND loud
Every company I’ve ever worked at has wound up having to install antivirus software to pass audits. The software only ever caused problems and never caught anything. But hey, we passed the audit so we’re good right?
Long time ago I was working for a web hoster, and had to help customers operating web shops to pass audits required for credit card processing.
Doing so regularly involved allowing additonal ciphers for SSL we deemed insecure, and undoing other configurations for hardening the system. Arguing about it is pointless - either you make your system more insecure, or you don't pass the audit. Typically we ended up configuring it in a way that we can easily toggle those two states, and reverted it back to a secure configuration once the customer got their certificate, and flipped it back to insecure when it was time to reapply for the certification.
This tracks for me. PA-DSS was a pain with ssl and early tls... our auditor was telling us to disable just about everything (and he was right) and the gateways took forever to move to anything that wasn't outdated.
Then our dealerships would just disable the configuration anyway.
The dreaded exposed loopback interface... I'm an (internal) auditor, and I see huge variations in competence. Not sure what to do about it, since most technical people don't want to be in an auditor role.
We did this at one place I used to work at. We had lots of Linux systems. We installed clamAV but kept the service disabled. The audit checkbox said “installed” and it fulfilled the checkbox…
Yes, it offers very real protection. Crowdstrike in particular is the best in the market, speaking from experience and having worked with their competitor's products as well and responded to real world compromises.
I'm a dev rather than infra guy, but I'm pretty sure everywhere I've worked which has a large server estate has always done rolling patch updates, i.e. over multiple days (if critical) or multiple weekends (if routine), not blast every single machine everywhere all at once.
If this comment tree: https://news.ycombinator.com/item?id=41003390 is correct, someone at Crowdstrike looked at their documented update staging process, slammed their beer down, and said: "Fuck it, let's test it in production", and just pushed it to everyone.
Which of course begs the question: How were they able to do that? Was there no internal review? What about automated processes?
For an organization it's always the easiest, most convenient answer to blame a single scapegoat, maybe fire them... but if a single bad decision or error from an employee has this kind of impact, there's always a lack of safety nets.
This is not a patch per se, it was Crowdstrike updating their virus definition or whatever it's called internal database.
Such things are usually enabled by default to auto-update, because otherwise you lose a big part of the interest (if there's any) of running an antivirus.
Surely their should be at least some staging on update files as well, to avoid the "oops, we accidentally blacklisted explorer.exe" type things (or, indeed, this)?
This feels like an auto-update functionality. For something that's running in kernel space (presumably, if it can BSOD you?) Which is fucking terrifying.
Windows IT admins of the world, now is your time. This is what you've trained for. Everything else has led to this moment. Now, go and save the world!!
Does it require to physically go to each machine to fix it? Given the huge number of machines affected, it seems to me that if this is the case, this outage could last for days.
The workaround involves booting into Safe mode or Recovery environment, so I'd guess that's a personal visit to most machines unless you've got remote access to the console (e.g. KVM)
It gets worse if your machines have bitlocker active, lots of typing required. And it gets even worse if your servers that store the bitlocker keys also have bitlocker active and are also held captive by crowstrike lol
I've already seen a few posts mentioning people running into worst-case issues like that. I wonder how many organizations are going to not be able to recover some or all of their existing systems.
Presumably at some point they'll be back to a state where they can boot to a network image, but that's going to be well down the pyramid of recovery. This is basically a "rebuild the world from scratch" exercise. I imagine even the out of band management services at e.g. Azure are running Windows and thus Crowdstrike.
• Servers, you have to apply the workaround by hand.
• Desktops, if you reboot and get online, CrowdStrike often picks up the fix before it crashes. You might need a few reboots, but that has worked for a substantial portion of systems. Otherwise, it’ll need a workaround applied by hand.
This is insane. The company I currently work for provides dinky forms for local cities and such, where the worst thing that could happen is that somebody will have to wait a day to get their license plates, and even we aren't this stupid.
I feel like people should have to go to jail for this level of negligence.
Maybe someone tried to backdoor Crowdstrike and messed up some shell code? It would fit and at this point we can't rule it out, but there is also no good reason to believe it. I prefer to assume incompetence over maliciousness.
>True for all systems, but AV updates are exempt from such policies. When there is a 0day you want those updates landing everywhere asap.
This is irrational. The risk of waiting for a few hours to test in a small environment before deploying a 0-day fix is marginal. If we assume the AV companies already spent their sweet time testing, surely most of the world can wait a few more hours on top of that.
Given this incident, it should be clear the downsides of deploying immediately at a global scale outweigh the benefits. The damage this incident caused might even be more than all the ransomware attacks combined. How long to take to do extra testing will depend on the specific organization, but I hope nobody will allow CrowdStrike trying to unilaterally impose a standard again.
I wonder if the move to hybrid estates (virtual + on prem + issued laptops etc) is the cause. Having worked in only on prem highly secure businesses no patches would be rolled out intra week without a testing cycle on a variety of hardware.
I consider it genuinely insane to allow direct updated from vendors like this on large estates. If you are behind a corporate firewall there is also a limit to the impact of discovered security flaws and thus reduced urgency in their dissemination anyway.
Most IT departments would not be patching all their servers or clients at the same time when Microsoft release updates. This is a pretty well followed standard practice.
For security software updates this is not a standard practice, I'm not even sure if you can configure a canary update group in these products? It is expected any updates are pushed ASAP.
For an issue like this though Crowdstrike should be catching it with their internal testing. It feels like a problem their customers should not have to worry about.
Their announcement (see Reddit for example) says it was a “content deployment” issue which could suggest it’s the AV definitions/whatever rather than the driver itself… so even if you had gradual rollout for drivers, it might not help!
I came to HN hoping to find more technical info on the issue, and with hundreds of comments yours is the first I found with something of interest, so thanks! Too bad there's no way to upvote it to the top.
In most appreciations of risk around upgrades in environments with which i am familiar, changing config/static data etc counts as a systemic update and is controlled in the same way
A proper fix means that a failure like this causes you a headache, it doesn't close all your branches, or ground your planes, or stop operations in hospitals, or take your tv off air.
You do that by ensuring a single point of failure, like virus definition updates, or an unexpected bug in software which hits on Jan 29th, or when leapseconds go backwards, can't affect all your machines at the same time.
Yes it will be a pain if half your checkin desks are offline, but not as much as when they are all offline.
Wow that's terrible. I'm curious as to whether your contract with them allows for meaningful compensation in an event like this or is it just limited to the price of the software?
Let's say you're a CISO and it's your task to evaluate Cybersecurity solutions to purchase and implement.
You go out there and found out that there are multiple organizations that tests (simulate attacks) the EDR capabilities of these Vendors periodically and published grades of these Vendors.
You found the top 5 to narrow down your selections and you pitted them in PoC which consists of attack simulations and end-to-end solutions (that's the Response part of EDR).
The winner gets the contract.
Unless there are tie-breakers...
PS: I heard others (and read) said that CS was best-in-class which suggested that they probably won PoC and received high grades from those independent Organizations.
I don't mean this to be rude or as an attack, but do you just auto update without validation?
This appears to be a clear fault from the companies where the buck stops - those who _use_ CS and should be validating patches from them and other vendors.
I'm pretty sure crowdstrike autoupdates, with 0 option to disable or manually rollout updates. Even worse people running N-1 and N-2 channels also seem to have been impacted by this.
I think it's probably not a kernel patch per se. I think it's something like an update to a data file that Crowdstrike considers low risk, but it turns out that the already-deployed kernel module has a bug that means it crashes when it reads this file.
Apparently, CS and ZScaler can apply updates on their own and thats by design, with 0day patches expected to be deployed the minute they are announced.
Why do they "have to"? Why can't company sysadmins at minimum configure rolling updates or have a 48 hour validation stage - either of which would have caught this. Auto updating external kernel level code should never ever be acceptable.
But isn't that a fairly tiny risk, compared with letting a third party meddle with your kernel modules without asking nicely? I've never been hit by a zero-day (unless Drupageddon counts).
I would say no, it's definitely not a tiny risk. I'm confused what would lead you to call getting exploited by vulnerabilities a tiny risk -- if that were actually true, then Crowdstrike wouldn't have a business!
Companies get hit by zero days all the time. I have worked for one that got ransomwared as a result of a zero day. If it had been patched earlier, maybe they wouldn't have gotten ransomwared. If they start intentionally waiting two extra days to patch, the risk obviously goes up.
Companies get hit by zero day exploits daily, more often than Crowdstrike deploys a bug like this.
It's easy to say you should have done the other thing when something bad happens. If your security vendor was not releasing definitions until 48 hours later than they could have, when some huge hack happened becuase of that obviously the internet commentary would say they were stupid to be waiting 48 hours.
But if you think the risk of getting exploited by a vulnerability is less than the risk of being harmed by Crowdstrike software, and you are a decision maker at your organization, then obviously your organization would not be a Crowdstrike customer! That's fine.
CS doesn't force you to auto-upgrade the sensor software – there is quite some FUD thrown around at this moment. It's a policy you can adjust and apply to different sets of hosts if needed. Additionally, you can choose if you want the latest version or a number of versions behind the latest version.
What you cannot choose, however - at least to my knowledge - is whether or not to auto-update the release channel feed and IOC/signature files. The crashes that occured seems to have been caused by the kernel driver not properly handling invalid data in these auxilliary files, but I guess we have to wait on/hope for a post-mortem report for a detailed explanation. Obviously, only the top-paying customers will get those details...
stop the pandering. you know very well crowdstrike doesn't offer good protection to begin with!
everyone pay for legal protection. after it happens you can show you did everything, which means nothing (well now this show even worse than nothing), by showing you paid them.
if they tell you to disable everything, what does it change? they're still your blame shield. which is the reason you have cs.
... the only real feature anybody care is inventory control.
You said Crowdstrike doesn't offer protection but there are plenty in this thread that suggested they actually do and seemed to be highly regarded at the field.
facts speak more than words. if you cared about protection you would be securing your system, not installing yet more things, specially one that now require you open up several other attack vectors. but i will never managed to make you see it.
Writing software in the safest programming language to develop mission critical product deployed on the most secure and stable OS that the world depends on would be developer's wet dream.
Crowdstrike though is not part of a system of engineered design.
It’s a half-baked rootkit sold as a figleaf for incompetent it managers so they can implement ”best practices” in their companys PC:s.
The people purchasing it don’t actually know what it does, they just know it’s something they can invest their cybersecurity budget into and have an easy way to fullfill their ”implement cybersecurity” kpi:s without needing to do anything themselves.
Exactly, and this is why I've heard the take that the companies who integrate this software need to be held responsible by not having proper redundancy, and while its a fine take, we need to keep absolutely assailing blame at Crowdstrike and even Microsoft. They're the companies that drum the beat of war every chance they get, scaring otherwise reasonable people into thinking that the Cyberworld is ending and only their software can save them, who push stupid compliance and security frameworks, and straight-up lie to their prospects about the capabilities and stability of their product. Microsoft sets the absolutely dog water standard of "you get updates, you cant turn them off, you can't stagger them, you can't delay them, you get no control, fuck you".
Perhaps true in some cases but in regulated insustries (example fed regulated banks) a tool like crowdstrike addresses several controls that if uncontrolled result in regulatory fines. Regulated companies rarely employ home grown tools due to maintainance risk. But now as we see these rootkit or even agent based security tools bring their own risks.
I’m not arguing against the need to follow regulations. I’m not familiar what specifically is required by banks. All I’m saying Crowdstrike sucks as a specific offering. I’m sure there are worse ways to check the boxes (there always is) but that’s not a much of a praise.
My rant is from a perspective in an org that most certainly was not a bank (b2b software/hardware) and there was enough of ruckus to tell it was not mandated there by any specific regulation (hence incompetence).
A properly used endpoint protection system is a powerful tool for security.
It's just that you can gamble compliance by claiming you have certain controls handled by purchasing crowdstrike... then leave it not properly deployed and without actual real security team in control of it (maybe there will be few underpaid and overworked people getting pestered by BS from management)
I think a lot about software that is fundamentally flawed but gets propelled up in value due to great sales and marketing. It makes me question the industry.
It's interesting that this is being referred to as a black swan event in the markets. If you look at the SolarWinds fiasco from a few years ago, there are some differences, but it boils down to problems with shitty software having too many privileges being deployed all over the place. It's a weak mono culture and eventually a plague will cause devastation. I think a screw up for these sorts of software models shouldn't really be thought of as a black swan event, but instead an inevitability.
That is how all of these tools are. I have always told people that third-party virus scanners are just viruses that we are ok with having. They slow down our computers, reduce our security, many of them have keyloggers in them (to detect other keyloggers). We just trust them more than we trust unknown ones so we give it over to them.
CloudStrike is a little broader of course. But yeah, its a rootkit that we trust to protect us from other rootkits. Its like fighting fire with fire.
That’s my experience as an unfortunate user of a PC as a software engineer in an org where every PC was mandated to install crowdstrike. Fortune 1000.
It ran amok of every PC it was installed to. Nobody could tell exactly what it did, or why.
Engineering management attempted to argue against it. This resulted in quite public discourse which made the incompetence of the relevant parties in it-management related to it’s implementation obvious.
Not _negligently_ incompetent. Just incompetent enough that it was obvious they did not understand the system they administered from any set of core principles.
It was also obvious it was implemented only because ”it was a product you could buy to implement cybersecurity”. What this actually meant from systems architecture point of view was apparently irrelevant.
One could argue the only task of IT management is to act as a dumb middleman between the budget and service providers. So if it’s acceptable it managers don’t actually need to know anything of computers, then the claim of incompetence can of course be dropped.
If you realize something horrific, your options are to decide it's not your problem (and feel guilty when it blows up), carefully forget you learned it, or try to do something to get it changed.
Since the last of these involves overcoming everyone else's shared distress in admitting the emperor has no clothes, and the first of these involves a lot of distress for you personally, a lot of people opt for option B.
> overcoming everyone else's shared distress in admitting the emperor has no clothes
I don't disagree, but why do we do we react this way? Doesn't knowing the emperor has no clothes instill a bit of hope that things can change? I feel for the people who were impacted by this, but I'm also a little bit excited. Like... NOW can we fix it? Please?
The higher up in large organizations you go, in politics or employment or w/e, the more what matters is not facts, but avoiding being visibly seen to have made a mistake, so you become risk-averse, and just ride the status quo unless it's an existential threat to you or something you can capitalize on for someone else's misjudgment.
So if you can't directly gain from pointing out the emperor's missing clothes, there's no incentive to call it out, there's active risk to calling it out if other people won't agree, and moreover, this provides an active incentive for those with political capital in the organization to suppress the embarrassment of anyone pointing out they did not admit the problem everyone knew was there.
(This is basically how you get the "actively suppress any exceptions to people collectively treating something as a missing stair" behaviors.)
I've not seen that at my fortune 100. I found other's willing to agree and we walked it up to the most senior evp in the corporation. Got face time andbwe weren't punished. Just, nothing changed. Some of the directors that helped walk it up the chain eventually became more powerful and the suggested actions took place about 15 years later.
Sure, I've certainly seen exceptions, and valued them a lot.
But often, at least in my experience, exceptions are limited in scope to whatever part of the org chart the person who is the exception is in charge of, and then that still governs everything outside of that box...
It's a nice idea, but has that worked historically? Some people will make changes, but I think we'd be naive to think that things will change in any large and meaningful way.
Having another I-told-you-so isn't so bad, though - it does give us IT people a little more latitude when we tell people that buying the insecurity fix du jour increases work and adds more problems than it addresses.
Sure, on long enough timescales. I mean, there's less lead in the environment than there used to be. We don't practice blood letting anymore. Things change. Eventually enough will be enough and we'll start using systems that are transparent about what their inputs are and have a way of operating in cases where the user disables one of those inputs because it's causing problems (e.g. crowdstrike updates).
I'd just like it to be soon because I'm interested in building such systems and I'd rather be paid to do so instead of doing it on my off time.
there are way too many horrific things in the world to learn about... and then realizing you can't do something about every of those things. But at least you can tackle one of them! (In my case, antibiotic resistance)
My issue is WTF do sooooooo many companies trust this 1 fucking company lol, like its always some obscure company that every major corporation is trusting lol. All because crowdstrike apparently throws good parties for C-Level execs lol
"First, in some cases, a reboot of the instance may allow for the CrowdStrike Falcon agent to be updated to a previously healthy version, resolving the issue.
Second, the following steps can be followed to delete the CrowdStrike Falcon agent file on the affected instance:
1. Create a snapshot of the EBS root volume of the affected instance
2. Create a new EBS volume from the snapshot in the same Availability Zone
3. Launch a new instance in that Availability Zone using a different version of Windows
4. Attach the EBS volume from step (2) to the new instance as a data volume
5. Navigate to the \windows\system32\drivers\CrowdStrike\ folder on the attached volume and delete "C-00000291*.sys"
6. Detach the EBS volume from the new instance
7. Create a snapshot of the detached EBS volume
8. Create an AMI from the snapshot by selecting the same volume type as the affected instance
9. Call replace root volume on the original EC2 Instance specifying the AMI just created"
Yes it can, that's what I ended up writing at 4am this morning, lol. We manage way more instances then is feasible to do anything by hand. This is probably too late to help anyone, but you can also just stop instance, detach root, attach it to another instance, delete file(s), offline drive, detach, reattach to original instance, and then start instance. You need a "fixer" machine in the same AZ.
FWIW, I find the high-level overview more useful, because then I can write a script tailored to my situation. Between `bash`, `aws` CLI tool, and Powershell, it would be straightforward to programmatically apply this remedy.
When you see the size if the impact across the world, the number of people who will die because hospital, emergency and logistics systems are down…
You don’t need conventional war any more. State actors can just focus on targeting widely deployed “security systems” that will bring down whole economies and bring as much death and financial damage as a missile, while denying any involvement…
I always think it's easy for state actors to pull out this trick.
Considering PR review is usually done within the team. A state actor can simply insert a manager, a couple of senior developers and maybe a couple of junior developers into a large team to do the job. Push something in Friday so few people bother to check, gets approved by another implant and here you go.
Seeing all the cancelled and delayed flights, it makes me think a hacking kind of climate activism/radicalism would be more useful than gluing hands to roads, or throwing paint on art.
Activism is mostly about awareness, because generally you believe your position to be the one a logical person will accept if they learn about it, so doing things that get in the news but only gets you a small fine or month in jail are preferred.
Taking destructive action is usually called "ecoterrorism" and isn't really done much anymore.
Given how obvious the vector is for targeting after its so widespread, stands reason to believe the same state actors would push phishing schemes and other such efforts in order to justify having a tool like crowdsrike used everywhere. We are focusing on the bear trap snapping shut here, but someone took the time to set up that trap right where we'd be stepping in the first place.
I was in my 20s during the peak hysteria of post-9/11 and GWOT. I had to cope with the hysteria hyped 24/7 by media and DHS of a constant terror threat to determine if it was real.
The fact that global infra is so flimsy and vulnerable brought me tremendous relief. If the terror threats were real, we would have been experiencing infrastructure attacks daily.
I remember driving through rural California thinking if the terrorist cells were everywhere, they could trivially <attack critical infra that I don't want to be flagged by the FBI for>
I've read a lot of cyber security books like Countdown to Zero Day, Sandworm, Ghost in the Wires and each one brings me relief. Many of our industrial systems have the most flimsy, pathetic , unencrypted & uncredentialed wireless control protocols that are vulnerable to remote attack.
The fact that we rarely see incidents like this, and when they do happen, they are due to gross negligence rather than malice, is a tremendous relief.
This is the silver lining of global capitalism. When every power on earth is invested in the same assets there is little interest in rocking the boat unless the financial justification to do so is sufficiently massive.
Until deglobalization sufficiently spreads to the software ecosystem. I have just a few hours ago attended a lecture by a very high profile German cybersecurity researcher (though he keeps a low profile). The guy is a real greybeard, can fluently read any machine code, he was building and selling Commodore64 cards at 14yo. (I don't even know what that is.) He's hell bent on not letting in any US code nor a single US chip. Intel is building a 2nm fab in Magdeburg, Germany, the most advanced in the world when it will be completed. German companies are developing their own fabs not based on or purchased from ASML. German developing their own chip designs. A new German operating system in Berlin.
Huawei, after their CEO got imprisoned in Canada took Linux source code and rewrote it file by file in C++. Now they're using it in all their products, called HarmonyOS. The Chinese are recruiting ex-TSMC engineers in mainland China and giving them everything, free house, car, money, free pass between Taiwan and China just to build their own fab in a city I don't know how to spell the name.
I'm not German but I'll go to the hell with the move to deglobalize, or in other words, de-Americanize. This textarea cannot possibly express my anger and hatred against the past fifty years of the domination of Imperium Americana. Not a single moment they let us live without bloodshed and brutal oppression.
I am not against the idea of civilization, authority, hierarchy and empires. I am against those who are unjust and evil oppressors on the face of Earth.
We are far past that point. So many critical systems are running on autopilot, with people who built and understood them retiring, and a new batch of unaware, aloof, apathetic people at the helm.
There's no real need for some Bad Actor -- at some point, entropy will take care of it. Some trivial thing somewhere will fail, and create a cascade of failures that will be cataclysmic in its consequences.
It's not fear-mongering, it's kind of a logical conclusion to decades of outsourcing, chasing profit above and over anything else, and sheer ignorance borne of privilege. We forgot what it took to build the foundations that keep us alive.
That's just what old people like to think: that they are super important and could never be replaced. A few months ago I replaced a "critical" employee that was retiring and everyone was worried what would happen when he was gone. I learned his job in a month.
Most people aren't very important or special and most jobs aren't that difficult.
why the fuck is our critical infrastructure running on WINDOWS. Fuck the sad state of IT. CIOs and CTOs across the board need to be fired and held accountable for their shitty decisions in these industries.
yes CRWD is a shitty company but seems they are a "necessity" by some stupid audit/regulatory board that oversees these industries. But at the end of the day, these CIOs/CTOs are completely fucking clueless as to the exact functions this software does on a regular basis. A few minions might raise an issue but they stupidly ignore them because "rEgUlAtOrY aUdIt rEqUiReS iT!1!"
While Linux isn't a panacea, the OS does matter as Linux provides tools for security scanners like Crowdstrike to operate entirely in userspace, with just a sandboxed eBPF program performing the filtering and blocking within the kernel. And yes, CrowdStrike supports this mode of operation, which I'll be advocating we switch over to on Monday. So yeah, for this specific issue, Linux provides a specific feature that would have prevented this issue.
> The OS doesn't matter, the question should be why is critical infrastructure online and allowed to receive OTA updates from third parties.
Not exactly. I think the question is why is critical infrastructure getting OTA updates from third parties automatically deployed directly to PROD without any testing.
These updates need to go to a staging environment first, get vetted, and only then go to PROD. Another upside of that it won't go to PROD everywhere all at once, resulting in such a worldwide shitshow.
I think you have the priority backwards. We shouldn’t be relying on trusting the QA process of a private company for national security systems. Our systems should have been resilient in the face of Crowdstrike incompetence.
> I think you have the priority backwards. We shouldn’t be relying on trusting the QA process of a private company for national security systems. Our systems should have been resilient in the face of Crowdstrike incompetence.
I think you misunderstood me. I wasn't talking about Crowdstrike having a staging environment, I was talking about their customers. So 911 doesn't go down immediately once Crowdstrike pushes a bad update, because the 911 center administrator stages the update, sees that it's bad, and refuses to push it to PROD.
I think that would even provide some resiliency in the fact of incompetent system administrators, because even if they just hit "install" on every update, they'll tend to do it at different times of day, which will slow the rollout of bad updates and limit their impact. And the incompetent admin might not hit "install" because he read the news that day.
Lol if they can't do staging to mitigate balls ups on the high availability infrastructure side (optus in aus earlier this year pushed a router config that took down 000 emergency for a good chunk of the nation) we got bugger all hope of big companies getting it further up the stack in software.
In this case it wasn’t an update to the OS but an update to something running on the OS supplied by an unrelated vendor.
But if we entertain the idea that another OS would not need CrowdStrike or anything else that required updates to begin with, I have doubts. Even your CPU needs microcode updates nowadays.
Of course the OS matters! Windows is a nasty ball of patches in order to maintain backward compatibility with the 80s. Linux and OSX don't have to maintain all the nasty hacks to keep this backward compatibility.
Also, Crowdstrike is a security (patch) company because Windows security sucks to the point they have, by default, real-time virus protection running constantly (runs my CPU white hot for half the day, can you imagine the global impact on the environment?!).
It's so bad on security that its given birth to a whole industry to fix it i.e. Crowdstrike. Every time I pass a bluescreen in a train station or advertisement I'd like "hA! you deserve that for choosing Windows".
IBM’s z/OS maintains compatibility with the 60’s, and machines running it continued to process billions of transactions every second without taking a break.
The OS matters, as well as the ecosystem and, and this is most important, the developer and operations culture around it.
> Of course the OS matters! Windows is a nasty ball of patches in order to maintain backward compatibility with the 80s. Linux and OSX don't have to maintain all the nasty hacks to keep this backward compatibility.
Just don't tell that to Linus Torvalds :) Because Linux absolutely does maintain compatibility with old ABI from 90-s.
> Just don't tell that to Linus Torvalds :) Because Linux absolutely does maintain compatibility with old ABI from 90-s.
That’s nothing. IBM’s z/OS maintains compatibility with systems dating all the way back to the 60’s. If they want to think they are reading a stack of punch cards, the OS is happy to fool them.
You should look into what a kernel driver is. You can panic a Linux kernel with 2 lines of code just as you can panic a Windows kernel, they just got lucky that this fault didn't occur in their Linux version.
And to be honest, I don't think recovering from this would be that much easier for non-technical folk on a fully encrypted Linux machine, not that it's particularly hard on Windows, it's just a lot of machines to do it on.
In Linux it could be implemented as an eBPF thing while most of the app runs in userspace.
And, for specialised uses, such as airline or ER systems, a cut-down specialised kernel with a minimal userland would not require the kind of protection Crowdstrike provides.
But this is a 3rd party software with ring-0 access to all of your computers deciding to break them. The technical features of the OS absolutely do not matter.
The question is whether other OSs would require it to have kernel mode privileges. People run complicated stuff in kernel mode for performance, because the switch to/from userspace is expensive.
Guess what’s also expensive? A global outage is expensive. Much more than taking the performance hit a better, more isolated, design would avoid.
This is true. Linux large fleet management is still missing some features large enterprises demand. Do they need all those features, idk, but they demand them if they're switching from Windows.
Windows also has better ways such as filter drivers and hooks. If everybody used Linux, Crowd Strike would still opt for the kernel driver since the software they create is effectively spyware that wants access to stuff as deep as possible.
If they opted for an eBPF service but put that into early boot chain, the bootloop or getting stuck could still happen.
The only long time solution is to stop buying software from a company that has a track record of being pushy and having terrible software practices like rolling out updates to the entire field.
I think the only real solution is for MSFT to stop allowing kernel level drivers, as Apple has already (sorf of, but nearly) done. Sure, lots and lots of crap runs on windows in kernelspace, but what happened today cost a sizable fraction of world's GDP. There won't be a better wake up call.
But would the Linux sysadmins of the world play along in the way that the Windows sysadmins of the world did? I think they might've given Crowd Strike the finger and confined them to a smaller blast radius anyhow. And if they wouldn't have... well they will now.
Once it gets popular, I think it would happen. The business people and C-suite would request quick dirty solutions like Crowd Strike's offerings to check boxes when entering new markets and go around the red tape. So they'll force Unix people to do as they say or else.
Agreed. It's a safer culture because it grew up in the wild. Windows, by contrast, is for when everybody you're using it with has the same boss... places where sanity can be imposed by fiat.
If Microsoft is to be blamed here, it's not for the quality of their software, it's for fostering a culture where dangerous practices are deemed acceptable.
> If they opted for an eBPF service but put that into early boot chain, the bootloop or getting stuck could still happen.
If the in-kernel part is simple and passes data to a trusted userland application the likelyhood of a major outage like the one we saw is much reduced.
More specifically why is critical stuff not equipped properly to revert itself and keep working and/or fail over? This should be built-in stuff at this point, have the last working OS snapshot on its own storage chip and automatically flash it back, even if it takes a physical switch… things like this just shouldn’t happen.
> why the fuck is our critical infrastructure running on WINDOWS
Because it’s cheaper.
I feel like many in this thread are obsessing over the choice of OS when the actual core question is why, given the insane money we spend on healthcare, are all healthcare systems shitty and underinvested?
A sensible, well constructed system would have fallbacks, no matter if the OS of choice is Windows or Linux.
The difference is that lots of different companies can share the burden of implementing all that in Linux (or BSD, or anything else) while only Microsoft can implement that functionality in Windows and even their resources are limited.
Very little healthcare functionality would ever need to be created at the OS level. The burden could be shared no matter if machines were running Windows or Linux, they’re mostly just regular applications.
Not talking about the applications - those could be ported and, ideally, financed by something like the UNDP so that the same tools are available everywhere to any interested part.
I'm talking about Crowdstrike's Falcon-like monitoring. It exists to intercept "suspicious" activity by userland applications and/or other kernel modules.
Cheaper? Well, perhaps when you require your OS to have some sort of support contract. And your support vendor charges you unhealthy sums.
And then you get to see the value of the millions of dollars you've paid for support contracts that don't protect your systems at all. But those contracts do protect specific employees. When the sky falls down, the big money execs don't have a solution. But it's not their fault because the support experts they pay huge sums don't have solutions either. Somehow paying millions of dollars to support contractors that can't save you is not seen as a fireable offense. Instead it is a career-saving scapegoat.
Within companies that have been bitten this time, the team that wasn't affected because they made better process decisions will not be promoted as smarter. Their voice will continue to be marginalized by the people whose decisions led to this disaster. Because, hey, look, everyone got bit right? Nobody looks around to notice the people who were not bitten and recognize their better choices. And "I told you so" is a pretty bad look right now.
> I feel like many in this thread are obsessing over the choice of OS when the actual core question is why, given the insane money we spend on healthcare, are all healthcare systems shitty and underinvested?
Because it's basically impossible to compete in the space.
Epic is a pile or horseshit, but you try convincing a hospital to sign up to your better version.
Tons of critical infrastructure in the US is run on IBM zOS. It doesn't matter what operating system you use, what matters is updates aren't automatic and everything is as air gapped as possible.
> why the fuck is our critical infrastructure running on WINDOWS.
That hits the nail on the head.
But it is a rhetorical question. We know why, generally, software sacks, and specifically why Windows is the worst and is the most popular
Good software is developed by pointy headed needs (like us) and successful software is marketed to business executives are have serious pathologies
There are exceptions (I am struggling to think of one) where a serious piece of good software has survived being mass marketed, but the constraints (basically business and science) conflict
1/ linux is as vulnerable to kernel panics induced by such software. In fact, CS had a similar snafu mid April, affecting linux kernels. Luckily, there are far fewer moronic companies running CS on linux boxes at scale.
2/ it does offer protection - if you are running total shit architecture and you need to trust your endpoints not to be compromised, something like this is sadly a must.
Incidentally, google, which prides itself at running a zero-trust architecture, sent a lot of people home on Friday. Not so zero-trust after all, it seems.
No, its just soooooo bad at security/stability that it gave birth to Crowdstrike. They very fact that Crowdstrike is so big and prevalent means is proof of the gapping hole in Windows security. Its given birth to a multibillion dollar industry!
Crowdstrike/falcon use is not by any means limited to Windows. Plenty of Linux heavy companies mandate it on all infrastructure (although I hope that changes after this incident).
It’s mandated because someone believes Linux is as bad as Windows in that regard.
And, quite frankly, a well configured and properly locked down Windows would be as secure as a locked down Linux install. It’d also be a pain to use, but that’s a different question.
Critical systems should run a limited set of applications precisely to reduce attack surface.
The reality is the wetwear that interfaces with any OS is always going to be the weakest link. Doesn't matter what OS they run, I guarantee they will click links and download files from anywhere.
I can pretty easily make it so a user on Linux can't download executables and can't even then can't do any damage without a severe vulnerability. That is actually pretty difficult to do in a typical Windows AD deployment. There is a big difference between the two OSes.
In fact, there's a couple billion Linux devices running around locked down hard enough that the most clueless users you can imagine don't get their bank details stolen.
> yes CRWD is a shitty company but seems they are a "necessity" by some stupid audit/regulatory board that oversees these industries.
Yep, this is the problem. The part about Windows is a distraction here.
That bullshit regulation is a much larger security issue than Windows. Incomparably so. If you run it over Linux, you'll get basically the same lack of security.
Someone on X has shared the kernel stack trace of the crash
The faulting driver in the stack trace was csagent.sys.
Now, Crowdstrike has got two mini filter drivers registered with Microsoft (for signing and allocation of altitude).
1) csagent.sys - Altitude (321410)
This altitude falls within the range for Anti-Virus filters.
2) im.sys - Altitude (80680)
This altitude falls within the range for access control drivers.
So, it is clear that the driver causing the crash is their AV driver, csagent.sys.
The workaround that CrowdStrike has given is to delete C-00000291*.sys files from the directory:
C:\Windows\System32\Drivers\CrowdStrike\
These files being suggested to be deleted are not driver files (.sys files) but probably some kind of virus definition database files.
The reason they name these files with the .sys extension is possibly to leverage Windows System File Checker tool's ability to restore back deleted system files.
This seems to be a workaround and the actual fix might be done in their driver, csagent.sys and the fix will be rolled out later.
Anyone having access a Falcon endpoint might see a change in the timestamp of the driver csagent.sys when the actual fix rolls out.
I've picked the perfect day to return from vacation. Being greeted by thousands of users being mad at you and people asking for your head on a plate makes me reconsider my career choice. Here's to 12 hours of task force meetings...
Huge sympathies to you. If it's any consolation, because the scale of the outage is SO massive and widely reported, it will quickly become apparent that this was beyond your control, and those demanding your 'head on a plate' are likely to appear rather foolish. Hang in there my friend.
To their credit, the stakeholder that asked for my head personally came to me and apologised once they realised that entire airports have been shut down worldwide. But yeah, not a Friday/funday hahaha
Ye and these types make any problem worse. Any technical problem also becomes a social problem to deal with these lunatics and keep the house of cards from crumbeling.
It's not a management thing, it's very much a personality trait ... that for whatever reason seems to survive in pockets of management in most organisations over a certain size.
It's not a trait that survives well at yard crew level, trade assistents that freak out at spiders either get over it or never make it through apprenceships to become tradespeople.
In IT those who deal with failing processes, stopped jobs, smoking hardware, insuffcient RAM, tight deadlines learn to cope or get sidelined or fired (mostly).
To be clear, I've seen people get frazzled at most levels and many job types in various companies.
My thesis is there's a layer of management in which nervous types who utterly lose their cool at the first sign of trouble can survive better than elsewhere in large organisations.
But that's just been my experience over many years in several different types of work domains.
Ohhh absolutely. And it's not just users, it's also management. "How does this affect us? Are we compromised? What are our options? Why didn't we prevent this? How do you prevent this going forward? How soon can you have it back up? What was affected? Why isn't it everyone? Why are things still down? Why didn't X or Y unrelated vendor schlock prevent this?..."
And on and on and on. Just the amount of time spent unproductively discussing this nightmare is going to cost billions.
Nothing is more annoying than having a user ask a litany of questions obvious to the person working on the problem and looking for the answers while working on the problem and looking for the answers.
They’re valid for a postmortem analysis. They’re not helpful while you’re actively triaging the incident, because they don’t get you any steps closer to fixing it.
Exactly my thinking. Asking these questions doesn't help us now. But after all the action is done, they should be asked. And really should be questions that always get asked from time to time, incident or no incident.
The problem is that you are only focusing on making the computers work and not the system.
"we don't know yet" is a valid response and gives the rest something to work, and it shouldn't annoy you that it's being asked, first of all because if they are asking is because you are already late.
you have to to tell the rest of the team what you know and you don't know, and update them accordingly.
until your team says something the rest don't know if it's a 30 minute thing or the end of the world or if we need to start dusting off the faxes.
Your head belongs on the plate for not being able to point back to your recommendation for failover posture enhancement such as identifying core business systems, core function roles, having fully offline emergency systems, warning of the dangers of making cloud services your only services, and then pointing to the proposed costs to implement these systems being lower than the damages caused by outage to core business services.
Move to a new career if you feel you don't have the ability to push right back against this.
The only surprising thing is that this doesn't happen every month.
Nobody understands their runtime environment. Most IT org's long ago "surrendered" control and understanding of it, and now even the "management" of it (I use the term loosely) is outsourced.
This is mostly physical machines in person, kiosks and pos terminals, office desktops and things like that. Windows is a tiny portion of GCP and AWS and the web in general.
I'm 100% "cloud" with tens of thousands of linux containers running and haven't been affected at all.
"I'm going to install an agent from Company X, on this machine, which it is essential that they update regularly, and which has the potential to both increase your attack surface and prevent not just normal booting but also successful operation of the OS kernel too". I am not going to provide you with a site specific test suite, you're going to just have to trust me that it wont interrupt your particular machine".
Why are so many mission critical hardware connected systems connected to the internet at all or getting automatic updates?
This is just basic IT common sense. You only do updates during a planned outage, after doing an easily reversible backup, or you have two redundant systems in rotation and update and test the spare first. Critical systems connected to things like medical equipment should have no internet connectivity, and need no security updates.
I follow all of this in my own home so a bad update doesn’t ruin my work day… how do big companies with professional IT not know this stuff?
Well that context makes it make a little more sense... I still wouldn't be trusting a service like that for mission critical hardware that shouldn't be connected to the internet in the first place.
The question with these types of services is: is your goal to keep the system as reliable as possible, or to be able to place the blame on a 3rd party when it goes down? If it's a critical safety system that human lives depend on, the answer better be the former.
But that's besides the point in any enterprise environment. Or even in a SMB where third parties are doing IT stuff for you.
Your opinion doesn't matter there. Compliance matters. Paper Risk aversion matters. And they don't always align with common IT sense and, as had been proven now, reality.
If you must trust the software not to do rogue updates then I have to swing back into the camp of blaming the operating system. Is Linux better at this?
I've noticed phones have better permissions controls than Windows, seemingly. You can control things like hardware access and file access at the operating system level, it's very visible to the user, and the default is to deny permissions.
But I've also noticed that phone apps can update outside of the official channel, if they choose. Is there any good way to police this without compromising the capabilities of all apps?
Microsoft has tried pushing app deployment and management platforms that would make this kind of thing really possible, but it constantly receives massive pushback. This was the concept of stuff like Windows S, where pretty much all apps have to be the new modern store app package and older "just run the install.exe as admin and double click the shortcut to run" was massively deprecated or impossible.
I’m not an IT professional, but I don’t use antivirus software on my personal macs and linux machines- I do regular rotated physical backups, and only install software digitally signed by trusted sources and well reviewed Pirate Bay accounts (that's a joke :-).
My only windows machine is what I would classify as a mission critical hardware connected/control device, an old Windows 8 tablet I use for car diagnostics- I do not connect it to the internet, and never perform updates on it.
I am an academic and use a lot of old multi-million dollar scientific instruments which have old versions of windows controlling them. They work forever if you don't network them, but the first time you do, someone opens up a browser to check their social media, and the entire system will fail quickly.
Yes. In an environment where you have so many clients that they can DDoS the antivirus management server, you have to stagger the update schedule anyway. The way we set it up, sysadmins/help desk/dev deployments updated on day 1, all IT workstations/test deployments updated on day 2, and all workstations/staging/production deployments on day 3.
Probably, implicitly. Have automated regular backups, and don’t let your AV automatically update, or even if it does, don’t log into all your computers simultaneously. If you update/login serially, then the first BSOD would maybe prevent you from doing the same thing on the other (or possibly, send you running to the other to accomplish your task, and BSODing that one too!)
But yeah this is one reason why I don’t have automatic updates enabled for anything, the other major one being that companies just can’t resist screwing with their UIs.
What people aren’t understanding is MOST of the outage isn’t caused by a crowdstrike install itself, it’s caused because something upstream of it (a critical application server) is what got borked, and that’s having a domino effect on everything else.
Remember, there's someone out there right now, without irony, suggesting that AI can fix this. There's someone else scratching their head, wondering why AI hasn't fixed this yet. And there's someone doing a three-week bootcamp in AI, convinced that AI will fix this. I’m not sure which is worse
A heuristic that has served me well for years is that anyone who uses the word “cybersecurity” is likely incompetent and should be treated with suspicion.
My first encounter with CrowdStrike was overwhelmingly negative. I was wondering why for the last couple weeks my laptop slowed to a crawl for 1-4 hours on most days. In the process list I eventually found CrowdStrike using massive amounts of disk i/o, enough to double my compile times even with a nice SSD. Then they started installing it on servers in prod, I guess because our cloud bill wasn’t high enough.
It rather looks like Crowdstrike marketed heavily to corporate executives using a horror story about the bad IT tech guy who would exfiltrate all their data if they didn't give Crowdstrike universal access at the kernel level to all their machines...? It seems more aimed at monitoring the employees of a corporation for insider threats than for defense against APT actors.
How long before companies start consciously de-risking by replacing general-purpose systems like Windows with newer systems with smaller attack surfaces? Why does an airline need to use Windows at all for operations? From what I’ve seen, their backend systems are still running on mainframes. The terminals are accessed on PCs running Windows, but those could trivially be replaced with iPadOS devices that are more locked down than Windows and generally more secure by design.
One of the problems possibly preventing this is that budgets for buying software aren't controlled by people administering the software. Definitely not by people using it.
Often, the cost of switching is too high or too complex to justify. On top of that, many applications commonly run in manufacturing etc., simply does not run on any other OS.
The billions that have been lost, and the lives that have been lost, have, in the blink of an eye, rendered the "too costly to implement" argument moot.
For bean-counting purposes, it's just really convenient that the burden of that cost was transferred onto somebody else, so that the claim can continue to be made that another solution would still be too costly to implement.
Accepting the status-quo that got us here in the first place, under the pseudo-rational argument that there are not realistic alternatives, is simply putting ones head in the sand and careening, full steam ahead, to the next wall waiting for us.
That there might not be an alternative available currently does not mean that a new alternative cannot be actively pursued, and that it is not time for extreem introspection.
Certain backend systems run on mainframes, yes. But the airline's website? No (only the booking portion interacts with a mainframe via API calls). Identity management system? No. Etc.
Banks are down so petrol stations and supermarkets are basically closed.
People can't check in to airline flights, various government services including emergency telephone and police are down. Shows how vulnerable these systems are if there's just one failure point taking all those down.
000 was never down, and most supermarkets and servos were still up. It was bad, but ABC appear to not have the internal capacity to validate all reports.
It's pretty bad when the main ABC 7pm News Bulletin pretty much had them reading from their iPads couldn't use their normal studio background screens and didn't even give us the weather forecast!
CIO here. They are known to be incredibly pushy. In my company we RFP'd for our endpoint & cyber security. Found the CS salesperson went over me to approach our CEO who is completely non-technical to try and seal a contract because I was on leave for 1 week out of service (and this was known to them). When I found out by our CEO informing me of the approach we were happy to sign with SentinelOne
One thing I'm really happy about at my current company is that when a sales person from a vendor (not Crowdstrike) tried that our CEO absolutely ripped them a new one and basically banned that company from being a vendor for a decade.
I had a very similar experience, I was leading the selection process for our new endpoint security vendor, Crowdstrike people:
- verbally attacked/abused a selection team member
- were ranting constantly about golf with our execs
- were dismissive and just annoying throughout
- raised hell with our execs when they learned they were not going to POC, basically went through everyone of them simultaneously
- I had to get a rep kicked out of the rfp as he was constantly disrespectful
We did not pick them, and cancelled any other relashionship we had with them, in IR space by example.
I think the update will be applied overnight, which is a different window (no pun intended) dependent on timezone and the impact will be reported when users come back online (or not) and identify the issue.
Currently seeing this happening in real time in the UK.
I was at the supermarket here last night about the time it kicked off. It seemed payWave was down, there were a few people walking out empty handed as they only had Apple Pay, etc on them. But the vast majority of people seemed fine, my chipped credit card worked without issue.
> 7/18/24 10:20PT - Hello everyone - We have widespread reports of BSODs on windows hosts, occurring on multiple sensor versions. Investigating cause. TA will be published shortly. Pinned thread.
This was particularly interesting (from the reddit thread posted above):
> A colleague is dealing with a particularly nasty case. The server storing the BitLocker recovery keys (for thousands of users) is itself BitLocker protected and running CrowdStrike (he says mandates state that all servers must have "encryption at rest").
> His team believes that the recovery key for that server is stored somewhere else, and they may be able to get it back up and running, but they can't access any of the documentation to do so, because everything is down.
> but they can't access any of the documentation to do so, because everything is down.
One of my biggest frustrations with learning networking was not being able to access the internet. Nowadays you probably have a phone with a browser, but back in the day if you were sitting in a data room and you'd configured stuff wrong, you had a problem.
Isn’t that what office safes are for? I don’t know the location, but all the old guard at my company knew that room xyz at Company Office A held a safe with printed out recovery keys and the root account credentials. No idea where the key to the safe is or if it’s a keypad lock instead. Almost had to use it one time.
I'm guessing someone somewhere said that "it must be stored in hard copy in a safe" and the answer was in the range of "we don't have a safe, we'll be fine".
Or worse, if it's like where I worked in the past, they're still in the buying process for a safe (started 13 months ago) and the analysts are building up a general plan for the management of the safe combination.
They still have to start the discussions with the union to see how they'll adapt the salary for the people that will have to remember the code for the safe and who's gonna be legally responsible for anything that happens to the safe.
Last follow-up meeting summary is "everything's going well but we'll have to modify the schedule and postpone the delivery date of a few months, let's say 6 to be safe"
Not just financial / process barriers. I worked for a company in the early 90's that needed a large secure safe to store classified documents and removable hard drives. A significant part of the delay in getting it was figuring out how to get it into the upstairs office where it would be located. The solution involved removing a window and hiring a crane.
When we later moved to new offices, somebody found a solution that involved a 'stair-walking' device that could supposedly get the safe down to the ground floor. This of course jammed when it was halfway down the stairs. Hilarity ensued.
Didn't bookmark it or anything and going back to the original reddit thread I now see that there are close to 9,000 comments, so unfortunately the answer is no...
Absolutely correct. Unfortunately, there is no other solution to this issue. If the laptops were powered down overnight, there might be a stroke of luck. However, this will be one of the most challenging recoveries in IT history, making it a highly unpleasant experience.
Yeah in context we have about 1000 remote workers down. We have to call them and talk through each machine because we can't fix them remotely because they are stuck boot looping. A large proportion of these users are non-technical.
MS Windows Recovery screen (or the OS installer disk) might ask you for the recovery key only, but you can unlock the drive manually with the password as well! I had to do that a week ago after a disk clone gone wrong, so in case someone steps on the same issue (this here is tested with Win 10, but it should be just the same for W11 and Server):
1. Boot the affected machine from the Windows installer disk
2. Use "Repair options"
3. Click through to the option to spawn a shell
4. It will now ask you for unlocking the disk with a recovery key. SKIP THAT.
5. In the shell, type: "manage-bde -unlock C: -Password", enter the password
6. The drive is unlocked, now go and execute whatever recovery you have to do.
> Can you even get the secret from the TPM in recovery mode?
Given that you can (relatively trivially) sniff the TPM communication to obtain the key [1], yes it should be possible. Can't verify it though as I've long ago switched to Mac for my primary driver and the old cheesegrater Mac I use as a gaming rig doesn't have a hardware TPM chip.
yea I don't need an attack on a weak system, I mean the authorized legal normal way of unlocking BL from Windows when you have the right credentials. Windows might not be able to unlock BitLocker with just your password.
I don't know how common it is to disable TPM-stored keys in companies, but on personal licenses, you need group policy to even allow that.
Although this is moot if Windows recovery mode is accepted as the right system by the TPM. But aren't permissions/privileges a bit neutered in that mode?
Most people installed CrowdStrike because an audit said they needed it. I find it exceedingly unlikely that the same audit did not say they have to enable Bitlocker and backup its keys.
I can confirm this. EDR checkbox for CrowdStrike, BitLocker enabled for local disk encryption checkbox. BitLocker backups to Entra because we know reality happens, no checkbox for that.
I know it does for personal accounts once linked to your machine. Years ago, I used the enterprise version and it didn’t, probably because it was “assumed” that it should be done with group policies, but that was in 2017.
Yes you should be able to pull it from your domain controllers. Unless they're also down, which they're likely to be seeing as Tier 0 assets are most likely to have crowdstrike on them. So you're now in a catch 22.
Rolling back an Active Directory server is a spectacularly bad idea. Better make doubly sure it's not connected to any network before you even attempt to do so.
In theory. I've seen it not happen twice. (The worst part is that you can hit the Bitlocker recovery somewhat randomly because of an irrelevant piece of hardware failing, and now you have to rebuild the OS because the recovery key is MIA.)
It includes PDFs of some relevant support pages that someone printed with their browser 5 hours ago. That's probably the right thing to do in such a situation to get this kind of info publicly available ASAP, but still, oof. Looks like lots of people in the Reddit thread had trouble accessing the support info behind the login screen.
Isn't Crowdstrike the same company the heavily lobbied to get make all their features a requirement for government computers?
https://www.opensecrets.org/federal-lobbying/clients/summary...
They have plenty of money for congress, but it seem little for any kind of reasonable software development practices. This isn't the first time crowdstrike has pushed system breaking changes.
The DNC has since has implemented many layers of protection including crowdstrike, hardware keys, as well as special auth software from Google. They learned many lessons from 2016.
If I were to hazard a guess I think the OP is attempting to say they are incompetent and wrong in fingering the GRU as the cause of the DNC hacks (even though they were one of many groups that made that very obvious conclusion).
The second link has nothing to do with the DNC breach. It's the Ukrainian military disagreeing with Crowdstrike attributing a hack of Ukrainian software to Russia. And ThreatConnect also attributed it to Russia: https://threatconnect.com/blog/shiny-object-guccifer-2-0-and...
>we assess Guccifer 2.0 most likely is a Russian denial and deception (D&D) effort that has been cast to sow doubt about the prevailing narrative of Russian perfidy
So Ukraine's military and the app creator denied their artillery app was hacked by Russians, which might have caused them to lose some artillery pieces? Sounds like they aren't entirely unbiased. Ironically, DNC initially didn't believe they were hacked either.
There's something of a difference between 'alternative scenarios' and demonstrating that the 'settled' story doesn't fit with the limited evidence. One popular example is that the exploit Crowdstrike claim was used wasn't in production until after they claimed it was used.
>There's something of a difference between 'alternative scenarios' and demonstrating that the 'settled' story doesn't fit with the limited evidence.
You've failed to demonstrate that, since your second link doesn't show the Ukrainian military disputing the DNC hack, just a separate hack of Ukrainian software, and the first link doesn't show ThreatConnect disagreeing with the assessment. ThreatConnect (and CrowdStrike, Fidelis, and FireEye) attributes the DNC hack to Russia.
>One popular example is that the exploit Crowdstrike claim was used wasn't in production until after they claimed it was used.
I see that now. I should have been more careful while searching for and sharing links. I have shot myself in the foot. And I'm not going to waste my time or others digging for and sharing what I think I remembered reading. I've done enough damage today. Thank you for your thorough reply.
According to that link the most money they contributed to lobbying in the past 5 years was $600,000 most years around $200,000. That’s barely the cost of a senior engineer.
That's probably only the part they had the hard proof for.
Also, the press release[1] says:
> between 2018 and 2022, Senator Menendez and his wife engaged in a corrupt relationship with Wael Hana, Jose Uribe, and Fred Daibes – three New Jersey businessmen who collectively paid hundreds of thousands of dollars of bribes, including cash, gold, a Mercedes Benz, and other things of value
and later:
> Over $480,000 in cash — much of it stuffed into envelopes and hidden in clothing, closets, and a safe — was discovered in the home, as well as over $70,000 in cash in NADINE MENENDEZ’s safe deposit box, which was also searched pursuant to a separate search warrant
This seems to be more than $120K over 4 years. Of course, not all of the cash found may be result of those bribes, but likely at least some of it is.
Ok but that point still defeats the premise that Crowdstrike are spending a large enough amount on lobbying that it is hampering their engineering dept.
I believe the OP was using figurative language. The point seems to be that _something_ is hampering their engineering department and they shouldn't be lobbying the government to have their software so deeply embedded into so many systems until they fix that.
Given its origin and involvement in these high profile cases I always thought Crowdstrike is a government subsidized company which barely has any real function or real product. I stand corrected I guess.
what many people of not taking is that why we are here:
one simple reason:
all eggs in one Microsoft PC basket
why in one Microsoft PC basket?
- most corporate desktop apps are developed for Windows ONLY
Why most corporate desktop apps are developed for Windows ONLY?
- it is cheaper to develop and distribute since, 90% of corporations use Windows PCs ( Chicken and Egg problem)
- alternate Mac Laptops are 3x more expensive, so corporations can't afford
- there are no robust industrial grade Linux laptops from PC vendors (lack of support, fear of Microsoft may penalize for promoting Linux laptops etc.)
1/ Most large corporations (Airlines, Hospitals etc..) can AFFORD & DEMAND their Software vendors to provide their ' business desktop applications' both in Windows and Linux versions and install mix of both Operating systems.
2/ majority of corporate desktop applications can be Web applications (Browser based) removing the single vendor Microsoft Windows PC/Laptops
Windows is not the issue here. If all of the businesses used Linux, a similar software product, deployed as widely as Crowdstrike, with auto-update, could result in the same issue.
Same goes for the OS; if let's say majority of businesses used RHEL with auto updates, RedHat could in theory push an update, that would result bring down all machines.
Agree. The monoculture simply accelerates the infection because there are no sizable natural barriers to stop it.
Windows and even Intel must take some blame, because in this day and age of vPro on the board and rollbacks built into the OS it's incredible that there is no "last known good" procedure to boot into the most recent successfully booted environment (didnt NT have this 30 years ago?), or remotely recover the system. I pity the IT staff that are going to have to talk Bob in Accounting through bitlocker and some sys file, times 1000s.
IT get some blame, because this notion that an update from a third party can reach past the logical gatekeeping function that IT provides, directly into their estate, and change things, is unconscionable. Why dont the PCs update from a local mirror that IT has that has been through canary testing? Do we trust vendors that much now?
I would posit that RedHat have a slightly longer and more proven track record than Crowdstrike, and more transparent process with how they release updates.
No entity is infallible but letting one closed source opaque corporation have the keys to break everything isn’t resilient.
Yes it is. Windows was created for the "Personal Computer" with zero thought initially put in to security. It has been fighting that heritage for 30 years. The reason Crowdstrike exists at all is due to shortcomings (real or perceived) in Windows security.
Unix (and hence Linux and MacOS) was designed as a multi-user system from the start, so access controls and permissions were there from the start. It may have been a flawed security model and has been updated over time, but at least it started some notion of security. These ideas had already expanded to networks before Microsoft ever heard the word Netscape.
> was designed as a multi-user system from the start, so access controls and permissions were there from the start.
Right and Windows NT wasn't? Obviously it supported all of those things from the very beginning (possibly even in a superior way to Unix in some cases considering it's a significantly more modern OS)...
The fact that MS developed another OS called Windows (3.1 -> 95 -> 98) prior to that which was to some extent binary compatible with NT seems somewhat tangential. Otherwise the same arguments would surely apply to MacOS as well?
> These ideas had already expanded to networks before Microsoft ever heard the word Netscape.
Does not seem like a good thing on its own to me. Just solidifies the fact the it's an inherently less modern OS than Windows(NT) (which still might have various design flaws obviously, that might be worth discussing, it just has nothing to do whatsoever with what you're claiming here...)
We have Crowdstrike on our Linux fleet. It is not merely a malware scanner but is capable of identifying and stopping zero-day attacks that attempt local privilege escalation. It can, for example, detect and block attempts to exploit CVE-2024-3094 - the xz backdoor.
Perhaps we need to move to an even more restrictive design like Fuschia, or standardize on an open source eBPF based utility that's built, tested, and shipped with a distribution's specific kernel, but Windows is not the issue here.
Security is a complex and deeply evolved field. Many modern required security practices are quite recent from a historical perspective because we simply didn't know we would need them.
A safe security first OS from 20 years ago would most likely be horribly insecure now.
yes, staggered software update is the way to go. there was reply in this thread why Crowdstrike did not do it -- don't want extra cost of Engineering for that
having 1/3 of Airlines computers Windows, RHEL, Ubuntu .. all unlikely to hit same problems at same time.
But you're more likely to encounter problems. That's likely a good thing as it improves your DR documentation and processes but could be a harder sell to the suits.
But then it'd be putting all eggs in the Linux pc basket, wouldn't it? I think they point was that more heterogeneity would make this not be a problem. If all your potatoes are the same potato it only takes one bad blight epidemic to kill off all farmed potatoes in a country. If there's more heterogeneity things like that doesn't happen.
The difference being that RHEL has a QA process which crowd strike apparently does not. The quality practices for open source involved companies is apparently much higher than for large closed source "security" firms.
I guess getting whined at because obscure things break in beta or rc releases has a good effect for the people using LTS.
Maybe this is pie-in-the-sky thinking, but if all the businesses used some sort of desktop variant of Android, the Crowdstrike app (to the extent that such a thing would even be necessary in the first place) would be sandboxed and wouldn't have the necessary permissions to bring down the whole operating system.
When notepad hits an unhandled exception and the OS decides it's in an unpredictable state, the OS shuts down notepad's process. When there's an unhandled exception in kernel mode, the OS shuts down the entire computer. That's a BSOD in Windows or a kernel panic in Linux. The problem isn't that CrowdStrike is a normal user mode application that is taking down Windows because Windows just lets that happen, it's that CrowdStrike has faulty code that runs in kernel mode. This isn't unique to Windows or Linux.
The main reason they need to run in kernel mode is you can't do behavior monitoring hooks in user mode without making your security tool open to detection and evasion. For example, if your security tool wants to detect whenever a process calls ShellExecute, you can inject a DLL into the process that hooks the ShellExecute API, but malware can just check for that in its own process and either work around it or refuse to run. That means the hook needs to be in kernel mode, or the OS needs to provide instrumentation that allows third party code to monitor calls like that without running in kernel mode.
IMO, Windows (and probably any OS you're likely to encounter in the wild) could do better providing that kind of instrumentation. Windows and Office have made progress in the last several years with things like enabling monitoring of PowerShell and VBA script block execution, but it's not enough that solutions like CrowdStrike can do their thing without going low level.
Beyond that, there's also going to be a huge latency between when a security researcher finds a new technique for creating processes, doing persistence, or whatever and when the engineering team for an OS can update their instrumentation to support detecting it, so there's always going to be some need for a presence in kernel mode if you want up to date protection.
I mean, to me that's just a convincing argument against using kernel-mode spywa-, err, endpoint protection, with OTA updates that give you no way to stage or test them yourself cannot be secure.
How are those arguments against kernel level detection from a security perspective?
His arguments show that without kernel level, you either can't catch all bad actors as they can evade detection, or that the latency is too big that an attacker basically has free reign for some time after detection.
SolarWinds story was quickly forgotten, and this one will be too, and we'll continue to build such special single points of global catastrophic failure into our craftly architected decentralized highly robust horizontally scaled multi-datacenter-region systems
The SolarWinds story wasn't forgotten. Late last year the SEC launched a complaint against SolarWinds and its CISO. It was only yesterday that many of the SEC's claims against the CISO were dismissed.
Solarwinds is still dealing with the reputation damage and fallout today from that breach. People don’t forget about this stuff. the lawsuits will likely be hitting crowdstrike for years to come
No less than three baskets, or you cannot apply for bailouts. If you want to argue your industry is a load-bearing element in the economy: no less than three baskets.
Making everything browser based doesn't help (unless you can walk across the room and touch the server). The web is all about creating fast-acting local dependency on the actions of far-away people who are not known or necessarily trusted by the user. Like crowdstrike, it's about remote control, and it's exactly that kind of dependency that caused this problem.
I love piling on Microsoft as much as the next guy, but this is bigger than that. It's a structural problem with how we (fail to) manage trust.
If it's true that a bad patch was the reason for this I assume someone, or multiple people, will have a really bad day today. Makes me wonder what kind of testing they have in place for patches like this, normally I wouldn't expect something to go out immediately to all clients but rather a gradual rollout. But who knows, Microsoft keeps their master keys on a USB stick while selling cloud HSM so maybe Crowdstrike just yolos their critical software updates as well while selling security software to the world.
Sounds like it was a 'channel file' which I think is akin to an av definition file that caused the problem rather than an actual software change. So they must have had a bug lurking in their kernel driver which was uncovered by a particular channel file. Still, seems like someone skipped some testing.
How about a try-catch block? The software reading the definition file should be minimally resilient against malformed input. That's like programming 101.
Reputational damage from this is going to be catastrophic. Even if that’s the limit of their liability it’s hard not to see customers leaving en masse.
Ironically some /r/wallstreetbets poster put out an ill-informed “due diligence” post 11 hours ago concerning CrowdStrike being not worth $83 billion and placing puts on the stock.
Everybody took the piss out of them for the post. Now they are quite likely to become very rich.
Not sure what material in their post is ill-informed. Looks like what happened today is exactly what that poster warned of in one of their bullet points.
Yea, everyone is dunking on OP here. But they essentially said that crowdstrike's customers were all vulnerable to something like this. And we saw a similar thing play out only a few years ago with SolarWinds. It's not surprising that this happened. Ofc with making money the timing is the crucial part which is hard to predict.
Is the alternative "mass hacking"? I thought all this software did was check a box on some compliance list. And slow down everyone's work laptop by unnecessarily scanning the same files over and over again.
As someone said earlier in these comments the software is required if you want to operate with government entities. So until that requirement changes it is not going anywhere and continues to print money for the company.
But then, if what you say is true and their software is indeed mandatory in some context, they also have no incentive or motivation to care about the quality of their product, about it bringing actual value or even about it being reliable.
They may just misuse this unique position in the market and squeeze as much profit from it as possible.
The mere fact that there exists such a position in the market is, in my opinion, a problem because it creates an entity which has a guaranteed revenue stream while having no incentive to actually deliver material results.
If the government agencies insist on using this particular product then you're right. If it's a choice between many such products than there should be some competition between them.
From experiencing different AV products at various jobs, they all use kernel level code to do their thing, so any one of them can have this situation happen.
You, the admin, don't get to see what Falcon is doing before it does it.
Your security ppl. have a dashboard that might show them alerts from selected systems if they've configured it, but Crowdstrike central can send commands to agents without any approval whatsoever.
We had a general login/build host at my site that users began having terrible problems using. Configure/compile stuff was breaking all the time. We thought...corrupted source downloads, bad compiler version, faulty RAM...finally, we started running repeated test builds.
Guy from our security org then calls us. He says: "Crowdstrike thinks someone has gotten onto linux host <host>, and has been trying to setup exploits for it and other machines on the network; it's been killing off the suspicious processes but they keep coming back..."
We had to explain to our security that it was a machine where people were expected to be building software, and that perhaps they could explain this to CS.
"No problem; they'll put in an exception for that particular use. Just let us know if you might running anything else unusual that might trigger CS."
TL;DR-please submit a formal whitelist request for every single executable on your linux box so that our corporate-mandate spyware doesn't break everyone's workflow with no warning.
Extremely unlikely. This isn't the first blowup Crowdstrike has had; though it's the worst (IIRC), Crowdstrike is "too big to fail" with tons of enterprise customers who have insane switching costs, even after this nonsense.
Unfortunately for all of us, Crowdstrike will be around for awhile.
Businesses would be crazy to continue with Crowdstrike after this. It's going to cause billions in losses to a huge number of companies. If I was a risk assessment officer at a large company I'd be speed dialling every alternative right now.
A friend of mine who used to work for Crowdstrike tells me they're a hot mess internally and it's amazing they haven't had worse problems than this already.
That sounds like any other companies I have ever worked for: looks great from the outside but a hot mess on the inside.
I have never worked for a company where everything is smooth sailing.
What I noticed is that the smaller the company, the less hot mess they are but at the same time they're also struggling to pay the bill because they don't innovate fast.
As at 4am NY time CRWD has lost $10Bn (~13%) in marketcap. Of course they've tested, but just not enough for this issue (as is often the case).
This is probably several seemingly non consequential issues coming together.
I'm not sure why though, when the system is this important that even successfully tested updates aren't rolled out piecemeal though (or perhaps it has and we're only seeing the result of partial failures around the world)
Testing is never enough. In fact, it won't catch 99% of issues by the virtue of them often testing happy paths only, or that they test what humans can think of, and by no means they are exhaustive.
A robust canarying mechanism is the only way you can limit the blast radius.
Set up A/B testing infra at the binary level so you can ship updates selectively and compare their metrics.
Been doing this for more than 10 years now, it's the ONLY way.
I'm not sure that justifies potentially bricking the devices of hundreds(?) of your clients by shipping untested updates to them. Of course it depends... and would require deeper financial analysis.
> They won't be able to test exhaustively every failure mode that could lead to such issues.
That might be acceptable. My point is that if you are incapable of having even absolutely basic automated tests (that would take a few minutes at most) for extremely impactful software like this starting with something more complex seems like a waste of time (clearly the company is run by incompetent people so they'd just mess it up)
And when it’s more costly for customers to walk back the mistake of adopting your service.
Yeah, I get the impression a lot of SaaS companies operate on this model these days. We just signed with a relatively unknown CI platform, because they were available for support during our evaluation. I wonder how available they’ll be when we have a contract in place…
Doesn't matter what testing exists. More scale. More complexity. More Bugs.
Its like building a gigantic factory farm. And then realizing that environment itself is the birthing chamber and breeding ground of superbugs with the capacity to wipe out everything.
I used to work at a global response center for big tech once upon a time. We would get hundreds of issues, we couldn't replicate cause we literally have to set up our own govt or airline or bank or telco to test certain things.
So I used to joke with the corporate robots to just hurry up and take over govts, airlines, banks and telcos already, cause thats the only path to better control.
> Its like building a gigantic factory farm. And then realizing that environment itself is the birthing chamber and breeding ground of superbugs with the capacity to wipe out everything.
Testing + a careful incremental rollout in stages is the solution. Don't patch all systems world-wide at once, start with a few, add a few more, etc. Choose them randomly.
i've seen photos of the bsod from an affected machine, the error code is `PAGE_FAULT_IN_NONPAGED_AREA`. here's some helpful takeaways from this incident:
1) mistakes in kernel-level drivers can and will crash the entire os
2) do not write kernel-level drivers
3) do not write kernel-level drivers
4) do not write kernel-level drivers
5) if you really need a kernel-level driver, do not write it in a memory unsafe language
I've said this elsewhere but the enabling of instant auto-updates on software relied on by a mission critical system is a much bigger problem than kernel drivers.
Just imagine that there's a proprietary firewall that everyone uses on their production servers. No kernel-level drivers necessary. A broken update causes the firewall to blindly reject any kind of incoming or outgoing request.
Easier to rollback because the system didn't break? Not really, you can't even get into the system anymore without physical access. The chaos would be just as bad.
A firewall is an easy example, but it can be any kind of application. A broken update can effectively bring the system down.
There sure are a lot of mission-critical systems and companies hit by this. I am surprised that auto-updates are enabled. I read about some large companies/services in my country being affected, but also a few which are unaffected. Maybe they have hired a good IT provider.
A k8s variety. By Canonical. Screams production, no one is using this for their gaming PC. Comes with.. auto-updates enabled through snap.
Yup, that once broke prod at a company I worked at.
Should our DevOps guy have prevented this? I guess so, though I don't blame him. It was a tiny company and he did a good job given his salary, much better than similar companies here. The blame goes to Canonical - if you make this the default it better come with a giant, unskippable warning sign during setup and on boot.
One thing to consider with security software, though, is that time is of essence when it comes to getting protection again 0day vulnerabilities.
Gotta think that the pendulum might swing into the other direction now and enterprises will value gradual, canary deployments over instant 100% coverage.
I'm not a Windows programmer so the exact meaning of PAGE_FAULT_IN_NONPAGED_AREA is not clear to me. I am familiar with UNIX style terminology here.
Is this just a regular "dereferencing a bad pointer", what would be a "segmentation violation" (SEGV) on UNIX, a pointer that falls outside the mapped virtual address space?
As this is in ring 0 and potentially has direct access to raw, non-virtual physical addressing, is there a distinction between "paged memory" (virtual address space) and "nonpaged memory" (physical address) with this error?
Is it possible to have a page fault failure in a paged area (PAGE_FAULT_IN_PAGED_AREA?), or would that be non-fatal and would be like "minor page fault" (writing to a shared page, COW) or "major page fault" (having to hit disk/swap to bring the page into physical memory)?
Are there other PAGE_FAULT_ errors on Windows?
Searching for this is difficult, as all the results are for random spammy user-centric tech sites with "how do I solve PAGE_FAULT_IN_PAGED_AREA blue screen?" content, not for a programmer audience.
this all-or-nothing mindset is is reductive and defeatist—harm reduction is valuable. sure, rust won’t magically make your kernel driver bug free, but will reduce the surface area for bugs, which will likely make it more stable.
Unfortunately, we have decades of first Haskell pseudo-fans, a sidequest of generic "static typing (don't look at how weak the type system is)" pseudo-fans, and now Rust afficionados that do act like it's all-or-nothing and types will magically fix things including category and logic errors.
If you think AV cannot stop viruses in the same privilege level, then that is more reason for AV to run in the kernel mode. Because by your logic, an AV in user mode cannot stop a virus in user mode.
5) Well how much of those kernel-level drivers we rely upon ARE written in a memory unsafe language ??? Like 99% ?
And we are not crashing and dying every day?
Sure, Rust is the way to go. it just took Rust 18 years to mature to that level.
Also, quite frankly, if your unwrap() makes your program terminate because an array out of bounds isn't that exactly the same thing ? (program terminates)
But IMHO if we are hopping along a minefield at this moment every second of every day, well...
If this is the worst case scenario, yeah it's not that worse after all.
> Well how much of those kernel-level drivers we rely upon ARE written in a memory unsafe language ??? Like 99% ? And we are not crashing and dying every day?
we shouldn't discount the consequences of memory safety vulnerabilities just because flights haven't physically been grounded.
> Also, quite frankly, if your unwrap() makes your program terminate because an array out of bounds isn't that exactly the same thing ? (program terminates)
this is a strawman, if you were writing a kernel-level driver in rust you'd configure the linter to deny code which can cause panics.
And I never said that anyone is telling me to use Java. It was an example.
Because of the nature of AV software, its code would be drowning in "unsafe" memory accesses no matter the language we chose. This is AV, it's always trying to read the memory that is not AV's, from its very design.
This is a story about bad software management processes, not programming languages.
This was apparently caused by a faulty "channel file"[0], which is presumably some kind of configuration database that the software uses to identify malware.
So there wasn't any new kernel driver deployed, the existing kernel driver just doesn't fail gracefully.
Also, why not have some sort of graceful degradation (well kind of), like: OS Boots, loads CS driver, the driver loads some new feature/config, and before/after new recent thing ("runtime flag") marked whether it successfully worked, and if not on the next reboot that thing gets either disabled, or the previous known good config (obviously some combination of things might cause another issue), but instead of blindly rebooting to the same state....
I think pfsense does this (from memory, been a while using it). Basically dual-partitions, and if it failed to come up on the active partition after an update it'd revert. Granted you need to have the space to have two partitions, but for a small partition/image not so bad.
What surprises me is if its a content update, and the code fell over when dealing with it - just basically bad release engineering isn't it not to cater for that in the first place? i.e. some tests in the pipeline before releasing the content update would've picked it up given it sounds like 100% failure rates.
The problem space kind of dictates that this couldn't be a solution, cause malware could load an arbitrary feature/config and mark it as 0, then the AV would be disabled on next boot, right?
More importantly, why are CS customers not validating? Upstream patches should be treated as faulty/malicious if not tested to show otherwise, especially if they're kernel level.
For a while I've joked with family and colleagues that software is so shitty on a widespread basis these days that it won't be long before something breaks so badly that the planet stops working. Looks like it happened.
Perhaps a dumb question for someone who actually knows how Microsoft stuff works...
Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?
Added: OK, from another post I now know Crowdstrike has some sort of kernel mode that allows this sort of catastrophe on Linux. So I guess there is a bigger question here...
> Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?
Because malware that gets into a system will do just that -- install its own backdoor drivers -- and will then erect defense to protect itself from future updates or security actions. e.g. change the path that Windows Updater uses to download new updates, etc.
Having a kernel module that answers to CloudStrike makes it harder for that to happen, since CS has their own (non-malicious) backdoor to confirm that the rest of the stack is behaving as expected. And it's at the kernel level, so it has visibility into deeper processes that a user-space program might not (or that is easy to spoof).
Or, much more likely, the malware will use a memory access bug in an existing, poorly written kernel module (say, CrowdStrike?) to load itself at the kernel level without anyone knowing, perhaps then flashing an older version of the BIOS/EFI and nestle there, or finding it's way into a management interface. Hell, it might even go ahead and install an existing buggy driver by itself it's not already there.
All of these invasive techniques end up making security even worse in the long term. Forget malware - there's freely available cheating software that does this. You can play around with it, it still works.
Maybe I am in the minority, but it always puzzled me that anybody in IT would think a mega-priviledged piece of software that looks into all files was a good idea.
If there is any place that historically was exploited more than all other things it was broken parsers. Congratulations if such an exploited file is now read by your AV-software it now sits now at a position where it is allowed (expected) to read all files and it would not surprise me if it could write them as well.
And you just doubled the number of places in which things can go wrong. Your system/software that reads a PNG image might do everything right, but do you know how well your AV-software parses PNGs?
This is just an example, but the question we really should ask ourselves is: why do we have systems where we expect malicous files to just show up in random places? The problem with IT security is not that people don't use AV software, it is that they run systems that are so broken by design that they are sprinkled on top.
This is like installing a sprinkler system in a house full of gasoline. Imagine gasoline everywhere including in some of the water piping — in the best case your sprinkler system reacts in time and kills the fire, in the worst case it sprays a combustive mix into it.
The solution is of course not to build houses filled with gasoline. Meanwhile AV-world wants to sell you ever more elaborate, AI-driven sprinkler systems. They are not the ones profiting from secure systems, just saying..
> but it always puzzled me that anybody in IT would think a mega-priviledged piece of software that looks into all files was a good idea.
Because otherwise, a piece of malware that installs itself at a "mega-privileged" level can easily make itself completely invisible to a scanner running as a low-priv user.
Heck, just placing itself in /root and hooking a few system calls would likely be enough to prevent a low-priv process from seeing it.
You're ignoring the parent's question of "why do we have systems where we expect malicous files to just show up in random places?", which I think is a good question. If a system is truly critical, you don't secure it by adding antivirus. You secure it by restricting access to it, and restricting what all software on the machine can do, such that it's difficult to attack in the first place. If your critical machines are immune to commodity malware, now you only have to worry about high-effort targeted attacks.
My point exactly. Antivirus is a cheap on top measure thst makes people feel they have done something, the actual safety of a system comes from preventing people and software from doing things they shouldn't do.
Why would you design a system where a piece of malware can "install itself" at a mega-priviledged position?
My argument was that this is the flaw, and everything else is just trying to put lipstick on a pig.
If you have a nightclub and you have problem controlling which people get in, the first idea would be to not have a thousand unguarded doors and to then recruit people that search the inside of your nightclub for people they think didn't pay.
You probably would think about reducing the numbers of doors and adding effective mechanisms to them that help you with your goals.
I am not saying we don't need software that checks files at the door, I say we need to reduce the number of doors leading directly to the nightclubs cash reserve.
Some file formats allow data to be appended or even prepended to the expected file data and will just ignore the extra data. This has been used to create executables that happen to also be a valid image file.
I don't know about PNG, but I'm fairly sure JPEG works this way. You can concatenate a JPEG file to the end of an executable, and any JPEG parser will understand it fine, as it looks for a magic string before beginning to parse the JPEG.
A JPEG that has something prepended might raise an eyebrow. A JPEG that has something executable prepended should raise alarms.
Why make something like that executable in the first place? I like the Unix model where things that should be executable are marked so. I know bad parsers and format decoders can lead to executable exploits, but I've always felt uncomfortable with the windows .exe model. Also VBA in excel, word... I believe a better solution would be to have a minimal executable surface than invasive software.
Vendors are allowed to install drivers , even via Windows update. Many vendors like HP, install functionality like telemetry as drivers to make it more difficult for the users to remove the software.
So next time you think you are doing a "clean install", you are likely just re-installing the same software that came with the machine.
It doesn't install the driver, it is the driver. As for the Linux version, it uses eBPF which has a sandbox designed to never crash the kernel. Windows does have something similar nowadays, but Crowdstrike's code probably predates it and was likely just rawdogging the kernel.
> Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?
While the files are named XXX.SYS they are apparently not drivers. The issue is that a corrupted XXX.SYS was loaded by the already-installed driver which promptly crashes.
I'm always curious on how security software can provide a ROI.
I had McAfee tell me one time that the hackersafe logo on our website would increase sales by 10%, this was at a Fortune 50 doing billions in sales online every year.
I was pretty hyped because it would have done wonders for my career, but then they walked it back and wouldn't explain it to me. I wasn't mad, I was disappointed.
I ran an AB test on 2012 not sure its relevant now, we tested the McAfee logo and conversion was boosted by 2%. Bigger boost was a lock icon, 3%. It kept increasing the more locks we added an topped at 5% after 5 lock icons.
"There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. It demands the same skill, devotion, insight, and even inspiration as the discovery of the simple physical laws which underlie the complex phenomena of nature."
"The most important property of a program is whether it accomplishes the intention of its user."
Agreed but have you been in the industry lately? Nobody hires assembly programmers anymore. Want money you must work at wobbly top of abstraction mountain.
I am well aware, but the quotes are timeless for a reason. Not to be cheeky, but "Want money" is exactly how you get to the many routinely broken endpoint solutions that wind up reducing reliability and at times increasing the attack surface. Wherever you are in the stack, please make it more robust and easier to reason about. No matter how far from the assembly.
It’s not just about the tech abstraction mountain, it’s about the app logic and dev process too.
A react native JS app with a clear spec and a solid release process can be more reliable than bloated software that receives an untested hotfix, even if the latter was handwritten in assembly.
My entire emergency department got knocked offline by this. Really scary when you have ambulances coming in and are trying to stabilize a heart attack.
Update: 911 is down in Oregon too, no more ambulances at least.
We're really prepared for epic to go down and have an isolated cluster that we access in emergencies. I transitioned from software engineering so I've only been in the ED for a year, but from what I could see there didn't seem to be a plan for what to do if every computer in the department bluescreened at once.
So assuming everyone uses sneaker-net to restart what’s looking like millions of windows boxes, there comes recriminations but then … what?
I think we need to look at minimum viable PC - certain things are protected more than others. Phones are a surprisingly good example - there is a core set of APIs and no fucker is ever allowed to do anything except through those. No matter how painful. At some point MSFT is going to enforce this the way Apple does. The EU court cases be damned.
For most tasks for most things it’s hard to suggest that an OS and a webbrowser are not the maximum needed.
We have been saying it for years - what I think we need is a manifesto for much smaller usable surface areas
In this case even dockerized environments would allow you to redeploy with ease.
But that's too much work, many of these systems are running docker resistant software. Management doesn't want to invest in modernization - it works this quarter, it's someone else's problem next quarterly.
You're basically proposing Windows 12 to radically limit what software and drivers can do. Even then eventually someone will probably still break it with weird code.
I'm actually amazed these updates are being tested in prod. Do they have no QA environments ?
Do I personally need to create a startup company called Paranoia... We actually run a clone of your prod environment minus any sensitive data, then we install all the weird and strange updates before they hit your production servers...
As an upsell we'll test out privileges, to take sure your junior engineers can't break prod.
Someone raise a seed round, I'm down to get started this week.
I think this is existential for Windows, and by extension MSFT. Something like 95% of corporate IT activity is either over http (ie every saas and web app) or is over the serial port (controlling that HVAC, that window blind, that garage lifter)
So what we need in 95% of boxes is not a fully capable PC - we need a really locked down OS. Or rather we can get by with a locked down OS.
I would put good money on there already being a tiny OS from the ground up in MSFT that could be relabelled windows-locked-Down(13) and sold exclusively to large corporates (and maybe small ones who sign a special piece of marketing paper)
The thing is once you do that you are breaking the idea that windows can run everywhere (or rather we claim Linux runs everywhere but the thing that’s on my default unbuntu install and the thing on my router are different
Yet the chaos seems to continue. Could it be that this fix can't be rolled out automatically to affected machines because they crash during boot - before the Crowdstrike Updater runs?
That update is so tone-deaf and half-assed. There's no apology.
If you go to the website, there's nothing on their front-page. The post on their blog (https://www.crowdstrike.com/blog/statement-on-windows-sensor...) doesn't even link to the solution. There's no link to "Support Portal" anywhere to be seen on their front-page. So, you have to go digging to find the update.
And the "Fix" that they've "Deployed" requires someone to go to Every. Single. Machine. Companies with fleets of 50k machines are on this HN thread - how are they supposed to visit every machine?!?!
Any response they make in the middle of a global outage will be half-assed. They have all available resources figuring out what the hell just happened and how to fix it.
An apology this early is a lose-lose. If they do apologize they'll piss off people dealing with it and want a fix not an apology. If they do t apologize they're tone deaf and don't seem to care.
lol sounds good, but how the hell do they deploy a fix to a machine that has crash and is looping BSOD with no internet or netwrok connectivity...
You do what I've been doing for the last 10 hours or so. you walk to each and every desktop and manually type in the bitlocker key so you can remove the offending update.
at least the virtual devices can be fixed sitting at a desk while suckling at a comfort coffee..
There's potentially a huge issue here for people using BitLocker with on-prem AD, because they'll need the BitLocker recovery keys for each endpoint to go in an fix it.
And if all those recovery keys are stored in AD (as they usually are), and the Domain Controllers all had Crowdstrike on them...
Most of the large deployments I've seen don't use pre-boot PINs, because of the difficulty of managing them with users - they just use TPM and occasionally network unlock.
So might save a few people, but I suspect not many.
Was thinking about a bootable usb-stick that would do that automagically. But I guess it is harder to boot from a usb-stick in these environments than the actual fix.
I guess more feasible and even neater to do it if you have network boot or similar.
This gem from the ABC news coverage has my mind 100% boggled:
"711 has been affected by the outage … went in to buy a sandwich and a coffee and they couldn’t even open the till. People who had filled up their cars were getting stuck in the shop because they couldn’t pay."
Can't even take CASH payment without the computer, what a world!
Technically a payment terminal can go into island mode and take offline credit card transactions and post them later. PIN can be verified against the card.
Depends if the retailer wants to take the chance of all that.
Having worked with some of these retail systems, yes, it depends on how they are configured.
There are stores in many places in the country with sporadic internet or where outages are not uncommon, and where you would want to configure the terminals to still work while offline. In these cases, the payment terminals can be configured to take offline transactions, and they are stored locally on the lane or a server located in the store until a connection to the internet is re-established.
At where though? The example given was in 711 which is a nationwide chain a bit like a Tesco Express or Sainsbury's Local, both of which still accept cash nationwide in the UK too.
This whole thing likely would have been averted had microkernel architectures caught on during the early days (with all drivers in user mode). Performance would have likely been a non-issue, not only due to the state of the art L4 designs that came later, but mostly because had it been adopted everything in the industry would have evolved with it (async I/O more prevalent, batched syscalls, etc.).
I will admit we've done pretty well with kernel drivers (and better than I would have ever expected tbh), but given our new security focused environment it seems like now is the time to start pivoting again. The trade offs are worth it IMO.
I wonder if for critical applications we'll ever go back to just PXE booting images from a central server: just load a barebones kernel and the app you want to run into a dedicated memory segment, mark everything else as NX, and you don't even have to worry about things like viruses and hacks anymore. Run into an issue? Just reboot!
I just skimmed through the news. A lot of airports, hospitals, and even governments are down! It's ironic how people are putting their eggs in one basket, trying to avoid downtime caused by malware by relying on a company that put their system down. A lot of lessons will be learned after this for sure.
Unless you run half your devices on one security vendor and half on another surely there is no way round it? Companies install this stuff over "Windows Defender" so they can point fingers at the security vendor when they get hacked, this is the other side of the coin.
It has happened before where security software has unwanted effects, can't say i remember anyone else managing to blue screen Windows and require a safe mode boot to fix the endpoints though.
Relying on easy-install "security vendors" is the problem. It's one thing to run an antivirus on a general purpose PC that doesn't have a qualified human admin. But many of the computers affected here are single-purpose devices, which should operate with a different approach to security.
Speaking as somebody who manages a large piece of a 911 style system for first responders and has done so for 10 years (and is not affected by this outage) - this is why we do not allow third parties to push live updates to our systems.
It's unfortunate, the ambulances are still running in our area of responsibility, but it's highly likely that the hospitals they are delivering patients to are in absolute chaos.
Disrespect to every CIO to make their business depend on a single operating system, running automatic updates of system software without any canaries and phased deployments.
While I believe Linux is a more reasonable operating system than Windows, shit can happen everywhere.
So if you have truly mission critical systems you should probably have more have at least 2 significantly different systems, each of them being able to maintain some emergency operations independently. Doing this with 2 Linux distros is easier than doing it with Linux and Windows. For workstations Macs could considered, for servers BSD.
Probably many companies will accept the risk that everything goes down. (Well, they probably don't say that. They say maintaining a healthy mix is too expensive.)
In that case you need a clearly phased approach to all updates. First update some canaries used by IT. If that goes well update 10% of the production. If that goes well (well, you have to wait until affected employees have actually worked a reasonable time) you can roll out increasingly more.
No testing in a lab (whether at the vendor or you own IT) will ever find all problems. If something slips through and affects 10% of your company it's significantly different from affecting (nearly) everyone.
What makes you think windows is the only alternative? Have you never heard about Gnu Hurd?
More seriously I am not saying you should run some critical services on menuetos or riscos but the BSDs are still alive and kicking as well as illumos and its derivatives. And yes I think a bit of diversity allows some additional resilience. It may necessitate more workforce but imho it is worth the downsides.
Presumably they do test their updates, they're just maybe not good enough tests.
The ideal would be to do canary rollouts (1%, then 5%, 10% etc.) to minimise blast radius, but I guess that's incompatible with antiviruses protecting you from 0-day exploits.
While I'm usually a proponent of update waves like that, I know some teams can get loose with the idea if they determine the update isn't worth that kind of carefulness.
Not saying CS doesn't care enough but what may be a minor update to the team that did this and not necessary for a slow rollout is actually something that really should be supervised in that way.
Our worst outage occurred when we were deploying some kernel security patches and we grew complacent and updated the main database and it's replica at the same time. We had a maintenance with downtime anyway at the same time, so whatever. The update worked on the other couple hundred systems.
Except, unknown to us, our virtualization provider had a massive infrastructural issue at exactly that moment preventing VMs from booting back up... That wasn't a fun night to failover services into the secondary DC.
The issue is update rollout process, lack of diversity of these kind of tools in the industry, and the absolute failure of the software industry to make decent software without bug and security holes.
Could you potentially do the same by just attaching the HDD to another computer as a secondary drive and renaming the folder if safe mode falls through?
Sky News (UK) is back on air, but they seem to have no astons / chyrons / on screen graphics at all, and I don't think they're able to show prerecorded material either (it's just people in the studio from what I've seen), so presumably they're still having fun issues with their general production systems.
Having another look in, it looks like they're overlaying a static banner along the bottom rather than having a fully working graphics system still (as off 11:30), as that's the only graphics I've seen.
Good news for crowdstrike! It shows how critical their services are. Stock to go up! (And down a bit when they get sued, and up a bit when they don't get sued too much, etc.)
If the vending machine handles credit cards, wouldn't Visa / Mastercard / etc. basically require it as part of their security requirements? Or it's just general CYA from someone that's backfired badly.
"It's a vending machine that sells back physical stolen credit cards." Input the dollars, receive stolen credit card. Put it next to an airport terminal for maximum impact.
> The London Stock Exchange says it's working as normal - but says there are problems with its RNS (regulatory news service).
> "RNS news service is currently experiencing a third party global technical issue, preventing news from being published on www.londonstockexchange.com," the statement says.
> "Technical teams are working to restore the service. Other services across the group, including London Stock Exchange, continue to operate as normal."
CS is an EDR (Endpoint Detection & Response) and it connects to other parts like XDR (Extended Detection and Response) and MDM (Mobile Device Management). They differ from the typical antivirus in how they detect threats. The AV usually checks against known threats, while EDR detects endpoint behavior anomalies. For example, if your browser spawns a shell, it will be marked and the process quarantined. Of course, they do share a lot of common domains like real-time protection, cloud analysis, etc., and some AVs have most of the EDR capabilities, and some EDRs have most of the AV capabilities.
This is briefly described.
We're running something similar ( not CS ) where I work.
It seems to me that these tools create lots of problems ( slows down the machine significantly in particular, gets things wrong and quarantines processes/machines when it shouldn't, injects itself into processes so changes behaviour, etc ).
The main question I have is : does anyone have an actual instance of such tools detecting something useful ? No one in the office was able to show one.
I contracted for a company that gave me a company issued macbook with crowdstrike. It logged my execve() or something, because I did a curl from rustup | sh, and this alerted an admin who then contacted me to ask if this was legitimate behaviour.
Worked for a fairly largish org (~40k emps), and one of the "security" gurus roped me into a conversation because he found a batch file in my Teams shared files. The contents:
set JAVA_HOME="what_ever_path"
and asked me to explain this egregious hacking attempt.
My company had a mandatory req of installing it. If you look into it - it logs and spies on everything you do, every dns req, every website, every application etc.
Now my m3-ultra MacBook work computer that they gave is a 4000 USD teams/email machine since I prefer to work on computers without spyware.
I understand your preference. I have two questions:
1) Do you think that an organization should have no protections in place?
2) Why not just work from the machine they provided you, and do everything else on a personal machine?
I assume from your rhetorical question that you don't. I personally don't know enough about it to say whether it does or not - but, I will make what I believe is a reasonable assumption and say that all else being equal, yes, a fleet of machines with a EDR sensor installed is more "protected" than a fleet without.
If you have a point to make, why not just say what you are trying to say; it will be more effective discourse. I am genuinely curious.
They key to tools like crowdstrike is not so much protection, and being able to trace an attack through the infrastructure. They can see that your credentials were comprimised on your machine, and which systems you then connected to (or that bad process did) so they can trace the attack and make sure get it all cleaned up.
My favorite work story is from 10 years ago. We had an internal IRC server for the devs. I'd written an IRC bot to do some basic functions. It was running on my desktop.
I get a call from IT on my work phone. My co-workers hear my end of the conversation:
"No, it's not a bot net. It's just one bot. Yeah, I wrote it and it talks IRC."
It is one of the best systems available for realtime protection of windows systems against various threat actors. Prior to today you could probably have said 'no one gets fired for recommending Crowdstrike as the security tool for the company.' It is everywhere and in particular if you are a large org with a lot of Windows seats you are likely a Crowdstrike customer.
What the heck is it doing? My work laptop fan always seems to be blasting air whether it is 10pm or 3am. It's in a reboot loop now so I just shut it off.
All my Linux machines are all quiet when nothing is running. In contrast I go to the bathroom at 10pm or 3am and the work laptop fan is blasting. I've logged and see some other security stuff taking up CPU cycles but it happens at least a few times an hour. I wonder how much electricity the world is wasting with this crap.
When I first got the laptop when I started this job 5 years ago I thought it must be infected with malware because it was always running the fan so I put it in a separate VLAN so it can't attack my home Linux machines. IT told me it is security software. Who knew that the cyber attack would come from inside the security software.
Some of these services go even further. One time, our IT department was being sales-bombed with a service that would remove our actual login credentials to servers, and then "for security" we'd access said servers using a MITM website kind of thing that would be behind our corporate AD-login. I didn't even find out the full intricate details before telling them to "nope this the fuck out" and stay away with a 10-ft pole.
It's like these people have nothing better to do with their time and just absolutely have to have to design and build a product for the sake of it, and then dump it on marketing for > 0 amounts of sales through pretty-much wearing IT departments down. Or in the case of this Crowdstrike thing, through the protection racket known as security audit compliance.
It injects itself into (at least) every executable startup and every executable write to disk. It's quite noticeable if you have it installed and run, say, an installer that unpacks a lot of DLL files, because each one gets checksummed and the checksum sent to a remote host. Every time.
I hated it before this incident and I will be bringing this incident up every time it is mentioned.
So it exists because nobody has any idea what the execution graph of their programs are, and CS is down because of that too.. Do we really need this level of dynamism in our programs?
And like most AV systems it seems to be a bigger threat than what it supposedly protects against. Seriously how is it acceptable to have one corporation push a live update and take down tons of critical services all over the world. Just imagine what a malicious actor could accomplish with such a delivery vector.
Indeed. The xz backdoor team must be kicking themselves: "We spent years getting our own vector into a tool, only for our world domination plans to be thwarted at the last minute ... we could have just bribed someone at CS!"
Botnet that checks if your bots in the botnet act like bad bots and can be considered bad too. Also checking if some of your files match AV signature. Also reading all your logs if you really want.
Financial losses? The comment you're replying to is mentioning heart attack treatment here. We're talking about deaths. Most of us won't like to hear this but for all of us who work at SaaS that is deployed on servers around the worlds, our bugs cause people to die. It's a given that at least a dozen people will die directly (medical flights, hospitals both being hit) due to this broken update, let alone indirectly.
I don't think the parent comment was ignoring that. The penalty for a company who does this can't be to bring someone back from the dead, it's likely to be financial, which is the aspect they're talking about.
As others have already stated, yes, that is how we should be interpreting comments, in good faith and in the most charitable way as the site guidelines suggests us to.
If companies want the nice parts of being "a person", they should also deal with the bad parts of being a person. Financial fines are not enough. Though I'm not sure how we'd build a jail cell for an entire company.
Fines are not enough because a large enough fine will kill a company, destroying lots of jobs and supply chains.
Why not dilute the shareholder pool by a serious amount? There's no need for a statization to formally happen, the government can sell the shares back over time without actually exercising control.
Also fire execs and ban them from holding office on publicly traded companies for the foreseeable future.
Seizing shares doesn't impact the cash flow of the company directly, thus shouldn't cause job losses, but shareholders (who should put pressure on executives and the board to act with prudence to avoid these kinds of disasters) are adequately punished.
This actually sounds like a workable idea, but the implementation would be extremely thorny (impact on covenants, governance, voting rights, non-listed companies, etc) and take forever to get done. It would also punish everyone equally, even though they clearly do not share equal blame.
You probably want, in addition to your proposal, executive stock-based compensation to be awarded in a different share class, used to finance penalties in such cases where the impact is deemed to be the result of gross negligence at the management level.
> but shareholders (who should put pressure on executives and the board to act with prudence to avoid these kinds of disasters) are adequately punished.
So if I own some Vanguard mutual fund as part of a retirement account, it’s now on me to put pressure on 500+ corporations?
Perhaps it’s on Vanguard to do so…but Vanguard isn’t going to just eat the cost of increased due diligence requirements. My fees will increase.
How does that increased due diligence even work? It’s not like I or Vanguard can see internal processes to verify that a company has adequate testing or backups or training to prevent cases like today’s failure.
When, on average, X number of those 500 companies in my mutual fund face this share seizure penalty per year…am I just supposed to eat the loss when those shares disappear? Does Vanguard start insuring against such losses? Who pays for that insurance in the end?
This doesn’t even really hurt the shareholders who are best placed to possibly pressure a company. This doesn’t hurt “billionaire executive who owns 40% of the outstanding shares”. I mean, sure, it will hurt that little part of their brain that keeps track of their monetary worth and just wants to see “huge number get huger”…but it doesn’t actually hurt them. It just hurts regular folks, as usual.
If you own a mutual fund, then you do not own shares of the 500 companies, rather you own shares of the mutual fund itself.
Consequently you don't put pressure on the 500 companies, you put pressure on the mutual fund and the mutual fund in turn puts pressure on the companies it invests in and exercises additional discretion in which companies it invests in.
>Perhaps it’s on Vanguard to do so…but Vanguard isn’t going to just eat the cost of increased due diligence requirements.
Yes they do, because mutual funds do compete with one another and a mutual fund that does the due diligence to avoid investing in companies that are held liable for these kinds of incidents will outperform the mutual funds that don't do this kind of due diligence.
> It’s not like I or Vanguard can see internal processes to verify that a company has adequate testing or backups or training to prevent cases like today’s failure.
I don't know specifically about Vanguard, but mutual funds in general do employ the services of firms like PwC, Deloitte, and KPMG to perform technical due diligence that assesses the target company's technology, product quality, development processes, and compliance with industry standards. VC firms like Sequoia Capital and Andressen Horowitz do their own technical due diligence.
Just perhaps the idea of sticking everyone's retirement funds into massive passive vehicles was a bad one and has an unhealthy effect on the market, as you illustrate here. It is the way of things now so I see your point and it would be harmful to people, but getting in this situation has seemingly removed what could be a natural lever of consequence. We can't really hold companies accountable lest all the "regular folks" that can't actively supervise what they're investing in become collateral damage.
At least with AI you could do something like, destroy all copies including backups, destroy all training data and other code used to generate it. Which to me actually doesn't seem unreasonable punishment.
I did not mean to imply this, as there's a very long culpability chain. For this reason, I'm not sure if it makes any sense to imprison individuals for this. A lot of people playing a part in this causing such chaos.
But it is something to be very aware of for those of us who develop software run in e.g. hospitals and airlines, and should receive more attention, instead of only bringing up financial losses which is what usually happens. I noticed the same with the big ransomware attacks.
Indeed, pity that we need major failures like these, for goverments to finally start paying attention to give the same kind of laws as anything else, instead of careless EULAs and updates without field testing.
It's very bizarre to me how normalized we have made kernel-level software in critical systems. This software is inherently risky but companies throw it around like it's nothing. And cherry on top, we let it auto-update too. I'm surprised critical failures like this don't happen more often.
I can't tell if you're serious or sarcastic, but there is such a thing as criminal negligence.
CrowdStrike knows that their software runs on computers that are in fricken hospitals and airports, they know that a mistake can potentially cause a human death. They also know how to properly test software, and they know how to do staggered releases.
Given what we know now, it seems pretty likely that to any reasonable person, the amount of risk they took when deploying changes to clients was in no way reasonable. People absolutely should go to jail for this.
This more or less originated with the unfortunately named MS Herald of Free Enterprise sinking (https://en.wikipedia.org/wiki/MS_Herald_of_Free_Enterprise) - after that incident, regulators decided that maybe they didn't want enterprise quite as free as all that, and cracked down significantly on shipping operators (though the attempt to prosecute its execs for corporate manslaughter did fail).
Why don't orgs test their updates? Every decent IT management/governance under the sun demands that you test your updates. How the hell did so many orgs that are ISO 2700x, COBIT, PCI-DSS, NIST CSF, etc. certified failed so hard??
(ToS/contracts will probably get you out of any damages.)
Testing for most organizations is usually either really, incredibly expensive or an ineffective formality which leaves them at more risk than it saves. If you aren’t going to do a full run through all of your applications, it’s probably not doing much and very few places are going to invest the engineer time it takes to automate that.
What I take from this is that vendors need a LOT more investment in that work. They have both the money and are best positioned to do that testing since the incentives are aligned better for them than anyone else.
I’m also reminded of all of the nerd-rage over the years about Apple locking down kernel interfaces, or restricting FDE to their implementation, but it seems like anyone who wants to play at the system level needs a well-audited commitment to that level of rigorous testing. If the rumors of Crowdstrike blowing through their staging process are true, for example, that needs to be treated as seriously as browsers would treat a CA for failing to validate signing requests or storing the root keys on some developer’s workstation.
Because historically orgs have been really bad with applying updates: either no updates or delayed updates resulting in botnets taking over unpatched PC's. Microsoft's solution was to force the updates unconditionally upon everybody with very few opportunities to opt out (for large enterprise customers only).
Another complication comes from the fact that operating system updates are not essential for running a business and especially for small businesses – as long as the main business app runs, the business runs. And most businesses are too far removed from IT to even know what a update is and why it is important. Hence the dilemma of fully automated vs manually applied and tested updates.
> Microsoft's solution was to force the updates unconditionally upon everybody with very few opportunities to opt out (for large enterprise customers only).
Not a Microsoft's fan, but this is not true. Everyone who has Windows Server somewhere, with some spare disk space for the updates, has this ability. Just install and run WSUS (included in Windows Server) and you can accept/reject/hold indefinitely any update you want.
1) the prevailing majority of laptop and desktop PC installations (home, business and enterprise) are not Windows Server;
2) kiosk style installs (POS terminals, airport check-in stands etc) are fully managed, unsupervised installations (the ones that ground to a complete halt today) and do not offer any sort of user interaction by design;
3) most Windows Server installations are also unsupervised.
> 1) the prevailing majority of laptop and desktop PC installations (home, business and enterprise) are not Windows Server;
They are not, but the point is elsewhere: that Windows Server is going to provide the WSUS service to your network, so your laptop and desktop installations (in business and enterprise) are going to be handled by this.
Homes, on the other hand, do not have any Windows Server on their network, that's true.
As a hack to disable Windows updates, it is possible to point it to a non-existing WSUS server (so that can be done at home too). The client will then never receive any approval to update. It won't receive any info wrt available updates either.
> 2) kiosk style installs (POS terminals, airport check-in stands etc) are fully managed, unsupervised installations (the ones that ground to a complete halt today) and do not offer any sort of user interaction by design;
That's fine; this is fully-configurable via GPO.
> 3) most Windows Server installations are also unsupervised.
IMHO law should require such a firm, or any firm that may impact millions of other people, i.e. including all OS developers and many others, to maintain a certified Q/A process, maintain a 24/7 coverage and spend X% on Q/A. Such companies should never be allowed to deploy without going through a stringent CD procedure with tests and such, and they need to renew the certificate annually.
These are infra companies. Their incompetence can literally kill people.
My point/problem is that EVERY company (sorry for the caps) that is ISO, PCI, COBIT, NIST CSF, etc. compliant MUST be doing this!! (again sorry for the caps)
So they drop half the 'safety' procedures once the auditor goes away? WTF! (I am semi-angry because there are so many easy solutions and workarounds to not fall for this!! (inside screaming).
How irresponsible must someone be to roll out something to 1k-5k-10k machines without testing it first??
I hope eventually law regards these companies as "infrastructure" companies, just like companies that build roads, bridges and such, that may and will kill people if not run professionally.
I'm not trying to enforce certifications because as a dev certifications always raise a bitter taste in my mouth. But those companies need certified processes that get re-certified every year. Sometimes even a cursory review from outsiders can find a lot of issues.
Updates do get tested. Windows updates can be held and selectively rolled out when a company is ready. As far as I can tell though, CrowdStrike doesn't give companies the agency to decide if updates should be applied or not.
Since we live in a capitalism, financial losses are the only one anyone cares about at scale. What's a human life worth nowadays? About 10 million for a healthy prime age adult? Negative for elderly?
I think it depends what passport etc. you hold...
One dystopian take is the trolley problem, where the self-driving car in question uses smartphones to determine the identity of the people involved, to work out who is cheaper to kill.
That reminds me of why McDonalds got such a high penalty in the court case everyone remembers as "person sues for spilling hot coffee on themselves".
The reason this reminds me of that, assuming that I remember right, is that I think they had even taken the decision that the cost of paying lawsuits for those injuries was lower than the increase in revenue for being able to say "we have the hottest coffee"… and that was why they were deemed so severely liable.
They were definitely shown to have known it was resulting in injuries from other settlements:
Not true. Making C-level executives of software companies criminally liable with the chance to go to jail did change their behaviour in some recent lawmaking situation (forgot which, sorry).
None whatsoever, their contracts with customers will limit liability to the price paid for the software/subscription. If there was open-ended liability for software failures then very little software would get written.
Yeah but if it's a hospital, they should be able to operate without these IT systems. Nothing critical / life-or-death / personal injury should rely on Windows / IT systems.
> they should be able to operate without these IT systems.
Is that even possible any more? (That said, "operate" isn't a boolean, it's a continuum between perfect service and none, with various levels of degraded service between, even if you mean "operate" in the sense of "perform a surgical operation" rather than "any treatment or care of any kind").
All medical notes being printed in hard-copy could be done, that's the relatively easy part. But there's a lot of stuff which is inherently IT these days, gene sequencing, CT scans, etc., there's a lot that computers add which humans can't do ourselves — even video consultation (let alone remote surgery) with experts from a different hospital, which does involve a human, that human can't be everywhere at once: https://en.wikipedia.org/wiki/Telehealth
> Nothing critical / life-or-death / personal injury should rely on Windows / IT systems.
Because the suppliers of IT systems (eg Microsoft, Crowdstrike) do not agree that they can be used for life-critical purposes
If someone is injured or dies because the hospital has inadequate backup processes in the event of a Windows outage, some or maybe all liability for negligence falls on those who designed the hospital that way, not the IT supplier who didn't agree to it.
If your assumptions rest on corporate entities or actual decision makers being held legally liable, then you've got a lot of legwork ahead of you to demonstrate why that's a reasonable presupposition.
That's not about experience, that's about following the regulated standards. This is well known ever since technology (not computers) got into hospitals.
And? People and institutions constantly make bad decisions for which there are reasonable alternatives, and that's assuming that the incentives at play for decision makers are aligned with what we would want them to be, which is often not the case. Not that that ends up mattering much except as an explanatory device, because people and institutions constantly pursue bad ideas even seen in terms of their own interests.
Disclaimer. Neither Microsoft, nor the device manufacturer or installer, gives any other express warranties, guarantees, or conditions. Microsoft and the devicemanufacturer and installerexclude all implied warranties and conditions, including those of merchantability, fitness for a particular purpose, and non-infringement. If your local law does not allow the exclusion of implied warranties, then any implied warranties, guarantees, or conditions last only during the term of the limited warranty and are limited as much as your local law allows. If your local law requires a longer limited warranty term, despite this agreement, then that longer term will apply, but you can recover only the remedies this agreement allows.
It doesn’t really matter what the contract says. Laws take precedence over contracts. For example, Boeing’s liability for 737 airliners that crash due to faulty software certainly isn’t limited to the price of the planes.
Yes, software industry as we know would not exists if companies where held liable for all damages. But in the current state of affairs they have little incentive to improve software quality - when incident like this happens they can suffer an insignificant short term valuation loss but unless it happens too often they can continue businesses as usual.
Many companies paying lip service to quality/reliability but internal incentives almost always go against maintenance and quality of service work (and instead reward new projects, features e. t. c.).
> Yes, software industry as we know would not exists if companies where held liable for all damages.
Of course it would. Restaurants are held liable for food poisoning, but they still operate just fine. They just - y’know - take care that they don’t poison their customers.
If computer systems were held liable, software would be a lot more expensive. There would be less of it. And it would also be better.
I like that future too, but to play devil's advocate:
Write me software that coordinates all flights to and from airports, capturing all edge-cases, that's bug free. Then tell me the number you estimate and the number of years to roll this out.
Sure, but ... thats not a spec. Specs have clear goals and limited scope. "All flights from all airports forever" is impossible to program, full stop.
The right way to write code like that is to start simple and small - we're going to service airports X, Y and Z. Those airports handle Q planes per day. The software will be used by (this user group) and have (some set of responsibilities). The software engineers will work with the teams on the ground during and after deployment to make sure the software is fit for purpose. Someone will sign off on using it and trusting its decisions. And lets also do a risk assessment where we lay out all the ways defects in the software could cost money and lives, so we can figure out how risk averse we need to be.
Give me scope like that, and sure - I'll put a team together to write that code. It'll be expensive, but not impossible. And once its working well, I'd happily roll it out to more airports in a controlled and predictable manner.
It honestly did not occur to me. In all seriousness, was stock exchange ever really hacked ( not just data exfiltration -- write access to everything )?
I'd expect crowdstrike to take a big hit. Between this and the russian hack [edit: actually not, sorry, confused with SolarWinds], I am not sure they are not causing more problems than they solve.
The waves that are already looking like a storm in a teacup ?
There is no 'AI', that is always only hype. There is machine learning, which is a very powerful technology but I doubt MSFT will be leading that revolution. As for LLMs, MSFT might have some competitiveness there but I doubt it's going to be a very lucrative market. MSFT is highly overvalued.
<< There is no 'AI', that is always only hype. There is machine learning, which is a very powerful technology
I agree with you on the technical aspect, but the distinction makes regular people eyes glaze over within 5 seconds of that explanation. AI as a label for this is here to stay the same way cyber stopped meaning text sex of IRC. The people have spoken.
<< MSFT is highly overvalued.
Yes, but so is NVDA, the entire stock exchange and US real estate market. We are obviously due for a major correction and have been for a while. As in, I actually moved stuff around in my 401k to soften the blow in that event 2 years ago now. edit: yes, I am a little miffed I missed out on that ride.
So far, everything was done to prevent hard crash and in the election year, that is unlikely to change. Now after the election, that is another story altogether.
<< I doubt MSFT will be leading that revolution.
I think I agree. I remain mildly hopeful that the open model approach is the way.
You should stop trying to predict the next crash. According to the study, most people (including institutional investors) consistently believe there is a >10% chance the market will crash in the next 6 months when historically the probability is only 1%
<< You should stop trying to predict the next crash.
Hmm? No. I will attempt to secure my own financial interest.
<< According to the study, most people (including institutional investors) consistently believe there is a >10% chance the market will crash in the next 6 months when historically the probability is only 1%
Historically is doing a fair amount of work here. I would argue there is little historical value to the data we face. Over the past few decades we went through through several mini revolutions ( industrial, information and whatever they end up calling now ) in terms of how we work, eat, communicate and, well, live.
All of these have upended how humans interact with the world effectively changing the calculus on the data that preceding it if not nullifying it altogether in some ways.
Your argument is to stop worrying since you are likely wrong anyway, by a factor of 10. I am saying is 1935 people also thought they have time to ride the wave.
My brain goes there too, but the other part of my brain says "line always goes up." The richest among us are heavy owners of stocks, and this country does everything it can to keep those numbers up. Look at that insane COVID V-shaped recovery that happened. That's just not a real/natural market reaction in my book.
The worst part is that I get the need to do something to rein it in, but I get the feeling it will, as always, not be the actual rich ( owns color blue rich level ), who will suffer from those plans. There are less and less moves the government has as time progresses.
How is it their fault and responsibility? Isn’t falcon sensor basically running like a kernel module? Does it mean that Windows is not engineered properly when it can be crashed by this?
Are you saying that they should prevent or limit the ability of their users from installing third party software? Or at the very least prevent it from running in kernel mode?
This is an insane take. Do you think other industries get away with limiting their liability to the product cost? No, because that doesn't provide adequate incentives for making a safe product. The amount of software that gets written depends mostly on the demand for that software. Even if Micrososft would not be willing to up their game to make the risk viable then someone else would.
The thing is we know how to make (eg) food that is safe or to a lesser extent bridges that don't fall down. If you sell food that makes people sick you should have known how to avoid that and so you can be held liable.
We don't have a good idea how to make software that is flawless, at least, not at scale for a cost that is acceptable. This is changing a little bit now with the drive by governments to use memory-safe languages, but that only covers a small part of the possible spectrum of bugs in software and hardware.
What's "critical software"? Software controlling flight systems in planes is already held to very high standards, but is enormously expensive to write and modify.
In this case it seems most of the software which is failing is dull back office stuff running on Windows - billing systems, train signage, baggage handling - which no one thought was critical, and there's no way on earth we could afford to rewrite it in the same way as we do aircraft systems.
Something that has managed to ground a lot of planes and disable emergency calls today is in fact critical. The outcome of it failing proves it is critical. Whatever it is.
Now, that it was not known previously to be critical, that may be. Whether we should have realised its criticality or not, is debatable. But going forward we should learn something from this. So maybe think more about cascading failures and classify more things as critical.
I have to wonder how the failure of billing and baggage handling has resulted in 911 being inoperative. I think maybe there's more to it than you mention here.
Agreed, there is no such thing as perfect software.
In physical world, you can specify a tolerance of 0.0005 in but the part is going to cost $25k a piece. It is trivially easy to specify tolerance, very hard to engineer a whole system that doesn't blow the cost and impossible to fund.
Given how widespread the issue is, it seems that proper testing on Crowdstrike's part could have revealed this issue before rolling out the change globally.
It's also common to rollout changes regionally to prevent global impact.
To me it seems Crowdstrike does not have a very good release process.
There's only one piece of software which (with adaptations) runs every Airbus plane. The cost of developing and modifying that -- which is enormous -- is amortized over all the Airbus planes sold. (I can't speak about Boeing)
What failed today is a bunch of Windows stuff, of which there is a vast amount of software produced by huge numbers of companies, all of very variable quality and age.
I meant critical software a short-hand for something like: quality of software should be proportional to the amount of disruption caused by downtime.
Point of sale in a records store, less important. Point of sale in a pharmacy, could be problematic. Web shop customer call center, less important. Emergency services call center, could be problematic.
I, as a producer of software, have effectively no control over where it gets used. That's the point.
Outside of regulated industries it's the context in which software is used which determines how critical it is. (As you say.)
So what you seem to be suggesting (effectively) is that use of software be regulated to a greater/lesser extent for all industries... and that just seems completely unworkable.
What you're describing is a system where the degree of acceptable failure is determined after the software becomes a product because it is being determined by how important the buyer is. That is backwards and unworkable.
It isn't, though. "You may not sell into a situation that creates an unacceptable hazard" is essentially how hazardous chemical sale is regulated, and that's just the first example that I could find. It's not uncommon for a seller to have to qualify a buyer.
I think the system is rather a one where if you offer critical services then you're not allowed to use a software that hasn't been developed up to a particular high standard.
So if you develop your compression library it can't be used by anyone running critical infra unless you stamp it "critical certified", which in turn will make you liable for some quality issues with your software.
I assume you mean "if the buyer will use the software in critical systems."
That's very realistic and already happens by requiring certain standards from the resulting product. For example, there are security standards and auditing requirements for medical systems, payment systems, cars, planes, etc.
> Software controlling flight systems in planes is already held to very high standards, but is enormously expensive to write and modify.
Here's something I don't understand: those jobs pay chump change compared to places like FB and (afaik) social networks don't have the same life-or-death context
Would not shock me for AV companies to immediately work around that if it were to be implemented. “You want our protection all of the time, even if the attacker is corrupting your drivers!”
We don't know how to make general software safe, but we do know how to make any one piece of software safe. If you're software is going to be used as infrastructure then it should be held to the same standards. If you don't want it to be treated as infrastructure don't sell it to hospitals.
The production simplicity of having a standardized OS and being able to drop in a .exe and have it run everywhere without worrying about building for 1000 system combinations cannot be beat.
Enterprise Linux can fairly consistently be assumed to be RHEL, Ubuntu, or SuSE, with the first two being far more likely in the U.S. That’s not that much to ask for.
That's... not reality even on desktop PCs, and never was. If your business is more complex than selling hot dogs or ice cream (or even that on big enough scale), IT of such company will become a small monstrosity over time, and complexity of such deployments on Unix vs Windows is nothing compared to overall picture.
I see you somehow avoided learning what dll hell is, what various .net runtime incompatible versions are and what optional compatibility levels windows 10 offers.
Plus, CrowdStrike runs on Linux as well. _This time_ they only crashed Windows devices, but there's no guarantee that switching to Linux would prevent any of it.
You can switch away from CrowdStrike but I doubt you'll be able to convince whoever mandated CS to be installed to not install an alternative that carries exactly the same risks.
>CrowdStrike runs on Linux as well. _This time_ they only crashed Windows devices, but there's no guarantee that switching to Linux would prevent any of it.
In fact there was a recent CrowdStrike-related crash in RHEL:
At least on Linux it runs on eBPF sniffing so the chances of fudging something are lower. There are some supported Linux distributions where they also have a kernel module and there might a higher chance of that exploding.
There's nothing special about Windows beyond the fact that you can run arbitrary executable files. The problem could just as easily have happened for Linux or iOS/Mac and in fact it has. ChromeOS kind of works if you want to run a web application that's hosted on some web server... but it's not appropriate for running programs where a dumb browser doesn't suffice.
I'm not in IT anymore and we run 100% macs, so serious question here: isn't nearly everything a webapp nowadays? Every "non dev" thing that I have to do for work happens in my browser or an electron app. I guess maybe MS Office apps may be the biggest hitch? We use Google Workspace and that's all in browser.
It's horrible to use though. Google's suite is somewhat better than MSFT's web one, but it still is weak compared to any established desktop office suite, even libreoffice.
I've found it alright to be honest. I'd like to use libre office but the incompatibilities with .docx make it too annoying. Finally I can easily work with .docx on Linux, thanks to the web version :)
I dont think you can hold Microsoft liable for 3rd party software pushing its own update. Microsoft didn't make anyone install Crowdstrike or it's update files.
Some people in the comments claim CS was used for compliance reasons. Some others claim Windows & CS do not offer warranties. How can a product satisfy the compliance check-box, if it does not offer the warranty and not accept liability for the related features?
While software is often warranted, contracts won't often accept liability in terms of business damages etc, and that's not usually a requirement for compliance.
If it was, it would also make it impractical for a small business to contract with a large one because of risk.
Depends. I'm at an EMR maker; our Windows machines (as well of those of our clients - read: hospitals and doctors offices) are down. That is, of course, bad for the patients under their care.
Do these clients have SLAs? If so, they're definitely on the hook for something. You could probably get a few businesses together for a decent class-action against Crowdstrike. You're then expecting a lawyer to be able to convince a dozen semi-random people with varying degrees of computer knowledge that Crowdstrike's software was negligently designed, developed, and deployed in a way that caused financial or life losses for customers.
What if your company mandated your customers run crowdstrike in order to run your software? What are the legal implications of that? Wouldn't that also put your contracts on the hook?
Sort of. They need to be sued into bankruptcy. Current shareholders get completely zeroed out; the company still exists, but is sold to the highest bidder with the proceeds paid out to affected customers.
We need this so that every company board is always asking "are we investing enough to make sure this never happens to us?"
A local rooflayer is absolutely corrupt. He cheats every customer, produces leaky roofs, doesn't even pay taxes completely.
It takes 2 year for the legal system to catch up, at which point he starts a new company, bankrupts the old one, sells all his tools cheaply to the new company, and fires and rehires his workers. I've seen this game going on for 14 years now.
I think Crowdstrike would do the same: Start a new one, sell the software, fire and rehire the workers, then go on as if nothing happened
I'd call BS on this story, but I know a friend that bought a home a few years back from a homebuilder that did a similar thing, except at a whole home level. Absolute disaster. he's been chasing him for half a decade now via legal means to get things fixed.
Not really though. Whether they should continue to exist into the future should depend on if the expected positive value of their services in that future exceeds the expected damage from having a big meltdown every once in a while. That some of their devs made a fuckup doesn't mean the entire product line is now without merit.
Killing the company because they made a mistake doesn't just throw away a ton of learned lessons (because the devs will probably be scattered around the industry where their newly acquired domain knowledge will be less valuable) but also forces a lot of companies to spend resources changing their antivirus scanners. For all we know, Crowdstrike might never fuck up again after this and forcing that change would burn hundreds of millions for basically no reason.
"Whether they should continue to exist into the future should depend on if the expected positive value of their services in that future exceeds the expected damage from having a big meltdown every once in a while"
I don't think that's right, since it ignores externalities.
You want to create a system where every company is incentivized to make positive security decisions. If your response to a fuckup of unprecedented scale is just "they learned their lesson, they probably won't do that again", then the message these companies receive is that it is okay to neglect proper security procedures, because you get one global economic meltdown for free.
Do we not remember "Ma" Bell?This should perhaps be a wakeup call in regards to Microsoft and other large tech having concentrated fingers in too many pies. This appears to be an anti-trust issue at its core.
Was it really a botched update? Or was it a test run for holding the world hostage prior to a coup?
Negligence at Crowdstrike is not covered by any SLA. Even if insured, Crowdstrike could be fucked. Let alone, companies going to try and how much cost this has. Long term, their fucked.
> Chances if Microsoft or Crowdstrike will be held liable for financial losses caused by this outage?
Zero. Exactly Zero.
Clearly you have never been involved in buying insurance or writing contracts for IT products/services.
Loss of contracts, profits, goodwill, economic loss, loss of data and all that jazz is excluded in whole or limited to a fixed monetary value.
It is known as indirect, consequential or special loss, damage or liability.
No lawyer worth their salt will let an IT product/service company draft a contract that does not have the above type of clause..
And good luck finding an insurance contract that will pay out for such losses, indeed most of them have conditions that state your contracts with customers must exclude or limit such losses.
Most software also has clauses excluding use in safety critical environments.
I'm just curious, don't they have something like "gradual rollout" to update their app? They just bulk-update simultaneously across entire agents? No way. Something is a bit off for me. But there are good lessons to learn for sure.
I read that they pushed a new configuration file, so possibly they don't consider that a "software update" and pushed it to everyone. Which is obviously insane. If I am publishing software, it doesn't matter if I've changed a .py file or a .yaml file. A change is a change and it's going to be tagged with a new version.
75 Billion dollars valuation, CNBC Analysts praising the company this morning on how well the company is run!...When in reality they can't master the most basic of the phased deployment methodologies known for 20 years...
Hundreds of handsomely paid CTO's, at companies with billions of dollars in valuations, critical healthcare, airlines, who can't master the most basic of the
concepts of "Everything fails all the time"...
I'll take it a step further and say that every industry is depressing when it comes to computers at scale.
Rather than build efficient, robust, fault-tolerant, deterministic systems built for correctness, we somehow manage to do the exact opposite. We have zettabytes and exaflops at our fingertips, and yet, we somehow keep making things slower. Our user interfaces are noisier than ever, and our helpdesks are less helpful than they used to be.
I am drifting towards hating to turn on my computer in the morning. The whole day is like pissing into the wind, trying to find workaround of annoyances or even malfunctions, getting rid of obstructive noise from all direction, my productivity using modern computer systems is diminishing compared to where it was just mere 10-15 years ago (still better than 25 years ago not only becuase of experience but also the access of information on demand). Very depressing. I should have became a farmer perhaps.
What I find definitely depressing is the fact we used to roll out progressively even OS upgrades (I guess now that is done through intune?) and was one point in favor of windows (on Linux you had to do things yourself at the time AFAIK, I don't think the situation has improved much).
Nowadays we get mandated random software upgrading at once on the entire company fleet and no one bats an eye - I counted more than a dozen agents installed for "security" and "monitoring" purposes in my previous company servers, many of those with hooks in the kernel obviously, and many of those installed with random policies to tick yet another compliance box...
> (on Linux you had to do things yourself at the time AFAIK, I don't think the situation has improved much)
You can schedule the updates any time you want, want to do it staggered then configure that, want to do it all at the same time then do that, want it with a random interval also possible. I don't see the "you need to do everything yourself" option as much as any managed environment.
I haven't been a sys admin in a very long time so my systems knowledge might be outdated, but I reckon functionality like intune's built-in monitoring of specific feature install failures would make a huge difference with a few dozen systems, let alone the hundreds of thousands you see in some of today's deployments. It's not like that stuff isn't possible on Linux, but if you're coordinating more than a few systems, that turns into a big, expensive project pretty quickly.
Centralized management is very useful, just a random delay is not enough. One of the (big) companies I worked with had jury rigged something with chef I believe to show different machines different "repositories" and roll things out progressively (1% of the fleet, 5%...).
Staggering is necessary in some cases. I've heard of scenarios where a company has lots of devices in the field which all simultaneously try to download a big update, and DDOS the servers hosting that update.
This borked our dispatch/911 call center then as well. However, it wasn't as bad as this one. This outage put our entire public safety system into the stone age and with that we were at stone age efficiency.
I work IT at a regional 911 center. We're fine but I sympathize with those who are back to pen and paper dispatching. Hard for most current dispatchers to realize the way we did it back in the day.
The worst part is that nobody will be held accountable. A F up like this should wipe out the entire company but instead everyone will just shrug it off as an opposie a few low level employees will get punished and nothing will change.
So CrowdStrike is deployed as third party software into the critical path of mission critical systems and then left to update itself. It's easy to blame CrowdStrike but that seems too easy on both the orgs that do this but also the upstream forces that compel them to do it.
My org which does mission critical healthcare just deployed ZScaler on every computer which is now in the critical path of every computer starting up and then in the critical path of every network connection the computer makes. The risk of ZScaler being a central point of failure is not considered. But - the risk of failing the compliance checkbox it satisfies is paramount.
All over the place I'm seeing checkbox compliance being prioritised above actual real risks from how the compliance is implemented. Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting. So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.
> The risk of ZScaler being a central point of failure is not considered. But - the risk of failing the compliance checkbox it satisfies is paramount.
You're conflating Risk and Impact, and you're not considering the target of that Risk and that Impact.
Failing an audit:
1. Risk: high (audits happen all the time)
2. Impact to business: minimal (audits are failed all the time and then rectified)
3. Impact to manager: high (manager gets dinged for a failing audit).
Compare with failing an actual threat/intrusion:
1. Risk: low (so few companies get hacked)
2. Impact to business: extremely high
3. Impact to manager: minimal, if audits were all passed.
Now, with that perspective, how do you expect a rational person to behave?
[EDIT: as some replies pointed out, I stupidly wrote "Risk" instead of "Odds" (or "Chance"). Risk is, of course, the expected value, which is probability X impact. My post would make a lot more sense if you mentally replace "Risk" with "probability".]
Moreover no manager gets dinged for "internet-wide" outages unfortunately, so the compliance department keeps calling the shots. The amount of times I've had to explain there's no added security in adding an "antivirus" to our linux servers as we already have proper monitoring at eBPF level is annoying.
I'd be fired if I caused enough loss in revenue to pay my own salary for a year.
I am responsible for my choices. I'm CTO, I don't doubt that in some cases execs cover for each other, but at least I have anecdotal experience of what it would take for me to be fired- and this is clearly communicated to me.
Hope you get paid a lot! Otherwise you are either in a very young or very stupid job.
I regularly spend multiples of my salary every month on various commitments my company makes, any small mistake could easily mean that its multiples of my salary type of problem within 10 days.
A friend of mine spent half a million on a storage device that we never used. It sat in the IT area for years until we were acquired. Everyone gave him so much shit. Finance asked me about it numerous times (going around my friend the CTO) so they could properly depreciate it. He didn't get dinged by the board at all. It remained an open secret. We were making million dollar decisions once a month, though.
> I regularly spend multiples of my salary every month on various commitments my company makes.
Yeah, same here.
But if I choose a vendor and that vendor fails us so catastrophically as to make us financially insolvent, then it's my job to have run a risk analysis and to have an answer for why.
If it's more cost effective to take an outage, that's fine, if it's not: then why didn't I have a DRP in place, why did we rely so much on one vendor, what's the exposure.
It's a pretty important part of being a serious business person.
Sure, but that's not what I said or you said, and my commentary was about relative measures of your salary to your budget.
If you can't make a mistake of your salary size in your budget then your budget is small or very tight, most corporations fuck up big multiples of their CTOs salary quarterly (but that turns out to be single digit percentage points of anything useful.)
> I'd be fired if I caused enough loss in revenue to pay my own salary for a year.
I'm not so sure.
I know of a major company that had a glitch, multiple times, that caused them to lose about ~15 million dollars at least once (a non-prod test hit prod because of a poorly designed too).
I was told the decision-makers decided not to fix the problem (the risk of losing more money again) because the "money had already been lost."
"no manager gets dinged for "internet-wide" outages"
Kind of like, nobody gets fired for hiring IBM, or using SAP. They are just so big, every manager can say, "look how many people are using them, how was I supposed to know they are crap".
But, seems like for uptime, someone should be identifiable. If your job is uptime, and there is a world wide outage, I'd think it would roll down hill onto someone.
> Kind of like, nobody gets fired for hiring IBM, or using SAP. They are just so big, every manager can say, "look how many people are using them, how was I supposed to know they are crap".
I wouldn't necessarily say IBM or SAP are "crap". It's much more likely that orgs buying into IBM or SAP don't the due diligence on what the true costs to properly set it up and keep it running, therefore cut tons of corners.
They basically want to own a Ferrari and when it comes to maintenance, they want run Regular gas and try to get their local mechanic to slap Ford parts on it because its too expensive to keep going back to the dealership.
The thing is usually this argument goes something like this:
A: Should prod be running a failover / <insert other safety mechanism>?
B: Yes!
A: This is how much it costs: <number>
B: Errm... Let me check... OK I got an answer, let's document how we'd do it, but we can't afford the overhead of an auto-failover setup.
And so then there will be 2 types of companies, the ones that "do it properly" will have more costs, their margins will be lower, over time they'll be less successful as long as no big incident happens. When a big incident happens though, for most businesses - recent history proves that if everyone was down, nobody really complains. If your customers have 1 vendor down due to this issue, they will complain, but if your customers have 10 vendors down, and are themselves down, they don't complain anymore. And so you get this tragedy of the commons type dynamic where it pays off to do what most people do rather than the right thing.
And the thing is, in practice, doing the thing most people do is probably not a bad yardstick - however disappointing that is. 20 years ago nobody had 2FA and it was acceptable, today most sites do and it's not acceptable anymore not to have it.
Parents may teach this to kids but the kids usually notice their parents don't practice what they preach. So they don't either.
The world is filled with people following everybody else off a cliff. If you're warning people or even just not playing along in a time of great hysteria, people at best ignore your warnings and direct verbal abuse at you. At worst, you can face active persecution for being right when the crowd has gone insane. So most people are cowards who go along to get along.
I think the parent was correct in the use of the word "Risk"; it's different than your definition, which appears to be closer to "likelihood".
Risk is a combination of likelihood and impact. If "risk" were just equivalent to "likelihood" then leaving without an umbrella on a cloudy day would be a "high-risk situation".
A rational person needs to weigh both the likelihood and impact of a threat in order to properly evaluate its risk. In many cases, the impact is high enough that even a low likelihood needs to be addressed.
ZScaler and similar software also has some hidden costs: Performance and all the other fun that comes with a proxy between you and the server you connect to.
> What I'm saying is that the business's interests are not aligned with the people comprising that business.
Yep, that's the point of capitalism.
> In that regard, what "the business" wants is irrelevant.
And yet here we are. Companies get fined left and right for breaching rules but it's ok because it earned them money. There are literal plans made to calculate whether it's profitable to cheat or not. In the current system, what the business wants always wins over individual qualms, unfortunately.
Because the punative system in most countries doesn't affect individuals. As a manager, you're not going to jail for breaking environmental laws, a different entity (the company) is paying for being caught. So, it's still the rational thing to do to break the environment laws to make your groups numbers go up and get a promo or bonus.
Almost correct, but you mean 'chance' where you write 'risk':
Risk = Chance × Impact
The chance of failing an audit initially are high (or medium, present at least). The impact is usually low-ish. It means a bunch of people need to fix policy and set out improvement plans in a rush. It won't cost you your certification if the rectification is handled properly.
It's actually possible that both of your examples are awarded the same level of risk, but in practice the latter example will have its chance minimized to make the risk look acceptable.
> Now, with that perspective, how do you expect a rational person to behave?
They'd deploy the software on the critical path. That's exactly GP's point, isn't it? That's why GP explicitly wants us to shift some of the blame from the business to the regulators. GP advocates for different regulatory incentives so that a rational person would then do the right thing instead of the wrong thing.
I’m at risk of sounding like chicken little, the reality is companies are getting popped all the time - you just don’t hear about them very often. The bar for media reporting is constantly being raised to the point where you only hear about the really big ones.
If you read through any of the weekly Risky Biz News posts [1] you’ll often see a five or more highly impactful incidents affecting government and industry, and they’re just the reported ones.
I wonder how much that's still true now that ransomware has apparently become viable.
Finding an insecure target, setup the data hostage situation, have the victim come to pay is scalable and could work in volume. If getting small money from a range of small targets becomes profitable, small fishes will bear sinilar risks to juicier targets.
But...surely you're also missing another point of consideration:
Single point of failure fails, taking down all your systems for an indeterminate length of time:
1. Risk: moderate (an auto-updating piece of software without adequate checks? yeah, that's gonna fail sooner or later)
2. Impact to business: high
3. Impact to manager: varies (depending on just how easy it is to spin the decision to go with a single point of failure rather than a more robust solution to the compliance mandate)
> 3. Impact to manager: minimal, if audits were all passed.
I don't know about you, but I'll be making sure everyone knows that the manager signed off on the spectacularly stupid idea to push through an update on a friday without testing.
Of course, disabling those auto updates will have you fail the external security audit and now your security team needs to fight with the rest of the leadership in the company explaining why you're generating needless delays, costs against the "state of the art in security industry" and why your security guys are smarter than the people who have the power to approve or deny your security certification.
I've taken part in some security audits where I work. They're not a joke only because they're a tragic story of incompetence, hubris, and rubberstamping. They 100% focus on checking boxes and cargo-culting, while leaving enormous vulnerabilities wide open.
What I don't understand is why they don't have a canary update process. Server side deployments do this all the time. You would think Windows would offer that to their institutional customers, for all types of updates including (especially) 3rd party.
This isn't a Windows update (which absolutely does let you do blue/green deployments vis SUS), but rather a Crowdstrike update which also lets you stage rollouts and I expect several administrators are finding out why that is important.
I know about update policies, but afaik those are about the “agent” version. Today’s update doesn’t look like an agent version. The version my box is running was released something like a week ago.
Is there some possibility tu stage rollouts of the other stuff it seems to download?
Kind of a big thing most people don't understand about the various forms of "Business Insurance." For the most part, businesses have whatever insurance whatever they are doing requires them to have. Those requirements are set by laws/regulations applied to those entities and the various entities they want to do business with.
At every small shop I've worked when the topic of Business Insurance came up with one of the owners, the response was extremely negative -- basically summarized as "it's the most you will ever pay for something you won't ever be able to use".
Yep, it’s pretty much a toll on doing business with entities. I’ve no doubt the intention is so your customer can sue you without you winding up, whether it actually works… no idea.
>> It's easy to blame CrowdStrike but that seems too easy on both the orgs that do this but also the upstream forces that compel them to do it.
While orgs using auto update should reconsider, the fact that CrowdStrike don't test these updates on a small amount of live traffic (e.g. 1%) is a huge failure on their part. If they released to 1% of customers and waited even 24 hours before rolling out further this seems like it would have been caught and had minimal impact. You have to be pretty arrogant to just roll out updates to millions of customers devices in one fell swoop.
Why even test the updates on a small amount of live customers first? Wouldn't this issue already have surfaced if they tested the update on a handful of their own machines?
You are completely right. BTW It wasn't a software update, it was a content update, a 'channel file'.
Someone didn't do enough testing. edit: or any testing at all?
It's an automatic update of the product. Semantic "channel vs. binary" doesn't indicate anything. If your software's definition files can cause a kernel mode driver to crash in a bootloop you have bigger problems, but the outcome is the same as if the driver itself was updated.
Indeed. Its worse really, it means there was a bug lurking in their product that was waiting for a badly formatted file to surface it.
Given how widespread the problem is it also means they are pushing these files out without basic testing.
edit: It will be very interesting to see how CrowdStrike wriggle out of the obvious conclusion that their company no longer deserves to exist after a f*k up like this.
That's funny, because IIRC McAfee back in the Windows XP days did this exact same thing! They added a system file to the signature registry and caused Windows computers to BSOD on boot.
That’s even worse—-they should be fuzz testing with bad definitions files to make sure this is safe. Inevitably the definitions updates will be rushed out to address zero days and the work should be done ahead of time to make them safe.
Having spent time reverse-engineering Crowdstrike Falcon, a lot of funny things can happen if you feed it bad input.
But I suspect they don't have much motivation to make the sensor resilient to fuzzing, since the thing's a remote shell anyways, so they must think that all inputs are absolutely trusted (i.e. if any malicious packet can reach the sensor, your attackers can just politely ask to run arbitrary commands, so might as well assume the sensor will never see bad data..)
This is something funny to say when the inputs contain malware signatures, which are essentially determined by the malware itself.
I mean, how hard would it be to craft a malware that has the same signature as an important system file? Preferably one that doesn't cause immediate havoc when quarantined, just a BSOD after reboot, so it slips through QA.
Even if the signature is not completely predictable, the bad guys can try as often as they want and there would not even be way to detect these attempts.
> malware signatures, which are essentially determined by the malware itself.
No they're not. The tool vendor decides the signature, they pick something characteristic that the malware has and other things don't, that's the whole point.
> how hard would it be to craft a malware that has the same signature as an important system file?
Completely impossible, unless you mean, like, bribe one of the employees to put the signature of a system file instead of your malware or something.
Sure, but they do it following a certain process. It's not that CrowdStrike employees get paid to be extra creative in their job, so you likely could predict what they choose to include in the signature.
In addition to that, you have no pressure to get it right the first time. You can try as often as you want and analyzing the updated signatures you even get some feedback about your attempts.
Like, «We require that your employees opens only links on white list, and social networks cannot be put on this list, and we require managed antivirus / firewall solution, but we are Ok that this solution has backdoor directly for 3rd party organization»?
It is crazy. All these PCI DSS and SOC2 looks like a comedy if they allow such things.
At a former employer of about 15K employees, two tools come to mind that allowed us to do this on every Windows host on our network[0].
It's an absolute necessity: you can manage Windows updates and a limited set of other updates via things like WSUS. Back when I was at this employer, Adobe Flash and Java plug-in attacks were our largest source of infection. The only way to reliably get those updates installed was to configure everything to run the installer if an old version was detected, and then find some other ways to get it to run.
To do this, we'd often resort to scripts/custom apps just to detect the installation correctly. Too often a machine would be vulnerable but something would keep it from showing up on various tools that limit checks to "Add/Remove Programs" entries or other mechanisms that might let a browser plug-in slip through, so we'd resort various methods all the way down to "inspecting the drive directory-by-directory" to find offending libraries.
We used a similar capability all the way back in the NIMDA days to deploy an in-house removal tool[1]
[0] Symantec Endpoint Protection and System Center Configuration Manager
[1] I worked at a large telecom at that time -- our IPS devices crashed our monitoring tool when the malware that immediately followed NIMDA landed. The result was a coworker and I dissecting/containing it and providing the findings to Trend Micro (our A/V vendor at the time) maybe 30 minutes before the news started breaking and several hours before they had anything that could detect it on their end.
Hilariously, my last employer was switching to Crowdstrike a few months ago when my contract ended. We previously used Trellis which did not have any remote control features beyond network isolation and pulling filesystem images. During the Crowdstrike onboarding, they definitely showed us a demo of basically a virtual terminal that you could access from the Falcon portal, kind of like the GCP or AWS web console terminals you can use if SSH isn't working.
As I understand, this only manifests after a reboot and if the 'content update' is tested at all it is probably in a VM that just gets thrown away after the test and is never rebooted.
Also, this makes me think:
How hard would it be to craft a malware that has the same signature as an important system file?
Preferably one that doesn't cause immediate havoc when quarantined, just a BSOD after reboot, so it slips through QA.
I don't believe this is what's happened, but I think it is an interesting threat.
Nope, not after a reboot. Once the "channel update" is loaded into Falcon, the machine will crash with a BSOD and then it will not boot properly until you remove the defective file.
> How hard would it be to craft a malware that has the same signature as an important system file?
Very, otherwise digital signatures wouldn’t be much use. There are no publicly known ways to make an input which hashes to the same value as another known input through the SHA256 hash algorithm any quicker than brute-force trial and error of every possibility.
This is the difficulty that BitCoin mining is based on - the work that all the GPUs were doing, the reason for the massive global energy use people complain about is basically a global brute-force through the SHA256 input space.
I was talking about malware signatures, which do necessarily use cryptographic hashes. They are probably more optimized for speed because the engine needs to check a huge number of files as fast as possible.
Cryptographic hashes are not the fastest possible hash, but they are not slow; CPUs have hardware SHA acceleration: https://www.intel.com/content/www/us/en/developer/articles/t... - compared to the likes of a password hash where you want to do a lot of rounds and make checking slow, as a defense against bruteforcing.
That sounds even harder; Windows Authenticode uses SHA1 or SHA256 on partial file bytes, the AV will use its own hash likely on the full file bytes, and you need a malware which matches both - so the AV will think it's legit and Windows will think it's legit.
AFAIK important system files on Windows are (or should be) cryptographically signed by Microsoft. And the presence of such signature is one of the parameters fed to the heuristics engine of the AV software.
> How hard would it be to craft a malware that has the same signature as an important system file?
If you can craft malware that is digitally signed with the same keys as Microsoft's system files, we got way bigger problems.
>How hard would it be to craft a malware that has the same signature as an important system file?
Extremely, if it were easy that means basically all cryptography commonly in use today is broken, the entire Public Key Infrastructure is borderline useless and there's no point in code signing anymore.
Admittedly, I don't know exactly what's in these files. When I hear 'content' I think 'config'. This is going to be very hypothetical, I ask for some patience. Not arguments.
The 'config file' parser is so unsafe that... not only will the thing consuming it break, but it'll take down the environment around it.
Sure, this isn't completely fair. It's working in kernel space so one misstep can be dire. Again, testing.
I think it's a reasonable assumption/request that something try to degrade itself, not the systems around it
edit: When a distinction between 'config' and 'agent' releases is made, it's typically with the understanding that content releases move much faster/flow freely. The releases around the software itself tend to be more controlled, being what is actually executed.
In short, the risk modeling and such doesn't line up. The content updates get certain privileges under certain (apparently mistaken) robustness assumptions. Too much credit, or attention, is given to the Agent!
"All over the place I'm seeing checkbox compliance being prioritised above actual real risks from how the compliance is implemented."
Great statement and one that needs to be seriously considered - would DORA regulation in the EU address this I wonder? Its a monster piece of tech legislation that SHOULD target this but WILL it - someone should use todays disaster and apply it to the regs to see if its fit for purpose.
Emphatically NO. Involved in (IT) Risk and DORA in a firm that actually does IT risk scenario planning (the sort opposite of checkbox compliance). DORA is rubber stamping al the way round. One caveat is that we are way ahead of DORA, so treating DORA as a checkbox exercise might be situational. But I haven’t noticed a place where the rubber hits the road regulatory wise. It’s too easy to stay in checkbox compliance if the board doesn’t see IT-risk as a major concern. I’m happy one of our board members does. We’ve gone so far as to introduce a person and paper based credit line, so we can continue an outgoing cashflow if most of our processes fail (for an insurer).
Well, yeah. If a regulation is broken and not achieving its goal it should be changed. What's the alternative? "Regulation? We tried that once and it didn't work perfectly, so now we let The Market™ sort out safety standards."
Who needs regulation when you can have free Fentanyl with your CrowdStrike subscription! All of your systems will go down, but you won't care, and the chance of accidental overdose is probably less than 10%!
Yes, in many contexts that may well be the correct conclusion. Your comment presumes that regulation here has proven itself useful and not resulted in a single point of failure which potentially reduces overall safety. It’s of course the correct comment from a regulator’s perspective.
For the market to work wouldn't you need something to hold the corps accountable if they fail to be secure AND to make regular people whole if the crops' failures cause them problems?
Especially for something like technology and infosec which rapidly changes, it’s silly to look to slow moving regulations as a solution, not to mention ignoring history and gambling politicians will do it competently and it won’t have negative side effects like distracting teams from doing real work that’d actually help.
You can make fines and consequences after the fact for blatant security failures as incentives but inventing a new “compliance” checklist of requirements is going to be out of date by the time it’s widely adopted and most companies do the bare minimum bullshit to pass these checklists.
There are so many english centric assumptions here.
Regulation of liability can be very generic and broad, with open standards that dont need to be updated.
Case in point: Most of continental Europe still uses Napoleon's code civile to prescribe how and when private parties are liable. This is more than 150 years old.
The real issue is that most Americans are stuck with an old English regulatory system, which for fear of overreach was never modernized.
This can be true of security (and every other expense) whether it's regulated or not. Which do you think will result in fewer incidents: the regulated bare minimum, or the unregulated base minimum?
> So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.
No, we need to hold Architects accountable, and this is the core of the issue. Creating systems with single, outsourced responsibility, in the critical path.
This is the point of much of the security efforts we see now.
Outsourcing of security functions, and things like login push a lot of liability and legal issues off into someone else's house.
It's hard to be the source of a password leak, or be compromised when you don't control the passwords. But like any chain your only as secure as your weakest link... Snowflake is a great current example of this. Mean while the USPS just told us "oops" we had tracking pixels for a bunch of vendors all over our delivery preview tool.
Candidly, most people stacks look a lot less like software and more like a toolbar riddled IE5 install circa 2000. I don't think our industry is in a good place.
This is one of the interesting aspects in Ethereum.
If your validator is down, you lose a small amount of stake, but if a large percentage of the total set of validators are down, you all start being heavily penalized.
This incentives people running validators to not use the most popular Ethereum client, to avoid using a single compute provider, and to overall, avoid relying on the popular choice since doing so can cause them to lose the majority of their stake.
There hasn't been a major Ethereum consensus outage, but when that happens, the impact of being lazy and following the heard will be huge.
How is it lazy and herd-like to _not_ run the latest and greatest? Sounds like Etherium's design is promoting a robustly diverse ecosystem rather than a monoculture.
> How is it lazy and herd-like to _not_ run the latest and greatest?
I'm not sure what you're asking here. Ethereum incentives don't make you run the latest version of your client's software (unless there's a hardfork you need to support). You can run any version that follows the network consensus rules.
The incentives are there to punish people who use the most common software. For example, let's say there are around 5 consensus clients which are each developed by independent teams. If everyone ran the same client, a bug could take down the entire network. If each of those 5 clients were used to run 20% of the network, then a bug in any one of them wouldn't be a problem for Ethereum users and the network would keep running.
If the network is evenly split across those 5 clients but all of them are running in AWS, then that still leaves AWS as a sigle point of failure.
The incentives baked into the consensus protocol exist to push people towards using a validator client that isn't used by the majority of other validators. That same logic applies to other things like physical host locations, 3rd party hosting providers, network providers, operating systems, etc... You never want to use the same dependencies as the majority of other validators. If you do and a wide-spread issue happens, you're setting yourself up to lose a lot of money.
It sounds like you're describing the advantages of diversity, with a little game theory thrown in to sweeten the deal. Still not sure how that can be described as lazy, or did I completely mis-read the original phrasing?
I find that in today's world it is no longer about one person being "accountable". There is always an interplay of factors, like others have pointed out cyber security has a compliance angle. Other times it is a cost factor, redundancy costs money. Then there is the whole revolving door of employees coming and going, so institutional knowledge about why a decision was made lost with them.
That is hard to do for even a small company. How do you balance all that out for critical infrastructure at a much larger scale?
The problem is that even knowing that this likely to happen many companies would still put CrowdStrike into a critical system for the sake of security compliance / audit. And it's not even prioritization of security over reliability because incentives are to care more about check-boxes in the audit report than about the actual security. Looks like almost no party in this tragic incident had a strong incentive to prevent it so it's likely to happen again.
Can anyone explain how CrowdStrike could possibly fix this now? If affected machines are stuck in an endless BSOD cycle, is it even possible to remotely roll out a fix? My understanding is that the machines will never come to the point where a CS update would be automatically installed. Is the only feasible option the official workaround of manually deleting system files after booting into the recovery environment? How could this possibly be done on scale in organizations with tens of thousands of machines?
There are orgs out there right now with 50,000+ systems in a reboot loop. Each one needs to me manually configured to disable CS via safe mode so that the agent version can be updated to the fixed version. Throw bitlocker in the mix which makes this process even longer, we're talking about weeks of work to recover all systems.
CrowdStrike itself will not fix anything. They published a guide on how to workaround the problem and that's it. Most likely a lot of sales reps and VPs will be fielding calls all over the weekend explaining large customers how did they manage to screw up and how much discount will they offer on the next renewal cycle.
Legally, I think somewhere in their license it says is that they're not responsible in any way or form if their software malfunctions in any way.
Like if I kill someone of course I go to jail. But if I get some people together, say we're a company, and then kill 100 people, nobody goes to jail. How does that work? What a huge loophole.
Phillips (the company) basically killed people with malfunctioning CPAP machines (which are meant to help against sleep apnea) and no one went to jail. So that's a practical example.
It's already the norm for devs to not be responsible for software malfunctions. They can choose to end their relationship with you, but they can't sue you for damages.
Yep, I've been involved in many vender contracts at my company and the contracts take weeks to months to finalize because every aspect of the agreement is up for discussion. Even things like SLA's (including how they're calculated), liability limitations, indemnity, recourse in the event of system failure are all put through the ringer until both sides come to agreeable terms. This is true for big and tiny venders.
This isn't a Github project with a MIT license. When you do B2B software, there aren't software licenses, there are contractual terms and conditions. The T&Cs outline any number of elements but including SLAs, financial penalties for contractual breaches, etc. Larger customers negotiate these T&Cs line by line. Smaller customers often accept the standard T&Cs.
Penalties, as far as I was involved in vendor discussions, are a part of the negotiation only when the software provider does any work on the client's premises and are liable to that extent.
For software, you don't pay penalties that it might malfunction once in a while, that's what bug-fixes are for and you get offered an SLA for that, but only for response time, not actual bug fixing. Where you do get penalties and maybe even your money back, is when the software is listed as being able to do X,Y,Z and it only does X and Z and the contract says it must do everything it said it does.
Well, probably no?
I've never seen liabilities in dollar value, or rather any significant value. Also I saw our company Ceowdstrike contract for 10k+ seats, no liabilities there.
Sounds like people in some of these environments will be doing their level best to automate an appropriate fix.
Hopefully they have IPMI and remote booting of some form available for the majority of the affected boxes/VMs, as that could likely fix a large chunk of the problem.
Imagine if North Korea comes with a statement, that they did it.. It would spawn such amount of work internally at CS to proof if it was intentional or a simple mistake.
I work for government organization that is constantly audited and I've seen this play out over and over.
An important aspect I never see mentioned is most Cyber Security personnel don't have the technical experience to truly understand the systems they are assessing, they are, like you said, just pushing to check those compliance boxes.
I say this as someone who is currently in a Cyber Security role, unfortunately, as I'm coming to learn cyber roles suck. But this isn't a jab at those Cyber Security personnel's intelligence. It's literally impossible to understand multiple systems at a deep level, it takes employees working on those systems weeks to months to understand this stuff, and that's with them being in the loop. Cyber is always on the outside looking in, trying like hell to piece it all together.
Sorry for the rant. I just wanted to add on with my personal opinions on the cyber security framework being severely broken because I deal with it on a daily basis.
> It's literally impossible to understand multiple systems at a deep level,\
No, it's not. It takes above average intelligence, and major investment in actual education (not just "training"), and actual depth of experience, but it's not impossible.
Do you think it comes from a fundamental misconception of how these roles should be structured? My take is that you just can't fundamentally assess technical elements from the outside unless they have been designed that way in the first place (for assessability). For example I educate my team that they have structure their git commits in a way that demonstrates their safety for audit / compliance purposes (never ever combine a high risk change with a low risk one, for example). That should go all the way up the chain. Failure to produce an auditable output is failure to produce an output that can be deployed.
I know of an important company currently pushing to implement a redundant network data loss prevention solution, while they don't have persistent VPN enabled and multiple known misconfigurations of things that prevent web decryption working properly.
The flip side is, if you don't do auto updates and an exploit is published and used against you and you haven't yet tested / pushed the patch, that you would have been protected against if it had auto updated, you are up the creak without a paddle in that situation as well.
To some degree you have to trust the software you are using not to mess things up.
So since I do mission critical healthcare I do run into this concept. But it's not as unresolvable as you portray. Consider for example HIPAA "break the glass" requirement. It says that whatever else you implement in terms of security you must implement a bypass that can be activated by routinely non-authorised staff to access health information if someone's life is in danger.
Similarly, when I questioned, "why can't users turn off ZScaler in an emergency" we were told that it wouldn't be compliant. But it's completely implementable at a technical level (Zscaler even supports this). You give users a code to use in an emergency and they can activate it and it will be logged and reviewed after use. But the org is too scared of compliance failure to let users do it.
Well, if the vault says you have COPD, and the devious bank robber is interested in your continued breathing, perhaps we can just review the footage after the fact.
This is one of those cases where you don't disable emergency systems to defend against rogue employees. If people abuse emergency procedures, you let the legal system sort it out.
> It says that whatever else you implement in terms of security you must implement a bypass that can be activated by routinely non-authorised staff to access health information if someone's life is in danger.
Huh.
I can see why this needs to exist, but hadn't thought of it before. Same deal as cryptography and law-enforcement backdoors.
> logged and reviewed after use
I was going to ask how this has protection from mis-use.
Seems good to me… but then I don't, not really, not deeply, not properly, feel medical privacy. To me, violation of that privacy is clearly rude, but how the bar raises from "rude" to "illegal" is a perceptual gap where, although I see the importance to others, I don't really feel it myself.
So it seems good enough to me, but am I right or is this an imagination failure on my part? Is that actually good enough?
I don't think cryptography in general can use that, unfortunately. A simple review process can be too slow for the damage in other cases.
This is an oversimplification. IF we are talking about compliance to ISO 27001 you are supposed to do your own risk assessment and implement necessary controls. The auditor will basically just check that you done the risk assessment, and that you have done the controls you said yourself you need to do.
I'd say this has nothing with regulatory compliance to do at all. The real truth is that modern organizations are way too attached to cloud solutions. And this runs across all parts of the organization with Saas and PaaS whether it's email (imagine Google Workspace having a major issue), AWS, Azure, Okta…
I've had the discussions so many times and the answer is always – the risks doesn't matter because the future is cloud and even talking about self hosting anything is naive and honestly we need to evaluate your competence for even suggesting it.
(Also the cloud would maybe not be this fragile if it wasn't for lock-in with different vendors. If you read the TOS it says basically on all cloud services that you are responsible for the backup – but getting your data out of the service is still pain in the ass – if possible at all)
> The real truth is that modern organizations are way too attached to cloud solutions.
I'm confused. This is a security product for your local machine. Not the cloud.
Unless you call software auto-update "the cloud", but that's not what people usually mean. The cloud isn't about downloading files, it's about running programs and storage remotely.
I mean, if CloudStrike were running entirely on the cloud, it seems like the problem would be vastly easier to catch immediately and fix. Cloud engineers can roll back software versions a lot easier than millions of end users can figure out how to safe boot and follow a bunch of instructions.
Well, in all times usually there has been the option to run a local proxy/cache for your updates so that you can properly test them inside your own organization before rolling them out to all your clients (precisely to avoid this kind of shit show). But doing that requires an internal team running it and actually testing all updates. But modern organizations don't want an IT-department, they want to be "cloud first". So they rely on services that promise they can solve everything for them (until they don't).
Cloud is not just about where things are – it's also about the idea that you can outsource every single piece of responsibility to a intangible vendor somewhere on the other side of the globe – or "in the cloud".
> Cloud is not just about where things are – it's about the idea that you can outsource every single piece of responsibility to a intangible vendor somewhere in the cloud.
I've never heard of a definition of cloud like that.
Cloud is entirely about where things are.
Outsourcing responsibility to a vendor is totally orthogonal to the idea of the cloud. You can outsource responsibility in the cloud or not. You can also outsource responsibility on local machines or not.
And outsourcing responsibility has existed since long before the concept of the cloud was invented.
The product affected here is litelarly called "CrowdStrike Falcon® Cloud Security". Meraki all tough they sell routers and switches markets their products as "cloud-based network platform". Jamf all tough their product is run on endpoint devices is marked as "Jamf Cloud MDM". I think its fair to say that cloud these days does not only mean storing data, or running servers in cloud but also if infrastructure is in any way MANAGED in cloud.
So to tie back to what i wrote earlier – none of these services has to have the management part in the cloud. They could just give you a piece of software to run on your own server. That would certainly distribute the risk since now it only takes someone hacking the vendor to go after all their customers, or in this case one faulty update brakes all users experience. And as far as I can see it seems we are willing to take those risks because we think it's nice having someone else manage the infrastructure (and that was my main point in the first comment).
> My org which does mission critical healthcare just deployed ZScaler on every computer which is now in the critical path of every computer starting up
Hi fellow CVS employee. Are you enjoying your zscaler induced SSO outages every week that torpedo access to email and every internal application? Well now your VMs can bluescreen too. A few more vendor parasites and we'll be completely nonfunctional. Sit tight!
When we think "security" on HN we think about the people who escalate wiggling voltages at just the right time into a hypervisor shell on XBox, but I've had to recognize that my learned bias is not correct in the real world. In the real world, "computer security" is a profession full of hucksters that can't tell post-quantum from heap and whose daily work of telling people repeatedly to not click links in Outlook and filling out checklists made by people exactly like them has essentially no bearing on actual security of any sort.
It's driven by a lot of things. Part of it is driven by rising cyber liability insurance rates, for one. A lot of organizations would rather not pay for CrowdStrike, but the premiums for not having an "EDR/XDR/NGAV" solution can be astoundingly high at-scale.
Fundamentally there's a lot of factors in this ecosystem. It's really wild how incentives that seem unrelated end up with crazy "security" products or practices deployed.
> A lot of organizations would rather not pay for CrowdStrike, but the premiums for not having an "EDR/XDR/NGAV" solution can be astoundingly high at-scale.
Just like a lot of homeowners would rather not pay for ADT, but insurance requires a box-ticking “professionally-monitored fire alarm system.” Nevermind that I can dial 911 as well as the “professional” when I get the same notification as they do.
> In the real world, "computer security" is a profession full of hucksters
Always has been. The information security model is about analogizing digital systems as physical systems, and employing the analogues of those physical controls that date back hundreds of years on those digital systems. At no point, in my relatively long career, have I ever met anyone in Information Security who actually understands at depth anything about how to secure digital systems. I say this as someone who has spent a lot of my career trying to do information security correctly, but from the perspective of operations and software engineering, which is where it must start.
The entire information security model the world works with is tacking on security after the fact, thinking you need to builds walls and a vault door to protect the room after the house has already been built, when in fact you need to build the house to be secure from the start because attacks don't go through doors, attacks are airborne (I recognize the irony of my analogizing digital concepts to physical concepts surrounding security, but I do it because of any infosec people that may read my comment so they can understand my point).
Because of this model, we have gone from buying "boxes" to buying "services", but it has never matured away from the box-checking exercise it's been since day one. In fact, many information security people have /no training or education/ in security, it's entirely in regulatory compliance.
I’ve met highly paid “security engineers” that talked about not really being into programming or being okay with python but everything else is too complicated.
It shocks me that such a low level of technical competence is required.
> So CrowdStrike is deployed as third party software into the critical path of mission critical systems and then left to update itself.
TIL that US government has pressured foreign nations to install a mystery blob in the kernel of machines that run critical software "for compliance".
If this wasn't a providential goof on the part of Crowdstrike -- the entire planet is now aware of this little known fact -- then some helpful soul in Crowdstrike has given us a heads-up.
Don't put your eggs in one basket, I use multiple anti-virus products so that if one blows up at least not all computers are affected. Looks like my old wisdom is still new wisdom.
Clarification: I mean that every computer has one anti-virus product, but not every computer has the same anti-virus product. I'm not installing multiple anti-virus products on the same computer.
You use multiple anti-virus products. Let's assume you use 3. Do you have multiple clusters of machines, each running their own AV product, so in case one has this problem the other two are unaffected?
How much overhead are we talking about here? Because if you're just using multiple AV software installed on one machine, 1) holy shit, the performance penalty, 2) you'd still be impacted by this, as CS would have taken it down.
They surely mean that all odd number assets are running crowdstrike and even are running sential-one (or similar, %3, %4, etc etc). At least then you only lose half your estate.
I have never seen a company that uses multiple AV products rolled out to user machines, ever. Sure, when you transition from one product to another, but across the whole company, at the same time? Never... I have also never seen a distribution of something like active directory servers based on antivirus software. I think these stories are purely academic, "why didn't you just..." tall tales.
Mine certainly does, our key windows based control systems use windows defender, the corporate crap gets sentinal one and zscaler and whatever else has been bought on a whim.
I'd assumed that any essential company would be similar. OK if your purchasing systems for your hospital are down for a couple of days it's a pain. If you can't get x-rays it's a catastrophe.
If half your x-ray machines are down and half are up, then it's a pain, but you can prioritise.
But lots of companies like a single supplier. Ho hum.
Not the person you're replying to, but in any reasonable organization with automated software deployment it should be easy to pool machines into groups, so you can make sure that each department has at least one machine that uses a different anti-virus software.
Bonus, in case you do catch a malware, chances are higher that one of the three products you use will flag it.
So you have multiple AV products and you target those groups. You have those groups isolated on their own networks, right? With all the overhead that comes with strict firewall rules and transmission policies between various services on each one. With redundant services on each network... you've doubled or tripled your network device costs solely to isolate for anti virus software. So if only one thing finds the zero day network based virus, it won't propagate to the other networks that haven't been patched against this zero day thing.
How far down the rabbit hole do we want to go? If you assume many companies are doing this kind of thing, or even a double digit percentage of companies, I have bad news for you.
But cost of maintenance aside it wouldn't be that bad to deploy each half the fleet with two distincts EDR.
This is actually implicitly in place for big companies that support BYOD. If half your fleet is on Windows another 40% on MacOs and 10% on Linux you need distinct EDR solutions and a single issue can't affect all your fleet at once.
I know a few people who have Zscaler deployed at work. It will routinely kick them of the internet, like multiple times a day. It has gotten to the point where they can sort of tell in advance that it's about to happen.
The theory so far it that it's related to their activities, working in DevOps they will sometimes generate "suspicious" traffic patterns which will then trigger someone policy in Zscaler, but they're not actually sure.
ZScaler itself uses port 443 UDP, but blocks QUIC. The last time I checked it didn't support IPv6 so they told customers to disable IPv6. Security software is legacy software out of the box and cuts the performance of computers in half.
> more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.
Duh, else there would be no need to audit them to force compliance, they'd just do it by themselves. The only reason it needs forcing is that they otherwise aren't motivated enough.
> Good point. But the audit seems useless now. It's supposed to prevent the carelessness from causing... this thing that happened anyway.
> Sure, maybe it prevented even more events like this from happening. But still.
Because the point of audit is not to prevent hacks, it's to prove that you did your due diligence to not get hacked, so fact that hack happened is not your fault.
You can hide under umbrella of "sometimes hacks happen no matter what you do".
CYA is the reason you do the audit. But the reason for the audit's existence and requirement is definitely so that hacks don't happen. Don't tell me regulatory agencies require things so that companies can hide behind them.
Who is them though? The airport that used this software? You can't put all the blame on the software vendor. It can be a good and useful component when not relied on exclusively for the functioning of the airport. Not relying on a single point of failure should be the responsibility of the business customer who knows the business context and requirements.
You will have each company person pointing at the others. That's why you have contracts in place.
You won't ever have real consequences for executives and real decision makers and stakeholders because the same kind of people make the laws. They are friends, revolving door etc.
There's no responsibility at any level, is the thing. Those people who couldn't fly might get a rebooking and some vouchers sent out to them, but they won't really get made whole. The airport knows they won't really be on the hook, so they don't demand real responsibility from their vendors, and so on.
In the grand scheme of things, being able to fly around the globe at these prices is a pretty good deal, even with these rare events taken into account. It's not like the planes fell out of the sky. If you must must definitely be somewhere at a time, plan to arrive one or two days earlier.
I don't even want to know how many mission critical systems automatically deploy open source software downloaded from github or (effectively random) public repositories.
Unlike Windows, there is at least the option to use curated software distributions such as Debian or RH that won't apply random stuff from upstream repositories.
If I were running an organization that needs these audits, I'd always have fallback procedures in place that would keep everything running even if all computers suddenly stop working, like they did today. General-purpose software is too fragile to be fully relied upon, IMO.
If a general-purpose computer must be used for something mission-critical, it should not have an internet connection and it should definitely not allow an outside organization to remotely push arbitrary kernel-mode code to it. It should probably also boot from a read-only OS image so that it could always be restored to a known-good state by just rebooting.
Organizations don't want to increase risk by listening to an employee with their personal opinion. Orgs want an outside vendor who they can point at and say "it's their fault", and await a solution. Employees going rogue and not following the vendor defined SW updates is a much higher risk than this particular crisis.
Isn't there a way to schedule the updates? With Windows updates, when I used to work at a firm with a critical system running on Windows, we had main and DR servers and the updates were scheduled to first rollout on the main server and a day after I think at the DR, which has saved us at least once in the past from a bad Windows update...
More or less. You can set up some update policies which and apply those to subsets of your machines. You can disable updates during time blocks, or block them altogether. There's also the option of automatically installing the "n-1" update.
We run auto n-1 at work, but this also happened at the same time on my test machine with runs "auto n". It never happened before, so this looks like something different than the actual installed sensor version, especially since the latest version was released something like a week ago.
It's a big stretch to call this the regulator's fault when its basic lack of testing by Microsoft and/or Crowdstrike. If a car manufacturer made safety belts that broke, you don't blame the regulators.
The root cause is automatic, mindless software update without proper testing - nothing to do with regulators.
That's some very twisted logic. If I expect someone to clean the kitchen as part of restaurant closeup checklist, and they fuck it all up, would I blame the checklist, or the person doing the work?
You blame the person fucking it up. In this case, it's someone who only cares about checking a box. Or someone who pushes broken shit.
If this person simultaneously fucks up millions of kitchens around the world, you do not blame that person. You blame the checklist which encouraged giving a single person global interlocked control over millions of kitchens, without any compartmentalization.
> If this person simultaneously fucks up millions of kitchens around the world, you do not blame that person.
No, you definitely do, even more than before. Let's say for example that the requirement is to disinfect anything that touches food. And the bleach supplier fucks it all up. You blame the bleach supplier. You don't throw out the disinfectant requirement.
Most enterprises will have teams of risk and security people. They will be asking who authorized deployment of an untested update into production. If CrowdStrike deployments cannot be managed, then they will switch to a product which can be managed.
Well, if you fail at compliance, you can be fired and sometimes even sued. If your compliance efforts cause system wide outage - nobody's to blame, shit happens. I predict this screwup will end up with zero consequences for anyone who took the decisions that led to it too. So how else do you expect this system to evolve, given this incentive structure?
> Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.
If a failed audit is the big scary monster in their closet, then it sounds like the senior leadership is not intimately familiar with the technology and software in general, and is unable to properly weigh the risks of their decisions.
More and more companies are becoming software companies whether they like it or not. The software is essential to the product. And just like you would never want a non-lawyer running your law firm, you don't want a non-software person running your software company.
Very sharp and to the point, this comment. I would like to add that in large companies the audit will, in my experience, very often examine documents only -- not actual configuration or code.
This is all well deserved for executives who trust MS to run their businesses. If you have the resources, like a bank, it is crime to put your company in the hands of MS.
It's possible that CrowdStrike heavily incentivises being left to update itself.
Removing the features that would allow sysadmins to actually do it automatically, even via the installer itself- would definitely be one way, but another one could be aggressive focus-stealing nags (similar to Windows' own nags) which in a server environment can actually cause some major issues, especially when automating processes in Windows (as you need to close the program when updating).
I think it's easy to blame the sysadmins, but I would also be remiss if I didn't point out that in the Windows world we have been slowly accepting these automatic dark patterns and alternative (more controlled) mechanisms have been removed over time.
I almost don't recognise the deployment environment today as to what it was in 2004; and yes, 20 years is a long time, but the total loss of control over what a computer is doing is only going to make issues like this significantly more common.
They say it was caused by a faulty channel file. I don't know what a channel file is, and they claim to not rely on virus signatures, but typically anti virus product need the latest signatures all the time and poll them probably once an hour or so. So I'm not surprised that an anti virus product wants to stay hyper updated and updates are rolled out immediately to everyone globally.
No, I'm not surprised either. But if you're operating at this kind of scale and with this level of immediate roll-out, what I would expect are:
* A staggered process for the roll-out, so that machines that are updated check-in with some metrics that say "this new version is OK" (aka "canary deployment") and that the update is paused/rolled back if not.
* Basic smoke testing of the files before they're pushed to any customers
* Validation that the file is OK before accepting an update (via a checksum or whatever, matched against the "this update works" automated test checksums)
* Fuzz tests that broken files don't brick the machine
Literally any of the above would have saved millions and millions of dollars today.
In any kind of serious environment the admin should not have any interaction with any system's screen when performing any kind of configuration change. If it can't be applied in a GPO without any interaction it has no business being in a datacenter.
There are situations where you will interact with the desktop, for debugging reasons not-withstanding. Saying anything else is hopelessly naive. For example: how do you know if your program didn't start due to missing DLL dependencies? There is no automated way: you must check the desktop because Windows itself only shows a popup.
2) What displays on the screen is absolutely material to the functioning of the operating system.
The windows shell (UI) is intertwined intrinsically with the NT kernel, there have been attempts to create headless systems with it (Windows Core etc;) however in those circumstances if there is a popup: that UI prompt can crash the process because it does not have dependencies to show the pop-up.
If you're in a situation where you're running windows core, and a program crashes if auto-updates are not enabled... well, you're more likely than not to enable updates to avoid the crash, after all, whats the harm.
Elsewise you will be aware that when a program has a UI (windows console) the execution speed of the process will be linked to the draw rate of the screen, so having a faster draw rate or fewer things on screen can actually affect performance.
Those that write Linux programs are aware that this is also true for linux (write to STDOUT is blocking), however you can't put I/O on another thread in the same way on Windows.
Anyway, all this to say: it's clear you've never worked in a serious windows environment. I've deployed many thousands of bare-metal windows machines across the world and of course it was automated, from PXE/BIOS to application serving on the internet, the whole 9 yards, but believing that the UI has no effect or no effectiveness of administration is just absurd.
> So we need to hold regulatory bodies accountable as well...
My bank, my insurer, my payment card processor, my accounting auditor and probably others may all insist I have anti-virus and insist that it is up to date. That is why we have to have these systems. However, I used to prefer systems that allowed me to control the update cycle and push it to smaller groups.
> So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.
Replacing common-law liability with prescriptive regulation is one of the main drivers of this problem today. Instead of holding people accountable for the actual consequences of their decisions, we increasingly attempt to preempt their decisions, which is the very thing that incentivizes cargo-cult "checkbox compliance".
It motivates people who otherwise have skin in the game and immediate situational awareness to outsource their responsibility to systems of generalized rules, which by definition are incapable of dealing effectively with outliers.
No doubt there will be another piece of software mandated to check up on the compliance software. When that causes a global IT outage, software that checks up on the software that checks up on the compliance software will be mandated.
When Crowdstrike messes up and BSODs thousands of machines, they have a dedicated team of engineers working the problem and can deliver a solution.
When your company gets owned because you didn't check a compliance checkbox, it's on you to fix it (and you may not even currently have the talent to do so).
We see similar risk tradeoffs in cloud computing in general; yes, hosting your stuff on AWS leaves you vulnerable to AWS outages, but it's not like outages don't happen if you run your own iron. You're just going to have to dispatch someone a three hour drive away to the datacenter to fix it when they do.
CrowdStrike has various auto update policies, including not to automatically update to the latest version, but to the latest version -1 or even -2. Customers with those two policies are also impacted.
> Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.
I've been someone in one of those audit meetings defending decisions made and defending things based on the records we keep and I understand this because it is both a deeply unpleasant and expensive affair to pull people from current projects and place them before auditors for several hours to debate what compliance actually means.
It’s even worse. The consultants who run the audits (usually business school recent grads) work with other consultants who shill the third party software and implementation work.
So true! It seems like all of these were invented to create another market for b2b saas security, audit, monitoring, etc. companies. Nobody cares about actual security or infrastructure anymore. Everything is just buying some subscription for random saas companies, not checking their permissions and grant policies and ticking boxes because... compliance.
It depends on what your position is. Are you there to actually provide security to your org or to tick a in an audit. If both which is more important. Because failing an audit have real consequences, while having breaches in security have almost none. Just look at credit score companies.
Regulation or auditors rarely require specific solutions. It's the companies themselves that choose to achieve the goals by applying security like tinctures: "security solutions". The issue is that the tinctures are an approved remedy.
Zscaler is such insane garbage. Legitimately one of the worst pieces of software I have ever used. If your organization is structurally competent, it will never use Zscaler and will just use wireguard or something.
It's VERY easy to blame CrowdStrike and companies like them as they are the one LOBBYING for those checkboxes. Both zscaler and Crowdstrike spent 500K last year lobbying.
There's a reasonable number of circumstances where there are cybersecurity standards that get imposed on organisations: insurance, from a customer, or from the government (especially if they are a customer). These standards are usually fairly reasonably written, but they are also necessarily vague and say stuff like "have a risk assessment", and "take industry-standard precautions". This vagueness can create a kind of escalation ratchet: when people tasked with (or responsible for) compliance are risk-averse and/or lazy, they will essentially just try to find as many precautions as they can find and throw them all in blanket-style, because it's the easiest and safest way to say that you're in compliance. This is especiallly true when you can more or less just buy one or two products which promise to basically tick every possible box. And if something else pops up as a suggestion, they'll throw that in too. Which then becomes the new 'industry standard', and it becomes harder to justify not doing it, and so on.
It's easy to blame CrodStrike because they're the ones to blame here. They lit a billion system32 folders on fire with an untested product and push out fear mongering corny marketing material. Turns out you should be afraid.
> All over the place I'm seeing checkbox compliance being prioritized above actual real risks from how the compliance is implemented.
Because if everyone is doing their job and checks their box, they're not gonna get fired. Might be out of a job because the company goes under, but hey, it was no one's fault, they just did their job.
SMB here. Just spent a nine hour day fixing this. We had two machines that after a couple of reboots just came back up fine.
We were trialing CrowdStrike and about to purchase next week. If their rep doesn't offer us at least half off, we are going with Sentinel One which was half the price of CS already.
The incompetence that allowed this is baffling to me. I assumed with their billions of dollars they'd have tiers of virtual systems to test updates with.
I remember this happening once with Sophos where it gobbled up Windows system files. If you had set to Delete instead of Quarantine, you were toast.
> We were trialing CrowdStrike and about to purchase next week. If their rep doesn't offer us at least half off, we are going with Sentinel One which was half the price of CS already.
Crowdstrike marketing slogan on their website: "A radical new approach proven to stop breaches". I'll give them that: Putting all Windows computers within a company into an endless BSOD loop is a very radical approach to stop breaches. :)
The Windows ecosystem typically deployed in corporate PCs or workstations is often insecure, slow, and poorly implemented, resulting in ongoing issues visible to everyone. Examples include problems with malware, ransomware, and Windows botnets.
In corporate environments, IT staff struggle to contain these issues using antivirus software, firewalls, and proxies. These security measures often slow down PCs significantly, even on recent multi-core systems that should be responsive.
Microsoft is responsible for providing an operating system that is inherently insecure and vulnerable. They have prioritized user lock-in, dark patterns, and ease of use over security.
Apple has done a much better job with macOS in terms of security and performance.
The corporate world is now divided into two categories:
1. Software-savvy companies that run on Linux or BSD variants, occasionally providing macOS to their employees. These include companies like Google, Amazon, Netflix, and many others.
2. Companies that are not software-focused, as it's not their primary business. These organizations are left with Microsoft's offerings, paying for licenses and dealing with slow and insecure software.
The main advantage of Microsoft's products is the Office suite: Excel, Word and Powerpoint but even Word is actually mediocre.
I think you represent the schism in your own post. Retail is hyper focused on the name Microsoft and Windows. But the enterprise and technical people are focused on rolling back a bad CrowdStrike bad update. They will spend hours and even days focusing on doing that, asking why they were vulnerable to such an update and what they should have done to avert being vulnerable to a bad update.
And for them it will be a bit of a stretch to say Microsoft should have stopped us deploying CrowdStrike. I’m sure Microsoft would love to do just that and sell its own Microsoft Solution.
> it will be a bit of a stretch to say Microsoft should have stopped us deploying CrowdStrike
I read GP's post to mean that if you take a step back, Windows' history of (in)security is what has led us to an environment where CrowdStrike is used / needed.
I can answer this. For the same reason I have run ClamAV on Linux development workstations. Because without it, we cannot attest that we have satisfied all requirements of the contract from the client's security organization.
Also if you are a small business and are required to have cybersecurity liability insurance, the underwriter will require such sensors to be in place or you will get no policy.
For the same reasons there's antivirus software for Mac and Linux.
People coming from Microsoft systems just expect it to be required, so there's demand for it (demand != need). And in hybrid environments it may remove a weak link: e.g. a Linux mailserver that serves mail to Windows users best has virus detection for windows viruses.
I’m not defending CrowdStrike here. This is a clearly egregious lack of test coverage, but CrowdStrike isn’t “just” antivirus. The Falcon Sensor does very useful things beyond that, like USB device control, firewall configuration, reporting, etc.
If your use case has a lesser need for antimalware you might still deploy CrowdStrike to achieve those ends. Which help to lessen reliance on antimalware as a singular defense (which of course it shouldn’t be).
It's not just those darn windows admins. Alot of the certifications customers care about- SOC II, ISO whatever, FedRamp, have line items that require it.
I've had to install server antivirus onto my Linux laptop at 4 different companies. Every time it's been a pain in the ass because the the only antivirus solutions I've found for Linux assume that "this must be a file server used by Windows clients". None of them are actually useful, so I've installed them and disabled them. There, box-checking exercise done.
> For the same reasons there's antivirus software for Mac and Linux.
Because they can also get malware or could use the extra control CS provides, and the "I'm not a significant target so I'm safe" is not really a solid defense? Bad quality protection (as exemplified by the present CS issues) isn't a justification for no protection at all.
Would you ignore the principle of least privilege (least user access) and walk around with all the keys to the kingdom just because you're savvier than most at detecting an attack and anyway you're only one person, what are the chances you're targeted? You're the Linux/MacOS of the user world, and "everyone knows those principles are only for the Windows equivalent of users".
I'm not arguing that Linux or Mac need no protection.
There are serious threats to any Linux machine. And if you include Android, there are probably far more Linux machines out there. Hell, including their navigation, router, NAS, TV, and car, my 70+ yo mom runs at least 5 Linux machines at her home. It's a significant target. And Mac is quite obviously a neat target, if only because the demographic usually has higher income (hardly any Bangladeshi sweatshop worker will put down the cash to buy a MacBook or iphone. But might just own an Android or windows laptop)
I'm arguing that viruses aren't a threat, generally. Partly due to the architecture, partly due to their useage.
Neither Linux nor OSX are immune to viruses, though malware is more commonly written to target Windows given its position in the market. Both iOS and Android are frequent malware targets despite neither being related to Windows, and consequently, both have antivirus capabilities integrated deeply into both the OS and the app delivery ecosystem.
Any OS deployed on a user device needs some form of malware protection unless the device is blocked from doing anything interesting. You can generally forgo anti-malware on servers that are doing one thing that requires a smaller set of permissions (e.g., serving a website), but that's not because of the OS they are running.
Sure, “AVG Mobile Security” is available, but nobody needs it, and it isn’t anything like antivirus software on a computer. It provides... a photo vault, a VPN, and “identity protection.”
To tell people that they are vulnerable without something like this on their iPhone is ludicrous.
Nobody meeds antivirus software or malware protection like this on their iPhone, unless they like just giving money away.
If you'll scroll up to the comment you originally replied to, you'll see that I said Android and iOS have AV capabilities built into the OS and app delivery ecosystem. That's more than enough for most users: mobile OSes have something much closer to a capability-based security paradigm than desktop OSes, and both Apple and Google are pretty quick to nerf app behavior that subverts user expectations via system updates (unless it was done by the platform to support ad sales).
Your mobile device is a Turing machine, and as such it is vulnerable to malware. However, the built-in protections are probably sufficient unless you have a specific reason to believe they are not.
The only AV software for mobile devices that I have seen used is bundled with corporate "endpoint management" features like a VPN, patch and policy management, and remote wipe support. It's for enterprise customers that provision phones for their employees.
> You can generally forgo anti-malware on servers that are doing one thing that requires a smaller set of permissions (e.g., serving a website), but that's not because of the OS they are running.
It seems to me like you’re trying to have it both ways.
It really is because of the OS that one doesn’t need to run anti-malware software on those servers and also on the iPhone, which you seem to have admitted.
It seems like we're both trying to make a distinction that the other person thinks is unimportant. But if the crucial marker for you is whether anti-malware protection is built into the OS, then I've got great news for you: Windows has built-in AV, too, and it's more than enough for most users.
The distinction I was trying to make is that the anti-malware strategy used by servers (restrict what the user can do, use formal change control processes, monitor performance trends and compare resource utilization against a baseline and expectations inferred from incoming work metrics) is different from the anti malware strategy used by "endpoints" (scanning binaries and running processes for suspicious patterns).
I'd say very special people need malware protection like this on their iPhone.
Remember NSO Group? Or the campaign Kaspersky exposed last year? Apple successfully made malware on iOS very rare unless you are targeted. But right now, it is impossible for these targeted people to get any kind of protection. Even forensics after being compromised is extremely difficult thanks to Apple's walled garden approach.
The usefulness of a theoretical app that might be able to stop high-power exploits isn’t being debated. The claim I’m objecting to is that everybody should be running (available) antivirus software on their phone.
But if you mean that these highly targeted people would have been helped by running “AVG Mobile Security” or one of the other available so-called “antivirus” apps, then I’ve got an enterprise security contract to sell you. :)
> The claim I’m objecting to is that everybody should be running (available) antivirus software on their phone.
You're objecting to the (much more specific) claim that everybody should be running 3P antivirus software on their phone. Nobody made this claim. You are already running AV software on your phone, and whatever is built into the platform is more than sufficient for most users.
I spent some time on STIG website out of curiosity. There seem to be down-to-earth practical requirements but only for Windows, cf. https://public.cyber.mil/stigs/gpo/
Why does it justify running antiviri on Linux is beyond my understanding.
Weak, impotent, speechless IT personnel that can not face off incompetence?
Windows IT admins who don’t use or understand Linux/Mac. Who also buy at the enterprise level. And who probably have to install (perhaps unnecessary) endpoint protection to satisfy compliance checklists.
The amount of Windows centric IT that gets pushed to Linux/Mac is crazy. I’ve been in meeting where using Windows based file storage was discussed at a possibility for an HPC compute cluster (Linux). And they were being serious. This was in theory so that central IT could manage backups.
To make money? Just because CrowdStrike is available for Linux and Mac doesn't mean that a) people buy and use it in substantial numbers b) people need to buy it. It would be interesting to hear from someone using CrowdStrike in a Linux/Mac environment.
We run Crowdstrike on Linux and Macs so that we can tick some compliance checkbox.
Fun fact: they’ve recommended we don’t install the latest kernel updates since they usually lag a bit with support. We’re running Ubuntu LTS, not some bleeding edge arch. It now supports using ebpf so it’s somewhat better.
The policies are written by folks who have no understanding of different operating environments. The requirement "All servers and workstations must have EDR software installed" leads to top-level execs doing a deal with Crowdstrike because they "support" Linux, Mac, and Windows. So then every host must have their malware installed to check the box. Doesn't matter if it's useful or not.
Indeed and insurance too. For our business, our professional errors and omissions coverage for years had the ability to cover cyber issues. No more. That requires cybersecurity insurance and the underwriters will not entertain underwriting a policy unless EDR is in place. They don't care if you are running OpenBSD and are an expert in cybersecurity who testifies in court cases or none of that. EDR from our list or no insurance.
For macOS? Because without it you don't have certain monitoring and compliance capabilities that are standard built-ins in windows, plus for windows/linux/mac the monitoring capabilities are all useful and help detect unwanted operation.
> I read GP's post to mean that if you take a step back, Windows' history of (in)security is what has led us to an environment where CrowdStrike is used / needed.
Windows does have a history of insecurity, but it is no different from any other software in this regard. The environment would be the same in the absence of Windows.
Attacks are developed for Windows because attacks against Windows are more valuable -- they have a large number of potential targets -- not because they're easier to develop.
In the case of a bad Linux kernel update I would just reboot and pick the previous kernel from the boot menu. By default most Linux distributions keep the last 3. I'm not an IPMI remote management expert but it may be possible to script this.
All my machines at home run Linux except for my work laptop. It is stuck in this infinite blue screen reboot loop. Because we use Bitlocker I can't even get it into safe mode or whatever to delete the bad file. I think IT will have to manually go around to literally 8,000 work laptops and fix them individually.
You would "just pick the previous kernel from the boot menu". That's funny, cause in this case you could "just delete the file causing the issue." Anything can sound easy and simple if you state it that way.
How do you access the boot menu for a server running in the cloud, which you normally just SSH into (RDP in Windows' case)?
About your last paragraph: we have just started sending out the bitlocker keys to everyone so it can be done by them too. Surely not best practice, but it beats everyone having to line up at the helpdesk.
> You would "just pick the previous kernel from the boot menu". That's funny, cause in this case you could "just delete the file causing the issue." Anything can sound easy and simple if you state it that way.
One small difference, is that choosing the kernel from the boot menu is done before unlocking the encrypted drive, so no recovery keys would be necessary. And yes, choosing an entry from a menu (which automatically appears when the last boot has failed) is simpler than entering recovery mode and typing a command, even without disk encryption.
A better analogue would be a bad update on a non-kernel package which is critical to the boot sequence, for instance systemd or glibc. Unless it's one of the distributions which snapshot the whole root filesystem before doing a package update.
NixOS boots to a menu of system configuration revisions to chose from which includes any config change, not just kernel updates.
It's not filesystem snapshots either. It keeps track of input parameters and then "rebuilds" the system to achieve the desired state. It sounds like it would be slow, but you've still got those build outputs cached from the first time, so it's quite snappy.
If you took a bad update, and then boot to a previous revision, the bad update is still in the cache, but it's not pointed to by anything. Admittedly it takes some discipline to maintain that determinism, but it's discipline that pays off.
I don't expect to use it much myself but I love the idea of reducing the OS to an interchangeable part. What matters is the software and its configuration. If windows won't boot for some reason, boot to the exact same environment but on a different OS, and get on with your day.
If something is broken about your environment, fix it in the code that generates that environment--not by booting into safe mode and deleting some file. Tamper with the cause, not with the effect. Cattle, not pets, etc.
This sort of thing is only possible with nix (and maybe a few others) because elsewhere "the exact same environment" is insufficiently defined, there's just not enough information to generate it in an OS-agnostic way.
I can't delete a file if the machine doesn't finish booting. Unless you are suggesting removing the drive and putting it in another machine. That requires a screwdriver and 5 minutes vs. the 10 seconds to reboot and pick a different kernel.
I'm not talking about the cloud. I am talking about the physical machines sitting in front of me specifically my work laptop.
I am an integrated circuit computer chip designer, not a data center IT person. I have seen IPMI on the servers in our office. Do cloud data centers have this available to people?
I have a cheap cloud VM that I pay $3.50 a month. I normally just SSH in but if I want to install a new operating system or SSH is not responding then I log in to the web site and get a management console. I can get a terminal window and login, I can force a reboot, or I can upload an ISO image of another operating system and select that as the boot device for the next reboot and install that.
Does your cloud service not have something like this?
I don't know what our corporate IT dept wants to do. We all work from home on Friday and I can't login to check email so I'll just wait until Monday as there is nothing urgent today anyway.
The OS drive is encrypted with Bitlocker. I've seen another thread where corporate IT departments were giving out the recovery key to users. I don't need to get anything done today. I'll go into the office on Monday and see what they say.
Idk if this is a serious question, but you just turn on console access in the cloud provider. It’s super easy. Same concept as VMWare. It’s possible that not all cloud providers do that, I suppose.
MacOS has been phasing out support for third-party kernel extensions and CrowdStrike doesn't use a kernel extension there according to some other posts.
I’m convinced that one reason for this move by Apple was poor quality kernel extensions written by enterprise security companies. I had our enterprise virus/firewall program crash my Mac all the time. I eventually had to switch to a different computer (Linux) for that work.
It wasn’t Crowdstrike, but quality kernel level engineering isn’t was I think of when I think of security IT companies.
But, also credit Apple here. They’ve made it possible for these programs to still run and do their jobs without needing to run in kernel mode and be susceptible to crashes.
Not only security software, but really any 3rd party drivers have caused issues on Windows for years. Building better interfaces less likely to crash the kernel was a smart move
When I started doing driver development on MacOS X in the early 2000s, there were a number of questions on the kernel/driver dev mailing lists for darwin from AV vendors implementing kernel extensions. Most of them were embarrassing questions like "Our kernel extension calls out to our user level application, and sometimes the system deadlocks" that made me resolve to never run 3rd party AV on any system.
Whether you like macOS or not, they definitely are innovating in this space. They (afaik) are the only OS with more granular data access for permissions as well (no unfettered filesystem access by default, for instance)
It's also a shame CrowdStrike doesn't take kernel reliability seriously
The user can change anything they want, but a process launched by your user doesn't inherit every user access by default. You (the user) can give a process full disk access, or just access to your documents, or just access to your contacts, etc. It's maximizing user control, not minimizing it.
You say they're planning to add a feature in the next release, but what you linked to is merely an uncompleted to-do item for creating a UI switch to toggle a feature that hasn't been written yet. I think you win the prize for the most ridiculous exaggeration in this thread. Unless you can link to something that actually comes anywhere close to supporting your claim, you're just recklessly lying.
The linked Issue #8553 is "just" about creating a toggle for GPU acceleration. It's blocked by Issue #8552 [0], which is the actual Issue about the acceleration and originally belonged to Milestone "Release 4.3". It seems to have been removed later, which I didn't expect or know about. Accusation of lying was completely unnecessary in your comment.
Moreover, the Milestone was removed not because they changed their mind about the Release but for other reasons [1].
Ok, so your [0] shows that the real work has barely been started. The only indication it was ever planned for the next release was a misunderstanding on your part about the meaning of a tag that was applied to the issue for less than one day last fall, and they've stopped tagging issues with milestones to prevent such misunderstandings in the future. It still looks to me like your exaggerated claim was grounded in little more than wishful thinking.
Am I missing something? This is to add a toggle button and the developers say they are blocked because GPU acceleration feature doesn't exist so the button wouldn't be able to do anything.
The issue with Crowdstrike on Linux did not cause widespread failures, so its clear that the majority of enterprises that do run their servers on Linux were not affected. They were invulnerable because they do not need Crowdstrike or similar.
Linux (or BSD) servers do not usually require third party kernel modules. Linux desktops might have the odd video driver or similar.
>Apple has done a much better job with macOS in terms of security and performance.
I really like their corporate IT products that are going to push MS out as you say. I particularly love iActive Directory, iExchange, iSQLserver, iDynamics ERP, iTeams. Apples office products are the reason noone uses Excel any more. Their integration with their corporate cloud, iAzure is amazing. I love their server products in particular, it being so easy to spin up an ios server and have dfs filesharing, dns etc is great. MS must be quaking in their shoes
All of those are product that creates huge risks when deployed to mission critical environments and this is exactly the problem.
The entire wintel ecosystem depends on people putting their heads in the sand and repeating "nobody ever got fired for buying Microsoft/crowdstrike/IBM" and neglecting to run even the most trivial simulation of what happens when the very well understood design flaws of those platforms gets triggered by a QA department you have no control over drops the ball.
The problem is that as long as nobody dares recognizing that the current mono culture around the "market leading providers" this kind of event will remain really likely even if nobody is trying to break it and and extremely likely once you insert well funded malicious actors(ranging from bored teenagers to criminal gangs and geopolitical rivals).
The problem is that adding fair weather product that gives the illusion of control though fancy dashboards on the days they work is not really an substitute for proper reliance testing and security hardening but far less disruptive to companies that don't really want to leave the 90ies PC metaphor behind.
You have 100,000 devices to manage. How do you handle that efficiently without creating a monoculture?
It's not a "90ies PC metaphor" problem. Swap Chromebooks for PCs and you still have the problem-- how do you handle centralized management of that "fleet"?
Should every employee "bring their own device" leaving corporate IT "hands-off"? There are still monocultures within that world.
Poor quality assurance on the part of software providers is the root cause. The monocultures and having software that treats the symptoms of bad computing metaphors aren't good either, but bad software quality assurance is the reason this happened today.
> Swap Chromebooks for PCs and you still have the problem-- how do you handle centralized management of that "fleet"?
Simplicity (and hence low cost) of fleet management, OS boot-verification, no third-party kernel updates, and A/B partitions for OS updates are among the major selling points of Chromebooks.
It's a big reason they have become so ubiquitous in primary education, where there is such a limited budget that there's no way they could hire a security engineer.
The OP was deriding monoculture. My point was that pushing out only Chromebooks is still perpetuating a monoculture. You're just shifting your risk over to Google instead of Crowdstrike / Microsoft.
re: Chromebooks themselves - The execution is really, really good. The need for legacy software compatibility limits their corporate penetration. I've done enough "power washes" to know that they're not foolproof, though.
ChromeOS is just Linux, isn't it? It's going to suffer from the same problem as NT re: a buggy kernel mode driver tanking the entire OS.
Google gets a pass because their Customers are okay with devices with limited general purpose ability. Google is big enough that the market molds product offerings to the ChromeOS limitations. I think MSFT suffers from trying to please everybody whereas Google is okay with gaining market share by usurping the market norms over a period of years.
> ChromeOS is just Linux, isn't it? It's going to suffer from the same problem as NT re: a buggy kernel mode driver tanking the entire OS.
ChromeOS is not just Linux. It uses the Linux kernel and several subsystems (while eschewing others), but it also has a security and update model that prevents third parties (or even the user themselves) from updating kernel space code and the OS's user space code, so basically any code that ships with the OS.
Therefore, the particular way that the Crowdstrike failure happened can't happen on ChromeOS.
However, Google themselves could push a breaking change to ChromeOS. That, however would be no different than Apple or Microsoft doing the same with their OS's.
I am familiar with Google's walled garden w/ ChromeOS. I didn't mean to give the impression that I was not.
It's "just Linux" in the sense that it has the same Boolean kernel mode/user mode separation that NT has. ChromeOS doesn't take advantage of the other processor protection rings, for example. A bad kernel driver can crash ChromeOS just as easily as NT can be crashed.
Hopefully Google just doesn't push bad kernel drivers. Crowdstrike can't, of course, because of the walled garden. That also means you can't add a kernel driver for useful hardware, either. That limits the usefulness of ChromeOS devices for general purpose tasks.
> That also means you can't add a kernel driver for useful hardware, either. That limits the usefulness of ChromeOS devices for general purpose tasks.
It's target market isn't niche hardware but rather the plethora of use cases that use bog standard hardware, much like many of the use cases that CS broke a few days ago.
Yes. I said that in a post up-thread. Google is making the market mold itself to their offering, rather than being like Microsoft and molding their offering to the market. Google is content to grow their market share that way.
If crowdsource QA department is all that stands between you and days of no operations then you chose to live with the near certainty that you will have days rather then hours of unplanned company wide downtime.
And if you cannot actually abandon someone like microsoft that consistantly screws up their QA then it's basically dishonest for you to claim that reliability is even a concern for your desktop platform.
And that's essentially what i say when i accuse the modern enterprise it's client device teams of being stuck in the 90ies as those risk were totally acceptable back when the stakes were low and outages only impacted non time critical back office clerical work. but what we saw today was that those high risk cost optimized systems got deployed into roles where the risk/consequence profile is entirely different.
So what you do is that you keep the low impact data entry clerks and spreadsheet wranglers on the windows platform but threat the customer facing workers dealing with time sensitive task something a bit less risky.
It's might not be as easy as just deploying the same old platform designed back in the 90ies to everyone but once you leave the Microsoft ecosystem dual sourcing based on open standards become totally feasible, at costs that might not be prohibitive as everything in the unix like ecosystem including web browsers have multiple independent implementations so you basically just have to standardize of 2-4 rather then one platform which again isnt unfeasible.
It's telling that an Azure region failed this news cycle without anyone noticing because companies just don't tolerate the kind of risk people takes with their wintel desktop for their backends so most critical services hosted in microsofts Iowa datacenter had and second site on standby.
>And if you cannot actually abandon someone like microsoft that consistantly screws up their QA
The last outage I can remember due to an ms update was 7 or 8 years ago. Desktops got stuck on 'update 100% complete'. After a couple of minutes I pressed ctrl+alt+del and it cleared. Before that...I don't remember. Btw MS provides excellent tools to manage updates, and you can apply them on a rolling basis.
> If crowdsource QA department is all that stands between you and days of no operations ...
For companies of a certain large size, I guess. For all but the largest companies, though, there's no choice but to outsource software risks to software manufacturers. The idea that every company is going to shoulder the burden of maintaining their own software is ridiculous. Companies use off-the-shelf software because it makes good financial sense.
> And if you cannot actually abandon someone like microsoft that consistantly screws up their QA then it's basically dishonest for you to claim that reliability is even a concern for your desktop platform.
When a company has significant software assets tied to a Microsoft platform there's no alternative. A company is going to use the software that best-suits their needs. Platform is a consideration, however I've never seen it be the dominant consideration.
Today's issue isn't a Microsoft problem. The blame rests squarely on Crowdstrike and their inability to do QA. The culture of allowing "security software" to automatically update is bad, but Crowdstrike threw the lit match into that tinderbox by pushing out this update globally.
As another comment points out, Microsoft has good tools for rolling update releases for corporate environments. They're not perfect but they're not terrible either.
> It's might not be as easy as just deploying the same old platform ...
When a company doesn't control their software platform they don't have this choice. Off-the-shelf software is going to dictate this.
In some fantasy world where every application is web-based and legacy code is all gone maybe that's a possibility. I have yet to work in that environment. Companies aren't maintaining the "wintel desktop" because they want to.
Blaming crowdstikes QA might feel good but the problem is that no company in the history of the world have been good enough at QA for it not to be reckless to allow day one patching of critical systems, or for that matter to allow single vendor, single design, critical systems in the first place. and yet the cyber security guidelines required to allow the pretense that windows can be used securely all but demand that companies take that risk.
It's also fundamentally a problem of Danial, everyone knows there will not be an good solution to any issue around security and stability that does not require that the assets tied up inside fragile monopoly operated ecosystems to be eventually either extracted or written off but nobody want to blaze new trails.
Claiming powerlessness is just lazy yes it might take an decade to get out from under the yokel of an abusive vendor, we saw this with IBM, but as IBM is now an footnote in the history of computing it's pretty clear that it can be done once people start realizing there is an systematic problem and not just a serious of one-off mistakes.
And we know how to design reliable systems, it's just that doing so is completely incompatible with allowing any of America's Big IT Vendors to remain big and profitable, and thats scary to every institution involved in the current market.
To be fair, IBM products back in the day when that saying made sense never had these kinds of problems. It's straight up insulting to compare them to somebody like Crowdstrike.
Wintel won by being cheaper and shittier and getting a critical mass of fly by night OEMs and software vendors on board.
IBM was more analogous to the way Apple handles things. Heavy vertical integration and premium price point with a select few software and hardware vendors working very closely with IBM when software and hardware analogous to Crowdstrike in terms of access was created.
> I really like their corporate IT products that are going to push MS out as you say. I particularly love iActive Directory, iExchange, iSQLserver, iDynamics ERP, iTeams.
You’re being sarcastic, but do you like those MS products, specifically Teams?
I genuinely believe that any business that doesn’t make Teams is doing the lords work.
I'm stuck with them on my company Macbook and will definitely say, they suck.
In the 5 years I've been here, Outlook has never addressed this bug (not even sure they consider it a bug): Get an invitation to an event. See it on calendar view. Respond to it on calendar view. Go to inbox. Unread invitation is sitting there in your inbox requesting a response.
I don't even need to talk about why Teams is trash. Terrible design is in Teams's DNA.
In enterprise software, you don't need to be good. Just better than your competitors. I distinctly remember doing a happy jig about 6 years ago when we moved from Skype for Business (shudder) to Teams. Did teams drive me nuts? Absolutely. But I was free from the particular hell of SFB.
TBF I have less experience with Dynamics than the others, but yes they are all excellent.
I include Teams in that. I don't think there is another app on the market that does what Teams does. Integrated video conferencing, messaging, and file sharing in one place. All free with the office package my team already use and fully integrated with Azure AD for sso. I use it all day with zero problems. I honestly can't see why anyone would use anything else
The fact Apple is not trying to be a tentacular behemoth syphoning profits in every enterprise environment does not invalidate the fact macOS is secure and performant.
Apple is a tentacular behemoth in the consumer space.
Not a single statement you purport as "fact" has been true cross large scale deployments in my experience. Especially the first part which tells me you have not experienced working with them as a supplier. I think you mean in your opinion or experience, but please don't attribute wishful thinking to factual statements. It derails objectivity and discussions.
As ridiculous as it sounds, this does work on a subset of the machines affected based on my experience of the last few hours. With other machines you can seemingly reboot endlessly with no effect.
Dynamics, Teams, Exchange, Active Directory all suck. There are better alternatives but CIOs are stuck in 1996. Apple themselves in their corporate IT environment use none of those things yet somehow are one of the biggest and most profitable companies in the world. Azure is garbage compared to AWS. Using Azure Blob vs S3 is a nightmare. MSSQL is garbage compared to PostgreSQL. Slack is vastly better than Teams in literally every aspect. I just did a project moving a company from AWS to Azure and it was simply atrocious. Nobody at the user level likes using MS products if they have experience using non-MS products. It’s like Bitbucket — nobody uses that by choice.
You got to admire Apple fanboy's nerve to say Apple is a better company when it comes to IT in a professional setting.
It appears whatever their basic and narrow use-case is becomes what the whole "corporate IT" is.
Windows sucks and recently Microsoft has been on a path to make it suck more, but saying Apple is better for this part of the IT universe is.. hilarious.
I think if someone wants to criticize Microsoft after experiencing their buggy products for 20 years straight, that is not “baseless,” although I accept that taking responsibility for literally anything our products do goes against the core values of our profession.
The do have some crappy products, but those crappy products make the world move, because nobody really makes better drop in replacement products, same as SAP, Canonical, Android, etc, none of them are fault tolerant, they all have issues and will fail if you fuzz them with enough edge cases, and according to this article CroudStrike caused the issue, not Windows which is what I was pointing at.
Do you think MacOS can't fail if you fuck with it long enough? Sometimes you don't even have to, it just fails by itself. My Ubuntu 22.04 LTS at my previous job gave me more issues than Windows ever did. Thanks Snaps, Wayland and APT. No workstation OS is perfect.
If you want a fault tolerant OS you're gonna have to roll out your own Linux/BSD build based on your requirements and do your own dev and testing. Which company has money for that? So of course they're gonna pick an off-the-shelf solution that best fits their needs on the budget. How is this Microsoft's fault what their customers choose to do with it? Did they guarantee anywhere their desktop OS os fault tolerant should be used in high availability systems and emergency services, especially with crappy endpoint solutions hooked at kernel level?
lol. i’ll dunk on Apple as much as i’ll dunk on any other OS, but they wouldn’t be as praised for security if they had to manage the infrastructure and users that Windows supports
> I particularly love iActive Directory, iExchange, iSQLserver, iDynamics ERP, iTeams. Apples office products are the reason noone uses Excel any more.
I see your sarcasm backfire as most you are listing is just Microsoft dog-food with no real usefulness. The only good thing in your list is Excel, all the rest is bloatware. Teams is a resource hog that serve no useful purpose. Skype was perfectly fine to send messages or have some video call.
I admit I don't have experience as an IT administator but things like managing emails, accounts, database, manage remote computers can be done with well estalished tools from the linux/BSD world.
Wild that you’d write this comment with such a confident voice then.
I worked at a company who’s IT team managed both windows and Mac computers and apparently MS’s ActiveDirectory is leagues ahead of apple’s offering.
Which makes sense. MS is selling windows to administrators, not to users
I'm a die hard FOSS guy, but as someone who has done LDAP work with FreeIPA and OpenLDAP -- AD does a better job.
Admittidly, it's mostly a better job at integrating with Microsoft-powered systems, so it should damn well do a better job, but it's a core business offering and has polish on it in ways that many FOSS offerings don't.
disclaimer: haven't done FreeIPA and LDAP work in the last ~3 years, maybe they got better.
I would disagree. I work in healthcare and we’ve always used SQL Server. While I wouldn’t pick it, it’s been reliable and integrates with auth.
No one “loves” Teams, but honestly it serves its purpose for us at no cost.
No one loves OneDrive but it works.
I think people underestimate how much work it would take to integrate services, train people, and meet compliance requirements when using a handful of the best in class products instead of MS Suite.
People use Teams and OneDrive because it’s “Free” when you use Office. IMO, that’s a bit of an anti-trust problem. Both have good competitors (arguably better competitors) that are getting squeezed because of the monopoly pricing with Office.
But with SQL Server, on the other hand, I think you are right. It is a good piece of software. But it also has high quality competition from multiple vendors. Some of it enterprise (Oracle, DB2), some of it FOSS (Postgres, MySQL). Because of this, it has to be better quality to survive… they couldn’t bundle it to get market share, it actually had to compete.
People use Teams because it's well integrated into Office, 365, Entra and other MS products, they would (and recently do) pay for it. It has functionalities that no other alternative has, e.g. it can act as a full call centre solution through a SIP gateway.
Yeah, sure. But the marginal cost is zero, whereas a Slack subscription for every person in our org will cost about 1 million dollars a year. And it doesn’t integrate as well with every other piece of functional but mediocre software.
The person approving the $1 million dollar budget item doesn’t really care that Teams isn’t “free” in the sense that there is no free lunch, and while they perhaps have moral qualms of antitrust, that’s outside their purview. We’re locked into Office suite and right now there is no extra charge for Teams.
> Teams is a resource hog that serve no useful purpose. Skype was perfectly fine to send messages or have some video call.
I’m sorry, this is a very silly take. I’m no fan of Teams or Slack but I can’t deny the functionality they offer, which is far above and beyond what Skype does.
> I admit I don’t have experience as an IT administrator
Time was, NeXT was a hard sell into corporations because it required so little administration, and what there was was so easily done IT staffs were hugely cut back after implementing them.
Had to move my Cube this past week-end, and it made me incredibly sad.
Using a NeXT Cube w/ Wacom ArtZ and an NCR-3125 running Go Corp.'s PenPoint (and rebooting into Windows for Pen Computing when I wanted to run Futurewave Smartsketch) was the high-water mark of my computing experience.
It was elegant, consistent, reliable, and "just worked" in a way no computer has since (and I had such high hopes for Rhapsody and the Mac OS Public Beta).
Then you probably shouldn't speak on software exclusively understood and administered by IT administrators. I've worked in IT for some time and every single one of those products(aside from Dynamics) have been the most important parts of our administrative stack.
Even Excel is beginning to be regarded as a dangerous piece of software that gives the illusion of power while silently bankrupting departments who depend on the idea that large spreadsheets is an accurate and reliable way to analyze large/complex datasets.
the 90ies are over but for some reason average enterprise department have a problem internalizing the fact that the demands today is different then they were 25 years ago.
Meanwhile, while HN bubble imagines people doing big data jobs on Excel, in the real world 10s or 100s of millions of people are perfectly satisfied doing small data jobs in Excel.
The problem is that without tools and processes to systematically validate those result's people might be perfectly happy about completely inaccurate results.
I know i have had to correct one in three excel sheet i have ever gone over using pen and paper in order to validate the results but i am a paranoid sod who actually do this kind of exercise on a regular basis.
almost all of the disciplines known to rely on excel have a serous issue with repeatability of results either because nobody ever attempts it, or because it's a messy field without a well defined methodology.
I work in finance. We have double entry accounting and literal checks and balances to validate our results. It is not a messy field, and has a well defined methodology. We have been the biggest spreadsheet users at many of the companies I have worked with.
I used to run a C++ shop, writing heavy-duty image processing pipeline software.
It did a lot, and it needed to do it in realtime, so we were constantly busting our asses to profile and optimize the software.
Our IT department insisted that we install 'orrible, 'orrible Java-based sneakware onto all of our machines, including the ones we were profiling.
We ended up having "rogue" machines, that would have gotten us in trouble, if IT found out (and I learned that senior management will always side with IT, regardless of whether or not that makes sense. It resulted in the IT department acting like that little sneak that sticks his tongue out at you, while hiding behind Sister Mary Elephant's habit).
But, to give them credit, they did have a tough job, and the risks were very real. Many baddies would have been thrilled to get their claws on our software.
We were in need of an MDM to help staff (non-techs) with their Mac books. I haven't noticed any issues, nor have two of my staff who are trialling it. What's been your main gripe?
I'm a Dev but also manage the It team of one sys admin and haven't noticed any performance hits. Yet anyway, but it's only been two weeks.
Installing software is painful- some of this is perhaps related to how the IT group has restricted so much for us, can't even change my screen saver, and weirdness like bizarre pop ups asking for your password from time to time. It just doesn't belong on a developer machine.
The poor quality of Windows and associated software is not the problem here. The problem is that Microsoft especially, but software vendors generally, encourage users to blindly accept updates which they do not understand or know how to roll back. And by "encourage" I mean that they've removed the "no thanks" and "undo" buttons.
Here on Linux (NixOS), I am prompted at boot time:
> which system config should be used?
If I applied a bad update today, I can just select the config that worked yesterday while I fix it. This is not a power that software vendors want users to have, and thus the users are powerless to fix problems of this sort that the vendors introduce.
It's not faulty software, it's a problematic philosophy of responsibility. Faulty software is the wake-up call.
What makes you think the FAANG companies don't use windows? Spent four years at Amazon recently and unless you were a dev, you were more likely to have a windows PC than Mac. Saw zero Linux laptops.
Leave FAANG and most internal developers at large corporations are running Windows. It wasn't until I started at a smaller shop that I found people regularly using Linux to do their jobs, usually in a dual-boot or with a virtual Windows install "just in case" but most never touched it.
I'm presently working supporting a .NET web app (some of which is "old .NET Framework) but my work machine runs OpenSUSE Tumbleweed. I can't see that flying at the larger shops I have previously worked at. I'll admit, that might be different -- today -- I haven't worked at a large shop in more than a decade.
Most corporations have no interest in paying the cost of running a multi-OS IT shop nor dealing with the challenges of fleet management with both Linux and Mac that make running those fleets more expensive and challenging.
That's before you factor in that almost everyone in IT is a born and bred in Windows and in almost every case people tend to choose what they know best.
Depend on which FAANG I guess. Approaching now 10y at Google and I saw Windows laptops only used by very few sales people. Everyone else is either using Macs or Chromebook.
Fellow Googler here. I'm the exception that proves the rule. After 7 years of Macbook and Linux devices, I needed Windows for a special project, so I got a "gWindows" device and found it very well supported.
Aside from the specific Windows-only software I needed, I would still just ssh into a Linux workstation, but gWindows can do basically everything my Mac can. I was pleasantly surprised.
> The Windows ecosystem typically deployed in corporate PCs or workstations is often insecure, slow, and poorly implemented
Yes, but that's not because of Windows itself (which is fast and secure out of the box) but because of an decades-old "security product" culture that insists on adding negative-value garbage like Crowdstrike and various anti-virus systems on the critical path, killing performance and harming real security.
It's a hard problem. No matter how good Windows itself gets and no matter how bad these "security products" become, Windows administrators are stuck in the same system of crappy incentives.
Decades of myth and superstition demand they perform rituals and make incantations they know harm system security, but they do them anyway, because fear and tradition.
It's no wonder that they see Linux and macOS as a way out. It's not that they're any better -- but they're different, and the difference gives IT people air cover for escaping from this suffocating "you must add security products" culture.
Disagree. At least in the context of business networks.
My favorite example is the SMB service, which is enabled by default.
In the Linux world, people preach:
- disabling SSH unless necessary
- use at least public key-based auth
- better both public key and password
- don't allow root login
In Windows, the SMB service:
- is enabled by default
- allows command execution as local admin via PsExec, so it's essentially like SSH except done poorly
- is only password-based
- doesn't even support MFA
- is not even encrypted by default
It's a huge issue why everyone gets encrypted by ransomware.
I always recommend disabling it using the Windows firewall unless it is actually used, and if it is necessary define a whitelist of address ranges, but apparently it is too hard to figure out who needs access to what, and much easier to deploy products like Crowdstrike which admittedly strongly mitigate the issue.
The next thing is that Windows still allows the NTLM authentication protocol by default (now finally about to be deprecated), which is a laughably bad authentication protocol. If you manage to steal the hash of the local admin on one machine, you can simply use it to authenticate to the next machine. Before LAPS gained traction, the local admin account password was the same on all machines in basically every organization. NT hashes are neither salted nor do they have a cost factor.
I could go on, but Microsoft made some very questionable security decisions that still haunt them to this day because of their strong commitment to backwards compatibility.
You don't need Crowdstrike to disable any of these things. You can use regular group policy. I'm not saying Windows can't be hardened. I'm saying these third party kernel hooks add negative value.
Fun fact, these negative value garbage offerings are often “required” by box checking certifications like SOC2.
Sure, if you have massive staffing to handle compliance you might be able to argue you’ve achieved the objective without this trash. The rest of us are just shrug and do it.
Some of the “compliance managers as a service” push you in this direction as well.
Why do companies need these "box checking certifications"? I imagine the answer, as usual, is that either they or one of their customers is working with the government which requires this for its contractors. That's usually the answer whenever you find an idiotic practice that companies are mindlessly adopting.
Pretty much. We’re in the healthcare space and most of our customers are large hospital systems. Anything except “SOC2 compliant, no exceptions on report” will take an already long deal cycle (4-18 months) and double or triple it.
If you’re a startup it also means that your core people are now sitting in multiple cycles of IT review with their IT staff filling out spreadsheet after spreadsheet of “Do you encrypt data in transit?”
> > The Windows ecosystem typically deployed in corporate PCs or workstations is often insecure, slow, and poorly implemented
> Yes, but that's not because of Windows itself
Come on. There’s a reason Windows users all want to install crappy security products: they’ve been routinely having their files encrypted and held for ransom for the last decade.
And Linux/BSD generally would not help here. Ransomeware is just ordinary file IO and is usually run "legitimately" by phished users rather than actual code execution exploits
I have a similar disdain for security bloatware with questionable value, but one actually effective corporate IT strategy is using one of those tools to operate a whitelist of safe software, with centralized updates
I think having a Linux/BSD might be helpful here in the general case, because the culture is different.
In Windows land it's pretty much expected that you go to random websites, download random executables, ignore the "make changes to your computer?" warnings and pretty much give the exe full permission to do anything. It's very much been the standard software install workflow for decades now on Windows.
In the Linux/BSD world, while you can do the above, people generally don't. Generally, they stick to trusted software sources with centralized updates, like your second point. In this case I don't think it's a matter of capability, both Windows and Unix-land is capable of what you're suggesting.
I think phishing is generally much less effective in Max/Linux/BSD world because of this.
Until a a lucrative contract requires you to install prescribed boutique windows-only software from a random company you've never heard of, and then it's back to that bad old workflow.
Yeah, because no one on Linux or Mac would clone a git repo they just found out about and blindly run the setup scripts listed in the readme.
And no one would pipe a script downloaded with wget/curl directly into bash.
And nobody would copy a script from a code-formatted block on a page, paste it directly into their terminal and then run it.
Im not going to go so far as to claim that these behaviors are as common as installing software on Windows, but they are still definitely common, and all could lead to the same kinds of bad things happening.
I would agree this stuff DOES happen, but typically in development environments. And I also think its crappy practice. Nobody should ever pipe a curl into sh. I see it on docs sometimes and yes, it does bother me.
I think though that the culture of robust repositories and package managers is MUCH more prominent on Mac/iOS/Linux/FreeBSD. It's coming to Windows too with the new(er) Windows store stuff, so hopefully people don't become too resistant to that.
A developer is much more likely to be able to fix their computer and/or restore from a backup than a typical user is. A significant problem is cascading failures, where one bozo installing malware either creates a business problem (e.g. allowing someone to steal a bunch of money) or is able to disable a bunch of other computers on the same network. It is not that common for macOS to be implicated in these sorts of issues. I know people have been saying for a long time that it’s theoretically possible but it really doesn’t seem that common in practice.
I'd wager if Linux had the same userbase as Windows, you'd see more ransomware attacks on that platform as well. Nothing about Linux is inherently more secure.
> Yeah I don't get where this "Linux is more secure" thing comes from.
It comes from the 1990s and early 2000s. Back then, Windows was a laughingstock from a security point of view (for instance, at one point connecting a newly installed Windows computer to the network was enough for it to be automatically invaded). Both Windows and Linux have become more secure since then.
> Basically any userspace program can read your .aws, .ssh, .kube, etc... The user based security model desktops have is the real issue. Compare that with Android and iOS for instance. No one needs anti-virus bloatware, just because apps are curated and isolated by default.
Things are getting better now, with things like flatpak getting more popular. For instance, the closed-source games running within the Steam flatpak won't have any access to my ~/.aws or ~/.ssh or ~/.kube or etc.
What fraction of ransomware attacks would these security products have prevented exactly? Windows already comes with plenty of monitoring and alerting functionality.
Probably close to none at some point. They may block some things.
But most of Windows falling to this is that it’s what people use. The only platform that is somewhat actually protected against attacks is the iPhone - the Mac can easily be ransomwared it’s just the market is so small nobody bothers attacking it; no ROI.
Yeah. The mobile ecosystems are what real security design looks like. Everything is sandboxed, brokered, MACed, and fuzzed. We should either make the desktop systems work the same way or generalize the mobile systems into desktops.
The mobile ecosystem is what corporate IT should be. Centralized app store, siloed applications, immutable filesystem (other than the document part for each application), then VM and specials computers for activities like development. However locked iOS can be, most upgrades happen without an hitch, and no need for security software.
Hard to say, but windows defender doesn't stop as many as EDR's can. There are actual tests for this, ran by independent parties that check exactly this. Defender can be disabled extremely easily, modern EDRs cannot.
Yes, average Windows users are significantly less tech literate due to obvious reasons and there are way more of them. This create a very lucrative market.
How is desktop Linux somehow inherently particularly more secure than Windows?
okay so i did and he defs claims windwos was secure out of the box. so again, i ask if he really said that ahaha, with a straight face.
SMB 1.0 is enabled, non admin users have powershell access, defender can be disabled with a single command, user is admin by default, passwords can be reset via booting to a bootable media device and then swapping its CLI to c:
there are so many basic insecurities out of the box in windows.
Apple on the desktop/laptop, Google in the cloud for email, collaboration, file sharing, office suite. I ran a substantial sized company this way for a decade. Then we did a merger and had to migrate to Microsoft- massive step backwards, quintupling of IT problems and staff.
> Companies that are not software-focused, as it's not their primary business. These organizations are left with Microsoft's offerings
I wonder why is it the case. These companies still have IT departments, someone has to manage these huge fleets of Windows machines. So nothing would prevent them from hiring Linux admins instead of Windows admins. What makes the management of these companies consider Windows to be the default choice?
1. Users are more comfortable running Windows and Office because it's Windows they likely used in school and on personal laptops.
2. This is the biggie: Microsoft's enterprise services for managing fleets of workstations are actually really good -- or at least a massive step up from the competition. Linux (and it's ilk) is much better for managing fleets of servers, but workstations require a whole different type of tooling. And once you have AD and it's ilk running and thus Windows administrators hired, it's often easier to run other services from Windows too, rather than having to spin up another cluster of management services.
Software focused businesses generally start out with engineers running macOS or Linux, so they wouldn't have Windows management services pre-provisioned. And that's why you generally see them utilising stuff like Okta or Google Workspace
Unfortunately Google did not succeed to get more into schools around the globe with chromebooks, which is a pity by my opinion. That helps to keep the Win/Office monopoly situation to go on in organizations and businesses hiring people who never used another software than one from Microsoft.
One reason being that Microsoft lobby hard against low-end PC & notebooks that are not aligned with its interests. [1]
Microsoft has a large, entrenched distribution network and market all over the world. It makes an uphill battle to create low-end programs for schools, universities, governments, SMBs.
Hence the phrase "no one was ever fired from buying Microsoft". It's too hard a battle to go against the flow.
Inertia, plus integration - AFAIK Exchange and SharePoint don't run on Linux, so if the company buys into that, then it's Windows all the way down.
Still, all this is a red herring. Using Linux instead of Windows on workstations won't change anything, because it's not the OS that's the problem. A typical IT department is locked in a war on three fronts - defending against security threats, pushing back on unreasonable demands from the top, and fighting the company employees who want to do their jobs. Linux may or may not help against external attackers, but the fight against employees (which IT does both to fulfill mandates from the top and to minimize their own workload) requires tools for totalitarian control over computing devices.
Windows actually is better suited for that, because it's designed to constrain and control users. Linux is designed for the smart user to be able to do whatever they want, which includes working around stupid IT policies and corporate malware. So it shouldn't be surprising corporate IT favors Windows workstations too - it puts IT at an advantage over the users, and minimizes IT workload.
>Windows actually is better suited for that, because it's designed to constrain and control users. Linux is designed for the smart user to be able to do whatever they want, which includes working around stupid IT policies and corporate malware.
This just tells me you don't know linux. Linux can be much more easily hardened and restricted than windows. It's trivial to make it so that a user can only install whitelisted software from private repos.
Excel. There is no other software that can currently fill excel’s role in business. It’s the best at what it does and what it does is usually very important. Unfortunately.
The situation might have changed since I last used Excel on Mac, but in 2018, the "Excel" on Mac barely resembled the Excel on Windows. Many obvious and useful features were missing.
My guess is that the fact you can buy about two to three cheap Dell desktop machines for the price of one Mac probably factors quite heavily into the equation.
If you’re only doing vacation travel planning, sure. But there’s a long tail of advanced functionality used across all kinds of industries (with plugins upon plugins) that are most certainly not even close to being supported by any of the options proposed.
I don't know, but I would guess that Microsoft Office is what retains people; personal anectodal experience suggests that anything else (Apple's offerings, Google Docs, LibreOffice &c.) is not acceptable to the average user.
My suspicion is that Microsoft would be very unhappy to have MS Office running successfully on Linux systems.
A lot actually don’t, in any meaningful sense. My partner’s company has a skeleton IT staff with all support requests being sent offshore. An issue with your laptop? A new one gets dispatched from ??? and mailed to you, you mail the old one back, presumably to get wiped and redispatched to the new person that has a problem.
Tooling, infra, knowledge? The only reason why people are talking about "issues in Windows" because people are widely using it.
If linux had software anywhere close to the amount that windows has, it would have experienced the same issues too. After all it is not just about running a server and tinkering with config files. It is about ability to manage the devices, rolling out updates and so on.
You have to also factor in competition. I think it's a big factor on why corporate IT is generally bad, Microsoft and their partners have no reason to improve on the status quo. If we had viable alternatives, in a market where no entity has more than 20% market share or something like that the standards would be much higher.
The whole idea of running a backdoor with OS privileges in order to increase system security screams Windows. In Linux, even if Crowdstrike (or similar endpoint management software) is allowed to update itself, it doesn't have to run as a kernel driver. So a buggy update to Crowdstrike would only kill Crowdstrike and nothing else.
And Linux is not even a particularly hardened OS. If we could take some resources from VC smoke and mirrors and dedicate them to securing our critical infrastructure we could have airports and hospitals running on some safety-critical microkernel OS with tailored software.
the comment I am replying to explicitly mentions Linux as an alternative to Windows. In any case, yes, one could use Mac, as I do, but it comes with its own issues, starting from price. I'd happily switch 100% to Linux if I didn't need to work on documents edited with Office. The online version may actually solve this, but it's still buggy as hell.
Word, Excel, Powerpoint and all the other windows software. Plus all the people that know how to use the windows software vs Linux equivalents (if they exist).
Purchasing decisions are made by purchasing managers. Purchasing managers spend their time torturing numbers in spreadsheets, writing reports, and getting free lunches from channel sales reps. Microsoft is just a sales organization with some technical prowess, and their channel reps are very effective.
Technical arguments, logic, and sense do not contribute much to purchasing decisions in the corporate world.
I'd say something implementing the ideas of NixOS, i.e. immutable versioned systems and declarative system definitions, is poised to replace the current deployment mess, which is extremely fragile.
With NixOS, you can upgrade without fear, as you can always roll back to a previous version of your system. Regular Linux distributions, macOS, and Windows make me very nervous because that is not the case.
> I'd say something implementing the ideas of NixOS, i.e. immutable versioned systems
NixOS isn't immutable, things aren't mounted read only. AFAIK, it can't be setup that way.
> With NixOS, you can upgrade without fear, as you can always roll back to a previous version of your system. Regular Linux distributions, macOS, and Windows make me very nervous because that is not the case.
The store is immutable in the functional programming sense, as the package manager creates a new directory entry for each hash value.
Backups could be an option, but it is much better to have a system where two computers are guaranteed to be running the exact same software if configuration hashes are the same.
In other OSes, the state of your system could depend on previous actions.
> Regular Linux distributions, macOS, and Windows make me very nervous because that is not the case.
I'm personally only really nervous when updating Linux distributions. Besides security updates it usually hardly matters or is noticeable on macOS/Windows (well besides the random UX changes..).
Ideally there would be a usable security first os based on something like sel4 with a declarative package system for slow to change mission critical appliances.
In NixOS, you have a bootloader to load your OS. Unless you botch your bootloader, you can't paint yourself into an unbootable state. If one system configuration doesn't work, you reboot and choose the prior one before the OS begins to load in a menu displayed by the bootloader.
This is also true of most regular Linux setups. Except that in those, you can only choose the kernel. Hence, if you have broken other parts of your configuration, your system might not be bootable. So the safety net is much thinner.
Because you just want stuff to work and couldn't care less about the ideology part?
Also no feature parity (it's not about Windows being "better" than Linux or the other way around, none of that matters) there are not out of the box solutions to replace some of the stuff enterprise IT relies in Windows/etc. which would mean they'd have to hire expensive vendors to recreate/migrate their workflows. The costs of figuring out how to run all of your legacy Windows software, retraining staff etc. etc. would be very significant. Why spend so much money with no clear benefits?
To be fair I'm not sure how Apple figures into this. They don't really cater to the enterprise market at al..
Why? Both things seem pretty tangential. Poorly written software exists or can exist on any platform, just like the IT infrastructure wouldn't somehow automagically become robust if they just switched to Linux.
When I took a Linux course in college I had an old laptop that I installed Linux on. However, for some reason my wireless card wouldn't work. I mentioned it to my professor and the next day he told me "It's actually quite simple, you just have to open up the source code for the wireless driver and make a one line change."
Maybe things have gotten better, but I think that's why people use Mac. It's POSIX but without having to jump through arcane hoops.
The problem with the linux desktop was usually that most hardware companies were either not spending any time/effort on non-windows drivers/compatibility or when they did it was a tiny fraction of the effort that went into working around bugs in the windows driver API's.
Today with the failure of windows in both the mobile and industrial control space we now see vendors actually giving a damn about the quality of their Linux drivers.
Today the main factor keeping the enterprise marked locked on windows is the fat clients written around the turn of the millennium, and that's as much a problem for mac adaptation as it is Linux adaptation.
The macs are slick well designed devices that speaks to a huge segment of the consumer market so will eventually find the way into the high cost niches where no specific dependency on legacy software exists but they are too expensive and inflexible to replace all of the wintel system so for Microsoft and it's partners to have their license to screw over the enterprise sector revoked Linux(or FreeBSD) will have to play a role too.
Things have definitely gotten better. I remember the painful years. My most recent Ubuntu install on a new laptop was about 3 years ago. As someone who has used Linux as the daily driver for more than a decade (and dual booted as a second OS for another decade) I was pleasantly surprised that everything just worked! I think that was a first
It was an HP from Costco, not something special sold with Linux. My wireless worked, dual monitors just worked, even the fingerprint reader that I never use. I remember sitting there thinking "I didn't have to fight anything?" Hopefully that becomes the norm, maybe it is - I haven't needed a new laptop yet.
Because for some people (certainly not all), their objection is not to a "corporate" OS, but to the specific things Microsoft does that Apple does not.
Honestly, windows out of the box is pretty secure. I don't want to defend Microsoft here, but adding third party security to Windows hasn't been anything but regulatory compliance at best and cargo culting at worst for over a decade now. If you actually look at core windows exploits compared to market share, they're comparable to Apple. Enterprises insist on adding extra attack surface area in the name of security.
I agree that people who actually know what they're doing are generally running Linux backends, but Microsoft have enterprise sewn up, and this attack is not their fault.
A lot of active directory defaults are wildly insecure, even on a newly built domain, and there are a lot of active directory admins out there that don't know how to properly delegate as permissions.
This is true. You are basically one escalation attack on the CFO away from someone wiring money to hackers and a new remotely embedded admin freely roaming your network.
Windows is leagues ahead of MacOS in terms of granularity of permissions and remote management tools. It's not even close. That's mainly why enterprise IT prefers it to alternatives.
downvoted, because in your response you conflate two issues:
1. The problem with using Microsoft
2. The lack of institutional knowledge of securing BSD and MacOS and running either of those at the scale Microsoft systems are being run at.
The vast majority of corporate computer endpoints are running windows. The vast majority of corporate line-of-business systems are running Windows Server (or alternatively Microsoft 365).
That means a whole lot of people have knowledge on how to administer windows machines and servers. That means the cost of knowledge to adminster those systems is going down as more people know how to do it.
Contra that with MacOS Server administration, endpoint administration, or BSD Administration. Far fewer people know how to do that. Far fewer examples of documentation and fixing issues administrators have are on the internet, waiting to help the hapless system administrator who has a problem.
It's not just about better vs. worse from your perspective; it's about the cost of change and the cost of acquiring the knowledge necessary to run these corporate systems at scale -- not to mention the cost of converting any applications running on these Windows machines to run on BSD or MacOS -- both from an endpoint perspective and a corporate IT system perspective.
It's really not even feasible to suggest alternatives to any of the corporations using Microsoft that are impacted by this outage.
If you want to create an alternative to Microsoft's Corporate IT Administration you're gonna need to do a lot more than point to MacOS or BSD being "better".
I watched a presentation by someone representing "I Am The Cavalry" at B-Sides, Las Vegas, a few years ago. Very interesting stuff, gave me a whole new perspective on "cyber security".
US Based and got a NANOG alert email just in time. At least half our windows servers down.
I went into our crowdstrike policies and disabled auto update of the sensor. Hopefully this means it doesnt hit everything. Double check your policies!!!
IMO, having a mix of servers would help in mitigating issues like that.
Like run stuff on Linux, windows and freebsd servers, so that you have OS redundancy should an issue affect one in particular (kernel or app).
Just like you want more than a single server handling your traffic, you’d want 2 different base for those servers to avoid impacting them both with an update.
It's morning here in Europe, departure peak time. We're still flying, but...
The problems mean the takeoff and weight & balance data is missing. It needs to be done manual by each crew. Baggage handling is also manual, so that means counting bags and the cabin crew counting people. Then manually calculating performance data before you can take off.
It's not all of Europe. The airports in Norway are operating as normal. One of the airlines reported booking issues on their website, but I haven't read about any other issues.
Some international flights have been delayed or canceled of course, depending on the route.
KLM was flying, but all flights pass through Schiphol Airport (it's their hub) and Schiphol couldn't board fights for a while. Because of that they ran out of gates for arriving flights, so everyone had to cancel flights to avoid compounding delays. As the biggest user of Schiphol that means KLM had to cancel a lot of flights.
I can't imagine believing that this computation, ordinarily automated by these out of service systems, can be performed correctly by crews that probably haven't had to do this in ... years?
At most places there is an iPad app that does the calculation locally. So it's mostly entering a lot of numbers and checking that the results make sense. Usually both crew members do it individually, and then cross check the results.
For much of their history they were written down or copied on carbon paper and manually processed by phone later. Electronic processing came in the 70s and wasn't universally used until much later. I saw plenty of credit card imprinters in use well into the 90s when I was growing up.
WTF is CrowdStrike and why is it affecting so many people and companies? I've never heard of it before. And apparently it isn't anything relevant to all Windows users as it didn't affect any computer of any person I personally know.
Very popular corporate endpoint protection (malware detection and spyware) that runs telemetry & monitoring agents installed as kernel-mode drivers on windows. Thus if there is a crash, it crashes the entire kernel (BSOD) . And their drivers load at boot.
Guess we will never read the real facts. Truth is RMS was right. Again. Closed source security software is too often malware by design. We need open solutions we can truly trust.
Gasoline is very useful. We also take a lot of precautions when using it.
We also have things like inspections and financial penalties if you were storing it in an unsafe manner.
It's clear we need to take more precautions before using Crowdstrike. More testing, ability by IT departments to not push updates, ability to rollback updates.
On a positive note, I'm in morocco and getting money from ATM wasn't working for the whole day I believe because of this outage. I was at the till in a supermarket and people started asking if they can chip in to pay for some food I bought because I didn't have the cash.
Humanity 1 - Technology 0
Edit: Outage of all ATM's in Morocco was yesterday not today. so not sure how the two are related.
such stupidity. our $$$ corporate geniuses mandate multiple so-called security software which is:
- unaccountable black boxes
- of questionable, and un-auditable, quality
- requires kernel modules, drivers, LocalSystem, root access, etc.
- updates at random times with no testing
- download these updates from where? and immediately trust and run that code at high privilege. using unaccountable-black-box crypto to secure it.
- all have known patterns of bad performance, bugs, and generally poor quality
all in the name of security. let's buy multiple "solutions" and widely deploy them to protect us from one boogeyman, or at least the shiny advertisements say. while punching all sorts of serious other holes in security. why even look for a Windows ZeroDay when we can look for a McAfee or Crowdstrike zero day?
According to Reddit It's hitting Croatia, Philippines, US, Germany, Mexico, India, Japan. SAP servers dropping like flies, that's Defence,Banks, Payroll all affected. Major Retail Chains like Big W down.
We have outages across whole APAC and most EMEA. Despite being a very big client of CS, we do not have an official resolution yet, an hour into the incident.
I'm a little late to the party, but I've uploaded my source codes to GitHub in case anyone needs a more convenient tool to deploy/execute on running machines and/or needs something fast on USB flash drives to run around the office:
I'm sure you mean well, but it's not going to be most programmers or devs who will need to apply a fix for this, it'll be sysadmin/network/SREs who'll be doing this and they're not going to download Go to build this random github code repo. Because it affects only Windows systems, it'll be way better writing a bat or powershell script that can non-programmers can read and comprehend before they execute anything in production/live systems.
The details (the particular companies / systems etc) of this global incident don't really matter.
When the entire society and economy are being digitized AND that digitisation is controlled and passes through a handful of choke points its an invitation to major disaster.
It is risk management 101, never put all your digital eggs in one (or even a few) baskets.
The love affair with oligopoly, cornered markets and power concentration (which creates abnormal returns for a select few) is priming the rest of us for major disasters.
As a rule of thumb there should be at least ten alternatives in any diversified set of critical infrastructure service providers, all of them instantly replaceable / forced to provide interoperability...
Some truths will hit you in the face again and again until you acknowledge the nature of reality.
Can you imagine having just one road connecting two big cities to cut costs? No alternative roads, nor big nor small.
That will be really cheap to maintain, and you can charge as much as you want in tolls as there are no alternatives. And you can add ads all over the road as people has to watch them to move from one city to the other.
And if the road breaks, the goverment needs to pay for the cost as they cannot allow the cities to go unconnected.
I'm writing this in the wake of the aftermath of the disclosure of the log4j zero-day vulnerability. But this is only a recent example of just one kind of networked risk.
With managed services we effectively add one more level to the Inception world of our software organisation. We outsource nice big chunks of supply chain risk management, but we in-source a different risk of depending critically on entities that we do not control and cannot fix if they fail.
Not to mention the fact that change ripples through the parallel yet deeply enmeshed dimensions of cyberspace and meatspace. Code running on hardware is inexorably tied to concepts running in wetware. Of course, at this level of abstraction, the notion applies to any field of human endeavour. Yet, it is so much more true of software. Because software is essentially the thoughts of people being played on repeat.
The oligopoly is not a "love affair", that's how IT works: prime mover advantage, "move fast and break things" (the first of them being interoperability), moats, the brittleness of programming...
The whole startup/unicorns ecosystem exists only because there is the possibility of becoming the dominant player in a field within a few years (or being bought out by one of the big players). This "love affair with oligopoly" is the reason why Ycombinator/HN exists.
It's correct that these are political/economical decisions. But most people in society neither have the knowledge for an informed opinion on such matters, nor a vote.
Centralisation vs decentralisation.
Cost-savings vs localisation of disaster.
It's a swinging pendulum of decisions. And developers know that software/hardware provision is a house of cards. The more levels of dependency, the more fragile the system is.
Absolutely horrible when lives will be lost, but determining the way our global systems are engineered and paid for will always be a moving target based on policy and incentive.
My heart goes out to life and death results of this. There are no perfect tech solutions.
Be aware that enterprise firms actively choose and "asses" who their AV suppliers are on-premis and in the cloud not imposed by msft. Googling it does seem that CrowdStrike, does have a history of Kernel Panics. Perhaps such interesting things as Kernel panic should be part of compliance checklist.
Googling it seems crowdstrike has a history of causing kernel panics.
Everytime there was a mysterious performance problem affecting a random subset of machines, it was Tanium. I know how difficult it is for anyone to just get rid of this type of software, but frankly it has been proven over and over that antivirus are just more surface attack, not less.
I think the enterprise software ecosystem currently is not really "all eggs in one basket", but rather you have a whole bunch of baskets, some of them you are not even aware of, some are full of eggs, some have grenades in them instead, some are buckets instead. All baskets are being constantly bombarded with a barrage of eggs from unknown sources, sometimes the eggs explode for inexplicable reasons. Oh yeah and sometimes the baskets themselves disintegrate all at once for no apparent reason.
The problem is allowing a single vendor, with a reputation of fucking up over and over again, to push code into your production systems at will with no testing on your part.
Right. I thought the "big guys" know better and they have some processes to vet Crowdstrike updates. Maybe even if they don't get its source code, they at least have a separate server that manages the updates, like Microsoft's WSUS.
But no, they are okay with a black box that calls home and they give it kernel access to their machines. What?
Monocultures are known to be points of failure, but people keep going down that path because they optimize for efficiency (heck, most modern economics is premised on the market being efficient).
This problem is pervasive and effects everything from food supply (planting genetically identical seeds rather than diversified "heirloom" crops) to businesses across the board buying and gutting their competitors thus reducing consumer choice.
It's a tough problem akin to a multi-armed bandit: exploit a known strategy or "waste" some effort exploring alternatives in the hopes of better returns. The more efficient you are (exploitation), the higher the likelihood of catastrophic failure in weird edge cases.
this isn't even the first time something like this has happened. it's literally a running joke in programmer circles that AWS East going down will take down half the internet, and yet there's absolutely zero initiative being taken by anyone who makes these sorts of decisions to maybe not have every major service on the internet be put into the same handful of points of failure. nothing will change, no one will learn anything, and this will happen again.
That’s very different though. That’s avoidable. We all can easily have our services running in different data centers around the world. Heck, the non-amateurs out there all have their services running in different Amazon data centers around the world. So you can get that even from a single provider. Hardware redundancy is just that cheap nowadays.
This CS thing, there’s no way around. You use it and they screw up, you get hit. Period. You don’t failover to another data center in Europe or Asia. You just go down.
Hardware, even cloud hardware, is rarely the issue. Probably especially cloud hardware is not an issue because failover is so inexpensive relative to software.
Software is a different issue entirely. How many of us will develop, shadow run, and maintain a parallel service written on a separate OS? My guess is “not many”. That’s the redundancy we’re talking about to avoid something like this. You’d have to be using a different OS and not using CS anywhere in that new software stack. (Though not using CS wouldn’t be much of a problem if the OS is different but I think you see what I mean.)
Amazon, implementing failover for your hardware is a few clicks. But if you want to implement an identical service with different software, you better have a spare dev team somewhere.
AWS East going down will (and has) cause(d) disruption in other regions.
Last time it happened (maybe like 18 months ago), you ran into billing and quota issues, if my memory serves.
AWS is, as any company, centralized in a way or another.
Want to be sure you won't be impacted by AWS East going down, even if you run in another region? Well, better be prepared to run (or have a DRP) on another cloud provider then...
The cost of running your workload on two different CSP is quite high, especially if your teams have been convinced to use AWS-specific technologies. You need to first get your software stack provider agnostic and then manage the two platform in sync from a technical and contract perspective, which is not always easy...
You just made the single point of failure your software stack hardware abstraction layer. There’s a bug in it, you’re down. Everywhere. Not only that, but if there is CS in either your HAL, or your application you’re down. So to get the redundancy the original commenter was talking about, you need to develop 2 different HALs with 2 different applications all using a minimum of 2 different OS and language stacks.
Why multiply your problems? Use your cloud service provider only to access hardware and leave the rest of that alone. That way any cloud provider will due. Any region on any cloud provider will due. You could even just fallback to your own racks if you want. Point is, you only want the hardware.
Now to get that level of redundancy, you would still have to create 2 different implementations of your application on 2 different software and OS stacks. But the hardware layer is now able to run anywhere. Again, you can even have a self hosted rack in your dispatch stack.
So hardware redundancy is easy to do at the level the original commenter recommends . Software redundancy is incredibly difficult and expensive to do at the level the original commenter was talking about. Your idea to make a hardware/cloud abstraction layer only multiplies the number of software layers you would need multiple implementations of, shadow run and maintain to achieve the hypothetical level of redundancy.
> It is risk management 101, never put all your digital eggs in one (or even a few) baskets.
The fact it's widespread is because so many individual organisations individually chose to use CrowdStrike, not because they all got together and decided to crown CrowdStrike as king, surely?
I agree with you in principle, but the only solution I can think of would be to split up a company with reach like CrowdStrike's. The consequences of having to do that are up for debate.
It's never that simple. There is a strong herd mentality in the business space.
Just yesterday I've been in a presentation from the risk department and they described the motives around choosing a specific security product as `safe choice, because a lot of other companies use it in our space, so regulator can't complain`...the whole decision structure boiled down to: `I don't want to do extra work to check the other options, we go with whatever the herd chooses`. Its terrifying to hear this...
The whole point of software like this is a regulatory box-ticking exercise, no-one wants it to actually do anything except satisfy the regulator. Crowdstrike had less overhead and (until now) outages than its competitors, and the regulators were willing to tick the box, so of course people picked them. There are bad cases of people following the herd where there are other solutions with actually better functionality, but this isn't that.
OTOH... I remember an O365 outage in London a few years ago.
You're down? Great, so are your competitors, your customers, and your suppliers. Head to the pub. Actually, you'll probably get more real value there, as your competitors, customers and suppliers are at that same pub. Insurance multinationals have been founded from less.
That didn't affect any OT though, so it was more just proof that 90% of work carried out via O365 adds no real value. Knowing where the planes are probably is important.
> You're down? Great, so are your competitors, your customers, and your suppliers. Head to the pub. Actually, you'll probably get more real value there, as your competitors, customers and suppliers are at that same pub. Insurance multinationals have been founded from less.
I mean yeah, that's the other thing - the Keynesian sound banker aspect. But that's more for software that you're intentionally using for your business processes. I don't think anyone was thinking about Cloudstrike being down in the first place, unless they were worried about an outage in the webpage that lists all the security certifications they have.
You say that as it's some bad thing, but it's just other words for "use boring tech".
Yes, there could be reasons to choose a lesser-known product, but they better be really good reasons.
Because there are multiple general reasons in the other direction, and incidents like this are actually one of those reasons: they could happen with any product, but now you have a bigger community sharing heads-ups and workarounds, and vendor's incident response might also be better when the whole world is on fire, not only a couple of companies.
It's not just Crowdstrike, it's all up and down the software and hardware supply chain.
It's that so many people are on Azure - which is a defacto monopoly for people using Microsoft stack - which is a defacto monopoly for people using .Net
And if they're doing that, the clients are on Windows as well, and probably also running Crowdstrike. The AD servers that you need to get around Bitlocker to automatically restore a machine are on Azure, running Windows, running Crowdstrike. The VM image storage? Same. This is basically a "rebuild the world from scratch" exercise to some greater or lesser degree. I hope some of the admins have non-windows machines.
How come AWS sometimes has even better tooling for .NET than Azure, while JetBrains offers better IDE on Linux, macOS and, depending on your taste, Windows than Microsoft? Or, for some reason, the most popular deployment target is just a container that is vendor-agnostic? Surely I must be missing something you don't.
All of that is absolutely true and in no way affects the behavior at hand. Big companies go with whoever sells them the best, not any kind of actual technical evaluation.
Perhaps the organisations have a similar security posture. And that creates a market that will eventually result in a few large providers who have the resources to service larger corporations. You see something similar in VPN software where Fortinet and Palo become the linchpin of security. The deeper question is to wonder at the soundness of the security posture itself.
There's a strong drive for everyone to do things the same way in IT. Some of the same pressure that drives us towards open standards can also drive us towards using a standard vendor.
> I agree with you in principle, but the only solution I can think of would be to split up a company with reach like CrowdStrike's.
Changing corporate structures doesn't necessarily help. It's possible that if CrowdStrike were split up into to smaller companies, all the customers would go to the one with the "better" product and we'd be in a similar position.
Well, if they'd used a different vendor (or nothing) on the DR servers we could have done a failover and gotten on with our day. But alas nobody saw, an app that can download data from the internet, whenever it wants to update itself arbitrarily without user intervention, as a problem.
They choose because other have. "Look how many others choose us" is a common marketing cry. Perhaps instead too popular is a reason not to choose? Perhaps not parroting your competitors and industry is a reason not to choose?
When it comes to security products, the size of the customer base matters. More customers means more telemetry. More telemetry means better awareness of IOCs, better training sets to determine what's good and what's bad.
I wonder how many of those orgs were "independently" audited by security firms which made passing audit without Crowdstrike specifically a hell.
Most of crap security I met in big organisations was driven by checklist audits and compliance audits by a few "security" firms. Either you did it the dumb way or good luck fighting your org and their org to pass the audit.
Setting aside the utter fecklessness if not outright perniciousness of cybersecurity products such as this, I hope this incident (re-)triggers a discussion of our increasing dependence on computing technology in our lives, its utter inescapability, and our ever-growing inability to function without it in modern society.
Not everything needs to be done through a computer, and we are seeing the effects now of organizing our systems such that the only way to interface with them is through a digital device or a smartphone, with no alternative. Such are the consequences of moving everything "into the cloud" and onto digital devices as a result of easy monetary policy and the concomitant digital gold rush where everyone and their dog scrambled to turn everything into a smartphone app.
This past week I purchased a thermostat. There were "high-end" touch only models, models that were app-assisted also with analog controls, and then finally old school analog only. I went with the middle / combo so that I have analog as a call back if the pure tech mode fails.
Being prepared can cost more and/or be less flashy (read: I didn't get touch-only) but it's only peace of mind, at least for critical components. I want a thermostat that works, I don't get no satisfaction from any bragging rights. Nod to the Rolling Stones.
I literally dealt with this just a few hours ago. I need a new HVAC system. I wanted the high-end model, but it will only work with their fancy cloud-connected thermostat. You cannot replace it with an off-the-shelf thermostat.
Have home automation? Sorry, you'll have to use the Internet.
I vote with my dollars, so it cost them the higher-margin sale. I also went with the mid-tier system, and grabbed a Z-Wave compatible thermostat along with it. I wonder if I'll miss the nifty variable-speed system?
I really wish everyone would stop trying to trap us into their walled gardens. Apple at least lets people write software for theirs, but the hardware/appliance manufacturers (not to mention the automotive folks) are awful about this.
> The details (the particular companies / systems etc) of this global incident don't really matter.
It definitively matters. The main issue he is that Crowdstrike was able to push and update on all server around the world where all their agent is installed ... it looks like an enormous botnet ...
We need a detailed post-mortem on what happened here.
The other aspect of risk management is an acceptance that something going wrong isn't necessarily a reason to change what you are doing. If the plan was tacitly to run something at a 99% uptime, then incidents causing 1% downtime can be ignored.
We are going to get hit by some terrible outage eventually (I hope someone is tracking things like what happens if a big war breaks out and the GPS constellations all go down together). But having 10x providers won't help against the big IT-related threats which are things like grid outages and suchlike having cascading effects into food supplies.
> there should be at least ten alternatives in any diversified set of critical infrastructure service providers, all of them instantly replaceable / forced to provide interoperability...
And does anyone actually know how to actually implement this, at the scale required (dealing with billions of transactions daily) in a way that would resolve the problems we are seeing?
It very much seems like a data access problem; places can't access/modify data. The physical disks themselves are most likely fine, but the 'interfaces' are having troubles (assuming that the data isn't stored on the devices having the issue).
But in any case how do you design a system that, if the 'main' interface is troubled, you can switch over, instantly, seemlessly, duplicating access controls, permissions, data validation, logic etc.
There is a reason everything is centralised because it makes no financial sense to duplicate for an extremely unlikely and rare chance. The world is random and these things will happen, but a global outage on this type of scale is not a daily occurance.
We'll look back in a few years and think "those were a crazy few hours" and move on...
> The details (the particular companies / systems etc) of this global incident don't really matter.
But they do matter. This is elementary. It's like saying "playing with matches doesn't matter". This is a problem that has happened before, albeit on smaller scale, and the solution/cure is well known and imho it should be established 2 decades ago to every org on the planet.
This is basic COBIT (or BYOFramework) stuff from 10-15-20 years ago.
How can you push a patch/update without testing it fist? I get it if you are a tiny company with 1 IT person an 20 local PCs. Stuff like that cripples you for a couple of days. But when you are an org, with 10k+ laptops, 500+ servers (half of them MS Win), how can you NOT test each and every update?
If you don't want to have the test/staging environments, then at least wait 1-3-5 days to see what the updates will do to others/the news.
Sorry not sorry guys and gals. I've been auditing systems and procedures for so many years, that this is a basic failure. "One cannot just push an update without testing it first" any update, no matter how small/innocent.
> So I am not convinced that there need to be "at least ten alternatives" to be fail safe as society.
The required number "N for safety" is a good discussion to have. Risk-Return, Cost-Benefit etc are essential considerations. We live in the real world with finite resources and stark choices. But I would argue (without trying to be facetious) that they are risk management 102 type considerations.
Why? Because they must rely on the pretense of knowledge [1]. As digitization keeps expanding to encompass basically everything we do, the system becomes exceedingly complex, nobody has a good picture of all internal or external vulnerabilities and how much they might cascade inside an interconnected system.
Assessing cost versus benefit implies one can reasonably quantify all sides of the equation. In the absence of a demonstrably valid model of the "system" the prudent thing is to favor detail-agnostic rules of thumb. If these rules suggest that reducing unsafe levels of concentration is not economically viable there must be something wrong with the conceptual business model of digitization as it is now pursued.
Or perhaps it's just because companies release features, planes, devices, etc. without any form of QA, aiming just to increase their profits?
In this case, has CS done any QA on this release? Have they tested it for months on all the variations of the devices that they claim to support? It seems not.
Considering CS Falcon causes your performance to drop by about half and does the same to your battery life, I doubt they have any sort of QA that cares about anything but hitting stakeholder goals.
Yet, catastrophic failures like this happen, and people move on. Sure, there is that one guy who spent 10 years building a 10-fold redundancy plan, and his service didn't go down when the whole planet went down, but do people really care?
Unless his systems are up but critically dependent on other external systems (payment services, bucket storage, auth etc...) that are down. It's becoming increasingly difficult to not have those dependencies.
While this is a great theory, how would you actually accomplish this with antivirus software?
Multiple machines, each one using different vendor software? What other software needs to be partitioned this way? What about combinations of this software?
I’m just barely awake but don’t know if I’m affected yet. One of my devs is, our client support staff is, and I have no idea how our servers are doing just yet.
> It is risk management 101, never put all your digital eggs in one (or even a few) baskets.
I mean, plenty of businesses only have penguin eggs in their basket, and some sort of penguin problem would cause major problems for them. I believe that last time this happened was with the leap second thing around 2005 or thereabouts.
"Don't put all your eggs in one basket" sounds nice, but it would mean a completely different independent service all through your stack. That's not really realistic, IMHO.
The bigger issue here is that: 1) some driver update "just" gets pushed (or how does this work?), and 2) the inability to easily do "this is broken, restore to last version". That is even something that could be automatic.
This isn't some global conspiracy, it's just incentives and economies of scale. When it's cheaper to pay a hyperexpert to handle your security, why wouldn't you?
The fact that physical distance is no longer a limit to who you do business with means that you can select the cheapest vendor globally, but then that vendor has an incentive to hyperspecialize (because everyone goes to them for this one thing), which means that even more people go to them.
Avoiding once-in-a-century events just isn't something we're willing to pay the extra cost for, except now we have around twenty places where these once-in-a-century events can happen, which kind of makes them more frequent.
How much stuff do you host on Hetzner instead of AWS?
Now they know the state of each of the affected companies systems. How adept their sysops guys are, a birds eye view of their security practices. Nice move and plausible deniable too :D.
I mean how did this happen at all? Are there no checks in place @ crowdstrike? Like deploying the new update to a selected machines and check whether everything is ok, and then releasing it to the wild incrementally?
> When the entire society and economy are being digitized AND that digitisation is controlled and passes through a handful of choke points its an invitation to major disaster.
Once again, it's Microsoft, directly, or indirectly, choosing a strategy of eventually getting all worldwide Windows desktops online, and connected via their systems.
Which is why I installed Fedora after Windows 7 and never looked back. 100% local, 100% offline if needed.
My company is looking to a non-Microsoft desktop. We're not affected by this, but it will certainly encourage us to move sooner rather than later.
Society was able to move to mass WFH on a global scale in a single month during Covid, thanks to the highly centralized and efficient cloud infrastructure. That could have easily saved tens of millions of lives (Imagine the Spanish flu with mass air travel, no vaccines, no strain-weakening)
These small 'downages' basically never cause serious issue. Your solutions are just alarmist and extremely costly (though they will provide developer employment...).
> These small 'downages' basically never cause serious issue.
Hospitals, airlines, 911, grocery stores, electric companies, gas companies, all offline. There will be more than a few people dead as an indirect result of this outage, depending on how long it lasts.
> These small 'downages' basically never cause serious issue.
Emergency Departments and 911 were knocked offline. People will indirectly die because of this, just like the last time 911 went down, and just like the last time EDs went down.
If CrowdStrike can cause this with a faulty update (allegedly), what do you think could happen to Western infrastructure from a full blown cyberwar? It's a valid risk.
> Society was able to move to mass WFH on a global scale in a single month during Covid
I don't know how much WFH saved lives, seeing as ordered isolation and social distancing was a thing during the Spanish Flu too (you just take the economic hit). But yes it allowed companies to keep maintaining profits. Those that couldn't WFH got paid in most countries anyway (furlough in England, etc).
true, but incentives should be in place to encourage a more diverse array of products, at the moment with many solutions (especially security) it is a choice between that one popular known product (Okta, CrowdStrike, et al, $$$) and bespoke ($$$$$$$$$$).
If only because we can then move away from one-size-fits-all, while mitigating the short-term impact of events like the above.
I just landed at SeaTac an hour ago and the rideshare/app pickup was absolutely nutso. Like thousands of people standing around waiting for taxis and Ubers. The one person I asked what was going on said that the computer systems at all the regional hotels are down (not sure how that makes more people need cabs). Wonder if it’s from this
Just realized this is posted on the SeaTac website now: “ SEA is experiencing temporary issues with the system that populates flight and baggage information on in terminal screens and the flySEA app/website. Travelers are recommended to check with their airlines for current gate and baggage claim information. Check With Your Airlines”
> just landed at SeaTac an hour ago and the rideshare/app pickup was absolutely nutso. Like thousands of people standing around waiting for taxis and Ubers.
For years now antivirus solutions have ridiculous amount of control over the OS. I accidentally installed an adware antivirus the other day that was bundled-up with a third party software, and I had to boot to Linux to manage to completely remove the damn thing from Windows. The uninstall option left a process running that couldn’t be forcefully killed.
Microsoft needs to take control and forbid anyone and anything from running software with that kind of behavior.
I find it impossible to believe that Azure as a whole organisation takes security seriously. There might be individuals that do, but definitely nobody with decision making power. Half of the above described exploits are trivial and should have never passed any sort of competent review process.
Have spent all my afternoon and all evening on a bridge trying to support flailing systems. Was supposed to be on a plane in 5 hours to start my vacation. Guaranteed it's not gonna happen.
With hearing 911 and other safety critical systems going down, I hope that the worst that comes out of this is a couple delayed flights and a couple missed bank payments.
Yet their stock tanked only a couple of dollars. They (and their customers) should face some rather unpleasant lawsuits. If you let others own your systems, you should not be allowed to provide critical infrastructure.
I don't think you're going to see as many lawsuits are you think. Most of these contracts probably state that they had to follow reasonable precautions for business continuity and data recovery. Having Crowdstrike in the path seems to have been a reasonable and potentially best practice before today's outage.
I don't think that companies are going to be held liable at all.
Eh. I think you're underestimating how overmatched these IT depts are when it comes to cybersecurity.
Either sign a contract with a best-in-class (even if in name only) vendor who says that they'll do all of this for us or we need to become "experts" in cybersecurity and potentially still use them.
The CIO is overmatched here so they're making the decision that protects them and their clients in _almost all_ cases.
Once they are taken to court and all their crap gets subpoena'd I think we might find that reasonable precautions were not taken.
Its possible that this update was never properly QA'd and was just rushed out the door. If thats the case then it could be found to be negligence, and no amount of legal jargon protects you from negligence. It could be the end of CrowdStrike. /end fud.
They didn’t force anyone to use their software in critical infrastructure. The customers deploying the software as part of critical infrastructure should take the necessary precautions or insist on a contractual agreement that makes the vendor liable for any causally related failures of the critical infrastructure. The mistake is that so much software is being put into use without any substantial liability. Doing so would also make software much, much more expensive.
unqualified people on the internet shouldn't give legal advice, but since we're doing it anyway: no, this is definitely not true and if you make assurances about the fitness of the good you are on the hook when it fails, even if it's some absurdly improbable "I had no way of knowing that our pencils intended for schoolchildren would be used on a spacecraft" situation.
There is a reason you will see a ton of warranties and terms of service/EULA specifically forbid or disclaim the use in life-critical situations, in which case you are safe, because you said don't do it. But if you don't, you generally are going to be liable.
Sadly there is a reason the chainsaws say "do not stop chain with genitals". Not only did someone probably do that, but the damages stuck.
example, I was talking about the CUDA license yesterday and of course one of the clauses is:
> You acknowledge that the SDK as delivered is not tested or certified by NVIDIA for use in connection with the design, construction, maintenance, and/or operation of any system where the use or failure of such system could result in a situation that threatens the safety of human life or results in catastrophic damages (each, a “Critical Application”). Examples of Critical Applications include use in avionics, navigation, autonomous vehicle applications, ai solutions for automotive products, military, medical, life support or other life critical applications. NVIDIA shall not be liable to you or any third party, in whole or in part, for any claims or damages arising from such uses. You are solely responsible for ensuring that any product or service developed with the SDK as a whole includes sufficient features to comply with all applicable legal and regulatory standards and requirements.
Why is this here? because they'd be liable otherwise, and more generally they want to be on the record as saying "hey idiot don't use this in a life-critical system".
There might well be a clause like that in crowdstrike's license too, of course. But the problem is it's generally different when what you are providing is a mission-critical safety/security system... hard to duck responsibility for being in critical places when you are actively trying to position yourself in critical places.
>Sadly there is a reason the chainsaws say "do not stop chain with genitals". Not only did someone probably do that, but the damages stuck.
This is very dependent on your jurisdiction. The USA's laws leave a lot more room for litigating in a way which I would deem frivolous than those of Canada. If you sell a chainsaw with safety features that adhere to common standards you should reasonably expect people not to try to stop it with their ballsack and a court of law that holds the manufacturer liable for moronic use of the object is a poorly designed court.
I think you have it flipped. Any clause in a contract is a negotiation. Warranties and insurance coverage are part of it.
Any smart CIO would have said - ill take what you sell but if you fail i can come back and haunt you and you are going to give me endorsement for your product insurance and i'll require upping your coverage + notifications of you being up to date with your insurance policy that has 50XXX M in coverage, minimum.
If the software is sitting on top of your business core IT, you must protect the business in its entirety by demanding a proportional shield, and using IT vendor's own IT insurance shield as if it was your own. And demanding more coverage, if the shield is too small . Then once those elements are in place, you are protected. Its as simple as that.
I mean, seriously: You can cause a worldwide outage of gargantuan proportions, affecting actual human lives and untold points off GDP in several countries ...
... and the market gives you the equivalent of a wrist slap? No lawsuits?
What is it down to? Anonymous hit markets? Where is justice going to be served here?
PS. Most outlets are reporting "a fix has been issued" - as in "Whoopsy. No biggie" ...
... I mean, who's going to make affected (still alive!) people whole?
Naive question, if it’s a blue screen of death with a boot loop, how are they going to restore things? Don’t tell me the answer is going to every system manually.
Well, it seems that Windows is not yet accessible remotely when it crashes.
If system administrator had too much free time, and configured every system to probe network on booting, and there is no encryption, it is possible to boot from a minimal Linux image with a script that automatically renames the driver and restarts.
The corporate version of the same approach uses Intel AMT (or however else it is called), but it is only available on licensed hardware from big suppliers.
Otherwise, you can distribute flash drives with the same auto-executing fix to everyone who is able to enter firmware setup, and boot from USB. If it's not available for security reasons, more manual work is required.
But what happens next? If Crowdstrike handled all the security measures, and there was no additional firewall rules, address checks, and so on, your network is now as open as it can be. I suppose certain groups have been celebrating, and uploading gigabytes of data from networks whose detection systems became severed.
Lots of systems (not all) are able to reboot, and have CrowdStrike download the fix before the bad code is able to crash things. But otherwise, yes, you have to go to systems manually.
It's kind of surprising so much infra was using windows servers or windows cloud VMs for these things. I assumed these systems would all be Linux VMS in Azure/AWS/GCP at this point.
> We have been made aware of an issue impacting Virtual Machines running Windows Client and Windows Server, running the CrowdStrike Falcon agent, which may encounter a bug check (BSOD) and get stuck in a restarting state.
I am not sure in which one of his talks he briefly mentioned that one of his concerns is that we are basically building a digital Alexandria library, and if it burns, well ...
Even more devastating events like this will happen in the future.
We stand on the shoulders of giants and yet we learned nothing.
It was Windows in this case but nothing is stopping it from happening with any other widely used system that gets online updates. CrowdStrike has root on Linux/MacOS as well after all.
The problem is relying on networked computers for critical infrastructure with no contingency plan. This sort of thing will happen whether because of a bug or because of ransomware. The software and hardware industries are incapable of producing reliable and safe products in our economic system.
Important services such as hospitals, groceries, water treatment plants, and electric grids should be able to operate in offline mode when this sort of thing inevitably happens.
I was listening to Triple J (one of ABC's radio stations), they said: "welcome to our first and possibly last ever Triple J's USB Fridays, we can't play any of our usual music because the computers are all down, all we can play is the songs that happen to be on the USB stick that one of us had in our pocket". LOL!
What I'm curious about: other than checkbox compliance, how does Crowdstrike convince companies to buy their product? Do they present evidence that their product is effective at protecting customers? Because certainly Crowdstrike customers still get hacked.
I've watched it occur countless times. Often times the people making the purchase decision are largely incompetent.
They usually come out and take your team to a nice lunch. Then they run you through a fancy slide deck and convince you to let them run some scaremongering reporting tool over your infra. By the end of the day, most of your leadership is convinced they need the solution.
Rinse and repeat hundreds of times and you have the 3rd party vendor hodgepodge hellscape that constitutes most large corporations' IT infrastructure.
I would imagine that their best weapon is that so many other big organizations are using CS, so choosing CS gives the decision maker the best shield from responsibilities, similar to "nobody gets fire by choosing IBM".
Of course, how they started from small was completely different.
They all should have used some expensive corporate-and-government-level product that promises protection against exactly that kind of large scale attack on infrastructure.
When I saw 'Global IT Outage' trending I assumed it was another major cloud service failure. Obviously this has far wider impact because of the need for intervention on individual endpoints.
The irony is dawning on me that for much of the recent computing era we've developed defenses against massive endpoint outages (worms, etc.) and one of them is now inadvertently reproducing the exact problem we had mostly eradicated.
CyberStrike offers a temporary solution for crashed systems
Cyberstike has given users a potential way to fix their systems.
Boot Windows into Safe Mode or the Windows Recovery Environment (you can do that by holding down the F8 key before the Windows logo flashes on screen)
Navigate to the C:WindowsSystem32driversCrowdstrike directory
Locate the file matching “C-00000291.sys” file, right click and rename it to “C-00000291.renamed”
Boot the host normally.
Yup. Well in our case (and we are thankfully not affected), they could call IT support. But then again, if IT support themselves cannot boot their PCs...
Microsoft are going to be pissed that this is widely being discussed as a Microsoft outage. Do AV vendors like Crowdstrike need a license or something from Microsoft to push these kernal driver based things? Or is it just like anyone can make one?
It seems like this would indirectly tell us what systems use Cloudstrike. Could that in of itself be information that could help an attacker? I know the security team at work is adamant about not leaking details of our system.
In terms of analysing risk factors to minimise something like this happening again, what are the factors at play here?
A Crowdstrike update being able to blue-screen Windows Desktops and Servers.
Whilst Crowdstrike are going to cop a potentially existential-threatening amount of blame, an application shouldn't be able to do this kind of damage to an operating system. This makes me think that, maybe, Crowdstrike were unlucky enough to have accidentally discovered a bug that affects multiple versions of Windows (ie. it's a Windows bug, maybe more-so than it is a Crowdstrike bug).
There also seems to have been a ball-dropped in regards to auto-updating all the things. Yes, gotta keep your infrastructure up to date to prevent security incidents, but is this done in test environments before it's put into production?
Un-audited dependence on an increasingly long chain of third-parties.
All the answers are difficult, time consuming, and therefore expensive, and are only useful in times like now. And if everyone else is down, then there's safety in the crowd. Just point at "them too", and stay the path. This isn't a profitable differentiation. But it should be! (raised fists towards the sky).
> Whilst Crowdstrike are going to cop a potentially existential-threatening amount of blame, an application shouldn't be able to do this kind of damage to an operating system.
It doesn't operate in user space, they install a kernel driver.
It's a design decision. People want the antivirus to protect them even if an attacker exploits a local privilege escalation vulnerability or if an attacker that compromised an admin account (which happens all the time in Windows environments) wants to load malicious software. That's kind of the point of these things. Somebody exploits a memory vulnerability of one of the hundreds of services on a system, the antivirus is supposed to prevent that, and to their benefit, Crowdstrike is very good at this. If it didn't run in the kernel, an attacker with root can deactivate the antivirus. Since it's a kernel module, the attacker needs to load a signed kernel module, which is much harder to achieve.
Presumably Crowdstrikes driver also has the ELAM flag which guarantees it will be loaded before any other third party drivers, so even if a malicious driver is already installed they have the opportunity to preempt it at boot.
If we are being pedantic then an ELAM driver can't be guaranteed to load before another ELAM driver of course, but only a small list of vetted vendors are able to sign ELAM drivers so it is very unlikely that malware would be able to gain that privilege. That's the whole point.
Yep. We can't migrate our workstations to Ubuntu 24.04 because Crowdstrikes falcon kernel modules don't support the kernel version yet. Presumably they wanted to move to EBPF but I'm guessing that hasn't happened yet. Also: I can't find the source code of those kernel modules - they likely use GPL-only symbols, wouldn't that be a GPL violation?
I was given to understand that Crowdstrike provided some protection from unvetted export of data. I'm not sure that data would be useful without the rare domain expertise to use it, but I wasn't shown the risk analysis. And then someone else demands and gets ssh access to GitHub. Sigh.
I think "compliance" would be a better word to use that "safety" when it comes to a lot of "security" software on computers.
And I bring up the distinction because while compliance is "sometimes" about safety, it's also very often about KPIs of particular individuals or due to imaginary liability for having not researched every possible "compliance" checkbox conceivable and making sure it's been checked.
Some computer security software is completely out of hand because its primary purpose is to have the appearance of effectiveness for the exec whose job is to tick off as many safety checkboxes as they can find, as opposed to being actually pragmatically effective.
If the same methodologies were applied to car safety, cars would be so weighed down by safety features, that they wouldn't be able to go faster than 40km/h.
They mean distributing Linux + the module together. Like e.g. shipping the Nvidia kernel module alone is fine, but shipping a Linux distro with that module preinstalled is not fine.
Two different "it". As an analogy: selling pizza Hawaii is dicey, but you can sell pineapple slices and customers can add those to their pizza themselves.
Last time I dealt with HP, I had to use their fakeraid proprietary kernel module which "tainted" the kernel. Of course they never open-sourced it. I guess it's not necessary.
GPL exported symbols are the ones that are thought to be so tightly coupled to the kernel implementation that if you are using them, you are writing a derivative work of the kernel.
Yeah that was also my understanding, and I can't imagine a av module able to intercept filesystem and syscalls to be only using non-core symbols. But of course you never know without decompiling the module
Are they? Apple has pretty much banned kernel drivers (kexts) in macOS on Apple Silicon. When they were still used, they were a common cause of crashes and instability, not to mention potential gaping security holes.
Most things that third-party kernel drivers used to do (device drivers, file systems, etc) are now done just as well, and much more safely, in userspace. I'm surprised if Microsoft isn't heading in this direction too?
Presumably, Crowdstrike runs on macOS without a kernel extension?
> Presumably, Crowdstrike runs on macOS without a kernel extension?
That's correct: CrowdStrike now only installs an "Endpoint Security" system extension and a "Network" system extension on macOS, but no kernel extension anymore.
One would hope that Crowdstrike does a similar thing on Linux and relies on fanotify and/or ebpf instead of using a kernel module. The other upside to this would be not having to wait for Crowdstrike to be constantly updating their code for newer kernels.
I believe so but would like better details. We used to use another provider that depended on exact kernel versions whereas the falcon-sensor seems quite happy with kernel updates.
Whatever protection is implemented in user-land can be removed from user-land too. This is why most EDR vendors are now gradually relying on kernel based mechanisms rather than doing stuff like injecting their DLL in a process, hooking syscalls, etc...
First, we were talking about EDR in Windows usermode.
Second, still, that doesn't change anything. You can make your malware jmp to anywhere so that the syscall actually comes from an authorized page.
In fact, in windows environment, this is actively done ("indirect syscalls"), because indeed, having a random executable directly calling syscalls is a clear indicator that something is malicious. So they take a detour and have a legitimate piece of code (in ntdll) do the syscall for them.
The original Windows NT had microkernel architecture, where a driver/server could not crash the OS. So no, Crowdstrike didn't have an option really, but Microsoft did.
As PC got faster, Microsoft could have returned to the microkernel architecture, or at least focused on isolating drivers better.
They've done it to a degree but only for graphics drivers, Windows is (AFAIK) unique amongst the major OSes in that it can nearly always recover from a GPU driver or hardware crash without having to reboot. It makes sense that they would focus on that since graphics drivers are by far the most complex ones on most systems and there are only 3 vendors to coordinate API changes with, but it would be nice if they broadened it to other drivers over time.
NT was never a true microkernel. Most drivers are loaded into the kernel. Display drivers being a huge pain point, subsequently rolled back to user space in 2000, and printer drivers being the next pain point, but primarily with security -- hence moving to a Microsoft-supplied universal print driver, finally in Windows 11.
There's a grey area between "kernel drivers are required for crowdstrike" and "windows is not modular enough to expose necessary functionality to userspace". It could be solved differently given enough motivation.
My experience working with Crowdstrike was that they were super arrogant about these risks. I was working on a ~50k enterprise rollout, and our CS guy was very belligerent about how long we were taking to do it, how much testing we wanted to do, the way that we were staggering roll outs and managing rollback plans. He didn’t think any of this was necessary, that we should roll it out in one fell swoop, have everything to auto-update all the time, and constantly yapped about how many bigger enterprises than ours completed their rollouts in just a couple of weeks.
He actually threatened to fire us as a client because he claimed he didn’t want the CS brand associated with an org that wasn’t “fully protected” by CS. By far the worst vendor contact I’ve ever had. I’ve had nicer meetings with Oracle lawyers than I was having with this guy. I hope this sort of thing humbles them a little.
I was just a contractor there, and don’t work with them at the moment. But I’m a customer of theirs and they’re definitely having an outage right now, so I’m guessing it’s all still in place.
I don’t work there any more. But they were having an outage, so I’m guessing they never got fired as a client (guessing that they’re still using Crowdstrike) and could still take that offer (of being fired as a client) if they wanted to.
> I hope this sort of thing humbles them a little.
What I hope, is that they stop to exist as a product and as a company. They have caused inconvenience, economic damage in global scale and probably also loss of life, given that many hospitals, ER units had outages. It has been proven that their whole way of working is wrong, from the very foundation to the top.
He was pretty senior for his role, but really I have no idea whether he was representative of the wider company culture.
We had a buggy client release during the rollout which consumed all the CPU in one of our test environments (something he assured us could never happen), and he calmed down a bit after that. Prior to that though he was doing stuff like finding our CISO on LinkedIn to let him know how worried he was about our rollout pace, and that without CS protection a major breach could be imminent.
At the end of the day, if you give an application a deep set of permissions, that's on you as an administrator, not the OS. This unchecked global rollout appears to just be a violation of every good software engineering practice we know.
Administrators are to blame because management (and a lot of 'cybersecurity policies') demand there's a virus scanner on the machines?
While virus scanners might pick up some threats not addressed by OS updates yet every one of them I've seen is a rootkit in disguise wanting full system privileges. There are numerous incidents with security holes and crashes caused by these security products. They also aren't that clever: repeatedly scanning the same files 'on access' over and over again wasting CPU and IO is not going to give you any extra security.
I often watch Crowdstrike thrash my laptop's resources, making it slow to do compiles. Cybersecurity won't let me disable it either, so I just set it to lower priority process.
As someone who worked for a company, who's a Crowdstrike partner, I assure you that Crowdstrike does not sell to administrators. It is very much a product sold to management and company auditors.
Where you're correct is that it's on the administrators to rollout the updates, but I'm not sure that's how Crowdstrike works. It's a managed solution and updates are done for you, maybe that can be disabled, but I honestly don't know.
CS is not sold to SA or technical types. It's sold to management as a risk reduction.
The whole point is that if you are technical, you are so untrusted that management is willing to require circumvention of known good practices and force installation of this software against technical advice.
I have worked in Finance for 25 years, and the amount of pressure I had to stand from Auditing on "Why do we have a 20-day-window on applying most updates as we get them from suppliers? We are not best practice!" is gruelling.
These people report to the Board Chairman, don't understand any real implication of their work, and believe the world is a simplistic Red - Amber - Green grid.
I understand most CIOs / CTOs / CISOs in Corporate would buckle.
It's actually worse than phone updates. Ever looked at your phone and noticed it hasn't updated to the new OS despite it having been out for a few days already? This is why.
Wading out my depth here, so forgive any stupidity following.
And there's a certain amount of sense to that, it has to get "under" the layer that viruses can typically get to, but I still think there should be another layer at which the OS is protected from misbehaving anti-virus software (which has been known to happen).
You're taking about how things are, the comment you're replying to is talking about how things could be. There's not a contradiction there.
Originally, x86 processors had 4 levels of hardware protection, from ring 0 up to ring 3 (if I remember right). The idea was indeed that non-OS drivers could operate at the intermediate levels. But no one used them and they're effectively abandoned now. (There's "level -1" now for hypervisors and maybe other stuff but that's besides the point.)
Whether those x86 were really suitable or not is not exactly important. The point is, it's possible to imagine a world where device drivers could have less than 100% permissions.
The problem I have with this is that anti-virus software has never felt like the most reliable, well-written, trustworthy software that's deserving of it's place in Ring 0.
I understand I'm yelling into the storm here, because anti-virus also requires that level of system access due to the nature of what it's trying to detect. But then again, does it only need Ring 0 access for the worst of the worst? Can it run 99% of the time in Ring 1, or user space, and only instantiate it's Ring 0 privileges for regular but infrequent scans or if it detects something else may be 'off'?
Default Ring 0? Earn it.
This turns into a "what's your threat level" discussion.
Crowdstrike is basically corporate malware - the failure is in large part with security dept deciders who signed off on policies that compel people to install these viruses on their work machines.
Other than a lack of redundant systems, it should be illegal to roll out updates like this to more than x% of any gov stuff at a time. Brute force way of avoiding correlational Armageddon.
There's a better joke, Crowdstrike sponsors the Mercedes Formula 1 team and in 1955 Mercedes was involved in the worst motorsport accident ever, killing over 80 people watching from the stands when parts of the cars flew off and... striked the crowd...
I think it was a Dodge charger. That pro-trump KKK guy in the gray car who drove through the crowd while the crowd was I think it was a George Floyd protest?
Mustangs have a reputation as being 'crowd (or streetlight) seeking' missiles.
This is due to their price making them relatively more available to the enthusiasts than say Hellcats, enthusiasts who may not be experienced enough to deal with having that much power available to them in a RWD car. This confluence of power, confidence and lack of skill often comes to a head when the enthusiast goes to a car meet to show off and meet with like minded folks. At the conclusion of the meet, or during a group drive, they'll often pull a sick burnout as they pull out of the parking lot on to a street.
A sick burnout they haven't practiced, and will often cause them to lose the back end sending the car into the curb, a tree, or a crowd of like minded attendees at the car meet. Therefore, the reputation.
Mustangs are famous for their high power and poor handling - there are lots of videos showing drivers doing burnouts, losing control, and striking the crowd they are showing off to.
Maybe it's time that critical systems switch to Linux. The major public clouds are already primarily running Linux. Emergency services, booking, and traditional point-of-sale have no strong reason to run Windows. In the past 10 years, the technological capability differences between Windows and Linux have widened considerably, with Linux being the most advanced operating system in the world without question.
Concerns about usability between Windows and Linux in the modern day are disingenuous at best and malicious at worst. There is no UX concern when everything runs off a webapp these days.
Just use Linux. You will save money and time, and your system will be supported for many years, you won't be charged per E-Core, you won't suffer BSoDs in 2024. Red Hat is a trustworthy American company based out of Raleigh, NC, in case you have concerns of provenance.
Really there's no downside. If you were building your own company you would base your tech stack on Linux and not Windows.
Critical systems cannot go down; therefore they cannot run Windows. If they do, they are being mismanaged and run negligently. Management should have no issue finding Linux engineers, they are everywhere. I could make a killing right now as a consultant going from company to company and just swapping out Windows backends for Linux. And quite frankly I might just do that, literally starting right now.
The discussed issue is not related to any meaningful difference between Windows and Linux – Crowdstrike used a kernel driver, apparently containing a serious bug, which took down the system, which is something any kernel driver can do, no matter which kernel you use. At least Windows have a well-developed framework for writing userspace drivers, unlike Linux.
> Linux being the most advanced operating system in the world without question.
Very strong and mostly unfounded claim; there are specific aspects where Linux is "more advanced", and others where Windows come out ahead (e.g. almost anything related to hardware-based security and virtualization).
> your system will be supported for many years
Windows Server 2008 was supported until earlier this year, longer than any RHEL release.
> you won't suffer BSoDs in 2024
Until you install a shitty driver for a dubious (anti)malware service.
I don't understand this sort of blindness? Linux fails all the time, with rather terrible nobody to root vulns because some idiot failed to use the right bounds check. Ye gods, XZ utils was barely a few months ago!
Hmm? It was released for two plus months? 5.6.0 and 5.6.1
I'd also say this wasn't a good example of 'linux handling it better': usually when a mess like this occurs on windows all the corps get a quiet tap on the shoulder that they need to immediately patch when MS releases it, then a few days later it hits the news. In XZ's case, the backdoor was published before the team knew about it, huge mess.
You’re right that it went noticed for a long time, just one clarification
> all the corps get a quiet tap on the shoulder that they need to immediately patch when MS releases it, then a few days later it hits the news
AFAIK, distros were notified and released a patched version of xz like a week before it hit the news, so at least a lot of machines received it via automatic updates.
Depends which news you're talking about. MS guy who discovered it found it March 29th, published to oss. It was in infosec news same day as redhat, others pushed out critical advisories. Patch didn't come til a day or two later.
You're half right - people who compiled it from source could theoretically get those releases, but no, it wasn't released in any distros. So in practice since no linux distro released it, no-one relying on linux distros was exposed to it.
> Maybe it's time that critical systems switch to Linux.
I switched critical systems to illumos and BSD years ago and it's been smooth sailing ever since. Nowadays there really is no need to contribute to linux monoculturization whatsoever.
I too want to see Linux more widely adopted, but it won't prevent this from happening. People will install corrupted kernel modules on Linux too for anti-virus purposes.
All good points but Windows didnt win because it had the best tech or user interface. Merely the most developer support thus user numbers. Legacy momentum is an incredibly difficult thing to sway. It has taken Apple decades an potentially hundreds of billions of dollars of marketing and good will to carve out its share of the market. Linux doesn't have that despites its clear technical advantages.
It is an incredibly frustrated battle akin to Sisyphus.
Crowdstrike has a linux version. It is mandatory in our linux servers in my company so that is not the solution.
I would say issue 1 is management/compliance forcing admins to install malwares like crowdstrike. But issue 1 is because of issue 2 which is about admins / app devs / users aren't smart enough to not have their machines compromised on a regular basis in the first place. And issue 2 is because issue 3 of the software industry not focusing on quality and making bug free software.
All in all this should be mitigated by more diversity in OS, software and "said security solution". Standardization and monopolies works well until they don't and you get this kind of shit.
I think we don't do enough to fight back this requests in a language that is understood by management. Ask them to sign a security waiver assuming risks for installing software techs would classify as a malware and RCE risk.
Companies like CS live on reputation, it should be dragged down.
One place I'm at recently required us to install it in our Kubernetes cluster which powers a bunch of typical web apps.
Falcon sensor is the most CPU intensive app running in the cluster and produces a constant stream of disk activity (more so than any of our apps).
It hasn't crashed anything yet but it definitely leaves me feeling iffy about running it.
I don't like CrowdStrike at all. I got contacted by our security department because I used curl to download a file from GitHub on my dev box and it prompted a severe enough security warning that it required me to explain my intent. That was the day I learned I guess every command or maybe even keystroke I type is being logged and analyzed.
We were also forced to run that until the agent had introduced a memory leak that ate almost all the memory on all the hosts. Thankfully we managed to convince our compliance people that we could run an immutable OS rather than deploy this ~~malware~~ XDR agent.
Windows actually runs a lot of drivers in user-mode, even GPU drivers. largely this is because third-party drivers were responsible for the vast majority of blue screens, but the users would blame Microsoft. which makes sense; Windows crashes so they blame Windows, but I doubt anyone blamed Linux for the kernel panic.
I think windows can be blamed on how badly you can fix that kind of issues. I mean on linux or any bsd admins would build an iso image that would automatically run a script that would take care of optionnally decrypting the system drive, then remove crowdstrike. Or alternatively simply building a live system that take an address via dhcp and start an ssh server. and admins would remotely and automatically run a playbook that mount that iso on the hypervisor, boot it, remotely apply the fix, then boot back the system on the system drive.
Maybe this is just my ignorance about windows and its ecosystem but it seems most admins this morning were clueless on how to fix that automatically and remotely on n machines and would resort to boot in safe mode and remove a file manually on each single server. This is just insane to think that supposed windows sysadmins / cloudops have no idea how to deploy a fix automatically on that platform.
It can kill process based on memory scanning. Imagine systemd was getting killed at every boot?
An issue might not be as universal as on windows, because some distros do things differently like not using glibc, or systemd, or whatever. Yet there are some baselines common to the most popular ones.
Well, Microsoft tried to lock down its kernel with Windows Vista and then antivirus vendors cried that they won't be able to protect Windows, anticompetetive etc.
I could never get smooth scrolling to work on Linux in any mainstream web browser, most people don’t seem to see it, but I’m sensitive to things like that.
Like with a laptop trackpad? I'm smooth-scrolling through these comments right now, and don't remember when scrolling wasn't smooth by default on any trackpad.
It’s smooth to a point, but not smooth like OS X is. It might have improved (I think I last tried desktop Linux a year ago). I do enjoy using Linux as my default headless OS.
Anecdote: my first job was IT at a small org. We had somehow gotten a 15 minute remote meeting with Kevin Mitnick, and asked him several questions about security best practices and software recommendations. I don't remember a lot about that meeting, but I do remember his strong recommendation of Crowdstrike. Interesting to see it brought up again in this context.
Can someone explain to me why such systems need anti-virus in the first place?
Windows has pretty good facilities for locking down the system so that ordinary users, even those with local admin rights, cannot run or install unauthorised code so if nothing can get in why would the system need checking for viruses?
So why do most companies not lock down their machines?
Anything that has root/kernel access is a risk. It always has been. When will we learn. Probably never. Because money runs this world. So sad. Time to open a bakery and move on from this world.
Things like hospitals, airlines, 911, should have multiple systems with different software stacks and independent backends running in-parallel, so that when one infra goes down they can switch to another.
For some areas of our critical systems we have three independent software groups program the same exact system on different infrastructure. Just for moments like these...
There is an enormous cost associated with the kind of redundancy you're talking about. Capitalism prevents us from being set up in the way you're describing. Why invest in company A if company B can run the same business with half the operational expenses? Shareholder profit above all.
Is company B allowed to take the full brunt of all the problems when there is a failure, or does government protect it by limiting damages? If company B's cheaper choice leads to harm and lets people and estates sue company B into the ground, then company A is a safer investment even if it has lower returns. If government interaction limits such recovery options, then that is what leads to company B's higher returns not also having higher risks, so they'll be the better investment. But that is a result of government intervention, not the economic system in play.
How does such a huge company do “full deploys” like this?
At this number of endpoints, only a few % should have been updated (and faced the problems) before a full rolout
This is not a small startup with some SaaS, these guys are in most computers of too many huge companies. Not rolling out the updates to everyone at the same time seems just too obvious
Working late Thursday night in Florida, USA. I have someone in Australia wanting me to write a quick script in LSL for an object in Second Life. We were interrupted: Second Life kept running, but Discord went down, telling me to 'try another server' which doesn't make sence when you are 1-on-1 with someone. All my typing in Discord turned red. Additionally, I couldn't log into the email portal for outlook.com: I got a screen of tiny-fonted text all clinging to the left edge of the display, unreadable, unusable. Second Life, though, stayed online and kept working for me, but then I'm on Windows 7. My friend who had requested the collaboration froze in Second Life on his Windows 10 system, and I don't know what his Discord was doing. I ended the session since I couldn't get a no/no-go out of him for the latest script version.
Wow I didn't know second life was still a thing. Literally yesterday I looked at a 20 year old archived version of a freeware portal which also listed a version of second life.
This is a good example of why you don't want ring0 level access for clients.
Or just, you don't want client-based solutions. The provider just becomes another threat vector.
Those focusing on QA and staged rollouts are misguided. Yes of course a serious company should do it but CrowdStrike is a compliance checkbox ticker.
They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”.
The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).
I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)
Due to the scale I think it’s reasonable to state that in all likelihood many people have died because of this. Sure it might be hard to attribute single cases but statistically I would expect to see a general increase in probability.
I used to work at MS and didn’t like their 2:1 test to dev ratio or their 0:1 ratio either and wish they spent more work on verification and improved processes instead of relying on testing - especially their current test in production approach. They got sloppy and this was just a matter of time. And god I hate their forced updates, it’s a huge hole in the threat model, basically letting in children who like to play with matches.
My important stuff is basically air-gapped. There is a gateway but it’ll only accept incoming secure sockets with a pinned certificate and only a predefined in-house protocol on that socket. No other traffic allowed. The thing is designed to gracefully degrade with the idea that it’ll keep working unattended for decades, the software should basically work forever so long as equivalent replacement hardware could be found.
At one company I used to work for, we had boring, airgapped systems that just worked all the time, until one day security team demanded that we must install this endpoint security software. Usually, they would fight tooth and nail to prevent devs from giving any in-house program any network access, but they didn't even blink once to give internet access to those airgapped systems because CrowdStrike agents need to talk to their mothership in AWS. It's all good, it's for better security!
It never caught any legit threat, but constantly flagged our own code. Our devs talked to security every other week to explain why this new line of code is not a threat. It generated a lot of work and security team's headcount just exploded. The software checked a lot of security checkboxes, and our CISO can sleep better at night, so I guess end of day it's all worth it.
>It never caught any legit threat, but constantly flagged our own code
When I worked in large enterprise it got to the point that if a piece of my app infrastructure started acting weird the blackbox security agents on the machines were the first thing I suspected. Can't tell you how many times they've blocked legit traffic or blown up a host by failing to install an update or logging it to death. Best part is when I would reach out to the teams responsible for the agents they would always blame us, saying we didn't update, or weren't managing logs etc. Mind you these agents were not installed or managed by us in any way, were supposed to auto update, and nothing else on the system outran the logrotate utility. Large enterprise IT security is all about checking boxes and generating paperwork and jobs. Most of the people I've interacted with on it have never even logged into a system or cloud console. By the end I took to openly calling them the compliance team instead of the security team.
I know I've lost tenders due to not using a pre-approved anti-virus vendors which really does suck and has impinged the growth of my company, but since I'm responsible for the security it helps me sleep at night. This morning I woke up to a bunch of emails and texts asking me if my systems have been impacted by this and it was nice to be able to confidently write back that we're completely unaffected.
I day-dream about being able to use immutable unikernels running on hypervisors so that even if something was to get past a gateway there would be no way to modify the system to work in a way that was not intended.
Air-gapping with a super locked down gateway was already getting more popular precisely due to the forced updates threat surface area, and after today I expect it to be even more popular. At the very least I’ll be able to point to this instance when explaining the rational behind the architecture which could help in getting exemptions from the antivirus box ticking exercise.
I love their forced updates, because if you know what you're doing you can disable them, and if you don't know what you're doing, well you shouldn't be disabling updates to begin with. I think people forget how virus infested and bug addled Windows used to be before they enforced updates. People wouldn't update for years and then bitch how bad Windows was, when obviously the issue wasn't Windows at that point.
If the user wants to boot an older, known-insecure, version so that they can continue taking 911 calls or scheduling surgeries... I say let 'em. Whether to exercise this capability should be a decision for each IT department, not imposed by Microsoft on to their whole swarm.
No, after the fact. Where's the prompt at boot-time which asks you if you want to load yesterday's known-good state, or today's recently-updated state?
It's missing because users are not to be trusted with such things, and that's a philosophy with harmful consequences.
I don't have any affected systems to test with, but I'd be pretty surprised if that were an effective mechanism for un-breaking the crowdstruck machines. Registry and driver configuration is a rather small part of the picture.
And I don't think that's an accident either. Microsoft is not interested in providing end users with the kind of rollback functionality that you see in Linux (you can just pick which kernel to boot to) because you can get less money by empowering your users and more money by cooperating with people who want to spy on them.
1) It is not enterprise version of Windows; it is any version capable of GPO (so Pro applies too, Home doesn't).
2) it is not disabling them; it is approving or rejecting them (or even holding up the decision indefinitely).
You can do that too, via WSUS. It is not reserved to large enterprises, as I've seen claimed several times in this thread. It is available to anyone, who has Windows Server in their network and is willing to install the WSUS role here.
We took 911 calls all night, I was up listening to the radio all night for my unit to be called. The problem was the dispatching software didn't work so we used paper and pen. Glory Days!!!!
It doesn't really matter to me that it's possible to configure your way out of Microsoft's botnet. They've created a culture of around Windows that is insufficiently concerned with user consent, a consequence of which is that the actions of a dubiously trusted few have impacts that are too far and wide for comfort, impacts which cannot be mitigated by the users.
The power to intrude on our systems and run arbitrary code aggregates in the hands of people that we don't know unless we're clever enough to intervene. That's not something to be celebrated. It's creepy and we should be looking for a better way.
We should be looking for something involving explicit trust which, when revoked at a given timestamp, undoes the actions of the newly-distrusted party following that timestamp, even if that party is Microsoft or cloudstrike or your sysadmin.
Sure, maybe the "sysadmin" is good natured Chuck on the other side of the cube partition: somebody that you can hit with a nerf dart. But maybe they're a hacker on the other side of the planet and they've just locked your whole country out of their autonomous tractors. No way to be sure, so let's just not engage in that model for control in the first place. Lets make things that respect their users.
I'm specifically talking about security updates here. Vehicles have the same requirement with forced OTA updates. Remember, every compromised computer is just one more computer spreading malware and being used for DDOS.
Ignoring all of the other approaches to that problem I wonder if this update will take the record for most damage done by a single virus/update. At some point the ‘cure’ might be worse than the disease. If it were up to me I would be suggesting different cures.
An immutable OS can be set up to revert to the previous version if a change causes a boot failure. Or even a COW filesystem with snapshots when changes are applied. Hell, Microsoft's own "System Restore" capability could do this, if MS provided default-on support for creating system restore points automatically when system files are changed & restoring after boot failures.
What's funny to me is that in college we had our computer lab set up such that every computer could be quickly reverted to a good working state just by rebooting. Every boot was from a static known good image, and any changes made while the computer was on were just stored as an overlay on a separate disk. People installed all manner of software that crashed the machines, but they always came back up. To make any lasting changes to the machine you had to have a physical key. So with the right kind of paranoia you can build systems that are resilient to any harmful changes.
Well, not the OS, per se, but macos updating mechanisms have auto-restart path, and I imagine any Linux update that touches the kernel can be configured in that way too. It's more the admin's decision then OS's but on all common systems auto-restart is part of the menu too.
MS could've leaned more towards user-space kernel drivers though. Apple has been going in that direction for a while and I haven't seem much of that (if anything) coming from MS.
That would have prevented a bad driver from taking down a device.
Apple created their own filesystem to make this possible.
The system volume is signed by Apple. If the signature on boot doesn't match, it won't boot.
When the system is booted, it's in read-only mode, no way to write anything to it.
If you bork it, you can simply reinstall macOS in place, without any data/application loss at all.
Of course, if you're a tinkerer, you can disable both, the SIP, and the signature validation, but that cannot be done from user-space. You'll need to boot into recovery mode to achieve that.
I don't think there's anything in NTFS or REFS that would allow for this approach. Especially when you account for the wide variety of setups on which an NTFS partition might sit on. With MBR, you're just SOL instantly.
Apple hardware on the other hand has been EFI (GPT) only for at least 15 years.
I don’t know the specifics of this case, but formal verification of machine code is an option. Sure it’s hard and doesn’t scale well but if it’s required then vendors will learn to make smaller kernel modules.
If something cannot be formally verified at the machine code level there should be a controls level verification where vendors demonstrate they have a process in place to achieving correctness by construction.
Driver devs can be quite sloppy and copy paste bad code from the internet, in the machine code Microsoft can detect specific instances of known copy and pasted code and knows how to patch it. I know they did this for at least one common error. But if I was in the business of delivering an OS I want people to rely on my OS this stuff formal verification at some level would be table stakes.
I thought Microsoft did use formal verification for kernel-mode drivers and that this was supposed to be impossible. Is it only for their first-party code?
No, I believe 3rd party driver developers must pass Hardware Lab Kit testing for their drivers to be properly signed. This testing includes a suite of Driver Verifier passes that are done, but this is not formal verification in the mathematical sense of the term.
I wasn’t privy to the extent it was used, if this was formally verified to be correct and still caused this problem then that really would be something. I’m guessing given the size and scope of an antivirus kernel module that they may have had to make an exception but then didn’t do enough controls checking.
There is a windows release preview channel that exists for finding issues like this ahead of time.
To be fair - it is possible the conflicting OS update did not make it to that channel. It is also possible it is due to an embarassing bug from MSFT (uknown as yet).
Until I hear that this is the case - I am pinning this on Crowdstrike. This should have been caught before prod.
Even if this is entirely due to Crowdstrike I see it as Microsofts failure to properly police their market.
There is the correctness by testing vs correctness by construction dynamic and in my view given the scale of interactions between an OS and the kernel modules trying to achieve correctness by testing is negligent. Even at the market scale Microsoft has there are not enough Windows computers to preview test every combination. Especially when taking into account the people on the preview ring have different behaviors to those on the mainline so many combinations simply won't appear in the preview.
I see it as Microsoft owning the Windows kernel module space and has allowed sloppiness by third parties and themselves, I don't know the specifics but I could easily believe that this is a due to a bug from Microsoft. The problem with allowing such sloppiness is that the slopy operators out compete the responsible operators, the bad pushes out the good until only the bad remains. A sloppy developer can push more code and gets promoted while the careful developer gets fired.
There's not enough public information about it - but taking this talking point at face value, Microsoft did sign their kernel driver in order for it to be able to do this kind of damage. It's not publicly documented what all validation they do as part of the certification and signing process:
The damage may have been done in a dependency which was not signed by Microsoft. Who knows? Hopefully we'll find out.
In general, a fair amount of the bad behavior of windows devices since Vista has been really about poorly written drivers misbehaving, so there appears to be value in that talking point. All the Vista crashes after release (according to some sources, 30% of all Vista crashes after release were due to NVidia drivers), notably, and more recently if you've ever tried to put your Windows laptop to sleep, and discovered when you take it out of your bag that it had promptly woken back up and cooked itself into having a dead battery. (Drivers not properly supporting sleep mode) WHQL has some things to answer for for sure.
As a tester, I'm frustrated by how little support testing gets in this industry. You can't blame bad testing if it's impossible to get reasonable time and cooperation to do more than a perfunctory job.
That's misplaced. Windows is an ancient platform. CrowdStrike is ubiquitous and routinely updated. There was no "move fast" here, at least on the part of the people operating these systems.
Stopped by a gas station in rural Wisconsin leaving from MSP. Thank God we were on a full tank when we left, nothing was operational except the bathrooms (which is why we stopped).
I left thinking about how anti-anti-fragile our systems have become. Maybe we should force cash operations…
Back in the 1990s when Microsoft wanted to enter the embedded systems market there was a saying "You don't want Windows controlling your car's breaks". We now let them control a huge part of our lives. Should we let them add AI to the already unpalatable cocktail?
- CS: Have a staging (production-like) environment for proper validation. It looks like CS has one of these bu they have just skipped it
- IT Admins: Have controlled roll-outs, instead of doing everything in a single swoop.
- CS: Fuzz test your configuration
It is possible Cloudflare did a timepointed release on this. Controlled roll-outs wouldn't work if all the daily chunked updates didn't activate the kernel driver until some point in the future.
I am going to stop saying this but people don't realize CS has an official RCE as a feature. As in run remote commands as root/admin on windows or linux/mac through their web.
Assuming this event itself isn't malicious, what an excellent POC for something that is. I sure hope every org out there with this level of market reach has good security in place. It's certainly going to be getting some probing after this.
This is a manifestation of almost everything wrong about software development and marketing practices.
I work in hardware development and such a failure is almost impossible to imagine. It has to work, always. It puzzles me why this isn't the casebfor software. My SWE colleagues often get mad at us HW guys because we want to see their test coverage for the firmware/drivers etc.. The focus is having something which compiles and pushing the code to production as fast as possible and then regressing in production. Most of HW problems are a result of this. I found it's often better to go over the firmware myself and read line by line to understand what the code does. It saves so much time from endless debugging sessions later. It pisses of firmware guys, but hey, you have to break some eggs to make an omelette.
> It puzzles me why this isn't the case for software
In my anecdotal experience, its because corporate software projects are not typically run by people who are good at building safe things - but rather, just building things quickly.
There's a huge issue with the mentality of "it works, ship it" being propagated.
I build systems software for safety-critical and mission-critical markets, and I can say without a doubt that if there aren't at least two quality stages in your process workflow (and your workflow isn't waterfall), then you're going to be in for a rough time, rookies.
Always, always delay your releases, and always, always, eat your own dog food by testing your delayed releases in your own customer-like environment, which is to say, never release a developers build ..
This is also my experience. And the worst was always being lectured by SW project managers about being agile and having to move quick and release early. I won't release anything without making sure everything works in every possible condition. This is why it takes years to build a complex chip (CPU, Fpga, any SoC really). Their firmware is often squeezed into months, and often the developers are handling like 10 different projects. So, no focus, no time to understand the details of the design. At the end it's common to have firmware issues in the first year after release. It's kind of expected even.
Complexity. As you get further from the driver and the kernel software complexity expands massively. It gets to a point where it is beyond the abilities of humans and processes to manage it in a cost effective manner.
I understand that might be case for a lot of SW development. But in the context I was talking about, the HW is so much more complex than the SW. Valid for a lot of cases too. But then, why? If we know that we cannot build a 100km long bridge, nobody attempts to build that and waste resources. Why does software development lack this?
When is HW much more complex than the SW? I work in a company that designs (far from trivial) hardware, and develops embedded software, and in my experience software is always more complex than hardware, due to the many layers of abstraction (unless you are writing only, IDK, boot loaders in assembler? and even then it is about the same level of complexity)
There's supposedly a fix being deployed (https://x.com/George_Kurtz/status/1814235001745027317). Since it's a channel update I'm assuming that it would be downloaded automatically? Has anyone received it yet? Does the garbage driver disappear or is it replaced?
Edit: got in touch with an admin:
C-00000291-00000000-00000029.sys SHA256 1A30..4B60 is the bad file (timestamp 0409 UTC)
C-00000291-00000000-00000030.sys SHA256 E693..6FAE is the fix (timestamp >= 0527 UTC)
Do not rely on the hashes too much as these might vary from org to org I've read.
Ironically SolarWinds court case happened yesterday. SEC won. SolarWindows was fraudulent to say their software way “secure”. They should rename a side channel attack a “Tom and Jerry”, because its getting like a game of Cat and Mouse
1. This is why kernel modules are a bad idea
2. This is why centralism is a bad idea
3. This is why sacrificing stability for security is a bad idea
4. Security still needs to factor in security of supply - not just data safety
Centralisation in a nutshell. Monopolies so big that they become globally fragile. CloudFlare outages break a lot of the internet, and now we can see, Windows-based updates bricking machines across the world.
We've all pushed bad updates but how was this not tested?
How many people still believe the "cloud" was worth it? Maybe we should go back to the days of buying software and running it ourselves with our own infrastructure.
Maybe a silly question, but: why hasn't this affected Linux? I assume it uses a proprietary kernel module just like it does on Windows. I guess this will come out in a post-mortem if they publish one, but it's been on my mind.
The sheer coverage of this outage across multiple businesses and industries, the impact must be greater than some of the malicious cyber attacks from ransomware, worms etc.
I want to say the problem is that the industry has systematically devalued software testing in favor of continuous delivery and the strategy of hoping that any problems are easy to roll back.
But it's deeper than that: the industry realizes that, once you get to a certain size, no one can hurt you much. Crowdstrike will not pay a lasting penalty for what has just happen, which means executives will shrug and treat this as a random bolt of lightning.
This is why I don't like fully automatic updates. I prefer having control over the "deploy" button for the ability to time it when I can tolerate downtime. In mission-critical production systems all updates should go through test staging pipelines that my team controls, not a vendor.
Broken updates have cause far more havoc than being a few hours or even days late on a so-called critical patch.
Troubleshooting an issue like this when I have the time and am prepared for a potential outage (with human resources hot and standing by for immediate action) is VASTLY different than encountering it the evening before some critical deadline for a multi-million dollar project (as Murphy's Law will be sure to have it).
When I have control over the deployment of updates I can push them through my own QA environment first, to uncover many of these kinds of issues before they hit production. Vendors pushing them out on their whim leaves me subject to whatever fast and loose practice they use and prevents me from being able to properly manage my own infrastructure.
A slow rollout certainly helps but doesn't satisfy the kind of 9's I demand in the environments I care for.
Their stock price will suffer but they can waive license fees for a year or so for every endpoint affected (~$50).
They better pin this on a rogue employee, but even then, force pushing updates shouldn't be in their capability at all! They must guarantee removal of that capability.
Lawsuits should be interesting. They offer(ed?) $1 mil breach insurance to their customers, so if they were to pay only that much per customer this might be compensation north of $10B. But to be honest, wouldn't surprise me if they can pay up without going bankrupt.
The sad situation is, as twitter people were pointing out, IT teams will use this to push back against more agents for a long time to come. But in reality, these agents are very important.
Crowdstrike Falcon alone is probably the single biggest security improvement any company can make and there is hardly any competition. This could have been any security vendor, the impact is so widespread because of how widely used they are, but there is a reason why they are so widely used to begin with.
Oh and just fyi, the mitigation won't leave you unprotected, when you boot normal, the userspace exe's will replace it with a fixed version.
Allow me to give a different, information-theoretic, perspective. How much damage can flipping a single bit cause? How much damage can altering two bits cause?
The fanout is a robustness measure on systems. If we can control the fanout we increase reliability. If all it takes is a handful of bits in a 3rd party update to kill IT infrastructure, we are doing it wrong.
Are you suggesting that a 3kb update be tested 3k times to assess the impact of each possible bit flip, and 9M times for the impact of each possible pair bit flips?
Because I think that's effort better spent in other ways.
Not remotely. I am aware of the state space explosion and the difficulty with brute forcing the testing. I am suggesting that the damage a broken antivirus update can do should be restricted.
"Incidents of this nature do occur in a connected world that is reliant on technology."
- Mike Maddison, CEO, NCC Group
Until I see an explanation of how this got past testing, I will assume negligence. I wasn't directly affected, but it seems every single Windows machine running their software in my org was affected. With a hit rate that high I struggle to believe any testing was done.
True, though tbf it's still part of the running system.
I read that many of those affected are global orgs. When I worked at an oil major, everything was tested to oblivion before going into production in the DCs, the reason being to avoid precisely this kind of situation where at all possible. There were clusters set aside for operational acceptance testing to ensure everything, from business application right down to kernel, ran successfully. The idea of leaving auto-update on in any production system was unthinkable. Yet here we are.
afaiu google (and I presume other operators of large number of computers) deploy updates to their software first to a small set of nodes and only if after a given time the update has been deemed successful, continue to update an increasingly larger set til complete.
Isn't this done as well with automatic updates of end user software or embedded systems and if not, why not?
This event raises the question: What is the liability of Crowdstrike given its erroneous update caused the meltdown, and the impact certainly had negative personal or business outcomes globally.
See for example 6000 flights cancelled or the many statements posted here regarding it negatively impacting healthcare and other businesses.
we are bound to see the YouTube ads equivalent of late night spot ads for lawyers with accelerated audio "have you lost someone to the 2024, 2025 or 2029 crowdstrike global hospital outages? if so you may be entitled to compensation. DM law5237 on X to find more"
I wonder who exactly messed up the update, microsoft or crowdstrike.
Usually, there is pre-rollout update testing AND some companies use N-1 version staging for critical/production systems. For me it feels much more complex a failure than just "it's crowdstrike's fault". Everybody involved must have done something wrong.
Except it hasn't got much to do with Windows... its a faulty kernel software package from a commercial vendor unrelated to the OS.
Philosophically it's always good to have diversity, precisely to avoid such disruptions. But the real issue here is:
A) Apparently half the world runs CloudStrike... So everything is disrupted.
B) Apparently CloudStrike didn't test their update properly.
I'm very curious what will happen to CloudStrike. This seems like a huge liability?
All my customers endpoints are on Linux based endpoints. Because our users' Windows apps run in vdi with disposable instances based off snapshots and highly restrictive networking on the Linux endpoints, none of our users are effected.
Running Windows on bare-metal was always obviously very stupid. The consequences of such stupidity are just being felt now.
Genuine question: How the heck crapeware like CloudStrike got into all critical systems from 911 to hospitals to airlines? My understanding was that all these critical systems are just super lazy to upgrade or install anything at all. I would love to know all the sales tactics CS used to get into millions of systems for money!
Reading other comments here, sorry I don't have the link, one crowd strike salesperson threatened to cancel them as a Client, yes you read that right, if the client wasn't easier to work with. So they're bullies or at least that one salesperson in crowd strike is a bully.
Another article talked about crowd strike being required for compliance, people here talking about checkbox compliance. So there's a systemic requirement from perhaps insurers for there to be some kind of comprehensive near real-time updated antivirus solution.
Furthermore, the haste makes waste philosophy seems to not be honored, in my opining mind, by the minds who drive The impacted sectors of our economy. Hospitals, Banks, airlines. This kind of vulnerability should not have been accepted. It's a single point of failure. Even on crowdstrike's website they have this kind of like radar ring hotspot Target kind of graphic, where they show at the very center one single client app .. theirs, as if that one single client is the thing that's going to save us?
This is amazing sales tactics! So, you buddy up with insurance, they create a checkbox and recommend you for a revenue cut! Now you suddenly have millions of customers out of nowhere and your product gets installed on billions computers before you even know it. I have seen this tactic get used for many mediocre products. For example, 3rd party dishwasher soap recommended by dishwasher company. Amazingly powerful. I don’t think most of CrowdStrike employees even knew they were in more than billion computers with paid service. The CEO was just busy doing brutal marketing of this pointless product.
I've been warning about the coming software apocalypse for years.
This isn't a one-off, this is the beginning of a pattern.
Tech recruitment is broken, software is more complex than ever, more and more people are turning to hacking, people are growing increasingly dissatisfied with the status quo...
I’m sure the patch itself can be fixed, and there will be a workaround to boot up the machine to fix it. My only concern is the BitLocker keys. If the hard drive is encrypted by Windows and assuming no backup for that key has been done, the system admins will have to activate their disaster recovery plans for these devices, and I hope they have that too, but hope isn’t a strategy!
If the keys aren’t backed up, you will be locked out of the system, and as soon as you try to boot into the safe mode to perform that workaround, you will be asked to enter it manually (or if you have it back it up on a USB drive), if you don’t have, or don’t know the key, you will have an encrypted drive with all of your data locked there.
So far requires going into recovery mode and removing/rename the cloud strike executable. Then you can boot into Windows from there, it will probably be a sys admin thing dependent on the organisation setup.
Absolutely shameful display of how the cure can be worse than the disease. It's nonsense snake oil and security theater such as this that throws the cyber"security" industry into disrepute. One may as well have just installed McAfee Anti Virus.
This has been the story of the antivirus "industry" all along. They simultaneously seem to employ actual bona-fide security researchers while also making sure none of their software products are ever touched by people you could even refer to as "developers". I can't even imagine the noise at Microsoft from all the crash reports solely caused by antivirus software written by utter clowns injecting into other programs and, as here, into the kernel.
So much software - especially in the 'operating system' sphere of things - really is just snake oil.
Its just, very functional oil, in many cases - and highly toxic and slippery in many, many other cases.
>cyber "security" industry
Yes, I agree this is a market of smoke and mirrors, lies and propaganda.
The reason is, operating systems are broken. Pretty much all of them. Only, some of them work well enough to get a lot of work done, most of the time. Of the 99.9995% of the time it works, its great.
But, here's a thing I feel needs broader attention and discussion - It is my firm opinion that "Operating Systems Vendors" are a very poor, ragged class of professionals these days.
The decisions made at Microsoft - and other OS vendor corporations - have really lost the plot.
I can prove this by asking the golden question among the general public, and categorically get a standard response: "does this feature benefit the user, or does it benefit an advertiser?"
"No, this all seems to be some sort of setup. Windows doesn't feel like its for us, any more."
I mean, how many 3rd-party vendors do I need, secretly installing crippling 'updates' in my production systems, before I realize that there is no security, so write better software that doesn't need all this utter junk.
I mean this sincerely, operating systems vendors are treasonous to the user if 3rd parties are of more relevance to production runtime, than the thing the user very definitely needs to be operating.
The cloud is for backups, encrypted. It is made of snake oil.
Yet Lennart Pottering and Redhat (spelled that way as I am one of the original pre-IPO investor of RedHat via Alex Brown/Deutsche Bank) wants to put networking of Linux into UEFI this quarter, inside the most sacrosanct PID 1.
They still won’t learning anything from Crowdstrike’s mistakeS!
is there an ELI5 on how can this happen? Like i get its a boot loop, but what did crowdstrike do that cause it? How can non malicious code trigger boot loop?
I would not call Crowdstrike ”non-malicious”. It’s incredibly incompetently implemented kit that’s sold to organizations as snakeoil that ”protects them from cybercrime”. It’s purpose is to give incompetent IT managers to ”implement something plausible” against cyberincidents, and when an incident happens, it gives them the excuse that ”they followed best practices”.
It craps the users PC while at it too.
I hope the company burns to the ground and large organizations realize it’s not a really great idea to run a rootkit at every PC ”just because everyone else does it”.
I have to say, it saved our ass a few months ago. Some hacker got access to one of multiple brands server infrastructure, started running PowerShell to weed through the rest and CrowdStrike notified us (the owning brand) that something was off about the PowerShell being ran. Turns out this small brand was running a remote in tool that had an exploit. Had Crowdstrike not been on that server we wouldn't have known until someone manually got in there to look at it.
I've had CrowdStrike completely delete a debug binary I ran from Visual Studio. Its injected module in every single process shows up in all of our logging.
What specifically makes it "incredibly incompetently implemented", and would you simply derisively describe any system that can push updates requiring admin access a "rootkit", or is there some way you envision a "competently implemented rootkit" operating? Your opinion seems incredibly strong so I'm just curious how you arrived at it? I'm not in IT, but the idea of both rolling out updates remotely and outsourcing the timely delivery of these updates to my door* is a no brainer.
* if not directly to all my thousands of PCs without testing, which is 100% a "me" task and not a "that cloud provider over there" task
Rootkit means Crowdstrike literally intercepts commands before they can be executed in the CPU. It is like letting a third party implant a chip in your brain. If the chip thinks the command in your head is malicious, it will stop your brain from ever receiving the command.
Crowdstrike needs to be the first person in the room so that they can act like the boss. If other people show up before crowdstrike, there's a possibility that they'll somehow prevent crowdstrike from being the boss. For this reason, crowdstrike integrates with the boot process in ways that most software doesn't.
Their ability to monitor and intervene against all software on the system also puts them in a position to break all software on the system.
What they did is that they forgot to write a graceful failure mode for their driver loader. (And what they did on top of it is to ship it without testing.)
My assumption is that when you have graceful failure for something like this, you risk a situation where someone figures out how to make it gracefully fail, so no it's disabled on this huge fleet.
It's likely that there have been multiple discussions about graceful failure at the load stage and decided against for 'security' reasons.
If the threat model includes "someone can feed corrupted files to us" then I would definitely want more robustness and verification, not less.
It's perfectly okay to make the protected services unavailable for security reasons, but still a management API should be available, and periodically the device should query whatever source of truth about the "imminent dangers". And as the uncertainty decreases the service can be made available again.
(Sure, then there's the argument against complexity in the kernel ... true, but that simply means that they need to have all this complexity upstream, testing/QA/etc. And apparently what they had was not sufficient.)
Official CrowdStrike workarround:
1. Boot Windows into Safe Mode or the Windows Recovery Environment
2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
3. Locate the file matching “C-00000291*.sys”, and delete it.
4. Boot the host normally.
TBF although I worried about this possibility the first time the IT dude wandered into my office in 1989 holding a floppy he said he wanted to put into all the PCs we had (we had no PCs), it has actually taken a very long time for the shit to hit the fan.
This is what happens when you entrust software security to ex-hackers. Hackers love complexity because that's the kind of environment they thrive in; yet when they start working for the other side as security consultants, they still love complexity. Complexity ought to be the security consultant's worst enemy.
Ex-hackers often talk about security as if it's something you need to add to your systems... Security is achieved through good software development practices and it's about minimalism. You can't take intrinsically crappy, over-engineered, complex software and make it more secure by adding layers upon layer of complex security software on top.
It's bizarre reading all the headlines about companies offline, flights canceled, banks not working because of a piece of antivirus software in 2024.
Mostly because I lived through Y2K and every fear about Y2K just materialised but because of Crowdstrike instead.
I can't imagine the amount of wasted work this will create, not only the lost of operations across many industries but recovery will be absolute hell with Bitlocker. How many corporate users have access to their encryption keys? And when stored centrally, how many of the servers have Crowdstrike running and just got stuck in a boot loop now?
I don't envy the next days/weeks for Windows IT admins of the world...
I can't wait to see the CloudFlare traffic report after this. All those computers going down must have affected traffic worldwide. Even from Linux systems as their owners couldn't run jobs from their bricked Windows laptops.
It is interesting that operating systems exist for server applications at all.
What is the problem they are solving?
What is the difference between what an operating system contains and can do and what you need it to do?
Why would I want to rent a server to run a program that performs a task, and also have the same system performing extra tasks - like intrusion detection, intrusion detection software updates, etc.
I just don't understand why compiled program that has enough disk and memory would ever be asked to restart for a random fucking reason having nothing to do with the task at hand. It seems like the architecture of server software is not created intelligently.
what we do is safemode the PC and then open the run CMD as admin. then issue this command. sc delete csagent. reinstall crowdstrike using previous version.
It's eye-opening how bad our crucial IT infra is nowadays. Running in-kernel third-party tools (AV) on critical infrastructure on Windows? Central banks? Control towers? Seriously? We should fire everyone involved and start IT from scratch. This level of negligence cannot be fixed.
Those EDR software is implemented as a kernel driver.
A third party closed source Windows kernel driver that can't be audited. It gathers massive amount of activities and send back to the central server(which can be sold) as well as execute arbitrary payload from the central server.
It became single point of failure to your whole system.
If an attacker gain control of the sysadmin PC, it's over.
If an attacker gain administrator privilege on EDR-installed system, it run the same privilege with EDR so attacker can hide their activities from EDR. There aren't many EDR products in the world it can be done.
How is it that these major companies aren't rolling out vendor updates to a small number of computers first to make sure that nothing broke, and then rolling out to the entire fleet? That's deployment 101.
It seems that an unexplored weirdness here is the prevalence of virtual Windows in the medical world. It seems that this has approach has become commonplace for HIPAA reasons (though it's unclear that it makes the world better versus using secure applications to handle HIPAA data). In the case of this Crowdstrike outage, one would think that virtual machines would simplify getting things up and running again, but instead there seems to be just the opposite going on, where lack of hardware access is limiting restoring them.
If I were a cloud vendor, I would provide a "CrowdStrike recovery" button which queues the recovery image and restores the system for the entire project. Why didn't hetzner, linode, DO, gcp, aws do something like this? Why leave people to their devices? Isn't this a basic application of centralization? It feels to me like this should be easier than managing your data center.
I would expect this to be a kernel specific bug. I'm on a company laptop with falcon, and we have linux systems using the same, no signs of problems so far.
I was going to buy some put options against CRWD with spare pocket money, but it turns out that the service I have my investment money is in broken right now. I wonder if that's because of Crowdstrike.
How do so many super critical things rely on… windows? I wouldn’t trust windows to run a laptop reliably but here it is running prettty ucy everything. I guess that’s why they need crowdstrike.
Vendors of tools like this drive the cybersecurity industry discourse, so 'defense in depth' often practically sorta means 'add more software that does more things'.
But maybe this kind of thing can actually impart the lesson that loading your OS up with always-on, internet-connected agents that include kernel components in order to instrument every little thing any program does on the system is, uh, kinda risky.
But maybe not. I wonder if we'll just see companies flock to alternative vendors of the exact same type of product.
Anyone have a technical writeup of the actual bug? I'm trying to explain how this could happen to people who think this is related to AI or cyber attacks.
What happened to the QA testing, staggered rollouts, feature flags, etc.? It's really this easy to cause a boot loop?
To me, BSOD indicates kernel level errors, which I assume Crowdstrike would be able to cause because it has root access due to being a security application. And because it's boot-looping, there's not a way to automatically push out updates?
I don't have a technical writeup to offer, but your assessment around the BSOD seems correct enough. Without having an affected machine but knowing how NT loads drivers like this, I'd hazard a guess that the OS likely isn't even getting to the point where smss.exe starts before the kernel bugchecks. This means no userspace, which almost certainly means no hope of remotely remediating the problem.
By way of a data point for everyone else I live in HongKong and haven't seen any of this level of disruption yet. I also was in Shenzhen China yesterday, probably the words highest density of Win95 machines, and everything was fine. At home we have only one old laptop on win10 that only gets opened when the 8yo gets windows homework - otherwise it's MacOs and Linux on all laptops, desktops and SBCs.
So, if CrowdStrike licenses didn't say "We're responsible for nothing" and if all affected users sued them, they'd be worth negative 90 trillion dollars or so right now. iow out of business.
I can understand the frustration their customers feel. But how could a software company ever bear liability for all the possible damage they can cause with their software? If they built CrowdStrike to space mission standards nobody could afford it.
I guess all the blamed EuroCommission will again have to do their job to bring anti-oligo/monopoly regulations, which everyone will hate but still slightly work.
Architecting technical systems is MUCH WAY easier than architecting social-economical systems. I hope one day all those tech-savvy web3 wannabe revolutionaries will start to do the real job a designing socially working systems, not only technically barely working cryptographically strong hamster-tapping scams
> I'm in Australia. All our banks are down and all supermarkets as well so even if you have cash you can't buy anything.
I hope the national security/defense people are looking at this closely. Because you can bet the bad guys are. What's the saying, civilisation is only ever three days away from collapse or something?
I am pretty convinced this is a fuckup not an attack, but if Iran or someone managed something like this, there would be hell to pay.
If you are IT team for a large impactful organization, you have to control updates to your organization's fleet. You cannot let vendors push updates directly. You have to stage those updates and test them and then do a gradual rollout to your whole organization.
Plus, for your critical communication systems, you must have a disaster recovery plan that actually helps you recover quickly in minutes, not hours or days. And you have to exercise this plan regularly.
If you are crowd strike, shame on you for not testing your product better. You failed to meet a very low bar. You just shipped a 100% reproducible widely impactful bug. Your customers must leave you for a more diligent vendor.
And I really hope the leadership teams in every software engineering organization learn a valuable lesson from this – listen to that lone senior engineer in your leadership team who pushes for better craft and operational rigor in your engineering culture; take it seriously - it has real business impact.
Today's incident shows that the real problem is actually that organisations spend too much (money, but too little time / manpower) on security.
Hey, third-party vendor, I'll give you all the money you want, I'll let you pwn all my systems, I'll be your little bitch, just make me secure, I don't have time for all that security shit, kthxbye.
The whole thing needs to be redesigned, so that antivirus and EDR solutions do not require such high privilege. We need a high-performance way for a possibly privileged service to export all the data that is needed for a decision, and then let the AV/EDR do its thing. If the AV/EDR is broken by an update, fine. At least the system won't go down.
My company has some bios bitlocker extension installed which prompts for a password on boot, so automatic updates (one of which tried to install last night) just get stuck there in jet engine mode. Normally this is extremely annoying but today I count myself lucky - aside from a couple of people with Chromebook thin clients I am the only person showing as online in Teams right now.
An update to the internal database. It still did not sunk to developers that data has equivalent risk as code. A400 crashed because of an XML file update. I have witnessed my share of critical bugs caused by "innocent? updates to "data" which were treated less seriously because of this. Management and devs alike should change their conception about this.
Canary releases aren't a magical bug free fix. They might be doing it, but the conditions that trigger a problem can be sneaky and can happen outside your canary period, rendering it useless. It's a best effort method.
But yeah I've seen US companies for example only doing their initial releases in the USA only which has zero value for issues that might appear with different localization/language settings for example.
Security technology harming security? Shocker. We need less monoculture. Trouble is monoculture pays. Write the software once, deploy it everywhere - free money.
I manage a simple Tier-4 cloud application on Azure, involving both Windows and Linux machines. Crowdstrike, OMI, McAfee and endpoint protection in general has been the biggest thorn in my side.
This is pretty wild. I woke up to a news alert on my phone stating a "global IT outage" took down banks, airlines (who were calling for a global ground stop for all flights), hospitals, emergency services, etc. Expected it to be some sort of Tier 1 Network issue. Nope, a failed update for some third party Windows security app.
Isn't a Windows BSOD the equivalent of a kernel panic? I don't understand how this is CrowdStrike's fault. Vanilla userspace operations shouldn't cause a kernel panic--that's a bug in the OS, not a bug in some user software. If anything, we should be blaming Windows here?
> Vanilla userspace operations shouldn't cause a kernel panic...
The component Crowdstrike says you need to remove to restore functionality is a ".sys" file. That's a kernel-mode driver. The fault is happening on the kernel side.
I have been told 'not to worry' because it isn't a cyber attack. Yet the outcomes we are seeing feel a lot like the doomsday predictions of what a cyberattack would do. It is almost as if we are experiencing the cybersecurity/warfare equivalent of 'friendly fire'.
This has all the hallmarks of a SSCA (Software Supply Chain Attack).
Either that or Crowdstrike is testing critical software meddling in ring zero so poorly, causing crashes and bootloops out in the wild on 100% of the deployments, that they need to get sued out of existence.
I wonder what the rollout procedure is for CrowdStrike.
I put $100 down that this was a minor update they decided was so minimal it didn't need extensive testing.
So many places use the "emergency break glass rollout procedure" on every deploy because it doesn't require all the hassle
Considering what CrowdStrike's software does, I'd say the majority of their updates could be quite easily argued as being "emergency" updates, so yeah, quite possibly they've gotten into the habit of "omg URGENT must break glass" way too often.
Maybe the world can finally reconsider their use of software products that cater to security theater. And the politics in companies which lead to things like this being introduced ("nobody gets fired for buying IBM").
"Endpoint protection" is just the new, hip term for antivirus/intrusion prevention/incident logging of the past. Why not provide immutable Linux based machines (like Chromebooks, Fedora Silverblue) which are locked down outside of the browser? I am aware that this isn't possible in some areas of the industry that rely on large amounts of Windows-only desktop software, but in many cases it may be worth a thought.
If I am being naive here, happy to hear other opinions. But I hate opening my company Windows laptop and having the fans turn to 11 just because some "security" software is parsing random files for malicious signatures or running an update that BSOD loops.
Make a live CD Linux image that mounts the NTFS drives, locates the Windows directories from the bootloader, and deletes the file.
Also, you can mount BitLocker partitions from Linux iirc. If it encounters a BitLocker partition, have it read a text file of possible keys off the USB drive.
We routinely implement phased / canary deployments in server-side systems to prevent faults from rolling out globally. How is it possible that CrowdStrike and/or Windows does not have a similar system built in for large, institutional customers? This is outrageous.
I take it patching remote machines is going to be difficult or impossible?
I haven't used windows in years, but from what I read you need to be in safe mode to delete a crowdstrike file in a system directory, but you need some 48 char key to get into safe mode now if it is locked down?
I don’t really understand why AV updates aren’t tested before being pushed out to critical systems and I don’t understand why every system would run the same AV.
But also I don’t understand why this corporate garbageware is still a thing in 2024 when it adds so little value.
While initially everyone blamed Microsoft and then quickly pointed the finger at CrowdStrike, I'd like to call out Microsoft especially their Azure division for making the recovery process unnecessarily difficult.
1) A key recovery step requires a snapshot to be take of the disk. The Portal GUI is basically locking up, so scripting is the only way to do this for thousands of VMs. This command is undocumented and has random combinations of strings as inputs that should be enums. Tab-complete doesn't work! See: https://learn.microsoft.com/en-us/powershell/module/az.compu...
E.g.: What are the accepted values for the -CreateOption parameter? Who knows! Good luck using this in a hurry. No stress, just apply it to a production database server at 1 am in the morning.
2) There has been a long-standing bug where VMs can't have their OS disk swapped out unless the replacement disk matches its properties exactly. For comparison, VMware vSphere has no such restrictions.
3) It's basically impossible to get to the recovery consoles of VMs, especially VMs stuck in reboot loops. The serial console output is buggy, often filled with gibberish, and doesn't scroll back far enough to be useful. Boot diagnostics is an optional feature for "reasons". Etc..
4) It's absurdly difficult to get a flat list of all "down" VMs across many subscriptions or resource groups. Again, compare with VMware vSphere where this is trivial. Instead of a simple portal dashboard / view, you have to write this monstrous Resource Graph query:
Resources
| where type =~ 'microsoft.compute/virtualmachines'
| project subscriptionId, resourceGroup, Id = tolower(id), PowerState = tostring( properties.extended.instanceView.powerState.code)
| join kind=leftouter (
HealthResources
| where type =~ 'microsoft.resourcehealth/availabilitystatuses'
| where tostring(properties.targetResourceType) =~ 'microsoft.compute/virtualmachines'
| project targetResourceId = tolower(tostring(properties.targetResourceId)), AvailabilityState = tostring(properties.availabilityState))
on $left.Id == $right.targetResourceId
| project-away targetResourceId
| where PowerState != 'PowerState/deallocated'
| where AvailabilityState != 'Available'
I wonder what Crowdstrike's opsec is like re: malicious actors gaining control of their automated update servers. This incident certainly highlights the power of that type of attack, even if this one just ends up being typical human incompetence-based.
Crazy isn't it, I had no issues because my group policy updates have been off since last year, guess the "everyone must forcefully update" for "security reasons" ended up backfiring, who could've thought
Why Crowdstrike doesn't follow standard deployment strategies such as canary or rolling? Gradual update would uncover this bug before reaching critical mass. Doing all-at-once update is unacceptable to critical systems.
I haven't heard ask this, but would this have happened on linux. Obviously not many people run virus s/w, but would something similar like this have caused this?
Are there any protections to prevent repeating reboots?
No rolling updates? How could a 100% repro BSOD pass QC? I'm more concerned about the deployment process than the crash itself. Everyone experiences a bad build from time to time. How did this possibly go live?
Go to advance repair option then advanced open cmd. Go to windows/system32/drivers/crowdstrike. Then list all the file and delete file name having 291 at the end using cmd "del filenameendingwith291"
Ho do they test this before they roll it out? Looks like a bug thats easy to spot. I would presume they test it at several configurations and when it passes the test ( a reboot), they roll it out. Has this been tested?
Was watching TV this morning in France (TF1, 8:00 CET), the weather forecast map system was out. The journalist just gave us the information as if he was on the radio, telling he was sorry for the system to be failing.
They sponsor the Mercedes F1 team https://crowdstrikeracing.com/f1/about-partnership/ , who have a race this weekend and practice sessions today. It'd be funny if their cars can't go on track because their computers are down...
So, why did our little company's (little used) two Windows machines not BSOD overnight? They were just sitting idle. They run CS Falcon sensor. Did the update force a restart? Didn't seem to happen here.
Why would they roll out this update globally to all users immediately? Isn’t it normal to do gradual rollouts? Or did this update contain some critical security fix they wanted everyone to have as fast as possible?
People at my workplace were affected but I dodged the bullet because I left my computer turned on overnight because I always want to be able to RDP in the next morning in case I decide to stay home.
Can we end the whole “loading a kernel rootkit” thing? AFAIK Apple already shuns kernel extensions. What’s preventing Microsoft to do the same? As a bonus shit like anti cheat will go away too.
it is humbling (and lowkey reassuring?) to know that not all large players use the absolute cutting edge approaches in their workflow.
it seems and i hope that after all is said and done there is no major life-threatening consequence of this debacle. at the same time, heart goes out to the dev who pushed the troubling code. very easy to point at them or the team's processes, but we need to introspect at our own setup and also recognize that not all of us work in crucial systems like this.
Does crowdstrike work similarly on MacOS? I have to imagine the "walled garden" doesn't allow for 3rd parties to insert themselves into the OS kernel but I could be wrong.
I'd bet my career CS isn't spending enough on QA. It's always the first thing to be cut, no one cares about QA when everything is going well, but when things go wrong...
I just wanted to mention that Microsoft has 3 tiers of Windows beta releases before changes are pushed to production. I can't comprehend how this wasn't noticed before.
Probably a stupid question but, how can the Windows kernel recover so well after a graphics driver crash and at the same time being unable to do the same for other kind of drivers.
what would be funny is if crowdstrike demanded ransom from their castomars.
security is a great business - you play on people's fears, your product does not have to deliver the goods.
like the lock maker, you sell a lock, the thief breaks it, but it is not your problem, and you sell a bigger badder lock the next year which promptly gets broken.
as a business, you dont have any consequences for how your product works or doesnt work, what a great business to be in !!
The postmortem will should interesting, can't imagine how even just basic integration testing didn't catch this. Much less basic best practice like canarying.
So - what is the lesson learned? The only clear message for me is that critical programs that also demand kernel level access maybe shouldn't update themselves.
I’m guessing it’s completely incidental that the CEO of crowdstrike was critical of China earlier this year, and that China is somehow unaffected by this ‘global’ issue!
Rolling out updates in an A/B test slowly is the only way to reduce the occurrence of such issues _significantly_. There's no other way, literally, nothing.
Crowdstrike seems like the kind of thing that's sold to CEOs at conferences, forced on IT against objections, and the subject of a lot of discussion at Defcon.
The workaround suggests removing a file with .sys extension. What does the file do normally? If removed, what happens to the state of security on that system?
That's what you get for letting a company install a root kit on your servers and desktops ;-)
I mean, don't they do canary updates on CrowdStrike too? Every Windows admin has done this for the last 5+ years, test Windows updates on a small number of systems to see if they are stable. Why not do the same for 3rd party software?
I know I have the benefit of hindsight in this regard, but how isn't there redundant checks and tests that would prevent a mishap of this magnitude?
I mean, there should be extensive automated testing using many different platforms and hardware combinations as a prerequisite for any rollout.
I guess this is what we get when everything is opaque, not only the product and the code, but also the processes involved in maintaining and evolving the solution. They would think twice about not investing heavily in testing their deployment pipelines if everyone could inspect their processes.
It might also be the case that they indeed have a thorough production and testing process deployed to support the maintenance of crowdstrike solutions, but we are only left to wonder and to trust whatever their PR will eventually throw at us, since they are a closed company.
Until corporate decides to install new MDM software on the usually-blazingly-fast apple silicon chips :(
I can't even open files larger than 500 lines without my whole system slowing to a crawl because of the insanely aggressive and slow "antivirus" bloatware the MDM forces on me.
Their Windows sensor has made development almost unworkable. Not sure why but I haven't noticed the OSX sensor slow things down appreciably. I suspect my Windows profile is configured to be more aggressive?
I'm amazed it's just 14%, not more like 75%-80%. Surely a lot of customers are going to uninstall and move to competitors. The remainders are at least going to demand much cheaper service with better guarantees going forward.
Random strangers running unknown, untrusted code on your computers is the worst. It's a good thing we patched that security flaw by letting the _right_ random strangers run unknown, untrusted code on our computers.
As something of a friendly reminder, it was Microsoft this time, but it's a matter of "when" not "if" till every other OS with that flavor of security theatre is similarly afflicted (and it happens much more frequently when you consider the normal consequences of a company owning the device you paid for -- kicked out of email forever, ads intruding into basic system functions, paid-in-full device eventually requires a subscription, ...). Be cautious with automatic updates.
In my org, none of the essential systems went down (those used by labor). However all of management's individual PCs went down which got me wondering... Is this the beginning (or continuation) of whittling down what is "essential" human labor versus what could be done remotely (or eliminated completely)?
Or perhaps Microsoft is just garbage and soon will be as irrelevant as commercial real estate office parks and mega-call centers
My read is that Crowdstrike's update agent downloaded new security threat definitions and those definitions exposed a bug in the existing Crowdstrike drivers, causing the disaster.
CrowdStrike today has shown why it's absolutely crucial to test code before deployment, say no to YOLO deployments with LLM powered software testing https://github.com/codeintegrity-ai/mutahunter
I just can‘t imagine how it passed tests for a common configuration that is exhibited by large number of windows machines. Stuff always can go wrong, but OS is not booting should be caught?
We had a few machines come out of the boot loop - only to re-enter it 20 mins later. I am sure CS pulled the patch from their CDNs but ...maybe some cached versions still linger?
this is really microsoft's fault for handing out kernel access to random 3rd parties, none of which are doing anything special that microsoft couldn't implement themselves (AV, anti-cheat, security)
Or do what Apple does, disallow kernel extensions, and provide rigid kernel faciltiies for VPN clients, EDR agents, etc. to use, so they don't have to implement custom code resident in the kernel.
Apple can disallow kernel extensions because it fully controls the entire hardware and software stack. Everything that would need to be an extension is already in the kernel and Apple knows all of those things.
Somewhere out there, there is an engineer with the biggest "I told you so" shit eating grin scrolling through every social media site and basking in the glory.
Was involved in a "security mandated" mandatory rollout of Crowdstrike at my prior company.
This software was utter shit, and broke stuff all over the place. And installs itself as basically malware into critical paths everywhere. We objected to ever using it as a SPOF, but was overruled.
So yeah, not remotely surprised this happened.
Any kind of middleware/dynamic agent is highly suspect in my experience and to be avoided.
Fix: Boot Windows into Safe Mode or the Windows Recovery Environment Navigate to the C:\Windows\System32\drivers\CrowdStrike directory Locate the file matching “C-00000291*.sys”, and delete it. Boot the host normally
dumb techbro c-suites: what, why would you have an issue with a proprietary closed source app that frequently self updates and sends tons of data to a third party while essentially being a backdoor? We said we wanted security and this has Security(tm) all over the literature! Look we even have dashboards for the gui-ninjas like the security team!
That's called [Microsoft Defender for Endpoint](https://learn.microsoft.com/en-us/defender-endpoint/), which is used even on Linux servers in big corporations. (Largely because it's the easiest way to complete box ticking exercises with Windows servers: once you have it, it's easy to decide to extend it to non-Windows machines as well.)
The binary self-upgrades and runs in highly privileged mode, so it might not be immune from the kind of failure CrowdStrike had here. Though apparently there's at least a way to use a local mirror so you have some control on the updates: https://learn.microsoft.com/en-us/defender-endpoint/linux-su...
I know there's a better word to be used here, but what initially looked like a massive cyberattack turning out to be a massive defender foot-broom is chefs kiss.
I saw it was Windows and went to bed. What a great feeling.
I'm sorry to those of you dealing with this. I've had to wipe 1200 computers over a weekend in a past life when a virus got in.
Did I receive any appreciation? Nope. I was literally sleeping under cubicle desks bringing up isolated rows one by one. I switched everything in that call center to linux after that. Ironically it turns out it was a senior engineers ssh key that got leaked somehow and was used to get in and dig around servers in our datacenter outside of my network. My filesystem logging (in Windows, coincidentally) alerted me.
> Don't solicit upvotes, comments, or submissions. Users should vote and comment when they run across something they personally find interesting—not for promotion.
> A "content update" is how it was described. So, it wasn’t a major refresh of the cyber security software. It could have been something as innocuous as the changing of a font or logo on the software design.
I think we have reached and inflection point. I mean we have to make an inflection point out of this.-
This outage represents more than just a temporary disruption in service; it's a black swan célèbre of the perilous state of our current technological landscape. This incident must be seen as an inflection point, a moment where we collectively decide to no longer tolerate the erosion of craftsmanship, excellence, and accountability that I feel we've been seeing all over the place. All over critical places.-
Who are we to make this demand? Most likely technologists, managers, specialists, and concerned citizens with the expertise and insight to recognize the dangers inherent in our increasingly careless approach to ... many things, but, particularly technology. Who is to uphold the standards that ensure the safety, reliability, and integrity of the systems that underpin modern life? Government?
Historically, the call for accountability and excellence is not new. From Socrates to the industrial revolutions, humanity has periodically grappled with the balance between progress and prudence. People have seen - and complained about - life going to hell, downhill, fast, in a hand basket without brakes since at least Socrates.-
Yet, today’s technological failures have unprecedented potential for harm. The Crowdsource outage killed, halted businesses, and posed serious risks to safety—consequences that were almost unthinkable in previous eras. This isn't merely a technical failure; it’s a societal one, revealing a disregard for foundational principles of quality and responsibility. Craftsmanship. Care and pride in one's work.-
Part of the problem lies in the systemic undervaluation of excellence. In pursuit of speed and profit uber alles. Many companies have forsaken rigorous testing, comprehensive risk assessments, and robust security measures. The very basics of engineering discipline—redundancy, fault tolerance, and continuous improvement—are being sacrificed. This negligence is not just unprofessional; it’s dangerous. As this outage has shown, the repercussions are not confined to the digital realm but spill over into the physical world, affecting real lives. As it always has. But never before have the actions of so few "perennial interns" affected so many.-
This is a clarion call for all of us with the knowledge and passion to stand up and insist on change. Holding companies accountable, beginning with those directly responsible for the most recent failures.-
Yet, it must go beyond punitive measures. We need a cultural shift that re-emphasizes the value of craftsmanship in technology. Educational institutions, professional organizations, and regulatory bodies must collaborate to instill and enforce higher standards. Otherwise, lacking that, we must enforce them ourselves. Even if we only reach ourselves in that commitment.-
Perhaps we need more interdisciplinary dialogue. Technological excellence does not exist in a vacuum. It requires input from ethical philosophers, sociologists, legal experts. Anybody willing and able to think these things through.-
The ramifications of neglecting these responsibilities are clear and severe. The fallout from technological failures can be catastrophic, extending well beyond financial losses to endanger lives and societal stability. We must therefore approach our work with the gravity it deserves, understanding that excellence is not an optional extra but an essential quality sine qua non in certain fields.-
We really need to make this be an actual tuning point, and not just another Wikipedia page.-
It is for social security, taxes, unemployment benefits, whatever. And running under a foreign TLD, .ME for Montenegro. I am not a security specialist. But I think this is asking for trouble.
By the way, do you remember when fuck.yu became fuck.me ?
I want to add something to the discussion but it's difficult for me to accurately summarize and cite things. In a nutshell, there appears to be a lot of tomfoolery with CrowdStrike and the stuff that happened with the DNC during the 2016 election. Here's some of what I'm talking about:
This 2017 piece talks about doubt behind CrowdStrike's analysis of the DNC hack being the result of Russian actors. One of the groups disputing CrowdStrike's analysis was Ukraine's military.
https://www.voanews.com/a/crowdstrike-comey-russia-hack-dnc-...
"For one, the vulnerability he claims to have used to hack the NGP VAN ... was not introduced into the code until an update more than three months after Guccifer claims to have entered the DNC system."
"This was a very egregious breach and our data was stolen," Mook said. "We need to be sure that the Sanders campaign no longer has access to our data."
"This bug was a brief, isolated issue, and we are not aware of any previous reports of such data being inappropriately available," the company said in a blog post on its website.
By chance, I watched a few episodes of 911 and kept thinking that it was all completely unrealistic nonsense. Then there's an episode where the entire emergency call system for LA goes down, and even though there were different reasons in the episode (a transformer fire), I couldn't have imagined that it was actually possible to completely disable the emergency call system (and what else) of a city.
Here’s my take as a security software dev for 15 years.
We put too much code in kernel simply because it’s considered more elite than other software. It’s just dumb.
Also - if a driver is causing a crash MSFT should boot from the last known-good driver set so the install can be backed out later. Reboot loops are still the standard failure mode in driver development…
Not possible in this situation, the "driver" is fine, it's a file the driver loads during startup that is bad, causing the otherwise "good" driver to crash.
Going back to an earlier version—since the driver is "good—would just re-load the same driver, loading the updated file, and then crashing again.
Hopefully now people might wake up to the idea that these tech monopolies are not leading to safe, secure and reliable systems. They will wonder how a third party component could cause such breakage. I expect many will be calling for regulation.
I used to laugh at Dijkstra's idea that all code should be mathematically proven correct. I thought of it as a laughable idea from yet another out-of-touch mathematician.
I suppose true genius is seldom understood within someone's lifetime.
If I weren't an atheist I would say this is god's punishment for installing malware on your employees' machines, on one hand, and for being a spineless patsy for management by letting them install that crap on your work machine.
I mean... installing what is essentially a 3rd party enterprise rootkit that not only has root access to all files and network activity but also a self-update mechanism ... who could have seen this coming?
I'm curious about investing and economy, and I always wonder about P/E ratios like Crowdstrike's (currently 450-something, was over 500 last week).
Some P/E ratios for today, for some companies I find interesting:
- Shopify: 615.12
- Crowdstrike: 455.70
- Datadog: 341.98
- Palantir: 212.34
- Pinterest: 187.67
- Uber: 99.0
- Broadcom: 77.68
- Tesla: 58.33
- Autodesk: 52.36
- Adobe: 49.23
- Microsoft: 37.97
What's going on here? Do investors expect Shopify, for example, to increase their earnings by an order of magnitude despite already having done extraordinarily well in a very competitive market? Can anyone ELI5?
Former equity analyst here. Nobody on "The Street" is actually valuing these companies on PE ratios. Tech companies often intentionally re-invest earnings back into the business in real time and so their reported EPS is often quite low and a poor metric to evaluate the underlying business on. So instead, analysts typically use other metrics like EV/EBITDA or even P/Sales ratios in their valuation models.
Very generally speaking, trading these companies is kind of more of like placing a bet on whether or not their future top-line growth will be dramatically different than the market's current expectations.
The only common belief held by investors in a stock is that the price is going to go up. You may have value investors with a belief that Shopify is undervalued based on earnings, you may have investors betting that the rest of the market will buy Shopify, you may have people who’ve seen the line go up and decided to buy…
Stock prices have been decoupled from earnings or “value” for a long time now and that’s toothpaste we will never get back in the tube. We are in the Robinhood age where you can buy and sell a stock in seconds with no effort.
> Stock prices have been decoupled from earnings or “value”
No, they aren't, but the market can remain irrational for longer than you can remain solvent. It doesn't help that our dear government seems loathe to actually ensure competitive markets.
not regardless, but only if. The future is unknown, so their bet is also based on that unknown. Is it foolish? Who knows. Did nvidia seem foolish if somebody made that bet before their ai boom?
You can also just say things you don't understand are always created by fools.
Now, there are some fools buying these stocks. But to say that each one of these has a high P/E because every shareholder is a fool is very reductionary.
> these are real, well run companies with good fundamentals
I'm not disputing that. But even "real" companies don't warrant P/E multiples in the three-digit range, unless there's a very good reason to expect them to grow their profits by 10x or more in the foreseeable future – and that has to be the expected value of earnings growth (roughly, the average growth over all possible futures), discounted by the time value of the investment.
P/E multiples over 100 are practically never justifiable, except as "someone else will come along and pay even more" – i.e., the greater fool theory.
TBH I don't think many figures here make any financial sense -- but I gotta hold it if my friends all hold it. And once everyone holds it no one is allowed to mass sell it because it's going to hurt your friends, and in finance that's a sin.
With numbers like that, either the market is crazy or the market believes the actual meaningful earnings are substantially higher than the GAAP reported numbers. Although even there the difference would have to be pretty big.
There's a lot of comments knocking the due diligence, but the call out of the threat vector and timing of this make it a bit hard to brush off as coincidence.
The problem was Windows giving arbitrary access to the kernel to software that can be updated OTA without user intervention and allowing that to crash the kernel, right? Wouldn't this mean that Windows is considerably less secure and stable than assumed?
Not really. These were kernel modules authorized and installed by the system admin. Of course kernel code runs the risk of crashing your system. The same is true on Linux, and according to another commenter it already has happened with Crowdstrike for Linux
He says he bought seven put contracts for $7.30 at the $185 strike. Absolute max profit, from CS going to -0, would be (185-7.3) x 7 X 100 = ~$125k.
I don’t know if the absolute amount of profit affects decisions here. It seems if he were more certain of what’s going on he would have bet a lot more.
> It seems if he were more certain of what’s going on he would have bet a lot more.
Outside of the HN bubble, $125K is already a pretty big sum of money to get all at once, and unlikely to bring too much scrutiny, if it was somehow not a coincident. Seems like a smart strategy, if the user was sitting on inside information and didn't want to ring too many alarm bells.
However posting on reddit about it, would not be such a smart strategy. I think it's genuinely just a coincidence, WSB gets plenty of worthless "DD" posts every day that end up amounting to nothing.
This smells of some inside trading. Someone internal at crowdstrike (or their relative/friend) got wind of this and is trying to save face if they get investigated.
Reading the post its obvious they don’t have a deep understanding of tech, while having that be core to their thesis.
It’s prohibitively hard to hack into a “cloud system” due to few possible entry points - as a reddit commenter said, open S3 buckets are tough to crack!
That is not an understatement. This is literally the largest failing of internet infrastructure to date.
Alas, using the internet has given us a lot of efficiency. The trade off is resilience. The entire global system is more brittle than ever but it what gave it such speed.
I'd argue the infrastructure of the Internet isn't to blame here, it sounds like a software/config bug at Crowdstrike. There are wider discussions around over-reliance on cloud-based tech too. But the good old Internet can hold its head up high IMHO.
I'm not really sure cloud has much blame here either.
Imagine it's 1998 and Norton push a new definition file that makes NAV think kernel32 is a virus. The only real difference today is that always-on means we all get the update together, instead of waiting for mum to get off the phone this evening.
We got an email this morning telling us none of our usual airlines could take bookings right now. That wouldn't have been much different in 1998, airline bookings have been centralised for my entire lifetime.
In a way, this might end up being a blessing in disguise. It's an emergency drill for something potentially catastrophic (e.g. massive cyberattack, solar flare), and it's a large enough wake-up call that society can't just ignore it.
This not Internet failure but software functuality failure from a cyber security update , thios is why i never use real cloud security but rather use home cloud security on my Serverpc/NAS/SAN , when it crashes okay so be it i dont get acces to my cloud , but i allways use a online backupcloud that does not need instalation for most important stuff , u should never rely on 1 softweare to do all allways have backups
Indeed this is different , This is world wide and it does not even only affect Windows pcs but with some friends of mine Linux and Macs are in trouble 2 , this is not ur Windows Vista BS problem etc
It’s most likely the flame war filtering algorithm of HN. Posts that create a lot of discussion quickly are down ranked until an admin fix the rank manually, or not.
Funny how I got rejected today from crowdstrike because I couldn’t code a hard leetcode problem under 40mins. I guess leetcode isn’t true software engineering after all.
This is a testing and deployment issue rather than coding... mistakes and bugs happen - but most serious businesses have routines setup to catch them before rolling them out globally!
Their "Statement" is remarkably aloof for having brought down flights, hospitals, and 911 services.
"The issue has been identified, isolated and a fix has been deployed."
Maybe I'm misunderstanding what I read elsewhere, but is the machine not BSODing upon boot, prior to a Windows Update service being able to run? The "fix" I see on reddit is roughly:
Workaround Steps:
1. Boot Windows into Safe Mode or the Windows Recovery Environment
2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
3. Locate the file matching “C-00000291*.sys”, and delete it.
I'm horrified at the thought of tens of thousands of novice Windows users digging through System32 to delete driver files; can someone set my mind at ease and assure me this will eventually be fixed in an automated fashion?
Of course it can be fixed in an automated fashion; it just requires effort. The machines should have netboot enabled so that new validated operating system images can be pushed to them anyway, so you just write a netboot script to mount the filesystem and delete the file, then tell the netboot server that you're done so it doesn't give you the same script again when it reboots.
It's like two hours of work with dnsmasq and a minimal Linux ISO. The only problem is that much of the work is not shareable between organisations; network structures differ, architectures may differ, partition layout may differ, the list of assets (and their MAC addresses) will differ.
Edit: + individual organisations won't be storing their BitLocker recovery keys in the same manner as each other either. You did back up the recovery keys when you enabled BitLocker, right? Modern cryptsetup(8) supports a BITLK extension for unlocking said volumes with a recovery key. Again, this can be scripted.
> so you just write a netboot script to mount the filesystem and delete the file
Because writing such a script (that mounts the filesystem and delete a file) under stress and time constraint is a great idea? That's a recipe for a worse disaster. The best solution, for now, is to go PC by PC manually. The sole reason the situation is as is was the lack of backstage testing.
If the affected organizations had such an organized setup, they probably won't need crowdstrike in the first place. The product is made so that companies that don't understand (and won't invest) in security can just check that box by installing the software. Everyone is okay with this.
> I'm horrified at the thought of tens of thousands of novice Windows users digging through System32 to delete driver files; can someone set my mind at ease and assure me this will eventually be fixed in an automated fashion?
Nope. Both my orgs (+2000 each) have sent out a Google doc to personal emails on using CMD Prompt to delete that file.
Anyone with technical experience is being drafted to get on calls and help people manually delete this file.
The 1000's of laptops my wife's work uses are bitlockered. I went to fix the issue, when I found that out. I wonder if they will be giving out the keys or if IT will require hands on to those laptops to fix it.... what a shitshow.
I agree but I've also personally witnessed how effective this crap is on a certain cohort of IT managers. You can see the 3 or 4 gears grinding together in their head... something like "oh my goodness look at all the things I get for one purchase order!".
The antivirus did its job, now you can't get viruses. Jokes aside, I've checked their website and it was full of AI buzzwords so I guess that happens when you focus on nonsense instead of what your customers actually need (I know that all antiviruses have a machine learning component, but usually you don't advertise it as some sort of AI to get better stocks).
I think the AI talk is just the fashion now amongst C-level execs. Their product - no matter what it does - suddenly needs some sort of AI integration.
1. Stop putting mission critical systems on Windows, it's not the reliable OS it once was since MS has cut off most of its QA
2. AV solutions are unnecessary if you properly harden your system, AV was needed pre-Vista because Windows was literally running everything as Administrator. AV was never a necessity on UNIX, whatever MS bundles in is usually enough
3. Do not install third party software that runs in kernel mode. This is just a recipe for disaster, no matter how much auditing is done beforehand by the OEM. Linux has taught multiple times that drivers should be developed and included with the OS. Shipping random binaries that rely on a stable ABI may work for printers, not for mission critical software.
None of this advice is useful for massive organizations like banks and hospitals who got hit by this. They cannot switch off of windows for a number of reasons.
There's nothing they can do right now, but my issue is that this will be forgotten when next update/purchasing round swings into action.
Take Mærsk who couldn't operate their freight terminals due to a cyber attack and had the entire operation being dependent on a hard drive in a server that happened to be offline. Have they improved network separation? Perhaps. Have they limited their critical infrastructure to only run whitelisted application? I assure you they have not. They've probably just purchased a Crowdstrike license.
Companies continuously fail to view their critical infrastructure as critical and severely underestimate risk.
Mærsk is kind of a bad example, because they made real security mitigations afterwards.[0] I cannot speak to whether they whitelist applications, but neither can you.
That's the reason why I wrote, "stop putting" instead of "throw all of your PCs out of the window". Just like they migrated away from DOS they should start planning to migrate away from Windows to more modern, sandboxed solutions. There are ZERO reasons why a cash register shouldn't boot from a read-only filesystem, run AV, and so on.
All of the hardware that's attached to workstations in our hospital are designed for windows. Certain departments have specific needs as well and depend on software that is Windows only. After decades of Windows it develops an insidious grasp that is difficult to escape, even moreso when your entire industry is dependent on Windows.
Switching over to windows wouldn't just be extremely costly from an IT perspective but would require millions of dollars in new hardware. We are in the red in part because of the pandemic, existing problems in our industry accelerated by the last few years, and because a large percentage of our patients are on Medicare, which the fed govt shrinks fixed service payments for every year.
I can't imagine convincing our administration to switch over to Linux across the hospital without a clear, obvious, and more importantly short-term financial payoff.
I'm working for a company that has no Windows boxes at all, anywhere. Sure, some Windows software has no alternatives. We're running all of those programs in VMs.
Does this make financial sense? Probably not in the short run, which is an issue for most companies nowadays. But in the long run? I think it's the right choice.
It is not the hardware designed for windows but the driver code, which is most probably written in basic C, which most probably can be cross-compiled for usage outside Windows – so instead of millions of dollars in new hardware it is really thousands in porting the drivers and GUIs to the new platform. What works on windows in 90% cases is an easy porting job for the manufacturer, they just won't be doing it unless someone stops paying for windows version and be willing to pay for alternative platform port.
Anyway, i totally agree with you. The convincing part here is short of clear and obvious for administration types. Until MS finally bricks it's OS and renders it totally unusable they can continue to do whatever shit they want and keep mocking their loyal customers forever.
Well, there’s this one app, written in VB6 using lots of DCOM that produces XML and XSLT transforms that only work in IE6, and the entire organisation depends on it, and the nephew who built it is now a rodeo clown and is unavailable for consultation.
1/ imagine running >1000 legacy applications, some never updated in 20 years
2/ imagine a byzantine mix of local data centers, VPCs in aws/gcp/azure
3/ imagine a IT departament run by a lot of people who have never learned anything new since they were hired
That would be your typical large, boring entity such as a bank, public utility or many of the big public companies.
Yeah, there is no law of physics preventing this, but it's actually nearly impossible to disentangle an organization from decades of mess.
People have continued to run old management systems inside of virtual machines and similar solutions. You can sandbox it, reset it, do all kinds of wondrous things if you use modern technologies in an era-appropriate way. Run your old Windows software inside of a VM, or tweak it to run well on Wine if you have the source. The reason this mess happened is that all of those software are literally running a desktop OS in mission critical applications.
I have worked as an embedded engineer for a while and I can't count the number of nonsensical stuff I've seen incompetent people running on unpatched, obsolescent Windows XP and 7 machines. This mess is 100% self inflicted.
I think these are just technical excuses, but the real answer lies somewhere in the fields of politics and economics. If people in charge are to make a decision – then us tech nerds are going to migrate and refactor 1000 applications and update 20 years of byzantine code mess. I saw entities so large and boring they can barely move one step – changing rapidly and evolving once their economic stability is at stake, and this is a great example of such a disruption which can push them into chasm of change.
This issue could easily happen on any other OS - Linux, macOS, BSDs - because it's a third party kernel driver which would be installed by the corporate IT regardless of anyone's opinion for compliance reasons. Your advice is incompatible with how the real world operates.
Alas in the world of B2B, contracts from larger companies nearly always come with lists of specific requirements for security controls that must be implemented, which nearly always include requiring anti-virus.
It just not as simple as commenters on this thread wish!
The contracts are rarely specifying stuff like antivirus explicitly, but instead compliance with one or more of the security standards like PCI DSS. Those say you have to use antivirus, but they all have an escape hatch called a "compensating control" which is basically "we solved the problem this is trying to solve this other way that's more conducive to our overall security posture, and got the auditor to agree with us".
My source: I review a lot of contracts. It's very common for things to be explicitly required.
Yes you can go back and forth and argue the toss, but it pushes up the cost of the sale and forces your customer to navigate a significant amount of bureaucracy to get a contract agreed. Or you could just run AV like they asked you to...
Can you propose an example of a compensating control for an "antivirus" that had a chance to pass? Would you propose something like custom SELinux/Apparmor setup + maybe auditd with alerting? Or some Windows equivalent of those.
compensating controls ftw. the spirit of the law vs the letter of the law. our system was more secure with the compensating controls, vs the prescribed design. this meant no having to rotate passwords because fuck that noise.
Same, I’ve been in an org that got PCI-DSS level 1 without antivirus beyond Windows Defender or any invasive systems to restrict application installation.
It did involve a lot of documentation of inter-machine security controls, network access restriction and a penetration test by an offensive security company starting with a machine inside the network, but it can be done! Also in my opinion it gives you a more genuinely secure environment.
Nothing like that, basically what sitharus said above you. Extra network level, zero trust to minimize lateral movement and giving the pen testers a leg up by letting them start already within the corporate network.
> AV was never a necessity on UNIX, whatever MS bundles in is usually enough
What prevents someone pushing a malicious package that takes my user data (that is accessible from a logged in session directly) and sends it somewhere? Especially in non-system repos, like Maven/NuGet/npm/pip/RubyGems and so on? What about the too widespread practice of piping shell scripts from the web, or applications with custom update mechanisms that might be compromised and pull in malicious code?
I'm not saying that AV software would protect against all of these, but even if users don't do stupid things (which they absolutely will anyways, sooner or later), then there are still vectors of attack against any system.
As for why *nix systems don't see that much malware, I've no idea, maybe because it's not as juicy of a target because of the lower count of desktop software installations (though the stuff that is on the systems might be more interesting to some, given the more tech savvy userbase), or maybe because a lot of the exploits focus on server software, like most CVEs.
On Windows, I guess the built in AV software is okay, maybe with occasional additional scans by something like Malwarebytes, but that's situational.
Nothing, in fact there have been many cases where python's and nodejs's package systems were exploited to achieve arbitrary code execution (because that's a feature, not a bug, to allow "complicated installation processes to just work").
AVs are the wrong way to go about security anyway, it's a reactionary strategy in a cat and mouse game by definition. For prevention, I think the BSDs are doing some promising work with the "pledge" mechanism. And as much hate as they get, I like appimages and snap et al for forcing people to consider a better segmentation model and permission system for installed software.
Crowdstrike agent is theoretically able to detect that what you just pipe-installed is now connecting to a known command and control server and can act accordingly.
Carbon Black will block any executables it pulls down though. And I think it may also block scripts as well. Executables have to be whitelisted before they can run.
Its an extremely strict approach, but it does address the situation you're talking about.
If you write a batch file on a Windows PC with Carbon Black on it, you will not be able to run it. Of course there is customisation available to tweak what is/isn't allowed.
Yes, but that's like 1% of the actual surface area for "running a script". I am not a Windows expert but on, say, Linux you can overwrite a script that someone has already run, or modify a script that is already running, or use an interpreter that your antivirus doesn't know about, or sit around and wait for a script to get run and then try to swap yourself into the authorization that gets granted for that, or…there's a whole lot of things. I assume Windows has most of the same problems. My confidence in Carbon Black stopping this is quite low.
If your malicious script starts doing things like running well known payloads or trying to move laterally or access things it really shouldn't be trying to access AV will flag/block it.
No one is suggesting it is 100% coverage but you would be suprised at the ammount of things XDR detects and prevents in a average organization with average users. Including the people who can't stop clicking YourGiftcard.pdf.exe
I am not against trying to protect against people who do that. The problem is that you pay XDR big bucks to stop a lot more than that, and this mostly doesn't work.
In a perfect world, AV software wouldn’t be necessary. We don’t live in a perfect world. So we need defense-in-depth, covering prevention, mitigation, and remediation.
> What prevents someone pushing a malicious package that takes my user data
That's not an argument in good faith. If you install unvetted packages in your airline control system, bank, or supermarket, the kind of systems that we're talking about here, you have much bigger problems to worry about.
> I'm not saying that AV software would protect against all of these,
Or indeed any of these. Highly privileged users piping shell scripts from untrusted sources is out of scope for any antivirus system, on any platform.
That doesn't mean all platforms are identical, or share the same attack vectors. It is much more accepted to install kernel mode drivers on the Windows platform, where it is not only accepted but have established quality control programs to manage it, than on Linux, where the major vendor will very literally show you the middle finger on video for everyone to see for doing so.
The Linux community is more for doing that kind of work upstream. If some type of new access control or binary integrity checking is required, that work goes upstream for everyone to use. It is not bolted on running systems with kernel mode drivers. That is because Linux is more like a shared platform, and less like a "product". That culture goes way beyond mere technical differences between the systems.
> If you install unvetted packages in your airline control system, bank, or supermarket, the kind of systems that we're talking about here, you have much bigger problems to worry about.
Surely we can agree that if it's a vector with an above 0% chance of it being exploited, then any methods for mitigating that are a good thing. Quite possibly even multiple overlaid methods for addressing the same risks. Defense in depth and all, the same reason why many run a WAF in front of their applications even though someone could just say: "Just have apps that are always up to date with no known CVEs".
> Or indeed any of these. Highly privileged users piping shell scripts from untrusted sources is out of scope for any antivirus system, on any platform.
You don't even have to be highly privileged to steam information, e.g. an "app" for running some web service could still serve to exfiltrate data. As others have mentioned, maybe this is not what AV software has been historically known for, but definitely there are pieces of software that attempt to mitigate some of the risks like this.
I'd rather have every binary or piece of executable code be scanned against a frequently updated database of bad stuff, or use heuristics to figure out what is talking with what, or have other sane defaults like preventing execution of untrusted code or to limit what can talk to what networks, not all of which is always trivial to configure in the OSes directly (even though often possible).
I won't pretend that AV software is necessarily the right place for this kind of functionality, but I also won't pretend that it couldn't be an added benefit to the security of a system, while also presenting different risks and shortcomings (threat vector in of itself or something that impacts system stability at worst, or just a hog on the resources and performance in most cases).
Use separate VMs, use secret management solutions, use separate networks, use principle of least privilege, make use of good system architecture, have good OS configuration, use WAFs, use AV software, use scanning software, use dependency management alerting software, use static code analysis, use whatever you need to mitigate the risk of waking up and realizing that there's been a breach and that your systems are no longer your own.
Even all of that might not be enough (and sometimes will actually make things worse), but you can at least try.
In that we can agree. But I would put "build on operating systems intended for the purpose" on top of that list, too. There is no excuse for building airline or bank systems on office operating systems and trying to compensate by bolting on endpoint protection systems.
The issue here is not simply scanning for known malware, "endpoint protection" systems go way beyond that. I have never, in practice, seen any of those systems be a net benefit for security. And I mean in a very serious and practical way. Depending on your needs, there are far more effective solutions that don't require backdooring your systems. There simply shouldn't be any unauthorized changes for this type of systems.
> In that we can agree. But I would put "build on operating systems intended for the purpose" on top of that list, too.
Agreed, most folks should probably use a proven *nix distro, or one of the BSD varieties. That would be a good starting point.
That said, I doubt whether the OS alone will be enough, even with a good configuration, but at some point the technical aspects have to contend with managing liability either way.
Carbon Black, running in DO NOT LET UNTRUSTED EXECUTABLES RUN mode,
would not let you run binaries that curl | sh just grabbed unless they were allow-listed.
Windows Defender is more than sufficient for most of these companies, but they need that false sense of security, or maybe they have excess budget to spare, or they are transferring the risk per their risk management plan.
This isn't a windows issue. For what it's worth, I've had plenty of problems in the past with kernel panics from crowdstrike's macos system extension, although it was fairly random, nothing like today's issue.
Linux isn't exactly reliable either... I'm sorry but that OS is barely capable of outputting a stable HDMI signal, god help you if you are on a laptop with external monitor.
For 3 computers, 2 laptops, I've never _not_ had display bugs/oddities/issues. System upgrades always make me nervous because there is a very real chance of something getting fucked up and my screen staying black the next time it boots, having to go into a TTY, and manually fixing stuff up or booting the previous version that was still saved in GRUB.
We can not get computers perfect. They are too complicated. That's true for anything in life. As soon as it gets too complicated, you're left in a realm of statistics and emergent phenomena. As much as I dislike windows enough to keep using Linux, I never had display issues on windows.
To anyone compelled to reply with a text that contains "just" or "simply": simply just consider that if you are able to think of it in 10 seconds, then I have thought of it as well, and tried it too.
In my comment I was referring to mission critical systems, which most definitely you don't put on cheap commodity hardware you buy in a brick and mortar store.
Linux is used EVERYWHERE for a reason. Most car HUD now run on some form of Linux embedded, like basically all embedded and low power devices. The problem here is that people still put embedded mission critical systems on a desktop OS and slap desktop software on it, which is _a bad choice_.
> Linux isn't exactly reliable either... I'm sorry but that OS is barely capable of outputting a stable HDMI signal, god help you if you are on a laptop with external monitor.
This is demonstrably false, given the amount of people that game on Linux nowadays.
> System upgrades always make me nervous because there is a very real chance of something getting fucked up and my screen staying black the next time it boots, having to go into a TTY, and manually fixing stuff up or booting the previous version that was still saved in GRUB.
I had this happen to me once. Timeshift was painless to use, and in about 15 minutes I had my machine up and running again, and could apply all updates properly afterwards. If anything it made me bolder lol.
> Linux isn't exactly reliable either... I'm sorry but that OS is barely capable of outputting a stable HDMI signal, god help you if you are on a laptop with external monitor.
It just just works for me, and has just worked with every laptop I have had in the last 15 years. My kids and I have several Linux installs and the only one with HDMI output issues is a cheap ARM tablet that is sold as a device for early adopters.
> For 3 computers, 2 laptops, I've never _not_ had display bugs/oddities/issues. System upgrades always make me nervous because there is a very real chance of something getting fucked up and my screen staying black the next time it boots, having to go into a TTY, and manually fixing stuff up or booting the previous version that was still saved in GRUB.
At least that number of machines (I do not know whether you mean three or five in total) for the last 20+ years and can recall one such issue.
> For 3 computers, 2 laptops, I've never _not_ had display bugs/oddities/issues. System upgrades always make me nervous because there is a very real chance of something getting fucked up and my screen staying black the next time it boots, having to go into a TTY, and manually fixing stuff up or booting the previous version that was still saved in GRUB.
I also had Windows Update fucking up my VMs and physical installs multiple times - this stuff just happens _with desktop machines, on desktop OSes_. The point is, lots of companies are using random cheap x86 computers with Windows desktop for mission critical appliances and systems, which is nonsensical. The rule of thumb has always been, do not put Windows (client) on anything you can't format on a short notice at any time. Guess people just never learn
How is your lack of a stable HDMI signal relevant to that the world's airlines and supermarkets and banks probably shouldn't run Windows with third-party antivirus software bolted on? That is a platform originally intended for office style typewriter emulation and games.
Every engineering-first or Internet-native company that could choose chose Linux and for simple reasons. Anything not Linux in The Cloud is a rounding error. Most of the world's mobile phones is Linux. And most cloud-first desktops too. They don't seem to be particularly more troubled with HDMI signal quality or other display issues than other devices.
> I'm sorry but that OS is barely capable of outputting a stable HDMI signal, god help you if you are on a laptop with external monitor.
You may have had particularly bad luck with poorly supported hardware, but I don't think this is a normal experience.
I've been using Linux exclusively on desktops and laptops (with various VGA, DVI, DisplayPort, HDMI, and PD-powered DisplayPort-over-USB-C monitors and TVs since 2002 without any unstable behavior or incompatibility.
Most likely. I think laptops are particularly gnarly, especially when they have both an apu and a discrete gpu. While manufacturers use windows' amenities for adding their own drivers and modifications so that they ensure that the OS understands the topology of the hardware (so that the product doesn't get mass RMA'd), there's no such incentive to go out of your way to make Linux support it.
But working with hundreds of computers, running many different distributions of Linux for decades, they just haven't ever seen what you're describing. It's really hard to reconcile what I read here with my hands-on experience.
2. plenty of malware and c2 systems happily operate off all systems, regardless of how hardened (or how unix) they are - IDS/IPS is a reactive way to try and mitigate this
3. you don't need third party software to compromise the unix kernel, you just need to wait a week or two until someone finds a bug in the kernel itself
all that being said, this has solarwinds vibes. the push for these enterprise IDS systems needs to be weighted, the approach adjusted
Windows RTMs used to be shipped in a usable state (albeit buggy) for more than a decade. You installed it from a CD and it worked fine, you installed patches every once in a while from a random Service Pack CD you got from somewhere.
Modern Windows has had the habit of being so buggy after release in such horrendous ways that I can't imagine being able to use the same install CD for years. This is definitely putting less attention to detail in my view.
The slice of Microsoft stuff I worked at certainly did not have dedicated QA at the time I was there and used to have a QA team before, so there is some degree of truth to the statement. I can't speak for other Microsoft teams and offices. It was very disappointing for me, because I have had the opportunity to work with great QA staff before and in my current job and there is no way a developer dedicating 25 % of their time (which is what was suggested as a replacement for having dedicated QA) can do a job anywhere near as good.
I have a feeling most commenters (not just here) don't really know what Falcon is and does, if EDR (and more?) keeps getting compared to a plain antivirus.
Depending on the threats pertinent to the org they may require deep observability and the ability to perform threat hunting for new and emerging threats, or detect behaviour based signals, or move to block a new emerging threat. Not all threats require Administrator privileges!
Not installing AV might be fine for a small number of assets in a low risk industry, but is bad advice for a larger more complex environment.
If were unbiased here the apparent crowdstrike problem could occur on any OS and with any vendor where you have updates or configuration changes automatically deployed at scale.
> Do not install third party software that runs in kernel mode. T
You mean don't install Steam nor the Epic Store, nor many of the games.
Note: I'm agreeing with you except that pretty much the only reason I have a Windows machine is for games. I do have Steam installed. I also have the Oculus software installed. I suspect both run in kernel mode. I have to cross my fingers that Valve and Facebook don't do bad things to me and don't leave too many holes.
I don't install games that require admin.
Oh, and I have Photoshop and I'm pretty sure Adobe effs with the system too >:(
Admin privileges aren't the same thing as a kernel-mode driver. Steam does require admin to be installed, but it does not install a kernel-mode driver.
I've never seen a program running in kernel mode other than AV software. Pretty sure all stuff you listed doesn't. Asking admin permissions doesn't mean it's kernel mode software.
this "kernel level = invasive" paranoia that's been going on lately is complete FUD at its core and screams tech illiteracy
no software vendor needs to or wants to write a driver to spy on you or steal your data when they can do all of that with user-level permissions without triggering any AV.
3rd party drivers are completely fine, and its normal that advanced peripherals like an Oculus uses them
> Linux has taught multiple times that drivers should be developed and included with the OS.
I've had Linux GPU drivers fail multiple times due to system updates, to the point were I needed to roll back. I've had RHEL updates break systems in a way were even Red Hat support couldn't help me (I had to fix them myself).
I don't see how Linux is any better in this regard than Windows to be honest.
Also:
> AV was never a necessity on UNIX
Sure, why write a virus when you can just deploy your malware via official supply chains?
Do you have/need GPUs on your 'mission critical systems'? I would bet most of us don't.
I quite agree with OP here. VMs are now quite lightweight (compared to available resources on machines at least) and I would rather use a light, hardened Linux as my base OS that will run windows VM and do snapshots for quick rollbacks. Actually, that's what I run on my own PC, and I think it would be the sanest way to operate.
Have some kind of soaking/testing environment for production critical systems, especially if you're a big business. If you're hip, something like a proper blue/green setup (please chime in with best practices!). If you're legacy, do it all by hand if you must.
Blindly enabling immediate internet-delivered auto-update on production systems will always allow a bad update to cause chaos. It doesn't matter how well you permission things off on your favourite Linux flavor. If an update is to be meaningful, the update can break the software. And clearly you're relying on the software, otherwise you wouldn't be using it.
100% nix based here so thankfully zero systems affected. Everything from routers to devices, we have a total blanket ban on any Windows based software.
BBC reports: “ The cause is not known - but Microsoft says it's taking mitigation action”.
Most of the media I found say it’s because “cloud infrastructure”. I am yet to see any major source actually factually report this is caused by a bad patch in Crowdstrike software installed on top of Windows.
Gets to show how little competency there is in journalism nowadays. And begs a question how often they misinterpret and misreport things in other fields?
The BBC are starting to say that 'tech people are saying this is Crowdstrike', so I guess it's just a question of being certain? Perhaps we'd have similar concerns about rigour in journalism if it were to turn out that it's actually not Crowdstrike specifically, it's caused by the interplay of Crowdstrike and some other currently unknown thing, and actually it's not Crowdstrike that's behaving improperly, but this other currently unknown thing.
It's looking more and more like Crowdstrike screwed up, but I appreciate rigour and accuracy more than FRISTTT!!! type announcements.
"Selon le quotidien The Australian, qui relaie les déclarations du ministère australien des Affaires intérieures, l'entreprise Crowdstrike pourrait être en cause, après avoir été victime d'une brèche au sein de sa plateforme."
Translated/summarized: "According to the publication The Australian, Crowdstrike may be the cause of the outage after having suffered a security breach"
I like how it redirects blame away from those responsible and perpetuates the idea that "hackers" are the real threat.
On BBC news a few minutes ago, an expert did describe the problem as affecting Microsoft Azure cloud systems as well as Windows systems running Crowdstrike due to an "update gone wrong".
Well, in BBC’s live coverage, just minutes ago, their technology editor said:
“ There have been reports suggesting that a cybersecurity company called Crowdstrike, which produces antivirus software, issued a software update that has gone horribly wrong and is bricking Windows devices - prompting the so-called "blue screen of death" on PCs.
Now, whether these two issues are the same thing, or whether it's a perfect storm of two big things happening simultaneously - I don't yet know. It certainly sounds like it's going to be causing a lot of havoc.”
What two issues? Two major independent outages? This is seriously bad and purely speculative.
There was a different Azure and other MS services (including Office 365) outage earlier which is separate from the crowdstrike thing that started a few hours later.
its strange how people who work in professions that are considered crucial infrastructure are held to such a high standard but there's always some tech problem that cripples them the hardest
"Windows" is the combination of the OS per se and all the things needed for it to run properly. That thing is a mess of proprietary drivers and pieces of software cobbled together. It can't be called "high reliability" with a straight face.
Crowdstrike is a multiplatform malware that chronically damages computers on all major desktop OSes. This is a Crowd strike problem and an admin problem.
That's a hell of a take that should not be taken seriously. Perhaps if you hold everything else to the same standard. Anything used on macOS or Linux or whatever else fully and completely represents that core platform, then I'd agree.
Anecdotally, I have zero stability problems on my non-ECC consumer-grade 11th gen Intel Windows 11 system. It'll stay up for months, until I decide to shut it down. I had a loose GPU power cable that was causing me problems at a point, but since I reseated everything I haven't had a single issue. That was my fault, things happen. The system is great.
More significantly, I see no difference in stability between our Windows Server platform and Red Hat Enterprise (Oracle) server platform at work either. Work being one of the top 3 largest city governments in the USA.
I don't even think Linux is the definite answer. The majority of these critical apps are just full-screen UIs written in C, C++ or Java with minimal computing and networking, so they could just as easily run on Qubes or BSD without all the constant patching for dumb vulnerabilities that still persist even though Windows is 40 years old.
The problem is the middle management class at hospitals, governments, etc., only know how to use Word and maybe Excel, so they are comfortable with Microsoft, even though it's objectively the worst option if you aren't gaming. So then they make contracts with Microsoft and all the computers run Windows, so all the app developers have to write the apps for Windows.
Not really disagreeing with you, but "staying up for months" isn't a serious bar to clear, it really provides no information in 2024 everything you can install should clear that bar.
Can you say with a straight face that if you were designing a system that had extremely high requirements of reliability that you would choose Windows over Linux? Like, all other things being equal? I'm sorry, but that would be an insane choice.
Well, yes? Of course, not the consumer deployment of Windows. Part of ensuring reliability is establishing contracts with suppliers that shift liability to them, so they're incentivized to keep their stuff reliable. Can't exactly do that with Linux (RHEL notiwthstanding) and open source in general, which is why large enterprises have been so reluctant to adopt them in the past - they had to figure out how to fit OSS into the flow of liability and responsibility.
It's not as straightforward of a choice as it may seem. In theory Linux would be a better choice but there simply isn't the infrastructure or IT staffing in place to manage millions and millions of Linux desktops. I'm not saying it can't be done but for various reasons it hasn't been done and that's a major practical roadblock. Just from a staffing perspective alone if you hand millions of Linux desktops to life long Microsoftsies you're begging for disaster.
For sure, no question! There's a reason people choose Microsoft. My question was narrower, just the question on reliability (hence "all else being equal"). I don't think you can say that, leaving aside issues like this, that Windows is as or more reliable than Linux.
For instance, if you had to make deploy a mission critical server, assuming cost and other software was the same, would you choose Linux or Windows for reliability? Of course you would choose Linux.
Well, with the proliferation of systemd and all the nightmares it's caused me over the past decade, I actually might. But thankfully BSD is an option.
But Linux isn't immune from this exact sort of issue, though - these overgrown antivirus solutions run as kernel drivers in linux as well, and I have seen them cause kernel panics.
Depends i think. When i was working as a super market cashier the tils had embedded XP. in 2 or 3 years it rarely had issues. The rare issues it did have were with the java POS running on top.
Windows 10 for my home desktop crashed a lot more and just seems to have gotten more "janky" with time.
The people working in those professions are; their bosses and their IT departments are not. IT security is treated as solved problem - if you deploy enough well-known solutions that prevent your employers from working, everything will be Safe from CyberAttacks. There's an assumption of quality like you'd normally have with drugs or food in the store. But this isn't the case in this industry, doubly so in security. Quality solutions are almost non-existent, so companies should learn to operate under the principle of caveat emptor.
> (Repeating my comment because other story is duped)
Please don't do this! It makes merging threads a pain. It's better to let us know at hn@ycombinator.com and then we'll merge the threads so your comment shows up in the main one.
Edit; it appears my comment has been moved to a top level comment, i.e. peer with the parent without any way of telling what happened - so now there is the whole other pointless branch polluting the relevance of the tree.
Previously;
It appears that someone was able to take my previous comment in this thread completely off hacker news, it's not even listed as flagged. It was at 40pts before disappearing, perhaps there is some reputation management going on here. If it was against the site rules it would be helpful to know which ones.
Edit; the link is https://news.ycombinator.com/item?id=41007985 it was a high up comment that no longer appears even though flagged comments do appear. I checked if it has been moved but the parent comment is still the same. This feels like hellbanned in that there isn't an easy way for me to see if I've been shadowbanned. But I really don't know. I was commenting in good faith.
It's a vital moderation function to do this, particularly when the parent is the top comment of the entire thread. Those tend to attract non-reply-replies, and that has bad effects on the thread as a whole. It causes the top part of the page to fill up with generic rather than specific content, and it makes the top subthread too top-heavy.
I'm not saying that you did anything wrong or that your post was bad or that it was unrelated to the original parent. The problem is that the effects I'm describing pile up unintentionally and end up being a systemic problem. It isn't anybody's fault, but there does need to be someone whose job it is to watch out for the system as a whole, and that's basically what moderators do.
Sometimes we comment that we detached a post from its original parent (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...) and sometimes not. (Perhaps the software should display this information automatically.) I'm less likely to do it when a comment stands on its own just fine, which was the case with your post https://news.ycombinator.com/item?id=41007985 and which is usually the case with the more generic sort of reply—in fact it's one test for deciding that question.
> so now there is the whole other pointless branch polluting the relevance of the tree
Yes, please don't do that—especially in the top subthread. I understand the frustration of "WTF where did my comment go", but you can always reach us at hn@ycombinator.com and get an answer.
Oh, nevermind, it's just a third-party cybersecurity tool running on the server detected a potential threat and quarantined the offending database record, just in case!
I was looking forward to spending the day talking to people about cyber security but if my comments are going to disappear like that then maybe hacker news is not the site for me. A shame really.
Edit; I don't know for sure but this is possibly the last straw for me on hacker news. It really has gone downhill. If good faith discussions from experts are being secretly deleted for what I can only now assume are for nefarious reasons then I can't trust what I find here is in anyway representative. It's unfortunate in that there really isn't anywhere else to go. Now my best discussions are in small WhatsApp groups / Discords with friends. It's ok for me where I have had a career to get to know people personally and have such groups but if public forums are tainted in this way then younger people in this field will end up only talking to each other.
I appreciated your comment and saw it earlier before it was detached. Thank you for sharing it. It got decent visibility to readers, as your points suggest. I suggest you cut the mods some slack for adjusting things on one of the heaviest trafficked threads ever. The phenomena dang describes with only semi-related reply to the first thread does exist and I myself have gotten higher points on posts that benefit from it, unintentionally and intentionally (didn’t realize that was abuse, sorry dang). I think we are better off with an ecosystem that limits such point/visibility seeking or accidental behavior, even for good content. Don’t take it personally.
(I do think there should be some way to skim for, say top X% rated comments particularly on mega threads, somewhat like there was/is on slash dot with its point filtering. This would have helped visibility for a detached comment like yours, would reduce the ordering benefit dang mentions for those using it, and improve usability more generally for busier readers. But that’s my 2 cents. These things always cut multiple ways.)
A megathread is always a tough place to add value, on any platform. Who am I, but I appreciate you and your comments and hope you continue to share with a broader audience that includes me here.
I was tipple downvoted in ~20s before it disappeared so I noticed it very quickly because I keep an eye on pts to see if people have interreacted with something I've posted. I then scanned through the peer comments and noticed that a bunch of other comments had been very freshly flagged, within the ~40s of the last time I checked. These comments have been around for over an hour so the odds that there were all independently flagged so quickly at exactly the same time is highly improbable. And I couldn't see if my own comment had been flagged even though I could see others.
It's possibly a bug but I've seen similar behavior before and it was due to flagging but without the flagging flag appearing which happened later. I think it's a variation of hellbanned but for a single post. It was easy to notice because not only did the points go down just before disappearing they also stopped going up as it had done so reliably before being removed.
I still keep the points, so perhaps a loss of possible future points. The points don't bother me, I restart antonymous accounts on occasion. I kind of use points to get an idea of where other peoples opinions which is half the reason I use this site. A negative signal to something I think is good is actually more interesting than a positive signal. I'm more worried about the damage such actions do to the 'market place of ideas' and if it hasn't pushed me away then perhaps it has pushed others that I'm interested to hear from. And if so where have they gone. Once I become disinterested in the opinions of others on this site then it's unlikely I'll have any further use for hacker news - and I'm getting pretty close to that point.
I edited my post to include the link, I guess if people see these posts I'm not completely banned and maybe there isn't anything nefarious going on, they have a situation where the comments to a single comment are already off the first page so there could be a software bug. But it does appear that there is some sort of reputation management going on.
Ah thanks, I guess it's been moved to it's own top level comment - I did check but only checked until page 4. It's weird because this chain of comments which is otherwise off topic and doesn't have anywhere near the points (2 (edit now 0) vs 37) and has same parent is on the first page. So I'm not sure if that was the right remedy. Hackernews needs better tooling for this or at least let me know if something has moved to try to flatten out the tree.
This is why I don't use Windows and refuse any SWE jobs that require Windows machines. Additionally, I believe kernel-level game anti-cheat software should be banned.
The reason people "can't learn" how to operate alternative software is because we don't give software the weight it deserves. We don't consider it crucial, but evidently it is.
When a new surgical technique is standardized, we don't tell doctors "well don't worry about it - we can't expect you to learn cryptic things!" Because we understand the gravity of what they're doing and how crucial that technique may be.
For whatever reason, software is still treated like the wild west and customers/employees are still babied. They're told it's all optional, they don't need to learn more. We still tell Windows users its fine to download executables online, click "okay!" and have them run as Admin. And that's the root cause of why we're in this mess.
We have safer computing environments - just look at iOS or Mac. Even Microsoft is slowly trying to faze this out with the new Windows store. But alas, we cannot expect anyone to change anything ever, so we still use computers like it's 1995.
Assuming that other operating system, whether high-reliability or not, are necessarily "cryptic" and unnecessarily impair people in their ability to "get shit done" is naive at best and disingenuous at worst.
I'd say regardless of the OS, you might find a company like Netflix is less likely to impose security-theatre box ticking exercises.
Which makes it less likely to take a 3rd party agent from a snakeoil company that sells to execs, then embed it at low levels into mission critical services with elevated privileges, then give it realtime external updates that can break your platform at any point.
This is good and bad. This showcases the importance of CrowdStrike. This is a short term blip but in the long run they will learn from this and prevent this type of an issue in the future. On the flip side, they have a huge target on their back for the U.S. government to try and control them. They are also a huge target for malicious actors since they can clearly see that CS is part of critical US and western infra. Taking them down can cripple essential services.
On a related note, this also demonstrates the danger of centralized cloud services. I wish there were more players in this space and the governments would try their very best to prevent consolidation in this space. Alternatively, I really wish the CS did not have this centralized architecture that allows for such failure modes. Software industry should learn from great & age old engineering design principles. For example, a large ships have watertight doors that prevent compartments from flooding in case of a breach. It appears that CS didn't think the current scenario was not possible therefore didn't invest in anything meaningful to prevent this nightmare scenario.
I'm not that confident that they're going to be around to recover from after their stock price falls into the toilet and they get sued out the yin-yang. I don't think 'read the EULA terms lol' is gonna cut it here.
Or, and that maybe a radical idea, YOU DON'T INSTALL THIS FUCKING SNAKE OIL IN THE FIRST PLACE.
The idea of antivirus software is laughable when Adobe cannot implement a safe and secure PDF parser then how can Crowdstrike while simultaneously supporting the parsing of a million other protocols?
Everyone involved: Vendor, operator, and auditors who mandate this shit are responsible and should be punished.
YOU HAVE TO MINIMIZE THE ATTACK SURFACE, NOT INCREASE IT.
https://news.ycombinator.com/item?id=41002195&p=2
https://news.ycombinator.com/item?id=41002195&p=3
https://news.ycombinator.com/item?id=41002195&p=4 (...etc.)