Hacker News new | past | comments | ask | show | jobs | submit login
CrowdStrike Update: Windows Bluescreen and Boot Loops (reddit.com)
4489 points by BLKNSLVR 4 months ago | hide | past | favorite | 3859 comments



All: there are over 3000 comments in this thread. If you want to read them all, click More at the bottom of each page, or like this:

https://news.ycombinator.com/item?id=41002195&p=2

https://news.ycombinator.com/item?id=41002195&p=3

https://news.ycombinator.com/item?id=41002195&p=4 (...etc.)


Throwaway account...

CrowdStrike in this context is a NT kernel loadable module (a .sys file) which does syscall level interception and logs then to a separate process on the machine. It can also STOP syscalls from working if they are trying to connect out to other nodes and accessing files they shouldn't be (using some drunk ass heuristics).

What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.

This has taken us out and we have 30 people currently doing recovery and DR. Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver. We have to literally take each node down, attach the disk to a working node, delete the .sys file and bring it up. Either that or bring up a new node entirely from a snapshot.

This is fine but EC2 is rammed with people doing this now so it's taking forever. Storage latency is through the roof.

I fought for months to keep this shit out of production because of this reason. I am now busy but vindicated.

Edit: to all the people moaning about windows, we've had no problems with Windows. This is not a windows issue. This is a third party security vendor shitting in the kernel.


I did approximately this recently, but on a Linux machine on GCP. It sucked far worse than it should have: apparently GCP cannot reliably “stop” a VM in a timely manner. And you can’t detach a boot disk from a VM that isn’t “stopped”, nor can you multi-attach it, nor can you (AFAICT) convince a VM to boot off an alternate disk.

I used to have this crazy idea that fancy cloud vendors had competent management tools. Like maybe I could issue an API call to boot an existing instance from an alternate disk or HTTPS netboot URL. Or to insta-stop a VM and get block-level access to its disk via API, even if I had to pay for the instance while doing this.

And I’m not sure that it’s possible to do this sort of recovery at all without blowing away local SSD. There’s a “preview” feature for this on GCP, which seems to be barely supported, and I bet it adds massive latency to the process. Throwing away one’s local SSD on every single machine in a deployment sounds like a great way to cause potentially catastrophic resource usage when everything starts back up.

Hmm, I wonder if you’re even guaranteed to be able to get your instance back after stopping it.

WTF. Why can’t I have any means to access the boot disk of an instance, in a timely manner? Or any better means to recover an instance?

Is AWS any better?


AWS is not any better really on this. In fact 2 years ago (to the day!) we had a complete AZ outage in our local AWS region. This resulted in their control plane going nuts and being unable to shut down or start new instances. Then capacity problems.


That's happened several times, actually. That's probably just the latest one. The really fun one was when S3 went down in 2017 in Virginia. Caused global outages of multiple services because most services were housed out of Virginia and when EC2 and other services went offline due to dependency on S3, everything cascade failed across multiple regions (in terms of start/stop/delete...ie. api actions. Stuff that was running was, for the most part, still working in some places).

...I remember that day pretty well. It was a busy day.


> apparently GCP cannot reliably “stop” a VM in a timely manner.

In OCI we made a decision years ago that after 15 minutes from sending an ACPI shutdown signal, the instance should be hard powered off. We do the same for VM or BM. If you really want to, we take an optional parameter on the shutdown and reboot commands to bypass this and do an immediate hard power off.

So worst case scenario here, 15 minutes to get it shut down and be able to detach the boot volume to attach to another instance.


I had this happen to one of my VMs, I was trying to compile something and went out of memory, then tried to stop the VM and it only came back after 15 min. I think it is a good compromise, long enough to give a chance for a clean reboot but short enough to prevent longer downtimes.

I’m just a free tier user but OCI is quite powerful. It feels a bit like KDE to me where sometimes it takes a while to find out where some option is, but I can always find it somewhere, and in the end it beats feeling limited by lack of options.


We've tried at shorter time periods, back in the earlier days of our platform. Unfortunately what we've found is that the few times we've tried to lower it from 15 minutes, we've ended up with Windows users experiencing corrupt drives. Our best blind interpretation is that some things common enough on Windows can take up to 14 minutes to shut down under worst circumstances. So 15 minutes it is!


This sounds appealing. Is OCI the only cloud to offer this level of control?


Based on your description, AWS has another level of stop, the "force stop", which one can use in such cases. I don't have statistics on the time, so I don't know if that meets your criteria of "timely", but I believe it's quick enough (sub-minute, I think).


There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

As for "throwing away local SSD", that only happens on AWS with instance store volumes which used to be called ephemeral volumes as the storage was directly attached to the host you were running on and if you did a stop/start of an ebs-backed instance, you were likely to get sent to a different host (vs. a restart API call, which would make an ACPI soft command and after a duration...I think it was 5 minutes, iirc, the hypervisor would kill the instance and restart it on the same host).

When the instance would get sent to a different host, it would get different instance storage and the old instance storage would be wiped from the previous host and you'd be provisioned new instance storage on the new host.

However, with EBS-volumes, those travel from host to host across stop/start cycles and they're attached via very low latency across the network from EBS servers and presented as a local block device to the instance. It's not quite as fast as local instance store, but it's fast enough for almost every use case if you get enough IOPS provisioned either through direct provisioning + correct instance size OR through a large enough drive + large enough instance to maximjze the connection to EBS (there's a table and stuff detailing IOPs, throughput, and instance size in the docs).

Also, support can detach the volume as well if the instance is stuck shutting down and doesn't get manually shut down by the API after a timeout.

None of this is by any means "ideal", but the complexity of these systems is immense and what they're capable of at the scale they operate is actually pretty impressive.

The key is...lots of the things you talk about are do-able at small scale, but when you add more and more operations and complexity to the tool stack on interacting with systems, you add a lot of back-end network overhead, which leads to extreme congestion, even in very high speed networks (it's an exponential scaling problem).

The "ideal" way to deal with these systems is to do regular interval backups off-host (ie. object/blob storage or NFS/NAS/similar) and then just blow away anything that breaks and do a quick restore to the new, fixed instance.

It's obviously easier said than done and most shops still on some level think about VMs/instances as pets, rather than cattle or have hurdles that make treating them as cattle much more challenging, but manual recovery in the cloud, in general, should just be avoided in favor of spinning up something new and re-deploying to it.


> There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

Also, there should be a way to force stop an instance that is already stopping.


>This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The issue is far more nuanced than that. The systems are very complex and they're a hypervisor that has layers of applications and interfaces to allow scaling. In fact, the hosts all have BMCs (last I knew...but I know there were some who wanted to get rid of the BMC due to BMCs being unreliable, which is, yes, an issue when you deal with scale because BMCs are in fact unreliable. I've had to reset countless stuck BMCs and had some BMCs that were dead).

The hypervisor is certainly capable of killing an instance instantly, but the preferred method is an orderly shutdown. In the case of a reboot and a stop (and a terminate where the EBS volume is not also deleted on termination), it's preferred to avoid data corruption, so the hypervisor attempts an orderly shutdown, then after a timeout period, it will just kill it if the instance has not already shutdown in an orderly manner.

Furthermore, there's a lot more complexity to the problem than just "kill the guest". There are processes that manage the connection to the EBS backend that provides the interface for the EBS volume as well as apis and processes to manage network interfaces, firewall rules, monitoring, and a whole host of other things. If the monitoring process gets stuck, it may not properly detect an unhealthy host and external automated remediation may not take action. Additionally, that same monitoring is often responsible for individual instance health and recovery (ie. auto-recover) and if it's not functioning properly, it won't take remediation actions to kill the instance and start it up elsewhere. Furthermore, the hypervisor itself may not be properly responsive and a call from the API won't trigger a shutdown action. If the control plane and the data plane (in this case, that'd be the hypervisor/host) are not syncing/communicating (particularly on a stop or terminate), the API needs to ensure that the state machine is properly preserved and the instance is not running in two places at once. You can then "force" stop or "force" terminate and/or the control plane will update state in its database and the host will sync later. There is a possibility of data corruption or double send/receive data in a force case, which is why it's not preferred. Also, after the timeout (without the "force" flag), it will go ahead and mark it terminated/stopped and will sync later, the "force" just tells the control plane to do it immediately, likely because you're not concerned with data corruption on the EBS volume, which may be double-mounted if you start up again and the old one is not fully terminated.

>The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

It does have a concept where all resources are still held and billed, except CPU and Memory. That's what a reboot effectively does. Same with a stop (except you're not billed for compute usage and network usage will obviously be zero, but if you have an EIP, that would incur charges still). The transition between stop and running is also fast, the only delays incurred are via the control plane...either via capacity constraints causing issues placing an instance/VM or via the chosen host not communicating properly...but in most cases, it is generally a fast transition. I'm usually up and running in under 20 seconds when I start up an existing instance from a stopped state. There's also now a hibernate or sleep state that the instance can be put into if it's windows via the API where the instance acts just like a sleep/hibernate state of a regular Windows machine.

>Also, there should be a way to force stop an instance that is already stopping.

There is. I believe I referred to it in my initial response. It's a flag you can throw in the API/SDK/CLI/web console when you select "terminate" and "stop". If the stop/terminate command don't execute in a timely manner, you can call the same thing again with a "force" flag and tell the control plane to forcefully terminate, which marks the instance as terminated and will asynchronously try to rectify state when the hypervisor can execute commands. The control plane updates the state (though, sometimes it can get stuck and require remediation by someone with operator-level access) and is notified that you don't care about data integrity/orderly shutdown and will (once its updated the state in the control plane and regardless of the state of the data plane) mark it as "stopped" or "terminated". Then, you can either start again, which should kick you over to a different host (there are some exceptions) or you can launch a new instance if you terminated and attach an EBS volume (if you chose not to terminate the EBS volume on termination) and retrieve data (or use the data or whatever you were doing with that particular volume).

Almost all of that information is actually in the public docs. There was only a little bit of color about how the backend operates that I added for color. There's hundreds of programs that run to make sure the hypervisor and control plane are both in sync and able to manage resources and if just a few of them hang or are unable to communicate or the system runs out of resources (more of a problem on older, non-nitro hosts as that's a completely different architecture with completely different resource allocations), then the system can become partially functional...enough so that remediation automation won't step in or can't step in because other guests appear to be functioning normally. There's many different failure modes of varying degrees of "unhealthy" and many of them are undetectable or need manual remediation, but are statistically rare and by and large most hosts operate normally. On a normally operating host, forcing a shutdown/terminate works just fine and is fast. Even when some programs that are managing the host are not functioning properly, launch/terminate/stop/start/attach/detach all tend to continue to function (along with the "force" on detach, terminate, stop), even if one or two functions of the host are not functioning properly. It's also possible (and has happened several times) where a particular resource vector is not functioning properly, but the rest of the host is fine. In that case, the particular vector can be isolated and the rest of the host works just fine. It's literally these tiny little edge cases that happen maybe .5% of the time that cause things to move slower and at scale, a normal host with a normal BMC would have the same issues. Ie. I've had to clear stuck BMCs before on those hosts. Also, I've dealt with completely dead BMCs. When those states occur, if there's also a host problem, remediation can't go in and remedy host-level problems, which can lead to those control-plane delays as well as the need to call a "force".

Conclusion: it may SEEM like it should be super easy, but there's about a million different moving parts to cloud vendors and it's not just as simple as kill it with fire and vengeance (ie. quemu guest kill). BMCs and hypervisors do have an instant kill switch (and guest kill is used on the hypervisor as is a BMC power off in the right remediation circumstances), but you're assuming those things always work. BMCs fail. BMCs get stuck. You likely haven't had the issue because you're not dealing with enough scale. I've had to reset BMCs manually more times than I can count and I've also dealt with more than my fair share of dead ones. So, "power off immediately" does not always work, which means a disconnect occurs between the control plane and the data plane. There's also delays in remediation actions that automation takes to give enough time for things to respond to the given commands, which leads to additional wait time.


I understand that this complexity exists. But in my experience with Google Compute, this isn’t a 1%-of-the-time problem with something getting stuck. It’s a “GCP lacks the capability” issue. Here’s the API:

https://cloud.google.com/compute/docs/reference/rest/v1/inst...

AWS does indeed seem more enlightened:

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_S...


yeah, AWS rarely has significant capacity issues. While the capacity utilization typically sits around 90% across the board, they're constantly landing new capacity, recovering broken capacity, and working to fix issues that cause things to get stuck (and lots of alarms and monitoring).

I worked there for just shy of 7 years and dealt with capacity tangentially (knew a good chunk of their team for a while and had to interact with them frequently) across both teams I worked on (support and then inside the EC2 org).

Capacity, while their methodologies for expanding were, in my opinion, antiquated and unenlightened for a long time, were still rather effective. I'm pretty sure that's why they never updated their algorithm for increasing capacity to be more JIT. Now, they have a LOT more flexibility in capacity now that they have resource vectoring, because you no longer have hosts with fixed instance sizes for the entire host (homogenous). You now have the ability to fit everything like legos as long as it is the same family (ie. c4 with c4, m4 with m4, etc.) and there was additional work being done to have cross-family resource vectoring as well that was in-use.

Resource vectors took a LONG time for them to get in place and when they did, capacity problems basically went away.

The old way of doing it was if you wanted to have more capacity for, say, c4.xlarge, you'd either have to drop new capacity and build it out to where the entire host had ONLY c4.xlarge OR you would have to rebuild excess capacity within the c4 family in that zone (or even down to the datacenter-level) to be specifically built-out as c4.xlarge.

Resource vectors changed all that. DRAMATICALLY. Also, to reconfigure a hosts recipe now takes minutes, rather than rebuilding a host and needing hours. So, capacity is infinitely more fungible than it was when I started there.

Also, I think resource vectoring came on the scene around 2019 or so? I don't think it was there in 2018 when I went to work for EC2...but it was there for a few years before I quit...and I think it was in-use before the pandemic...so, 2019 sounds about right.

Prior to that, though, capacity was a much more serious issue and much more constrained on certain instance types.


I always said if you want to create real chaos, don't write malware. Get on the inside of a security product like this, and push out a bad update, and you can take most of the world down.


So… Write malware?


*Malicious code on a legit software


> Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver.

It took a bit to figure out with some customers, but we provide optional VNC access to instances at OCI, and with VNC the trick seems to be to hit esc and then F8, at the right stage in the boot process. Timing seems to be the devil in the details there, though. Getting that timing right is frustrating. People seem to be developing a knack for it though.


> give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.

Interesting..

> We have to literally take each node down, attach the disk to a working node..

Probably the easiest solution for you is to go back in time to a previous scheduled snapshot, if you have that setup already.


That would make sense but it appears everyone is doing EBS snapshots in our regions like mad so they aren't restoring. Spoke to our AWS account manager (we are a big big big org) and they have contention issues everywhere.

I really want our cages, C7000's and VMware back at this point.


Netflix big? Bigger or Smaller?

I'm betting I have a good idea of one of the possible orgs you work for, since I used to work specifically with the largest 100 customers during my ~3yr stint in premium support


Netflix isn't really that big. two organizations ago our reverse proxy used 40k cores. netflixes is less than 5k. of course, that could just mean our nginx extensions are 8 times crappier than netflix.


Smaller. No one has heard of us :)


> Spoke to our AWS account manager (we are a big big big org)

Is this how you got the inside scoop on the rollout fiasco?


Beautiful


> This is not a windows issue.

Honest question, I've seen comments in these various threads about people having similar issues (from a few months/weeks back) with kernel extension based deployments of CrowdStrike on Debian/Ubuntu systems.

I haven't seen anything similar regarding Mac OS, which no longer allows kernel extensions.

Is Mac OS not impacted by these kinds of issues with CrowdStrike's product, or have we just not heard about it due to the small scale?

Personally, it's a shared responsibility issue. MS should build a product that is "open to extension but closed for modification".

> they pissed over everyone's staging and rules and just pushed this to production.

I am guessing that act alone is going to create a massive liability for CrowdStrike over this issue. You've made other comments that your organization is actively removing CrowdStrike. I'm curious how this plays out. Did CrowdStrike just SolarWind themselves? Will we see their CISO/CTO/CEO do time? This is just the first part of this saga.


The issue is where it is integrated. You could arguably implement CrowdStrike in BPF on Linux. On NT they literally hook NT syscalls in the kernel from a driver they inject into kernel space which is much bad juju. As for macOS, you have no access to the kernel.

There is no shared responsibility. CrowdStrike pushed a broken driver out, then triggered the breakage, overriding customer requirement and configuration for staging. It is a faulty product with no viable security controls or testing.


Yep, it's extremely lame that CS has been pushing the "Windows" narrative to frame it as a Windows issue in the press, so everyone will just default blame Microsoft (which everyone knows) and not Crowdstrike (which only IT/cybersec people are familiar with).

And then you get midwits who blame Microsoft for allowing kernel access in the first place. Yes Apple deprecated kexts on macOS; that's a hell of a lot easier to do when you control the entire hardware ecosystem. Go ahead and switch to Apple then. If you want to build your own machines or pick your hardware vendor, guess what, people are going to need to write drivers, and they are probably going to want kernel mode, and the endpoint security people like CrowdStrike will want to get in there too because the threat is there.

There's no way for Microsoft or Linux for that matter to turn on a dime and deny kernel access to all the thousands upon thousands of drivers and system software running on billions of machines in billions of potential configurations. That requires completely reworking the system architecture.


> midwits

This midwit spent the day creating value for my customers instead of spinning in my chair creating value for my cardiologist.

Microsoft could provide adequate system facilities so that customers can purchase products that do the job without having the ability to crash the system this way. They choose not to make those investments. Their customers pay the price by choosing Microsoft. It's a shared responsibility between the parties involved, inclduing the customers that selected this solution.

We all make bad decisions like this, but until customers start standing up for themselves with respect to Microsoft, they are going to continue to have these problems, and society is going to continue to pay the price all around.

We can and should do better as an industry. Making excuses for Microsoft and their customers doesn't get us there.


This midwit believes a half decent Operating System kernel would have a change tracking system that can auto-roll back a change/update that impacts the boot process causing a BSOD. We see in Linux, multiple kernel boot options, fail safe etc. It is trivial to code at the kernel the introduction of driver / .sys tracking that can detect a failed boot and revert to the previous good config. A well designed kernel would have roll back, just like SQL.


Windows does have that and does do that. Crowdstrike does stuff at UEFI level to install itself again.


Could Microsoft put pressure on UEFI vendors to coordinate a way for such reinstallation to be suppressed during this failsafe boot?


Not sure why you are being downvoted. Take a look at ChromeOS and MacOS to see how those mechanisms are implemented there.

They aren’t perfect, but they are an improvement over what is available on Windows. Microsoft needs to get moving in this same direction.


um.. don't have access to the kernel? what's with all the kexts then? [edit: just read 3rd parties don't get kexts on apple silicon. that's a step in the right direction, IMHO. I love to bitch about Mach/NeXTStep flaws, but happy to give them props when they do the right thing.]


Although it's a .sys file, it's not a device driver.

"Although Channel Files end with the SYS extension, they are not kernel drivers."

https://www.crowdstrike.com/blog/technical-details-on-todays...


Yeah it's a way of delivering a payload to the driver, which promptly crashed.

Which is horrible!


Horrible for sure, not least because hackers now know that the channel file parser is fragile and perhaps exploitable. I haven't seen any significant discussion about follow-on attacks, it's all been about rolling back the config file rather than addressing the root cause, which is the shonky device driver.


I suspect the wiley hackors have known how fragile that code is for years.


But it is Windows because the kernel should be able to roll back a bad update, there should NEVER be BSODs.


Windows does do that. Crowdstrike sticks it back in at the UEFI level by the looks, because you know, "security".


pish! this isn't VM/SP! commodity OSes and hardware took over because customers didn't want to pay firms to staff people who grokked risk management. linux supplanted mature OSes because some dork implied even security bugs were shallow with all those billions of eyes. It's a weird world when MSFT does a security stand down in 2003 and in 2008 starts widening security holes because the new "secure" OS they wrote was a no-go for third parties who didn't want to pay $100 to hire someone who knew how to rub two primes together.

I miss my AS/400.

This might be a decent place to recount the experience I had when interviewing for office security architect in 2003. my background is mainframe VM system design and large system risk management modeling which I had been doing since the late 80s at IBM, DEC, then Digital Switch and Bell Canada. My resume was pretty decent at the time. I don't like Python and tell VP/Eng's they have a problem when they can't identify benefits from JIRA/SCRUM, so I don't get a lot of job offers these days. Just a crusty greybeard bitching...

But anyway... so I'm up in Redmond and I have a decent couple of interviews with people and then the 3rd most senior dev in all of MSFT comes in and asks "how's your QA skills?" and I start to answer about how QA and Safety/Security/Risk Management are different things. QA is about insuring the code does what it's supposed to, software security, et al is about making sure the code doesn't do what it's not supposed to and the philosophic sticky wicket you enter when trying to prove a negative (worth a google deep dive if you're unfamiliar.) Dude cuts me off and says "meh. security is stupid. in a month, Bill will end this stupid security stand down and we'll get back to writing code and I need to put you somewhere and I figured QA is the right place."

When I hear that MSFT has systems that expose inadequate risk management abstractions, I think of the culture that promoted that guy to his senior position... I'm sure he was a capable engineer, but the culture in Redmond discounts the business benefits of risk management (to the point they outsource critical system infrastructure to third parties) because senior engineers don't want to be bothered to learn new tricks.

Culture eats strategy for breakfast, and MSFT has been fed on a cultural diet of junk food for almost half a century. At least from the perspective of doing business in the modern world.


> ”This is not a windows issue. This is a third party security vendor shitting in the kernel.“

Sure, but Windows shares some portion of the blame for allowing third-party security vendors to “shit in the kernel”.

Compare to macOS which has banned third-party kernel extensions on Apple Silicon. Things that once ran as kernel extensions, including CrowdStrike, now run in userspace as “system extensions”.


Back in 2006 the Microsoft agreed to allow kernel level access for Security companies due to an EU anti trust investigation. They were being sued by anti virus companies because they were blocking kernel access in the soon to be released Vista.

https://arstechnica.com/information-technology/2006/10/7998/


Wow, that looks like a root cause


Wow! First cookie pop-ups, now Blue Friday...?


Sick and tired of EU meddling in tech. If third parties can muck around in the kernel, then there's nothing Microsoft can really do at that point. SMH


Can they simultaneously allow this, but recommend against it and deny support / sympathy if you do it to your OS?


Yes... in the same sense that if a user bricks their own system by deleting system32 then Windows shares some small sliver of the blame. In other words, not much.


Why should Windows let users delete system32? If they don't make it impossible to do so accidentally (or even maliciously), then I would indeed blame Windows.

On macOS you can't delete or modify critical system files without both a root password and enough knowledge to disable multiple layers of hardware-enforced system integrity protection.


And what do you think installing a deep level antivirus across your entire fleet is equivalent to?


lol. Never said they should, did I?


the difference is you can get most of the functionality you want without deleting system32, but if you want the super secure version of NT, you have to let idiots push untested code to your box.

linux, Solaris, BSD and macOS aren't without their flaws, but MSFT could have done a much better job with system design.


...but still, if the user space process is broken, MacOS will fail as well. Maybe it's a bit easier to recover, but any broken process with non-trivial privileges can interrupt the whole system.


It's certainly not supposed to work like that. In the kernel, a crash brings down the entire system by design. But in userspace, failed services can be restarted and continued without affecting other services.

If a failure in a userspace service can crash the entire system, that's a bug.


It's kind of inevitable that a security system can crash the system. It just needs to claim than one essential binary is infected with malware, and the system won't run.


Hello:

I'm a reporter with Bloomberg News covering cybersecurity. I'm trying to learn more about this Crowdstrike update potentially bypassing staging rules and would love to hear about your experience. Would you be open to a coversation?

I'm reachable by email at jbleiberg2@bloomberg.net or on Signal at JakeBleiberg.24. Here's my Bloomberg author page: https://www.bloomberg.com/authors/AWuCZUVX-Pc/jake-bleiberg.

Thank you.

Jake


Before reaching the "pushed out to every client without authorization" stage, a kernel driver/module should have been tested. Tested by Microsoft, not by "a third party security vendor shitting in the kernel" that some criminally negligent manager decided to trust.


> Tested by Microsoft

MS don't have testers any more. Where do you think CS learned their radically effective test-in-prod approach?


I think they learned it from Freedesktop developers.


Yeah we have a staging and test process where we run their updated Falcon sensor releasees.

They shit all over our controls and went to production.

This says we don't control it and should not trust it. It is being removed.


> It is being removed.

Congratulations on actually fixing the root cause, as opposed to hand wringing and hoping they don't break you again. I'm expecting "oh noes, better keep it on anyway to be safe" to be the popular choice.


yeah, I agree. I think most places will at least keep it until the existing contract comes time for renegotiation and most will probably keep using cs.

It's far easier for IT departments to just keep using it than it is to switch and managers will complain about "the cost of migrating" and "the time to evaluate and test a new solution" or "other products don't have feature X that we need" (even when they don't need that feature, but THINK they do).


why would Microsoft be required to test some 3rd party software? Maybe I mis-understood.


It's a shitty C++ hack job within CloudStrike with a null pointer. Because the software has root access, Windows shuts it down as a security precaution. A simple unit test would have caught this, or any number of tools that look for null pointers in C++, not even full QA. It's unbelievable incompetence.


Took down our entire emergency department as we were treating a heart attack. 911 down for our state too. Nowhere for people to be diverted to because the other nearby hospitals are down. Hard to imagine how many millions of not billions of dollars this one bad update caused.


Yup - my mom went into the ER for stroke symptoms last night and was put under MRI. The MRI imaging could NOT be sent to the off-site radiologist and they had to come in -- turned out the MRI outputs weren't working at all.

We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.


A relative of mine had back surgery late yesterday. Today the hospital nursing staff couldn’t proceed with the pain medication process for patients recovering from surgery because they didn’t have access to the hospital systems.


My wife is a nurse. She has a non-critical job making care plans for patients and the system is STILL down.


Hope she's okay. For better or worse, our entire emergency department flow is orchestrated around epic. If we can't even see the board, nurses don't know what orders to perform, etc.


If it’s so critical that nurses are left standing around clueless then if it goes down entire teams of people should be going to prison for manslaughter.

Or, we could build robust systems that can tolerate indefinite down time. Might cost more, might need more staff.

Pick one. I’ll always pick the one that saves human lives when systems go down.


Okay but that will affect hospital profits and our PE firms bought these hospitals specifically to wrench all redundancy out of these systems in the name of efficiency (higher margins and thus profit) so that just won't do.


Private equity people need to start getting multiple life sentences for fucking around with shit like this. It's unironically a national security issue.


1. Hospitals should not make profits.

2. Hospitals should not have executives.

3. Hospitals should be community funded with backstop by the federal government.

4. PE is a cancer - let the doctors treat it.


Doctors can't even own hospitals now. Doctor-owned hospitals were banned with the passage of Obamacare in order to placate big hospital systems concerned about the growing competition.


Another way to look at it is that you can have more hospitals using systems with a lower cost, thus saving more lifes comparing to only a few hospitals using an expensive system.


This isn't another way to look at it, this is the only way to look at it.


I wish your mother recovers promptly. And I’m glad she doesn’t run on Windows. ;-)


Ha ha! Good one! This is a save!

Wishes for a speedy recovery to your mom!

I hope no one uses such single point of failure systems anymore. Especially CS. The same is applicable for Cloudflare as well! But at least, the systems will be functioning standalone and accessible in their case and could cause only netwide outage! (i.e., if the CF infra goes down!)

Anyways, who knows what is going to happen with such widespread vendor dependency?

The world gets reminded about the Supply Chain Attacks every year which is a good (but a scary) one that definitely needs some deep thinking...

Up for it?


I am "saving" this comment :)

... and seconding all the best wishes for the mother involved. Do get well.-


I hope she's ok.


Wishing you and your mom the best


I wish your mother the best <3


Thank you <3


Idk… critical hospital systems should be air gapped.


All of the critical equipment is. But we need internet access on computers, or at the very least Epic does to pull records from other hospitals.


> We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.

That's an extra 4 hours of emergency room fees you ideally wouldn't have to pay for.


Having a medical system that has the concept of "hours of emergency room fees" is also a pretty fundamental problem


It's actually per 15 minutes :)


Honestly, that sounds like a typical ER visit.


The system crashed while my coworker was running a code (aka doing CPR) in the ER last night. Healthcare IT is so bad at baseline that we are somewhat prepared for an outage while resuscitating a critical patient.


The second largest hospital group in Nashville experienced a ransomware attack about two months ago. Nurses told me they were using manual processes for three weeks.


It takes a certain type of a criminal a55hole to attack hospitals and blackmail them. I would easily support life or death penalty for anyone attempting this cr@p.


In this case it was tracked to Russia.


That is absolutely one of the A-tier "certain type of a criminal a55hole".


More than just Nashville, they have hospitals all over the country.


Ascension?


Yes. And I was told by multiple nurses at St. Thomas Midtown that the hospital did not have manual procedures already in place. In their press release they refer to their hospitals as "ministries" [0], so apparently they practice faith-based cyber security (as in "we believe that we don't need backups") since it took over 3 weeks to recover.

[0] https://about.ascension.org/cybersecurity-event


As a paramedic, there is very little about running a code that requires IT. You have the crash cart, so not even stuck trying to get meds out of the Pyxis. The biggest challenge is charting / scribing the encounter.


lol, yep, that was my take on this... If you need a computer to run an ACLS algorithm, something has gone seriously wrong.


Especially out in the field where we have a lot more autonomy. If our iPads break we'll just use paper.


Excuse my ignorance, but what systems are needed for CPR?


I used to work in healthcare IT. Running a code is not always only CPR.

Different medications may be pushed (injected into the patient) to help stabilize them. These medications are recorded via a bar code and added to the patients chart in Epic. Epic is the source of truth for the current state of the patient. So if that is suddenly unavailable that is a big problem.


Makes sense, thank you for the explanation.


Okay,not having historical data avaliable to make decision on what to put into a patient is understandable - but maybe also print critical stuff per patient once a day? - but not being able to log an action in realtime should not be a critical problem.


It is a critical problem if your entire record of life-saving drugs you've given them in the past 24 hours suddenly goes down. You have to start relying on people's memories, and it's made worse by shift turn-overs so the relevant information may not even be reachable once the previous shift has gone home.

There are plenty of drugs that can only be given in certain quantities over a certain period of time, and if you go beyond that, it makes the patient worse not better. Similarly there are plenty of bad drug interactions where whether you take a given course of action now is directly dependent on which drugs that patient has already been given. And of course you need to monitor the patient's progress over time to know if the treatments have been working and how to adjust them, so if you suddenly lose the record of all dosages given and all records of their vital signs, you've lost all the information you need to treat them well. Imagine being dropped off in the middle of nowhere, randomly, without a GPS.


That's why there's a sharpie in the first aid kit. If you're out of stuff to write on you can just write on the patient.

More seriously, we need better purpose build medical computing equipment, that runs on it's own OS, and only has outbound network connectivity for updating other systems.

I also think of things like the old school "check list boards" that used to be literally built into the yolk of the airplane they were made for.


I’m afraid the profitability calculation shifted it in favor of off-the-shelf OS a long time ago. I agree with you, though, that a general purpose OS has way too much crap that isn’t needed in a situation like this.


> That's why there's a sharpie in the first aid kit.

That doesn't help when the system goes down and you lose the record of all medications administered prior to having to switch over to the Sharpie.


> It is a critical problem if your entire record of life-saving drugs you've given them in the past 24 hours suddenly goes down.

Will outages like this motivate a backup paper process? The automated process should save enough information on paper so a switch over to paper process at any time is feasible. Similar to elections.


Maybe if all the profit seeking entities were removed from healthcare that money could instead go to the development of useful offline systems.

Maybe a handheld device for scanning in drugs or entering procedure information that stores the data locally which can then be synced with a larger device with more storage somewhere that is also 100% local and immutable which then can sync to online systems if that is needed.


And with their luck, those handheld devices will also be sent the OTA update that temporarily bricks them along with everything else.


no money for that

there are backup paper processes, but they start fresh when the systems go down

If it was printing paper in case of downtime 24/7, it would be massive wasteage for the 99% of time system is up


A good system is resilient. Paper process could take over when system is down. Form my understanding healthcare systems undergo recurrent outages for various reasons.


Many place did revert back to paper processes. But, it’s a disaster model that has to tested to make sure everyone can still function when your EMR goes down. Situations like this just reinforce that you can’t plan for if IT systems go down, it is when they go down.


My experience with internet outages affecting retail is the ability to rapidly and accurately calculate bill totals and change is not practiced much anymore. Not helped by things like 9.075 % tax rates to be sure.


How about an e-ink display for each patient that gets drug and administration info displayed on it?


Real paper is probably as much about breaking from the "IT culture" as it's about the physical properties. E-ink display would probably help with power outage, but happily display BSOD in an incident like this.


Honestly if you were designing a system to be resilient to events like this one, the focus would be on distributed data and local communication. The exact sort of things that have become basically dirty words in this SaaS future we are in. Every PC in the building, including the ones tethered to equipment, is presently basically a dumb terminal, dependent on cloud servers like Epic, meaning WAN connection is a single point of failure (I assume that a hospital hopefully has a credible backup ISP though?) and same for the Epic servers.

If medical data were synced to the cloud but also stored on the endpoint devices and local servers, you’d have more redundancy. Obviously much more complexity to it but that’s what it would take. Epic as single source of truth means everyone is screwed when it is down. This is the trade off that’s been made.


> synced to the cloud but also stored on the endpoint devices and local servers

That's a recipe for a different kind of disaster. I actually used Google Keep some years ago for medical data at home — counted pills nightly, so mom could either ask me or check on her phone if she forgot to take one. Most of the time it worked fine, but the failure modes were fascinating. When it suddenly showed data from half a year ago, I gave up and switched to paper.


I don't think it is historical data required to make a decision, it is required to store the action for historical purposes in the future. This is ultimately to bill you and to track that a doctor isn't stealing medication, improperly treating the patient, and to track it for legal purposes.

Some hospitals require you to input this in order to even get physical access to the medications.

Although a crash cart would normally have common things necessary to save someone in an emergency, so I would think that if someone was truly dying they could get them what they needed. But of course there are going to be exceptions and a system being down will only make the process harder.


> maybe also print critical stuff per patient once a day?

Yep, the business continuity boxes are basically minimally connected PDF archives of patient records "printed" multiple times a day.


maybe non-volatile e-paper, which can be updated easily if things are up, and if the system is down it still works as well as the printouts


updatable e-paper is going to be very expensive


Compared to managing thousands of printers? And then the resulting printouts? Buying ink, changing the cartridges?

Technologically it seems doable. Big enough order brings down the costs.

https://soldered.com/product/soldered-inkplate-5-5-2%e2%80%b...

Of course the real backup plan should be designed based on the actual needs, perhaps the whole system needs an "offline mode" switch. I assume they already run things locally, in case the big cable seeker machine arrives in the neighborhood.


A small printer connected to the scanner should do.


in this case, it's the entire operating system going down on all computers, so I don't think the printers are working either


Most printers in these facilities run standalone on an embedded Linux variant.They actually can host whole folders of.data for reproduction "offline". Actually all scan/print/fax multi function machines can generally do that these days. If the IT onsite is good though the usb ports an storage on devices should be locked down.


Looks like a small scanner + printer running a small minimalistic RTOS would be a good solution.


Ok now you have a park of 200 of those devices to handle. And now you move a patient across a service or to another hospital and then....

Reality is complex.


Oh yes. This would be a contingency measure, just to keep the record in a human readable form while requiring little manual labor. Printed codes could be scanned later into Epic and, if you need to transfer the patient, tear the paper and send it with them.


This.

Anyone involved in designing and/or deploying a system where an application outage threatens life safety, should be charged with criminal negligence.

A receipt printer in every patient room seems like a reasonable investment.


This would be challenging. Establishing crowdstrike’s duty to a hospital patient would be challenging if not impossible in some jurisdictions.


It is not necessarily crowdstrike's responsibility, but it should be someone's.

If I go to Home Depot to buy rope for belaying at my rock climbing center and someone falls, breaks the rope and dies, then I am on the hook for manslaughter.

Not the rope manufacturer, who clearly labeled the packaging with "do not use in situations where safety can be endangered". Not the retailer, who left it in the packaging with the warning, and made no claim that it was suitable for a climbing safety line. But me, who used a product in a situation where it was unsuitable.

If I instead go to Sterling Rope and the same thing happens, fault is much more complicated, but if someone there was sufficiently negligent they could be liable for manslaughter.

In practice, to convict of manslaughter, you would need to show an individual was negligant. However, our entire industry is bad at our job, so no individual involved failed to perform their duties to a "reasonable" standard.

Software engineering is going to follow the path that all other disciplines of meatspace engineering did. We are going to kill a lot of people; and every so often, enough people will die that we add some basic rules for safety critical software, until eventually, this type of failure occuring without gross negligence becomes nearly unthinkable.


Its on whoever runs the hospitals computer systems - allowing a ring 0 kernel driver to update ad-hoc from the internet is just sheer negligence.

Then again, the management that put this in are probably also the same idiots that insist on a 7 day lead time CAB process to update a typo on a brochure ware website "because risk".


This patient is dead. They would not have been if the computer system was up. It was down because of CrowdStrike. CrowdStrike had a duty of care to ensure they didn't fuck over their client's systems.

I'm not even beyond two degrees of seperation here. I don't think a court'll have trouble navigating it.


I suppose it will come as a surprise to you that you have misleading intuitions about the duty of care.

Cloudstrike did not even have a duty of care to their customer, let alone their customer’s customer (speaking for my jurisdiction, of course).


If that really were how it worked, I don’t think that software would really exist at all. Open Source would probably be the first to disappear too — who would contribute to, say, Linux, if you could go to jail for a pull request you made because it turns out they were using it in a life or death situation and your code had a bug in it. That checks all the same boxes that your scenario does: someone is dead, they wouldn’t be if you didn’t have a bug in your code.

Now, a tort is less of a stretch than a crime, but thank goodness I’m not a lawyer so I don’t have to figure out what circumstances apply and how much liability the TOS and EULAs are able to wash away.


When I read something like this that has such a confident tone while being incredibly incorrect all I can do is shake my head and try to remember I was young once and thought I knew it all as well.


I don't think you understand the scale of this problem. Computers were not up to print from. Our Epic cluster was down for placing and receiving orders. Our lab was down and unable to process bloodwork - should we bring out the mortar and pestle and start doing medicine the old fashioned way? Should we be charged with "criminal negligence" for not having a jar of leeches on hand for when all else fails?


I was advocating for a paper fall back. That means that WHILE the computers are running, you must create a paper record, eg “medication x administered at time y”, etc., hence the receipt printers, which are cheap and low-dependency.

The grandparent indicated that the problem was that when all tow computers went down, they couldn’t look up what had already been done for the patient. I suggested a simple solution for that - receipt printers.

After the computers fail you tape the receipt to the wall and fall pack to pen and paper until the computers come back up.

I completely understand the scale of the outage today. I am saying that it was a stupid decision and possibly criminally negligent to make a life critical process dependent on the availability of a distributed IT application not specifically designed for life critical availability. I strongly stand by that POV.


> I suggested a simple solution for that - receipt printers.

Just so I understand what you are saying you are proposing that we drown our hospital rooms in paper receipt constantly. In the off chance the computers go down very rarely?

Do you see any possible drawbacks with your proposed solution?

> possibly criminally negligent to make a life critical process dependent on the availability of a distributed IT application

What process is not “life critical” in a hospital? Do you suggest that we don’t use IT at all?


Modern medicine requires computers. You literally cannot provide medical care in a critical care setting with the sophistication and speed required for modern critical care without electronic medical records. Fall back to paper? Ok, but you fall back to 1960s medicine, too.


We need computers. But, how about we fall back to an air-gapped computer with no internet connection and a battery backup?

Why does everything need the internet?


> Why does everything need the internet?

Why would you ever need to move a patient from one hospital room containing one set of airgapped computers into another, containing another set of airgapped computers?

Why would you ever need to get information about a patient (a chart, a prescription, a scan, a bill, an X-Ray) to a person who is not physically present in the same room (or in the same building) as the patient?


You wouldn't airgap individual rooms.

And sending data out can be done quite securely. Then replies could be highly sanitized or kept on specific machines outside the air gap.


You also need to receive similar data from outside the hospital.

And now you've added an army of people running around moving USB sticks, or worse, printouts and feeding them into other computers.

It's madness, and nobody wants to do it.


Local area networks air gapped from the internet don't need to be air gapped from each other. You could have nodes in each network responsible for transmitting specific data to the other networks.. like, all the healthcare data you need. All other traffic, including windows updates? Blocked. Using IP still a risk? Use something else. As long as you can get bytes across a wire, you can still share data over long distances.

In my eyes, there is a technical solution therr that keeps friction low for hospital staff: network stuff, on an internet, but not The Internet...

Edit: I've since been reading the other many many comment threads on this HN post which show the reasons why so much stuff in healthcare is connected to each other via good old internet, and I can see there's way more nuance and technicality I am not privy to which makes "just connect LANs together!" less useful. I wasn't appreciating just how much of medicine is telemedicine.


I think wiring computers within the hospital over LAN, and adding a human to the loop for inter-hospital communication seems like a reasonable compromise.

Yes there will be some pain, but the alternative is what we have right now.

> nobody wants to do it.

Tough luck. There's lots of things I don't want to do.


Less time urgent, and would not take an army.


This approach is also what popped in my head. I've seen people use white boards for this already so it must be ok from a hipaa standpoint.


A hospital my wife worked at over a decade ago didn't use EMR's, it was all on paper. Each patient had a binder. Per stay. And for many of them it rolled into another binder. (This was neuro-ICU so generally lengthy patient stays with lots of activity, but not super-unusual or Dr House stuff, every major city in America will have 2-3 different hospitals with that level of care.)

But they switched over to EMR because the advantages of Pyxis[1] in getting the right medications to the right patients at the right time- and documenting all of that- are so large that for patient safety reasons alone it wins out over paper. You can fall back to paper, it's just a giant pain in the ass to do it, and then you have to do the data entry to get it all back into EMR's. Like my wife, who was working last night when everyone else in her department got Crowdstrike'd, she created a document to track what she did so it could be transferred into EMR's once everything comes back up. And the document was over 70 pages long! Just for one employee for one shift.

1: Workflow: Doctor writes prescription in EMR. Pharmacist reviews charts in EMR, approves prescription. Nurse comes to Pyxis cabinet and scans patient barcode. Correct drawer opens in cabinet so the proper medication- and only the proper medication- is immediately available to nurse (technicians restock cabinet when necessary). Nurse takes medication to patient's room, scans patient barcode and medication barcode, administers drug. This system has dramatically lowered the rates of wrong-drug administration, because the computers are watching over things and catch humans getting confused on whether this medication is supposed to go to room 12 or room 21 in hour 11 of their shift. It is a great thing that has made hospitals safer. But it requires a huge amount of computers and networks to support.


> Pyxis cabinet

Why would a Pyxis cabinet run Windows? I realize Windows isn't even necessarily at fault here, but why on earth would such a device run Windows? Is the 90s form of mass incompetence in the industry still a thing where lots of stuff is written for Windows for no reason?


I don't know what Pyxis runs on, my wife is the pharmacist and she doesn't recognize UI package differences with the same practiced eye that I do. And she didn't mention problems with the Pyxis. Just problems with some of their servers and lots of end user machines. So I don't know that they do.


You only need one link in the chain of doctor -> pharmacist -> pixys -> nurse to be reliant on Windows for this to fail.


This would be a disaster from a HIPAA perspective, and an unimaginable amount of paperwork.


For relying on windows to run this kind of stuff and not doing any kind of staged rollout but just blindly applying untested kernel driver 3rd party patching fleet wide? yeah honestly. We had safer rollouts for cat videos than y'all seem to have for life critical systems. Maybe some criminal liability would make y'all care about reliability a bit more.


Staged rollout in the traditional sense wouldn't have helped here because the skanky kernel driver worked under all test conditions. It just didn't work when ot got fed bad data. This could have been mitigated by staging the data propagation, or by fully testing the driver with bad data (unlikely to ever have been done by any commercial organization). Perhaps some static analysis tool could have found the potential to crash (or the isomorphic "safe language" that doesn't yet exist for NT kernel drivers).


If you don't see that the thing that happened today that blew up the world was the rollout, I don't know what to tell you.


A QR code can store 3 KB of data. Every patient has a small QR Sticker printer on their bed. Whenever EPIC updates, print a new small QR sticker. Patient being moved tear of sticker and stick to their wrist tag.

This much of patients state will be carried on their wrist. Maybe for complex cases you need two stickers. Have to be judicious in encoding data, maybe just last 48 hours.

Handheld qr readers, off line that read and display QR data strings.


You need to document everything during a code arrest. All interventions, vitals and other pertinent information must be logged for various reasons. Paper and pen work but they are very difficult to audit and/or keep track of. Electronic reporting is the standard and deviating from the standard is generally a recipe for a myriad of problems.


We chart all codes on paper first and then transfer to computer when it's done. There's a nurse whose entire job is to stay in one place and document times while the rest of us work. You don't make the documenter do anything else because it's a lot of work.

And that's in the OR, where vitals are automatically captured. There just aren't enough computers to do real-time electronic documentation, and even if there were there wouldn't be enough space.


I chart codes on my EPCR, in the PT's house, almost everyday with one hand. Not joking about the one hand either.

Its easier, faster, and more accurate than writing in my experience. We have a page solely dedicated to codes and the most common interventions. Got IO? I press a button and its documented with timestamp. Pushing EPI, button press with timestamp. Dropping an I-Gel or Intubating, button press... you get the idea.

The details of the interventions can be documented later along with the narrative, but the bulk of the work was captured real-time. We can also sync with our monitors and show depth of compressions, rate of compressions and rhythms associated with the continuous chest compression style CPR we do for my agency.

Going back to paper for codes would be ludicrous for my department. The data would be shit for a start. Hand writing is often shit and made worse under the stress of screaming bystanders. Depending on whether we achieved ROSC or not would increase the likelihood of losing paper in the shuffle


The idea is to have the current system create a backup paper trail from which you practice resuming from for when computers go down. Nothing about current process for you need change only that you be familiar with falling back to paper backups when computers are down.


Which means that you have to be operating papered before the system goes down. If you aren't, the system never gets to transition because it just got CrowdStruck.


Correct. We use paper receipts for shopping and paper ballots for voting. Automation is fast and efficient, but there must be a manual fallback when power fails or automation is unreliable.

This wisdom is echoed in some religious practices that avoid complete reliance on modern technology.


> depth of compressions

Okay, how does that monitor work? Genuinely curious.


Replace require and must with expected to, and you get the difference of policy and reality.


You can do CPR without a computer system, but changing systems in the middle of resuscitation where a delay of seconds can mean the difference between survival and death is absolutely not ideal. CPR in the hospital is a coordinated team response and if one person can’t do their job without a computer then the whole thing breaks down.


If you're so close to death that you're depending on a few seconds give or take, you're in God's hands. I would not blame or credit anyone or any system for the outcome, either way.


I’m sure you meant “the physicians’ hands.”


No. The physician will be running a standard ER code protocol, following a memorized flow chart.


Judgement is always part of the process, but yeah running a routine code is pretty easy to train for. It's one of the easiest procedures in medicine. There are a small number of things that can go wrong that cause quick death, and for each a small number of ways to fix them. You can learn all that in a 150 hour EMT class.


My guess is the system that notifies the next caretaker in the chain that someone is currently receiving CPR.

if it works, there's a lot more to be done to get the patient to stable.


need to play bee gees on windows media player


probably the system used to pull and record medication uses in a hospital. It's been awhile, but "Pyxis" used to be the standard where I shadowed.

Nurses hated it.


Hello, I'm a journalist looking to reach people impacted by the outage and wondering if you could kindly connect with your ER colleague. My email is sarah.needleman@wsj.com. Thanks!


Surprised and impressed at your using HN as a resource.


The comments is the content. I have always said this.


I mean if they're finding sources through the comment and then corroborating their stories via actual interviews, it's completely fine practice. As long as what's printed is corroborated and cross-referenced I don't see a problem.

If they go and publish "According to hackernews user davycro ..." _then_ there's a problem.


She is living in the future. Way to go.


I sent them your contact info, pretty sure they will be asleep for the next few hours


Now this is an unusual meeting of two meanings of "running a code".


there's a great meme out there that says something like: Everyone on my floor is coding! \n Software PMs: :-D \n Doctors: :-O


When you're a software engineer turned doctor you get sent that by all of your friends xD


> Took down our entire emergency department as we were treating a heart attack.

It makes my blood boil to be honest that there is no liability for what software has become. It's just not acceptable.

Companies that produce software with the level of access that Crowdstrike has (for all effective purposes a remote root exploit vector) must be liable for the damages that this access can cause.

This would radically change how much attention they pay to quality control. Today they can just YOLO-push barely tested code that bricks large parts of the economy and face no consequences. (Oh, I'm sure there will be some congress testimony and associated circus, but they will not ever pay for the damages they caused today.)

If a person caused the level and quantity of damage Crowdstrike caused today they would be in jail for life. But a company like Crowdstrike will merrily go on doing more damage without paying any consequence.


> Companies that produce software

What about companies that deploy software with the level of quality that Crowdstrike has? Or Microsoft 365 for that matter.

That seems to be the bigger issue here; after all Crowdstrike probably says it is not suitable for any critical systems in their terms of use. You shouldn't be able to just decide to deploy anything not running away fast enough on critical infrastructure.

On the other hand, Crowdstrike Falcon Sensor might be totally suitable for a non-critical systems, say entertainment systems like the Xbox One.


CrowdStrike https://www.crowdstrike.com › resources › infographics Learn how CrowdStrike keeps your critical areas of risk such as endpoints, cloud workloads, data, and identity, safe and your business running


Wife is a nurse. They eventually go 2 computers working for her unit. I don't think it impacted patients already being treated, but they couldn't get surgeries scheduled and no charting was being done. Some of the other floors were in complete shambles.


Hi, as I noted to another commenter, I'm a journalist looking to speak with people who've been impacted by the outage. I'm wondering if I could speak with your wife. My email is sarah.needleman@wsj.com. Thanks.


Sure I’ll pass your email along to her and see if she wants to do that.


I dont understand how this isnt bigger news?

Local emergency services were basically nonfunctioning for better part of the day along with the heat wave and various events, seems like a number of deaths (locally at least, specific to what I know for my mid sized US city) will be indirectly attributable to this.


It's entirely possible (likely, even) that someone died from this, but it's hard to know with critically ill patients whether they would have survived without the added delays.


On aggregate it is. How many deaths over the average for these conditions did we see?


We are in the process of calculating this but need this 24H period to roll over so we can benchmark the numbers against a similar 24H period. Its hard to tell if the numbers we get back will even be reliable given a lot of the statistics back from today from what I can tell have been via emails or similar.


So what?


Give it like, a week before bothering to ask such questions...


If true, this is insane that critical facilities like hospital do not have decentralized security system.


Crowdstrike is on every machine in the hospital because hospitals and medical centers became a big target for ransomware a few years ago. This forced medical centers to get insured against loss of business and getting their data back. The insurance companies that insure companies against ransomware insist on putting host based security systems onto every machine or they won't cover losses. So Crowdstrike (or one of their competitors) has to run on every machine.


I wonder why putting software on every machine, instead of relying on a good firewall and network separation.

Granted, you are still vulnerable of physical attacks (i.e. the person coming with an USB stick) but I would say much more difficult, and if you put firewalls also between compartment of internal networks even difficult.

Also, I think the use of Windows in critical settings is not a good choice, and to me we had a demonstrations. For who says the same could have happened to Linux, yes but you could have mitigated it. For example, to me a Linux system used in critical settings shall have a root read-only root filesystem, on Windows you can't. Thus the worse you would had is to reboot the machine to restore it.


The physical security of computers in , say a hospital, is poor. You can't rely on random people not getting access to a logged in computer.


A common attack vector is phishing, where someone clicks on an email link and gets compromised or supplies credentials on a spoofed login page. External firewalls cannot help you much there.

Segmenting your internal network is a good defence against lots of attacks, to limit the blast radius, but it's hard and expensive to do a lot of it in corporate environments.


There are no good firewall in the market. It's always the pretend-firewall that becomes the vector.


Yup as you say, if you go for a state of the art firewall, then that firewall also becomes a point of failure. Unfortunately complex problems don't go away by saying the word "decentralize".


You highly overestimate the capabilities of the average IT person working for a hospital. I'm sure some could do it. But most who can work elsewhere.


I wonder if those same insurance policies are going to pay out due to the losses from this event?


> I wonder if those same insurance policies are going to pay out due to the losses from this event?

They absolutely should be liable for the losses, in each case where they caused it.

(Which is most of them. Most companies install crowdstrike because their auditor want it and their insurance company says they must do whatever the auditor wants. Companies don't generally install crowdstrike out of their own desire.)

But of course they will not pay a single penny. Laws need to change for insurance companies, auditors and crowdstrike to be liable for all these damages. That will never happen.


Why would they? Cybersecurity insurance doesn’t cover “we had an outage” - it covers a security breach.


Depends on what the policy (contract) says. But there's a good argument that your security vendor is inside the wall of trust at a business, and so not an external risk.


In a sense, it looks like these insurance company's policies work a little bit like regulation. Except that it's not monopolistic (different companies are free to have different rules), and when shit hits the fan, they actually have to put their money where their mouth is.

Despite this horrific outage, in the end it sounds like a much better and anti-fragile system than a government telling people how to do things.


A little bit, probably slightly better. But insurance companies don't want to eliminate risk (if they did that, no one would buy their product). They instead want to quantify, control and spread the risk by creating a risk pool. Good, competent regulation would be aimed at eliminating, as much as reasonably possible, the risk. Instead, insurance company audits are designed to eliminate the worst risk and put everyone into a similar risk bucket. After spending money on an insurance policy and passing an audit, why would a company spend even more money and effort? They have done "enough".


> The insurance companies that insure companies against ransomware insist on putting host based security systems onto every machine or they won't cover losses.

This is part of the problem too. These insurance/audit companies need to be made liable for the damage they themselves cause when they require insecure attack vectors (like Crowdstrike) to be installed on machines.


Crowdstrike and its ilk are basically malware. There have to be better anti-ransomware approaches, such as replicated, immutable logs for critical data.


That only solves half the problem, it doesn't solve data theft


1. Is data theft the main risk of ransomware?

2. Why would anyone trust a ransomware perpetrator to honor a deal to not reveal or exploit data upon receipt of a single ransom payment? Are organizations really going to let themselves be blackmailed for an indefinite period of time?

3. I'm unconvinced that crowdstrike will reliably prevent sensitive data exfiltration.


1. Double extortion is the norm, some groups don't even bother with the encryption part anymore, they just ask a ransom for not leaking the data

2. Appearently yes. Why do you think calls to ban payments exist?

3. At minimum it raises the bar for the hackers - sure, it's not like you can't bypass edr but it's much easier if you don't have to bypass it at all because it's not there


> That only solves half the problem, it doesn't solve data theft

crowsdstrike is not a DLP solution. You can solve that problem (where necessary) by less intrusive means.


I agree edr is not a DLP solution, but edr is there to prevent* an attack getting to the point where staging the data exfil happens... In which case yes I would expect web/volumetric DLP kicks in as the next layer.

*Ok ok I know it's bypassable but one of the happy paths for an attack is to pivot to the machine that doesn't have edr and continue from there.


Is there any security company that provides decentralized service?


By "decentralized" I think you mean "doesn't auto-update with new definitions"?

I have worked at places which controlled the roll-out of new security updates (and windows updates) for this very reason. If you invest enough in IT is possible. But you have to have a lot of money to invest in IT to have people good enough to manage it. If you can get SwiftOnSecurity to manage your network, you can have that. But can every hospital, doctor's office, pharmacy, scan center, etc. get top tier talent like SwiftOnSecurity?


I used to work for a major retailer managing updates to over 6000 stores. We had no auto updates (all linux systems in stores) and every update went through our system.

When it came to audit time, the auditors were always impressed that our team had better timely updates than the corporate office side of things.

I never really thought we were doing anythin all that special (in fact, there were always many things I wanted to improve anout the process) but reading about this issue makes me think that maybe we really were just that much better than the average IT shop?


> I have worked at places which controlled the roll-out of new security updates (and windows updates)

But did they also control the roll-out of virus/threat definition files? Because if not their goose would have been still cooked this time.


Maybe, maybe not, devil's in the details.

If, for example, they were doing slow rollouts for configs in addition to binaries, they could have caught the problem in their canary/test envs and not let it proceed to a full blackout.


When I say decentralized, I mean security measures and updates taken locally at the facility. For example, MRI machines are local, and they get maintained and updated by specialists dispatched by the vendor (Siemens or GE)


Siemens or GE or whomever built the MRI machine aren't really experts in operating systems, so they just use one that everyone knows how to work, MS Windiows. It's unfortunate that to do things necessary for modern medicine they need to be networked together with other computers (to feed the EMR's most importantly) but it is important in making things safer. And these machines are supposed to have 10-20 year lifespans (depending on the machine)! So now we have a computer sitting on the corporate network, attached to a 10 year old machine, and that is a major vulnerability if it isn't protected, patched, and updated. So is GE or Siemens going to send out a technician to every machine every month when the new Windows patch rolls out? If not, the computer sitting on the network is vulnerable for how long?

Healthcare IT is very important, because computers are good at record-keeping, retrieval and storage, and that's a huge part of healthcare.


A large hospital takes in power from multiple feeds in case any one provider fails. It's amazing that we're even thinking in terms of "a security company" rather than "multiple security layers."

The fact that ransomware is still a concern is an indication that we've failed to update our IT management and design appropriately to account for them. We took the cheap way out and hoped a single vendor could just paper over the issue. Never in history has this ever worked.

Also speaking of generators a large enough hospital should be running power failure test events periodically. Why isn't a "massive IT failure test event" ever part of the schedule? Probably because they know they have no reasonable options and any scale of catastrophe would be too disastrous to even think about testing.

It's a lesson on the failures of monoculture. We've taken the 1970s design as far as it can ago. We need a more organically inspired and rigorous approach to systems building now.


This. The 1970s design of the operating system and the few companies that deliver us the monoculture are simply not adequate or robust given the world of today.


It's insane why critical facilities use Windows OS rather than Linux/*BSD, which is rock-solid.



They'll still install crowdstrike or some other rootkit that will bring it all down anyway


> Hard to imagine how many millions of not billions of dollars this one bad update caused.

And even worse, possibly quite a few deaths as well.

I hope (although I will not be holding my breath), that this is the wake-up call we need to realise that we cannot have so much of our critical infrastructure rely on the bloated OS of company known for its buggy, privacy-intruding, crapware riddled software.

I'm old enough to remember the infamous blue-screen-of-death Windows 98 presentation. Bugs exist but that was hardly a glowing endorsement of high-quality software.. This was long ago, yet it is nigh on impossible to believe that the internal company culture has drastically improved since then, with regular high-profile screw-ups reminding us of what is hiding under the thin veneer of corporate of respectability.

Our emergency systems don't need windows, our telephone systems don't need windows, our flight management systems don't need windows, our shop equipment systems don't need windows, our HVAC systems don't need windows, and the list goes on, and on, and on.

Specialized, high-quality OSes with low attack surfaces are what we need to run our systems. Not a generic OS stuffed with legacy code from a time when those applications were not even envisaged.

Keep-it-simple-stupid -KISS-is what we need to go back to, our lives literally depend on it.

With the mutli-billion dollars screw-up that happened yesterday, and an as-of-yet unknown number of deaths, it's impossible to argue that the funds are unavailable to develop such systems. Plurality is what we need, built on top of strong standards for compatibility and interoperability.


OK, but this was a bug in an update of a kernel module that just happened to be deployed on Windows machines. How many OSs are there that can gracefully recover from an error in kernel space? If every machine that crashed had been running, say, Linux and the update had been coded equivalently, nothing would've changed.

Perhaps rather than an indictment on Windows, this is a call to re-evaluate microkernels, at least for critical systems and infrastructure.


It was not a call to replace windows systems with linux, but to replace it with specialised OSes that do less, with better stability guarantees.

And building something around microkernels would definitely not be a bad starting point.


> Took down our entire emergency department

What does this mean? Did the power go down? Is all the equipment connected? Or is it the insurance software that can't run do nothing gets done? Maybe you can't access patient files anymore but is that taking down the whole thing?


Every computer entered a bluescreen loop. We are dependent on Epic for placing orders, for nursing staff to know what needs to be done, for viewing records, for transmitting and interpreting radiology machines. It's how we know the current state of the department and where each patient (out of 50+ people we are simultaneously treating) is at. Our equipment still works but we're flying blind and having to shout orders at each other and have no way to send radiology images to other doctors for consultation.


Yeah in Radiology we depend on Epic and a remote reading service called VRAD. VRAD runs on AWS and went down just after 0130 hrs EST. Without Epic & VRAD we were pretty helpless.


Can't imagine how stressful this must have been for Radiology. I had two patients waiting on CT read with expectation to discharge if no acute findings. Had to let them know we had no clear estimate for when that would be, and might not even know when the read comes back if we can't access epic.

Have a family member in crit care who was getting a sepsis workup on a patient when this all happened. They somehow got plain film working offline after a bit of effort.


Did the person survive?


We have limited visibility into this in the emergency department. You stabilize the patient and admit them to the hospital, then they become internal medicine or ICU's patient. Thankfully most of the work was done and consults were called prior to the outage, but they were in critical condition.


I will say - the way we typically find out really sends a shiver down your spine.

You come in for you next shift and are finishing charting from your prior shift. You open one of your partially finished charts and a little popup tells you "you are editing the chart for a deceased patient".


Sounds like this is hugely emotionally taxing, do you just get used to it after a while, or is it a constant weight?

This is why I'm impressed by anyone who works in a hospital, especially the more urgent/intensive care


i'll admit i have no idea what i'm talking about but isn't there some Plan B options? something that's more manual? or are surgeons too reliant on computers?


There are plan B options like paper charting, downtime procedures, alternative communication methods and so on. So while you can write down a prescription and cut a person open you can't manually do things pull up the patient's medical history for the last 10 years in a few seconds, have an image read remotely when there isn't a radiologist available on site, or electronically file for the meds to just show up instantly (all depending on what the outage issue is affecting of course). For short outages some of these problems are more "it caused a short rush on limited staff" than "things were falling apart". For longer outages it gets to be quite dangerous and that's where you hope it's just your system that's having issues and not everyone in the region so you can divert.

If the alternatives/plan b's were as good or better than the plan a's then they wouldn't be the alternatives. Nobody is going to have half a hospital's care capacity sit as backup when they could use that year round to better treat patients all the time, they just have plans of last resort to use when what they'd like to use isn't working.

(worked healthcare IT infrastructure for a decade)


> So while you can write down a prescription and cut a person open you can't manually do things pull up the patient's medical history for the last 10 years in a few seconds, have an image read remotely when there isn't a radiologist available on site, or electronically file for the meds to just show up instantly (all depending on what the outage issue is affecting of course).

I worked for a company that sold and managed medical radiology imaging systems. One of our customers' admins called and said "Hey, new scans aren't being properly processed so radiologists can't bring them up in the viewer". I told him I'd take a look at it right away.

A few minutes later, he called back; one of their ERs had a patient dying of a gunshot wound and the surgeon needed to get the xray up so he could see where the bullet was lodged before the guy bled out on the table.

Long outages are terrifying, but it only takes a few minutes for someone to die because people didn't have the information they needed to make the right calls.


Yep, when patients often still die while everything is working fine even a minor inconvenience like "all of the desktop icons reset by mistake" can be enough to tilt the needle the wrong way for someone.


I used to work for a company that provided network performance monitoring to hospitals. I am telling a Story second hand that I heard the CEO share.

One day, during a rapid pediatric patient intervention, a caregiver tried to log in to a PC to check a drug interaction. The computer took a long time to log in because of a VDI problem where someone had stored many images in a file that had to be copied on login. While the care team was waiting for the computer, an urgent decision was made to give the drug. But a drug interaction happened — one that would have been caught, had the VDI session initialized more quickly.

The patient died and the person whose VDI profile contained the images in the bad directory committed suicide. Two lives lost because files were in the wrong directory.


What's insane medical malpractice is that radiology scans aren't displayed locally first.

You don't need 4 years of specialized training to see a bullet on a scan.


We can definitely get local imaging with X-Ray and ultrasound - we use bedside machines that can be used and interpreted quickly.

X-Ray has limitations though - most of our emergencies aren't as easy to diagnose as bullets or pneumonia. CT, CTA, and to a lesser extent MRI are really critical in the emergency department, and you definitely need four years of training to interpret them, and a computer to let you view the scan layer-by-layer. For many smaller hospitals they may not have radiology on-site and instead use a remote radiology service that handles multiple hospitals. It's hard to get doctors who want to live near or commute to more rural hospitals, so easier for a radiologist to remotely support several.


GP referred to "processed," which could mean a few things. I interpreted it to mean that the images were not recording correctly locally prior to any upload, and they needed assistance with that machine or the software on it.


I am talking out my ass, but...

Seems like a possible plan would be duplicate computer systems that are using last week's backup and not set to auto-update. Doesn't cover you if the databases and servers go down (unless you can have spares of those too), but if there is a bad update, a crypto-locker, or just a normal IT failure each department can switch to some backups and switch to a slightly stale computer instead of very stale paper.


We have "downtime" systems in place, basically an isolated Epic cluster, to prevent situations like this. The problem is that this wasn't a software update that was downloaded by our computers, it was a configuration change by Crowdstrike that was immediately picked up by all computers running its agent. And, because hospitals are being heavily targeted by encryption attacks right now, it's installed on EVERY machine in the hospital, which brought down our Epic cluster and the disaster recovery cluster. A true single point of failure.


Can only speak for the UK here, but having one computer system that is sufficiently functional for day-to-day operations is often a challenge, let alone two.


My hospital's network crashed this week (unrelated to this). Was out for 2-3 hours in early afternoon.

The "downtime" computers were affected just like everything else because there was no network.

Phones are all IP-based now; they didn't work.

Couldn't check patient histories, couldn't review labs, etc. We could still get drugs, thankfully, since each dispensing machine can operate offline.


There are often such plans from DR systems to isolated backups to secondary system, as much as risk management budget allow at least. Of course it takes time to switch to these and back, the missing records cause chaos (both inside synced systems and with patient data) both ways and it takes a while to do. On top of that not every system will be covered so it's still a limited state.


Yes buy the more high available you do the more it costs and it's not like this happens every week.


As I was finishing my previous costs it occurred to me that costs are fungible.

Money spent on spares is not spent on cares.


Thank you, I'm quickly becoming tired of HN posters assuming they know how hospitals operate and asking why we didn't just use Linux.


There are problems with getting lab results, X-rays, CT and MRI scans. They do not have paper-based Plan B. IT outage in a modern hospital is a major risk to life and health of their patients.


I don't know about surgeons, but nursing and labs have paper fallback policies... they can backload the data later.


It's often the case that the paper fallbacks can't handle anywhere near the throughput required. Yes, there's a mechanism there, but it's not usable beyond a certain load.


I think it's eventually manageable for some subset of medical procedures, but the transition to that from business as usual is a frantic nightmare. Like there's probably a whole manual for dealing with different levels of system failure, but they're unlikely to be well practiced.

Or maybe I'm giving these institutions too much credit?


Why is the emergency department using windows?


Why did they update everything all at once?


I assume Crowdstrike is software you usually want to update quickly, given it is (ironically) designed to counter threats to your system.

Very easy for us to second guess today of course. But in another scenario a manager is being torn a new one because they fell victim to a ransomware attack via a zero day systems were left vulnerable to because Crowdstrike wasn’t updated in a timely manner.


Maybe, if there's a new zero-day major exploit that is spreading like wildfire. That's not the normal case. Most successful exploits and ransom attacks are using old vulnerabilites against unpatched and unprotected systems.

Mostly, if you are reasonably timely about keeping updates applied, you're fine.


> Maybe, if there's a new zero-day major exploit that is spreading like wildfire. That's not the normal case.

Sure. And Crowstrike releasing an update that bricks machines is also not the normal case. We're debating between two edges cases here, the answers aren’t simple. A zero day spreading like wildfire is not normal but if it were to happen it could be just as, if not more, destructive than what we’re seeing with Crowdstrike.


In the context of the GP where they were actively treating a heart attack, the act of restarting the computer (let alone it never come back) in of itself seems like an issue.


I believe this update didn't restart the computer, just loaded some new data into kernel. Which didn't crash anything the previous 1000 times. A successful background update could hurt performance, but probably machines where that's considered a problem just don't run a general-purpose multitasking OS?


tfw you need to start staggering your virus updates in case your anti-virus software screws you over instead


Maybe those old boomer IT people were on to something by using different Citrix clusters and firewalling off the ones that run essential software...


Crowdstrike pushed a configuration change that was a malformed file, which was picked up by every computer running a the agent (millions of computers across the globe). It's not like hospitals and IT systems are manually running this update and can roll it back.

As to why they didn't catch this during tests or why they don't use perform gradual change rollouts to hosts, your guess is as good as mine. I hope we get a public postmortem for this.


Considering Crowdstrike mentioned in their blog that systems that had their 'falcon sensor' installed weren't affected [1], and the update is falcon content, I'm not sure it was a malformed file, but just software that required this sensor to be installed. Perhaps their QA only checked if the update broke systems with this sensor installed, and didn't do a regression check on windows systems without it.

[1]https://www.crowdstrike.com/blog/statement-on-falcon-content...


That’s not exactly what they’re saying.

It says that if a system isn’t “affected”, meaning it doesn’t reboot in a loop, then the “protection” works and nothing needs to be done. That’s because the Crowdstrike central systems, on which rely the agents running on the clients’ systems, are working well.

The “sensor” is what the clients actually install and run on their machines in order to “use Crowdstrike”.

The crash happened in a file named csagent.sys which on my machine was something like a week old.


I'm not familiar with their software, but I interpreted their wording to mean their bug can leave your system in one of two possible states:

(1) Entire system is crashed.

(2) System is running AND protected from security threats by Falcon Sensor.

And to mean that this is not a possible state:

(3) System is running but isn't protected by Falcon Sensor.

In other words, I interpreted it to mean that they're trying to reassure people they don't need to worry about crashes and hacks, just crashes.


> Why did they update everything all at once?

This is beyond hospital IT control. Clownstrike (sorry, Crowdstrike) unconditionally force-updates the hosts.


Likely because staggered updates would harm their overall security services. I'm guessing these software offer telemetry that gets shared across their clientele, so that gets hampered if you have a thousand different software versions.


My guess is this was an auto-update pushed out by whatever central management server they use. Given CS is supposed to protect your from malware, IT may have staged and pushed the update in one go.


Auto-updates are the only reason something like this gets so widespread so fast.


High-end hospital-management software is not simple stuff, to roll your own. And the (very few) specialty companies which produce such software may see no reason to support a variety of OS's.


A follow up question is why is the one OS chosen the one historically worst at security.


It appears insecure because it is under constant attack because it is so prevalent. Let’s not pretend the *nix world is any better.

I’m no fan of Windows or Microsoft but the commitment to backwards compatibility should not be underestimated.


Are you sure that argument still holds when everyone has Android/iOS phone with apps that talk to Linux servers, and some use Windows desktops and servers as well?


There isn't, and never was, a benevolent dictator choosing the OS for computers in medical settings.

Instead, it's a bunch of independent-ish, for-profit software & hardware companies. Each one trying to make it cheap & easy to develop their own product, and to maximize sales. Given the dominance of MS-DOS and Windows on cheap-ish & ubiquitous PC's, starting in the early-ish 1980's, the current situation was pretty much inevitable.


To add detail for those that don't understand, the big healthcare players barely have unix teams, and the small mom and pop groups literally have desktops sitting under the receptionist desk running the shittiest software imaginable.

The big health products are built on windows because they are built by outsourced software shops and target the majority of builds which are basically the equivalent of bob's hardware store still running windows 95 on their point of sale box.

The major players that took over this space for the big players had to migrate from this, so they still targeted "wintel" platforms because the vast majority of healthcare servers are windows.

Its basically the tech equivalent of everything evolved from the width of oxen for railway.


Because of critical mass. A significant amount of non-technically inclined people use Windows. Some use Mac. And they're intimidated by anything different.


Generally speaking employees don't really per se use windows so much as click the browser icon and proceed to use employers web based tools.


There's a bunch of non-web proprietary software medical offices use to access patient files, result histories, prescription dispensation etc. At least here in Ontario my doctor uses an actual windows application to accomplish all that.


Then they use those apps. The point is that since they usage of the OS as such is so minimal as to be irrelevant as long as it has a launcher and an X in the top corner.

They could as well launch that app in OpenBSD.


Momentum as well. Many of these systems started in DOS. The DOS->Windows transition is pretty natural.


Exactly !

Question is: why half+ of Fortune 500 companies allowed Crowdstrike - Windows hackers - access and total control of their not-a-ms-windows business ? Obviously Crowdstrike do not do medicine or lifting cranes differentiation. "In the middle of the surgery" is not in their use case docs!

There was somewhere Mercedes pitstop image with wall of BSoD monitors :) But that is not Crowdstrike business either...

And all that via public internet and misc clouds. Banks have their own fibre lines, why hospitals can't?

Airports should disconnect from Internet too, selling tickets can be separate infra, synchronization between POSes and checkout don't need to be in real time.

There is only one sane way to prevent such events: EOD controlled by organization and this is sharply incompatible with 3rd party on-line EOD providers. But they can sell it in a box and do real time support when called.


I mean this question is the most honest way; I am not trying to be snarky or superior.

What are the hard problems? I can think of a few, but I'm probably wrong.


Auditing: using Windows plus AV plus malware protection means you demonstrate compliance faster than trying to prove your particular version on Linux is secure. Hospitals have to demonstrate compliance in very short timeframes and every second counts. If you fail to achieve this, some or all of your units can be closed.

Dependency chains: many pieces of kit either only have drivers on windows or work much better on Windows. You are at the mercy of the least OS diverse piece of kit. Label printers are notorious for this as an e.g.

Staffing: Many of your staff know how to do their jobs excellently, but will struggle with tech. You need them to be able assume a look and feel, because you dont want them fighting UX differences when every second counts. Their stress level is roughly equiv. to their worst 10 seconds of their day. And staff will quit or strike over UX. Even UI colour changes due to virtualization down scaling have triggered strife.

Change Mgmt: Hospitals are conservative and rarely push the envelope. We are seeing a major shift at the moment in key areas (EMR) but this still happening slowly. No one is interested in increasing their risk just because Linux exists and has Win64 compatability. There is literally no driver for change away from windows.


> There is literally no driver for change away from windows.

(Not including this colossal fuck up.)


No hospital will shift to Linux because of this incident. They may shift away from Crowdstrike, but not to another OS.


It's actually not that hard from a conceptual implementation standpoint, it's a matter of scale, network effects, and regulatory capture


> What are the hard problems? I can think of a few, but I'm probably wrong.

Billing and insurance reimbursement process change all the time and is a headache to keep up to date. E.g. the actual dentist software is paint but with mainly the bucket and some way to quickly insert teeth objects to match your mouth. I.e. almost no medical skill in the software itself helping the user.


Because essentially every large hospital in the USA does?


This is the result of vendor lock-in and the lesson for all businesses not to use Microsoft servers. Linux/*BSD are rock-solid and open source.


It's not just that. A large portion of IT people who work in these industries find Windows much easier to administer. They're very resistant to switching out even if it was possible and everything the company needed was available elsewhere.

Even if they did switch, they'd then want to install all the equivalent monitoring crap. If such existed, it would likely be some custom kernel driver and it could bring a unix system to its knees when shit goes wrong too.


I mean crowdstrike has a linux equivalent which broke rhel recently by triggering kernel panic


Contact a lawyer if this affected her health please. Any delays in receiving Stroke care can have injured her more I imagine. Any docs here?


ER worker here. It really depends on the details. If she was C-STAT positive with last known normal within three hours, you assume stroke, activate the stroke team, and everything moves very quickly. This is where every minute counts, because you can do clot busting to recover brain function.

The fact that she was discharged without an overnight admit suggests to me that the MRI did not show a stroke, or perhaps she was outside the treatment window when she went to the hospital.


What if it was a cerebral bleed?


I can't even begin to imagine the cost of proving the health effects and attempting to win the case.


Yes. Reading and learning.


I remember a fed speaker in the 90s at Alexis hotel Defcon trying to rationalize their weirdly over-aggressive approach to enforcement by mentioning how hackers would potentially kill people in hospitals, fast forward to today and it's literally the "security" software vendor that's causing it.


Well cryptolockers have actually compromised various hospitals and I remember the first one was in the United Kingdom .


Don't forget that nearly all crypto lockers are run by North Korea or other state espionage groups pretending to be North Korea.

If we adjusted our foreign policy slightly, I think we would dissuade that whole class of attacker.


It's not like hackers haven't killed people in hospitals with e.g. ransomware. Our local dinky hospital system was hit by ransomware twice, which at the very least delayed some important surgeries.


I can't imagine why any critical system is connected to the internet at all. It never made sense to me. Wifi should not be present on any critical system board and ethernet plugged in only when needed for maintenance.

This should be the standard for any life sustaining or surgical systems, and any critical weapons systems.


I work for a large medical device company and my team works on securing medical devices. At least at my company as a general rule, the more expensive the equipment (and thus the more critical the equipment, think surgical robots) the less likely it will ever be connected to a network, and that is exactly because of what you said, you remove so many security issues when you keep devices in a disconnected state.

Most of what I do is creating the tools to let the field reps go into hospitals and update capital equipment in a disconnected state (IE, the reps must be physically tethered to the device to interact with it). The fact that any critical equipment would get an auto-update, especially mid-surgery is incredibly bad practice.


I work for the government supporting critical equipment - not in medical, in transportation sector - and the systems my team supports not only are not connected to the internet, they aren't even capable of being so connected. Unfortunately the department responsible for flogging us to do cybersecurity reporting (different org branch than my team) has all our systems miscategorized as IT data systems (when they don't even contain an operating system). So we waste untold numbers of engineer hours now reporting "0 devices affected" to lists of CvE's and answering data calls about SSH, Oracle or Cisco vulnerabilities, etc. etc. which we keep answering with "this system is air gapped and uses a microcontroller from 1980 that cannot run Windows or Linux" but the cybersecurity-flogging department refuses to properly categorize us. My colleague is convinced they're doing that because it inflates their numbers of IT systems.

Anyway: it is getting to the point that I cynically predict we may be required to add things to the system (such as embedding PCs), just so we can turn around and "secure" them to comply with the requirements that shouldn't be applied to these systems. Maybe this current outage event will be a wake up call to how misplaced the priorities are, but I doubt it.


All this stuff could easily be airgapped or revert to USB stick fail safe.


Have you ever tried to airgap a gigantic wifi network across several buildings?

Has to be wifi because the carts the nurses use roll around. Has to be networked so you can have EMR's that keep track of what your patients have gotten and the Pharmacists, doctors, and nurses can interface with the Pyxis machines correctly. The nurse scans a patients barcode at the Pyxis, the drawer opens to give them the drugs, and then they go into the patient's room and scan the drug barcode and the patients barcode before administering the drug. This system is to prevent the wrong drug from being administered, and has dramatically dropped the rates of mis-administering drugs. The network has to be everywhere on campus (often times across many buildings). Then the doctor needs to see the results of the tests and imaging- who is running around delivering all of these scans to the right doctors?

You don't know what you are talking about if you think this is easy.


Air gap the system with the external world is different from air gap internally. The systems are only update via physical means. And possibly all data in and out is offline like, via certain double firewall arrangement (you do not let direct contact but dump in and out files). Not common but for industrial critical system saw a few big shops did this.


So how does a doctor issue a discharge order via e-prescription to the patients pharmacy for them to pick up when they leave? How do you update the badge readers on the drug vaults when an employee leaves and you need to deactivate their badge? How do you update the EMR's from the hospital stay so the GP practice they use can see them after discharge? How do you order more supplies and pharmacy goods when you run out? How do you contact the DEA to get approval for using certain scheduled meds? I'm afraid that external networks are absolutely a requirement for modern hospitals.

If the system has to be networked with the outside world, who is responsible for physically updating all of these machines, so they don't get ransomware'd? Who has to go out and visit each individual machine and update it each month so the MRI machine doesn't get bricked by some teen ransomware gang? Remember that was the main threat hospitals faced 3-4 years ago, which is why Crowdstrike ended up on everyone's computer: because the ransomware insurance people forced them too.

There is a reason that I am a software engineer and not an IT person. I prefer solving more tractable problems, and I think proving p!=np would be easier than effectively protecting a large IT network for people who are not computing professionals.

One of my favorite examples: in October 2013 casino/media magnate and right wing billionaire Sheldon Adelson gave a speech about how the US and Israel should use nuclear weapons to stop Iran nuclear program. In February 2014 a 150 line VB macro was installed on the Sands casino network that replicated and deleted all HDDs, causing 150 million dollars of damage. That was to a casino, which spends a lot of money on computer security, and even employs some guys named Vito with tire irons. And it wasn't nearly enough.


> Who has to go out and visit each individual machine and update it each month so the MRI machine doesn't get bricked by some teen ransomware gang?

The manufacturer does. As I mentioned in my OP I help build the software for our field reps to go into hospitals and clinics to update our devices in a disconnected state. Most of the critical equipment we manufacture has this as a requirement since it can't be connected to a network for security reasons.

As for discharge orders, etc, I can't speak to that, but that's also not what I would consider critical. I'm talking about things like surgical robots, which can not be connected to a network for obvious reasons, especially during a surgery.


External networks are required but it should be possible to air gap the critical stuff to read only. It’s just that it’s costly and hospitals are poor/cheap


Did this actually happen to medical equipment mid-surgery today?


The OP for this very thread said as much.


My wife is a hospital pharmacist. (1) When she gets a new prescription in, she needs to see the patients charts on the electronic medical records, and then if she approves the medication a drawer in the Pyxis cabinet (2) will open up when a nurse scans the patients barcode, allowing them to remove the medication, and then the nurse will scan the patient's barcode and the medication barcode in the patients room to record that it was delivered at a certain time. Computers are everywhere in healthcare, because they need records and computers are great at record-keeping. All of those need networks to connect them, mostly on wifi (so the nurses scanners can read things).

In theory you could build an air-gapped network within a hospital, but then how do you transmit updates to the EMR's across different campuses of your hospital? How do you issue electronic prescriptions for patients to pick up at their home pharmacy? How do you handle off-site data backup?

Quite honestly, outside of defense applications I'm not aware of people building large air-gapped networks (and from experience, most defense networks aren't truly air-gapped any more, though I won't go into detail). Hospitals, power plants, dams, etc. all of them rely heavily on computers these days, and connect those over the regular internet.

1: My wife was the only pharmacist in her department last night whose computer was unaffected by Crowdstrike (for unknown reasons). She couldn't record her work in the normal ways, because the servers were Crowdstrike'd as well. So she spun up a document of her decisions and approvals, for later entry into the systems. It was over 70 pages long when she went off shift this morning. She's asleep right now.

2: https://www.bd.com/en-uk/products-and-solutions/products/pro...


First - drop "air-gapped" term and replace it with "internet-gapped". TA^h^h^a^a! And it already have a name: "The LAN"... Now teach managers about importance of local net vs open/public/world net. Tell them cloud costs more becouse someone is making a fortune or three on it !

TIP: many buildings can be part of one LAN! It is called VPN and Russia and China do not like it becouse it is good for peoples!

TIP: data can be easily exchanged when needed! Including LAN.

--

My wife is a hospital pharmacist. (1) When she gets a new prescription in, she needs to see the patients charts on the electronic medical records, and then if she approves the medication a drawer in the Pyxis cabinet (2) will open up when a nurse scans the patients barcode, allowing them to remove the medication, and then the nurse will scan the patient's barcode and the medication barcode in the patients room to record that it was delivered at a certain time. Computers are everywhere in healthcare, because they need records and computers are great at record-keeping. All of those need networks to connect them, mostly on wifi (so the nurses scanners can read things).

--

It was description of very local workflow...

It was description of data flow - no any reason it should be monopolized by unsecure by design os vendor that need to be mandatory secured by essentialy kernel rootkit aka os hacking. Which contradicts using that os in the first place!

And looks like Crowdstrike is just if you ask for price then you can't have it version of SELinux :>>> RH++ for two decades of making presentations of SELinux necessity.

But over all allowing automatic updates from 3rd party not having clue about medicine to hospital system, etc. is managers criminal negligence. Simple as that. Curent state of the art ? More negligence! Add (business) academia & co to chronic offenders. Call them what they truly are - sociopaths via craft training facilities.

>In theory you could build an air-gapped network within a hospital, but then how >do you transmit updates to the EMR's across different campuses of your hospital?

How do you transmit to other campuses of other hospitals ? EASY! Transfer mandatory data. Pleas notice I used words like "mandatory" and "data". I DID NOT SAY "use mandatory http stack to transfer data"! NO. NO, I'm far, faaar from even sugesting THAT ! :>

>How do you issue electronic prescriptions for patients to pick up at their home pharmacy?

Hard sold on that "air-gapped and in cage" meme, eh? Send them required data via secure and private method! Communications channels already "hacked" - monopolized - by FB? Obviously that should do not happend in first place. So resolve it as part of un-win-dosing critical civilian infra.

>How do you handle off-site data backup?

That one I do not get. You saying that cloud access is a only possibility to have backups??? And Internet is a must to do it?? Is medical staff brain dead? Ah, no... It's just managers... Again.

>Quite honestly, outside of defense applications I'm not aware of people building large air-gapped networks

And dhcp and "super glue" and tons of other things was invented by military, for a reason, but that things proliferated to civilians anyway. For good reasons. Air-gapping should be much more common when wifi signal allows tracking how you move in your own home. Not to mention GSM+ based "technologies"...

There is old saying: Computers maximize doing. And when somewhere is chaos then computers simply do their work.


I think the criticial systems here are often the ones that need to be connected to some network. Somebody up there mentioned how the MRI worked fine, but they still needed to get the results to the people who needed it. So the problem there was more doctor <-> doctor.


Yeah, our imaging devices were working fine, but with Epic down, you lose most of your communication between departments and your sole way of sharing radiology images and interpretations.


> Roslin: ...it tells people things like where the restroom is, and--

> Adama: It's an integrated computer network, and I will not have it aboard this ship.

> Roslin: I heard you're one of those people. You're actually afraid of computers.

> Adama: No, there are many computers on this ship. But they're not networked.

> Roslin: A computerized network would simply make it faster and easier for the teachers to be able to teach--

> Adama: Let me explain something to you. Many good men and women lost their lives aboard this ship because someone wanted a faster computer to make life easier. I'm sorry that I'm inconveniencing you or the teachers, but I will not allow a networked computerized system to be placed on this ship while I'm in command. Is that clear?

> Roslin: Yes, sir.

> Adama: Thank you. 'Scuse me.


and any critical weapons systems.

... at which point you will lose battles to enemies who have successfully networked their command and control operations. (For extra laughs, just wait until this is also true of AI.)

Ultimately there are just too darned many advantages to connecting, automating, and eventually 'autonomizing' everything in sight. It sucks when things don't go right, or when a single point of failure causes a black-swan event like this one, but in an environment where you're competing against either time or external adversaries, the alternatives are all worse.


Or the opposite: the enemy (or a third-party enemy who wasn't previously a combatant in the battle) hijacks your entire naval USV/UUV fleet & air force drone fleet using an advanced cyberattack, and suddenly your enemy's military force has almost doubled while yours is down to almost zero, and these hijacked machines are within your own lines.


Yes, the efficiency gains of remote automated administration and deployment make up for most outages that are caused by it.

A better thing to do is do phased deployment, so you can see if an update will cause issues in your environment before pushing it to all systems. As this incident shows, you can’t trust a software vendor to have done that themselves.


This wasn't a binary patch though, it was a configuration change that was fed to every device. Which raises a LOT of questions about how this could have happened and why it wasn't caught sooner.


Writing from the SRE side of the discipline, it's commonly a configuration change (or a "flag flip") that ultimately winds up causing an outage. All too seldom are configuration data considered part of the same deployable surface area (and, as a corollary, part of the same blast radius) as program text.

I've mostly resigned myself today to deploying the configuration change and watching for anomalies in my monitoring for a number of hours or days afterward, but I acknowledge that I also have both a process supervisor that will happily let me crash loop my programs and deployment infrastructure that will nonetheless allow me to roll things back. Without either of those, I'm honestly at a loss as to how I'd safely operate this product.


  # Update A
  
  ## config.ext
  
  foo = false
  
  ## src.py
  
  from config import config
  
  if config('foo'):
      work(2 / 0)
  else:
      work(10 / 5)
"Yep, we rigorously tested it."

  # Update B
  
  ## config.ext
  
  foo = true
"It's just a config change, let's go live."


Yeah, that's about right.

The most insidious part of this is when there are entire swaths of infrastructure in place that circumvent the usual code review process in order to execute those configuration changes. Boolean flags like your `config('foo')` here are most common, but I've also seen nested dictionaries shoved through this way.


When I was at FB there were a load of SEVs caused by config changes, such that the repo itself would print out a huge warning about updating configs and show you how to do a canary to avoid this problem.


As in, there was no way to have configured the sensors to prevent this? They were just going to get this if they were connected to the internet? If I was an admin that would make me very angry.


This is the way it's done in the nuclear industry across the US for power and enrichment facilities. Operational/secure section of the plant is airgapped with hardware data diodes to let info out to engineers. Updates and data are sneaker netted in.


Not like hackers haven’t done the same.


At least hackers let people boot their machines, and some even have an automated way to restore the files after a payment. CS doesn't even do that. Hackers are looking better and more professional if we're going to put them in the same bucket, that is.


The criminal crews have a reputation to uphold. You don't deliver on payment, the word gets around and soon enough nobody is going to pay them.

These security software vendors have found a wonderful tacit moat: they have managed to infect various questionnaire templates by being present in a short list of "pre-vetted and known" choices in a dropdown/radiobutton menu. If you select the sane option ("other"), you get to explain to technically inept bean counters why you did so.

Repeat that for every single regulator, client auditing team, insurance company, etc. ... and soon enough someone will decide it's easier and cheaper to pick an option that gets you through the blind-leading-the-blind question karaoke with less headaches.

Remember: vast majority of so-called security products are sold to people high up in the management chain, but they are inflicted upon their victims. The incentives are perverse, and the outcomes accordingly predictable.


> If you select the sane option ("other"), you get to explain to technically inept bean counters why you did so.

Tell them it’s for preserving diversity in the field.


Funnily enough, a bit of snark can help from time to time.

For anyone browsing the thread archive in the future: you can have that quip in your back pocket and use it verbally when having to discuss the bingo sheet results with someone competent. It's a good bit of extra material, but it can not[ß] be your sole reason. The term you do want to remember is "additional benefit".

The reasons you actually write down boil down to four things. High-level technical overview of your chosen solution. Threat model. Outcomes. And compensating controls. (As cringy as that sounds.)

If you can demonstrate that you UNDERSTAND the underlying problem, and consider each bingo sheet entry an attempt at tackling a symptom, you will be on firmer ground. Focusing on threat model and the desired outcomes helps to answer the question, "what exactly are you trying to protect yourself from, and why?"

ß: I face off with auditors and non-technical security people all the time. I used to face off with regulators in the past. In my experience, both groups respond to outcome-based risk modeling. But you have to be deeply technical to be able to dissect and explain their own questions back to them in terms that map to reality and the underlying technical details.


nothing like this scale. These machines are full blue screen and completely inoperable.


The problem is concentration risk and incentives. Everyone is incentivized to follow the herd and buy Crowdstrike for EDR because of sentiment and network effects. You have to check the box, you have to be able to say you're defending against this risk (Evolve Bank had no EDR, for example), and you have to be able to defend your choice. You've now concentrated operational risk in one vendor, versus multiple competing vendors and products minimizing blast radius. No one ever got fired for buying Crowdstrike previously, and you will have an uphill climb internally attempting to argue that your org shouldn't pick what the bubble considers the best control.

With that said, Microsoft could've done this with Defender just as easily, so be mindful of system diversity in your business continuity and disaster recovery plans and enterprise architecture. Heterogeneous systems can have inherent benefits.


If you have a networked hybrid heterogeneous system though now you have weakest link issue, since lateral movement can now happen after your weaker perimeter tool is breached


A threat actor able to evade EDR and moving laterally or pivoting through your env should be an assumption you’ve planned for (we do). Defense in depth, layered controls. Systems, network, identity, etc. One control should never be the difference between success and failure.

https://apnews.com/article/tech-outage-crowdstrike-microsoft...

> “This is a function of the very homogenous technology that goes into the backbone of all of our IT infrastructure,” said Gregory Falco, an assistant professor of engineering at Cornell University. “What really causes this mess is that we rely on very few companies, and everybody uses the same folks, so everyone goes down at the same time.”


WannaCry did about the same damage to be honest. To pretty much the same systems.

The irony is the NHS likely installed CrowdStrike as a direct reaction to WannaCry.


The difference is malware infection is usually random and gradual. CrowdStrike screwup is everything at once with 100% lethality.


Computers hit by ransomware are also inoperable, and ransomware is wildly prevalent.


Yes, but computers get infected by ransomware randomly; Crowdstrike infected large amount of life-critical systems worldwide over some time, and then struck them all down at the same time.


I'm not sure I agree, ransomware attacks against organizations are often targeted. They might not all happen on the same day, but it is even worse: an ongoing threat every day.


It's why it's not worse - an ongoing threat means only small amount of systems are affected at a time, and there is time to develop countermeasures. An attack on everything all at once is much more damaging, especially when it eliminates fallback options - like the hospital that can't divert their patients because every other hospital in the country is down too, and so is 911.


Ransomware that affects only individual computers died not get payouts outside of hitting extremely incompetent orgs.

If you want actually good payout, your crypto locker has to either encrypt network filesystems, or infect crucial core systems (domain controllers, database servers, the filers directly, etc).

Ransomware getting smarter about sideways movement, and proper data exfiltration etc attacks, are part of what led to proliferation of requirements for EDRs like Crowdstrike, btw


Ransomware vendors at least try to avoid causing damage to critical infrastructure, or hitting way too many systems simultaneously - it's good neither for business nor for their prospects of staying alive and free.

But that's besides the point. Point is, attacks distributed over time and space ultimately make the overall system more resilient; an attack happening everywhere at once is what kills complex systems.

> Ransomware getting smarter about sideways movement, and proper data exfiltration etc attacks, are part of what led to proliferation of requirements for EDRs like Crowdstrike, btw

To use medical analogy, this is saying that the pathogens got smarter at moving around, the immune system got put on a hair trigger, leading to a cytokine storm caused by random chance, almost killing the patient. Well, hopefully our global infrastructure won't die. The ultimate problem here isn't pathogens (ransomware), but the oversensitive immune system (EDRs).


I want to agree with the point you're making, but WannaCry, to take one example, had an impact at roughly this scale.


I think recovering from this incident will be more straightforward than WannaCry.

At large-scale, you don’t solve problems, you only replace them with smaller ones.


Not like the security software has ever stopped it.


A lot of security software, ranging from properly using EDRs like Crowdstrike to things like simply setting some rules in Windows File Server Resource Manager fooled many ransomware attacks at the very least


I'm guessing hundreds of billions if you could somehow add it all up.

I can't believe they pushed updates to 100% of Windows machines and somehow didn't notice a reboot loop. Epic gross negligence. Are their employees really this incompetent? It's unbelievable.

I wonder where MSFT and Crowdstrike are most vulnerable to lawsuits?


This outage seems to be the natural result of removing QA by a different team than the (always optimistic) dev team as a mandatory step for extremely important changes. And neglecting canary type validations. The big question is will businesses migrate away from such a visibly incompetent organization. (Note I blame the overall org; I am sure talented individuals tried their best inside a set of procedures that asked for trouble.)


So there was apparently an Azure outage prior to this big one. One thing that is a pretty common pattern in my company when there are big outages is something like this:

1. Problem A happens, it’s pretty bad

2. A fix is rushed out very quickly for problem A. It is not given the usual amount of scrutiny, because Problem A needs to be fixed urgently.

3. The fix for Problem A ends up causing Problem B, which is a much bigger problem.

tl;dr don’t rush your hotfixes through and cut corners in the process, this often leads to more pain


If you’ve ever been forced to use a PC with Crowdstrike it’s not amazing at all. I’m amazed incident of this scale didn’t happen earlier.

Everything about it reeks of incompetence and gross negligence.

It’s the old story of the user and purchaser being different parties-the software needs to be only good enough to be sold to third parties who never neeed to use it.

It’s a half-baked rootkit part of performative cyberdefence theatrics.


> It’s a half-baked rootkit part of performative cyberdefence theatrics.

That describes most of the space, IMO. In a similar vein, SOC2 compliance is bullshit. The auditors lack the technical acumen – or financial incentive – to actually validate your findings. Unless you’re blatantly missing something on their checklist, you’ll pass.


From a enterprise software vendor perspective, cyber checklists feel like a form of regulatory capture. Someone looking to sell something gets a standard or best practice created, added to the checklists, and everyone is forced to comply, regardless of the context.

Any exception made to this checklist is reviewed by third parties that couldn't care less, bean counters, or those technically incapable of understanding the nuance, leaving only the large providers able to compete on the playing field they manufactured.


This will go on for multiple days, but hundreds of billions would be >$36 trillion annualized if it was that much damage for one day. World annual GDP is $100 trillion.


Their terms of use undoubtedly disclaim any warranty, fitness for purpose, or liability for any direct or incidental consequences of using their product.

I am LMFAO at the entire situation. Somewhere, George Carlin is smiling.


MSFT doesn’t recommend or specify CrowdStrike


I wonder if companies are incentivized to buy Crowdstrike because of Crowdstrike's warranty that will allegedly reimburse you if you suffer monetary damage from a security incident while paying for Crowdstrike.


If such a warranty exists, the real question will be how Crowdstrike tries to spin this as a non-security incident.


The CEO says it isn't and we believe them apparently


There must be an incentive. Because from a security perspective bringing in a 3rd party to a platform (microsoft) to do a job the platform already does is literally just the definition of opening up holes in your security. Completely b@tshit crazy, the salesmen for these products should hang their heads in shame. It's just straight up bad practice. Im astounded it's so widespread.


Insurance companies recommended them


Same people who destroyed a US bridge recently.

This is the result of giving away US jobs overseas at 1/10th the salary


Do you have some more details?


I saw one of the surgery videos recently. The doctor was saying, "Alexa, turn on suction." It boggled my mind. There could be so many points of failure.


Fwiw this is not typical, we don’t have alexa/siri type smart devices in any OR I work in, and suction is turned on off with a button and a dial.


It's in a Maryland clinic doing plastic surgery.

Edit: Found it. https://www.youtube.com/watch?v=nS9nLvGMLH0&t=947s


ALEXA, TURN OFF THE SUCTION! ALEXA!!

“Loive from NPR news in Washington“


I don't suppose there was a doctor or nurse named Alexa involved?


Not to be that guy, but I often say software engineering as a field should have harsher standards of quality and certainly liability for things like these. You know like civil engineers, electrical engineers and most people whose work could kill people if done wrongly.

Usually when I write this devs get all defensive and ask me what the worst thing is that could happen.. I don't know.. Could you guarantee it doesn't involve people dying?

Dear colleagues, software is great because one persons work multiplies. But it is also a damn fucking huge responsibility to ensure you are not inserting bullshit into the multiplication.


Some countries such as Canada have taken minor steps towards this, for example making it illegal to call oneself a software engineer unless you are certified by the provinces professional engineering body, however this is still missing a lot. I also don't wish to be "that guy" but I'll go further and say that the US is really holding this back by not making using Software Engineer as a title (without holding a PEng) illegal in a similar fashion.

If we can at least get that basis then we can start to define more things such as jobs that non Engineers can not legally do, and legal ramifications for things such as software bugs. If someone will lose their professional license and potentially their career over shipping a large enough bug, suddenly the problem of having 25,000 npm dependences and continuous deployment breaking things at any moment will magically cease to exist quite quickly.


I'd go a step farther and say software engineering as a field is not respected at the same levels as such certified/credentialed engineers, because of these lacks of standards and liabilities. Leading to common occurrences of systemic destructive failures such as this, due to organization level direction being very lax in dealing with software failure potential.


I don't know, I get paid more than most of my licensed engineer friends. That's the only respect that really matters to me. Not saying there might not be other advantages to a professional organization for software.


I feel the same way but do agree there’s a general lack of respect for the field relative to other professions. Here’s another thread on the subject https://news.ycombinator.com/item?id=23676651


Respect has to be earned.


I believe instances like this will push people to reconsider the lax stance. Humans in general have a hard time regulating something abstract. The fact that people can be killed is well-known since the 80s', see https://en.wikipedia.org/wiki/Therac-25


I once worked on some software that generated PDFs of lab reports for drug companies monitoring clinical trials. These reports had been tested, but not exhaustively.

We got a new requirement to give doctors access to print them on demand. Before this, doctors only read dot matrix-printed reports that had been vetted for decades. With our XSL-FO PDF generator, it was possible that a column could be pushed outside the print boundary, leading a doctor to see 0.9 as 0. I assume in a worst worst case scenario, this could lead to an misdiagnosis, intervention, and even a patient's death.

I was the only one in the company who cared about doing a ton more testing before we opened the reports to doctors. I had to fight hard for it, then I had to do all the work to come up with every possible lab report scenario and test it. I just couldn't stand the idea that someone might die or be seriously hurt by my software.

Imagine how many times one developer doesn't stand up in that scenario.


This is why I made that point, similar to you I would not stand for having my code in something that I can't stand behind, especially if it potentially harms people.

But it should not hinge on us convincing people.


I'd endorse this. That way when my hypothetical PHB wants to know why something is taking so long I can say "See this part? Someone could die if we don't refactor it."


Related talk by Alan Kay: https://youtu.be/D43PlUr1x_E


It’s important not to disregard that software engineers are often overruled by management or product when strict deadlines and targets exist.


"If only we asked harder problems for our leetcode interview!"


And how many lifes lost?


It's honestly terrifying that someone would opt for Windows in systems critical to medical emergencies.

I hope organisations start revisiting some of these insane decisions.


Not my story to tell, so I'm relaying it. Childhood friend works for a big company, you've heard their name, they make nuclear control systems for nuclear reactors; they have products out in the field they support and there are new reactors in parts of the world from time to time. We were scheduled to have lunch a couple years back and he bailed, we rescheduled, he bailed because that was the day you couldn't defer XP updates anymore, they came in and some XP systems became Windows 10. XP was "nuclear reactor approved" by someone, they had a tool chain that didn't work right on other versions of windows, it all gave me chills.

They ended up giving MS a substantial amount of money to extend support for their use case for some number of years. I can't remember the number he told me but it was extremely large.


If its not connected to the internet who cares?


It sounds like he said XP machines auto-updated to Windows 10, and they would have had to have been connected to the internet in order to download that update. (I'm assuming, optimistically, that these were more remote-control computers than actual nuclear devices.)


Eh. There are a great many problems that could befall a medical emergency systems that are unrelated to OS. Like power loss. I think the core problem here really is a lack of redundancy.


I've had updates break Linux machines.

Just a few weeks ago I had an OpenBSD box render itself completely unbootable after nothing more than a routine clean shutdown. Turns out their paranoid-idiotic "we re-link the kernel on every boot" coupled with their house-of-cards file system corrupted the kernel, then overwrote the backup copy when I booted from emergency media - which doesn't create device nodes by default so can't even mount the internal disks without more cryptic commands.

Give me the Windows box, please.


Counter anecdote: I’ve been using Linux for 20 years, nearly half of that professionally. The only time I’ve broken a Linux box where it wasn’t functional was mixing Debian unstable with stable, and I was still able to fix it.

I’ve had hardware stop working because I updated the kernel without checking if it removed support, but a. that’s easily reversible b. Linux kept working fine, as expected.

I’ll also point out, as I’m sure you know, that the BSDs are not Linux.


Funny, i broke my Debian twice (on two separate laptops) by doing exactly that, mixing stable with testing. I was kinda obliged to use "testing" because Dell XPS would miss critical drivers.

I switched to opensuse afterwards


In fairness, this is the number one way listed [0] on how to break Debian. That said, if you need testing (which isn’t that uncommon for personal use; Debian is slow to roll out changes, favoring stability), then running pure Sid is actually a viable option. It’s quite stable, despite its name.

[0]: https://wiki.debian.org/DontBreakDebian


you are comparing a broken bicycle to a trainwreck


some critical software has DRM that only works in Windows


"Took down our entire emergency department as we were treating a heart attack. 911 down for our state too."

Why would Windows systems be anywhere near critical infra ?

Heart attacks and 911 are not things you build with Windows based systems.

We understood this 25 years ago.


I do not think windows is the problem here. The problem is that equipment that is critical infrastructure being connected to the internet, imo. There is little reason for a lot of computers in some settings to be connected to the internet, except for convenience or negligence. If data transfer needs to be done, it can happen through another computer. Some systems should exist on a (more or less) isolated network at best. Too often we do not really understand the risk of a device being connected to the internet, until something like this happens.


You have no idea how a hospital or modern medicine works. It needs to be online.


Why would a machine that is required for a MRI machine to work (as one of the examples given in the thread here) need to be online? I understand about logging, though even then I think it is too risky. Do all these machines _really_ need to be online, or just nobody bothered after all the times something happened or, even worse, software companies profit in certain ways and would not want to change their models? Can we imagine no other way to do things apart from connecting everything to some server wherever that is?


MRI read outs are 3d, so can't be printed for analysis. They are gigabytes in size, and the units are usually in a different part of the building. So you could sneakernet cds every time an MRI is done, then sneakernet the results back. Or you could batch it and then analysis is done slowly and all at once. OR you could connect it to a central server and results/analysis can be available instantly.

Smarter people than us have already thought through this and the cost-benefit analysis said "connect it to a server"


So in that case you setup a NAS server that it can push the reports to, everything else is firewalled off.

Its just laziness, and to be honest, an outage like this has no impact on their management reputation as a lot of other poorly run companies and institutions were also impacted, so the focus is on crowdstrike and azure, not them.


I admit I'm not a medical professional but these sound like problems with better solutions than lots of Internet connected terminals that can be taken down by edr software.

Why not an internal only network for all the terminals to talk to a central server, then disable any other networking for the terminals? Why do those terminals need a browser where pretty much any malware is going to enter from? If hospitals are paying out the ass for their management software from epic/etc, they should be getting something with a secure design. If the central server is the only thing that can be compromised then when edr takes it down you at least still have all your other systems, presumably with cached data to work from


Ever heard of a LAN? You don't need internet access for every single machine.


Many X-Rays (MRIs, CT scans, etc.) are read and interpreted by doctors who are remote. There are firms who that's all they do - provide a way to connect radiologists and hospitals, and handle the usual business back-end work of billing, HR, and so on. Search for "teleradiology"

Same goes for electronic medical records. There are people who assign ICD-10 codes (insurance billing codes) to patient encounters. Often this is a second job for them and they work remote and typically at odd hours.

A modern hospital cannot operate without internet access. Even a medical practice with a single doctor needs it these days so they can file insurance claims, access medical records from referred patients and all the other myriad reasons we use the internet today.


Okay, so (as mentioned elsewhere in this thread), connect the offline box to an online NAS with the tightest security between the two humanly possible. You can get the relevant data out to those who need it.

This stuff isn't impossible to solve. Rather, the incentives just aren’t there. People would rather build an apparatus for blame-shifting than actually just building a better solution.


Do you think everyone involved is physically present? The gp was absolutely accurate that you guys have no idea how modern healthcare works and this had nothing to do with externally introduced malware.


This sounds a bit like someone just got ran over by a truck because the driver couldn’t see them so people ask why trucks are so big that they’re dangerous and the response is “you just don’t know how trucks work” rather than “yeah maybe drivers should be able to see pedestrians”.

If modern medicine is dangerous and fragile because of network connected equipment then that should be fixed even if the way it currently works doesn’t allow it.


This is a completely different discussion. They absolutely should be reliable. The part that is a complete non starter is not being networked because it ignores that telemedicine, pacs integration, and telerobotics exist.

If you don't understand why it has to be networked with extremely bad fallback to paper, then I suggest working in healthcare for a bit before pontificating on how everything should just go back to the stone age.


Networking puts their reliability into risk. As shown here, as shown in ransomware cases. It is not the first time something like this happen.

The question is not whether or not hospitals need internet at all or to go back into printing things in paper or whatever nobody ever said. The question is whether everything in the hospital should be connected to the internet. Again the example used was simple. Having the computer processing and exporting the data from an MRI machine connected online in order to transfer the data, vs using a separate computer to transfer the data and the first computer is offline. This is how we are supposed to transfer similar data at my work for security reasons. I am not sure why it cannot happen in there. If you cannot transfer data through that computer, there could be an emergency backup plan. But you need to solve only the transfering data part. Not everything.


even the most secure outbound protection would likely whitelist the CrowdStrike update servers because they'd be considered part of the infrastructure


You don’t print the images an MRI produced, you transmit them to the people who can interpret them, and they are almost never in the same room as the big machine, and sometimes they need to be called up in a different office altogether.


The comment [0] mentioned that they could not get at all the mri outputs even with the radiologist coming on site. Obviously, software that was processing/exporting the data was running on a computer that was connected online, if not requiring internet connection itself. Data transfer can happen from another computer than the one the data is processed/obtained. Less convenient, but this is common practice in many other places for security and other reasons.

[0] https://news.ycombinator.com/item?id=41009018


I mean, this is incentivized by current monetization models. Remove the need to go through a payment based aaS infra, and all the libraries to do the data visualization could be running on the MRI dude's PC.

-aaS by definition requires you to open yourself to someone else to let them do the work for you. It doesn't empower you, it empowers them.


Yeah I suspect -aaS monetisation models are one of the reasons of the current all-to-internet mess. However, such software running in the machine using a hardware usb key as authenticating is not unheard of either in software like that. I wish that decisions on these subjects were done based on the specific needs of the users rather than the finance people of -aaS companies.


Our critical devices were fine. But epic and all of our machines were down. How do you transmit radiology images without epic?


Is that an ironic question? Or serious one? I fail to detect the presence or absence of irony sometimes online. I just hope that my own healthcare system has some back-up plans for how to do day-to-day operations like transfering my scan results to a specialist in case the system they normally use fails.


"It needs to be online."

No, it doesn't.

Some have chosen - for reasons of efficiency and scale and cost - to place it online.

However, this is a trade-off for fragility.

It's not insane to make this trade-off ...

... but it is insane to not realize one is making it.


It seems like you’ve never worked with critical infra. Most of it runs on 6 to 10 year old unpatched versions of Windows…


"It seems like you’ve never worked with critical infra."

My entire career has been spent building, and maintaining, critical infra.[1]

Further, in my volunteer time, I come into contact with medical, dispatch and life-safety systems and equipment built on Windows and my question remains the same:

Why is Windows anywhere near critical infra ?

Just because it is common doesn't mean it's any less shameful and inadequate.

I repeat: We've fully understood these risks and frailties for 25 years.

[1] As a craft, and a passion - not because of "exciting career opportunities in IT".


Is this the rsync.net HN account? If so, lmao @ the comment you replied to.

> As a craft, and a passion

I believe you’ve nailed the core problem. Many people in tech are not in it because they genuinely love it, do it in their off time, and so on. Companies, doubly so. I get it, you have to make money, but IME, there is a WORLD of difference in ability and self-solving ability between those who love this shit, and those who just do it for the money.

What’s worse is that actual fundamental knowledge is being lost. I’ve tried at multiple companies to shift DBs off of RDS / Aurora and onto at the very least, EC2s.

“We don’t have the personnel to support that.”

“Me. I do this at home, for fun. I have a rack. I run ZFS. Literally everything in this RFC, I know how to do.”

“Well, we don’t have anyone else.”

And that’s the damn tragedy. I can count on one hand the number of people I know with a homelab who are doing anything other than storing media. But you try telling people that they should know how to administer Linux before they know how to administer a K8s cluster, and they look at you like you’re an idiot.


The old school sysadmins who know technology well are still around but there is increasingly less of them while the demand skyrockets as our species gives computers an increasing number of responsibilities.

There is tremendous demand for technology that works well and works reliably. Sure, setting up a database running on an EC2 instance is easy. But do you know all of the settings to make the db safe to access? Do you maintain it well, patch it, replicate it, etc? This can all be done by one of the old school sysadmins. But they are rare to find, and not easy to replace. It's hard to judge from the outside, even if you are an expert in the field.

So when the job market doesn't have the amount of sysadmins/devops engineers available, then the cloud offers a good replacement. Even if you as an individual company can solve it by offering more money and having a tougher selection process, this doesn't scale over the entire field, as at that point the whole number of available experts comes in.

Aurora is definitely expensive, but there is cheaper alternatives to it. Full disclosure, I'm employed by one of these alternative vendors (Neon). You don't have to use it, but many people do and it makes their life easier. The market is expected to grow a lot. Clouds seem to be one of the ways our industry is standardizing.


I’m not even a sysadmin, I just learned how to do stuff in Gentoo in the early ‘00s. Undoubtedly there are graybeards who will laugh at the ease of tooling that was available to me.

> But do you know all of the settings to make the db safe to access? Do you maintain it well, patch it, replicate it, etc?

Yes, but to be fair, I’m a DBRE (and SRE before that). I’m not advocating that someone without fairly deep knowledge attempt to do this in prod at a company of decent size. But your tiny startup? Absolutely; chuck a default install of Postgres or MySQL onto Debian, and optionally tune 2 – 3 settings (shared_buffers, effective_cache_size, and random_page_cost for Postgres; (innodb_buffer_pool_* and sync_array_size for MySQL – the latter isn’t necessary until you have high concurrency, but it also can’t be changed without a restart so may as well). Pick any major backup solution for your DB (Barman for Postgres, XtraBackup for MySQL, etc.), and TEST YOUR BACKUPS. That’s about it. Apply any security patches (or use unattended-upgrades, just be careful) as they’re released, and don’t do anything outside of your distro’s package management. You’ll be fine.

Re: Neon, I’ve not used it, but I’ve read your docs extensively. It’s the most interesting Postgres-aaS product I’ve seen, alongside postgres.ai, but you’re (I think) targeting slightly different audiences. I wish you luck!


> It’s the most interesting Postgres-aaS product I’ve seen, alongside postgres.ai, but you’re (I think) targeting slightly different audiences. I wish you luck!

This is always great feedback to hear, thank you!


Also a lot of the passionate security people such as myself moved on to other fields as it has just become bullshit artists sucking on the vendors teat and filling out risk matrix sheets, but no accountability when their risk assessments invariably turn out to be wrong.


That reminds me, I should check Twitter to see the most recent batch of “cybersecurity experts” take on Crowdstrike. Always a good time.


raises hand you guys hiring? I’ll be proof that there is indeed “anyone else.”


Not saying they're sufficient reasons but ..

1. more Windows programmers than Linux so they're cheaper.

2. more third-party software for e.g. reporting, graphing to integrate with

3. no one got fired for buying Microsoft

4. any PC can run Windows; IT departments like that.


My comment was tongue in cheek, of course it should not be this way but as you know it oftentimes is.


In the past, old versions of Windows were often considered superior because they stopped changing and just kept working. Today, that strategy is breaking down because attackers have a lot more technology available to them: a huge database of exploits, faster computers, IoT botnets, and so on. I suspect we're going to see a shift in the type of operating system hospitals run. It might be Linux or a more hardened version of Windows. Either way, the OS vendor should provide all security infrastructure, not a third party like Crowdstrike, IMHO.


> I suspect we're going to see a shift in the type of operating system hospitals run. It might be Linux or a more hardened version of Windows.

Why? "Hardening" the OS is exactly what Crowdstrike sells and bricked the machines with.

Centralization is the root cause here. There should be no by design way for this to happen. That also rules out Microsoft's auto updates. Only the IT department should be able to brick the hospitals machines.


Hardening is absolutely not what crowdstrike sells. They essentially sell OS monitoring and anomaly detection. OS monitoring involves minimizing the attack surface, usually by minimizing the number of services running and limiting the ability to modify the OS


Nothing wrong with that. Windows XP-64 supports up to 128GB physical RAM, could be 5 years until that is available on laptops. Windows 7 Pro supports up to 192 GB of RAM. Now if you were to ask me what you would run on those systems with maxed out RAM, I wouldn't know. I also don't think the Excel version that runs on those versions of windows allows partially filled cells for Gantt charts.


>Most of it runs on 6 to 10 year old unpatched versions of Windows…

Well, that's a pretty big problem. I don't know how we ended up in a situation where everybody is okay with the most important software being the most insecure, but the money needed to keep critical infra totally secure is clearly less than the money (and lives!) lost when the infra crashes.


Well you can use stupid broken software with any OS, not just Windows. Isn't CrowdStrike Falcon available on Linux, is there any reason why couldn't they have introduced a similar bug and similar consequences there?


None. There are a bunch of folks here who clearly haven’t spent a day in enterprise IT proclaiming Linux would’ve saved the day. 30 seconds of research would’ve lead them to discover crowdstrike also runs on Linux and has created similar problems on Linux in the past.


Oh could you link me the source of the claim that all linux clients of crowdstrike went down all at once? I'm very interested to hear it.


No it couldn't. Crowdstrike on Linux uses eBPF and therefore can't cause a kernel panic (which is the fundamental issue here).



It's even better when you get told about the magical superiority of apple for that...

... Except Apple pretty much pushes you to run such tools just to get reasonable management key alone things like real-time integrity monitoring of important files (Crowdstrike in $DAYJOB[-1] is how security knew to ask whether it was me or something else that edited PAM config for sudo on corporate Mac)


Enterprise mac always follows the same pattern, users proclaim its superiority while its off the radar, then it gets mcaffee, carbon black, airlock, and a bunch of other garbage tooling installed and runs as poorly as enterprise Windows.

The best corporate dev platform at moment is WSL2 - most of the activity inside the WSL2 vm isn't monitored by the windows tooling so performance is fast. Eventually security will start to mandate agents inside the WSL2 instance, but at the moment most orgs dont.


> Why would Windows systems be anywhere near critical infra ?

This is just a guess, but maybe the client machines are windows. So maybe there are servers connected to phone lines or medical equipment, but the doctors and EMS are looking at the data on windows machines.


> Why would Windows systems be anywhere near critical infra ?

maybe Heartbleed or the xzUtils debacles convinced them to switch.


Because Windows is accessible and Linux requires uncommon expertise and short term cost that is just not practical for lots of places.

Goodluck teaching administrators an entirely new ecosystem, goodluck finding software off the shelf for Linux.

Bespoke is expensive, expertise is rare, Linux is sadly niche.


No. The problem isn’t expertise — it’s CIOs that started their career in the 1990s and haven’t kept up with the times. I had to explain why we wanted PostgreSQL instead of MS SQL server. I shouldn’t have to have that conversation with an executive that should theoretically be a highly experienced expert. We also have CIOs that have MBAs but not actual background in software. (I happen to have an MBA but I also have 15+ years of development experience.) My point is CIOs generally know “business” and they know how to listen to pitches from “Enterprise” software companies — but they don’t actually have real-world experience using the stuff they’re forcing upon the org.

I recently did a project with a company that wanted to move their app to Azure from AWS — not for any good technical reason but just because “we already use Microsoft everywhere else.”

Completely stupid. S3 and Azure Blob don’t work the same way. MCS and AWS SES also don’t work the same way — but we made the switch not even for reasons of money, but because some Microsoft salesman convinced the CIO that their solution was better. Similar to why many Jira orgs force Bitbucket on developers — they listen to vendors rather than the people that have to use this stuff.


> I had to explain why we wanted PostgreSQL instead of MS SQL server.

Tbf, you are giving up a clustering index in that trade. May or may not matter for your workload, but it’s a remarkably different storage strategy that can result in massive performance differences. But also, you could have the same by shifting to MySQL, sooooo…


That’s so infuriating. But, while the people in your story sound dumb, they still sound way more technically literate than 95% of society. Azure is blue, AWS is followed by OME.

Teach a 60 year old industrial powertrain salesman to use Linux and to redevelop their 20 year old business software for a different platform.

Also explain why it’s worth spending food, house, and truck money on it.

Finally, local IT companies are often incompetent. You get entire towns worth of government and business managed by a handful of complacent, incompetent local IT companies. This is a ridiculously common scenario. It totally sucks, and it’s just how it is.


Are. You. Kidding.

Windows servers are “niche” compared to Linux servers. Command line knowledge is not “uncommon expertise,” it’s imo the bare minimum for working in tech.


Most businesses aren’t working in tech.

I’m not wildly opinionated here, I should clarify. I’d love a more Linux-y world. I’m just saying that a lot of small-medium towns, and small-medium businesses are really just getting by with what they know. And really, Windows can be fine. Usually, however, you get people who don’t understand tech, who can barely use a Windows PC, nevermind Linux, and don’t really have the budget to rebuild their entire tech ecosystem or the knowledge to inform that decision. It sucks, but it’s how it is.

Also, Open Office blows chunks. Business users use Windows. M365 is easy to get going, email is relatively hands-off, deliverability is abstracted. Also, a LOT of business software is Windows exclusive. And that also blows chunks.

I would LOVE a more open source, security minded, bespoke world! It’s just not the way it is right now.


> Why would Windows systems be anywhere near critical infra ?

Why would computers be anywhere near critical infra? This sounds like something that should failsafe, the control system goes down but the thing keeps running. If power goes down, hospitals have generator backups, it seems weird that computers would not be in the same situation


i mean not just dollars but lives also right? do we have a way to track that?


Yup through electronic medical records... o wait


What's the NASDAQ ticker for lives?


> Hard to imagine how many millions of not billions of dollars this one bad update caused.

I mean, if the problem is that hospitals can't function anymore, money is hardly the biggest problem


[flagged]


Without access to Epic we can't place med orders, look up patient records, discharge patients from the hospital, enter them into our system, really much of anything. Every provider in the emergency department is on their computer placing orders and doing work when not interacting with a patient. Like most hospitals in this country, our entire workflow depends on Epic. We couldn't even run blood tests because the lab was down too.

The STEMI was stabilized, it's more that it was scary to lose every machine in the department at once while intubating a crashing patient. You're flying blind in a lot of ways.


If the computer system was down, and medicine was needed to save a life, would either some protocol dictate grabbing the medicine and dealing with the paperwork or consequence later? If protocol didn’t allow for discussion, would staff start breaking protocol to save life?


You can skip paperwork but what if the patient is allergic to a medicine and you need to check medical records? Or you need to call for a surgeon but VoIP is down? Etc…


My father's coworker died from being in hospital for observation after few scratches in car accident because they were accidentally given medication they were allergic to.

So, yeah. The paperwork can save lives too and not sadly red tape is bad.

Otherwise you may go o hospital to pickup your friend and be told to wait for coroner


> Surely none of the medical devices needed to treat a heart attack are Windows PCs connected to the internet?

Wouldn't that be nice


I'm guessing they were being treated over the phone as the systems went down. I've been through a similar situation, the person on the phone will give step by step instructions while waiting for an ambulance to arrive.

Sounds like with the systems being down the call would have been cut off which sounds horrible.


No, treating in person. But we can't function as a department without computers. You call cardiology (on another floor) and none of their computers are working to be able review the patients records. You could take the EKG printout and run it to them, but we're just telling them lab results from what we can remember before our machines all bluescreened. The lab's computers were down so we can't do blood tests. Nursing staff knows what to do next by looking at the board or their computer. Without that you're just a room full of people shouting things at each other, and definitely can't see the 3-4x patients an hour you're expected to. Doctors and midlevels rely on epic to place med orders too.


[flagged]


It's against the site guidelines to post like this, and we have to ban accounts that to it repeatedly, so if you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.


May I say that starting from "treating a heart attack" and ending up worrying about millions lost in productivity sounds a bit "wrong"?


I just had a ten hour hospital shift from hell, apologies if my writing is lacking. I can't think of a better way to try to measure the scope of the damage caused by this.


Just completed a standing 24 due to this outage. My B-Shift brothers and sisters had to monitor the radios all night for their units to be called for emergencies. I heard every dispatch that went out.

We were back in the 1960's with paper and pen for everything, no updates on nature of call, no address information, nothing... find out when you show up and hope the scene is secure. It was wild as it was coupled to a relatively intense Monsoon storm.


Starting with an ER story kind of set up the expectation that you'll be "measuring the scope of the damage" in lives lost, not dollars. Though I guess at large enough scale, they're convertible.

Regardless, thanks for your report; seeing it was very sobering. I hope you can get some rest, and that things will soon return to normalcy.


A tiny bit of thought about your situation IMO should lead anyone to conclude that you just first-hand experienced the fallout of today's nightmare, and then took a step back and realized you were likely one of millions if not billions of other people experiencing the same, and relayed that thought in terms of immediately understandable loss. Someone else might see "wrong" but I saw empathy.


Sorry to hear this! I'm a journalist covering this mess and wondering if we could talk. Am at sarah.needleman@wsj.com


Take care of yourself. you're making the world a better place. You deserve better supportive technology, not this shit show.


Billions in losses means a somewhat worse life for a huge number of people and potentially much worse healthcare problems down the line, the NHS was affected


When it comes to measuring the impact to society at scale, dollars is really the only useful common proxy. One can't enumerate every impact this is going to have on the world today -- there's too many.


Bullshit. Absolute bullshit.

I've told my testers for years their efficacy at their jobs would be measured in unnecessary deaths prevented. Nothing less. Exactly this outcome was something I've made unequivocally clear was possible, and came bundled with a cost in lives. Yet the "Management and bean counter types" insist "Oh, nope. Only the greenbacks matter. It's the only measure."

Bull. Shit. If we weren't so obsessed with imaginary value attached to little green strips of paper, maybe we'd have the systems we need so things like this wouldn't happen. You may not be able to enumerate every, but you damn well can enumerate enough. Y'all just don't want to because then work starts looking like work.


Why measure only death, as if it is the only terrible thing that can happen to someone?

That doesn’t count serious bodily injury, suffering, people who were victimized, people who had their lives set back for decades due to a missed opportunity, a person who missed the last chance to visit a loved one, etc.

There are uncountable different impacts that happen when you’re talking about events on the scale of an economy. Which is why economists use dollars. The proxy isn’t useful because it is more important than life, it it useful because the diversity of human experience is innumerable.


I understand your emotion but perhaps people simply don't value human lives.

At least putting a number to life is an genuine attempt even though it may be distasteful.

The fact is that there already is a number on it, which one can derive entirely descriptively without making moral judgements. Insurance companies and government social security offices already attempt to determine the number.

The number is not infinite or we'd have no cars.


[flagged]


"Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."

https://news.ycombinator.com/newsguidelines.html

https://news.ycombinator.com/item?id=41005274


Millions lost is sizeable parts of people's lives they won't get back.


> "Took down our entire emergency department as we were treating a heart attack."

Not questioning that it happened, but this was a boot loop after a content update. So if the computers were off and didn't get the update, and you booted them, they would be fine. And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.

How did it happen that you were rebooting in the middle of treating a heart attack? [Edit: BSOD -> auto reboot]


Beyond the BSOD that happened in this case, in general this is not true with Windows:

> And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.

Windows has been notorious for forcing updates down your throat, and rebooting at the least appropriate moments (like during time-sensitive presentations, because that's when you stepped away from the keyboard for 5 minutes to set up the projector). And that's in private setting. Corporate setting, the IT department is likely setting up even more aggressive and less workaround-able reboot schedule.

Things like this is exactly why people hate auto-updates.


Windows Update has nothing to do with it.


But it has created a culture of everything needing to be kept up to date all the time no matter what, and pulling control of those updates out of your own hands into the provider's.


True, especially when a reboot of Windows takes several minutes because it started auto-applying updates!


How do you propose ensuring critical security updates get deployed then?

Especially if an infected machine can attack others?

Users/IT regularly would never update or deploy patches which has its own consequences. There’s no perfect solution—but rather there to accept the pain.

It’s a lot like herd immunity in vaccines.


> It’s a lot like herd immunity in vaccines.

Yes. But you don't deploy experimental vaccines simultaneously across the entire population all at once. Inoculating an entire country takes months; the logistics incidentally provide protection against unforeseen immediate-term dangerous side effects. Without that delay, well, every now and then you'd kill half the population with a bad vaccine. The equivalent of what's happening now with CrowdStrike.


Windows update actually provides sensible control over when and how to supply updates since I think Windows 2000 (definitely was there by vista time). You just need to use it.


It was degrading since Windows 2000, with Microsoft steadily removing and patching up any clever workarounds people came with to prevent the system from automatically rebooting. The pinnacle of that, an insult added to injury, was introduction of "active hours" - a period of, initially, at most 8 or 10 hours, designated as the only time in the day your system would not reboot due to updates. Sucks if your computer isn't an office machine only ever used 9-to-5.


No, it was not degrading - Windows 10 introduced forced updating in home editions because it was weighed to be better for general cases (that it got abused later is separate issue).

The assumption is that "pros" and "enterprise" either know how to use provided controls or have WSUS server setup which takes over all of scheduling updates.


We do not know if the update was new version of the driver (which also can be updated without reboot on Windows since... ~17 years ago at least) or if it was done data that was hot-reloaded that triggered a latent big in the driver


> "Windows has been notorious for forcing updates down your throat"

in the same way cars are notorious for forcing you to run out of gas while you're driving them and leaving you stranded... because you didn't make time to refill them before it became a problem.

> "Things like this is exactly why people hate auto-updates."

And people also hate making time for routine maintenance, and hate getting malware from exploits they didn't patch, and companies hate getting DDoS'd by compromised Windows PCs the owners didn't patch, and companies hate downtime from attackers taking them offline. There isn't an answer which will please everyone.


This isn't really a good faith response. This prevention of functionality during a critical period while forcing an update would be like if a modern car refused to drive during an emergency due to a forced over the air update that paused the ability to drive till the update was finished.


The parent response wasn't good faith; it was leaning on an emergency in a hospital department caused by CrowdStrike to whine about Microsoft in trollbait style.

> "This prevention of functionality during a critical period while forcing an update would be like if a modern car refused to drive during an emergency"

Machines don't know if there's an emergency going on; if you don't do maintenance, knowing that the thing will fail if you don't, then you're rolling the dice on whether it fails right when you need it. It's akin to not renewing an SSL certificate - you knew it was coming, you didn't deal with it, now it's broken - despite all reasonable arguments that the connection is approximately as safe 1 minute after midnight as it was 1 minute before, if the smartphone app (or whatever) doesn't give you any expired cert override then complaining does nothing. Windows updates are released the same day every month, and have been mandatory for eight years: https://www.forbes.com/sites/amitchowdhry/2015/07/20/windows...

And we all know why - because Windows had a reputation of being horribly insecure, and when Microsoft patched things, nobody installed the patches. So now people have to install the patches. Complaining "I want to do it myself" leads to the very simple reply: you can - why didn't you do it yourself before it caused you a problem?

If you're still stubbornly refusing to install them, refusing to disable them, refusing to move to macOS or Linux, and then complaining that they forced you to update at an inconvenient time, you should expect people to point out how ridiculous (and off-topic) you're being.


(Your user name is wonderful.)

> It's akin to not renewing an SSL certificate.

Your choice of analogies is a good one. I have done SSL type stuff since 1997.

Doesn't matter: I would have to work a few hours very carefully before modifying my web server config. And test it.

I am terrified by scale of deployment involved in this CloudStrike update.


But that's the thing, forced updates are not akin to maintenance or certs that expire on an annual basis. I'm not sure where you seem to be getting your "you should expect people to point out how ridiculous you're being" line from. Your the only one I'm seeing arguing this idea.


Disabling forced updates by using proper managed updates features that exist longer than "forced updates" had is table stakes for IT. In fact, it was considered important and critical before Windows became major OS in business.


Not setting computers that are in any critical path on proper maintenance schedule (which, btw, overrides automatic updates on Windows and doesn't require extra licenses!) is the same as willfully ignoring maintenance just because the car didn't punch you in the face every time you need to up some fluids


I agree that it is willfully ignoring maintenance, but I completely disagree with the analogy that it is the same as ignoring a fluid change in a car. A car will break down and may stop working without fluid changes. The same is almost assuredly not usually true if a windows, or other, update is ignored. If you disagree, then I'd be happy to review any evidence you have that these updates really are always as critical as you think.


A lot of things that come as "mandatory patches" in IT, not just for Windows, are things that tend to generate recalls - or "sucks to be you, buy new car" in automotive world.

In more professional settings than private small car ownership, you often will both have regular maintenance updates provided and mandates to follow them. Sometimes they are optional because your environment doesn't depend on them, sometimes they are mandatory fixes, sometimes they change from optional to mandatory overnight when previous assumptions no longer apply.

Several years ago a bit over 100 people and uncounted amount of possible more had their lives endangered because an extra airflow directing piece of metal was optional, and after the incident it was quickly made mandatory, with hundreds of aircraft being stopped to have the fix applied (which previously was only required for hot locations - climate change really bit it).

Similarly, when you drive your car and it fails to operate, that's just you. When it's a more critical service, you're either facing corporate, or in worst case, governmental questions.


Not OP, but some (most? many?) machines receiving the update crashed with a BSOD. So that's how they could enter the boot loop.


I just realised I had read that, but 4 minutes later and it's too late to delete my comment now; Thanks, yes it makes sense.


Half of the hotels (Choice) computers were down. We never reboot the computer, unless it's not working or working slowly or Windows update.


A lot of security software updates on-line, workout rebooting.

If said update pushes you into bsod where automatic watchdog (by default set enabled in windows) reboots...well, here you have a bootloop


idk, a lot of system are never meant to be rebooted outside of the update schedule, so they wouldn't have been off in the first place. And if those systems control others, then there is a domino effect.

I can see very well how one computer could have screwed all others. It's really not hard to imagine.


And dove software is supposed to hot patch itself because you might not have time to take systems offline to deal with ongoing attack, for example


What happens when a computer gets rebooted as part of daily practice or because of the update, and then it becomes unusable, and then the treatment team needs to use it hours later?


I dunno, but they'd know about it hours earlier in time to switch to paper, or pull out older computers, or something - in that scenario it wouldn't have happened "as we were treating a heart attack" and they would have had time to prepare.


I work for a diesel truck repair facility and just locked up the doors after a 40 minute day :( .

- lifts wont operate.

- cant disarm the building alarms. (have been blaring nonstop...)

- cranes are all locked in standby/return/err.

- laser aligners are all offline.

- lathe hardware runs but controllers are all down.

- cant email suppliers.

- phones are all down.

- HVAC is also down for some reason (its getting hot in here.)

the police drove by and told us to close up for the day since we dont have 911 either.

alarms for the building are all offline/error so we chained things as best we could (might drive by a few times today.)

we dont know how many orders we have, we dont even know whos on schedule or if we will get paid.


How come lifts and cranes are affected by this?

Are they somehow controlled remotely? or do they need to ping a central server to be able to operate?

I can see how alarms, email and phones are affected but the heavy machinery?

(Clearly not familiar with any of these things so I am genuinely curious)


Lots and lots of heavy machinery uses Windows computers even for local control panels.


But why does it need to be remotely updated? Have there been major innovations in lift technology recently? They still just go up and down, right?

Once such a system is deployed why would it ever need to be updated?


They're probably deployed to a virtualized system to easy with maintenance and upkeep.

Updates are partially necessary to ensure you don't end up completely unsupported in the future.

It's been a long time, but I worked IT for an auto supplier. Literally nothing was worse than some old computer crapping out with an old version of Windows and a proprietary driver. Mind you, these weren't mission critical systems, but they did disrupt people's workflows while we were fixing the systems. Think, things like digital measurements or barcode scanners. Everything can be easily done by hand but it's a massive pain.

Most of these systems end up migrated to a local data center than deployed via a thin client. Far easier to maintain and fix than some box that's been sitting in the corner of a shop collecting dust for 15 years.


Ok but it’s a LIFT. How is Windows even involved? Is it part of the controls?


Real problem is not that it's just a damn lift and shouldn't need full Windows. It's that something as theoretically solved and done problem as an operating system is not practically so.

An Internet of Lift can be done with <32MB of RAM and <500MHz single core CPU. Instead they(for whoever they) put a GLaDOS-class supercomputer for it. That's the absurdity.


An Internet of Lift can be done with <32KB of RAM and <500KHz single core CPU.


You’d be surprised at how entrenched Windows is in the machine automation industry. There are entire control systems algo implemented and run in realtime Windows, vendors like Beckhoff and ACS only have Windows build for their control software which developers extend and build on top with Visual Studio.


Absolutely correct, I've seen muli-axis machine tools that couldn't even be started let alone get running properly if Windows wouldn't start.

Incidentally, on more than one occasion I've not been able to use one of the nearby automatic tellers because of a Windows crash.


Siemens is also very much in on this. Up to about the 90s most of these vendors were running stuff on proprietary software stacks running on proprietary hardware networked using proprietary networks and protocols (an example for a fully proprietary stack like this would be Teleperm). Then in the 90s everyone left their proprietary systems behind and moved to Windows NT. All of these applications are truly "Windows-native" in the sense that their architecture is directly built on all the Windows components. Pretty much impossible to port, I'd wager.


Example of patent: https://patents.google.com/patent/US6983196B2/en

So for maintenance and fault indications. Probably saves some time from someone digging up manuals for checking error codes from where ever they maybe placed or not. Also could display things like height and weight.


Perhaps "Windows Embedded" is involved somewhere in the control loop, it is a huge industry but not that well-known to the public;

https://en.wikipedia.org/wiki/Windows_Embedded_Industry

https://en.wikipedia.org/wiki/Windows_IoT


We do ATM's - it runs on Windows IOT - before that it was OS/2.


Any info on whether this Crowdstrike Falcon crap is used here?


Fortunately for us not at all although we use it on our desktops - my work laptop had a BSOD on Friday morning, but it recovered.


According to reports the ATMs of some banks also showed the BSOD which surprised me; i wouldn't have thought such "embedded" devices needed any type of "third-party online updates".


Security for a device that can issue cash is kind of important.


Its easier and cheaper (and a lil safer) to run wires to the up\down control lever and have those actuate a valve somewhere, than it is to run hydraulic hoses to a lever like in lifts of old, for example.

That said it could also be run by whatever the equivalent of "PLC on an 8bit Microcontroller" is, and not some full embedded Windows system with live online virus protection so yeah, what the hell.


Probably for things like this - https://www.kone.co.uk/new-buildings/advanced-people-flow-so...

There’s a lot of value on Internet of Things everything, but comes with own risks.


I'm having a hard time picturing a multi-story diesel repair shop. Maybe a few floors in a dense area but not so high that a lack of elevators would be show stopping. So I interpret "lift" as the machinery used to raise equipment off the ground for maintenance.


Several elevator controllers automatically switch to the safe mode if they detect a fire or security alarm (which apparently is also happening).


The most basic example is duty cycle monitoring and trouble shooting. You can also do things like digital lock-outs on lifts that need maintenance.

While the lift might not need a dedicated computer, they might be used in an integrated environment. You kick off the alignment or a calibration procedure from the same place that you operate the lift.


how many lifts, and how many floors, with how many people are you imagining? Yes, there's a dumb simple case where there's no need for a computer with an OS, but after the umpteenth car with umpteen floors, when would you put in a computer?

and then there's authentication. how do you want key cards which say who's allowed to use the lift to work without some sort of database which implies some sort of computer with an operating system?


It's a diesel repair shop, not an office building. I'm interpreting "lift" as a device for lifting a vehicle off the ground, not an elevator for getting people to the 12th floor.


> But why does it need to be remotely updated?

Because it can be remotely updated by attackers.


Security patches, assuming it has some network access.


Why would a lift have network access?


Do you see a lot of people driving around applying software updates with diskettes like in the old days?

Have we learned nothing from how the uranium enrichment machines were hacked in Iran? Or how attackers routinely move laterally across the network?

Everything is connected these days. For really good reasons.


Your understanding of stuxnet is flawed, Iran was attacked by the Us Gov in a very very specific spearfish attack with years of preparation to get Stux into the enrichment facilities - nothing to do with lifts connected to the network.

Also the facility was air-gapped, so it wasn't connected to ANY outside network. They had to use other means to get Stux on those computers and then used something like 7 zero days to move from windows into Siemens computers to inflict damage.

Stux got out potentially because someone brought their laptop to work, the malware got into said laptop and moved outside the airgap from a different network.


"Stux got out potentially because someone brought their laptop to work, the malware got into said laptop and moved outside the airgap from a different network."

The lesson here is that even in an air-gapped system the infrastructure should be as proprietary as is possible. If, by design, domestic Windows PCs or USB thumb drives could not interface with any part of the air-gapped system because (a) both hardwares were incompatible at say OSI levels 1, 2 & 3; and (b) software was in every aspect incompatible with respect to their APIs then it wouldn't really matter if by some surreptitious means these commonly-used products entered the plant. Essentially, it would be almost impossible† to get the Trojan onto the plant's hardware.

That said, that requires a lot of extra work. By excluding subsystems and components that are readily available in the external/commercial world means a considerable amount of extra design overhead which would both slow down a project's completion and substantially increase its cost.

What I'm saying is obvious, and no doubt noted by those who've similar intentions to the Iranians. I'd also suggest that the use of individual controllers etc. such as the Siemens ones used by Iran either wouldn't be used or they'd need to be modified from standard both in hardware and with the firmware (hardware mods would further bootstrap protection if an infiltrator knew the firmware had been altered and found a means of restoring the default factory version).

Unfortunately, what Stuxnet has done is to provide an excellent blueprint of how to make enrichment (or any other such) plants (chemical, biological, etc.) essentially impenetrable.

† Of course, that doesn't stop or preclude an insider/spy bypassing such protections. Building in tamper resistance and detection to counter this threat would also add another layer of cost and increase the time needed to get the plant up and running. That of itself could act as a deterrent, but I'd add that in war that doesn't account for much, take Bletchley and Manhattan where money was no object.


I once engineered a highly secure system that used (shielded) audio cables and amodem as the sole pathway to bridge the airgap. Obscure enough for ya?

Transmitted data was hashed on either side, and manually compared. Except for very rare binary updates, the data in/out mostly consisted of text chunks that were small enough to sanity-check by hand inside the gapped environment.


Stux also taught other government actors what's possible with a few zero days strung together, effectively starting the cyberwasr we've been in for years.

Nothing is impenetrable.


You picked a really odd day and thread to say that everything is connected for really good reasons.


Or being online in the first place. Sounds like an unnecessary risk.


Remember those good old fashioned windows that you could roll down manually after driving into a lake?

Yeah, can’t do it now: it’s all electronic.


I’m sure that lifts have been electronically controlled for decades. But why is Windows (the operating system) involved?


but why do they have CS on them? they should be simply not connected to any kinds of networks.

and if there's some sensor network in the building that should be completely separate from the actual machine controls.


Compliance.

To work with various private data, you need to be accredited and that means an audit to prove you are in compliance with whatever standard you are aspiring to. CS is part of that compliance process.


Which private data would a computer need to operate a lift?


Another department in the corporation is probably accessing PII, so corporate IT installed the security software on every Windows PC. Special cases cost money to manage, so centrally managed PCs are all treated the same.


Anything that touches other systems is a risk and needs to be properly monitored and secured.

I had a lot of reservations about companies installing Crowdstrike but I'm baffled by the lack of security awareness in many comments here. So they do really seem necessary.


It must be security tags on the lift which restrict entry to authorised staff.


who's allowed to use the lift? where do those keycards authenticate to?


Because there's some level of convenience involved with network connectivity for OT.


That sounds...suboptimal.

I would imagine they used specialized controller cards or something like that.


They optimize for small batch development costs. Slapping windows PC when you sell a few hundred to thousand units is actually pretty cheap. Software itself is probably same order of magnitude, cheaper for UI itself...


And cheap both short and long term. Microsoft has 10 year lifecycles you don't need to pay extra for. Linux you need IT staff to upgrade it every 3 years. Not to mention hiring engineers to recompile software every 3 years with the distro upgrade.


Ubuntu LTS has support for 5 years, can be extended to 10 years of maintenance/security support with ESM (which is a paid service).

Same with Rocky Linux, but the extra 5 years of maintenance/security support is provided for free.


thats just asking for trouble.


Probably a Windows-based HMI (“human-machine interface”).

I used to build sorting machines that use variants of the typical “industrial” tech stack, and the actual controllers are rarely (but not never!) Windows. But it’s common for the HMI to be a Windows box connected into the rest of the network, as well as any server.


I'm still running multiple CNC/Industrial equipment with win3.1/98/xp. Only just retired one running Dos 6.2


I'm just impressed that the lifts, alarms, cranes, phones, etc all run on Windows somehow.


In a lot of cases you find tangential dependencies on Windows in ways you don't expect. For example a deployment pipeline entirely linux-based deploying to linux-based systems that relies on Active Directory for authentication.


> Active Directory for authentication.

In my experience that'd be 90% of the equipment.

"Oh! It has LDAP integration! We can 'Single Sign On'."


I don't know if "impressed" is the right word..

"Appalled", "bewildered" and "horrified" and also comes to mind..


I'm more confused because I have never, ever encountered a lift that wasn't just some buttons or joysticks on a controller attached to the lift. There is zero need of more computing power than a 8-bit microcontroller from the 1980s. I don't know where I would even buy such a lift with a windows PC.


No one sells 8 bit microcontrollers from the 1980s anymore. Just because you don't need the full power of modern computing hardware and software doesn't mean you are going to pay extra for custom, less capable options.


wow, why do lifts require an OS?


I think the same question can be asked for why lots of equipment seemingly requires an OS. My take is that these products went through a phase of trying to differentiate themselves from competitors and so added convenience features that were easier to implement with a general purpose computer and some VB script rather than focusing on the simplest most reliable way to implement their required state machines. It's essentially convenience to the implementors at the expense of reliability of the end result.


My life went sideways when organizations I worked for all started to make products solely for selling and not for using those. If the product was useful for something, that was the side effect of being sellable. Not the goal.


Worse is Better has eaten the world. The philosophy of building things properly with careful, bespoke, minimalist designs has been totally destroyed by a race to the bottom. Grab it off the shelf, duct tape together a barely-working MVP, and ship it.

Now we are reaping what we sowed.


That's what you get for outsourcing to some generic shop with no domain expertise who implements to a spec for the lowest dollar.


the question is - why lifts require windows?


The question is, why do lifts require Crowdstrike?


Some idiot with college degree in office no-where near the place sees that we have these PCs here. And then they go over compliance list and mandate this is needed. Now go install it and the network there...


Or they want to protect their Windows-operated lifts from very real and life threatening events like an attacker jumping from host to host until they are able to lock the lifts and put people lives at risk or cause major inconveniences.

Not all security is done by stupid people. Crowdstrike messed up in many ways. It doesn't make the company that trusted them stupid for what they were trying to achieve.


Crowdstrike is malware and spyware. Trusting one malware to control another is your problem right there. It will always blow up in your face.


Why are the lifts networked or on a network which can route to the internet?

This is a car lift. It really doesn't need a computer to begin with. I've never seen one with a computer. WTF?


For the same reason people want to automate their homes, or the industries run with lots of robots, etc: because it increases productivity. The repair shop could be monitoring for usage, for adequate performance of hydraulics, long-term performance statistics, some 3rd-party gets notified to fix it before it's totally unusable, etc.

I have a friend that is a car mechanic. The amount of automation he works with is fascinating.

Sure, lifts and whatnot should be in a separate network, etc, but even banks and federal agencies screw up network security routinely. Expecting top-tier security posture from repair shops is unrealistic. So yes, they will install a security agent on their Windows machines because it looks like a good idea (it really is) without having the faintest clue about all the implications. C'est la vie.


But what are you automating? It's a car lift, you need to be standing next to it to safely operate it. You can't remotely move it, it's too dangerous. Most of the things which can go wrong with a car lift require a physical inspection and for things like hydraulic pressure you can just put a dial indicator which can be inspected by the user. Heck, you can even put electronic safety interlocks without needing an internet connection.

There are lots of difficult problems when it comes to car repair, but cloud lift monitoring is not something I've ever heard anyone ask for.

The things you're describing are all salesman sales-pitch tactics, they're random shit which sound good if you're trying to sell a product, but they're all stuff nobody actually uses once they have the product.

It's like a six in one shoe horn. It has a screw driver, flash light, ruler, bottle opener, and letter opener. If you're just looking at two numbers and you see regular shoe horn £5, six in one shoe horn £10 then you might blindly think you're getting more for your money. But at the end of the day, I find it highly unlikely you'll ever use it for anything other than to put tight shoes on.


I imagine something keeps monitors how many times the lift has gone up and down for maintenance reasons. Maybe a nice model monitors fluid pressure in the hydraulics to watch for leaks. Perhaps a model watches strain, or balance, to prevent a catastrophic failure. Maybe those are just sensors but if they can’t report their values they shutdown for safety’s sake. There are all kinds of reasonable scenarios that don’t rely on bad people trying to screw or cheat someone.


None of these features require internet or a windows machine, most of them do not require a computer or even a microcontroller. Strain gauges can be useful for checking for an imbalanced load, but they cannot inspect the metal for you.


The question is, why do lifts require internet connection on top of the rest.


In my office, when we swipe our entry cards at the security gates, a screen at the gate tells us which lift to take based on the floor we work on, and sets the lift to go to that floor. It's all connected.


In the context of a diesel repair shop, he likely was referring to fork lifts or vehicle lifts rather than elevators.


This doesn't require an internet, just a LAN.


Remote monitoring and maintenance. Predictive maintenance, monitor certain parameters of operation and get maintenance done before lift stops operating.


It's a car lift. Not only would it be irresponsible to rely on a computer to tell you when you should maintain it, as some inspections can only be done visually, it seems totally pointless as most inspections need to be done manually.

Get a reminder on your calendar to do a thorough inspection once a day/week (whatever is appropriate) and train your employees what to look for every time it's used. At the end of the day, a car lift on locks is not going to fail unless there's a weakness in the metal structure, no computer is going to tell you about this unless there's a really expensive sensor network and I highly doubt any of the car lifts in question have such a sensor network.

Moreover, even if they did have such a sensor network, why are these machines able to call out to the internet?


These requirements can be met by making the lift's systems and data observable, which is a uni-directional flow of information from the lift to the outside world. Making the lift's operation modifiable from the outside world is not required to have it be observable.


I mean... the beginning of mission impossible 1 should tell you.


The same reason everyone just uses a microcontroller on everything. It's like a universal glue and you can develop in the same environment you ship. Makes it easy.


Well, how else is the operator supposed to see outside?


Heh ...


Why do lathes , cranes and laser alignment systems need a new copy of windows?


Very likely they use a manufacturing execution system like Dassault's DELMIA or Siemens MES.

These systems are intended to allow local control of a factory, or cloud based global control of manufacturing.

They can connect to individual PLC(Programmable Logic Controller) which handles the actual equipment.

They connect to a LAN network, or to the internet. So they naturally need some form of security.

They could use Windows Server, Redhat Linux, etc. but they need some form of security. Which is how controller would be affected.

Usually you can just set them to manual though...


Lathes probably have PCs connected to them to control them, and do CNC stuff (he did say the controllers). Laser alignment machines all have PCs connected to them these days.

The cranes and lifts though... I've never heard of them being networked or controlled by a computer. Usually it's a couple buttons connected to the motors and that's it. But maybe they have some monitoring systems in them?


Off then top of my head, based on limited experience in industrial automation:

- maintenance monitoring data shipping to centralised locations

- computer based HMI system - there might be good old manual control but it might require unreasonable amounts of extra work per work order

- Centralised control system - instead of using panel specific to lift, you might be controlling bunch of tools from common panel

- integration with other tools, starting from things as simple as pulling up manufacturers' service manual to check for details to doing things like automatically raising the lift to position appropriate for work order involving other (possibly also automated) tools with adjustments based on the vehicle you're lifting

There could be more.


CNC machine tools can track use, maintenance, etc via the network. You can also push programs to them for your parts.

The need a new copy of Windows because running an old copy on a network is a worse idea.


This blows my mind because none of this requires windows, or a desktop OS at all.


No, they don't. Absolutely. But there are very few companies successful not using Windows or existing OS. Apple HomePod runs iOS.


Remember that CNC is programming environment. Now how do actually see what program is loaded? Or where is the execution at the moment? For anything beyond few lines of text on dotmatrix screen actual OS starts to be come desirable.

And all things considered, Windows is not that bad option. Anything else would also have issues. And really what is your other option some outdated, unmaintained Android? Does your hardware vendor offer long term support for Linux?

Windows actually offers extremely good long term support quite often.


> And all things considered, Windows is not that bad option

I'm gonna go out on a limb and say that it actually is. It's a closed source OS which includes way more functionality than you need. A purpose-built RTOS running on a microcontroller is going to provide more reliability, and if you don't hook it up to the internet it will be more secure, too. Of course, if you want you can still hook it up to the internet, but at least you're making the conscious decision to do so at that point.

Displaying something on a screen isn't very hard in an embedded environment either.

I have an open source printer which has a display, and runs on an STM32. It runs reliably, does its job well, and doesn't whine about updates or install things behind my back because it physically can't, it has no access to the internet (though I could connect it if I desired). A CNC machine is more complex and has more safety considerations, but is still in a similar class of product.

https://youtu.be/FxIUs-pQBjk?si=N-W-Af6jBgGBiIgl&t=46


> Does your hardware vendor offer long term support for Linux?

This seems muddled. If the CNC manufacturer puts Linux on an embedded device to operate the CNC, they're the hardware manufacturer and it's up to them to pick a chip that's likely to work with future Linuxes if they want to be able to update it in the future. Are you asking if the chip manufacturer offers long-term-support for Linux? It's usually the other way around, whether Linux will support the chip. And the answer, generally, is "yes, Linux works on your chip. Oh you're going to use another chip? yes, Linux works on that too". This is not really something to worry about. Unless you're making very strange, esoteric choices, Linux runs on everything.

But that still seems muddled. Long-term support? How long are we talking? Putting an old Linux kernel on an embedded device and just never updating it once it's in the field is totally viable. The Linux kernel itself is extremely backwards compatible, and it's often irrelevant which version you're using in an embedded device. The "firmware upgrades" they're likely to want to do would be in the userspace code anyhow - whatever code is showing data on a display or running a web server you can upload files to or however it works. Any kernel made in the last decade is going to be just fine.

We're not talking about installing Ubuntu and worrying about unsolicited Snap updates. Embedded stuff like this needs a kernel with drivers that can talk to required peripherals (often over protocols that haven't changed in decades), and that can kick off userspace code to provide a UI either on a screen or a web interface. It's just not that demanding.

As such, people get away with putting FreeRTOS on a microcontroller, and that can show a GUI on a screen or a web interface too, you often don't need a "full" OS at all. A full OS can be a liability, since it's difficult to get real-time behaviour which presumably matters for something like a CNC. You either run a real-time OS, or a regular OS (from which the GUI stuff is easier) which offloads work to additional microcontrollers that do the real-time stuff.

I did not expect Windows to be running on CNCs. I didn't expect it to be running on supermarket checkouts. The existence of this entire class of things pointlessly running self-updating, internet-connected Windows confuses me. I can only assume that there are industries where people think "computer equals Windows" and there just isn't the experience present, for whatever reason, to know that whacking a random Linux kernel on an embedded computer and calling it a day is way easier than whatever hoops you have to jump through to make a desktop OS, let alone Windows, work sensibly in that environment.


5-10 years is not unreasonable expected support I think.

And if you are someone manufacturing physical equipment be it CNC machine or vehicle lift hiring entire team to keep Linux patched and making your own releases seems pretty unreasonable and waste of resources. In the end anything you choose is not error free. And the box running software is not main product.

This is actually huge challenge. Finding vendor that can deliver you a box where to run software with promised long term support, when the support is actually more than just few years.

Also I don't understand how it is any more acceptable to run unpatched Linux in networked environment than it is Windows. These are very often not just stand-alone things, but instead connected to at least local network if not larger networks. With possible internet connections too. So not updating vulnerabilities is as unacceptable as it would be with Windows.

With CNC there is place for something like Windows OS. You have separate embedded system running the tools. But you still want a different piece managing the "programs". As you could have dozens or hundreds of these. And at that point reading them from network starts once again make sense. Time of dealing with floppies is over...

And with checkouts, you want more UI than just buttons. And Windows CE has been reasonably effective tool in that.

Linux is nice on servers, but often with embedded side keeping it secure and up to date is massive amount of pain. Windows does offer excellent stability and long term support. And you can just simply buy a computer with sufficient support from MS. One could ask why do not not massive companies run their own Linux distributions?


> 5-10 years is not unreasonable expected support I think.

A couple of years ago, I helped a small business with an embroidery machine that runs Windows 98. Its physical computer died, and the owner could not find the spare parts. Fortunately, it used a parallel port to control the embroidery hardware, so it was easy to move to a VM with a USB parallel port adapter.


That was very lucky then. USB parallel ports adapters are only intended to work with printers. They fail with any hardware that does custom signalling over the parallel port.


Ok, just make the lift controller analogue. No digital processors at all. Nothing to update, so no updates needed.


Maybe you want your lift to be able to diagnose itself. Tell possible faults, instead of spending man hours on troubleshooting every part each time downtime included. With big lifts there are many parts that could go wrong. Being able to identify which one saves lot of time and time is money.

These sort of outages are actually extremely rare nowadays. Considering how long these control systems have been kept around must mean that they are not actually causing that many issue that replacing them would be worth it.


you log into the machine, download files, load files onto the program. that doesn't need a desktop environment? you want to reimplement half of one, poorly, because that would have avoided this stupid mistake, in exchange for half a dozen potential others, and a worse customer experience?


> you log into the machine, download files, load files onto the program. that doesn't need a desktop environment?

Believe it or not, it doesn't! An embedded device with a form of flash storage and an internet connection to a (hopefully) LAN-only server can do the same thing.

> you want to reimplement half of one, poorly

Who says I would do it poorly? ;)

> and a worse customer experience?

Why would a purpose-built system be a worse customer experience than _windows_? Are you really going to set the bar that low?


and why do they run spyware?


Probably because some fraction of lift manufacturer's customer base has a compliance checklist requiring it.


Because we live deep into the internet of shit era.


How else are you going to update your grocery list while operating the lift?


> we dont have 911 either

Holy cow...

Who on earth requires a Windows-based backend (or whatever else had CrowdStrike, in the shop or outside) for regular (VoIP) phone calls.

This should really lead to some learnings for anyone providing any kind of phone infrastructure.


Or lathe, or cranes, or alarms, or hvac... what the actual fuck.

Next move should be some artisanal as mechanical-as-possible quality products, or at least Linux(TM) certified product or similar (or Windows-free (TM)). The opportunity is here, everybody noticed this clusterfuck, and smart folks don't like ignoring threats that are in your face.

But I suppose in 2 weeks some other bombastic news will roll over this and most will forget. But there is always some hope


That’s not it. 911 itself was down.


Oh, great. I guess that counts as phone infrastructure.


what are the brands of these systems?


Oh man, you work with some cool (and dangerous) stuff.

Outage aside, do you feel safe using it while knowing that it accepts updates based on the whims of far away people that you don't know?


I hate to be that person, but things have moved to automatic updates because security was even shittier when the user was expected to do it.

I can't even imagine how much worse ransomware would be if, for example, Windows and browsers weren't updating themselves.


I feel like this is the fake reason given to try to hide the obvious reason: automatic updates are a power move that allows companies to retain control of products they've sold.


It's not fake reason; it's a very real solution to a very real problem.

Of course companies are going to abuse it for grotesque profit motive, but that doesn't make their necessity a lie.


Yep. And even aside from security, its a nightmare needing to maintain multiple versions of a product. "Oh, our software is crashing? What version do you have? Oh, 4.5. Well, update 4.7 from 2 years ago may fix your problem, but we've also released major versions 5 and 6 since then - no, I'm not trying to upsell you ma'am. We'll pull up the code from that version and see if we can figure out the problem."

Having evergreen software that just keeps itself up to date is marvellous. The Google Docs team only needs to care about the current version of their software. There are no documents saved with an old version. There's no need to backport fixes to old versions, and no QA teams that need to test backported security updates on 10 year old hardware.

Its just a shame about, y'know, the aptly named crowdstrike.


> The Google Docs team only needs to care about the current version of their software. There are no documents saved with an old version.

There sure are. I have dozens saved years ago.


Fine. But Google can mass-migrate all of them to a new format any time they want. They don’t have the situation you used to have with Word, where you needed to remember to Save As Word 2001 format or whatever so you could open the file on another computer. (And if you forgot, the file was unreadable). It was a huge pain.


Yes it is better than the Word situation, but no it isn't not caring. There do exist old format docs and Google does have to care - to make that migration.


Yes, they have to migrate once. But they don’t need to maintain 8 different versions of Word going back a decade, make sure all security patches get back ported (without breaking anything along the way), and make all of them are in some way cross compatible despite having differing feature sets.

If google makes a new storage format they have to migrate old Google docs. But that’s a once off thing. When migrations happen, documents are only ever moved from old file formats to new file formats. With word, I need to be able to open an old document with the new version of word, make changes then re-save it so it’s compatible with the old version of word again. Then edit it on an old version of word and go back and forth.

I’m sure the Google engineers are very busy. But by making Docs be evergreen software, they have a much easier problem to solve when it comes to this stuff. Nobody uses the version of Google docs from 6 months ago. You can’t. And that simplifies a lot of things.


> Yes, they have to migrate once.

They have to migrate each time they change the format, surely. Either that or maintain converters going back decades, to apply the right one when a document is opened.

> but they don’t need to maintain 8 different versions of Word going back a decade, make sure all security patches get back ported

Nor does Microsoft for Word.

> With word, I need to be able to open an old document with the new version of word, make changes then re-save it so it’s compatible with the old version of word again.

You don't have to, unless you want the benefit of that.

And Google Docs offers the same.

> Nobody uses the version of Google docs from 6 months ago. You can’t. And that simplifies a lot of things.

Well, I'd love to use the version of Gmail web from 6 months ago. Because three months ago Google broke email address input such that it no longer accesses the contacts list and I have to type/paste each address in full.

That's a price we pay for things being "simpler" for a software provider than can and does change the software I am using without telling me let alone giving me the choice.

Not to mention the change that took away a large chunk of my working screen space for an advert telling me to switch to the app version, despite have the latest version of Google's own Chrome. An advert I cannot remove despite having got the message 1000 times. Pure extortion. Simplification is no excuse.


It used to be the original reason why automatic updates were accepted and it was valid.

But since then it has been abused for all sorts of things that really are nothing more than consolidation of power, including an entire shift in mentality of what "ownership" even means: Tech companies today seem to think it's the standard that they keep effective ownership of a product for its entire life cycle, no matter how much money a customer has paid for it, and no matter deeply the customer relies on that product.

(Politicians mostly seem fine with that development or even encourage it)

I agree that an average nontechnical person can't be expected to keep track of all the security patches manually to keep their devices secure.

What I would expect would be an easy way to opt-out of automatic updates if you know what you're doing. The fact that many companies go to absurd lengths to stop you from e.g. replacing the firmware or unlocking the bootloader, even if you're the owner of the device is a pretty clear sign to me they are not doing this out of a desire to protect the end-user.

Also, I'm a bit baffled that there is no vetting at all of the contents of updates. A vendor can write absolutely whatever they want into a patch for some product of theirs and arbitrarily change the behaviour of software and devices that belong to other people. As a society, we're just trusting the tech companies to do the right thing.

I think a better system would be if updates would at the very least have to be vetted by an independent third party before being applied and a device would only accept an update if it's signed by the vendor and the third-party.

The third-party cold then do the following things:

- run tests and check for bugs

- check for malicious and rights-infringing changes deliberately introduced by the vendor (e.g. taking away functionality that was there at time of purchase)

- publicly document the contents of an update, beyond "bug fixes and performance improvements".


What you're describing is what Linux distro maintainers do: Debian maintainers check the changes of different software repos, look at new options and decide if anything should be disabled in the official Debian release, and compile and upload the packages.


The problem you are complaining about here is the weakening of labor and consumer organizations vis a vis capital or ownership organizations. The software must be updated frequently due to our lack of skill in writing secure software. Whether all the corporations will take advantage of everything under the sun to reduce the power the purchasers and producers of these products have is a political and legal questions. If only the corporations are politically involved then only they will have their voice heard by the legislatures.


no reason why both can't be true — the security is overall better, and companies are happy to invest in advancing this paradigm because it gives them more control


incentive can and does undermine the stated goal. what if the government decided to take control of everyone's investment portfolio to prevent the market doing bad things? or an airplane manufacturer gets takes control of its own safety certification process because obviously its in their best interest that their planes are safe? imposed curfew, everyone has to be inside their homes while its dark outside because most violent crimes occur at night?


This is for critical infrastructure though. You AT LEAST test it out first on some machines


That may apply to things that need to be online, but... a lathe?


how much lathe-ing have you done recently? did you load files onto your CNC lathe with an SD card, and thus there is a computer, which needs updates, or are you thinking of a lathe that is a motor and a rubber band, and nothing else, from, like, high school woodshop?


I bought a 3d printer years ago then let it sit collecting dust for like 2 or more years because I was intimidated by it. Finally started using it and was blown away how useful it has been to me. Then a long time later realized holy shit there are updates and upgrades one can easily do. I can add a camera and control everything and monitor everything from any online connected device. I always hated pulling out the sd card and bringing it to my computer and copying it over and back to the printer and so on. Being online makes things so much easier and faster. I have been rocking my basic printer for a few years now and have not paid much attention to the scene and then started seeing these multi color prints holy shit am I slow and behind the times. The newer printers are pretty rad but I will give props to my Anycubic Mega it has been a work horse and I have had very little problems. I don't want it to die on me but a newer printer would be cool also.


All fine... until it gets hacked.


And does what? Print something?

There are immense benefits to using modern computing power, including both onboard and remote functionality. The cost of increased software security vulnerability is easily justified.


More like infect something. Your computer.

> The cost of increased software security vulnerability is easily justified.

Sometimes yes, sometimes no.


wouldn't the lathe need to be online to get the OTA update from Crowdstrike?


What load of horseshit.

1. Nobody auto updates my linux machines. They have no malware. 2. It's my job to change the oil in my car. When Ford starts sending a tech to my house to tamper with my machines "because they need maintenance" will be the day I am no longer a Ford customer.


The irony of this comment is almost perfected by the fact Ford were one of the leading companies in bringing ECU's (one of the myriad of computer systems essential to modern vehicles that can and do receive regular updates) to market in checks notes 1975.

https://en.wikipedia.org/wiki/Ford_EEC


Carelessly handled Linux machines* can and do get infected by malware or compromised for data exfile, don't be obtuse.

*Let's not pretend this never happens


Not to mention CVE mitigation.


Those Linux systems that aren't getting updates must be the ones sending Mirai to my Linux systems, which are getting updates (and also Mirai, although it won't run because it's the wrong architecture).

No malware? Only if you have your head in the sand.


I assume that comment was saying that they handle the update process and that their machines don't have any malware on them.

I ignored it because it was somewhat abusive and is missing the problem that automatic updates are trying to solve: that most people, but not all, don't do updates.


yeah, you don't want day to day security (a) changing daily (b) at the kennel level


Wow, this hits close to home. Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009. I added a check on the driver initialization path and didn't annotate the code as non-paged because frankly I didn't know at the time that the Windows kernel was paged. All my kernel development experience up to that point was with Linux, which isn't paged.

BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.

The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.


> without even the most basic level of qualification

That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.


Exactly this what I was missing in the story. Like why not to have a limited set of users have it before going live for the whole user base at a mission critical product like this is beyond comprehension of everyone ever came across software bugs (so billions of people). And then we already overcame the part of not testing internally well, or at all? Something clusteruck must have happened there which is still better than imagining that this is the normal way the organization operates. Which is a very scary vision. Serious rethinking of trusting this organization is due everywhere!


But that would require hiring staff to manage the process, and that is money taken away from sponsoring an F1 racing team.


The funniest part was seeing Mercedes F1 team pit crew staring at BSODs at their workstations[1] while wearing CrowdStrike t-shirts. Some jokes just write themselves. Imagine if they loose the race because of their sponsor.

But hey, at least they actually dogfood the products of their sponsors instead of just taking money to shill random stuff.

[1] https://www.thedrive.com/news/crowdstrike-sponsored-mercedes...


Or it could be made that Windows stops loading drivers that are crashing.

Third-party driver/module crashed more than 3 times in a row -> Third-party driver/module is punished and has to be manually re-enabled.


Because CrowdStrike is an EDR solution it likely has tamper-proofing features (scheduled tasks, watchdog services, etc.) that re-enables it. These features are designed to prevent malware or manual attackers from disabling it.


These features drive me nuts because they prevent me, the computer owner/admin, from disabling. One person thought up techniques like "let's make a scheduled task that sledgehammers out the knobs these 'dumb' users keep turning' and then everyone else decided to copycat that awful practice.


If you're the admin, I would assume you have the ability to disable Crowdstrike. There must be some way to uninstall it, right?


Not if you want to keep the magic green compliance checkbox!


Are you saying that the compliance rule requires the software to be uninstallable? Once it's installed it's impossible to uninstall? No one can uninstall it? I have a hard time believing it's impossible to remove the software. In the extreme case, you could reimage the machine and reinstall Windows without Crowdstrike.

Or are you saying that it is possible to uninstall, but once you do that, you're not in compliance, so while it's technically possible to uninstall, you'll be breaking the rules if you do so?


It's obviously the second option.


The person I originally replied to, rkagerer, said there was some technical measure preventing rkagerer from uninstalling it even though rkagerer has admin on the computer.


I was referring to the difficulty overriding the various techniques certain modern software like this use to trigger automatic updates at times outside admin control.

Disabling a scheduled task is easy, but unfortunately vendors are piling on additional less obvious hooks. Eg. Dropbox recreates its scheduled task every time you (run? update?) it, and I've seen others that utilize the various autostart registry locations (there are lots of them) and non-obvious executables to perform similar "repair" operations. You wind up in "Deny Access" whackamole and even that isn't always effective. Uninstalling isn't an option if there's a business need for the software.

The fundamental issue is their developers / product managers have decided they know better than you. For the many users out there who are clueless to IT this may be accurate, but it's frustrating to me and probably others who upvoted the original comment.


Is what you're saying relevant in the Crowdstrike case? If you don't want Crowdstrike and you're an admin, I assume there are instructions that allow you to uninstall it. I assume the tamper-resistant features of Crowdstrike won't prevent you from uninstalling it.


I cannot find that comment. Care to link it?



An admin can obviously disable a scheduled task... It's not "impossible" to remove the software, just annoying.


It's not obvious - the owner of the computer sets the rules.


If you're the owner, just turn it off and uninstall.


Doesn't malware do that as well?

But what other malware has been as successful? Crowdstrike can rest easy knowing it's taken down many of the most critical systems in the world.

Oh, no, actually, if Crowdstrike WAS malware, the authors would be in prison.. not running a $90B company.


it does. several crowdstrike alerts popped when i was remediating systems of the broken driver.


Wouldn't this be an attack vector? Use some low-hanging bug to bring down an entire security module, allowing you to escalate?


It's currently a DOS by the crashing component, so it's already broken the Availability part of Confidentiality/Integrity/Availability that defines the goals of security.


But a loss of availability is so much more palatable than the others, plus the others often result in manually restricting availability anyway when discovered.


I think the wider societal impact from the loss of availability today - particularly for those in healthcare settings - might suggest this isn't always the case


Availability of a system that can’t ensure data integrity seems equally bad though.


Tell that to the millions of people whose flights were canceled, the surgeries not performed, etc etc.


What is the importance of data integrity? If important pre-op data/instructions are missing or gets saved on the wrong patient record which causes botched surgeries, if there are misprescribed post-op medications, if there is huge confusion and delays in critical follow-up surgeries because of a 100% available system that messed up patient data across hospitals nationwide, if there are malpractice lawsuits putting entire hospitals out of business etc etc, then is that fallout clearly worth having an available system in the first place?


How does crowdstrike protect against instructions being saved on the wrong patient’s record?


Huh? We're talking about hypotheticals here. You're saying availability is clearly more important than data integrity. I'm saying that if a buggy kernel loadable module allowed systems to keep on running as if nothing was wrong, but actually caused data integrity problems while the system is running, that's just as bad or worse.


Or anyone who owns CrowdStrike shares.


They’d surely have used some kind of Unix if uptime mattered.


before you get all smug recognize that linux has the exact same architecture, just because it wasn't impacted - this time.


Too late, I was born smug.

If Linux and Windows have similar architectural flaws, Microsoft must have some massive execution problems. They are getting embarrassed in QA by a bunch of hobbyists, lol.


I'm sure the people who missed their flights because of this disagree.


Or families of those who die.


If you're planning around bugs in security modules, you're better off disabling them - malware routinely use bugs in drivers to escalate, so the bug you're allowing can make the escalation vector even more powerful as now it gets to Ring 0 early loading.


> Wouldn't this be an attack vector?

Isn't DoSing your own OS an attack vector? and a worse one when it's used in critical infrastructure where lives are at stake.

There is a reasonable balance to strike, sometimes it's not a good idea to go to extreme measures to prevent unlikely intrusion vectors due to the non-monetary costs.

See: The optimal amount of fraud is non-zero.


In the absence of a Crowdstrike bug, if an attacker is able to cause Crowdstrike to trigger a bluescreen, I assume the attacker would be able to trigger a bluescreen in some other way. So I don't think this is a good argument for removing the check.


That assumes it's more likely than crowdstrike mass bricking all of these computers... this is the balance, it's not about possibility, it's about probability.


I think we're in agreement. I now realize my previous comment replied to the wrong comment. I meant to reply to Lx1oG-AWb6h_ZG0. Sorry.


Requires state level social engineering.

Might by why north Koreans are trying to get work from home jobs.

https://www.businessinsider.com/woman-helped-north-korea-fin...


It does. CrowdStrike forced itself into boot process. Normal windows drivers will be disable automatically if they caused a crash


I use Explorer Patcher on a windows 11 machine. It had a history of crash loops with Explorer that they implemented this circuit breaker functionality.


It's baffling how fast and wide the blast radius was for this Crowdstrike update. Quite impressive actually, if you think about it - updating billions of systems that quickly.


Certainly living up to the name


Indeed, far more damage caused than any actual malware!


This was my first thought too. I'm not that familiar with the space, but I would think for something this sensitive the rollout would be staggered at least instead of what looks like globally all at the same time.


This is the bit I am still trying to understand. On CrowdStrike you can define how many updates a host is behind. I.e. n (latest), n-1 (one behind) or n-2 etc. This update was applied to a 'latest' policy hosts and the n-2 hosts. To me it appears that there was more to this than just a corrupt update, otherwise how was this policy ignored? Unless it doesn't separate the update as deeply and maybe just a small policy aspect, which would also be very concerning.

I guess we won't really know until they release the post mortem...


Yeah, my guess is that they roll out the updates to every client at the same time, and then have the client implement the n-1/2/whatever part locally. That worked great-ish until they pushed a corrupt (empty) update file which crashed the client when it tried to interpret the contents... Not ideal, and obviously there isn't enough internal testing before sending stuff out to actual clients.


But do you ever get free world-wide advertisement that everyone uses your product? Crowdstrike sure did and I'm sure they'll use that to sell it to more people.


That is the right way to do it.


> It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.

Discussed elsewhere it is claimed that the file causing the crash was a data file that has been corrupted in the delivery process. So the development team and their CI have probably tested a good version, but the customer received a bad one.

If that is true to problem is that the driver first uses an unsigned file at all, so all customer machines are continuously at risk for local attacks. And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.


If the file was signed, wouldn't that have prevented the corrupted transmission file from being loaded.

I assume if the signed file was hacked (or parts missing), then it wouldn't pass verification.


> And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.

To me, this is the inexcusable sin. These updates should be signed and signatures validated before the file is read. Ideally the signing/validating would be handled before distribution so that when this file was corrupted, the validation would have failed here.

But even with a good signature, when a file is read and the values don’t make sense, it should be treated as a bad input. From what I’ve seen, even a magic bytes header here would have helped.


Still a staggered roll-out would have reduced the impact.


https://news.ycombinator.com/item?id=41006104#41006555

the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers

per a new/green account


“And so that’s why we recommend using phased rollouts” -Every DevOps engineer from now on


“But that costs us money and time” - some suit.


"And they promise fast threat mitigation... Let allow them to take over EVERYTHING! With remote access, of course. Some form of overwatch of what they in/out by our staff ? Meh... And it even allow us to do cuts in headcount and infra by $<digits_here> a year."


So have we decided to stop using checksums or something?


Perhaps it was the checksum/signature process!


Ya gotta keep checksumming until you find a fixed point.


when something is changed, we usually re-test. that's the whole point of testing anyway. :)


> I didn't know at the time that the Windows kernel was paged.

At uni I had a professor in database systems, who did not like written exams, but mostly did oral exams. Obviously for DBMSes the page buffer is very relevant, so we chatted about virtual memory and paging. So in my explanation I made the difference for kernel space and user space. I am pretty sure I had read that in a book describing VAX/VMS internals. However, the professor claimed that a kernel never does paging for its own memory. I did not argue on that and passed the exam with the best grade. Did not check that book again to verify my claim. I have never done any kernel space development even vaguely close to memory management, so still today I don't know the exact details.

However, what strikes me here: When that exam happened in 1985ish the NT kernel did not exist yet, I'd believe. However, IIRC a significant part of the DEC VMS kernel team went to Microsoft to work on the NT kernel. So the concept of paging (a part of) kernel memory went with them? Whether VMS --> WNT, every letter increased by one is just a coincidence or intentionally the next baby of those developers I have never understood. As Linux has shown us today much bigger systems can be successfully handled without the extra complications for paging kernel memory. Whether it's a good idea I don't know, at least not a necessary one.


If you want to hear the history of [DEC/VMS] NT from the horses mouth:

https://www.youtube.com/watch?v=xi1Lq79mLeE


Oh oh, 3 hours 10. I watched around half of it.

The VMS --> WNT acronym relationship was not mentioned, maybe it was just made up later.

One thing I did not know (or maybe not remember) is that NT was originally developed exclusively for the Intel i860, one of Intel's attempts to do RISC. Of course in the late 1980s CISC seemed deemed and everyone was moving to RISC. The code name of the i860 was N10. So that might well be the inside origin of NT, the marketing name New Technology retrofitted only later.


Here's a direct link:

https://youtu.be/xi1Lq79mLeE?t=4314

"New Technology", if you want to search the transcript. Per Dave, marketing did not want to use "NT" for "New Technology" because they thought no one would buy new technology.


Actually it was not only x86 hardware that was not really planned for the NT kernel, also Windows user space was not the first candidate. Posix and maybe even OS/2 were earlier goals.

So the current x86 Windows monoculture came up as an accident because strategically planned new options did not materialize. The user space change should finally debunk the theory that VMS andvances into WNT was a secret plot by the engineers involved. It was probably a coincidence discovered after the fact.


https://www.usenix.org/system/files/1311_05-08_mickens.pdf

"Perhaps the worst thing about being a systems person is that other, non-systems people think that they understand the daily tragedies that compose your life. For example, a few weeks ago, I was debugging a new network file system that my research group created. The bug was inside a kernel-mode component, so my machines were crashing in spectacular and vindic- tive ways. After a few days of manually rebooting servers, I had transformed into a shambling, broken man, kind of like a computer scientist version of Saddam Hussein when he was pulled from his bunker, all scraggly beard and dead eyes and florid, nonsensical ramblings about semi-imagined enemies. As I paced the hallways, muttering Nixonian rants about my code, one of my colleagues from the HCI group asked me what my problem was. I described the bug, which involved concur- rent threads and corrupted state and asynchronous message delivery across multiple machines, and my coworker said, “Yeah, that sounds bad. Have you checked the log files for errors?” I said, “Indeed, I would do that if I hadn’t broken every component that a logging system needs to log data. I have a network file system, and I have broken the network, and I have broken the file system, and my machines crash when I make eye contact with them. I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood.”


Ah, the joys of trying to come up with creative ways to get feedback from your code when literally nothing is available. Can I make the beeper beep in morse code? Can I just put a variable delay in the code and time it with a stopwatch to know which value was returned from that function? Ughh.


Some of us have worked on embedded systems or board bringup. Scope and logic analyzer ... Serial port a luxury.

IIRC Windows has good support for debugging device drivers via the serial port. Overall the tooling for dealing with device drivers in windows is not bad including some special purpose static analysis tool and some pretty good testing.


This is why power users want that standard old two digit '7 segment' display to show off that ONE hex code the BIOS writes to at various steps...

When stuff breaks, not if, WHEN it breaks, this at least gives a fighting chance at isolating the issue.


Yeah. Been there, done that. Write to an unused address decode to trigger the logic analyzer when I got to a specific point in the code, so I could scroll back through the address bus and figure out what the program counter had done for me to get to that piece of code.


Old school guys at my first job could send the contents of the program counter to the speaker, and diagnose problems by the sound of it.

Definitely Old School Cool


I call this "throwing dye in the water".


I certainly used beeping for debugging more than once! : - )


Quoting James Mickens is always the winning move. I recommend the entire collection of his wisdom, https://mickens.seas.harvard.edu/wisdom-james-mickens


James Mickens’s Monitorama 2014 presentation had me laughing to the point of tears. “Look a word cloud!”

Title: "Computers are a Sadness, I am the Cure" https://vimeo.com/95066828


Say "word count" one more time!


Somebody get this man a serial port, or maybe a PC Speaker to Morse out diagnostics signals.


That's beautiful.


This is an interesting piece of creative writing, but virtual machines already existed in 2013. There are very few reasons to experiment on your dev machine.


OS / driver development needs to be done on bare metal sometimes.


At the time, Mickens worked at Microsoft Research, and with the Windows kernel development team. There may only be a few reasons to experiment on your dev machine, but that's one environment where they have those reasons.


Sometimes you have to debug on a real machine. When you do, you'd usually use a serial port for your debug output. Everything has one.


>Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009.

Hello from a fellow BitLocker dev from this time! I think I know who this is, but I'm not sure and don't want to say your name if you want it private. Was one of your Win10 features implementing passphrase support for the OS drive? In any case, feel free to reach out and catch up. My contact info is in my profile.


Win8. I've been seeing your blog posts show up here and there on HN over the years, so I was half expecting you to pick up on my self-doxx. I'll ping you offline.


"It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification."

It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?


Drivers have had to be signed forever and pass pretty rigorous test suites and static analysis.

The problem here is obviously this other file the driver sucks in. Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...


There’s a design problem here if the driver can’t be self-contained in such a way that it’s possible to roll back the kernel to a known good state.


How so? Preventing roll-backs on software updates is a "security feature" in most cases for better and for worse. Yeah, it would be convenient for tinkerers or in rare events such as these, but would be a security issue in the 99,9..99% of the time for enterprise users where security is the main concern.


I don't really understand this, many Linux distributions like Universal Blue advertise rollbacks as a feature. How is preventing a roll-back a "security feature"?


Imagine a driver has an exploitable vulnerability that is fixed in an update. If an attacker can force a rollback to the vulnerable older version, then the system is still vulnerable. Disallowing the rollback fixes this.


ohh


> Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...

    "What are you complaining about? It works on my machine."™


> In which case why did they fail to prevent this?

"Oh, crowdstrike? Yeah, yeah, here's that Winodws kernel code signing key you paid for."


You can pay for it and sign a file full of null characters. Signing has nothing to do with quality from what I understand.


"Yours sincerely,

Crowdstrike

---

PS - If you get hit by some massive crash, we refer you to our company's name. What were you expecting?"


[flagged]


Please explain this comment. How is the Crowdstrike incident related to the Key Bridge collision?


I think he's implying there was some sort of conspiracy by foreign actors.


This is what I don’t get, it’s extremely hard for me to believe this didn’t get caught in CI when things started blue screening. Every place I ever did test rebooting/powercycling was part of CI, with various hardware configs. This was before even our lighthouse customers even saw it.


What makes you think they have CI after what happened?


Apparently the flaw was added to the config file in post-processing after it had completed testing. So they thought they had testing, but actually didn't.


Disgruntled employee trying to use Crowd Strike to start a General Strike?


I was thinking, this doesn't seem like its a case of all these machines still on an old version of windows, or some specific version, that is having issues. Therefore QA just missed one particular variant in their smoke testing. It seems like its every windows instance with that software, so either they don't have basic automated testing, or someone pushed this outside of a normal process.


> Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

Up the chain to automated test machines, right?


You would think automated test would come before your teammates work stations / commit to head.


Did I mention this was 15 years ago? Software development back then looked very different than it does now, especially in Wincore. There was none of this "Cloud-native development" stuff that we all know and love today. GitHub was just about 1 year old. Jenkins wouldn't be a thing for another 2 years.

In this case the "automated test" flipped all kinds of configuration options with repeated reboots of a physical workstation. It took hours to run the tests, and your workstation would be constantly rebooting, so you wouldn't be accomplishing anything else for the rest of the day. It was faster and cheaper to require 8 devs to rollback to yesterday's build maybe once every couple of quarters than to snarl the whole development process with that.

The tests still ran, but they were owned and run by a dedicated test engineer prior to merging the branch up.


Jenkins was called Hudson from 2005 until 2011, and version control is much, much older.

I'm surprised you didn't have two or more workstations.


Sorry, the comment wasn't meant to be a personal judgement on you.


I'm completely ignorant on the topic but isn't rebooting a default test for kernel code, given how sensitive it is?


Oh I rebooted, I just didn't happen to have the right configuration options to invoke the failure when I rebooted. Not every dev workstation was bluescreening, just the ones with the particular feature enabled.


But as someone already pointed out, the issue was seen on all kinds of windows hosts. Not just the ones running a specific version, specific update etc.


That sounds like it was caught by luck, unless there was some test explicitly with that configuration in the QA process?


A lot of QA, especially at the system level, is just luck. That’s why it’s so important to dogfood internally imho.

And by internally I don’t just mean the development team, but anyone and everyone at the company who is allowed to have access to early builds.


There's "something that requires highly specific conditions managed to slip past QA" and then there's "our update brought down literally everyone using the software". This isn't a matter of bad luck.


Maybe thru luck, they're gonna uncover another xz utils backdoor MS version, but its probably gonna get covered up because, Microsoft


What does this mean?

Windows kernel paged, linux non paged?


The memory used by the Windows kernel is either Paged or Non-Paged. Non-Paged means pinning the memory in physical RAM. Paged means it might be swapped out to disk and paged back in when needed. OP was working on BitLocker a file system driver, which handles disk IO. It must be pinned in physical RAM to be available all the times; otherwise, if it's paged out, an IO request coming would find the driver code missing in memory and try to page in the driver code, which triggers another IO request, creating an infinite loop. The Windows kernel usually would crash at that point to prevent a runway system and stops at the point of failure to let you fix the problem.


Thank you!


Linux is a bit unusual in that kernel memory is generally physically mapped and unless you use vmalloc any memory you allocate has to correspond to pages backed by RAM. This also ties into how file IO happens, swapping, and how Linux approach to IO is actually closer to Multics and OS/400 than OG Unix.

Many other systems instead default to using full power of virtual memory including swapping kernel space to disk, with only things explicitly need to be kept in ram being allocated from "non-paged" or "wired" memory.

EDIT: fixed spelling thanks to writing on phone.


Linux kernel memory isn’t paged out to disk, while Windows kernel memory can be: https://knowledge.broadcom.com/external/article/32146/third-...


Has that changed? I remember always creating a swap partition that was meant to be at least the size of RAM


I do not mean this to be blamey in any way shape or form and am asking only about the process:

Shouldn’t that have been caught in code review?


My manager actually blamed the more senior developer who reviewed my code for that one.


Must have been DNS... when they did the deployment run and the necessary code was pulled and the DNS failed and then the wrong code got compiled...</sarcasm>

that they don't even do staged/A-B pushes was also <mind-blown-away>

But the most.... ironical was: https://www.theregister.com/2024/07/18/security_review_failu...


So the key test, the test that was not run, was to turn the machine off and on again? Classic windows.


Some Canonical guy I think many years ago mentioned this as their sales strategy a few year ago after a particularly nasty Windows outage:

We don't ask customers to switch all systems from Windows to Ubuntu, but to consider moving maybe a third to Ubuntu so they won't sit completely helpless next time Windows fail spectacularly.

While I see more and more Ubuntu systems, and recently have even spotted Landscape in the wild I don't think they were as successful as they hoped with that strategy.

That said, maybe there is a silver lining on todays clouds both WRT Ubuntu and Linux in general, and also WRT IT departments stopping to reconsider some security best practices.


Except further up this thread another poster mentions that CrowdStrike took down their Debian servers back in April as well. As soon as you're injecting third party software into your critical path with self-triggered updates you're vulnerable to the quality (or lack of) that software despite platform.

Honestly your comment highlights one of the few defenses... don't sit all on one platform.


Sure, but note the sales pitch was to encourage resiliency through diversity. While that may not be helpful in cases where one vendor may push the same breaking change through to multiple platforms, it also may be helpful. I remember doing some work with a mathematics package under Solaris while in university, while my peers were using the same package under Windows. Both had the same issue, but the behaviour was different. Under Solaris, it was possible to diagnose since the application crashed with useful diagnostic information. Under Windows, it was impossible to diagnose since it took out the operating system and (because of that) it was unable to provide diagnostic information. (It's worth noting that I've seen the opposite happen as well, so this isn't meant to belittle Windows.)


Yes, I already heard one manager at my company today say they're getting a mac for their next computer. That's great, the whole management team shouldn't be on Windows. The engineering team is already pretty diversified between mac, windows, and linux. The next one might take down all 3 but at least we tried to diversify the risk.


Yep, these episodes are the banana monoculture [0] applied to IT. The solution isn't to use this vendor or avoid that vendor, it's to diversify your systems such that you can have partial operability even if one major component is down.

[0] https://en.m.wikipedia.org/wiki/Gros_Michel_banana


> don't sit all on one platform.

Debian has automatic updates but they can be manual as well. That's not the case in Windows.

The best practice for security critical infrastructure in which peoples lives are at stake, is to install some version of BSD stripped down to it's bare minimum. But then the company has to pay for much more expensive admins. Windows admins are much cheaper and plentiful.

Also as a user of Ubuntu and Debian for more than a decade, i have a hunch that this will not happen in India [1].

[1] https://news.itsfoss.com/indian-govt-linux-windows/


Windows updates can definitely be manual. And anyway, this was not a Windows update. It was a CrowdStrike update.


Oh, i thought it was tied to OS updates. So Windows is not to blame, if that's the case.


well, in another sense, Windows is certainly to blame partially. Several technical solutions have been put forward here and in other places, that would've at least limited the blast radius of a faulty update/driver/critical path. Windows didn't implement any of those. Presumably by choice and for good reasons: A tradeoff would be that software like crowdstrike is more limited in protecting you. So the Windows devs deliberately opted for this risk.

Or they never considered it, which is far worse.


Hopefully they won't botch the update for two operating systems at the same time. But yeah. Hope.


Yeah, I see a lot of noise on social media blaming this on Microsoft/Windows... but AFAIK if you install a bad kernel driver into any major OS the result would be the same.

The specific of this CrowdStrike kernel driver (which AFAIK is intended to intercept and log/deny syscalls depending on threat assessment?) means that this is badnewsbears no matter which platform you're on.

Like sure, if an OS is vulnerable to kernel panics from code in userland, that's on the OS vendor, but this level of danger is intrinsic to kernel drivers!


> AFAIK if you install a bad kernel driver into any major OS the result would be the same

Updates should not be destructive. Linux doesn't typically overwrite previous kernels, and bootloaders let users choose a kernel during startup.

Furthermore, an immutable OS makes rollback trivial for the entire system, not just the kernel (reboot, select previous configuration).

I hope organizations learn from this, and we move to that model for all major OSes.

Immutability is great, as we know from functional programming. Nix and Guix are pushing these ideas forward, and other OSes should borrow them.


It's interesting to me that lay people are asking the right questions, but many in the industry, such as the parent here, seem to just accept the status quo. If you want to be part of the solution, you have to admit there is a problem.


True; except here's what's baffling:

CloudStrike only uses a kernel level driver on Windows. It's not necessary for Mac, it's not necessary for Linux.

Why did they feel that they needed kernel level interventions on Windows devices specifically? Windows may have some blame there.


Apple deprecated kernel extensions with 10.15 in order to improve reliability and eventually added a requirement that end users must disable SIP in order to install kexts. Security vendors moved to leverage the endpoint security framework and related APIs.

On Linux, ebpf provides an alternative, and I assume, plenty of advantages over trying to maintain kernel level extensions.

I haven’t researched, but my guess is that Microsoft hasn’t produced a suitable alternative for Windows security vendors.


> Why did they feel that they needed kernel level interventions on Windows devices specifically?

Maybe because everyone else in "security" and DRM does it, so they figured this is how it's done and they should do it too?

My prior on competence of "cybersecurity" companies is very, very low.


> My prior on competence of "cybersecurity" companies is very, very low.

Dmitri Alperovitch agrees with you.[0] He went on record a few months back in a podcast, and said that some of the most atrocious code he has ever seen was in security products.

I am certain he was implicitly referring, at least in part, to some of the code seen inside his past company's own code base.

0: https://nationalsecurity.gmu.edu/dmitri-alperovitch/ ["Co-founder and former CTO of Crowdstrike"]


> Maybe because everyone else in "security" and DRM does it, so they figured this is how it's done and they should do it too?

What DRM uses kernel drivers? And how do you plan to prevent malware from usermode?


> CloudStrike ONLY uses a kernel level driver on Windows

Crowdstrike uses a kernel level driver ONLY on Windows.


CrowdStrike uses a kernel level driver on Windows ONLY.

Even better..

ONLY on Windows does CrowdStrike use a kernel level driver.


Yeah, I think your point is totally valid. Why does CrowdStrike need syscall access on Windows when it doesn't need it elsewhere?

I do think there's an argument to be made that CrowdStrike is more invasive on Windows because Windows is intrinsically less secure. If this is true then yeah, MSFT has blame to share here.


I don't know about MacOS, but at least as recently as a couple years ago crowdstrike did ship a Linux kernel module. People were always complaining about the fact that it advertised the licensing as GPL and refused to distribute source.

I imagine they've simply moved to eBPF if they're not shipping the kernel module anymore.


I haven't looked too deeply into how EDRs are implemented on Linux and macOS, but I'd wager that CrowdStrike goes the way of its own bit of code in kernel space to overcome shortcomings in how ETW telemetry works. It was never meant for security applications; ETW's purpose was to aid in software diagnostics.

In particular, while it looks like macOS's Endpoint Security API[0] and Linux 4.x's inclusion of eBPF are both reasonably robust (if the literature I'm skimming is to be believed), ETW is still pretty susceptible to blinding attacks.

(But what about PatchGuard? Well, as it turns out, that doesn't seem to keep someone from loading their own driver and monkey patching whatever WMI_LOGGER_CONTEXT structures they can find in order to call ControlTraceW() with ControlCode = EVENT_TRACE_CONTROL_STOP against them.)

0: https://developer.apple.com/documentation/endpointsecurity


Non hardware "drivers" which cause a BSOD should be disabled automatically on next boot.

Windows offers it's users nothing here.


You can also make rollback easy. Just load the config before the one where you took the bad update.

Of course that means putting the user in control of when they apply updates, but maybe that would be a good thing anyway.


Linux and open source also have the potential to be far more modular than Windows is. At the moment we have airport display boards running a full windows stack including anti-virus/spyware/audit etc, just to display a table ... madness


I'm a Kubuntu user that, seemingly due to Canonical's decision to ship untested software regularly, has been repeatedly hit by problems with snaps. What were initially basic, obvious, and widespread issues with major software.

Yes, distribute your eggs, but check the handles on the baskets being sold to you by the guy pointing out bad handles.


FWIW, while some people like Kubuntu, I have had much better results with KDE Neon.

Stable Ubuntu core under the surface, and everything desktop related delivered by the KDE team.


Thanks for the tip, I'm looking to jump ship to MX-Linux, just procrastinating the move right now.


Still haven't forgiven Ubuntu for pushing a bad kernel of their own that caused a boot loop if you used containers...


I’ll never forgive them for the spyware they defaulted to on in their desktop stuff. It wasn’t the worst thing in the world, but they’re also the only major distro to ever do it, so Ubuntu (and Canonical as a whole) can get fucked, imo.


[flagged]


That's a long grudge to hold over a feature that was reconsidered and removed.


Maybe, but Canonical didn't learn and are back to pushing advertising and forcing unwanted changes.


To say it rather politely, the mindset exposed by introducing this feature is unlikely to go away.


As shown by Mozilla.


i started with RH (Non-EL) back in the mid-to-late 90s, and switched to gentoo as soon as one of my best (programmer) friends gushed about how much better of an admin it had made them[0], so i started down that path - by the time AWS appeared, we were both automating everything, using build (pump) servers, etc. I like debian, a lot - really! I think apt is about the best non-technical-user package manager, and the packages that were available without having to futz with keyrings was great.

Ubuntu spent a lot of time, talent, and treasure on trying to migrate people off windows instead of being a consistent, great OS. It is still with great dread that i open docs for some new package/program linked to from HN or elsewhere; dread that the first instruction is "start with ubuntu 18.04|20.04".

[0] They actually maintained the unofficial gentoo AWS images for over a decade. unsure if they still do, it could be automated to run a new build off every quarter. https://github.com/genewitch/gentoo/blob/master/gentoo_auto.... (a really old version of the script i keep to remind me that automation is possible with nearly everything...)


canonical has some of the most ridiculous IT job postings i’ve come across. just sounds like a bananas software shop. didn’t give me much confidence in whatever they cooking up in there


Not really.


Sure but if that Canonical sales person was successful in that, I'd almost guarantee that after they switched the first third they'd be in there arguing to switch out the rest.


Absolutely.

I'm just saying what they said their strategy was, not judging their sales people.


Many years ago an Ubuntu tech sales guy demoed their (openstack?) Self hosted cloud offering, his laptop was running windows..


Canonical in particular are no better, they do the exact same thing with that aberration called snap. They have brought entire clusters down before with automatic updates.


Seems like a reasonable strategy. Not just Ubuntu but some redundancy in some systems.


Ubuntu has unattended-upgrades enabled by default


Yes, but by default the only repo enabled for it is $(cat /etc/os-release)-security.


But CrowdStrike is security as well?


Yes, but it's not included in the upstream Ubuntu security repository. In fact, it's not available via any repository AFAIK. It updates itself via fetching new versions from CrowdStrike backend according to your update policy for the host in question. However, as we've learned the past days, that policy does not apply to the "update channel" files...


things are so interdependent that in this scenario you might now just end up crashing the system if either Windows or Ubuntu are down instead of just the one of them you chose


Read on Mastodon: https://infosec.exchange/@littlealex/112813425122476301

The CEO of Crowdstrike, George Kurtz, was the CTO of McAfee back in 2010 when it sent out a bad update and caused similar issues worldwide.

If at first you don't succeed, .... ;-) j/k


If anything, this just shows how short-term our memory is. I imagine crowdstrike stock will be back to where it was by the end of next week.

I bet they don't even lose a meaningful amount of customers. Switching costs are too high.

A real shame, and a good reminder that we don't own the things we think we own.


> this just shows how short-term our memory is.

I've been out of IT proper for a while, so to me, I had to ask "the Russiagate guys are selling AV software now?"


I don't partake in the stock market these days, but this is the kind of event that you can make good money betting the price will come back up.

When a company makes major headlines for bad news like this investors almost always over react and drive the price too far down.


I dunno. The stock price will probably dead cat bounce, but this is the sort of thing that causes companies to spiral eventually.

They just made thousands of IT people physically visit machines to fix them. Then all the other IT people watched that happen globally. CTOs got angry emails from other C-levels and VPs. Real money was lost. Nobody is recommending this company for a while.

It may put a dent in Microsoft as splash damage.


>It may put a dent in Microsoft as splash damage.

I have a feeling that Microsoft's PR team will be able to navigate this successfully and Microsoft might even benefit from this incident as it tries to pull customers away from CrowdStrike Falcon and into its own EDR product -- Microsoft Defender for Endpoint.


My (very unprofessional) guess here is that investors in the near term will discount the company too heavily and the previously overvalued stock will blow past a realistic valuation and be priced too low for a little while. The software and company aren't going anywhere as far as I can tell, they have far too much marketshare and use of CrowdStrike is often a contractual obligation.

That said, I don't gamble against trading algorithms these days and am only guessing at what I think will happen. Anyone passing by, please don't take random online posts as financial advice.


After yesterday, CRWD is still up more than the S&P since the start of the year, and both are up insane amounts.

The stock market is unrelated to reality.


Honestly makes me angry, if we had a sense of justice in this world this would devastate them financially.


With a P/E of over 573? Doubt it will recover that fast.


Worth $3.7B, paid $148M in 2022.

Edited to add: I wonder what the economic fallout from this will be? 10x his monetary worth? 100x? (not trying to put a price on the people who will die because of the outage; for that he and everyone involved needs to go to jail)


Nothing at all.

He will be the guy that convinced the investors and stakeholders to pour more money into the company despite some world-wide incident.

He deserves at least 3x the pay.

PS: look at the stocks! They sank, and now they are gaining again value. People can't work, people die, flights get delayed/canceled because of their software.


Regarding the stock. I'm sure people are "buying the dip".


From an investing perspective, that's fairly foolish until the financial liability of the company has been assessed.


Time will tell whether it's foolish or not.


so much seems based on sentiment now, might not matter as much as it would have 15 years ago.


If you invest based on fundamentals and company finances, you probably haven't had many chances to buy any positions in the last decade. Stock prices are completely unhinged from company financial reports.


"This is just a demonstration about how critical our software is and how undervalued we are. If the whole world economy requires us to run, we should be worth more" /s


I am still waiting for someone saying how generative AI (= chatbots/copilot) would have solved this problem, or even "never let happen"...


Generative AI would have mandated QA as best practice and to limit corporate liability.


Yeah well, obviously they don't need AI for that.

"We have unit tests coverage of 99%".

Let's be honest. :)


>I run a software company that deploys antivirus software automatically to customers' computers and servers. What should I be aware of as part of this and what actions should I ensure occur?

I gave it this question and part of the steps included deploying to avoid peak hours to avoid potential disruptions (such as the weekend for businesses); including a way to detect deployment issues and easily rollback the software in case there is an issue; and testing the deployment in a controlled environment before rolling it out to all customers (this would include doing a small rollout before a larger one). All of these actions are very standard and all would have greatly helped in this situation.


Crowdstrike (Falcon) has all possible compliance certifications (they are PCI, FedRAMP, ISO-27001, etc.), and these often include following standard practices of SDLC, such as testing, deploying, incident management, etc. And they are much more detailed than the answer from AI.

My suspicion, as someone pointed out, is that they deployed this update as a small patch, maybe bypassing some controls and cutting some corners. I really wouldn't blame it on lack of automated tests, etc. Only crowdstrike people know how this thing possibly made it to production.


>What certifications should I have?

ChatGPT Obtaining relevant certifications can enhance the credibility of your antivirus software and reassure customers about its security, reliability, and compliance with industry standards. Here are some key certifications you should consider:

1. Security Certifications: ISO/IEC 27001: This certification specifies the requirements for establishing, implementing, maintaining, and continually improving an information security management system (ISMS). It demonstrates your commitment to managing and protecting sensitive information. ISO/IEC 15408 (Common Criteria): This certification provides assurance that the security features of your software have been rigorously tested and meet internationally recognized standards. 2. Quality Management Certifications: ISO 9001: This certification focuses on quality management systems and ensures that your processes consistently provide products that meet customer and regulatory requirements. ISO/IEC 25010: This standard specifies a model for evaluating software quality, ensuring that your software meets high standards of functionality, reliability, usability, efficiency, maintainability, and portability. 3. Data Privacy Certifications: GDPR Compliance: If you operate in or serve customers in the European Union, demonstrating compliance with the General Data Protection Regulation (GDPR) is essential. CCPA Compliance: For customers in California, compliance with the California Consumer Privacy Act (CCPA) is important. 4. Industry-Specific Certifications: HIPAA Compliance: If your software handles protected health information (PHI) in the United States, ensuring compliance with the Health Insurance Portability and Accountability Act (HIPAA) is crucial. PCI DSS: If your software processes payment card information, compliance with the Payment Card Industry Data Security Standard (PCI DSS) is necessary. 5. Product-Specific Certifications: ICSA Labs Certification: ICSA Labs, an independent division of Verizon, provides testing and certification for security products, including antivirus software. Certification from ICSA Labs can validate the effectiveness of your software. AV-TEST Certification: The AV-TEST Institute is an independent research institute that evaluates and certifies antivirus software for its protection, performance, and usability. An AV-TEST certification can enhance your product’s credibility. VB100 Certification: Virus Bulletin (VB) offers the VB100 certification, which is awarded to antivirus products that meet specific criteria for detecting known malware without generating false positives. OPSWAT Certification: OPSWAT offers certifications that validate the interoperability and effectiveness of cybersecurity products, including antivirus software. 6. Environmental and Occupational Health Certifications: ISO 14001: This certification demonstrates your commitment to environmental management, which can be important for corporate social responsibility (CSR) and sustainability. ISO 45001: This certification focuses on occupational health and safety management systems, ensuring a safe and healthy workplace for your employees.


"I asked ChatGPT how to prevent this and here's what it said. <generic bullet points about software testing> AI would have prevented this."


Every major outage when I worked at Google caused stock price to increase for this exact reason.

IT is always considered a cost until execs realize its critical to the company's existence. Keeping the lights on always seems to be undervalued. :(


You’re joking but I actually think this is part of how the CEO will frame things to investors.


Kurtz response is ridiculous blaming the customer on X. He will probably find another company to hire him as CEO tho. Just an upside down world in the C-suite world.


Don't forget the golden parachute. These guys always seem to fail upward.


That guy is gonna fail all the way right up to the top. Sheesh.


who is hiring these fucking idiots? they need to be blacklisted


Crowdstrike is run by humans just like you and me. One mistake doesn’t mean they are completely incompetent.


> One mistake doesn’t mean they are completely incompetent.

They are completely incompetent because for something as critical as crowdstrike code, you must build so many layers of validation that one, two or three mistakes don't matter because they will be caught before the code ends up in a customer system.

Looks like they have so little validation that one mistake (which is by itself totally normal) can end up bricking large parts of the economy without ever being caught. Which is neither normal nor competent.


Except this isn’t one mistake. Writing buggy code is a mistake. Not catching it in testing, QA, dogfooding or incremental rollouts is a complete institutional failure


Mistakes are perfectly fine, that's why multiple layers of testing exist


> Mistakes are perfectly fine, that's why multiple layers of testing exist

Indeed. Or in the case of crowdstrike, should exist. Which clearly doesn't for them.


The CTO with a shitty track record, not the line employees. He deserves zero reprieve


Reminds me of Phil Harrison who always seems to find himself in an of executive position, botching launches of new video game platforms - PlayStation 3, Xbox One, Google Stadia


CXOs usually have deep connection and great contracts (golden parachutes, etc.) that make them extremely difficult to fire and amiable to hire :)


He founded the company


I didn’t understand why in 2010, it didn’t seem to make most news…

Took out the entire company where I worked.

People thought it was a worm/virus — few minutes after plugging in laptop, McAfee got the DAT update, quarantined the file; which caused Windows to start countdown+reboot (leading to endless BSODs).


Yet another successful loser who somehow continues to ascend corporate ranks despite poor company performance. Just shows how disconnected job performance is from C-suite peer reviews, a glorified popularity contest. Should add the unity and better.com folk here


Eh. To be fair, the higher profile your job is, the more likely you'll be the face of one of these in your career.


Ok but he faced two


“There's an old saying in Tennessee — I know it's in Texas, probably in Tennessee — that says, fool me once, shame on — shame on you. Fool me — you can't get fooled again.”

- GWB


fool me once...


This event is predicted in Sydney Dekker’s book “Drift into Failure”, which basically postulates that in order to prevent local failure we setup failure prevention systems that increase the complexity beyond our ability to handle, and introduce systemic failures that are global. It’s a sobering book to read if you ever thought we could make systems fault tolerant.


We need more local expertise is really the only answer. Any organization that just outsources everything is prone to this. Not that organizations that don't outsource aren't prone to other things, but at least their failures will be asynchronous.


Funny thing is that for decades there were predictions about how there was a need for millions of more IT workers. It was assumed one needed local knowledge in companies. Instead what we got was more and more outsourced systems and centralized services. This today is one of the many downsides.


Two weeks ago it was just about all car dealers


The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely. So companies who do their own IT would be routinely outcompeted by ones that outsource, only for the latter to get into trouble when the black swan swoops in. The problem is all other kinds of companies are mostly extinct by then unless their investors had some super-human foresight and discipline to invest for years into something that year after year looks like losing money.


> The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely.

Is that because of economies of scale or because the vendor is just cutting costs while hiding their negligence?

I don't understand how a single vendor was able to deploy an update to all of these systems virtually simultaneously, and _that_ wasn't identified as a risk. This smells of mindless box checking rather than sincere risk assessment and security auditing.


Kinda both I think, with an addition of principal agent problem. If you found a formula that provides the client with an acceptable CYA picture it is very scalable. And the model of "IT person knowledgeable in both security, modern threats and company's business" is not very scalable. The former, as we now know, is prone to catastrophic failures, but those are rare enough for a particular decision-maker to not be bothered by it.


the vendor is just cutting costs while hiding their negligence?

That's how it works.


Depressing thought that this phenomena is some kind of Nash equilibrium. That in the space of competition between firms, the equilibrium is for companies to outsource IT labor, saving on IT costs and passing that cost savings onto whatever service they are providing. -> Firms that outsource, out-compete their competition + expose their services to black swan catastrophic risk. Is regulation that only way out of this, from a game theory perspective?


Depressing, but a good way to think about it.

The whole market in which crowdstrike can exist is a result of regulation, albeit bad regulation.

And since the returns of selling endpoint protection are increasing with volume, the market can, over time, only be an oligopoly or monopoly.

It is a screwed market with artificially increased demand.

Also the outsourcing is not only about cost and compliance. There is at least a third force. In a situation like this, no CTO who bought crowdstrike products will be blamed. He did what was considered best industry practice (box ticking approach to security). From their perspective it is risk mitigation.

In theory, since most of the security incidents (not this one) involve the loss of personal customer data, if end customers would be willing to a pay a premium for proper handling of their data, AND if firms that don’t outsource and instead pay for competent administrators within their hierarchy had a means of signaling that, the equilibrium could be pushed to where you would like it to be.

Those are two very questionable ifs.

Also how do you recognise a competent administrator (even IT companies have problems with that), and how many are available in your area (you want them to live in the vicinity) even if you are willing to pay them like the most senior devs?

If you want to regulate the problem away, a lot of influencing factors have to be considered.


It has been exactly the same with outsourcing production to China...


Also a major point in the Black Swan. In the Black Swan, Taleb describes that it is better for banks to fail more often than for them to be protected from any adversity. Eventually they will become "too big to fail". If something is too big to fail, you are fragile to a catastrophic failure.


I was wondering when someone would bring up Taleb RE: this incident.

I know you aren't saying it is, but I think Taleb would argue that this incident, as he did with the coronavirus pandemic for example, isn't even a Black Swan event. It was extremely easy to predict, and you had a large number of experts warning people about it for years but being ignored. A Black Swan is unpredictable and unexpected, not something totally predictable that you decided not to prepare for anyways.


I think Grey Rhino is the term to use. Risks that we can see and acknowledge yet do nothing about.


That is interesting, where does he talk about this? I'm curious to hear his reasoning. What I remember from the Black Swan is that Black Swan events are (1) rare, (2) have a non-linear/massive impact, (3) and easy to predict retrospectively. That is, a lot of people will say "of course that happened" after the fact but were never too concerned about it beforehand.

Apart from a few doomsdayers I am not aware of anybody was warning us about a crowd strike type of event. I do not know much about public health but it was my understanding that there were playbooks for an epidemic.

Even if we had a proper playbook (and we likely do), the failure is so distributed that one would need a lot of books and a lot of incident commanders to fix the problem. We are dead in the water.


"Antifragile" is even more focused around this.


I think it was "predicted" by Sunburst, the Solarwinds hack.

I don't think centrally distributed anti-virus software is the only way to maintain reliability. Instead, I'd say companies to centralize anything like administration since it's cost effective and because they actually aren't concerned about global outage like this.

JM Keynes said "A ‘sound’ banker, alas! is not one who foresees danger and avoids it, but one who, when he is ruined, is ruined in a conventional and orthodox way along with his fellows, so that no one can really blame him." and the same goes for corporate IT.


Many systems are fault tolerant, and many systems can be made fault tolerant. But once you drift into a level of complexity spawned by many levels of dependencies, it definitely becomes more difficult for system A to understand the threats from system B and so on.


Do you know of any fault tolerant system? Asking because in all the cases I know, when we make a system "fault tolerant" we increase the complexity and we introduce new systemic failure modes related to our fault-tolerant-making-system, making them effectively non fault tolerant.

In all the cases I know, we traded frequent and localized failure for infrequent but globalized catastrophic failures. Like in this case.


You can make a system tolerant to certain faults. Other faults are left "untolerated".

A system that can tolerate anything, so have perfect availability, seems clearly impossible. So yeah, totally right, it's always a tradeoff. That's reasonable, as long as you trade smart.

I wonder if the people deciding to install Crowdstrike are aware of this. If they traded intentionally, and this is something they accepted, I guess it's fine. If not... I further wonder if they will change anything in the aftermath.


There will be lawsuits, there will be negotiations for better contracts, and likely there will be processes put in place to make it look like something was done at a deeper level. And yet this will happen again next year or the year after, at another company. I would be surprised if there was a risk assessment for the software that is supposed to be the answer to the risk assessment in the first place. Will be interesting to see what happens once the dust settles.


  - This is system has a single point of failure, it is not fault tolerant. Lets introduce these three things to make it fault-tolerant
  - Now you have three single points of failure...


That makes it three times as durable...

...right?


It really depends on the size of the system and the definition of fault tolerance. If I have a website calling out to 10 APIs and one API failure takes down the site, that is not fault tolerance. If that 1 API failure gets caught and the rest operate as normal, that is fault tolerance, but 10% of the system is down. If you go to almost any site and open the dev console, you'll see errors coming from parts of the system, that is fault tolerance. Any twin engine airplane is fault tolerant...until both engines fail. I would say the solar system is fault tolerant, the universe even moreso if you consider it a system.

tldr there are levels to fault tolerance and I understand what you are saying. I am not sure if you are advocating for getting rid of fault handling, but generally you can mitigate the big scary monsters and what is left is the really edge case issues, and there really is no stopping one of those from time to time given we live in a world where anything can happen at anytime.

This instance really seems like a human related error around deployment standards...and humans will always make mistakes.


well, you usually put a load balancer and multiple instances of your service to handle individual server failures. In a basic no-lb case, your single server fails, you restart it and move on (local failure). In a load balancer case, your lb introduces its own global risks e.g. the load balancer can itself fail, which you can restart, but the load balancer can have a bug and stop handling sticky sessions when your servers are relying on it, and now you have a much harder to track brown-out event that is affecting every one of your users for a longer time, it's hard to diagnose, might end up with hard to fix data issues and transactions, and restarting the whole might not be enough.

So yeah, there is no fault tolerance if the timeframe is large enough, there are just less events, with much higher costs. It's a tradeoff.

The cynical in me thinks that the one advantage of these complex CYA systems is that when systems fail catastrophically like CrowdStrike did, we can all "outsource" the blame to them.


It's also in line with arguments made by Ted Kaczynski (the Unabomber)

> Why must everything collapse? Because, [Kaczynski] says, natural-selection-like competition only works when competing entities have scales of transport and talk that are much less than the scale of the entire system within which they compete. That is, things can work fine when bacteria who each move and talk across only meters compete across an entire planet. The failure of one bacteria doesn’t then threaten the planet. But when competing systems become complex and coupled on global scales, then there are always only a few such systems that matter, and breakdowns often have global scopes.

https://www.overcomingbias.com/p/kaczynskis-collapse-theoryh...

https://en.wikipedia.org/wiki/Anti-Tech_Revolution


crazy how much he was right. if he hadn't gone down the path of violence out of self-loathing and anger he might have lived to see a huge audience and following.


I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.

There was a quote last year during the "Twitter files" hearing, something like, "it is axiomatic that the government cannot do indirectly what it is prohibited from doing directly".

Perhaps ironically, I had a difficult time using Google to find the exact wording of the quote or its source. The only verbatim result was from a NYPost article about the hearing.


>I suppose we wouldn't know whether an audience for those ideas exists today because they would be blacklisted, deplatformed, or deamplified by consolidated authorities.

Be realistic, none his ideas would be blacklisted. They sound good on paper, but the instant it's time for everyone to return to mudhuts and farming, 99% of people will return to Playstations and ACs.

He wasn't "silenced" because the government was out to get him, no one talks about his ideas because they are just bad. Most people will give up on ecofascism once you tell them that you won't be able to eat strawberries out of season.


"would be blacklisted, deplatformed, or deamplified by consolidated authorities"

Sorry. Not true. You have Black Swan (Taleb) and Drift into Failure (Dekker) among many other books. These ideas are very well known to anyone who makes the effort.


> it is axiomatic that the government cannot do indirectly what it is prohibited from doing directly

Turns out SCOTUS decided it isn't, and the government is free to do exactly that as long as they are using the services of an intermediary.


The only thing that got Unabomber blacklisted is that he started to send bombs to people. His manifesto was dime a dozen, half the time you can expect politician boosting such stuff for temporary polling wins.

Hell, if we take his alleged (don't have vetted the genealogy tree) cousins, his body count isn't even that impressive.


Being the subject of psychological experiments at Harvard probably did a number on him


I think a surprising amount of people already share this view, even if they don't go into extensive treatment with references like Dekker presumably does (I haven't read it).

I suspect most people in power just don't subscribe to that. which is precisely why it's systemic to see the engineer shouting "no!" when John CEO says "we're doing it anyway." I'm not sure this is something you can just teach, because the audience definitely has reservations about adopting it.


> we setup failure prevention systems

You can't prevent failure. You can only mitigate the impact. Biology has pretty good answers as to how to achieve this without having to increase complexity as a result, in fact, it often shows that simpler systems increase resilliency.

Something we used to understand until OS vendors became publicly traded companies and "important to national security" somehow.


Just yesterday listened to a lecture by Moshe Vardi which covers adjacent topics:

https://simons.berkeley.edu/events/lessons-texas-covid-19-73...


> if you ever thought we could make systems fault tolerant

The only possible way to fault tolerancy is simplicity and then more simplicity.

Things like crowsdtrike have the opposite approach. Add a lot of fragile complexity attempting to catch problems, but introducing more attack surfaces than they can remove. This will never succeed.


As an architect of secure, real-time systems, the hardest lesson I had to learn is there's no such thing as a secure, real-time system in the absolute sense. Don't tell my boss.


I haven't read it, but I'd take a leap to presume it's somewhere between the people that say "C is unsafe" and "some other language takes care of all of things".

Basically delegation.


The thing that amazes me is how they've rolled out such a buggy change at such a scale. I would assume that for such critical systems, there would be a gradual rollout policy, so that not everything goes down at once.


Lack of gradual, health mediated rollout is absolutely the core issue here. False positive signatures, crash inducing blocks, etc will always slip through testing at some % no matter how good testing is. The necessary defense in depth here is to roll out ALL changes (binaries, policies, etc) in a staggered fashion with some kind of health checks in between (did > 10% of endpoints the change went to go down and stay down right after the change was pushed?).

Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.

You can stagger changes out within a reasonable timeframe - the blocks already take hours/days/weeks to come up with, taking an extra hour or two to trickle the change out gradually with some basic sanity checks between staggers is a tradeoff everyone would embrace in order to avoid the disaster we're living through today.

Need a reset on their balance point of security:uptime.


Wow !! good to know real reason for non-staggered release of the software ...

> Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.


There's some irony there in that the whole point of CrowdStrike itself is that it does behavioural based interventions. ie: it notices "unusual" activity over time and then can react to that autonomously. So them telling you they can't engineer it is kind of like them telling you they do don't know how to do a core feature they actually sell and market the product itself doing.


The core issue? I'd say it's QA.

Deploy to a QA server fleet first. Stuff is broken. 100% prevention.


It's quite handy that all the things that pass QA never fail in production. :)

On a serious note, we have no way of knowing whether their update passed some QA or not, likely it hasn't, but we don't know. Regardless, the post you're replying to, IMHO, correctly makes the point that no matter how good your QA is: it will not catch everything. When something slips, you are going to need good observability and staggered, gradual, rollbackable, rollouts.

Ultimately, unless it's a nuclear power plant or something mission critical with no redundancy, I don't care if it passes QA, I care that it doesn't cause damage in production.

Had this been halted after bricking 10, 100, 1.000, 10.000, heck, even 100.000 machines or a whopping 1.000.000 machines, it would have barely made it outside of the tech circle news.


> On a serious note, we have no way of knowing whether their update passed some QA or not

I think we can infer that it clearly did not go through any meaningful QA.

It is very possible for there to be edge-case configurations that get bricked regardless of how much QA was done. Yes, that happens.

That's not what happened here. They bricked a huge portion of internet connected windows machines. If not a single one of those machines was represented in their QA test bank, then either their QA is completely useless, or they ignored the results of QA which is even worse.

There is no possible interpretation here that doesn't make Crowdstrike look completely incompetent.


If there had been a QA process, the kill rate could not have been as high as it is, because there'd have to be at least one system configuration that's not subject to the issue.


I agree that testing can reduce the probability of having huge problems, but there are still many ways in which a QA process can fail silently, or even pass properly, without giving a good indication of what will happen in production due to data inconsistencies or environmental differences.

Ultimately we don't know if they QA'd the changes at all, if this was data corruption in production, or anything really. What we know for sure is that they didn't have a good story for rollbacks and enforced staggered rollouts.


There's also the possibility that they did do QA, had issues in QA and were pressured to rush the release anyways.


Unsubstantiated (not even going to bother link to the green-account-heard-it-from-a-friend comment), the fault was added by a post-QA process


My understanding of their argument is that they can't afford the time to see if it breaks the QA fleet. Which I agree with GP is not a sufficient argument.


Yes, one of the first steps of this gradual rollout should be rolling out to your own company in the classic, "eat your own dogfood" style.


If and when there is a US Cyber Safety Review Board investigation of this incident, documents like that are going to be considered with great interest by the parties involved.

Often it is the engineers working for a heavily invested customer at the sharp end of the coal face who get a glimpse underneath the layers of BS and stare into the abyss.

This doesn’t look good, they say. It looks fine from up top! Keep shoveling! Comes the reply.


Sure, gradual rollout seems obviously desirable, but think of it from a liability perspective.

You roll out a patch to 1% of systems, and then a few of the remaining 99% get attacked and they sue you for having a solution but not making it available to them. It won't matter that your sales contract explains that this is how it works and the rollout is gradual and random.

Just a thought.


These suing hypotheticals work both ways- they can sue for crashing 100% of your computers - so don't really explain any decision


Then push it down to customer, better yet provide integration points with other patch management software (no idea if you can integrate with WSUS without doing insane crap, but it's not the only system to handle that, etc.)


Another version of the "fail big" or "big lie" type phenomenon. Impact 1% of your customers and they sue you saying the gradual rollout demonstrates you had prior knowledge of the risk. Impact 100% of your customers and somehow you get off the hook by declaring it a black swan event that couldn't have been foretold.


Don't you think they will be sued now too?


CS recently went through cost cutting measures, which is likely why there's not a QA fleet to deploy to or improving their engineering processes.


Were they struggling with paying the employees?


In modern terms, you mean they simply weren't willing to babysit longer install frames.


This. I can see such an update shipping out for a few users. I mean I've shipped app updates that failed spectacularly in production due to a silly oversight (specifically: broken on a specific Android version), but those were all caught before shipping the app out to literally everybody around the world at the same time.


The only thing I can think of is they were trying to defend from a very severe threat very quickly. But... it seems like if they tested this on one machine they'd have found it.


Unless that threat was a 0day bug that allows anyone to SSH to any machine with any public key, it was not worth pushing it out in haste. Full stop. No excuses.


Can't boot, can't get cracked! Big brain thinking.


That’s the most charitable hypothesis, and I agree could be possible!

I myself have ninja-shipped a fix a minor problem, but then caused a worse problem since I rushed it.


I pushed a one-character fix and broke it a second time


I'd love to know what the original threat was. I hope it was something dumb like applying new branding colors to the systray indicator.


"works on my machine" at Internet scale. What a scary thought


I also blame the customers here to be completely honest.

The fact the software does not allow for progressive rollout of a version in your own fleet should be an instantaneous "pass". It's unacceptable for a vendor to decide when updates are applied to my systems.


Absolutely. I may be speaking from ignorance here, as I don't know much about Windows, but isn't it also a big security red flag that this thing is reaching out to the Internet during boot?

I understand the need for updating these files, they're essentially what encodes the stuff the kernel agent (they call it a "sensor"?) is looking for. I also get why a known valid file needs to be loaded by the kernel module in the boot process--otherwise something could sneak by. What I don't understand is why downloading and validating these files needs to be a privileged process, let alone something in the actual kernel. And to top it all off, they're doing it at boot time. Why?

I hope there's an industry wide safety and reliability lesson learned here. And I hope computer operators (IT departments, etc) realize that they are responsible for making sure the things running on their machines are safe and reliable.


Well said. I can't fathom companies being fine with some 3rd party pushing arbitrary changes to their critical production systems.


With fear of sounding like a douche-bag, I honestly believe there's A LOT of incompetence in the tech-world, which permeates all layers, security companies, AV companies, OS companies etc.

I really blame the whole power-structure, it looked like the engineers had the power, but last 10 years tech has been turned upside-down and exploited as any other industry, controlled by the opportunistic and greedy people. Everything is about making money, shipping features, the engineering is lost.

Would you rather tick compliance boxes easily or think deep about your critical path? Would you rather pay 100k for a skilled engineer or 5 cheaper (new) ones? Would you rather sell your HW now despite pushing feature-incomplete buggy app ruining the experience for many many customers? Will you listen to your engineers?

I also blame us, the SWE engineers, we are waay to easily busied around by these types of people who have no clue. Have professional integrity, tests is not optional or something that can be cut, it's part of SWE. Gradual rollout, feature-toggles, fall-backs/watchdogs etc. basic tools everyone should know.


I know people really dislike how Apple restricts your freedom to use their software in any way they don't intend. But this is one of the times where they shine.

Apple recognised kernel extension brought all sorts of trouble for users such as instability, crashing, etc. and presented a juicy attack surface. They deprecated and eventually disallowed kernel extensions supplanting them with a system extensions framework to provide interfaces for VPN functionality, EDR agents, etc.

A Crowdstrike agent couldn't panic or boot loop macOS due to a bug in the code when using this interface.


> I know people really dislike how Apple restricts your freedom to use their software in any way they don't intend. But this is one of the times where they shine.

Yes, the problem here is that the system owners had too much control over their systems.

No, no, that's the EXACT OPPOSITE of what happened. The problem is Crowdstrike had too much control of systems -- arguing that we should instead give that control to Apple is just swapping out who's holding the gun.


> arguing that we should instead give that control to Apple is just swapping out who's holding the gun.

apple wrote the OS, in this scenario they're already holding a nuke, and getting the gun out of crowdstrike's hands is in fact a win.

it is self-evident that 300 countries having nukes is less safe than 5 countries having them. Getting nukes (kernel modules) out of the hands of randos is a good thing even if the OS vendor still has kernel access (which they couldn't possibly not have) and might have problems of their own. IDK why that's even worthy of having to be stated.

don't let the perfect be the enemy of the good, incremental improvements in the state of things is still improvement. there is a silly amount of black-and-white thinking around "popular" targets like apple and nvidia (see: anything to do with the open-firmware-driver) etc.

"sure google is taking all your personal data and using it to target ads to your web searches, but apple also has sponsored/promoted apps in the app store!" is a similarly trite level of discourse that is nonetheless tolerated when it's targeted at the right brand.


Perfectly stated!


This is good nuance to add to the conversation, thanks.

I think in most cases you have to trust some group of parties. As an individual you likely don't have enough time and expertise to fully validate everything that runs on your hardware.

Do you trust the OSS community, hardware vendors, OS vendors like IBM, Apple, M$, do you trust third party vendors like Crowdstrike?

For me, I prefer to minimize the number of parties I have to trust, and my trust is based on historical track record. I don't mind paying and giving up functionality.


Even if you've trusted too many people, and been burned, we should design our systems such that you can revoke that trust after the fact and become un-burned.

Having to boot into safe mode and remove the file is a pretty clumsy remediation. Better would be to boot into some kind of trust-management interface and distrust cloudstrike updates dated after July 17, then rebuild your system accordingly (this wouldn't be difficult to implement with nix).

Of course you can only benefit from that approach if you trust the end user a bit more than we typically do. Physical access should always be enough to access the trust management interface, anything else is just another vector for spooky action at a distance.


It is some mix of priorities along the frontier, with Apple being on the significantly controlling end such that I wouldn't want to bother. Your trust should also be based on prediction, and giving a major company even more control over what your systems are allowed to do has been historically bad and only gets worse. Even if Apple is properly ethical now (I'm skeptical, I think they've found a decently sized niche and that most of their users wouldn't drop them even if they moved to significantly higher levels of telemetry, due to being a status good in part), there's little reason to give them that power in perpetuity. Removing that control when it is absued hasn't gone well in the past.


Microsoft is also trying to make drivers and similar safer with HVCI, WDAC, ELAM and similar efforts.

But given how a large part of their moat is backwards compatibility, very few of those things are the default and even then probably wouldn't have prevented this scenario.


Microsoft has routinely changed the display driver model, breaking backward compatibility. They've also barred print drivers.


> large part of their moat is backwards compatibility

This is more of a religious belief than truth, IMO. They could strong-arm recalcitrant customers, but they don't.


> They could strong-arm recalcitrant customers, but they don't.

They really can't. When the customers have to redo their stack, they might do that in a way that doesn't need Microsoft at all.


These customers wouldn't be able to do that in time frames measured in anything but decades and/or they would risk going bankrupt attempting to switch.

Microsoft has far more leverage than they choose to exert, for various reasons.


I can't run a 10year old game on my Mac but i can run a 30 year old game on my windows 11 box. Microsoft prioritizes backwards compatibility for older software,


You can’t run a 30 year old driver in Windows, nor a 10 year old in all likelihood.

Microsoft prioritizes userspace compatibility, but their driver models have changed (relatively) frequently.


If you are a Crowdstrike customer you can’t run anything today.


For apple you just need to be an apple customer, they do a good job on crashing computers with their OSX updates like Sonoma. I remember my first macbook pro retina couldn’t go to sleep because it wouldn’t wake up till apple decided to release a fix for it. Good thing they don’t make server OSes.


I remember fearing every OSX update because until they switched to just shipping read-only partition images you had considerable chance of hitting a bug in Installer.app that resulted in infinite loop... (the bug existed since ~10.6 until they switched to image-based updates...)


30 years ago would be 1994. Were there any 32-bit Windows games in 1994 other than the version of FreeCell included with Win32s?

16-bit games (for DOS or Windows) won't run natively under Windows 11 because there's no 32-bit version of Windows 11 and switching a 64-bit CPU back to legacy mode to get access to the 16-bit execution modes is painful.


Maybe. Have you tried? 30 year old games often did not implement delta timing, so they advance ridiculously fast on modern processors. Or the games required a memory mode not supported by modern Windows (see real mode, expanded memory, protected mode), requiring DOSBox or other emulator to run today.

DOSBox runs on Mac too, incidentally.


A 10 year old driver would crash your system, and you can't install some VM like you can with some game. Not that great prioritization


If the user want remote code execution (auto updates are) in kernel space, let them.

Apple sell the whole hardware stack. I don't think limeting drivers would fly on Windows or Linux.


Pretty sure there's an exception for drivers but requires at minimum notarisation from Apple, but more likely a review as well.


They just developed a new framework that allows drivers to work just in user space https://developer.apple.com/documentation/driverkit


Well - recognition where it's due - that actually looks pretty great. (Assuming that, contrary to prior behavior, they actually support it, and fix bugs without breaking backwards compatibility every release, and don't keep swapping it out for newer frameworks, etc etc)


Ok what if they sold it off by default but there was a physical switch that could turn it on, that required hardware access?

Good compromise?


That’s exactly how macOS works (except it’s not a physical switch). You can disable SIP if you have hardware access to a machine.


I would be fine with jumpers, ye.


No.

Go buy a different product if you want that functionality. I'm sticking with my Apple phone so outages like this are much less likely to affect me.


> I also blame us, the SWE engineers, we are waay to easily busied around by these types of people who have no clue. Have professional integrity, tests is not optional or something that can be cut, it's part of SWE.

Then maybe most of what's done in the "tech-industry" isn't, in any real sense, "engineering"?

I'd argue the areas where there's actual "engineering" in software are the least discussed---example being hard real-time systems for Engine Control Units/ABS systems etc.

That _has_ to work, unlike the latest CRUD/React thingy that had "engineering" processes of cargo-culting whatever framework is cool now and subjective nonsense like "code smells" and whatever design pattern is "needed" for "scale" or some such crap.

Perhaps actual engineering approaches could be applied to software development at large, but it wouldn't look like what most programmers do, day to day, now.

How is mission-critical software designed, tested, and QA'd? Why not try those approaches?


Amen to that. Software Engineering as a discipline badly suffers from not incorporating well-known methods for preventing these kinds of disasters from Systems Engineering.

And when I say Systems Engineering I don't mean Systems Programming, I mean real Systems Engineering: https://en.wikipedia.org/wiki/Systems_engineering

> How is mission-critical software designed, tested, and QA'd? Why not try those approaches?

Ultimately, because it is more expensive and slower to do things correctly, though I would argue that while you lose speed initially with activities like actually thinking through your requirements and your verification and validation strategies, you end up gaining speed later when you're iterating on a correct system implementation because you have established extremely valuable guardrails that keep you focused and on the right track.

At the end of the day, the real failure is in the risk estimation of the damage done when these kinds of systems fail. We foolishly think that this kind of widespread disastrous failure is less likely than it really is, or the damage won't be as bad. If we accurately quantified that risk, many more systems we build would fall under the rigor of proper engineering practices.


Accountability would drive this. Engineering liability codes are a thing, trade liability codes are a thing. If you do work that isn't up to code, and harm results, you're liable. Nobody is holding us software developers accountable, so it's no wonder these things continue to happen.


"Listen to the engineers?" The problem is that there are no engineers, in the proper sense of the term. What there are is tons and tons of software developers who are all too happy to be lax about security and safe designs for their own convenience and fight back hard against security analysts and QA when called out on it.


I would add that a lot of ppl in this industry also just blindly follow the herd too, without any independent thinking.

Oh, everyone is using Crowdstrike? I guess i have to do so too!

Oh, everyone is using Kubernetes? I guess i better start migrating our services to it too!

Oh, everyone is using this fancy Vercel stuff? We better use it too!

Oh everyone is migrating their workloads to the cloud even tho we dont need to and it costs 5x more?!! We better do so too!!


> it looked like the engineers

Engineers can be lazy and greedy, too. But at least they should better understand the risks of cutting corners.

> Have professional integrity, tests is not optional or something that can be cut, it's part of SWE. Gradual rollout, feature-toggles, fall-backs/watchdogs etc. basic tools everyone should know.

In my career, my solution for this has been to just include doing things "the right way" as part of the estimate, and not give management the option to select a "cutting corners" option. The "cutting corners" option not only adds more risk, but rarely saves time anyway when you inevitably have to manually roll things back or do it over.


Sigh, I've tried this. So management reassigned to a dev who was happy to ship a simalcrum of the thing that, at best, doesn't work or, at worst, is full of security holes and gives incorrect results. And this makes management happy because something shipped! Metrics go up!

And then they ask why, exactly, did the senior engineer say this would take so long? Why always so difficult?


I don't know that incompetence is the best way to describe the forces at play but I agree with your sentiment.

There is always tension between business people and engineering. Where the engineers want things to be perfect and safe, because we need to fix the arising issues during nights and weekends. The business people are interested in getting features released, and don't always understand the risks by pushing arbitrary dates.

It's a tradeoff which in healthy organizations where the two sides and leadership communicate effectively is well managed.


> Where the engineers want things to be perfect and safe, because we need to fix the arising issues during nights and weekends. The business people are interested in getting features released, and don't always understand the risks by pushing arbitrary dates.

Isn't this issue a vindication of the engineering approach to management, where you try to _not_ brick thousands of computers because you wanted to meet some internal deadline faster?


You don't consider bricking a considerable fraction of the world's computers in a way that's difficult to recover from incompetence?


Maybe it is. I don't like the connotation of it though.

It implies some sort of individual failure when I think it's an organizational failure is what I am was trying to say.


> There is always tension between business people and engineering.

Really? I think this situation (and the situation with Boeing!) shows that the tension is between ultimately between responsibility and irresponsibility.

I cannot be said that this is a win for short-sighted and incompetent business people?

If people don't understand the risks they shouldn't be making the decisions.


I think this is especially true in businesses where the thing you are selling is literally your ability to do good engineering. In the case of Boeing the fundamental thing customers care about is the "goodness" of the actual plane (for example the quality, the value for money, etc). In the case of Crowdstrike people wanted high quality software to protect their computers.


Yeah, good point. If you buy a carton of milk and it's gone off you shrug and go back to the store. If you're sitting in a jet plane at 30,000ft and the door goes for a walk... Twilight Zone. (And if the airline's security contractor sends a message to all the planes to turn off their engines... words fail. It's not... I can't joke about it. Too soon.)


Yes. I have been working in the tech industry since the early aughts and I never seen the industry so weak on engineer lead firms. Something really happened and the industry flipped.


In most companies, businesspeople without any real software dev experience control the purse strings. Such people should never run companies that sell life-or-death software.

The reality is there is plenty of space in the software industry to trade off velocity against "competent" software engineering. Take Instagram as an example. No one is going to die if e.g. a bug causes someone's IG photo upload to only appear in a proper subset of the feeds where it should appear.

There's a lot of incompetence by choice.


In the civil engineering world, at least in Europe, the lead engineer would sign papers that would put him as liable if a bridge or a building structure collapses on its own. The civil engineers face literal prison time if they make a sloppy work.

In the software engineering world, we have TOSs that deny any liability if the software fails. Why?

It boils my blood to think that the heads of CrowdStrike would maybe get a slap on the wrist and everything will slowly continue as usual as the machines will get fixed.

People died for this bug.


Let's think about this for a second. I agree to some extend with what you are trying to say, I just think there's a critical thing missing here in your consideration, and that is usage of the product outside its intended purpose/marketing.

Civil engineers built bridges knowingly that civilians use them, and structural failure can cause deaths. The line of responsibility is clear.

SW companies (like CrowdStrike (CS)) it MAY BE less straight-forward.

A relevant real-world example is the use of consumer drones in military conflicts. Companies like DJI design and market their drones for civilian use, such as photography. However, these drones have been repurposed in conflict zones, like Ukraine, to carry explosives. If such a drone malfunctioned during military use, it would be unreasonable to hold DJI accountable, as this usage clearly falls outside the product's intended purpose and marketing.

The liability depends on the guarantees they make. If they market it for AV used for critical infrastructure, such as healthcare (seems like they do https://www.crowdstrike.com/platform/) - by all means, it's reasonable to hold with accountable.

However, SW companies should be able to sell products and long as they're clear what the limitations are, and it needs to be clearly communicated to the customers.


We have those TOS's in the software world because it would be prohibitively expensive to make all software reliable as a publicly used bridge. For those who died as a direct result of CrowdStrike, that's where the litigious nature of the US becomes a rare plus. And CrowdStrike will lose a lot of customers over this. It isn't perfect, but the market will arbitrate CrowdStrike's future in the coming months and years.


We’re definitely in a moment. I’ve seen a large shift away from discipline in the field. People don’t seem to care about professionalism or “good work”.

I mean back in the mid teens we had the whole “move fast and break things” motif. I think that quickly morphed into “be agile” because no one actually felt good about breaking things.

We don’t really have any software engineering leaders these days. It would be nice if one stood up and said “stop being awful. Let’s be professionals and earn our money.” Like, let’s create our own oath.


> We don’t really have any software engineering leaders these days. It would be nice if one stood up and said “stop being awful. Let’s be professionals and earn our money.”

I assume you realize that you don't get very far in many companies when you do that. I'm not humble-bragging, but I used to say just this over past 10-15 years even when in senior/leadership positions, and it ended up giving me a reputation of "oh, gedy is difficult", and you get sidelined by more "helpful" junior devs and managers who are willing to sling shit over the wall to please product. It's really not worth it.


It’s a matter of getting a critical mass of people who do that. In other words, changing the general culture. I’m lucky to work at a company that more or less has that culture.


Yeah I’ve found this is largely cultural, and it needs to come from the top.

The best orgs have a gnarly, time-wisened engineer in a VP role who somehow is also a good people person, and pushes both up and down engineering quality above all else. It’s a very very rare combination.


If it's a mature system and management is highly risk averse, not fucking up means more than slinging shit quickly


> We’re definitely in a moment. I’ve seen a large shift away from discipline in the field. People don’t seem to care about professionalism or “good work”.

Agreed. Thinking back to my experience at a company like Sun, every build was tested on every combination of hardware and OS releases (and probably patch levels, don't remember). This took a long time and a very large number of machines running the entire test suites. After that all passed ok, the release would be rolled out internally for dogfooding.

To me that's the base level of responsibility an engineering organization must have.

Here, apparently, Crowdstrike lets a code change through with little to no testing and immediately pushes it out to the entire world! And this is from a product that is effectively a backdoor to every host. What could go wrong? YOLO right?

This mindset is why I grow to hate what the tech industry has become.


As an infra guy, it seems like all my biggest fights at work lately have been about quality. Long abandoned dependencies that never get updated, little to no testing, constant push to take things to prod before they're ready. Not to mention all the security issues that get shrugged off in the name of convenience.

I find both management and devs are to blame. For some reason the amazingly knowledgeable developers I read on here daily are never to be found at work.


Yes. I’ve had the same experience. Literally have had engineers get upset with me when I asked them to consider optimizing code or refactor out complexity. “Yeah we’ll do it in a follow up, this needs to ship now,” is what I always end up hearing. We’re not their technical leads but we get pulled into a lot of PRs because we have oversight on a lot of areas of the codebase. From our purview, it’s just constantly deteriorating.


Need to make software developers legally liable like other engineers, that will cause a huge behavioral shift


IMO, if you want to write code for anything mission critical you should need some kind of state certification, especially when you are writing code for stuff that is used by govt., hospitals, finance etc.


Certifications by themselves don’t help if the culture around them doesn’t change. Otherwise it’s just rubber-stamping.


Not certification, licensure. That can and will be taken away if you violate the code of ethics. Which in this case means the code of conduct dictated to you by your industry instead of whatever you find ethical.

Like a license to be a doctor, lawyer, or civil engineer.

There’s - perhaps rightfully, but certainly predictably - a lot of software engineers in this thread moaning about how evil management makes poor engineers cut corners. Great, licensure addresses that. You don’t cut corners if doing so and getting caught means you never get to work in your field again. Any threat management can bring to the table is not as bad as that. And management is far less likely to even try if they can’t just replace you with a less scrupulous engineer (and there are many, many unscrupulous engineers) because there aren’t any because they’re all subject to the same code of ethics. Licensure gives engineers leverage.

Super unpopular concept, though.


Certifications and compliance regimes are what got us into this mess in the first place.


I think that could cause a huge shift away from contributing to or being the maintainer of open source software. It would be too risky if those standards were applied and they couldn't use the standard "as is, no warranties" disclaimers.


Actually, no it wouldn't, as the licensire would likely be tied with providing the service on a paid basis to others. You could write or maintain any codebase you want. Once you start consuming it for an employer though, the licensure kicks in.

Paid/subsidized maintainers may be a different story though. But there absolutely should be some level of teeth and stake wieldable by a professional SWE to resist pushes to "just do the unethical/dangerous thing" by management.


I might have misunderstood. I took it to mean that engineers would be responsible for all code they write - the same as another engineer may be liable for any bridge they build - which would mean the common "as is", "no warranty", "not fit for any purpose" cute clauses common to OSS would no longer apply as this is clearly skirting around the fact that you made a tool to do a specific thing, and harming your computer isn't the intended outcome.

You can already enforce responsibility via contract but sure, some kind of licensing board that can revoke a license so you can no longer practice as a SWE would help with pushback against client/employer pressure. In a global market though it may be difficult to present this as a positive compared to overseas resources once they get fed up with it. It would probably need either regulation, or the private equivalent - insurance companies finding a real, quantifiable risk to apply to premiums.


Trouble is, the bridge built by any licensed engineer stands in its location, and can't be moved or duplicated. Software however is routinely duplicated, and copied to places that might not be suitable for ite original purpose.


I’d be ok with this so long as 1) there are rules about what constitutes properly built software and 2) there are protections for engineers who adhere to these rules


Greed and MBAs have colonized the far ends of the techno sphere.


Far from being douchey, I think you've hit the nail on the head. No one is perfect, we're all incompetent to some extent. You've written shitty code, I've definitely written shitty code. There's little time or consideration given to going back and improving things. Unless you're lucky enough to have financial support while working on a FOSS project where writing quality software is actually prioritized.

I get the appeal software developers have to start from scratch and write their own kernel, or OS, etc. And then you realize that working with modern hardware is just as messy.

We all stack our own house of cards upon another. Unless we tear it all down and start again with a sane stable structure, events like this will keep happening.


I know 100k+ engineers (some in security) that definitely should not be described as skilled.


Wow, you know a lot of people!


I think you are correct on that many SWEs are incompetent. I definitely am. I wish I had the time and passion to go through a complete self-training of CS fundamentals using Open Course resources.


You do realize that knowledge of CS fundamentals is extremely unlikely to have prevented this?


> I honestly believe there's A LOT of incompetence in the tech-world

I can understand why. An engineer with expertise in one area can be a dunce in another; the line between concerns can be blurry; and expectations continue to change. Finding the right people with the right expertise is hard.


100% what we seen in the last couple of decades is the march of normies into the techno sphere to the detriment of the prior natives.

We've essentially watched digital colonialism, and it certainly peaks with Elon musk wealth and ego, attempting to buy up the digital market place of ideas.


Pardon my snarkiness but this is what you get when imbeciles MBAs and marketers run every company, not engineers.


Applying rigorous engineering principles is not something I see developers doing often. Whether or not it's incompetence on their part, or pressure from 'imbecile MBAs and marketers', it doesn't matter. They are software developers, not engineers. Engineers in most countries have to belong to a professional body and meet specific standards before they can practice as professionals. Any asshat can call themselves a 'software engineer', the current situation being a prime example, or was this a marketing decision?


You're making the title be more than it is. This won't get solved by more certification. The checkbox of having certified security is what allowed it to happen in the first place.


No. Engineering means something. This is a software ‘engineering’ problem. If the field wants the nomenclature, then it behooves them to apply rigour to who can call themselves an engineer or architect. Blaming middle management is missing the wood for the trees. The root cause was a bad patch. That is developments fault, and no one else’s. As to why this fault could happen, well the design of Windows should be scrutinised. Again, middle management isn’t really to blame here, software architects and engineers design the infrastructure, they choose to use Windows for a variety of reasons.

The point here m trying to make is blaming “MBAs and marketing” shifts blame and misses the wood for the trees. The OP is as on the holier-than-thou “engineer” trip. They are not engineers.


I think engineering only means something because of culture. It all starts from the culture of collective people who define and decide what principles are to be followed and why. All the certifications and licensing that are prerequsite to becoming an engineer are outcomes of the culture that defined them.

Today we have pockets of code produced by one culture linked (literally) with pockets of code produced by a completely different ones and somehow expect the final result to adhere to the most principled and disciplined culture.


Nobody did gradual rollout in 1992


Not entirely true. The company I worked for, major network equipment provider, had a customer user group that had self-organised to take it in turns to be the first customer to deploy major new software builds. It mostly worked well.


Just wait until the PE licensing requirements come to legally charge money for code.


This is the thing that gets me most about this. Any Windows systems developer knows that a bug in a kernel driver can cause BSODs - why on earth would you push out such changes en-masse like this?!


In 2012 a local bank rolled out an update that basically took all of their customer services offline. Couldn't access your money. Took them a month to get things working again.


No concept of "canarying", eh?


I'm confused as to how this issue is so widespread in the first place. I'm unfamiliar with how Crowdstrike works, do organizations really have no control over when these updates occur? Why can't these airlines just apply the updates in dev first? Is it the organizations fault or does Crowdstrike just deliver updates like this and there's no control? If that's just how they do it, how do they get away with this?


Can somebody summarize what CrowdStrike actually is/does? I can't figure it out from their web page (they're an "enterprise" "security" "provider", apparently). Is this just some virus scanning software? Or is it some bossware/spyware thing?


It's both. Antivirus along with spyware to also watch for anything the user is doing that could introduce a threat, such as opening a phishing email, posting on HN, etc.


[flagged]


It's not really up to the companies. In this day and age, everyone is a target for ransomware, so every company with common sense holds insurance against a ransomware attack. One of the requirements of the insurance is that you have to have monitoring software like Crowdstrike installed on all company machines. The company I work for fortunately doesn't use Crowdstrike, but we use something similar called SentinelOne. It's very difficult to remove, and it's a fireable offense if you manage to.


No doubt mandated so that the NSA can have a backdoor to everything just by having a deal with each one of those providers.

I think there's a Ben Franklin quote that applies here. "Those who would give up essential liberty, to purchase a little temporary safety, deserve neither liberty nor safety."


Just remember that the liberty was of the government to tax people for military spending.

Or that Security Monitoring is well established field that has actually given a lot of results in preventing various attacks.


Highly likely, yes.


Yup, its also a requirement to be compliant for security standards like NIST.


What NIST requirement is that?


It is kinda implied throughout SP 800-171r3 that EDRs will make meeting the requirements easier, although they are only specifically mentioned in section 03.04.06


Most corporate places I've encountered over the last N years mandate one kind of antivirus/spyware combo or another on every corporate computer. So it'd be pretty much every major workplace.


Just because everyone does it doesn't not make it a dumb idea. Everyone eats sugar.

If the average corporation hates/mistrusts their employees enough to add a single point of failure to their entire business and let a 3rd party have full access to their systems, then well, they reap what they sow.


I think you have to look beyond the company. In my experience, even the people implementing these tools hate them and rarely have some evil desire to spy on their employees and slow down their laptops. But without them as part of the IT suite, the company can't tick the EDR or AV box, pass a certain certification, land a certain type of customer, etc. It is certainly an unfortunate cycle.


This goes way higher than the average corporation.

This is companies trying desperately to deliver value to their customer at a profit while also maintaining SOC 2, GDPR, PCI, HIPAA, etc. compliance.

If you're not a cybersecurity company, a company like CrowdStrike saying: 'hey, pay us a monthly fee and we'll ensure you're 100% compliant _and_ protected' sounds like a dream come true. Until today, it probably was! Hell, even after today, when the dust settles, still probably worth it.


Sounds like the all too common dynamic of centralized top-down government/corporate "security" mandates destroying distributed real security. See also TSA making me splay my laptops out into a bunch of plastic bins while showing everyone where and how I was wearing a money belt. (I haven't flown for quite some time, I'm sure it's much worse now)

There's a highly problematic underlying dynamic where 364 days out of the year, when you talk about the dangers of centralized control and proprietary software, you get flat out ignored as being overly paranoid and even weird (don't you know that "normal" people have zero ability or agency when it comes to anything involving computers?!). Then something like this happens and we get a day or two to say "I told you so". After which the managerial class goes right back to pushing ever-more centralized control. Gotta check off those bullet point action items.


They fixed that. Now you can fly without taking your laptop out, or taking your shoes and belt off. You just have to give them fingerprints, a facial scan and an in-person interview. They give you a little card. It's nifty.


And the most important part: pay them an annual subscription fee.

Sincerely,

PreCheck Premium with Clear Plus Extra customer


there's nothing socially repressive about having airline travel segregated into classes of passengers at all, nope, this is completely normal /s

I go through the regular TSA line out of solidarity and protest. Fuck the security theater.


My response was intended as sarcasm. But eventually, I don't think it will be a two-tiered system. You simply won't be allowed to fly without what is currently required for precheck.

And fwiw, I don't think the strong argument against precheck has to do with social class... it's not terribly expensive, and anyone can do it. It's just a further invasion of privacy.


Precheck is super cheap, it's like less than $100 once per 5 years. Yes, it is an invasion of privacy, but I suspect the government already has all that data anyway many times over.


Totally with you. Pre-check is an ugly band-aid on a flawed system.


> showing everyone where and how I was wearing a money belt

I only fly once every couple years, but I really hated emptying my pockets into those bins. The last time I went through, the agent suggested I put everything in my computer bag. That worked a lot better.


That's what I usually do, except when they ask you to take out all your devices and put them in bins individually.


Last time I flew, in sweden, the guy was angry at me for having to do his job so he slipped my passport away from the tray, so that I'd lose it. Lucky for me I saw him doing that.


I see this was downvoted by some swede who doesn't think this stuff can happen in sweden.

It can. It doesn't happen to you because you're white, blonde, and can pass through a scanner without triggering it.


At my work in the past year or 2 they rolled out Zscaler onto all of our machines which I think is supposed to be doing a similar thing. All it's done is caused us regular network issues.

I wonder if they also have the capability to brick all our Windows machines like this.


Zscaler is awful. It installs a root cert to act as a man-in-the-middle TCP traffic snooper. Probably does some other stuff, but all you TLS traffic is snooped with zscaler. It is creepy software, IMO.


> installs a root cert

Wow, I didn't know that, but you're right. It even works in Brave, which I wouldn't have expected:

    % openssl x509 -text -noout -in news.ycombinator.com.pem 
    Certificate:
        Data:
            Version: 3 (0x2)
            Serial Number:
                6f:9e:b3:95:05:50:6e:4d:03:d6:0b:a9:81:8c:2f:c3
        Signature Algorithm: sha256WithRSAEncryption
            Issuer: C=US, ST=California, O=Zscaler Inc., OU=Zscaler Inc., CN=Zscaler Intermediate Root CA (zscalertwo.net) (t) 
            Validity
                Not Before: Jul 13 03:45:27 2024 GMT
                Not After : Jul 27 03:45:27 2024 GMT
            Subject: C=US, ST=California, L=Mountain View, O=Y Combinator Management, LLC., CN=news.ycombinator.com
            Subject Public Key Info:
                Public Key Algorithm: rsaEncryption
                    RSA Public-Key: (2048 bit)
It seems to hijack the browser somehow, though, because that doesn't happen from the command line:

    % openssl s_client -host news.ycombinator.com -port 443
    CONNECTED(00000005)
    depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root G2
    verify return:1
    depth=1 C = US, O = DigiCert Inc, CN = DigiCert Global G2 TLS RSA SHA256 2020 CA1
    verify return:1
    depth=0 C = US, ST = California, L = Mountain View, O = "Y Combinator Management, LLC.", CN = news.ycombinator.com
    verify return:1
    write W BLOCK
    ---
    Certificate chain
     0 s:/C=US/ST=California/L=Mountain View/O=Y Combinator Management, LLC./CN=news.ycombinator.com
       i:/C=US/O=DigiCert Inc/CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1
     1 s:/C=US/O=DigiCert Inc/CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1
       i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Global Root G2


Ah, yeah, they gave us zscaler not too long ago. I wondered if it was logging my keystrokes or not, figured it probably was because my computer slowed _way_ down ever since it appeared.


Zscaler sounds like it would be a web server. Just looked it up: "zero trust leader". The descriptiveness of terms these days... if you say it gets installed on a system, how is that having zero trust in them? And what do they do with all this nontrust? Meanwhile, Wikipedia says they offer "cloud services", which is possibly even more confusing for what you describe as client software


Somebody upthread pointed out that it installs a root CA and forces all of your HTTPS connections to use it. I verified that he's correct - I'm on Hacker News right now with an SSL connection that's verified by "ZScaler Root CA", not Digicert.


ZScaler has various deployment layouts. Instead of the client side TLS endpoint, you can also opt for the "route all web traffic to ZScaler cloud network" which office admins love because less stuff to install on the clients. The wonderful side effect is that some of these ZScaler IPs are banned from reddit, Twitter, etc, effectively banning half the company.


Zero trust means that there is no implicit trust whether you’re accessing the system from an internal protected network or from remote. All access to be authenticated to the fullest. In theory you should be doing 2FA every time you log in for the strictest definition of zero trust.


zero trust means absolutely nothing. Just a term void of any meaning.


There is a NIST paper on it. It's requirement for government systems after they suffered major breaches.

https://www.nist.gov/publications/zero-trust-architecture


Now check how many zero trust companies have offering that remotely compare to that.


It’s a tool to “zero trust” your employees


They are a SASE provider, I am assume they offer a beyond Corp style offering allowing companies to move their apps off a private VPN and allow access on the public internet. Probably have a white paper on how they satisfy zero trust architecture.


I certainly would have zero trust in a system that man in the middles all my traffic


See the recent waves of ransomware encrypting drives and similar attacks. They cause real cost as well and this outage can be blamed on crowdstrike without losing face. If you are in the news for phished data or have an outage since all data is encrypted blaming somebody else is hard


Once you get legal involved the employee becomes the liability, not the asset.


Well it’s not aimed at IT people and programmers (though the policies still apply to them), it’s aimed at everyone else who doesn’t understand what a phishing email looks like.


Do you think that IT and programmers are immune to these attacks?


Nope, but more trustworthy to that stuff than not.


This is the whole point of 1984. It's not some overbearing government entity that surveils their citizens; the citizens bring it on themselves.

They willingly relinquish their right to privacy in service of protection against a potential threat, or the appearance of one.


[flagged]


These comments make me think that both you and the commenter you replied to have never read 1984.

It's anti totalitarian propaganda. There is IIRC not much about how Airstrip One came to be, it's kinda always been there because the state controls history. People did not ask for the telescreens, they accept them.

The system in the book is so strongly based on heavy-handed coercion and manipulation that I actually find it psychologically implausible (though, North Korea...). The strength of the book, I would say, is not its plausibility, but the intensity of the nightmare and the quality of the prose that describes it.


So there's the control freak at the top who made this decision, and then there are the front lines who are feverishly booting into safe mode and removing the update, and then there are the people who can't get the data they need to safely perform surgeries.

So yeah, screw 'em. But let's be specific about it.


I think the question this raises is why critical systems like that have unrestricted 3rd party access and are open to being bricked remotely. And furthermore, why safety critical gear has literally zero backup options to use in case of an e.g. EMP, power loss, or any other disruption. If you are in charge of a system where it crashing means that people will die, you are a complete moron to not provide multiple alternatives in such a case and should be held criminally liable for your negligence.


Agreed on all points, but if we're going to start expecting people to do that kind of diligence, re: fail-safes and such (and we should), then we're going to have to stop stretching people as thin as we tend to, and we're going to have to give them more autonomy than we tend to.

Like the kind of autonomy that let's them uninstall Crowdstrike. Because how can you be responsible for a system which at any time could start running different code.


What I don't get why nobody questions how's OS that needs all third-party shit to function and be compliant, gets into critical paths in the first place??


I think various auditing standards are forcing it, regardless of whether you think its a good thing or not.


i'd think you'd want some sort of controls/detection on infrastructure level machines.

above comment is very naive.


This kind of thing is required by FedRAMP. Good luck finding a company without ending management software who is legally allowed to be a US government vendor.

If you stick to small privately held companies you might be able to avoid ending management but that's it.. any big brand you can think of is going to be running this or something similar on their machines -- because they're required to


Paranoid? Phishing is very successful.


If my job didn't include clicking random links I get via email all the time, I'd be much more successful in not clicking random links I get via email.


> Is this just some virus scanning software?

Essentially, yes. It is fancy endpoint protection.


The thing people are paying for is regulatory compliance. The actual product is anti-virus software.


Presumably endpoint detection & response (EDR) agents need to do things like dynamically fetch new malware signatures at runtime, which is understandable. But you'd think that would be treated as new "content", something they're designed to handle in day-to-day operation, hence very low risk.

That's totally different to deploying new "code", i.e. new versions of the agent itself. You'd expect that to be treated as a software update like any other, so their customers can control the roll out as part of their own change management processes, with separate environments, extensive testing, staggered deployments, etc.

I wonder if such a content vs. code distinction exists? Or has EDR software gotten so complex (e.g. with malware sandboxing) that such a distinction can't easily be made any more?

In any case, vendors shouldn't be able to push out software updates that circumvent everyone's change management processes! Looking forward to the postmortem.


My guess is it probably was a content update that tickled some lesser trodden path in the parser/loader code, or created a race condition in the code which lead to the BSOD.

Even if it’s ‘just’ a content update, it probably should follow the rules of a code update (canaries, pre-release channels, staged rollouts, etc).


CrowdStrike is an endpoint detection and response (EDR) system. It is deeply integrated into the operating system. This type of security software is very common on company-owned computers, and often have essentially root privileges.


Well, actually more than root. Even for an administrator user on Windows, it’s pretty hard to mess with things and get into BSOD. CrowdStrike has these files as drivers (as indicated by .sys file extension) which run in the kernel mode.


Companies operate on a high level of fear and trust. This is the security vendor, so in theory they want those updates rolled out as quickly as possible so that they don't get hacked. Heh.


These updates happen automatically and as far as I can tell, there is no option to turn this feature off. From a security perspective, the vendor will always want you to be on the most recent software to protect from attack holes that may open up by operating on an older version. Your IT department will likely want this as well to avoid culpability. Just my 2 observations, whether it is the right away or if CS is effective at what it does, no idea.


I mean, they pay a lot of money to crowdstrike. A failure this widespread is a Crowdstrike dev issue.


It's a Mossad/CIA sponsored spyware agent.


Source: bro trust me


Crowdstrike did this to our production linux fleet back on April 19th, and I've been dying to rant about it.

The short version was: we're a civic tech lab, so we have a bunch of different production websites made at different times on different infrastructure. We run Crowdstrike provided by our enterprise. Crowdstrike pushed an update on a Friday evening that was incompatible with up-to-date Debian stable. So we patched Debian as usual, everything was fine for a week, and then all of our servers across multiple websites and cloud hosts simultaneously hard crashed and refused to boot.

When we connected one of the disks to a new machine and checked the logs, Crowdstrike looked like a culprit, so we manually deleted it, the machine booted, tried reinstalling it and the machine immediately crashes again. OK, let's file a support ticket and get an engineer on the line.

Crowdstrike took a day to respond, and then asked for a bunch more proof (beyond the above) that it was their fault. They acknowledged the bug a day later, and weeks later had a root cause analysis that they didn't cover our scenario (Debian stable running version n-1, I think, which is a supported configuration) in their test matrix. In our own post mortem there was no real ability to prevent the same thing from happening again -- "we push software to your machines any time we want, whether or not it's urgent, without testing it" seems to be core to the model, particularly if you're a small IT part of a large enterprise. What they're selling to the enterprise is exactly that they'll do that.


Oh, if you are also running Crowdstrike on linux, here are some things we identified that you _can_ do:

- Make sure you're running in user mode (eBPF) instead of kernel mode (kernel module), since it has less ability to crash the kernel. This became the default in the latest versions and they say it now offers equivalent protection.

- If your enterprise allows, you can have a test fleet running version n and the main fleet run n-1.

- Make sure you know in advance who to cc on a support ticket so Crowdstrike pays attention.

I know some of this sounds obvious, but it's easy to screw up organizationally when EDR software is used by centralized CISOs to try to manage distributed enterprise risk -- like, how do you detect intrusions early in a big organization with lots of people running servers for lots of reasons? There's real reasons Crowdstrike is appealing in that situation. But if you're the sysadmin getting "make sure to run this thing on your 10 boxes out of our 10,000" or whatever, then you're the one who cares about uptime and you need to advocate a bit.


Just a nit, I don't think it's correct to call eBPF "user mode". It's just a different, much more sandboxed, way of running kernel-mode code.


If you can crash Linux with an eBPF program, many more asses will have fires lit under them than just this one vendor.


heh.. Linus would have a fit :-D


I would wager that even most software developers who understand the difference between kernel and user mode aren't going to be aware there is a "third" address space, which is essentially a highly-restricted and verified byte code virtual machine that runs with limited read-only access to kernel memory


Not that it changes your point, and I could be wrong, but I'm pretty sure eBPF bytecode is typically compiled to native code by the kernel and runs in kernel mode with full privileges. Its safety properties entirely depend on the verifier not having bugs.


all code is native code eventually (although there are experimental cpus that can execute java byte code directly eg. [0]https://en.wikipedia.org/wiki/Java_processor )


No, lots of VMs don't have any JIT and just interpret bytecode with a loop around a big switch statement (e.g. Python before 3.13).


fwiw there's like a billion devices out there with cpus that can run java byte code directly - it's hardly experimental. for example, Jazelle for ARM was very widely deployed


Listed in that wiki, along with the much older Picojava


It's what crowdstrike call it. To run falcon sensor as ebpf, you need to set it up as "user mode" which, I agree with you, is poorly named.


We could call it, I don't know, "Protected Mode"?


It'll never catch on.


Hear me out here: Maybe if we split the address space into various use-specific segments...


Call it Ring 0, 1, 2... for good measure


Let's start with 0 because there won't be anything less than 0. Same when using letters, start with A because things will never get better than A


Is this a good moment to relitigate the Tanenbaum-Torvalds debate?


Depending on what kernel I'm running, CrowdStrike Falcon's eBPF will fail to compile and execute, then fail to fall back to their janky kernel driver, then inform IT that I'm out of compliance. Even LTS kernels in their support matrix sometimes do this to me. I'm thoroughly unimpressed with their code quality.


Im suspicious that turning it off entirely would also provide equivalent protection as kernel and user space mode. If not more more.


do you work for them? if not what took you to do free tech support? ... honestly this looks a little insane


JackC mentioned in the parent comment that they work for a civic tech lab, and their profile suggests they’re affiliated with a high-profile academic institution. It’s not my place to link directly, but a quick Google suggests they do some very cool, very pro-social work, the kind of largely thankless work that people don’t get into for the money.

Perhaps such organizations attract civic-minded people who, after struggling to figure out how to make the product work in their own ecosystem, generously offer high-level advice to their peers who might be similarly struggling.

It feels a little mean-spirited to characterize that well-meaning act of offering advice as “insane.”


Civic minded people would point how to fully understand the base of your mission critical system. not help a crack addict chose a better toothpaste.


This is gold. My friend and me were joking around that they probably did this to macos and linux before, but nobody gave a shit since it's... macos and linux.

(re: people blaming it on windows and macos/linux people being happy they have macos/linux)


I don’t think people are saying that causing a boot loop is impossible on Linux, anyone who knows anything about the Linux kernel knows that it’s very possible.

Rather it’s that on Linux using such an invasive antiviral technique in Ring 0 is not necessary.

On Mac I’m fairly sure it is impossible for a third party to cause such a boot loop due to SIP and the deprecation of kexts.


I believe Apple prevented this also for this exact reason. Third-parties cannot compromise the stability of the core system, since extensions can run only in user-space.


I might be wrong about it, but I feel that malware with root access can wreak quite a havoc. Imagine that this malware decides to forbid launch of every executable and every network connection, because their junior developer messed up with `==` and `===`. It won't cause kernel crash, but probably will render the system equally unusable.


Root access is a separate issue, but user space access to sys level functions is something Apple has been slowly (or quickly on the IOS platform, where they are trying to stop apps snooping on each other) clamping down on for years.


On both macOS and Linux, there's an increasingly limited set of things you can do from root. (but yeah, malware with root is definitely bad, and the root->kernel attack surface is large)


Malware can do tons of damage even with only regular user access, e.g. ransomware. That’s a different problem from preventing legitimate software from causing damage accidentally.

To completely neuter malware you need sandboxing, but this tends to annoy users because it prevents too much legitimate software. You can set up Mac OS to only run sandboxed software, but nobody does because it’s a terrible experience. Better to buy an iPad.


> but nobody does because it’s a terrible experience

To be fair, all apps from the App Store are sandboxed, including on macOS. Some apps that want/need extra stuff are not sandboxed, but still use Gatekeeper and play nice with SIP and such.

FWIW, according to Activity Monitor, somewhere around 2/3 to 3/4 of the processes currently running on my Mac are sandboxed.

Terrible dev experience or not, it's pretty widely used.


It depends on your setup. If you actually put in the effort to get apparmor or selinux set up, then root is meaningless. There have been so many privilege escalation exploits that simply got blocked by selinux that you should worry more about setting selinux up than some hypothetical exploit.


It's not unnecessary, it's harder (no stable kernel ABI, and servers won't touch DKMS with a ten foot pole).

On the other hand you might say that lack of stable kernel ABI is what begot ebpf, and that Microsoft is paying for the legacy of allowing whatever (from random drivers to font rendering) to run in kernel mode.


I’ve had an issue with it before in my work MacBook. It would just keep causing the system to hang, making the computer unusable. Had to get IT to remove it.


I’ve had sporadic kernel panics with the macos version.


Interesting that they push updates on a Friday when support profile will be way different across companies and organizations during that time.


It makes you wonder if there was some critical vulnerability that forced them to deploy to everyone simultaneously at an awkward time.


AI probably thought there was a critical vuln.


CrowdStrike is the critical vulnerability.


> we push software to your machines any time we want, whether or not it's urgent, without testing it

Do they allow you to control updates? It sounds like what you want is for a small subset of your machines using the latest, while the rest wait for stability to be proven.


This is what happened to us. We had a small fraction of the fleet upgraded at the same time and they all crashed. We found the cause and set a flag to not install CS on servers with the latest kernel version until they fixed it.


It's worth noting that this feature is available, but the recent Windows issue [reportedly] ignored it.


I wonder if the changes they put in behind the scenes for your incident on Linux saved Linux systems in this situation and no one thought to see if Windows was also at risk.


I think a big thing probably is that few linux systems are centrally upgraded.


> we're a civic tech lab

Obviously not the point of your post, but say more? This sounds like it could be pretty cool!


You should send this to every tech reporter you like.


So in a nutshell it is about corporations pushing for legislation which compels usage of their questionable products, because such products enable management to claim compliance when things go wrong, even when the things that go wrong is are the compliance ensuring products.

Sounds like a good gig to me.


Please tell me you've ended the contract with CrowdStrike after this?


Interesting. How was the faulty upgrade distributed? Not from Debian archives I assume.


CrowdStrike Falcon may ship as a native package, but after that it completely self-updates to whatever they think you should be running. Often, I have to ask IT to ask CS to revert my version because the "current" one doesn't work on my up-to-date kernel/glibc/etc. The quality of code that they ship is pretty appalling.


Thanks for confirming. Is there any valid reason these updates couldn't be distributed through proper package repositories, ideally open repositories (especially data files which can't be copyrightable anyway)?

How does Wazuh do it, for example in the [AUR packaged version](https://aur.archlinux.org/packages/wazuh-agent)?


Being able to update on a whim is a feature, not a bug, of CrowdStrike (according to them, you may disagree).


But in Debian that can be achieved with properly configured unattended-upgrades as well.


Yes but that puts a lot of complexity on the end user and you end-up with:

1. A software vendor that is unhappy about the speed they can ship new features at

2. Users that are unhappy the software vendor isn't doing more to reduce their maintenance burden, especially when they have a mixture of OS, distros and complex internal IT structures

IMO default package manager have failed on both linux and windows to provide a good solution for remote updates so everyone re-invents the wheel with custom mini package managers + dedicated update systems.


> we push software to your machines any time we want, whether or not it's urgent, without testing it

... Just, no... HOW does a vendor get away with that? No rolling releases allowed? No hostclasses or placement groups? No local cache or proxy?

The more I've learned over the past day or so, the more crowdstrike keeps edging towards the malware side of the malware/anti-malware spectrum for me


Every real malware programmer would be damn proud of this blast radius.


This seems to be misinformation? The CrowdStrike KB says this was due to a Linux kernel bug.

---

Linux Sensor operating in user mode will be blocked from loading on specific 6.x kernel versions

Published Date: Apr 11, 2024 Symptoms

In order to not trigger a kernel bug, the Linux Sensor operating in user mode will be prevented from loading on specific 6.x kernel versions with 7.11 and later sensor versions.

Applies To Linux sensor 7.11 in user mode will be prevented from loading:

For Ubuntu/Debian kernel versions:

6.5 or 6.6

For all distributions except Ubuntu/Debian, kernel versions:

6.5 to 6.5.12

6.6 to 6.6.2

Linux sensor 7.13 in user mode will be prevented from loading:

For Ubuntu 22.04 running the 6.5 kernel:

6.5.0 - 6.5.0-1014-aws 6.5.0- 6.5.0-1015-azure 6.5.0 - 6.5.0-1014-gcp 6.5.0 - 6.5.0-24-generic 6.5.0 - 6.5.0-1015-oem Ubuntu kernel version:

6.6 to 6.6.2 For Debian kernel version:

6.5 to 6.5.12

6.6 to 6.6.2

For all distributions except Ubuntu/Debian, kernel versions:

6.5 to 6.5.12

6.6 to 6.6.2

Linux Sensors running in kernel mode are not affected.

Resolution CrowdStrike Engineering identified a bug in the Linux kernel BPF verifier, resulting in unexpected operation or instability of the Linux environment.

In detail, as part of its tasks, the verifier backtracks BPF instructions from subprograms to each program loaded by a user-space application, like the sensor. In the bugged kernel versions, this mechanism could lead to an out-of-bounds array access in the verifier code, causing a kernel oops.

This issue affects a specific range of Linux kernel versions, that CrowdStrike Engineering identified through detailed analysis of the kernel commits log. It is possible for this issue to affect other kernels if the distribution vendor chooses to utilize the problem commit.

The commit where the kernel bug was introduced is seen at https://github.com/torvalds/linux/commit/fde2a3882bd07876c14... and the commit that resolves the issue is seen at https://github.com/torvalds/linux/commit/4bb7ea946a370707315...

To avoid triggering a bug within the Linux kernel, the sensor is intentionally prevented from running in user mode for the specific distributions and kernel versions shown in the above section

These kernel versions are intentionally blocked to avoid triggering a bug within the Linux kernel. It is not a bug with the Falcon sensor. Sensors running in kernel mode are not affected.

No action required, the sensor will not load into user mode for affected kernel versions and will stay on kernel mode.

For Ubuntu 22.04 the following 6.5 kernels will load in user mode with Falcon Linux Sensor 7.13 and higher:

6.5.0-1015-aws and later 6.5.0-1016-azure and later 6.5.0-1015-gcp and later 6.5.0-25-generic and later 6.5.0-1016-oem and later

If for some reason the sensor needs to be switched back to kernel mode: Switch the Linux sensor backend to kernel mode sudo /opt/CrowdStrike/falconctl -s --backend=kernel


At one point overnight airlines were calling for an "international ground stop for all flights globally". Planes in the air were unable to get clearance to land or divert. I don't believe such a thing has ever happened before except in the immediate aftermath of 9/11.


Flights have been delayed/canceled. Not as critical as the OP hospital, but not good.

Effecting airlines/brodcasters and banks.

https://www.nytimes.com/live/2024/07/19/business/global-tech...


If a plane is in the air, and can't get permission to land anywhere, well, they only have a finite amount of fuel onboard.


A pilot WILL land, even without clearance. They're not going to crash their own plane. Either way, ATC has fallback procedures and can just use radio to communicate and manage everything manually. Get all the planes on the ground in safe order and then wait for a fix before clearing new takeoffs. https://aviation.stackexchange.com/questions/43379/is-there-...


Planes always get landing clearance via radio. "Planes in the air were unable to get clearance to land or divert" strongly suggests that the radios themselves were not working if it's actually true.


And if they can’t get it via radio they are trained to get clearance via visual sight with someone on the ground.


In the case of an emergency (which low fuel most definitely is) the captain has the ultimate authority and can tell ATC "I'm landing anyway".


That might be fine for a single or a few planes. But given the magnitude of the outage, what if a single airport had dozens of planes landing anyway.

There's a very good reason airport traffic control exists.


I'd be highly surprised if ATC systems were affected by this, but if anyone wants to correct me please do.


I wouldn't expect emergency rooms and 911 to stop working either, but here we are, so until someone says otherwise, I'm assuming some ATCs went down too.


I imagine the flight planning software they use was affected (so their ability to coordinate with other airport's ATC), but not their radio systems or aircraft radar (nearly all radar systems I've worked with are run on Linux, and are hardened to the Nth degree). Been out of the game for 12 years though, so things have likely changed.


The Tenerife disaster (second-deadliest aviation incident in history, after 9/11) was ultimately caused by chaotic conditions due to too many airplanes having to be diverted and land at an alternate airport that wasn't equipped to handle them comfortably.


I'd argue that Tenerife was due to taking off (in bad weather), not landing. But of course, a bunch of planes landing at the same airport without ATC sounds quite dangerous.


There were a lot of contributing causes, but it wouldn't have happened if not for the fact that Tenerife North airport was massively overcrowded due to Gran Canaria airport being suddenly closed (for unrelated reasons) and flights forced to divert.

The issue wasn't with landing specifically; I'm just using it as a general example of issues caused by havoc situations in aviation.


Also lack of visibility. The two pilots couldn't see each other through the fog.


Pilots know where there are other places to land, e.g. there are a lot of military strips and private airfields where some craft can land, depending on size.


I would expect this to affect airlines services, e.g. for check-in and boarding. I would be very surprised if this outtage affects ATC systems.


They would use those 4 magic words in aviation: "I'm declaring an emergency"


Obviously, they're going to land planes before they run out of fuel.


Or shortly after


I would also point out that the backup plan (Radio and Binoculars) are not only effective but also extremely cheap & easy to keep ready in the control tower at all times.

The same cannot be said for medical records.


Unbelievable these systems run on Windows.


I feel Windows is wrongly blamed here it's just an OS.

If you rely on your applications to be available you should have disaster recovery plans for scenarios like this.


A little yes, a little no.

Was this problem caused by Microsoft? No.

Why does this tool exist and must be installed on servers? Well, Windows OS design definitely plays a role here.

Why does this software run in a critical path that can cause the machine to BSOD? This is where the OS is a problem. If it is fragile enough that a bad service like this can cause it to crash in an unfixable state (without manual intervention), that’s on Windows.


> Why does this tool exist and must be installed on servers?

Fads, laziness, and lack of forethought. This tool didn't exist a few years ago. Nobody stopped IT departments worldwide and said "hey, maybe you shouldn't be auto-rolling critical software updates without testing, let alone doing this via a third-party tool with dubious checks."

This could have happened on any OS. Auto deployment is the root problem.


In this very thread there was report of a Debian Linux fleet being kernel crashed in exactly the same scenario by exactly the same malware few months ago.

So the only blame Windows can take is its widespread usage, compared to Debian.


there's an eBPF mode for linux which is safe(r)

so windows can still be blamed for not providing a relatively safe way of doing this.


https://access.redhat.com/solutions/7068083

Kernel panic observed after booting 5.14.0-427.13.1.el9_4.x86_64 by falcon-sensor process.

eBPF program causes kernel panic on kernels 5.14.0-410+ .

Apparently not safe enough for CrowdStrike.


Windows supports eBPF too.


Why the whataboutism?

Yes, the Linux device driver has many of the same issues (monolithic drivers running in kernel space/memory). I’m not sure what the mitigations were in that case, but I’d be interested to know.

But we both know this isn’t the only model (and have commented as such in the thread). MacOS has been moving away from this risk for years, largely to the annoyance of these enterprise security companies. The vendor that was used by an old employer blamed Apple on their own inability to migrate their buggy EDM program to the new version of macOS. So much so that our company refused to upgrade for over 6 months and then it was begrudgingly allowed.


A tool that has full control of the OS (which is apparently required by such security software) fundamentally must have a way to crash the system, and continue to do so at every restart.


Was this problem caused by Microsoft? No.

This really should be a hell no. Perhaps Microsoft's greatest claim to fame is their enduring ability to quickly and decisively react to security breaches with updates. Their process is extremely public and hasn't significantly changed in decades.

If your company can't work with Microsoft's process, your company is the problem. Every other software company in the last forty years has figured it out.


Windows is also the platform where this sort of spyware has been normalized for decades as best practice.


There are similar products for Linux. My organization runs Cortex XDR, it has a kernel module.

Had a few calls with them to figure out its features and how it would impact the systems. They didn't even know.


Fair enough.


I don't blame Windows, but do blame these systems for running Windows, if that makes sense.

I imagined a lot of this ran on some custom or more obscure and hardened specialty system. One that would generally negate the need for antiviruses and such. (and obviously, no, not off the shelf Linux/BSD either)


Their disaster recovery plans include the same faults that brought them down this time, guaranteed.


Do you expect them to run on Android?


Legit question, not trolling. Android is the next biggest OS used to run a single application like POS, meter readers, digital menus, navigation systems. It might be the top one by now. It's prone to all the same 'spyware' drawbacks and easier to set up than "Linux".


It would be better than Windows for sure. You’ve got A/B updates, verified boot, selinux, properly sandboxed apps and a whole range of other isolation techniques.

For something truly mission critical, I’d expect something more bespoke with smaller complexity surface. Otherwise Android is actually not a bad choice.


Any sort of Immutable OS would be better for critical systems like this. The ability to literally just rollback the entire state of the system to before the update would have gotten things back online as fast as a reboot...


Something like Android Lollipop from 2014 supports all the latest techniques. It's likely there's no security issues left on Lollipop by now.

A lot of the new forced updates on Android is to prevent people some apps from being used to spy on other apps, stealing passwords, notification backdoor etc, but you don't need that if it's just a car radio.


the same time new showed up here, on wechat tiktok clone (moments i think, in English) was showing animations of the usa air traffic maps and how the tech blackout affected it. from those images i that it was huge.


We are a major CS client, with 50k windows-based endpoints or so. All down.

There exists a workaround but CS does not make it clear whether this means running without protection or not. (The workaround does get the windows boxes unstuck from the boot loop, but they do appear offline in the CS host management console - which of course may have many reasons).


Does CS actually offer any real protection? I always thought it was just feel-good software, that Windows had caught up to separating permissions since after XP or so. Either one is lying/scamming, but which one?


> Does CS actually offer any real protection? I always thought it was just feel-good software, that Windows had caught up to separating permissions since after XP or so. Either one is lying/scamming, but which one?

Our ZScaler rep (basically, they technically work for us) come out with massive impressive looking numbers of the thousands of threats they detect and eliminate every month

Oddly before we had zscaler we didn't seem to have any actual problems. Now we have it and while we have lots of zscaler caused problems around performance and location, we still don't have any actual problems.

Feels very much like a tiger repelling rock. But I'm sure the corporate hospitality is fun.


AFAIK, most of the people I know that deploy CrowdStrike (including us) just do it to check a box for audits and certifications. They don't care much about protections and will happily add exceptions on places where it gives problems (and that's a lot of places)


What a dream business.


It's not about checking the boxes themselves, but the shifting of liability that enables. Those security companies are paid well not for actually providing security, but for providing a way to say, "we're not at fault, we adhered to the best security practices, there's nothing we could've done to prevent the problem".


So in essence just a flavor of insurance?

Shouldn't that hit Crowdstrike's stock price much more than it has then? (so far I see ~11% down which is definitely a lot but it looks like they will survive).


Not quite. Insurance is a product that provides compensation in the event of loss. Deploying CrowdStrike with an eye toward enterprise risk management falls under one of either changing behaviors or modifying outcomes (or perhaps both).


They are paying with reputation and liability, no?

If the idea is "we'll hire Crowdstrike for CYA" when things like this happen, the blame is on CS and they pay with their reputation.


Pay for what exactly though? Cybersecurity incidents result in material loss, and someone somewhere needs to provide dollars for the accrued costs. Reputation can't do that, particularly when legal liability (or, hell, culpability) is involved.

EDR deployment is an outcome-modifying measure, usually required as underwritten in a cybersecurity insurance policy for it to be in force. It isn't itself insurance.


Not at all like insurance, because they don’t have to pay out at all when things go wrong.


In a way it's the new "nobody ever got fired for buying IBM".


You're so right. Pay us ludicrous sums to make your auditors feel good. Crazy.


So much of regulation is just a round-about to creating business for regulatory compliance.


Good for profits, but it bet there are some employees who feel a distinct lack of joy about their work.


Just adding my two cents: I work as a pentester and arguably all of my colleagues agree that engagements where Crowdstrike is deployed are the worst because it's impossible to bypass.


It definitely isn't impossible to bypass. It gets bypassed all the time, even publicly. There's like 80 different CrowdStrike bypass tricks that have been published at some point. It's hard to bypass and it takes skill, and yes it's the best EDR, but it's not the best solution - the best solution is an architecture where bypassing the EDR doesn't mean you get to own the network.

An attacker that's using a 0 day to get into a privileged section in a properly set up network is not going to be stopped by CrowdStrike.


Or you're a pentester playing 4D chess with your comment.

Or a CS salesperson playing 3D chess with your comment.

If so, well played!


By “impossible to bypass” are you meaning that it provides good security? Or that it makes pen testing harder because you need to be able to temporarily bypass it in order to do your test?


The first. AV evasion is a whole discipline in itself and it can be anything from trivial to borderline impossible. Crowdstrike definitely plays in the champions league.


[flagged]


I don't appreciate your aggressive tone. Which AV is better in your opinion? Are there a lot?


You should be asking how to get through, not what competitor is better. You totally sound like a marketing rep now.


That's not what the discussion was about. If you don't think crowdstrike qualifies as one of the best, justify your opinion.


Best not to respond to trolls.


How can one gets through? I'm sure the knowledge costs gold though.


valuable 2 cents

is there any writeups from the pentesting side of things that we can read to learn more?


I’ll say this: I did a small lab in college for a hardware security class and I got a scary email from IT because CrowdStrike noticed there was some program using speculative execution/cache invalidation to leak data on my account - they recognized my small scale example leaking a couple of bytes. Pretty impressive to be honest.


Did you have CrowdStrike installed on your personal machine, or did they detect it over the network somehow?


We ran our code on our own accounts on the school’s system.


Those able to write and use FUD malware do not create public documentation. Crowdstrike is not impossible to bypass, but for a junior security journeyman known as a pentester, working for corporate interests with no budget and absurdly limited scopes under contract for n-hours a week for 3 weeks will never be able to do anything as simple as an EDR evasion, however if you wish to actually learn the basics the common practitioner of this art please go study the offsec evasion class. Then go read a lot of code and syscall documentation and learn assembly.


I don't understand why you were downvoted. I'm interested in what you said. When you mentioned offsec evasion class, is this what you mean? It seems pretty advanced.

https://www.offsec.com/courses/pen-300/

What kind of code should I read? Actually, let me ask this, what kind of code should I write first before diving into this kind of evasion technique? I feel I need to write some small Windows system software like duplicating Process Explorer, to get familiar with Win32 programming and Windows system programming, but I could be wrong?

I think I do have a study path, but it's full of gap. I work as a data engineer -- the kind that I wouldn't even bother to call myself engineer /s


I know quite a few offensive security pros that are way better than I will ever be at breaking into systems and evading detections that can only barely program anything beyond simple python scripts.

It’s a great goal to eventually learn everything, but knowing the correct tools and techniques and how and when to use them most effectively are very different skillsets from discovering new vulnerabilities or writing new exploit code and you can start at any of them.

Compare for instance a physiologist, a gymnastics coach, and an Olympic gymnast. They all “know how the human body works” but in very different ways and who you’d go to for expertise depends on the context.

Similarly just start with whatever part you are most interested in. If you want to know the techniques and tools you can web search and find lots of details.

If you want to know how best to use them you should set up vulnerable machines (or find a relevant CTF) and practice. If you want to understand how they were discovered and how people find new ones you should read writeups from places like Project Zero that do that kind of research. If you’re interested in writing your own then yes you probably need to learn some system programming. If you enjoy the field you can expand your knowledge base.


EDR vendors are generally lying unless they tell you anything but "install us if you want to pass certification".


My contacts abroad are saying "that software US government mandated us to install on our clients and servers to do business with US companies is crashing our machines".

When did Crowdstrike get this gold standard super seal of approval? What could they be referring to?


[flagged]


> if it's not 100% secure

It's 100% broken though...

I guarantee you that the damage caused by Crowdstrike today will significantly outweigh any security benefits/savings that using their software might have had over the years.


The benefits include

1) Nice trips to golf courses before contract renewal

2) Nice meals at fancy restaurants before contract renewal

3) Someone for the CTO to blame when something goes wrong


Nah, more like your security/usability/reliability tradeoff needs to be better.


As a redteamer I guarantee you that a Windows endpoint without EDR is caviar for us...


Are there publicly known exploits which allow RCE or data extraction on a default windows installation?


* SMB encryption or signing not enforced

* NTLM/NTLMv1 enabled

* mDNS/llmnr/nbt-ns enabled

* dhcpv6 not controlled

* Privileged account doing plain LDAP (not LDAPS) binds or unencrypted FTP connections

* WPAD not controlled

* lights out interfaces not segregated from business network. Bonus points if its a supermicro which discloses the password hash to unauthenticated users as a design features.

* operational technology not segregated from information technology

* Not a windows bug, but popular on windows: 3rd party services with unquoted exe and uninstall strings, or service executable in a user-writable directory.

I remediate pentests as well as realworld intrusion events and we ALWAYS find one of these as the culprit. An oopsie happening on the public website leading to an intrusion is actually an extreme rarity. It's pretty much always email > standard user > administrator.

I understand not liking EDR or AV but the alternative seems to be just not detecting when this happens. The difference between EDR clients and non-EDR clients is that the non-EDR clients got compromised 2 years ago and only found it today.


Thanks for the list. I got this job as the network administrator at a community bank 2 years ago and 9/9 of these were on/enabled/not secured. I've got it down to only 3/9 (dhcpv6, unquoted exe, operational tech not segregated from info tech). I'm asking for free advise, so feel free to ignore me, but of these three unremediated vectors, which do you see as the culprit most often?


dhcpv6 poisoning is really easy to do with metasploit and creates a MITM scenario. It's also easy to fix (dhcpv6guard at the switch, a domain firewall rule, or a 'prefer ipv4' reg key).

unquoted paths are used to make persistence and are just an indicator of some other compromise. There are some very low impact scripts on github that can take care of it

Network segregation, the big thing I see in financial institutions is the cameras. Each one has its own shitty webserver, chances are the vendor is accessing the NVR with teamviewer and just leaving the computer logged in and unlocked, and none of the involved devices will see any kind of update unless they break. Although I've never had a pentester do anything with this I consider the segment to be haunted.


None of those things require a kernel module with remote code execution to configure properly.


I believe the question was 'in which ways is windows vulnerable by default', and I answered that.

If customers wanted to configure them properly, they could, but they don't. EDR will let them keep all the garbage they seem to love so dearly. It doesn't just check a box, it takes care of many other boxes too.


At work we have two sets of computers. One gets beamed down by our multi-national overlords, loaded with all kinds of compliance software. The other is managed by local IT and only uses windows defender, has some strict group policies applied, BMCs on a separate vlans etc. Both pass audits, for whatever that's worth.


This is the key question for me: is there a way to get [most of] the security benefits of EDR without giving away the keys to the kingdom.


No. If an EDR relies on userland mechanisms to monitor, these userland mechanisms can easily be removed by the malicious process too.


> It's pretty much always email > standard user > administrator

What does this mean?


believe it or not, most users dont run around downloading random screensavers or whatever. Instead they are receiving phish emails, often from trusted contacts who have recently been compromised using the same style of message that they are used to receiving, that give the attacker a foothold on the computer. From there, you can use a commonly available insecure legacy protocol or other privilege escalation technique to gain administrative rights on the device.


standard user: why can't I open this pdf? It says Permission Denied

dumb admin: let me try .... boom game over man


It's the attack path.


>> always email > standard user > administrator

maybe its the boomers that can't give up Outlook? Otherwise they could've migrated everybody to google workspaces or some other web alternative.


You don't need exploits to remotely access and run commands on other systems, steal admin passwords, and destroy data. All the tools to do that are built into Windows. A large part of why security teams like EDR is that it gives them the data to detect abuse of built-in tools and automatically intervene.


Not on a fully patched system. 0-days are relatively rare and fixed pretty quickly by Microsoft.


Remember WannaCry? The vuln it used was patched by MS two months prior the attack. Yet it took the world by storm.


Not sure what you want from me, I simply answered the question. Yes I remember WannaCry.


How is it caviar then?


Not the same poster, but one phase of a typical attack inside a corporate network is lateral movement. You find creds on one system and want to use them to log on to a second system. Often, these creds have administrative privileges on the second system. No vulnerabilities are necessary to perform lateral movement.

Just as an example: you use a mechanism similar to psexec to execute commands on the remote system using the SMB service. If the remote system has a capable EDR, it will shut that down and report the system from which the connection came from to the SOC, perhaps automatically isolate it. If it doesn't, an attacker moves laterally through your entire network with ease in no time until they have domain admin privs.


A key part of breach a network is having a beacon running on their networks, and communicating out, one way or another.

Running beacons with good EDRs is difficult, and has become the most challenging aspect of most red team engagements because of that.

No EDR, everything becomes suddenly super easy.


Anyone who claims CS is nothing but a compliance checkbox has never worked as an actual analyst, of course it's effective...no, dur, its worth 50bn for no reason...god some people are stupid AND loud


Every company I’ve ever worked at has wound up having to install antivirus software to pass audits. The software only ever caused problems and never caught anything. But hey, we passed the audit so we’re good right?


The real scam is the audit.

Many moons ago, I failed a "security audit" because `/sbin/iptables --append INPUT --in-interface lo --jump ACCEPT`

"This leaves the interface completely unfiltered"

Since then, I've not trusted any security expert until I've personally witnessed their competence.


Long time ago I was working for a web hoster, and had to help customers operating web shops to pass audits required for credit card processing.

Doing so regularly involved allowing additonal ciphers for SSL we deemed insecure, and undoing other configurations for hardening the system. Arguing about it is pointless - either you make your system more insecure, or you don't pass the audit. Typically we ended up configuring it in a way that we can easily toggle those two states, and reverted it back to a secure configuration once the customer got their certificate, and flipped it back to insecure when it was time to reapply for the certification.


This tracks for me. PA-DSS was a pain with ssl and early tls... our auditor was telling us to disable just about everything (and he was right) and the gateways took forever to move to anything that wasn't outdated.

Then our dealerships would just disable the configuration anyway.

It's been better in recent years.


The dreaded exposed loopback interface... I'm an (internal) auditor, and I see huge variations in competence. Not sure what to do about it, since most technical people don't want to be in an auditor role.


The companies I had the displeasure of dealing with were basically run by mindless people with a shell script.


I agree completely. It makes me wonder if other engineering disciplines have this same competency issue.


We did this at one place I used to work at. We had lots of Linux systems. We installed clamAV but kept the service disabled. The audit checkbox said “installed” and it fulfilled the checkbox…


Yes, it offers very real protection. Crowdstrike in particular is the best in the market, speaking from experience and having worked with their competitor's products as well and responded to real world compromises.


How did they fail to test such a critical bug then ?

Clearly shows lack of testing.

If intially good, probably culture & products have rotten.

Not fit to be in security domain, if like this.


I think this is more of a failure on the software development side than the domain specific functionality side.


Hubris. Clearly they have no form of internal testing for updates because this should have been caught immediately.


"best in the market"

I think the evidence shows that no, they aren't.


Go buy the second-best in the market then. Red Team would love you to do that.


Yes, from experience, I can say that CS does offer real protection.


> "50k windows-based endpoints or so. All down."

I'm a dev rather than infra guy, but I'm pretty sure everywhere I've worked which has a large server estate has always done rolling patch updates, i.e. over multiple days (if critical) or multiple weekends (if routine), not blast every single machine everywhere all at once.


If this comment tree: https://news.ycombinator.com/item?id=41003390 is correct, someone at Crowdstrike looked at their documented update staging process, slammed their beer down, and said: "Fuck it, let's test it in production", and just pushed it to everyone.


Which of course begs the question: How were they able to do that? Was there no internal review? What about automated processes?

For an organization it's always the easiest, most convenient answer to blame a single scapegoat, maybe fire them... but if a single bad decision or error from an employee has this kind of impact, there's always a lack of safety nets.


Even if true, the orgs whose machines they are have the responsibility to validate patches.


This is not a patch per se, it was Crowdstrike updating their virus definition or whatever it's called internal database.

Such things are usually enabled by default to auto-update, because otherwise you lose a big part of the interest (if there's any) of running an antivirus.


Surely their should be at least some staging on update files as well, to avoid the "oops, we accidentally blacklisted explorer.exe" type things (or, indeed, this)?


Companies have staging and test process but CS bypassed it and deployed to prod.


If I understand the thread correctly, CS bypassed the organization's staging system


I'm guessing there's a lesson to be learned here.


This feels like an auto-update functionality. For something that's running in kernel space (presumably, if it can BSOD you?) Which is fucking terrifying.


Crowdstrike auto-updates. Please do not spread misinfo.


I think my company has more than 300k+ machines down right now :)

SLAs will be breached anyway


Windows IT admins of the world, now is your time. This is what you've trained for. Everything else has led to this moment. Now, go and save the world!!


"The world will look up and shout, "Save us!" and I'll whisper "No..." -- Rorschach


Or "log a ticket!"


Type in those Bitlocker recovery keys for as long as you can stay awake!


Or rather, go limp and demand to unionize!


Yeah, manual GUI work. Like any good MS product.


Or don't \o/


Probably go buy a mocha and cry in the corner :(


Does it require to physically go to each machine to fix it? Given the huge number of machines affected, it seems to me that if this is the case, this outage could last for days.


The workaround involves booting into Safe mode or Recovery environment, so I'd guess that's a personal visit to most machines unless you've got remote access to the console (e.g. KVM)

The info is apparently behind here: https://supportportal.crowdstrike.com/s/article/Tech-Alert-W...


That's crazy, imagine you have thousands of office PCs that all have to be fixed by hand.


It gets worse if your machines have bitlocker active, lots of typing required. And it gets even worse if your servers that store the bitlocker keys also have bitlocker active and are also held captive by crowstrike lol


I've already seen a few posts mentioning people running into worst-case issues like that. I wonder how many organizations are going to not be able to recover some or all of their existing systems.


Presumably at some point they'll be back to a state where they can boot to a network image, but that's going to be well down the pyramid of recovery. This is basically a "rebuild the world from scratch" exercise. I imagine even the out of band management services at e.g. Azure are running Windows and thus Crowdstrike.


Wow, why the fuck is that support article behind a login page.


Our experience so far has been:

• Servers, you have to apply the workaround by hand.

• Desktops, if you reboot and get online, CrowdStrike often picks up the fix before it crashes. You might need a few reboots, but that has worked for a substantial portion of systems. Otherwise, it’ll need a workaround applied by hand.


Even once in a boot loop, it can download the fix and recover?


Why the difference between servers and desktops ?


What happens if you've got remote staff?


The Dildo of Consequences rarely comes lubed, it seems.


This made me actually laugh out loud.


Surely i'ts not normal practice to allow patches to be rolled out without a staging/testing area on an estate of that size?


This is insane. The company I currently work for provides dinky forms for local cities and such, where the worst thing that could happen is that somebody will have to wait a day to get their license plates, and even we aren't this stupid.

I feel like people should have to go to jail for this level of negligence.


Which makes me think--are we sure this isn't malicious?


Unfortunately, any sufficiently advanced stupidity indistinguishable from malice.


As strange as it sounds, this just seems way to sophisticated to be malicious.


Maybe someone tried to backdoor Crowdstrike and messed up some shell code? It would fit and at this point we can't rule it out, but there is also no good reason to believe it. I prefer to assume incompetence over maliciousness.


The AI said it was ok to deploy


I blame the Copilot


True for all systems, but AV updates are exempt from such policies. When there is a 0day you want those updates landing everywhere asap.

Things like zscaler, cs, s1 are updating all the time, nearly everywhere they run.


>True for all systems, but AV updates are exempt from such policies. When there is a 0day you want those updates landing everywhere asap.

This is irrational. The risk of waiting for a few hours to test in a small environment before deploying a 0-day fix is marginal. If we assume the AV companies already spent their sweet time testing, surely most of the world can wait a few more hours on top of that.

Given this incident, it should be clear the downsides of deploying immediately at a global scale outweigh the benefits. The damage this incident caused might even be more than all the ransomware attacks combined. How long to take to do extra testing will depend on the specific organization, but I hope nobody will allow CrowdStrike trying to unilaterally impose a standard again.


It's incredibly bad practice, but it seems to be industry normal as we learned today.


I wonder if the move to hybrid estates (virtual + on prem + issued laptops etc) is the cause. Having worked in only on prem highly secure businesses no patches would be rolled out intra week without a testing cycle on a variety of hardware.

I consider it genuinely insane to allow direct updated from vendors like this on large estates. If you are behind a corporate firewall there is also a limit to the impact of discovered security flaws and thus reduced urgency in their dissemination anyway.


Most IT departments would not be patching all their servers or clients at the same time when Microsoft release updates. This is a pretty well followed standard practice.

For security software updates this is not a standard practice, I'm not even sure if you can configure a canary update group in these products? It is expected any updates are pushed ASAP.

For an issue like this though Crowdstrike should be catching it with their internal testing. It feels like a problem their customers should not have to worry about.


Their announcement (see Reddit for example) says it was a “content deployment” issue which could suggest it’s the AV definitions/whatever rather than the driver itself… so even if you had gradual rollout for drivers, it might not help!


It's definitely the driver itself if it blue screens the kernel. Quite possibility data-sensitive of course.


https://x.com/brody_n77/status/1814185935476863321 [0]

The driver can't gracefully handle invalid content - so you're kinda both right.

[0] brody_n77 is:

   Director of OverWatch,
   CrowdStrike Inc.


I came to HN hoping to find more technical info on the issue, and with hundreds of comments yours is the first I found with something of interest, so thanks! Too bad there's no way to upvote it to the top.


Looks like a great way to bypass crowd strike if I'm an adversary nation state


Anyone copy the original text? Now getting: > Hmm...this page doesn’t exist. Try searching for something else


I don’t have the exact copy, but it said it was a ‘channel file’ which was broken.


It might have been a long-present bug in the driver, yes, but today's failure was apparently caused by content/data update.


In most appreciations of risk around upgrades in environments with which i am familiar, changing config/static data etc counts as a systemic update and is controlled in the same way


You would lose a lot of the benefits of a system like crowdstrike if you waited to slowly roll out malware definitions and rules.


Survived this long without such convenience. anything worth protecting lives behind a firewall anyway


A bunch of unprotected endpoints all at once on critical systems… what could possibly go wrong?

Hope they role out a proper fix soon…


They did, around four hours ago.


A proper fix means that a failure like this causes you a headache, it doesn't close all your branches, or ground your planes, or stop operations in hospitals, or take your tv off air.

You do that by ensuring a single point of failure, like virus definition updates, or an unexpected bug in software which hits on Jan 29th, or when leapseconds go backwards, can't affect all your machines at the same time.

Yes it will be a pain if half your checkin desks are offline, but not as much as when they are all offline.


Except this actually was an opportunity for malicious actors https://www.theregister.com/2024/07/19/cyber_criminals_quick...


Wow that's terrible. I'm curious as to whether your contract with them allows for meaningful compensation in an event like this or is it just limited to the price of the software?


Are you rolling out CS updates as is everywhere? Are you not testing any published updates immediately at least with some N-1 staging involved?


Do you need to manually fix all your windows boxes? Or is there a way to update it remotely?


Yeah, the simple renaming .sys files in safe mode does seem like it would inhibit protection.


Renaming it .old or whatever would be what a sane person does.

They recommended deleting the thing: https://news.ycombinator.com/item?id=41002199


While you're at it probably delete the rest of the software. Then cancel your contract.


Then your hardware becomes unprotected, congratulations, you won Cybersecurity award of the year.


There are more options than "you are using Crowdstrike" and "you have no protection".


You're absolutely right, there are more options.

Let's say you're a CISO and it's your task to evaluate Cybersecurity solutions to purchase and implement.

You go out there and found out that there are multiple organizations that tests (simulate attacks) the EDR capabilities of these Vendors periodically and published grades of these Vendors.

You found the top 5 to narrow down your selections and you pitted them in PoC which consists of attack simulations and end-to-end solutions (that's the Response part of EDR).

The winner gets the contract.

Unless there are tie-breakers...

PS: I heard others (and read) said that CS was best-in-class which suggested that they probably won PoC and received high grades from those independent Organizations.


Then sue the shit out of them.


No, the userspace program will replace it with a good version.


I don't mean this to be rude or as an attack, but do you just auto update without validation?

This appears to be a clear fault from the companies where the buck stops - those who _use_ CS and should be validating patches from them and other vendors.


I'm pretty sure crowdstrike autoupdates, with 0 option to disable or manually rollout updates. Even worse people running N-1 and N-2 channels also seem to have been impacted by this.


My point stands then. If you're applying kernel grade patches on machines which you knowingly cannot disable or test, that's just simple negligence.


I think it's probably not a kernel patch per se. I think it's something like an update to a data file that Crowdstrike considers low risk, but it turns out that the already-deployed kernel module has a bug that means it crashes when it reads this file.


Which suggests the question: What's the current state of "fuzz testing" within the Crowdstrike dev org?


Apparently, CS and ZScaler can apply updates on their own and thats by design, with 0day patches expected to be deployed the minute they are announced.


CS, S1, Zscaler etc auto updates and they have to. Thats the point of the product. If they dont get definitions they cannot protect.


Why do they "have to"? Why can't company sysadmins at minimum configure rolling updates or have a 48 hour validation stage - either of which would have caught this. Auto updating external kernel level code should never ever be acceptable.


If you have a 48 hour window on updating definitions, your machines all have 48 extra hours they are vulnerable to 0-days.


But isn't that a fairly tiny risk, compared with letting a third party meddle with your kernel modules without asking nicely? I've never been hit by a zero-day (unless Drupageddon counts).


I would say no, it's definitely not a tiny risk. I'm confused what would lead you to call getting exploited by vulnerabilities a tiny risk -- if that were actually true, then Crowdstrike wouldn't have a business!

Companies get hit by zero days all the time. I have worked for one that got ransomwared as a result of a zero day. If it had been patched earlier, maybe they wouldn't have gotten ransomwared. If they start intentionally waiting two extra days to patch, the risk obviously goes up.

Companies get hit by zero day exploits daily, more often than Crowdstrike deploys a bug like this.

It's easy to say you should have done the other thing when something bad happens. If your security vendor was not releasing definitions until 48 hours later than they could have, when some huge hack happened becuase of that obviously the internet commentary would say they were stupid to be waiting 48 hours.

But if you think the risk of getting exploited by a vulnerability is less than the risk of being harmed by Crowdstrike software, and you are a decision maker at your organization, then obviously your organization would not be a Crowdstrike customer! That's fine.


CS doesn't force you to auto-upgrade the sensor software – there is quite some FUD thrown around at this moment. It's a policy you can adjust and apply to different sets of hosts if needed. Additionally, you can choose if you want the latest version or a number of versions behind the latest version.

What you cannot choose, however - at least to my knowledge - is whether or not to auto-update the release channel feed and IOC/signature files. The crashes that occured seems to have been caused by the kernel driver not properly handling invalid data in these auxilliary files, but I guess we have to wait on/hope for a post-mortem report for a detailed explanation. Obviously, only the top-paying customers will get those details...


stop the pandering. you know very well crowdstrike doesn't offer good protection to begin with!

everyone pay for legal protection. after it happens you can show you did everything, which means nothing (well now this show even worse than nothing), by showing you paid them.

if they tell you to disable everything, what does it change? they're still your blame shield. which is the reason you have cs.

... the only real feature anybody care is inventory control.


Quite a few people in this thread disagree with you though.


you mean their career depend on them claiming ignorance.


You said Crowdstrike doesn't offer protection but there are plenty in this thread that suggested they actually do and seemed to be highly regarded at the field.

Not sure who the ignoramus here...


facts speak more than words. if you cared about protection you would be securing your system, not installing yet more things, specially one that now require you open up several other attack vectors. but i will never managed to make you see it.


Writing software in the safest programming language to develop mission critical product deployed on the most secure and stable OS that the world depends on would be developer's wet dream.

But it's just that... developer's wet dream.


This might be a good time for folks to go back and watch the first episode of James Burke's Connections: The Trigger Effect

https://www.youtube.com/watch?v=NcOb3Dilzjc

Interconnected systems can fail spectacularly in unforeseen ways. Strange that something so obvious is so often dismissed or overlooked.


Crowdstrike though is not part of a system of engineered design.

It’s a half-baked rootkit sold as a figleaf for incompetent it managers so they can implement ”best practices” in their companys PC:s.

The people purchasing it don’t actually know what it does, they just know it’s something they can invest their cybersecurity budget into and have an easy way to fullfill their ”implement cybersecurity” kpi:s without needing to do anything themselves.


>they just know it’s something they can invest their cybersecurity budget into and have an easy way to fullfill their ”implement cybersecurity” kpi:s

To be fair. This is what companies like Crowdstrike are selling to these managers. Emphasis on the world "SELLING"


Exactly, and this is why I've heard the take that the companies who integrate this software need to be held responsible by not having proper redundancy, and while its a fine take, we need to keep absolutely assailing blame at Crowdstrike and even Microsoft. They're the companies that drum the beat of war every chance they get, scaring otherwise reasonable people into thinking that the Cyberworld is ending and only their software can save them, who push stupid compliance and security frameworks, and straight-up lie to their prospects about the capabilities and stability of their product. Microsoft sets the absolutely dog water standard of "you get updates, you cant turn them off, you can't stagger them, you can't delay them, you get no control, fuck you".


Exactly.


Perhaps true in some cases but in regulated insustries (example fed regulated banks) a tool like crowdstrike addresses several controls that if uncontrolled result in regulatory fines. Regulated companies rarely employ home grown tools due to maintainance risk. But now as we see these rootkit or even agent based security tools bring their own risks.


I’m not arguing against the need to follow regulations. I’m not familiar what specifically is required by banks. All I’m saying Crowdstrike sucks as a specific offering. I’m sure there are worse ways to check the boxes (there always is) but that’s not a much of a praise.

My rant is from a perspective in an org that most certainly was not a bank (b2b software/hardware) and there was enough of ruckus to tell it was not mandated there by any specific regulation (hence incompetence).


The point is that CrowdStrike is only useful for compliance, not for security.


A properly used endpoint protection system is a powerful tool for security.

It's just that you can gamble compliance by claiming you have certain controls handled by purchasing crowdstrike... then leave it not properly deployed and without actual real security team in control of it (maybe there will be few underpaid and overworked people getting pestered by BS from management)


I think a lot about software that is fundamentally flawed but gets propelled up in value due to great sales and marketing. It makes me question the industry.

It's interesting that this is being referred to as a black swan event in the markets. If you look at the SolarWinds fiasco from a few years ago, there are some differences, but it boils down to problems with shitty software having too many privileges being deployed all over the place. It's a weak mono culture and eventually a plague will cause devastation. I think a screw up for these sorts of software models shouldn't really be thought of as a black swan event, but instead an inevitability.


But most managers get to their positions by passing the buck and not being present when inevitability happens or being able to blame the other guy.


> have an easy way to fullfill their ”implement cybersecurity”

There's a typo in there. "Do cyber" is how the said managers would phrase it.


That kind of phrasing lends itself to some wild misunderstandings...


That is how all of these tools are. I have always told people that third-party virus scanners are just viruses that we are ok with having. They slow down our computers, reduce our security, many of them have keyloggers in them (to detect other keyloggers). We just trust them more than we trust unknown ones so we give it over to them.

CloudStrike is a little broader of course. But yeah, its a rootkit that we trust to protect us from other rootkits. Its like fighting fire with fire.


This is the same argument as saying the government is just the biggest gang — a mafia with uniforms.


The government metaphor is apt. Someone has overall authority over your compute and data on your PC. In general I would view the OS as the government.


Which is... true?


100%


It's like when you are using Wiz. You give your most secret files to former Israeli intelligence officers, and hope for the best.

It doesn't really make your data "more secure" or "private".


See also hosted VPN companies


This is an interesting response. I'm curious why you specifically believe "it’s a half-baked rootkit sold as a figleaf for incompetent it managers."


That’s my experience as an unfortunate user of a PC as a software engineer in an org where every PC was mandated to install crowdstrike. Fortune 1000.

It ran amok of every PC it was installed to. Nobody could tell exactly what it did, or why.

Engineering management attempted to argue against it. This resulted in quite public discourse which made the incompetence of the relevant parties in it-management related to it’s implementation obvious.

Not _negligently_ incompetent. Just incompetent enough that it was obvious they did not understand the system they administered from any set of core principles.

It was also obvious it was implemented only because ”it was a product you could buy to implement cybersecurity”. What this actually meant from systems architecture point of view was apparently irrelevant.

One could argue the only task of IT management is to act as a dumb middleman between the budget and service providers. So if it’s acceptable it managers don’t actually need to know anything of computers, then the claim of incompetence can of course be dropped.


Because a security software shouldn't be able to cause a kernel panic, but if it can, then the kernel component should be rock solid.


because it took down half the world?


and it shows that the deployment process was not under control and that something malicious could have happened as well


Which part seems dubious to you?


Very well put. Compliance over actual ops or security.


If you realize something horrific, your options are to decide it's not your problem (and feel guilty when it blows up), carefully forget you learned it, or try to do something to get it changed.

Since the last of these involves overcoming everyone else's shared distress in admitting the emperor has no clothes, and the first of these involves a lot of distress for you personally, a lot of people opt for option B.


> overcoming everyone else's shared distress in admitting the emperor has no clothes

I don't disagree, but why do we do we react this way? Doesn't knowing the emperor has no clothes instill a bit of hope that things can change? I feel for the people who were impacted by this, but I'm also a little bit excited. Like... NOW can we fix it? Please?


The problem is, making noise incurs risk.

The higher up in large organizations you go, in politics or employment or w/e, the more what matters is not facts, but avoiding being visibly seen to have made a mistake, so you become risk-averse, and just ride the status quo unless it's an existential threat to you or something you can capitalize on for someone else's misjudgment.

So if you can't directly gain from pointing out the emperor's missing clothes, there's no incentive to call it out, there's active risk to calling it out if other people won't agree, and moreover, this provides an active incentive for those with political capital in the organization to suppress the embarrassment of anyone pointing out they did not admit the problem everyone knew was there.

(This is basically how you get the "actively suppress any exceptions to people collectively treating something as a missing stair" behaviors.)


I've not seen that at my fortune 100. I found other's willing to agree and we walked it up to the most senior evp in the corporation. Got face time andbwe weren't punished. Just, nothing changed. Some of the directors that helped walk it up the chain eventually became more powerful and the suggested actions took place about 15 years later.


Sure, I've certainly seen exceptions, and valued them a lot.

But often, at least in my experience, exceptions are limited in scope to whatever part of the org chart the person who is the exception is in charge of, and then that still governs everything outside of that box...


And, another problem, if I may: This, too will be soon forgotten. Our "attention cycle" is too short.-

(Look all the recent, severe supply chain attacks by state actors, and how soon they have been displaced off-focus ...)


So if we want our technology to be more reliable, we need to make its current unreliability into an existential threat to certain people?

I mean, it already is to some people, as shown elsewhere in this thread. Seems like it's the wrong people though.


It's a nice idea, but has that worked historically? Some people will make changes, but I think we'd be naive to think that things will change in any large and meaningful way.

Having another I-told-you-so isn't so bad, though - it does give us IT people a little more latitude when we tell people that buying the insecurity fix du jour increases work and adds more problems than it addresses.


Sure, on long enough timescales. I mean, there's less lead in the environment than there used to be. We don't practice blood letting anymore. Things change. Eventually enough will be enough and we'll start using systems that are transparent about what their inputs are and have a way of operating in cases where the user disables one of those inputs because it's causing problems (e.g. crowdstrike updates).

I'd just like it to be soon because I'm interested in building such systems and I'd rather be paid to do so instead of doing it on my off time.


> We don't practice blood letting anymore. Things change

Gonna make me a Tshirt outta this :)


Going off of that framework:

Because doing three requires convincing a bunch of people that are currently doing number two, to do number one instead.


Option C is to quietly distance yourself from it.


there are way too many horrific things in the world to learn about... and then realizing you can't do something about every of those things. But at least you can tackle one of them! (In my case, antibiotic resistance)


My issue is WTF do sooooooo many companies trust this 1 fucking company lol, like its always some obscure company that every major corporation is trusting lol. All because crowdstrike apparently throws good parties for C-Level execs lol


Again, as I feel I must every time Connections is mentioned: That series is an unmitigated masterpiece. It really is ...


Except this is not that, good and known practices exist against these kind of fails, and are used by others.


Nassim Taleb is having a good day


AWS has posted some instructions for those affected by the issue using EC2.

[AWS Health Dashboard](https://health.aws.amazon.com/health/status)

"First, in some cases, a reboot of the instance may allow for the CrowdStrike Falcon agent to be updated to a previously healthy version, resolving the issue.

Second, the following steps can be followed to delete the CrowdStrike Falcon agent file on the affected instance:

1. Create a snapshot of the EBS root volume of the affected instance

2. Create a new EBS volume from the snapshot in the same Availability Zone

3. Launch a new instance in that Availability Zone using a different version of Windows

4. Attach the EBS volume from step (2) to the new instance as a data volume

5. Navigate to the \windows\system32\drivers\CrowdStrike\ folder on the attached volume and delete "C-00000291*.sys"

6. Detach the EBS volume from the new instance

7. Create a snapshot of the detached EBS volume

8. Create an AMI from the snapshot by selecting the same volume type as the affected instance

9. Call replace root volume on the original EC2 Instance specifying the AMI just created"


That is a lot of steps. Can this not be scripted?


Yes it can, that's what I ended up writing at 4am this morning, lol. We manage way more instances then is feasible to do anything by hand. This is probably too late to help anyone, but you can also just stop instance, detach root, attach it to another instance, delete file(s), offline drive, detach, reattach to original instance, and then start instance. You need a "fixer" machine in the same AZ.


FWIW, I find the high-level overview more useful, because then I can write a script tailored to my situation. Between `bash`, `aws` CLI tool, and Powershell, it would be straightforward to programmatically apply this remedy.

Here's something quick that ChatGPT ginned up: https://chatgpt.com/share/293ea9d5-b7ac-4064-b870-45f8266aea...


When you see the size if the impact across the world, the number of people who will die because hospital, emergency and logistics systems are down…

You don’t need conventional war any more. State actors can just focus on targeting widely deployed “security systems” that will bring down whole economies and bring as much death and financial damage as a missile, while denying any involvement…


I always think it's easy for state actors to pull out this trick.

Considering PR review is usually done within the team. A state actor can simply insert a manager, a couple of senior developers and maybe a couple of junior developers into a large team to do the job. Push something in Friday so few people bother to check, gets approved by another implant and here you go.

All people can then leave leisurely.


This happened with intelij a while back didn't it? A spy pushed a code that caused a suplly chain outage somewhere, I can't remember the details.

Anyway, I believe this is what happened here in this case.


Seeing all the cancelled and delayed flights, it makes me think a hacking kind of climate activism/radicalism would be more useful than gluing hands to roads, or throwing paint on art.


Activism is mostly about awareness, because generally you believe your position to be the one a logical person will accept if they learn about it, so doing things that get in the news but only gets you a small fine or month in jail are preferred.

Taking destructive action is usually called "ecoterrorism" and isn't really done much anymore.


This is, in a way, why Kaspersky was banned in the US... "who scans the scanners?". Kaspersky is not that different from a Cloudstrike EDR product.

   https://news.ycombinator.com/item?id=4092187


But will Europe ban CrowdStrike?


Given how obvious the vector is for targeting after its so widespread, stands reason to believe the same state actors would push phishing schemes and other such efforts in order to justify having a tool like crowdsrike used everywhere. We are focusing on the bear trap snapping shut here, but someone took the time to set up that trap right where we'd be stepping in the first place.


I was in my 20s during the peak hysteria of post-9/11 and GWOT. I had to cope with the hysteria hyped 24/7 by media and DHS of a constant terror threat to determine if it was real.

The fact that global infra is so flimsy and vulnerable brought me tremendous relief. If the terror threats were real, we would have been experiencing infrastructure attacks daily.

I remember driving through rural California thinking if the terrorist cells were everywhere, they could trivially <attack critical infra that I don't want to be flagged by the FBI for>

I've read a lot of cyber security books like Countdown to Zero Day, Sandworm, Ghost in the Wires and each one brings me relief. Many of our industrial systems have the most flimsy, pathetic , unencrypted & uncredentialed wireless control protocols that are vulnerable to remote attack.

The fact that we rarely see incidents like this, and when they do happen, they are due to gross negligence rather than malice, is a tremendous relief.


This is the silver lining of global capitalism. When every power on earth is invested in the same assets there is little interest in rocking the boat unless the financial justification to do so is sufficiently massive.


Until deglobalization sufficiently spreads to the software ecosystem. I have just a few hours ago attended a lecture by a very high profile German cybersecurity researcher (though he keeps a low profile). The guy is a real greybeard, can fluently read any machine code, he was building and selling Commodore64 cards at 14yo. (I don't even know what that is.) He's hell bent on not letting in any US code nor a single US chip. Intel is building a 2nm fab in Magdeburg, Germany, the most advanced in the world when it will be completed. German companies are developing their own fabs not based on or purchased from ASML. German developing their own chip designs. A new German operating system in Berlin.

Huawei, after their CEO got imprisoned in Canada took Linux source code and rewrote it file by file in C++. Now they're using it in all their products, called HarmonyOS. The Chinese are recruiting ex-TSMC engineers in mainland China and giving them everything, free house, car, money, free pass between Taiwan and China just to build their own fab in a city I don't know how to spell the name.

I'm not German but I'll go to the hell with the move to deglobalize, or in other words, de-Americanize. This textarea cannot possibly express my anger and hatred against the past fifty years of the domination of Imperium Americana. Not a single moment they let us live without bloodshed and brutal oppression.


What do you think preceded the imperium americana? You'd have to go back thousands of years to find an example of a world not dominated by empires.


I am not against the idea of civilization, authority, hierarchy and empires. I am against those who are unjust and evil oppressors on the face of Earth.


it turns out, civilization works because most of us are civilized.


Ours does


This clusterfuck is a dress rehearsal if you ask my honest opinion.


-


We are far past that point. So many critical systems are running on autopilot, with people who built and understood them retiring, and a new batch of unaware, aloof, apathetic people at the helm.

There's no real need for some Bad Actor -- at some point, entropy will take care of it. Some trivial thing somewhere will fail, and create a cascade of failures that will be cataclysmic in its consequences.

It's not fear-mongering, it's kind of a logical conclusion to decades of outsourcing, chasing profit above and over anything else, and sheer ignorance borne of privilege. We forgot what it took to build the foundations that keep us alive.


That's just what old people like to think: that they are super important and could never be replaced. A few months ago I replaced a "critical" employee that was retiring and everyone was worried what would happen when he was gone. I learned his job in a month.

Most people aren't very important or special and most jobs aren't that difficult.


What is the parable about the engineers who made a beautiful public bath that stopped working, and nobody understood how to fix it?


I'm not so sure it's hyperbole: https://news.ycombinator.com/item?id=41002977


why the fuck is our critical infrastructure running on WINDOWS. Fuck the sad state of IT. CIOs and CTOs across the board need to be fired and held accountable for their shitty decisions in these industries.

yes CRWD is a shitty company but seems they are a "necessity" by some stupid audit/regulatory board that oversees these industries. But at the end of the day, these CIOs/CTOs are completely fucking clueless as to the exact functions this software does on a regular basis. A few minions might raise an issue but they stupidly ignore them because "rEgUlAtOrY aUdIt rEqUiReS iT!1!"


The OS doesn't matter, the question should be why is critical infrastructure online and allowed to receive OTA updates from third parties.


While Linux isn't a panacea, the OS does matter as Linux provides tools for security scanners like Crowdstrike to operate entirely in userspace, with just a sandboxed eBPF program performing the filtering and blocking within the kernel. And yes, CrowdStrike supports this mode of operation, which I'll be advocating we switch over to on Monday. So yeah, for this specific issue, Linux provides a specific feature that would have prevented this issue.


BPF-based CrowdStrike is relatively recent, partially because, from the Enterprise Linux perspective, kernel support is relatively recent.

For example, BPF-based CrowdStrike works on Enterprise Linux 9 and Debian 12. I don't know if the necessary support was in EL 8 or Debian 11.


Right! Windows should NEVER blue screen. Ever. From a third-party software.

Maybe Windows doesn't provide the right ABI or whatever for CS, but come on, you should never be able to kernel panic Windows.

That this blue screened is 100% Microsoft's fault. It's a mess all the way around.


Poe's law?


I mean, you can crash Linux too with bad kernel code.


> The OS doesn't matter, the question should be why is critical infrastructure online and allowed to receive OTA updates from third parties.

Not exactly. I think the question is why is critical infrastructure getting OTA updates from third parties automatically deployed directly to PROD without any testing.

These updates need to go to a staging environment first, get vetted, and only then go to PROD. Another upside of that it won't go to PROD everywhere all at once, resulting in such a worldwide shitshow.


I think you have the priority backwards. We shouldn’t be relying on trusting the QA process of a private company for national security systems. Our systems should have been resilient in the face of Crowdstrike incompetence.


> I think you have the priority backwards. We shouldn’t be relying on trusting the QA process of a private company for national security systems. Our systems should have been resilient in the face of Crowdstrike incompetence.

I think you misunderstood me. I wasn't talking about Crowdstrike having a staging environment, I was talking about their customers. So 911 doesn't go down immediately once Crowdstrike pushes a bad update, because the 911 center administrator stages the update, sees that it's bad, and refuses to push it to PROD.

I think that would even provide some resiliency in the fact of incompetent system administrators, because even if they just hit "install" on every update, they'll tend to do it at different times of day, which will slow the rollout of bad updates and limit their impact. And the incompetent admin might not hit "install" because he read the news that day.


Lol if they can't do staging to mitigate balls ups on the high availability infrastructure side (optus in aus earlier this year pushed a router config that took down 000 emergency for a good chunk of the nation) we got bugger all hope of big companies getting it further up the stack in software.


> why is critical infrastructure getting OTA updates from third parties automatically deployed directly to PROD without any testing.

I am missing some details, perhaps

From what I see this was an update from Crowdstrike. They are a first party, no?

Was another party involved?


They are a third party software provider


OS absolutely does matter. Windows has an enormous attack surface because Microsoft doesn't care.

There's a number of minimal operating systems without all bells and whistles. The reason they aren't as popular choice is the "OS doesn't matter".

If OS is minimal it doesn't need OTA updates, let alone from a third party...


In this case it wasn’t an update to the OS but an update to something running on the OS supplied by an unrelated vendor.

But if we entertain the idea that another OS would not need CrowdStrike or anything else that required updates to begin with, I have doubts. Even your CPU needs microcode updates nowadays.


Of course the OS matters! Windows is a nasty ball of patches in order to maintain backward compatibility with the 80s. Linux and OSX don't have to maintain all the nasty hacks to keep this backward compatibility.

Also, Crowdstrike is a security (patch) company because Windows security sucks to the point they have, by default, real-time virus protection running constantly (runs my CPU white hot for half the day, can you imagine the global impact on the environment?!).

It's so bad on security that its given birth to a whole industry to fix it i.e. Crowdstrike. Every time I pass a bluescreen in a train station or advertisement I'd like "hA! you deserve that for choosing Windows".


IBM’s z/OS maintains compatibility with the 60’s, and machines running it continued to process billions of transactions every second without taking a break.

The OS matters, as well as the ecosystem and, and this is most important, the developer and operations culture around it.


> Of course the OS matters! Windows is a nasty ball of patches in order to maintain backward compatibility with the 80s. Linux and OSX don't have to maintain all the nasty hacks to keep this backward compatibility.

Just don't tell that to Linus Torvalds :) Because Linux absolutely does maintain compatibility with old ABI from 90-s.


> Just don't tell that to Linus Torvalds :) Because Linux absolutely does maintain compatibility with old ABI from 90-s.

That’s nothing. IBM’s z/OS maintains compatibility with systems dating all the way back to the 60’s. If they want to think they are reading a stack of punch cards, the OS is happy to fool them.


+1

With so much of tooling and products, it should come down to

- What am I running and their current security state

- Supply chain of any change that's happening

- Test/Stage/Rollout any change - do not trust the vendor, as they do not know your infrastructure

By allowing OTA update, they assumed that the vendor has tested all permutations.


It matters as in it makes it easy for this kind of issue to cause this much damage with little to no recourse for a fast correction.

Not that Linux or whatever are all immune, but it definitely matters.


You should look into what a kernel driver is. You can panic a Linux kernel with 2 lines of code just as you can panic a Windows kernel, they just got lucky that this fault didn't occur in their Linux version.

And to be honest, I don't think recovering from this would be that much easier for non-technical folk on a fully encrypted Linux machine, not that it's particularly hard on Windows, it's just a lot of machines to do it on.


In Linux it could be implemented as an eBPF thing while most of the app runs in userspace.

And, for specialised uses, such as airline or ER systems, a cut-down specialised kernel with a minimal userland would not require the kind of protection Crowdstrike provides.

I’m sure the NSA wasn’t affected by this.


ebpf works in Windows as well.


The OS absolutely matters


The culture around the OS matters.

But this is a 3rd party software with ring-0 access to all of your computers deciding to break them. The technical features of the OS absolutely do not matter.


The question is whether other OSs would require it to have kernel mode privileges. People run complicated stuff in kernel mode for performance, because the switch to/from userspace is expensive.

Guess what’s also expensive? A global outage is expensive. Much more than taking the performance hit a better, more isolated, design would avoid.


EDS run in kernel mode for access, not performance. They monkey-patch your syscalls.


The alternatives aren't in a position fill the roles needed for the tasks at hand.


This is true. Linux large fleet management is still missing some features large enterprises demand. Do they need all those features, idk, but they demand them if they're switching from Windows.


What are the tasks in question?


No, what is stopping a similarly designed EDR from causing the same problem on Linux?


From a comment above, Linux has features (ebpf) that key crowdstrike stay out of the kernel.

The old "everyone else is just as bad" adage is bullshit. Some OSs are better suited than others.


From a comment elsewhere, a CS update took out Linux machines earlier this year.


Didn't CRWD cause a similar issue with Debian/RHEL a little while ago?

It sounds to me that the problem lies with CRWD and not with whatever OS it's installed on.


A kernel driver can, definitely, take down a Linux machine.

The question is whether someone should implement something like this as a kernel module when there are better ways.


Windows also has better ways such as filter drivers and hooks. If everybody used Linux, Crowd Strike would still opt for the kernel driver since the software they create is effectively spyware that wants access to stuff as deep as possible.

If they opted for an eBPF service but put that into early boot chain, the bootloop or getting stuck could still happen.

The only long time solution is to stop buying software from a company that has a track record of being pushy and having terrible software practices like rolling out updates to the entire field.


I think the only real solution is for MSFT to stop allowing kernel level drivers, as Apple has already (sorf of, but nearly) done. Sure, lots and lots of crap runs on windows in kernelspace, but what happened today cost a sizable fraction of world's GDP. There won't be a better wake up call.


I hope that, in the future, we have better robot firmware validation protocols in place when pushing OTA updates.

Maybe Skynet didn't mean any of that - it was just a botched update.


But would the Linux sysadmins of the world play along in the way that the Windows sysadmins of the world did? I think they might've given Crowd Strike the finger and confined them to a smaller blast radius anyhow. And if they wouldn't have... well they will now.


Third-party blobs running in kernel space being delivered through their own channels without anyone in the company signing them off?

I don’t think I ever met a Unix person with whom that idea would fly.


Once it gets popular, I think it would happen. The business people and C-suite would request quick dirty solutions like Crowd Strike's offerings to check boxes when entering new markets and go around the red tape. So they'll force Unix people to do as they say or else.


Agreed. It's a safer culture because it grew up in the wild. Windows, by contrast, is for when everybody you're using it with has the same boss... places where sanity can be imposed by fiat.

If Microsoft is to be blamed here, it's not for the quality of their software, it's for fostering a culture where dangerous practices are deemed acceptable.


> If they opted for an eBPF service but put that into early boot chain, the bootloop or getting stuck could still happen.

If the in-kernel part is simple and passes data to a trusted userland application the likelyhood of a major outage like the one we saw is much reduced.


More specifically why is critical stuff not equipped properly to revert itself and keep working and/or fail over? This should be built-in stuff at this point, have the last working OS snapshot on its own storage chip and automatically flash it back, even if it takes a physical switch… things like this just shouldn’t happen.


> why the fuck is our critical infrastructure running on WINDOWS

Because it’s cheaper.

I feel like many in this thread are obsessing over the choice of OS when the actual core question is why, given the insane money we spend on healthcare, are all healthcare systems shitty and underinvested?

A sensible, well constructed system would have fallbacks, no matter if the OS of choice is Windows or Linux.


The difference is that lots of different companies can share the burden of implementing all that in Linux (or BSD, or anything else) while only Microsoft can implement that functionality in Windows and even their resources are limited.


Very little healthcare functionality would ever need to be created at the OS level. The burden could be shared no matter if machines were running Windows or Linux, they’re mostly just regular applications.


Not talking about the applications - those could be ported and, ideally, financed by something like the UNDP so that the same tools are available everywhere to any interested part.

I'm talking about Crowdstrike's Falcon-like monitoring. It exists to intercept "suspicious" activity by userland applications and/or other kernel modules.


Cheaper? Well, perhaps when you require your OS to have some sort of support contract. And your support vendor charges you unhealthy sums.

And then you get to see the value of the millions of dollars you've paid for support contracts that don't protect your systems at all. But those contracts do protect specific employees. When the sky falls down, the big money execs don't have a solution. But it's not their fault because the support experts they pay huge sums don't have solutions either. Somehow paying millions of dollars to support contractors that can't save you is not seen as a fireable offense. Instead it is a career-saving scapegoat.

Within companies that have been bitten this time, the team that wasn't affected because they made better process decisions will not be promoted as smarter. Their voice will continue to be marginalized by the people whose decisions led to this disaster. Because, hey, look, everyone got bit right? Nobody looks around to notice the people who were not bitten and recognize their better choices. And "I told you so" is a pretty bad look right now.


> I feel like many in this thread are obsessing over the choice of OS when the actual core question is why, given the insane money we spend on healthcare, are all healthcare systems shitty and underinvested?

Because it's basically impossible to compete in the space.

Epic is a pile or horseshit, but you try convincing a hospital to sign up to your better version.


Tons of critical infrastructure in the US is run on IBM zOS. It doesn't matter what operating system you use, what matters is updates aren't automatic and everything is as air gapped as possible.


> why the fuck is our critical infrastructure running on WINDOWS.

That hits the nail on the head.

But it is a rhetorical question. We know why, generally, software sacks, and specifically why Windows is the worst and is the most popular

Good software is developed by pointy headed needs (like us) and successful software is marketed to business executives are have serious pathologies

There are exceptions (I am struggling to think of one) where a serious piece of good software has survived being mass marketed, but the constraints (basically business and science) conflict


Nah, nope.

1/ linux is as vulnerable to kernel panics induced by such software. In fact, CS had a similar snafu mid April, affecting linux kernels. Luckily, there are far fewer moronic companies running CS on linux boxes at scale.

2/ it does offer protection - if you are running total shit architecture and you need to trust your endpoints not to be compromised, something like this is sadly a must.

Incidentally, google, which prides itself at running a zero-trust architecture, sent a lot of people home on Friday. Not so zero-trust after all, it seems.

Lots of armchair CIOs/CTOs in the comments today.


Source on google sending home people?


Windows is no more or less vulnerable to this class of issues than any other OS.


Debatable. macOS did away with third-party kernel extensions. On Windows, CS runs in the kernel, and the kernel can't load properly because of CS.


Apple also 100% controls their hardware, so they can afford to do away with third-party kernel extensions.


maybe that's what's required for critical infrastructure


Windows isn’t Crowdstrike.


No, its just soooooo bad at security/stability that it gave birth to Crowdstrike. They very fact that Crowdstrike is so big and prevalent means is proof of the gapping hole in Windows security. Its given birth to a multibillion dollar industry!


Crowdstrike/falcon use is not by any means limited to Windows. Plenty of Linux heavy companies mandate it on all infrastructure (although I hope that changes after this incident).


It’s mandated because someone believes Linux is as bad as Windows in that regard.

And, quite frankly, a well configured and properly locked down Windows would be as secure as a locked down Linux install. It’d also be a pain to use, but that’s a different question.

Critical systems should run a limited set of applications precisely to reduce attack surface.


The reality is the wetwear that interfaces with any OS is always going to be the weakest link. Doesn't matter what OS they run, I guarantee they will click links and download files from anywhere.


I can pretty easily make it so a user on Linux can't download executables and can't even then can't do any damage without a severe vulnerability. That is actually pretty difficult to do in a typical Windows AD deployment. There is a big difference between the two OSes.

In fact, there's a couple billion Linux devices running around locked down hard enough that the most clueless users you can imagine don't get their bank details stolen.


There’s Crowdstrike for Linux and Mac


spoiler: crowdstrike is used by companies running on mac and linux as well


Odd that you choose Windows to swipe at when this was largely CRWD's problem + a mix of awful due diligence by IT departments.


The primary answer to your question is because it's expensive to switch.


> yes CRWD is a shitty company but seems they are a "necessity" by some stupid audit/regulatory board that oversees these industries.

Yep, this is the problem. The part about Windows is a distraction here.

That bullshit regulation is a much larger security issue than Windows. Incomparably so. If you run it over Linux, you'll get basically the same lack of security.


Someone on X has shared the kernel stack trace of the crash

The faulting driver in the stack trace was csagent.sys.

Now, Crowdstrike has got two mini filter drivers registered with Microsoft (for signing and allocation of altitude).

1) csagent.sys - Altitude (321410) This altitude falls within the range for Anti-Virus filters. 2) im.sys - Altitude (80680) This altitude falls within the range for access control drivers.

So, it is clear that the driver causing the crash is their AV driver, csagent.sys.

The workaround that CrowdStrike has given is to delete C-00000291*.sys files from the directory: C:\Windows\System32\Drivers\CrowdStrike\

These files being suggested to be deleted are not driver files (.sys files) but probably some kind of virus definition database files.

The reason they name these files with the .sys extension is possibly to leverage Windows System File Checker tool's ability to restore back deleted system files.

This seems to be a workaround and the actual fix might be done in their driver, csagent.sys and the fix will be rolled out later.

Anyone having access a Falcon endpoint might see a change in the timestamp of the driver csagent.sys when the actual fix rolls out.


I've picked the perfect day to return from vacation. Being greeted by thousands of users being mad at you and people asking for your head on a plate makes me reconsider my career choice. Here's to 12 hours of task force meetings...


Huge sympathies to you. If it's any consolation, because the scale of the outage is SO massive and widely reported, it will quickly become apparent that this was beyond your control, and those demanding your 'head on a plate' are likely to appear rather foolish. Hang in there my friend.


To their credit, the stakeholder that asked for my head personally came to me and apologised once they realised that entire airports have been shut down worldwide. But yeah, not a Friday/funday hahaha


Most organisations seem to have a section of management that dissolve into batshit crazy teapots at the first hint of a crisis.


Ye and these types make any problem worse. Any technical problem also becomes a social problem to deal with these lunatics and keep the house of cards from crumbeling.


I don't it's a management thing, per se. I see it more as a personality trait for how one handles stressful situations.

I think some people can improve upon it with time and effort.


It's not a management thing, it's very much a personality trait ... that for whatever reason seems to survive in pockets of management in most organisations over a certain size.

It's not a trait that survives well at yard crew level, trade assistents that freak out at spiders either get over it or never make it through apprenceships to become tradespeople.

In IT those who deal with failing processes, stopped jobs, smoking hardware, insuffcient RAM, tight deadlines learn to cope or get sidelined or fired (mostly).

To be clear, I've seen people get frazzled at most levels and many job types in various companies.

My thesis is there's a layer of management in which nervous types who utterly lose their cool at the first sign of trouble can survive better than elsewhere in large organisations.

But that's just been my experience over many years in several different types of work domains.


Unless he works high up at CrowdStrike lol


<Points at BBC news live feed>

Its not just us, form an orderly queue and you'll been seen soon.

Do you really all have these mentally unstable userbases?


Ohhh absolutely. And it's not just users, it's also management. "How does this affect us? Are we compromised? What are our options? Why didn't we prevent this? How do you prevent this going forward? How soon can you have it back up? What was affected? Why isn't it everyone? Why are things still down? Why didn't X or Y unrelated vendor schlock prevent this?..."

And on and on and on. Just the amount of time spent unproductively discussing this nightmare is going to cost billions.


those are all valid questions though.


Nothing is more annoying than having a user ask a litany of questions obvious to the person working on the problem and looking for the answers while working on the problem and looking for the answers.


They’re valid for a postmortem analysis. They’re not helpful while you’re actively triaging the incident, because they don’t get you any steps closer to fixing it.


Exactly my thinking. Asking these questions doesn't help us now. But after all the action is done, they should be asked. And really should be questions that always get asked from time to time, incident or no incident.


The problem is that you are only focusing on making the computers work and not the system.

"we don't know yet" is a valid response and gives the rest something to work, and it shouldn't annoy you that it's being asked, first of all because if they are asking is because you are already late.

you have to to tell the rest of the team what you know and you don't know, and update them accordingly.

until your team says something the rest don't know if it's a 30 minute thing or the end of the world or if we need to start dusting off the faxes.


Good candidate to do a copy and paste write up you send to everyone who asks.


A large portion of this was in person


Bring a billboard with you everywhere, and point at it?


"I'll email you the full update"


"Gentlemen! You can't fight in here, this is the War Room!"

Have a nice day, anyway.


Maybe you picked the right week to start sniffing glue.


> and people asking for your head on a plate

I'd say I'd given what they want and left to get an icecream at the part

Let then try to fix it themselves


well, you did agree to go in business with crowdstrike, and base your company IT on windows, so...


Your head belongs on the plate for not being able to point back to your recommendation for failover posture enhancement such as identifying core business systems, core function roles, having fully offline emergency systems, warning of the dangers of making cloud services your only services, and then pointing to the proposed costs to implement these systems being lower than the damages caused by outage to core business services.

Move to a new career if you feel you don't have the ability to push right back against this.


The only surprising thing is that this doesn't happen every month. Nobody understands their runtime environment. Most IT org's long ago "surrendered" control and understanding of it, and now even the "management" of it (I use the term loosely) is outsourced.


Nowadays it seems like everyone is running stuff inside VMs because IT removes all the rights in the host system.


“Cloud” is so great huh


This is mostly physical machines in person, kiosks and pos terminals, office desktops and things like that. Windows is a tiny portion of GCP and AWS and the web in general.

I'm 100% "cloud" with tens of thousands of linux containers running and haven't been affected at all.


"I'm going to install an agent from Company X, on this machine, which it is essential that they update regularly, and which has the potential to both increase your attack surface and prevent not just normal booting but also successful operation of the OS kernel too". I am not going to provide you with a site specific test suite, you're going to just have to trust me that it wont interrupt your particular machine".


"And ofc, you pay me a shitload of money for this, I don't have to tell you why, am I?"


Why are so many mission critical hardware connected systems connected to the internet at all or getting automatic updates?

This is just basic IT common sense. You only do updates during a planned outage, after doing an easily reversible backup, or you have two redundant systems in rotation and update and test the spare first. Critical systems connected to things like medical equipment should have no internet connectivity, and need no security updates.

I follow all of this in my own home so a bad update doesn’t ruin my work day… how do big companies with professional IT not know this stuff?


CrowdStrike let's you create update strategies and rollout groups.

This update bypassed all of those settings.


Well that context makes it make a little more sense... I still wouldn't be trusting a service like that for mission critical hardware that shouldn't be connected to the internet in the first place.

The question with these types of services is: is your goal to keep the system as reliable as possible, or to be able to place the blame on a 3rd party when it goes down? If it's a critical safety system that human lives depend on, the answer better be the former.


you wouldn't be trusting it.

But that's besides the point in any enterprise environment. Or even in a SMB where third parties are doing IT stuff for you. Your opinion doesn't matter there. Compliance matters. Paper Risk aversion matters. And they don't always align with common IT sense and, as had been proven now, reality.


If you must trust the software not to do rogue updates then I have to swing back into the camp of blaming the operating system. Is Linux better at this?

I've noticed phones have better permissions controls than Windows, seemingly. You can control things like hardware access and file access at the operating system level, it's very visible to the user, and the default is to deny permissions.

But I've also noticed that phone apps can update outside of the official channel, if they choose. Is there any good way to police this without compromising the capabilities of all apps?


Microsoft has tried pushing app deployment and management platforms that would make this kind of thing really possible, but it constantly receives massive pushback. This was the concept of stuff like Windows S, where pretty much all apps have to be the new modern store app package and older "just run the install.exe as admin and double click the shortcut to run" was massively deprecated or impossible.


How do you keep an airline ticketing system offline? How would anybody book tickets without access to the databases?


Whitelist the persistent store?


you don't need to airgap it. just limit the access to the specific APIs/access to the database and block everything else.

CrowdStrike won't be able to upgrade itself through your database API...


[flagged]


This worked for me!


> Why are so many mission critical hardware connected systems connected to the internet at all or getting automatic updates?

Because it lets them "scale" by having fewer and cheaper offsite IT and contractors to manage vs hiring pesky onsite employees.


You do that for antivirus definition updates?


I’m not an IT professional, but I don’t use antivirus software on my personal macs and linux machines- I do regular rotated physical backups, and only install software digitally signed by trusted sources and well reviewed Pirate Bay accounts (that's a joke :-).

My only windows machine is what I would classify as a mission critical hardware connected/control device, an old Windows 8 tablet I use for car diagnostics- I do not connect it to the internet, and never perform updates on it.

I am an academic and use a lot of old multi-million dollar scientific instruments which have old versions of windows controlling them. They work forever if you don't network them, but the first time you do, someone opens up a browser to check their social media, and the entire system will fail quickly.


Yes. In an environment where you have so many clients that they can DDoS the antivirus management server, you have to stagger the update schedule anyway. The way we set it up, sysadmins/help desk/dev deployments updated on day 1, all IT workstations/test deployments updated on day 2, and all workstations/staging/production deployments on day 3.


what happens if there's a 0-day RCE? 72 hours of your production systems hanging out in the open...


The schedules are shockingly easy to adjust.


Probably, implicitly. Have automated regular backups, and don’t let your AV automatically update, or even if it does, don’t log into all your computers simultaneously. If you update/login serially, then the first BSOD would maybe prevent you from doing the same thing on the other (or possibly, send you running to the other to accomplish your task, and BSODing that one too!)

But yeah this is one reason why I don’t have automatic updates enabled for anything, the other major one being that companies just can’t resist screwing with their UIs.


What people aren’t understanding is MOST of the outage isn’t caused by a crowdstrike install itself, it’s caused because something upstream of it (a critical application server) is what got borked, and that’s having a domino effect on everything else.


Remember, there's someone out there right now, without irony, suggesting that AI can fix this. There's someone else scratching their head, wondering why AI hasn't fixed this yet. And there's someone doing a three-week bootcamp in AI, convinced that AI will fix this. I’m not sure which is worse


when even jsDevOpsv can see the king is naked....


The saddest one is definitely the bootcamp one.


A heuristic that has served me well for years is that anyone who uses the word “cybersecurity” is likely incompetent and should be treated with suspicion.

My first encounter with CrowdStrike was overwhelmingly negative. I was wondering why for the last couple weeks my laptop slowed to a crawl for 1-4 hours on most days. In the process list I eventually found CrowdStrike using massive amounts of disk i/o, enough to double my compile times even with a nice SSD. Then they started installing it on servers in prod, I guess because our cloud bill wasn’t high enough.


It rather looks like Crowdstrike marketed heavily to corporate executives using a horror story about the bad IT tech guy who would exfiltrate all their data if they didn't give Crowdstrike universal access at the kernel level to all their machines...? It seems more aimed at monitoring the employees of a corporation for insider threats than for defense against APT actors.


The employees is a very important attack vector, we had multiple incidents, after they downloaded the wrong kind of stuff.


Cyber- is pretty much a code prefix for anything targeted at the public sector. I too see it as a kind of dirty word TBH.


"Cyber," used on its own, is the worst of them all.


How long before companies start consciously de-risking by replacing general-purpose systems like Windows with newer systems with smaller attack surfaces? Why does an airline need to use Windows at all for operations? From what I’ve seen, their backend systems are still running on mainframes. The terminals are accessed on PCs running Windows, but those could trivially be replaced with iPadOS devices that are more locked down than Windows and generally more secure by design.


They likely run software written for windows, patched together over decades, that wouldn’t port easily to an iPad.


One of the problems possibly preventing this is that budgets for buying software aren't controlled by people administering the software. Definitely not by people using it.


Often, the cost of switching is too high or too complex to justify. On top of that, many applications commonly run in manufacturing etc., simply does not run on any other OS.


It's true that a multi-billion dollars screw-up, with possible deaths, is a cost that is much easier to justify..


And you think that any realistic alternative (which does require appropriate funding) does not have similar risks?


The billions that have been lost, and the lives that have been lost, have, in the blink of an eye, rendered the "too costly to implement" argument moot.

For bean-counting purposes, it's just really convenient that the burden of that cost was transferred onto somebody else, so that the claim can continue to be made that another solution would still be too costly to implement.

Accepting the status-quo that got us here in the first place, under the pseudo-rational argument that there are not realistic alternatives, is simply putting ones head in the sand and careening, full steam ahead, to the next wall waiting for us.

That there might not be an alternative available currently does not mean that a new alternative cannot be actively pursued, and that it is not time for extreem introspection.


Certain backend systems run on mainframes, yes. But the airline's website? No (only the booking portion interacts with a mainframe via API calls). Identity management system? No. Etc.


Embedded Windows has always seemed like an oxymoron to me.


You lost me at iPadOS.


Never if they can help it and have heard of Santander


“Nobody ever got fired for Buying IBM”


They should’ve been.


Looks like this also took down half of New Zealand's economy.

https://www.nzherald.co.nz/nz/bank-problems-reports-bnz-asb-...


Same in Australia https://www.abc.net.au/news/2024-07-19/global-it-outage-crow...

Banks are down so petrol stations and supermarkets are basically closed. People can't check in to airline flights, various government services including emergency telephone and police are down. Shows how vulnerable these systems are if there's just one failure point taking all those down.


000 was never down, and most supermarkets and servos were still up. It was bad, but ABC appear to not have the internal capacity to validate all reports.


It's pretty bad when the main ABC 7pm News Bulletin pretty much had them reading from their iPads couldn't use their normal studio background screens and didn't even give us the weather forecast!


Banks seem fine now. At least ING and NAB.


Crowdstrike must have some slick salespeople in New Zealand. Seems like nearly everyone uses it. Single point of failure problem.


CIO here. They are known to be incredibly pushy. In my company we RFP'd for our endpoint & cyber security. Found the CS salesperson went over me to approach our CEO who is completely non-technical to try and seal a contract because I was on leave for 1 week out of service (and this was known to them). When I found out by our CEO informing me of the approach we were happy to sign with SentinelOne


One thing I'm really happy about at my current company is that when a sales person from a vendor (not Crowdstrike) tried that our CEO absolutely ripped them a new one and basically banned that company from being a vendor for a decade.


I had a very similar experience, I was leading the selection process for our new endpoint security vendor, Crowdstrike people: - verbally attacked/abused a selection team member - were ranting constantly about golf with our execs - were dismissive and just annoying throughout - raised hell with our execs when they learned they were not going to POC, basically went through everyone of them simultaneously - I had to get a rep kicked out of the rfp as he was constantly disrespectful

We did not pick them, and cancelled any other relashionship we had with them, in IR space by example.


It sounds like they are a bunch of shitheads type A nihilists. How stereotypical.


I got confused and thought that you were the CIO of Crowdstrike until I read further into your comment.


"CIO of CS here: we suck."



Similar being reported in Australia. Can see the effects being reported as other timezones become more active.


You mean you think the update was timed and waves will hit every hour for the next nearly 24h, or it’s just 3AM in NYC right now and under the radar?


I think the update will be applied overnight, which is a different window (no pun intended) dependent on timezone and the impact will be reported when users come back online (or not) and identify the issue.

Currently seeing this happening in real time in the UK.


Netherlands is somewhat affected: Two main airports, Rotterdam harbour, a few hospitals and news reporting.

Surprisingly: banks, government, police, fire department, railways, buses are mostly unaffected.

Maybe they have a good IT department/provider.


Probably more to do with luck than being "good".


I was at the supermarket here last night about the time it kicked off. It seemed payWave was down, there were a few people walking out empty handed as they only had Apple Pay, etc on them. But the vast majority of people seemed fine, my chipped credit card worked without issue.


Satellite TV channels too


It's crowdstrike: https://www.reddit.com/r/crowdstrike/comments/1e6vmkf/bsod_e...

> 7/18/24 10:20PT - Hello everyone - We have widespread reports of BSODs on windows hosts, occurring on multiple sensor versions. Investigating cause. TA will be published shortly. Pinned thread.

> SCOPE: EU-1, US-1, US-2 and US-GOV-1

> Edit 10:36PT - TA posted: https://supportportal.crowdstrike.com/s/article/Tech-Alert-W...

> Edit 11:27 PM PT:

> Workaround Steps:

> Boot Windows into Safe Mode or the Windows Recovery Environment

> Navigate to the C:\Windows\System32\drivers\CrowdStrike directory

> Locate the file matching “C-00000291*.sys”, and delete it.

> Boot the host normally.


Right after you enter the bit locker recovery key.

You do have your bit locker recovery key, right? .....right?


This was particularly interesting (from the reddit thread posted above):

> A colleague is dealing with a particularly nasty case. The server storing the BitLocker recovery keys (for thousands of users) is itself BitLocker protected and running CrowdStrike (he says mandates state that all servers must have "encryption at rest").

> His team believes that the recovery key for that server is stored somewhere else, and they may be able to get it back up and running, but they can't access any of the documentation to do so, because everything is down.


> but they can't access any of the documentation to do so, because everything is down.

One of my biggest frustrations with learning networking was not being able to access the internet. Nowadays you probably have a phone with a browser, but back in the day if you were sitting in a data room and you'd configured stuff wrong, you had a problem.


Isn’t that what office safes are for? I don’t know the location, but all the old guard at my company knew that room xyz at Company Office A held a safe with printed out recovery keys and the root account credentials. No idea where the key to the safe is or if it’s a keypad lock instead. Almost had to use it one time.


Just hope there is no mutual recursion, i.e. recovery key A is stored on machine B, recovery key B is stored on machine A!


I find that hilarious


Me too, as I am also not affected. But I do pity those guys who now try to solve that deadlock.


Nobody, not one person, thought that documentation should be stored in hard copy?


I'm guessing someone somewhere said that "it must be stored in hard copy in a safe" and the answer was in the range of "we don't have a safe, we'll be fine".

Or worse, if it's like where I worked in the past, they're still in the buying process for a safe (started 13 months ago) and the analysts are building up a general plan for the management of the safe combination. They still have to start the discussions with the union to see how they'll adapt the salary for the people that will have to remember the code for the safe and who's gonna be legally responsible for anything that happens to the safe. Last follow-up meeting summary is "everything's going well but we'll have to modify the schedule and postpone the delivery date of a few months, let's say 6 to be safe"


Not just financial / process barriers. I worked for a company in the early 90's that needed a large secure safe to store classified documents and removable hard drives. A significant part of the delay in getting it was figuring out how to get it into the upstairs office where it would be located. The solution involved removing a window and hiring a crane.

When we later moved to new offices, somebody found a solution that involved a 'stair-walking' device that could supposedly get the safe down to the ground floor. This of course jammed when it was halfway down the stairs. Hilarity ensued.


Any chance you have a link to that comment?


Didn't bookmark it or anything and going back to the original reddit thread I now see that there are close to 9,000 comments, so unfortunately the answer is no...




BitLocker for Business stores the bitlocker key centrally. Still, it is a huge manual undertaking fixing every system.


Absolutely correct. Unfortunately, there is no other solution to this issue. If the laptops were powered down overnight, there might be a stroke of luck. However, this will be one of the most challenging recoveries in IT history, making it a highly unpleasant experience.


Yeah in context we have about 1000 remote workers down. We have to call them and talk through each machine because we can't fix them remotely because they are stuck boot looping. A large proportion of these users are non-technical.


Man, talk about a mass-phishing opportunity.


How fortunate the phone system is not vulnerable to CrowdStrike...


I heard the central system was on Azure, running CrowdStrike.


MS Windows Recovery screen (or the OS installer disk) might ask you for the recovery key only, but you can unlock the drive manually with the password as well! I had to do that a week ago after a disk clone gone wrong, so in case someone steps on the same issue (this here is tested with Win 10, but it should be just the same for W11 and Server):

1. Boot the affected machine from the Windows installer disk

2. Use "Repair options"

3. Click through to the option to spawn a shell

4. It will now ask you for unlocking the disk with a recovery key. SKIP THAT.

5. In the shell, type: "manage-bde -unlock C: -Password", enter the password

6. The drive is unlocked, now go and execute whatever recovery you have to do.

Good luck.


On my corporate Windows 11 22H2 "manage-bde -unlock C: -Password" does not unlock the disk with the user key. I guess it needs recovery key as well.


Don’t you need more options if the key is in a TPM, or there is a password but it’s only part of the key?

Can you even get the secret from the TPM in recovery mode?


> Can you even get the secret from the TPM in recovery mode?

Given that you can (relatively trivially) sniff the TPM communication to obtain the key [1], yes it should be possible. Can't verify it though as I've long ago switched to Mac for my primary driver and the old cheesegrater Mac I use as a gaming rig doesn't have a hardware TPM chip.

[1] https://pulsesecurity.co.nz/articles/TPM-sniffing


TPMs embedded in the processor (fTPM) are pretty popular and it's a lot harder to sniff communications that stay inside the cpu.


yea I don't need an attack on a weak system, I mean the authorized legal normal way of unlocking BL from Windows when you have the right credentials. Windows might not be able to unlock BitLocker with just your password.

I don't know how common it is to disable TPM-stored keys in companies, but on personal licenses, you need group policy to even allow that.

Although this is moot if Windows recovery mode is accepted as the right system by the TPM. But aren't permissions/privileges a bit neutered in that mode?


I doubt most of the clients who use CS know what BitLocker is, let alone how to back it up, assuming it wasn’t backed up automatically by Windows.


Most people installed CrowdStrike because an audit said they needed it. I find it exceedingly unlikely that the same audit did not say they have to enable Bitlocker and backup its keys.


I can confirm this. EDR checkbox for CrowdStrike, BitLocker enabled for local disk encryption checkbox. BitLocker backups to Entra because we know reality happens, no checkbox for that.


Doesn't that get backed up automatically to the Microsoft account?


I know it does for personal accounts once linked to your machine. Years ago, I used the enterprise version and it didn’t, probably because it was “assumed” that it should be done with group policies, but that was in 2017.


That's opt-in.

In Enterprise setups the key should be backed somewhere in Active Directory.


Yes you should be able to pull it from your domain controllers. Unless they're also down, which they're likely to be seeing as Tier 0 assets are most likely to have crowdstrike on them. So you're now in a catch 22.


Log into hypervisor, rollback VM


Rolling back an Active Directory server is a spectacularly bad idea. Better make doubly sure it's not connected to any network before you even attempt to do so.


Microsoft shops gonna be running Hyper-V. Probably also got hosed.


In theory. I've seen it not happen twice. (The worst part is that you can hit the Bitlocker recovery somewhat randomly because of an irrelevant piece of hardware failing, and now you have to rebuild the OS because the recovery key is MIA.)


Saved to my desktop? How does that help? /s


Happy weekend to everyone who works there.


Can you post a summary? We're affected but I don't have access to that portal.


They've bumped this support info to a blog post that's linked from their home page: https://www.crowdstrike.com/blog/statement-on-falcon-content...

It includes PDFs of some relevant support pages that someone printed with their browser 5 hours ago. That's probably the right thing to do in such a situation to get this kind of info publicly available ASAP, but still, oof. Looks like lots of people in the Reddit thread had trouble accessing the support info behind the login screen.


"Start your free trial now." Hahahahah you have got to ne kidding me :)


Someone posted this in the thread, but I also can't log in to verify

> Summary

> CrowdStrike is aware of reports of crashes on Windows hosts related to the Falcon Sensor.

> Details

> Symptoms include hosts experiencing a bugcheck\blue screen error related to the Falcon Sensor.

> Current Action

> Our Engineering teams are actively working to resolve this issue and there is no need to open a support ticket.

> Status updates will be posted below as we have more information to share, including when the issue is resolved.

> Latest Updates

> 2024-07-19 05:30 AM UTC | Tech Alert Published.

> Support

> Find answers and contact Support with our Support Portal


They had me at "crowdstrike engineering"

So engineer-like.


Isn't Crowdstrike the same company the heavily lobbied to get make all their features a requirement for government computers? https://www.opensecrets.org/federal-lobbying/clients/summary... They have plenty of money for congress, but it seem little for any kind of reasonable software development practices. This isn't the first time crowdstrike has pushed system breaking changes.


Since we are in political season here in the US, they are also well known as the company that investigated the Russian hack of the DNC.

https://www.crowdstrike.com/blog/bears-midst-intrusion-democ...


The DNC has since has implemented many layers of protection including crowdstrike, hardware keys, as well as special auth software from Google. They learned many lessons from 2016.


If I were to hazard a guess I think the OP is attempting to say they are incompetent and wrong in fingering the GRU as the cause of the DNC hacks (even though they were one of many groups that made that very obvious conclusion).


What? No.


Not you, the person you were responding to.


Afaik didn't they hack republicans too? They only released democrat emails though.


Correct. Also, the DNC breach was investigated by FireEye and Fidelis as well (who also attributed it to Russia).



The second link has nothing to do with the DNC breach. It's the Ukrainian military disagreeing with Crowdstrike attributing a hack of Ukrainian software to Russia. And ThreatConnect also attributed it to Russia: https://threatconnect.com/blog/shiny-object-guccifer-2-0-and...

>we assess Guccifer 2.0 most likely is a Russian denial and deception (D&D) effort that has been cast to sow doubt about the prevailing narrative of Russian perfidy


So Ukraine's military and the app creator denied their artillery app was hacked by Russians, which might have caused them to lose some artillery pieces? Sounds like they aren't entirely unbiased. Ironically, DNC initially didn't believe they were hacked either.


And CrowdStrike accurately point all the facts.

Seems like they're pretty good at what they do. Maybe that's why there are so many critical infrastructure depends on them.


I mean... the DNC thought Bernie hacked them so...


Yeah this is the fringe view. The fact that the GRU is responsible is the closest thing you can get to settled in infosec.

Especially since the alternative scenarios described usually devolve into conspiracy theories about inside jobs


There's something of a difference between 'alternative scenarios' and demonstrating that the 'settled' story doesn't fit with the limited evidence. One popular example is that the exploit Crowdstrike claim was used wasn't in production until after they claimed it was used.


>There's something of a difference between 'alternative scenarios' and demonstrating that the 'settled' story doesn't fit with the limited evidence.

You've failed to demonstrate that, since your second link doesn't show the Ukrainian military disputing the DNC hack, just a separate hack of Ukrainian software, and the first link doesn't show ThreatConnect disagreeing with the assessment. ThreatConnect (and CrowdStrike, Fidelis, and FireEye) attributes the DNC hack to Russia.

>One popular example is that the exploit Crowdstrike claim was used wasn't in production until after they claimed it was used.

Can you provide more info there?


> You've failed to demonstrate that

I see that now. I should have been more careful while searching for and sharing links. I have shot myself in the foot. And I'm not going to waste my time or others digging for and sharing what I think I remembered reading. I've done enough damage today. Thank you for your thorough reply.


Ok, who did it then?


According to that link the most money they contributed to lobbying in the past 5 years was $600,000 most years around $200,000. That’s barely the cost of a senior engineer.


You'd be surprised how cheap politicians are.


IIRC Menendez was accused and found guilty of accepting around $30,000 per year from foreign governments?


That's probably only the part they had the hard proof for.

Also, the press release[1] says:

> between 2018 and 2022, Senator Menendez and his wife engaged in a corrupt relationship with Wael Hana, Jose Uribe, and Fred Daibes – three New Jersey businessmen who collectively paid hundreds of thousands of dollars of bribes, including cash, gold, a Mercedes Benz, and other things of value

and later:

> Over $480,000 in cash — much of it stuffed into envelopes and hidden in clothing, closets, and a safe — was discovered in the home, as well as over $70,000 in cash in NADINE MENENDEZ’s safe deposit box, which was also searched pursuant to a separate search warrant

This seems to be more than $120K over 4 years. Of course, not all of the cash found may be result of those bribes, but likely at least some of it is.

[1] https://www.justice.gov/usao-sdny/pr/us-senator-robert-menen...


I always half-jokingly think "should I buy a politician?"

I feel like a few friends could go in on it.


It could be like an "insurance" where people pay for politician lobbying. Pool our resources and put it in the right spots.


Ok but that point still defeats the premise that Crowdstrike are spending a large enough amount on lobbying that it is hampering their engineering dept.


I believe the OP was using figurative language. The point seems to be that _something_ is hampering their engineering department and they shouldn't be lobbying the government to have their software so deeply embedded into so many systems until they fix that.


In the UK, a housing minister was bribed with £12,000 in return for a £45m tax break.

3750:1 return on investment, you don't get many investments that lucrative!


Given its origin and involvement in these high profile cases I always thought Crowdstrike is a government subsidized company which barely has any real function or real product. I stand corrected I guess.


This still doesn't demonstrate that it has any real function tbf.


Business Continuity Plan chaos gorilla as a service.


There's something missing here... You know nothing about Crowdstrike (as per your own statement) and critical infrastructure depends on them.

That two things tell us something about your knowledge;)


On the bright side, they are living up to their aptronym.


I wonder if it might starting being a common turn of phrase. "Crowdstrike that directory", etc.


There's a brokenness spectrum. Here are some points on it:

- operational and configured

- operational and at factory defaults

- broken, remote fixable

- crowdstruck (broken remotely by vendor, but not fixable remotely)

- bricked

Usage:

> don't let them install updates or they'll crowdstrike it.


> Isn't Crowdstrike the same company the heavily lobbied to get make all their features a requirement for government computers?

Do you have any more sources on this specifically? The link you gave doesn't seem to reference anything specific.


Seems to be a perfectly rational decision to maximise short term returns for the owners of the company.

Now make of that what you will.


This demonstrated that Crowdstrike lacks the most basic of tests and staging environments.


Corporate brainrot strikes again.


what many people of not taking is that why we are here:

one simple reason: all eggs in one Microsoft PC basket

why in one Microsoft PC basket?

- most corporate desktop apps are developed for Windows ONLY

Why most corporate desktop apps are developed for Windows ONLY?

- it is cheaper to develop and distribute since, 90% of corporations use Windows PCs ( Chicken and Egg problem)

- alternate Mac Laptops are 3x more expensive, so corporations can't afford

- there are no robust industrial grade Linux laptops from PC vendors (lack of support, fear of Microsoft may penalize for promoting Linux laptops etc.)

1/ Most large corporations (Airlines, Hospitals etc..) can AFFORD & DEMAND their Software vendors to provide their ' business desktop applications' both in Windows and Linux versions and install mix of both Operating systems.

2/ majority of corporate desktop applications can be Web applications (Browser based) removing the single vendor Microsoft Windows PC/Laptops

-


Windows is not the issue here. If all of the businesses used Linux, a similar software product, deployed as widely as Crowdstrike, with auto-update, could result in the same issue.

Same goes for the OS; if let's say majority of businesses used RHEL with auto updates, RedHat could in theory push an update, that would result bring down all machines.


Agree. The monoculture simply accelerates the infection because there are no sizable natural barriers to stop it.

Windows and even Intel must take some blame, because in this day and age of vPro on the board and rollbacks built into the OS it's incredible that there is no "last known good" procedure to boot into the most recent successfully booted environment (didnt NT have this 30 years ago?), or remotely recover the system. I pity the IT staff that are going to have to talk Bob in Accounting through bitlocker and some sys file, times 1000s.

IT get some blame, because this notion that an update from a third party can reach past the logical gatekeeping function that IT provides, directly into their estate, and change things, is unconscionable. Why dont the PCs update from a local mirror that IT has that has been through canary testing? Do we trust vendors that much now?

Poor Crowdstrike. This might be the end for them.


https://access.redhat.com/solutions/7068083

Just last month there were issues between RHEL's kernel update and Crowdstrike.


I'd like to read more about it, but that link is... paywalled I think? It's not even clear.


I keep a RedHat developer account active for their documentation. I didn't notice.

I did find these related forum/reddit threads:

https://forums.rockylinux.org/t/crowdstrike-freezing-rockyli...

https://www.reddit.com/r/crowdstrike/comments/1cluxzz/crowds...

Good to see Rocky keeping their promise of bug for bug compatibility.


I would posit that RedHat have a slightly longer and more proven track record than Crowdstrike, and more transparent process with how they release updates.

No entity is infallible but letting one closed source opaque corporation have the keys to break everything isn’t resilient.


His example actually had 2 parts. One RH bricking the OS, the other one with a commercial vendor creating software with separate auto update.

You've only addressed the RH OS angle.


Yes but the problem here was bricking the OS


>> Windows is not the issue here.

Yes it is. Windows was created for the "Personal Computer" with zero thought initially put in to security. It has been fighting that heritage for 30 years. The reason Crowdstrike exists at all is due to shortcomings (real or perceived) in Windows security.

Unix (and hence Linux and MacOS) was designed as a multi-user system from the start, so access controls and permissions were there from the start. It may have been a flawed security model and has been updated over time, but at least it started some notion of security. These ideas had already expanded to networks before Microsoft ever heard the word Netscape.


> was designed as a multi-user system from the start, so access controls and permissions were there from the start.

Right and Windows NT wasn't? Obviously it supported all of those things from the very beginning (possibly even in a superior way to Unix in some cases considering it's a significantly more modern OS)...

The fact that MS developed another OS called Windows (3.1 -> 95 -> 98) prior to that which was to some extent binary compatible with NT seems somewhat tangential. Otherwise the same arguments would surely apply to MacOS as well?

> These ideas had already expanded to networks before Microsoft ever heard the word Netscape.

Does not seem like a good thing on its own to me. Just solidifies the fact the it's an inherently less modern OS than Windows(NT) (which still might have various design flaws obviously, that might be worth discussing, it just has nothing to do whatsoever with what you're claiming here...)


We have Crowdstrike on our Linux fleet. It is not merely a malware scanner but is capable of identifying and stopping zero-day attacks that attempt local privilege escalation. It can, for example, detect and block attempts to exploit CVE-2024-3094 - the xz backdoor.

Perhaps we need to move to an even more restrictive design like Fuschia, or standardize on an open source eBPF based utility that's built, tested, and shipped with a distribution's specific kernel, but Windows is not the issue here.


Security is a complex and deeply evolved field. Many modern required security practices are quite recent from a historical perspective because we simply didn't know we would need them.

A safe security first OS from 20 years ago would most likely be horribly insecure now.


[flagged]


The Linux kernel predates Windows 95 (as do the first distributions). GNU predates even the first version of Windows.


The Mac predates Windows.


That's assuming in this alternate universe we'd also be using kernel antivirus software to counter malware. It's far from obvious.


yes, staggered software update is the way to go. there was reply in this thread why Crowdstrike did not do it -- don't want extra cost of Engineering for that

having 1/3 of Airlines computers Windows, RHEL, Ubuntu .. all unlikely to hit same problems at same time.


But you're more likely to encounter problems. That's likely a good thing as it improves your DR documentation and processes but could be a harder sell to the suits.


The update here is relevant to catch 0day exploit.

Without the update, your system is "naked" for the duration.


But then it'd be putting all eggs in the Linux pc basket, wouldn't it? I think they point was that more heterogeneity would make this not be a problem. If all your potatoes are the same potato it only takes one bad blight epidemic to kill off all farmed potatoes in a country. If there's more heterogeneity things like that doesn't happen.


The difference being that RHEL has a QA process which crowd strike apparently does not. The quality practices for open source involved companies is apparently much higher than for large closed source "security" firms.

I guess getting whined at because obscure things break in beta or rc releases has a good effect for the people using LTS.


Maybe this is pie-in-the-sky thinking, but if all the businesses used some sort of desktop variant of Android, the Crowdstrike app (to the extent that such a thing would even be necessary in the first place) would be sandboxed and wouldn't have the necessary permissions to bring down the whole operating system.


More secure OSes would consider an application being able to take down the entire OS as a security issue and would make that impossible.


When notepad hits an unhandled exception and the OS decides it's in an unpredictable state, the OS shuts down notepad's process. When there's an unhandled exception in kernel mode, the OS shuts down the entire computer. That's a BSOD in Windows or a kernel panic in Linux. The problem isn't that CrowdStrike is a normal user mode application that is taking down Windows because Windows just lets that happen, it's that CrowdStrike has faulty code that runs in kernel mode. This isn't unique to Windows or Linux.

The main reason they need to run in kernel mode is you can't do behavior monitoring hooks in user mode without making your security tool open to detection and evasion. For example, if your security tool wants to detect whenever a process calls ShellExecute, you can inject a DLL into the process that hooks the ShellExecute API, but malware can just check for that in its own process and either work around it or refuse to run. That means the hook needs to be in kernel mode, or the OS needs to provide instrumentation that allows third party code to monitor calls like that without running in kernel mode.

IMO, Windows (and probably any OS you're likely to encounter in the wild) could do better providing that kind of instrumentation. Windows and Office have made progress in the last several years with things like enabling monitoring of PowerShell and VBA script block execution, but it's not enough that solutions like CrowdStrike can do their thing without going low level.

Beyond that, there's also going to be a huge latency between when a security researcher finds a new technique for creating processes, doing persistence, or whatever and when the engineering team for an OS can update their instrumentation to support detecting it, so there's always going to be some need for a presence in kernel mode if you want up to date protection.


I mean, to me that's just a convincing argument against using kernel-mode spywa-, err, endpoint protection, with OTA updates that give you no way to stage or test them yourself cannot be secure.


How are those arguments against kernel level detection from a security perspective? His arguments show that without kernel level, you either can't catch all bad actors as they can evade detection, or that the latency is too big that an attacker basically has free reign for some time after detection.


Easy: plenty people in this forum aren't entrenched in the security field.

That's why there are so many misinformed assumptions


SolarWinds story was quickly forgotten, and this one will be too, and we'll continue to build such special single points of global catastrophic failure into our craftly architected decentralized highly robust horizontally scaled multi-datacenter-region systems


The SolarWinds story wasn't forgotten. Late last year the SEC launched a complaint against SolarWinds and its CISO. It was only yesterday that many of the SEC's claims against the CISO were dismissed.


Solarwinds is still dealing with the reputation damage and fallout today from that breach. People don’t forget about this stuff. the lawsuits will likely be hitting crowdstrike for years to come


Lenovo and Dell have some laptops with Linux, and they are very good ones.

(not sure if you meant rugged ones, that may not be the case, but I guess this is a tiny percentage of the market)


Crowdstrike also has an Ubuntu Linux version. We're required to install it at work.


No less than three baskets, or you cannot apply for bailouts. If you want to argue your industry is a load-bearing element in the economy: no less than three baskets.


Making everything browser based doesn't help (unless you can walk across the room and touch the server). The web is all about creating fast-acting local dependency on the actions of far-away people who are not known or necessarily trusted by the user. Like crowdstrike, it's about remote control, and it's exactly that kind of dependency that caused this problem.

I love piling on Microsoft as much as the next guy, but this is bigger than that. It's a structural problem with how we (fail to) manage trust.


If it's true that a bad patch was the reason for this I assume someone, or multiple people, will have a really bad day today. Makes me wonder what kind of testing they have in place for patches like this, normally I wouldn't expect something to go out immediately to all clients but rather a gradual rollout. But who knows, Microsoft keeps their master keys on a USB stick while selling cloud HSM so maybe Crowdstrike just yolos their critical software updates as well while selling security software to the world.


Sounds like it was a 'channel file' which I think is akin to an av definition file that caused the problem rather than an actual software change. So they must have had a bug lurking in their kernel driver which was uncovered by a particular channel file. Still, seems like someone skipped some testing.

https://x.com/George_Kurtz/status/1814235001745027317

https://x.com/brody_n77/status/1814185935476863321


The parser crashing the system on a malformed input file strongly suggests their software stack in general is trash


Sounds like something a fuzzer likely would have found pretty quickly.


How about a try-catch block? The software reading the definition file should be minimally resilient against malformed input. That's like programming 101.


A badpage fault in a kernel driver doesn't exactly recover from exceptions like that


Who needs testing when apologizing to your customers is cheaper?


Reputational damage from this is going to be catastrophic. Even if that’s the limit of their liability it’s hard not to see customers leaving en masse.


Ironically some /r/wallstreetbets poster put out an ill-informed “due diligence” post 11 hours ago concerning CrowdStrike being not worth $83 billion and placing puts on the stock.

Everybody took the piss out of them for the post. Now they are quite likely to become very rich.

https://www.reddit.com/r/wallstreetbets/s/jJ6xHewXXp



That user is the equivalent of using a screwdriver to look for gold and succeeding.


Not sure what material in their post is ill-informed. Looks like what happened today is exactly what that poster warned of in one of their bullet points.


Yea, everyone is dunking on OP here. But they essentially said that crowdstrike's customers were all vulnerable to something like this. And we saw a similar thing play out only a few years ago with SolarWinds. It's not surprising that this happened. Ofc with making money the timing is the crucial part which is hard to predict.


A convenient alibi?


The company will perish, there is no doubt in that.


Nah they'll be fine. It happened 7 months ago on a smaller scale, people forgot about that pretty quickly.

You don't ditch the product over something like this as the alternative is mass hacking.


Is the alternative "mass hacking"? I thought all this software did was check a box on some compliance list. And slow down everyone's work laptop by unnecessarily scanning the same files over and over again.


I assume you're not in Sec industry?

This sounds like someone who said "dropbox ain't hard to implement"


As someone said earlier in these comments the software is required if you want to operate with government entities. So until that requirement changes it is not going anywhere and continues to print money for the company.


But then, if what you say is true and their software is indeed mandatory in some context, they also have no incentive or motivation to care about the quality of their product, about it bringing actual value or even about it being reliable.

They may just misuse this unique position in the market and squeeze as much profit from it as possible.

The mere fact that there exists such a position in the market is, in my opinion, a problem because it creates an entity which has a guaranteed revenue stream while having no incentive to actually deliver material results.


If the government agencies insist on using this particular product then you're right. If it's a choice between many such products than there should be some competition between them.


Surely there are more than one anti-virus that can check the audit box?


From experiencing different AV products at various jobs, they all use kernel level code to do their thing, so any one of them can have this situation happen.


Presumably those other companies try running things at least once before pushing it to the entire world though.


I'd kind of expect IT administrators to try out these updates on a staging machine before fully deploying to all critical systems. But here we are.


You, the admin, don't get to see what Falcon is doing before it does it.

Your security ppl. have a dashboard that might show them alerts from selected systems if they've configured it, but Crowdstrike central can send commands to agents without any approval whatsoever.

We had a general login/build host at my site that users began having terrible problems using. Configure/compile stuff was breaking all the time. We thought...corrupted source downloads, bad compiler version, faulty RAM...finally, we started running repeated test builds.

Guy from our security org then calls us. He says: "Crowdstrike thinks someone has gotten onto linux host <host>, and has been trying to setup exploits for it and other machines on the network; it's been killing off the suspicious processes but they keep coming back..."

We had to explain to our security that it was a machine where people were expected to be building software, and that perhaps they could explain this to CS.

"No problem; they'll put in an exception for that particular use. Just let us know if you might running anything else unusual that might trigger CS."

TL;DR-please submit a formal whitelist request for every single executable on your linux box so that our corporate-mandate spyware doesn't break everyone's workflow with no warning.


EDR stands for Endpoint Detection and Response.

People don't realize there's that last bit: Response, what do you do when something is Detected.

That's your Admin setup.


Some of them might have saner rollout strategy and/or better quality control.


AV definition needs to be roll out quickly for 0day.

Developers aren't used to security lifecycle so quite a few commenters in this thread equates SDLC and Security


Extremely unlikely. This isn't the first blowup Crowdstrike has had; though it's the worst (IIRC), Crowdstrike is "too big to fail" with tons of enterprise customers who have insane switching costs, even after this nonsense.

Unfortunately for all of us, Crowdstrike will be around for awhile.


Businesses would be crazy to continue with Crowdstrike after this. It's going to cause billions in losses to a huge number of companies. If I was a risk assessment officer at a large company I'd be speed dialling every alternative right now.


Cybersecurity industry has regular and annual security testing/competitions done by various Organizations that simulates tons of attacks.

Vendors are tested against these cases and graded with their effectiveness.

I heard Crowdstrike is "best-in-market" for good reasons as others who have more deep knowledge of the industry have shared in this thread.


> I heard Crowdstrike is "best-in-market"

A friend of mine who used to work for Crowdstrike tells me they're a hot mess internally and it's amazing they haven't had worse problems than this already.


That sounds like any other companies I have ever worked for: looks great from the outside but a hot mess on the inside.

I have never worked for a company where everything is smooth sailing.

What I noticed is that the smaller the company, the less hot mess they are but at the same time they're also struggling to pay the bill because they don't innovate fast.


it would be crazy not to at least investigate migration paths away from Crowdstrike, or better redundancies for yourself


While it probably should, I regret to inform you that SolarWinds is still alive and well.


I mean, Boeing is still around...


I would assume that its enterprise customers have an uptime SLA as part of their contract, and that breaching it isn't very cheap for Crowdstrike.


I highly doubt their SLA says something about compensating for damages. At most you won't have to pay for the time they were down.

And even more ironically; A botched update doesn't mean they are down. It means you are down. So I don't even think their SLA applies to this.


Yeah, they'll pay with "credits" for the downtime, if what is currently happening even technically qualifies as downtime.


Software doesn't have uptime guarantees. They might have time-to-fix on critical issues, though.

I assume this is gross negligence, which would leave them open to claims made through courts, though.


As at 4am NY time CRWD has lost $10Bn (~13%) in marketcap. Of course they've tested, but just not enough for this issue (as is often the case).

This is probably several seemingly non consequential issues coming together.

I'm not sure why though, when the system is this important that even successfully tested updates aren't rolled out piecemeal though (or perhaps it has and we're only seeing the result of partial failures around the world)


Testing is never enough. In fact, it won't catch 99% of issues by the virtue of them often testing happy paths only, or that they test what humans can think of, and by no means they are exhaustive.

A robust canarying mechanism is the only way you can limit the blast radius.

Set up A/B testing infra at the binary level so you can ship updates selectively and compare their metrics.

Been doing this for more than 10 years now, it's the ONLY way.

Testing is not.


Depends on what you mean by enough. It should be more than enough to catch issues like this one specifically.

If they can't even manage that they'll fail at your approach as well.


Canary offers more bang for the buck, and is much easier to set up. So I kind of disagree.


> Canary offers more bang for the buck

I'm not sure that justifies potentially bricking the devices of hundreds(?) of your clients by shipping untested updates to them. Of course it depends... and would require deeper financial analysis.


They won't be able to test exhaustively every failure mode that could lead to such issues.

That's why canaries are easier and more "economical" to implement and gives better value per unit effort.


> They won't be able to test exhaustively every failure mode that could lead to such issues.

That might be acceptable. My point is that if you are incapable of having even absolutely basic automated tests (that would take a few minutes at most) for extremely impactful software like this starting with something more complex seems like a waste of time (clearly the company is run by incompetent people so they'd just mess it up)


But they can test obvious failure modes like this one. You need both.


Exactly. They knocked half the world offline probably killed thousands in ERs and the stock is only down to about June lows.


And when it’s more costly for customers to walk back the mistake of adopting your service.

Yeah, I get the impression a lot of SaaS companies operate on this model these days. We just signed with a relatively unknown CI platform, because they were available for support during our evaluation. I wonder how available they’ll be when we have a contract in place…


hah that tweet was one heck of an apology. "we deployed a fix to the issue, speak with your customer rep"


Unfortunately cybersecurity still revolves around obscurity.


Doesn't matter what testing exists. More scale. More complexity. More Bugs.

Its like building a gigantic factory farm. And then realizing that environment itself is the birthing chamber and breeding ground of superbugs with the capacity to wipe out everything.

I used to work at a global response center for big tech once upon a time. We would get hundreds of issues, we couldn't replicate cause we literally have to set up our own govt or airline or bank or telco to test certain things.

So I used to joke with the corporate robots to just hurry up and take over govts, airlines, banks and telcos already, cause thats the only path to better control.


> Its like building a gigantic factory farm. And then realizing that environment itself is the birthing chamber and breeding ground of superbugs with the capacity to wipe out everything.

Factorio player detected


Testing + a careful incremental rollout in stages is the solution. Don't patch all systems world-wide at once, start with a few, add a few more, etc. Choose them randomly.


Here's hoping they start from the top.

They won't, but hope springs eternal.


i've seen photos of the bsod from an affected machine, the error code is `PAGE_FAULT_IN_NONPAGED_AREA`. here's some helpful takeaways from this incident:

1) mistakes in kernel-level drivers can and will crash the entire os

2) do not write kernel-level drivers

3) do not write kernel-level drivers

4) do not write kernel-level drivers

5) if you really need a kernel-level driver, do not write it in a memory unsafe language


I've said this elsewhere but the enabling of instant auto-updates on software relied on by a mission critical system is a much bigger problem than kernel drivers.

Just imagine that there's a proprietary firewall that everyone uses on their production servers. No kernel-level drivers necessary. A broken update causes the firewall to blindly reject any kind of incoming or outgoing request.

Easier to rollback because the system didn't break? Not really, you can't even get into the system anymore without physical access. The chaos would be just as bad.

A firewall is an easy example, but it can be any kind of application. A broken update can effectively bring the system down.


There sure are a lot of mission-critical systems and companies hit by this. I am surprised that auto-updates are enabled. I read about some large companies/services in my country being affected, but also a few which are unaffected. Maybe they have hired a good IT provider.


I'm not surprised, seeing how this madness has even infected OSS/Linux.

https://github.com/canonical/microk8s/issues/1022

A k8s variety. By Canonical. Screams production, no one is using this for their gaming PC. Comes with.. auto-updates enabled through snap.

Yup, that once broke prod at a company I worked at.

Should our DevOps guy have prevented this? I guess so, though I don't blame him. It was a tiny company and he did a good job given his salary, much better than similar companies here. The blame goes to Canonical - if you make this the default it better come with a giant, unskippable warning sign during setup and on boot.


Snap auto update pissed me off so much I started Nix-ifyng my entire workflow.

Declarative, immutable configurations for the win...


One thing to consider with security software, though, is that time is of essence when it comes to getting protection again 0day vulnerabilities.

Gotta think that the pendulum might swing into the other direction now and enterprises will value gradual, canary deployments over instant 100% coverage.


I'm not a Windows programmer so the exact meaning of PAGE_FAULT_IN_NONPAGED_AREA is not clear to me. I am familiar with UNIX style terminology here.

Is this just a regular "dereferencing a bad pointer", what would be a "segmentation violation" (SEGV) on UNIX, a pointer that falls outside the mapped virtual address space?

As this is in ring 0 and potentially has direct access to raw, non-virtual physical addressing, is there a distinction between "paged memory" (virtual address space) and "nonpaged memory" (physical address) with this error?

Is it possible to have a page fault failure in a paged area (PAGE_FAULT_IN_PAGED_AREA?), or would that be non-fatal and would be like "minor page fault" (writing to a shared page, COW) or "major page fault" (having to hit disk/swap to bring the page into physical memory)?

Are there other PAGE_FAULT_ errors on Windows?

Searching for this is difficult, as all the results are for random spammy user-centric tech sites with "how do I solve PAGE_FAULT_IN_PAGED_AREA blue screen?" content, not for a programmer audience.




Basically all AV either runs as root or uses a kernel driver. I guess the former is preferable


Rust's memory safety does not prevent category errors like using nonpaged memory for things supposed to be paged and vice versa


this all-or-nothing mindset is is reductive and defeatist—harm reduction is valuable. sure, rust won’t magically make your kernel driver bug free, but will reduce the surface area for bugs, which will likely make it more stable.


Yes, I fully agree.

Unfortunately, we have decades of first Haskell pseudo-fans, a sidequest of generic "static typing (don't look at how weak the type system is)" pseudo-fans, and now Rust afficionados that do act like it's all-or-nothing and types will magically fix things including category and logic errors.

At some point tiredness and reactivity steeps in.


Other takeaways:

- do not put critical infrastructure online

- do not push updates that work around the update schedule

- do not push such updates to all machines at once

- do not skip testing and QA, relevant to the number and kind of the machines affected

Even one of these would have massively improved the situation, even with a kernel-level driver written in an unsafe language.


Memory safe language does not prevent crash.

In case of potential UB (and then memory corruption), you get a guaranteed crash.

Wait, crash? :wink:


did you have a crowdstroke while writing this reply?


The problem is that some viruses may run in the kernel mode, so an AV has to do the same, or it will be powerless against such viruses.


If a virus got that far, you're already in trouble. What stops them from attacking the anti-virus?


If you think AV cannot stop viruses in the same privilege level, then that is more reason for AV to run in the kernel mode. Because by your logic, an AV in user mode cannot stop a virus in user mode.


>5) if you really need a kernel-level driver, do not write it in a memory unsafe language

I C what you're doing... >_>


pointing out the obvious? why are you upset i’m stating mixing hot oil and water will make a mess?


0) don't load a new driver into your working kernel.


an audio driver once blue screen of death'd my windows whenever i started Discord.

i'm surprised i'm not hearing a stronger call for microkernels yet


5) Well how much of those kernel-level drivers we rely upon ARE written in a memory unsafe language ??? Like 99% ?

And we are not crashing and dying every day?

Sure, Rust is the way to go. it just took Rust 18 years to mature to that level.

Also, quite frankly, if your unwrap() makes your program terminate because an array out of bounds isn't that exactly the same thing ? (program terminates)

But IMHO if we are hopping along a minefield at this moment every second of every day, well... If this is the worst case scenario, yeah it's not that worse after all.


> Well how much of those kernel-level drivers we rely upon ARE written in a memory unsafe language ??? Like 99% ? And we are not crashing and dying every day?

we shouldn't discount the consequences of memory safety vulnerabilities just because flights haven't physically been grounded.

> Also, quite frankly, if your unwrap() makes your program terminate because an array out of bounds isn't that exactly the same thing ? (program terminates)

this is a strawman, if you were writing a kernel-level driver in rust you'd configure the linter to deny code which can cause panics.

here's a subset:

- https://rust-lang.github.io/rust-clippy/master/index.html#/u...

- https://rust-lang.github.io/rust-clippy/master/index.html#in...


Not a helpful takeaway, I've yet to see a Java kernel driver.


Nobody is telling you to use Java. Although, if you want to revive Singularity that would be pretty neat.


And I never said that anyone is telling me to use Java. It was an example.

Because of the nature of AV software, its code would be drowning in "unsafe" memory accesses no matter the language we chose. This is AV, it's always trying to read the memory that is not AV's, from its very design.

This is a story about bad software management processes, not programming languages.


Reading memory from another process can be done through memory-safe APIs.

To give an example from the linux userspace world: https://docs.rust-embedded.org/rust-sysfs-gpio/nix/sys/uio/f...


be the change you wish to see


This was apparently caused by a faulty "channel file"[0], which is presumably some kind of configuration database that the software uses to identify malware.

So there wasn't any new kernel driver deployed, the existing kernel driver just doesn't fail gracefully.

[0]: https://x.com/brody_n77/status/1814185935476863321


Why on earth don't they have staged rollouts for updates?


Everytime i look into such catastrophic issues, it always boils down to lack of robust canarying mechanisms.

They have enough client base that they can even run an A/B test on the whole binary level, but no.


Also, why not have some sort of graceful degradation (well kind of), like: OS Boots, loads CS driver, the driver loads some new feature/config, and before/after new recent thing ("runtime flag") marked whether it successfully worked, and if not on the next reboot that thing gets either disabled, or the previous known good config (obviously some combination of things might cause another issue), but instead of blindly rebooting to the same state....


I think pfsense does this (from memory, been a while using it). Basically dual-partitions, and if it failed to come up on the active partition after an update it'd revert. Granted you need to have the space to have two partitions, but for a small partition/image not so bad.

What surprises me is if its a content update, and the code fell over when dealing with it - just basically bad release engineering isn't it not to cater for that in the first place? i.e. some tests in the pipeline before releasing the content update would've picked it up given it sounds like 100% failure rates.


The problem space kind of dictates that this couldn't be a solution, cause malware could load an arbitrary feature/config and mark it as 0, then the AV would be disabled on next boot, right?


fair point indeed!


Why put effort in engineering when you can just fear monger in marketing and buy politicians in sales?


More importantly, why are CS customers not validating? Upstream patches should be treated as faulty/malicious if not tested to show otherwise, especially if they're kernel level.


For a while I've joked with family and colleagues that software is so shitty on a widespread basis these days that it won't be long before something breaks so badly that the planet stops working. Looks like it happened.


Perhaps a dumb question for someone who actually knows how Microsoft stuff works...

Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

Added: OK, from another post I now know Crowdstrike has some sort of kernel mode that allows this sort of catastrophe on Linux. So I guess there is a bigger question here...


> Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

Because malware that gets into a system will do just that -- install its own backdoor drivers -- and will then erect defense to protect itself from future updates or security actions. e.g. change the path that Windows Updater uses to download new updates, etc.

Having a kernel module that answers to CloudStrike makes it harder for that to happen, since CS has their own (non-malicious) backdoor to confirm that the rest of the stack is behaving as expected. And it's at the kernel level, so it has visibility into deeper processes that a user-space program might not (or that is easy to spoof).


Or, much more likely, the malware will use a memory access bug in an existing, poorly written kernel module (say, CrowdStrike?) to load itself at the kernel level without anyone knowing, perhaps then flashing an older version of the BIOS/EFI and nestle there, or finding it's way into a management interface. Hell, it might even go ahead and install an existing buggy driver by itself it's not already there.

All of these invasive techniques end up making security even worse in the long term. Forget malware - there's freely available cheating software that does this. You can play around with it, it still works.


Maybe I am in the minority, but it always puzzled me that anybody in IT would think a mega-priviledged piece of software that looks into all files was a good idea.

If there is any place that historically was exploited more than all other things it was broken parsers. Congratulations if such an exploited file is now read by your AV-software it now sits now at a position where it is allowed (expected) to read all files and it would not surprise me if it could write them as well.

And you just doubled the number of places in which things can go wrong. Your system/software that reads a PNG image might do everything right, but do you know how well your AV-software parses PNGs?

This is just an example, but the question we really should ask ourselves is: why do we have systems where we expect malicous files to just show up in random places? The problem with IT security is not that people don't use AV software, it is that they run systems that are so broken by design that they are sprinkled on top.

This is like installing a sprinkler system in a house full of gasoline. Imagine gasoline everywhere including in some of the water piping — in the best case your sprinkler system reacts in time and kills the fire, in the worst case it sprays a combustive mix into it.

The solution is of course not to build houses filled with gasoline. Meanwhile AV-world wants to sell you ever more elaborate, AI-driven sprinkler systems. They are not the ones profiting from secure systems, just saying..


> but it always puzzled me that anybody in IT would think a mega-priviledged piece of software that looks into all files was a good idea.

Because otherwise, a piece of malware that installs itself at a "mega-privileged" level can easily make itself completely invisible to a scanner running as a low-priv user.

Heck, just placing itself in /root and hooking a few system calls would likely be enough to prevent a low-priv process from seeing it.


You're ignoring the parent's question of "why do we have systems where we expect malicous files to just show up in random places?", which I think is a good question. If a system is truly critical, you don't secure it by adding antivirus. You secure it by restricting access to it, and restricting what all software on the machine can do, such that it's difficult to attack in the first place. If your critical machines are immune to commodity malware, now you only have to worry about high-effort targeted attacks.


My point exactly. Antivirus is a cheap on top measure thst makes people feel they have done something, the actual safety of a system comes from preventing people and software from doing things they shouldn't do.


Why would you design a system where a piece of malware can "install itself" at a mega-priviledged position?

My argument was that this is the flaw, and everything else is just trying to put lipstick on a pig.

If you have a nightclub and you have problem controlling which people get in, the first idea would be to not have a thousand unguarded doors and to then recruit people that search the inside of your nightclub for people they think didn't pay.

You probably would think about reducing the numbers of doors and adding effective mechanisms to them that help you with your goals.

I am not saying we don't need software that checks files at the door, I say we need to reduce the number of doors leading directly to the nightclubs cash reserve.


I wonder why and how does security software read a PNG file. Sure it's not tough to parse a PNG file, but what does it look for exactly?


Some file formats allow data to be appended or even prepended to the expected file data and will just ignore the extra data. This has been used to create executables that happen to also be a valid image file.

I don't know about PNG, but I'm fairly sure JPEG works this way. You can concatenate a JPEG file to the end of an executable, and any JPEG parser will understand it fine, as it looks for a magic string before beginning to parse the JPEG.

A JPEG that has something prepended might raise an eyebrow. A JPEG that has something executable prepended should raise alarms.


Why make something like that executable in the first place? I like the Unix model where things that should be executable are marked so. I know bad parsers and format decoders can lead to executable exploits, but I've always felt uncomfortable with the windows .exe model. Also VBA in excel, word... I believe a better solution would be to have a minimal executable surface than invasive software.


Vendors are allowed to install drivers , even via Windows update. Many vendors like HP, install functionality like telemetry as drivers to make it more difficult for the users to remove the software.

So next time you think you are doing a "clean install", you are likely just re-installing the same software that came with the machine.


It doesn't install the driver, it is the driver. As for the Linux version, it uses eBPF which has a sandbox designed to never crash the kernel. Windows does have something similar nowadays, but Crowdstrike's code probably predates it and was likely just rawdogging the kernel.


> Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?

While the files are named XXX.SYS they are apparently not drivers. The issue is that a corrupted XXX.SYS was loaded by the already-installed driver which promptly crashes.


As I understand it was a definition update that caused a crash inside already installed driver.


I guess this article might need some updating soon:

https://www.crowdstrike.com/resources/reports/total-economic...


"Falcon Complete managed detection and response (MDR) delivers 403% ROI, zero breaches and zero hidden costs"


I'm always curious on how security software can provide a ROI.

I had McAfee tell me one time that the hackersafe logo on our website would increase sales by 10%, this was at a Fortune 50 doing billions in sales online every year.

I was pretty hyped because it would have done wonders for my career, but then they walked it back and wouldn't explain it to me. I wasn't mad, I was disappointed.


I ran an AB test on 2012 not sure its relevant now, we tested the McAfee logo and conversion was boosted by 2%. Bigger boost was a lock icon, 3%. It kept increasing the more locks we added an topped at 5% after 5 lock icons.


The intersection of ROI and human psychology!

1 lock: “looks safe, I buy”

2 locks: “wow really safe, I buy more”

50 locks: “I’m being lied to”


> ...delivers -407% ROI...

FTFY.


"There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. It demands the same skill, devotion, insight, and even inspiration as the discovery of the simple physical laws which underlie the complex phenomena of nature."

"The most important property of a program is whether it accomplishes the intention of its user."

C.A.R. Hoare


Agreed but have you been in the industry lately? Nobody hires assembly programmers anymore. Want money you must work at wobbly top of abstraction mountain.


I am well aware, but the quotes are timeless for a reason. Not to be cheeky, but "Want money" is exactly how you get to the many routinely broken endpoint solutions that wind up reducing reliability and at times increasing the attack surface. Wherever you are in the stack, please make it more robust and easier to reason about. No matter how far from the assembly.


It’s not just about the tech abstraction mountain, it’s about the app logic and dev process too.

A react native JS app with a clear spec and a solid release process can be more reliable than bloated software that receives an untested hotfix, even if the latter was handwritten in assembly.


My entire emergency department got knocked offline by this. Really scary when you have ambulances coming in and are trying to stabilize a heart attack.

Update: 911 is down in Oregon too, no more ambulances at least.


Do you have offline backup processes at least? Nasty situation.


We're really prepared for epic to go down and have an isolated cluster that we access in emergencies. I transitioned from software engineering so I've only been in the ED for a year, but from what I could see there didn't seem to be a plan for what to do if every computer in the department bluescreened at once.


"Always look on the bright side of life!" - M Python


But at least the best instant messaging app in the world Microsoft Teams and the best web browser in the world Microsoft Edge are working fine, right?


It's a bsod loop, so not really.


What do we do next week?

So assuming everyone uses sneaker-net to restart what’s looking like millions of windows boxes, there comes recriminations but then … what?

I think we need to look at minimum viable PC - certain things are protected more than others. Phones are a surprisingly good example - there is a core set of APIs and no fucker is ever allowed to do anything except through those. No matter how painful. At some point MSFT is going to enforce this the way Apple does. The EU court cases be damned.

For most tasks for most things it’s hard to suggest that an OS and a webbrowser are not the maximum needed.

We have been saying it for years - what I think we need is a manifesto for much smaller usable surface areas


In this case even dockerized environments would allow you to redeploy with ease.

But that's too much work, many of these systems are running docker resistant software. Management doesn't want to invest in modernization - it works this quarter, it's someone else's problem next quarterly.

You're basically proposing Windows 12 to radically limit what software and drivers can do. Even then eventually someone will probably still break it with weird code.

I'm actually amazed these updates are being tested in prod. Do they have no QA environments ?

Do I personally need to create a startup company called Paranoia... We actually run a clone of your prod environment minus any sensitive data, then we install all the weird and strange updates before they hit your production servers...

As an upsell we'll test out privileges, to take sure your junior engineers can't break prod.

Someone raise a seed round, I'm down to get started this week.


> In this case even dockerized environments would allow you to redeploy with ease.

Not if the CIO mandated that your bare-metal OS hosting Docker has to run a rootkit developed by bozos.


Isn't that basically the point of WinRT and Windows 10 S Mode? The problem is getting developers to adopt the new more secure APIs.


I think this is existential for Windows, and by extension MSFT. Something like 95% of corporate IT activity is either over http (ie every saas and web app) or is over the serial port (controlling that HVAC, that window blind, that garage lifter)

So what we need in 95% of boxes is not a fully capable PC - we need a really locked down OS. Or rather we can get by with a locked down OS.

I would put good money on there already being a tiny OS from the ground up in MSFT that could be relabelled windows-locked-Down(13) and sold exclusively to large corporates (and maybe small ones who sign a special piece of marketing paper)

The thing is once you do that you are breaking the idea that windows can run everywhere (or rather we claim Linux runs everywhere but the thing that’s on my default unbuntu install and the thing on my router are different


So apparently "The issue has been identified, isolated and a fix has been deployed" https://x.com/George_Kurtz/status/1814235001745027317

Yet the chaos seems to continue. Could it be that this fix can't be rolled out automatically to affected machines because they crash during boot - before the Crowdstrike Updater runs?


Correct. Many just end up in an endless loop and never actually boot.

It's about as bad as it gets.


That update is so tone-deaf and half-assed. There's no apology.

If you go to the website, there's nothing on their front-page. The post on their blog (https://www.crowdstrike.com/blog/statement-on-windows-sensor...) doesn't even link to the solution. There's no link to "Support Portal" anywhere to be seen on their front-page. So, you have to go digging to find the update.

And the "Fix" that they've "Deployed" requires someone to go to Every. Single. Machine. Companies with fleets of 50k machines are on this HN thread - how are they supposed to visit every machine?!?!


They won't apologize for legal reasons. Also, it will only make their stock fall further.


The CEO actually did apologize: "We're deeply sorry for the impact that we've caused to customers, to travelers, to anyone affected by this..."

https://www.reuters.com/technology/crowdstrike-ceo-apologize...


Any response they make in the middle of a global outage will be half-assed. They have all available resources figuring out what the hell just happened and how to fix it.

An apology this early is a lose-lose. If they do apologize they'll piss off people dealing with it and want a fix not an apology. If they do t apologize they're tone deaf and don't seem to care.


Imagine being anywhere near the team that sent this...


lol sounds good, but how the hell do they deploy a fix to a machine that has crash and is looping BSOD with no internet or netwrok connectivity...

You do what I've been doing for the last 10 hours or so. you walk to each and every desktop and manually type in the bitlocker key so you can remove the offending update.

at least the virtual devices can be fixed sitting at a desk while suckling at a comfort coffee..


Yeah, you need to manually fix each affected system by booting in safe mode. Not possible to do remotely.


And you will need your bitlocker recovery key to access your encrypted drive in safe mode. I luckily had mine available offline

There's going be a lot of handholding to get end users through this.


You can enable safemode for next boot without the recovery key and then you can delete the offending file on that next boot.


That requires being able to boot in the first place


You can do a minimal boot. I'm told.


Ouch!



There's potentially a huge issue here for people using BitLocker with on-prem AD, because they'll need the BitLocker recovery keys for each endpoint to go in an fix it.

And if all those recovery keys are stored in AD (as they usually are), and the Domain Controllers all had Crowdstrike on them...


Bitlocker keys are apparently not necessary: https://x.com/AttilaBubby/status/1814216589559861673


It might work on some machines, but doubt to work on the rest. Worth the try.


This is the best definition of "Single point of failure" i have ever seen.


Assuming that they also have a regular Bitlocker password, there's hope with a bit manual effort. https://news.ycombinator.com/item?id=41003893


Most of the large deployments I've seen don't use pre-boot PINs, because of the difficulty of managing them with users - they just use TPM and occasionally network unlock.

So might save a few people, but I suspect not many.


Yeah but TPM-only Bitlocker shouldn't be affected anyway by this issue, these machines should start up just fine.

Whoever only has AD-based Bitlocker encryption is straight up fucked. Man, and that on a Friday.


That's the easy part? just do the domain controller first?


I got around BitLocker and booted into safe mode by setting automatic boot to safe mode via bcdedit https://blog.vladovince.com/mitigating-the-crowdstrike-outag...


> CrowdStrike Engineering has identified a content deployment related to this issue and reverted those changes.

> Workaround Steps:

> Boot Windows into Safe Mode or the Windows Recovery Environment

> Navigate to the C:\Windows\System32\drivers\CrowdStrike directory

> Locate the file matching “C-00000291*.sys”, and delete it.

> Boot the host normally.


Was thinking about a bootable usb-stick that would do that automagically. But I guess it is harder to boot from a usb-stick in these environments than the actual fix.

I guess more feasible and even neater to do it if you have network boot or similar.


So booting into safe mode should do the trick right, even if Bitlocker is enabled?


What if you have 50k workstations? Can you even do this remotely?

The problem may be fixed but I can see some companies having a really shit weekend.


2000s vibes.


This gem from the ABC news coverage has my mind 100% boggled:

"711 has been affected by the outage … went in to buy a sandwich and a coffee and they couldn’t even open the till. People who had filled up their cars were getting stuck in the shop because they couldn’t pay."

Can't even take CASH payment without the computer, what a world!


Technically a payment terminal can go into island mode and take offline credit card transactions and post them later. PIN can be verified against the card.

Depends if the retailer wants to take the chance of all that.


That is if the terminal is not dead itself


The terminal is probably not running Crowdstrike...


Terminal running Windows? Someone is going to make it run Crowdstrike too.


You might be surprised..


Crowdstrike != Windows ;)


"Probably" is a load bearing word


Dude. SO MUCH STUFF runs on Windows.


Terminal is probably fine, the machine that tells it number to charge is dead... And it is probably not even setup to accept manual payment inputs.


Yeah, all depends on how much config you want to allow employees to do, but I’m sure the functionality is there if you wish to enable it.


Having worked with some of these retail systems, yes, it depends on how they are configured.

There are stores in many places in the country with sporadic internet or where outages are not uncommon, and where you would want to configure the terminals to still work while offline. In these cases, the payment terminals can be configured to take offline transactions, and they are stored locally on the lane or a server located in the store until a connection to the internet is re-established.


Not this time. Use paper, pen, and a non-electronic cash box.


Good luck putting a payment terminal into island mode when it's in a bluescreen loop.


I'm seeing several reports of things like being unable to buy tickets for the train on-line in Belgium.

They use Windows as a part of their server infrastructure?


At least they'd take cash if the computer wasn't broken. That's getting quite rare in the UK.


Not really? I've only really seen people not taking cash at trendy street food stalls and bougie coffee shops, pretty much everywhere else does.


Not just indie coffee shops - chains too. Pubs, clothes shops... Even the Raspberry Pi store


Netherlands is the worst at this. More and more “PIN ONLY”. Also more and more tight rules about how much you're allowed to have.

Luckily I can just give someone a paper wallet containing crypto. No transactions, no traceability, no rules.


In London it’s really common


At where though? The example given was in 711 which is a nationwide chain a bit like a Tesco Express or Sainsbury's Local, both of which still accept cash nationwide in the UK too.


Aldi, apparently. That's where Piers Corbyn couldn't buy strawberries with cash. https://www.mirror.co.uk/news/uk-news/piers-corbyn-splits-op...


This whole thing likely would have been averted had microkernel architectures caught on during the early days (with all drivers in user mode). Performance would have likely been a non-issue, not only due to the state of the art L4 designs that came later, but mostly because had it been adopted everything in the industry would have evolved with it (async I/O more prevalent, batched syscalls, etc.).

I will admit we've done pretty well with kernel drivers (and better than I would have ever expected tbh), but given our new security focused environment it seems like now is the time to start pivoting again. The trade offs are worth it IMO.


Not disagreeing with you but we need operating systems with snapshots before updates and a trivial way to rollback the update.

Linux has some immutable OS versions and also btrfs snapshots and booting a specific snapshot from the GRUB bootloader


I wonder if for critical applications we'll ever go back to just PXE booting images from a central server: just load a barebones kernel and the app you want to run into a dedicated memory segment, mark everything else as NX, and you don't even have to worry about things like viruses and hacks anymore. Run into an issue? Just reboot!


I just skimmed through the news. A lot of airports, hospitals, and even governments are down! It's ironic how people are putting their eggs in one basket, trying to avoid downtime caused by malware by relying on a company that put their system down. A lot of lessons will be learned after this for sure.


Unless you run half your devices on one security vendor and half on another surely there is no way round it? Companies install this stuff over "Windows Defender" so they can point fingers at the security vendor when they get hacked, this is the other side of the coin.

It has happened before where security software has unwanted effects, can't say i remember anyone else managing to blue screen Windows and require a safe mode boot to fix the endpoints though.


Relying on easy-install "security vendors" is the problem. It's one thing to run an antivirus on a general purpose PC that doesn't have a qualified human admin. But many of the computers affected here are single-purpose devices, which should operate with a different approach to security.


Speaking as somebody who manages a large piece of a 911 style system for first responders and has done so for 10 years (and is not affected by this outage) - this is why we do not allow third parties to push live updates to our systems.

It's unfortunate, the ambulances are still running in our area of responsibility, but it's highly likely that the hospitals they are delivering patients to are in absolute chaos.


Thanks god all the critical infrastructure in my country is still on MS DOS!


Hahaha, you mean all the CRITIC~1.INF ?


> Hahaha, you mean all the CRITIC~1.INF ?

Kids those days. It shall be CRITIC~1.COM


That school running their HVAC infra on an Amiga must be pretty happy.


The biggest mistake here is running a global update on a Friday. Disrespect to every sysadmin worldwide.


Disrespect to every CIO to make their business depend on a single operating system, running automatic updates of system software without any canaries and phased deployments.


You're saying I should diversify my 100% Linux operation to also use Windows?


While I believe Linux is a more reasonable operating system than Windows, shit can happen everywhere.

So if you have truly mission critical systems you should probably have more have at least 2 significantly different systems, each of them being able to maintain some emergency operations independently. Doing this with 2 Linux distros is easier than doing it with Linux and Windows. For workstations Macs could considered, for servers BSD.

Probably many companies will accept the risk that everything goes down. (Well, they probably don't say that. They say maintaining a healthy mix is too expensive.)

In that case you need a clearly phased approach to all updates. First update some canaries used by IT. If that goes well update 10% of the production. If that goes well (well, you have to wait until affected employees have actually worked a reasonable time) you can roll out increasingly more.

No testing in a lab (whether at the vendor or you own IT) will ever find all problems. If something slips through and affects 10% of your company it's significantly different from affecting (nearly) everyone.


Maybe some OpenBSD would be a good hedge. It can also help spot over-reliance on some Linux quirks.


What makes you think windows is the only alternative? Have you never heard about Gnu Hurd?

More seriously I am not saying you should run some critical services on menuetos or riscos but the BSDs are still alive and kicking as well as illumos and its derivatives. And yes I think a bit of diversity allows some additional resilience. It may necessitate more workforce but imho it is worth the downsides.


The biggest mistake is not ringfencing this update in a test environment before sign-off for general deployment.


Presumably they do test their updates, they're just maybe not good enough tests.

The ideal would be to do canary rollouts (1%, then 5%, 10% etc.) to minimise blast radius, but I guess that's incompatible with antiviruses protecting you from 0-day exploits.


While I'm usually a proponent of update waves like that, I know some teams can get loose with the idea if they determine the update isn't worth that kind of carefulness.

Not saying CS doesn't care enough but what may be a minor update to the team that did this and not necessary for a slow rollout is actually something that really should be supervised in that way.


Our worst outage occurred when we were deploying some kernel security patches and we grew complacent and updated the main database and it's replica at the same time. We had a maintenance with downtime anyway at the same time, so whatever. The update worked on the other couple hundred systems.

Except, unknown to us, our virtualization provider had a massive infrastructural issue at exactly that moment preventing VMs from booting back up... That wasn't a fun night to failover services into the secondary DC.


Was this update meant to save from a 0 day?


Update: change color of text in console


Agreed. What happened to Patch Tuesdays?!


I don't think the day matter anymore really.

The issue is update rollout process, lack of diversity of these kind of tools in the industry, and the absolute failure of the software industry to make decent software without bug and security holes.


Yeah, airlines prefer mid-week chaos & grounding.


Crowdstrike is a perfect name for a company that could cause a worldwide outage.


Makes me think of flystrike which is also a perfect analogy https://en.wikipedia.org/wiki/Myiasis


Yeah, I've always thought it was a bad name. I see them during Formula 1 advertisements because they sponsor the Mercedes team.

They might as well have named themselves "cluster bomb" as they have done a huge amount of damage today and for the next few days.


Quick fix that worked for us, In safe mode:

1.enter in drive C: 2.system 32 folder 3. Drivers 4. Rename crowdstrike folder to something else doesent matter what.


Could you potentially do the same by just attaching the HDD to another computer as a secondary drive and renaming the folder if safe mode falls through?


Not likely, not unless the disk is also not BitLockered.


GLOBAL OUTAGES

- Major banks, media and airlines affected by major IT outage

- Significant disruption to some Microsoft services

- 911 services disrupted in several US states

- Services at London Stock Exchange disrupted

- Sky News is off air

- Flights in Berlin grounded

- Reports the issue relates to problem at global cybersecurity firm Crowdstrike


Sky News (UK) is back on air, but they seem to have no astons / chyrons / on screen graphics at all, and I don't think they're able to show prerecorded material either (it's just people in the studio from what I've seen), so presumably they're still having fun issues with their general production systems.


Oh, look at that, TV has suddenly become watchable without all of the scrolling garbage all over the screen.


But my attention span! Nooo!


Just got their ticker back at 09:42 !


Having another look in, it looks like they're overlaying a static banner along the bottom rather than having a fully working graphics system still (as off 11:30), as that's the only graphics I've seen.


Good news for crowdstrike! It shows how critical their services are. Stock to go up! (And down a bit when they get sued, and up a bit when they don't get sued too much, etc.)


CrowdStrike achieved PMF - Product-Market Fiasco!


Take a look at their stock right now.


Tried, "Real time quote data is not available". Big brain move, can't crash the stock price if the nasdaq is offline..


Google tells me "Pre-market 295.00 −48.05 (14.01%)"


That's 85.99% too high.


Yes, this has had no (statistically significant) effect.


Pre-market is -12% down at the moment of writing this. I'd say this definitely has an effect.


I have seen some pictures shared of vending machines with BSOD.. why would you have an EDR on a vending machine? To protect the valuable assets?!


If the vending machine handles credit cards, wouldn't Visa / Mastercard / etc. basically require it as part of their security requirements? Or it's just general CYA from someone that's backfired badly.


"It's a vending machine that sells back physical stolen credit cards." Input the dollars, receive stolen credit card. Put it next to an airport terminal for maximum impact.


presumably they handle credit card transactions


> - Services at London Stock Exchange disrupted

just their news service.

> The London Stock Exchange says it's working as normal - but says there are problems with its RNS (regulatory news service).

> "RNS news service is currently experiencing a third party global technical issue, preventing news from being published on www.londonstockexchange.com," the statement says.

> "Technical teams are working to restore the service. Other services across the group, including London Stock Exchange, continue to operate as normal."


I was just at Coles, one of Australia's big 2 supermarkets.

50% of the self-serve terminals were down. Presumably the others aren't far behind.

Total chaos.


Schipol (Amsterdam, europe's largest) and Melbourne Airport are down


Can someone explain what Crowdstrike actually is? Reading Wikipedia it seems to be some sort of anti-virus software?


CS is an EDR (Endpoint Detection & Response) and it connects to other parts like XDR (Extended Detection and Response) and MDM (Mobile Device Management). They differ from the typical antivirus in how they detect threats. The AV usually checks against known threats, while EDR detects endpoint behavior anomalies. For example, if your browser spawns a shell, it will be marked and the process quarantined. Of course, they do share a lot of common domains like real-time protection, cloud analysis, etc., and some AVs have most of the EDR capabilities, and some EDRs have most of the AV capabilities. This is briefly described.


We're running something similar ( not CS ) where I work.

It seems to me that these tools create lots of problems ( slows down the machine significantly in particular, gets things wrong and quarantines processes/machines when it shouldn't, injects itself into processes so changes behaviour, etc ).

The main question I have is : does anyone have an actual instance of such tools detecting something useful ? No one in the office was able to show one.


I contracted for a company that gave me a company issued macbook with crowdstrike. It logged my execve() or something, because I did a curl from rustup | sh, and this alerted an admin who then contacted me to ask if this was legitimate behaviour.


Worked for a fairly largish org (~40k emps), and one of the "security" gurus roped me into a conversation because he found a batch file in my Teams shared files. The contents:

set JAVA_HOME="what_ever_path"

and asked me to explain this egregious hacking attempt.


My company had a mandatory req of installing it. If you look into it - it logs and spies on everything you do, every dns req, every website, every application etc.

Now my m3-ultra MacBook work computer that they gave is a 4000 USD teams/email machine since I prefer to work on computers without spyware.


I understand your preference. I have two questions:

1) Do you think that an organization should have no protections in place? 2) Why not just work from the machine they provided you, and do everything else on a personal machine?


> 1) Do you think that an organization should have no protections in place?

Do you think Crowdstrike offers protection ?


I assume from your rhetorical question that you don't. I personally don't know enough about it to say whether it does or not - but, I will make what I believe is a reasonable assumption and say that all else being equal, yes, a fleet of machines with a EDR sensor installed is more "protected" than a fleet without.

If you have a point to make, why not just say what you are trying to say; it will be more effective discourse. I am genuinely curious.


They key to tools like crowdstrike is not so much protection, and being able to trace an attack through the infrastructure. They can see that your credentials were comprimised on your machine, and which systems you then connected to (or that bad process did) so they can trace the attack and make sure get it all cleaned up.


My favorite work story is from 10 years ago. We had an internal IRC server for the devs. I'd written an IRC bot to do some basic functions. It was running on my desktop.

I get a call from IT on my work phone. My co-workers hear my end of the conversation:

"No, it's not a bot net. It's just one bot. Yeah, I wrote it and it talks IRC."

Thankfully they left me alone.


You also forgot the part that it is a tool to spy on everything the employees do if it is installed on their computers.


It’s watching the system for events like “file was opened” and “process started”, and looking for patterns resembling hackers/malware.

It’s different from AV in that it mostly looks at runtime behavior and not signatures.


AV with shiny bits stuck on the side and a good marketing team.


I see it's not just the Software development ecosystem that got affected by the cult of Hipsterism.

If I was Alex Jones, I'd go further and blame this on a decade of DEI and fluoride in the water. /s


Yes it’s rebranded antivirus for enterprise with new fancy name - “endpoint security”. Also has remote fleet management and firewall features


It is one of the best systems available for realtime protection of windows systems against various threat actors. Prior to today you could probably have said 'no one gets fired for recommending Crowdstrike as the security tool for the company.' It is everywhere and in particular if you are a large org with a lot of Windows seats you are likely a Crowdstrike customer.


> realtime protection of windows systems against various threat actors

So it's AV + a firewall? What does it actually do?


Consume a lot of CPU and occasionally delete development build artifacts?


Sounds just like your average antivirus then.


What the heck is it doing? My work laptop fan always seems to be blasting air whether it is 10pm or 3am. It's in a reboot loop now so I just shut it off.

All my Linux machines are all quiet when nothing is running. In contrast I go to the bathroom at 10pm or 3am and the work laptop fan is blasting. I've logged and see some other security stuff taking up CPU cycles but it happens at least a few times an hour. I wonder how much electricity the world is wasting with this crap.

When I first got the laptop when I started this job 5 years ago I thought it must be infected with malware because it was always running the fan so I put it in a separate VLAN so it can't attack my home Linux machines. IT told me it is security software. Who knew that the cyber attack would come from inside the security software.


It massively increases your attack surface but lets you tick the "cybersecurity" box on your audit. It's a good trade for many people, it seems.


Can you please elaborate on the increases in attack surface? I know it's a kernel driver so maybe that's what meant?


You're giving a third-party company remote admin access to all your systems (by their ability to push their own code updates to your systems).


Some of these services go even further. One time, our IT department was being sales-bombed with a service that would remove our actual login credentials to servers, and then "for security" we'd access said servers using a MITM website kind of thing that would be behind our corporate AD-login. I didn't even find out the full intricate details before telling them to "nope this the fuck out" and stay away with a 10-ft pole.

It's like these people have nothing better to do with their time and just absolutely have to have to design and build a product for the sake of it, and then dump it on marketing for > 0 amounts of sales through pretty-much wearing IT departments down. Or in the case of this Crowdstrike thing, through the protection racket known as security audit compliance.


I'm mandated to use one of those.

The security tradeoffs don't make sense at all once you understand how it works.

Ssh or winrm are significantly more secure than whatever some security vendor thinks will tick an audit box.

10ft pole is an excellent approach.


It injects itself into (at least) every executable startup and every executable write to disk. It's quite noticeable if you have it installed and run, say, an installer that unpacks a lot of DLL files, because each one gets checksummed and the checksum sent to a remote host. Every time.

I hated it before this incident and I will be bringing this incident up every time it is mentioned.


ptraces all processes and feeds the resulting logs through some regex looking for suspicious patterns.


So it exists because nobody has any idea what the execution graph of their programs are, and CS is down because of that too.. Do we really need this level of dynamism in our programs?


And like most AV systems it seems to be a bigger threat than what it supposedly protects against. Seriously how is it acceptable to have one corporation push a live update and take down tons of critical services all over the world. Just imagine what a malicious actor could accomplish with such a delivery vector.


Indeed. The xz backdoor team must be kicking themselves: "We spent years getting our own vector into a tool, only for our world domination plans to be thwarted at the last minute ... we could have just bribed someone at CS!"


Of course, this tells you a lot about the sad state of "realtime protection" software.


> realtime protection of windows systems

and mac, and linux


More like enterprise-level spyware


Botnet that checks if your bots in the botnet act like bad bots and can be considered bad too. Also checking if some of your files match AV signature. Also reading all your logs if you really want.


seems to be an MDM solution thats doing tons of stuffs


"MDM solution" leaves me even more confused than before.


It is just an RDBI deploying an FSM for enterprise GTX solutions.

(clarification: FSM refers to an AIIT for SEV)


Chances if Microsoft or Crowdstrike will be held liable for financial losses caused by this outage?


Financial losses? The comment you're replying to is mentioning heart attack treatment here. We're talking about deaths. Most of us won't like to hear this but for all of us who work at SaaS that is deployed on servers around the worlds, our bugs cause people to die. It's a given that at least a dozen people will die directly (medical flights, hospitals both being hit) due to this broken update, let alone indirectly.


I don't think the parent comment was ignoring that. The penalty for a company who does this can't be to bring someone back from the dead, it's likely to be financial, which is the aspect they're talking about.


If this was a Japanese company, the entire c-suite would have committed seppuku by now.


[flagged]


As others have already stated, yes, that is how we should be interpreting comments, in good faith and in the most charitable way as the site guidelines suggests us to.


Good. That's the HN way.


That's how I read it, ie "will there be severe fines for the loss of life and other losses for this?".


You’re basically asking for a virtue signalling disclaimer. I think you’d prefer a different social network.


I've finally learned to spot and ignore all emotional arguments.

https://www.scribbr.com/fallacies/appeal-to-emotion


If companies want the nice parts of being "a person", they should also deal with the bad parts of being a person. Financial fines are not enough. Though I'm not sure how we'd build a jail cell for an entire company.


Fines are not enough because a large enough fine will kill a company, destroying lots of jobs and supply chains.

Why not dilute the shareholder pool by a serious amount? There's no need for a statization to formally happen, the government can sell the shares back over time without actually exercising control.

Also fire execs and ban them from holding office on publicly traded companies for the foreseeable future.

Seizing shares doesn't impact the cash flow of the company directly, thus shouldn't cause job losses, but shareholders (who should put pressure on executives and the board to act with prudence to avoid these kinds of disasters) are adequately punished.


> Fines are not enough because a large enough fine will kill a company, destroying lots of jobs and supply chains.

That could be amazing: "Ooopsie, in punishing Crowdstrike they've ended up folding and now there's a second global outage."


This actually sounds like a workable idea, but the implementation would be extremely thorny (impact on covenants, governance, voting rights, non-listed companies, etc) and take forever to get done. It would also punish everyone equally, even though they clearly do not share equal blame.

You probably want, in addition to your proposal, executive stock-based compensation to be awarded in a different share class, used to finance penalties in such cases where the impact is deemed to be the result of gross negligence at the management level.


> but shareholders (who should put pressure on executives and the board to act with prudence to avoid these kinds of disasters) are adequately punished.

So if I own some Vanguard mutual fund as part of a retirement account, it’s now on me to put pressure on 500+ corporations?

Perhaps it’s on Vanguard to do so…but Vanguard isn’t going to just eat the cost of increased due diligence requirements. My fees will increase.

How does that increased due diligence even work? It’s not like I or Vanguard can see internal processes to verify that a company has adequate testing or backups or training to prevent cases like today’s failure.

When, on average, X number of those 500 companies in my mutual fund face this share seizure penalty per year…am I just supposed to eat the loss when those shares disappear? Does Vanguard start insuring against such losses? Who pays for that insurance in the end?

This doesn’t even really hurt the shareholders who are best placed to possibly pressure a company. This doesn’t hurt “billionaire executive who owns 40% of the outstanding shares”. I mean, sure, it will hurt that little part of their brain that keeps track of their monetary worth and just wants to see “huge number get huger”…but it doesn’t actually hurt them. It just hurts regular folks, as usual.


If you own a mutual fund, then you do not own shares of the 500 companies, rather you own shares of the mutual fund itself.

Consequently you don't put pressure on the 500 companies, you put pressure on the mutual fund and the mutual fund in turn puts pressure on the companies it invests in and exercises additional discretion in which companies it invests in.

>Perhaps it’s on Vanguard to do so…but Vanguard isn’t going to just eat the cost of increased due diligence requirements.

Yes they do, because mutual funds do compete with one another and a mutual fund that does the due diligence to avoid investing in companies that are held liable for these kinds of incidents will outperform the mutual funds that don't do this kind of due diligence.

> It’s not like I or Vanguard can see internal processes to verify that a company has adequate testing or backups or training to prevent cases like today’s failure.

I don't know specifically about Vanguard, but mutual funds in general do employ the services of firms like PwC, Deloitte, and KPMG to perform technical due diligence that assesses the target company's technology, product quality, development processes, and compliance with industry standards. VC firms like Sequoia Capital and Andressen Horowitz do their own technical due diligence.


Just perhaps the idea of sticking everyone's retirement funds into massive passive vehicles was a bad one and has an unhealthy effect on the market, as you illustrate here. It is the way of things now so I see your point and it would be harmful to people, but getting in this situation has seemingly removed what could be a natural lever of consequence. We can't really hold companies accountable lest all the "regular folks" that can't actively supervise what they're investing in become collateral damage.


Other stocks will go up as a result. It's not like money is ever destroyed.


The death penalty could be an option? Dissolve the company, seize their assets, bar anyone involved from ever running or owning a company again.


Should be, but I don't know that that's appropriate for involuntary manslaughter.

Do it to Boeing, sure.


Hold the board of directors and the C-suite personally, corporally accountable -- immediate changes for the better will follow.


You'd seize the company from its current shareholders.

That gives shareholders of other companies good reason to care going forward.


> Though I'm not sure how we'd build a jail cell for an entire company.

Same thing with AI. You can't punish an AI, it has no body.


At least with AI you could do something like, destroy all copies including backups, destroy all training data and other code used to generate it. Which to me actually doesn't seem unreasonable punishment.


We must demand both financial and criminal liabilities against the perpetrators! Get the torches and pitchforks out! We need to teach them a lesson!


I did not mean to imply this, as there's a very long culpability chain. For this reason, I'm not sure if it makes any sense to imprison individuals for this. A lot of people playing a part in this causing such chaos.

But it is something to be very aware of for those of us who develop software run in e.g. hospitals and airlines, and should receive more attention, instead of only bringing up financial losses which is what usually happens. I noticed the same with the big ransomware attacks.


Indeed, pity that we need major failures like these, for goverments to finally start paying attention to give the same kind of laws as anything else, instead of careless EULAs and updates without field testing.


It's very bizarre to me how normalized we have made kernel-level software in critical systems. This software is inherently risky but companies throw it around like it's nothing. And cherry on top, we let it auto-update too. I'm surprised critical failures like this don't happen more often.


I can't tell if you're serious or sarcastic, but there is such a thing as criminal negligence.

CrowdStrike knows that their software runs on computers that are in fricken hospitals and airports, they know that a mistake can potentially cause a human death. They also know how to properly test software, and they know how to do staggered releases.

Given what we know now, it seems pretty likely that to any reasonable person, the amount of risk they took when deploying changes to clients was in no way reasonable. People absolutely should go to jail for this.


Also corporate manslaughter, in some countries: https://en.wikipedia.org/wiki/Corporate_manslaughter

This more or less originated with the unfortunately named MS Herald of Free Enterprise sinking (https://en.wikipedia.org/wiki/MS_Herald_of_Free_Enterprise) - after that incident, regulators decided that maybe they didn't want enterprise quite as free as all that, and cracked down significantly on shipping operators (though the attempt to prosecute its execs for corporate manslaughter did fail).


I made a separate (longer) comment about this..

Why don't orgs test their updates? Every decent IT management/governance under the sun demands that you test your updates. How the hell did so many orgs that are ISO 2700x, COBIT, PCI-DSS, NIST CSF, etc. certified failed so hard??

(ToS/contracts will probably get you out of any damages.)


Testing for most organizations is usually either really, incredibly expensive or an ineffective formality which leaves them at more risk than it saves. If you aren’t going to do a full run through all of your applications, it’s probably not doing much and very few places are going to invest the engineer time it takes to automate that.

What I take from this is that vendors need a LOT more investment in that work. They have both the money and are best positioned to do that testing since the incentives are aligned better for them than anyone else.

I’m also reminded of all of the nerd-rage over the years about Apple locking down kernel interfaces, or restricting FDE to their implementation, but it seems like anyone who wants to play at the system level needs a well-audited commitment to that level of rigorous testing. If the rumors of Crowdstrike blowing through their staging process are true, for example, that needs to be treated as seriously as browsers would treat a CA for failing to validate signing requests or storing the root keys on some developer’s workstation.


> Why don't orgs test their updates?

Because historically orgs have been really bad with applying updates: either no updates or delayed updates resulting in botnets taking over unpatched PC's. Microsoft's solution was to force the updates unconditionally upon everybody with very few opportunities to opt out (for large enterprise customers only).

Another complication comes from the fact that operating system updates are not essential for running a business and especially for small businesses – as long as the main business app runs, the business runs. And most businesses are too far removed from IT to even know what a update is and why it is important. Hence the dilemma of fully automated vs manually applied and tested updates.


> Microsoft's solution was to force the updates unconditionally upon everybody with very few opportunities to opt out (for large enterprise customers only).

Not a Microsoft's fan, but this is not true. Everyone who has Windows Server somewhere, with some spare disk space for the updates, has this ability. Just install and run WSUS (included in Windows Server) and you can accept/reject/hold indefinitely any update you want.


Not disagreeing, however:

1) the prevailing majority of laptop and desktop PC installations (home, business and enterprise) are not Windows Server;

2) kiosk style installs (POS terminals, airport check-in stands etc) are fully managed, unsupervised installations (the ones that ground to a complete halt today) and do not offer any sort of user interaction by design;

3) most Windows Server installations are also unsupervised.


> 1) the prevailing majority of laptop and desktop PC installations (home, business and enterprise) are not Windows Server;

They are not, but the point is elsewhere: that Windows Server is going to provide the WSUS service to your network, so your laptop and desktop installations (in business and enterprise) are going to be handled by this.

Homes, on the other hand, do not have any Windows Server on their network, that's true.

As a hack to disable Windows updates, it is possible to point it to a non-existing WSUS server (so that can be done at home too). The client will then never receive any approval to update. It won't receive any info wrt available updates either.

> 2) kiosk style installs (POS terminals, airport check-in stands etc) are fully managed, unsupervised installations (the ones that ground to a complete halt today) and do not offer any sort of user interaction by design;

That's fine; this is fully-configurable via GPO.

> 3) most Windows Server installations are also unsupervised.

See 2.


IMHO law should require such a firm, or any firm that may impact millions of other people, i.e. including all OS developers and many others, to maintain a certified Q/A process, maintain a 24/7 coverage and spend X% on Q/A. Such companies should never be allowed to deploy without going through a stringent CD procedure with tests and such, and they need to renew the certificate annually.

These are infra companies. Their incompetence can literally kill people.


My point/problem is that EVERY company (sorry for the caps) that is ISO, PCI, COBIT, NIST CSF, etc. compliant MUST be doing this!! (again sorry for the caps)

So they drop half the 'safety' procedures once the auditor goes away? WTF! (I am semi-angry because there are so many easy solutions and workarounds to not fall for this!! (inside screaming).

How irresponsible must someone be to roll out something to 1k-5k-10k machines without testing it first??

Hubris-Atis-Nemesis-Tisis!!!!

https://www.greecehighdefinition.com/blog/hubris-atis-nemesi...


I hope eventually law regards these companies as "infrastructure" companies, just like companies that build roads, bridges and such, that may and will kill people if not run professionally.

I'm not trying to enforce certifications because as a dev certifications always raise a bitter taste in my mouth. But those companies need certified processes that get re-certified every year. Sometimes even a cursory review from outsiders can find a lot of issues.


What you described is not a “CD” procedure. Lack of precision around such terms is part of the problem here.


I thought that is a deployment issue? Or maybe a QA one because looks like no QA has been performed...


Updates do get tested. Windows updates can be held and selectively rolled out when a company is ready. As far as I can tell though, CrowdStrike doesn't give companies the agency to decide if updates should be applied or not.


The updates should be rolled out incrementally rather than all at once


Since we live in a capitalism, financial losses are the only one anyone cares about at scale. What's a human life worth nowadays? About 10 million for a healthy prime age adult? Negative for elderly?


I think it depends what passport etc. you hold... One dystopian take is the trolley problem, where the self-driving car in question uses smartphones to determine the identity of the people involved, to work out who is cheaper to kill.


That reminds me of why McDonalds got such a high penalty in the court case everyone remembers as "person sues for spilling hot coffee on themselves".

The reason this reminds me of that, assuming that I remember right, is that I think they had even taken the decision that the cost of paying lawsuits for those injuries was lower than the increase in revenue for being able to say "we have the hottest coffee"… and that was why they were deemed so severely liable.

They were definitely shown to have known it was resulting in injuries from other settlements:

https://en.wikipedia.org/wiki/Liebeck_v._McDonald%27s_Restau...


Not true. Making C-level executives of software companies criminally liable with the chance to go to jail did change their behaviour in some recent lawmaking situation (forgot which, sorry).


None whatsoever, their contracts with customers will limit liability to the price paid for the software/subscription. If there was open-ended liability for software failures then very little software would get written.


Caveat to this: In the UK and many other countries, you cannot limit liabilities that cause death or personal injury arising from negligence.


Yeah but if it's a hospital, they should be able to operate without these IT systems. Nothing critical / life-or-death / personal injury should rely on Windows / IT systems.


> they should be able to operate without these IT systems.

Is that even possible any more? (That said, "operate" isn't a boolean, it's a continuum between perfect service and none, with various levels of degraded service between, even if you mean "operate" in the sense of "perform a surgical operation" rather than "any treatment or care of any kind").

All medical notes being printed in hard-copy could be done, that's the relatively easy part. But there's a lot of stuff which is inherently IT these days, gene sequencing, CT scans, etc., there's a lot that computers add which humans can't do ourselves — even video consultation (let alone remote surgery) with experts from a different hospital, which does involve a human, that human can't be everywhere at once: https://en.wikipedia.org/wiki/Telehealth

> Nothing critical / life-or-death / personal injury should rely on Windows / IT systems.

If you think that's bad, you may want to ensure you're seated before reading this about the UK nuclear deterrent: https://en.wikipedia.org/wiki/Submarine_Command_System


Also Silicon Valley: AI will replace doctors and nurses.


Why? Because you simply wish it to be so?


Because the suppliers of IT systems (eg Microsoft, Crowdstrike) do not agree that they can be used for life-critical purposes

If someone is injured or dies because the hospital has inadequate backup processes in the event of a Windows outage, some or maybe all liability for negligence falls on those who designed the hospital that way, not the IT supplier who didn't agree to it.


If your assumptions rest on corporate entities or actual decision makers being held legally liable, then you've got a lot of legwork ahead of you to demonstrate why that's a reasonable presupposition.


Because it's evidently a bad idea and there are reasonable alternatives.


That’s easy for you to say, with the benefit of recency bias, and with presumably zero experience in running a hospital.


That's not about experience, that's about following the regulated standards. This is well known ever since technology (not computers) got into hospitals.


None of the points you mention detracts from the correctness of his/her statement.


And? People and institutions constantly make bad decisions for which there are reasonable alternatives, and that's assuming that the incentives at play for decision makers are aligned with what we would want them to be, which is often not the case. Not that that ends up mattering much except as an explanatory device, because people and institutions constantly pursue bad ideas even seen in terms of their own interests.


It would be like orthopedic surgeons heading down to harbor freight to pick up their saws instead of using medical grade versions.

The tool isn't fit for purpose


Because you should always have a backup.


When has a software company successfully been sued (or settled) over this liability?


From windows tos:

Disclaimer. Neither Microsoft, nor the device manufacturer or installer, gives any other express warranties, guarantees, or conditions. Microsoft and the devicemanufacturer and installerexclude all implied warranties and conditions, including those of merchantability, fitness for a particular purpose, and non-infringement. If your local law does not allow the exclusion of implied warranties, then any implied warranties, guarantees, or conditions last only during the term of the limited warranty and are limited as much as your local law allows. If your local law requires a longer limited warranty term, despite this agreement, then that longer term will apply, but you can recover only the remedies this agreement allows.


"We give you no guarantees, unless the local law says we have to give them to you, in which case we do."

So they might get sued on a local level?


It doesn’t really matter what the contract says. Laws take precedence over contracts. For example, Boeing’s liability for 737 airliners that crash due to faulty software certainly isn’t limited to the price of the planes.


But only $243.6M for fraud, which caused death of 346 people.


Yes, software industry as we know would not exists if companies where held liable for all damages. But in the current state of affairs they have little incentive to improve software quality - when incident like this happens they can suffer an insignificant short term valuation loss but unless it happens too often they can continue businesses as usual.

Many companies paying lip service to quality/reliability but internal incentives almost always go against maintenance and quality of service work (and instead reward new projects, features e. t. c.).


> Yes, software industry as we know would not exists if companies where held liable for all damages.

Of course it would. Restaurants are held liable for food poisoning, but they still operate just fine. They just - y’know - take care that they don’t poison their customers.

If computer systems were held liable, software would be a lot more expensive. There would be less of it. And it would also be better.

I think I can get behind that future.


I like that future too, but to play devil's advocate:

Write me software that coordinates all flights to and from airports, capturing all edge-cases, that's bug free. Then tell me the number you estimate and the number of years to roll this out.


Sure, but ... thats not a spec. Specs have clear goals and limited scope. "All flights from all airports forever" is impossible to program, full stop.

The right way to write code like that is to start simple and small - we're going to service airports X, Y and Z. Those airports handle Q planes per day. The software will be used by (this user group) and have (some set of responsibilities). The software engineers will work with the teams on the ground during and after deployment to make sure the software is fit for purpose. Someone will sign off on using it and trusting its decisions. And lets also do a risk assessment where we lay out all the ways defects in the software could cost money and lives, so we can figure out how risk averse we need to be.

Give me scope like that, and sure - I'll put a team together to write that code. It'll be expensive, but not impossible. And once its working well, I'd happily roll it out to more airports in a controlled and predictable manner.


Crowdstrike's stock closed at $343 yesterday, I imagine that and MSFT are going to be cratering later this morning.


Pro tip: your stock can't go down if you crash the stock exchange


It honestly did not occur to me. In all seriousness, was stock exchange ever really hacked ( not just data exfiltration -- write access to everything )?


Trading has been halted on stock exchanges due to technical issues many times. But there's are also more than one stock exchange.


No, it can't, if there is no stock exchange online to process the prices.


"Tell me, Mr. Anderson, what good is a phone call when you are unable to speak?"


Pretty good time to buy MSFT I would imagine, given that this isn't really their fault.


So far MSFT is down by ~2%... Even Crowdstrike is only -20%. When they probably did more damage in a day their entire net worth.


I'm mystified it's not much lower. Perhaps the market hasn't really priced in the damage yet.


Yeah, if I had a spare million, I can imagine buying that dip.


I'd expect crowdstrike to take a big hit. Between this and the russian hack [edit: actually not, sorry, confused with SolarWinds], I am not sure they are not causing more problems than they solve.


Crowdstrike was hacked by Russians?


Sorry I confused them with SolarWinds. Strike that


it hovers around -20% in pre-market (at the moment)


MSFT will be fine. They are riding the AI waves, this is not meaningful, especially since they are not at fault.


The waves that are already looking like a storm in a teacup ?

There is no 'AI', that is always only hype. There is machine learning, which is a very powerful technology but I doubt MSFT will be leading that revolution. As for LLMs, MSFT might have some competitiveness there but I doubt it's going to be a very lucrative market. MSFT is highly overvalued.


<< There is no 'AI', that is always only hype. There is machine learning, which is a very powerful technology

I agree with you on the technical aspect, but the distinction makes regular people eyes glaze over within 5 seconds of that explanation. AI as a label for this is here to stay the same way cyber stopped meaning text sex of IRC. The people have spoken.

<< MSFT is highly overvalued.

Yes, but so is NVDA, the entire stock exchange and US real estate market. We are obviously due for a major correction and have been for a while. As in, I actually moved stuff around in my 401k to soften the blow in that event 2 years ago now. edit: yes, I am a little miffed I missed out on that ride.

So far, everything was done to prevent hard crash and in the election year, that is unlikely to change. Now after the election, that is another story altogether.

<< I doubt MSFT will be leading that revolution.

I think I agree. I remain mildly hopeful that the open model approach is the way.


https://www.aqr.com/-/media/AQR/Documents/Whitepapers/Unders...

You should stop trying to predict the next crash. According to the study, most people (including institutional investors) consistently believe there is a >10% chance the market will crash in the next 6 months when historically the probability is only 1%


<< You should stop trying to predict the next crash.

Hmm? No. I will attempt to secure my own financial interest.

<< According to the study, most people (including institutional investors) consistently believe there is a >10% chance the market will crash in the next 6 months when historically the probability is only 1%

Historically is doing a fair amount of work here. I would argue there is little historical value to the data we face. Over the past few decades we went through through several mini revolutions ( industrial, information and whatever they end up calling now ) in terms of how we work, eat, communicate and, well, live.

All of these have upended how humans interact with the world effectively changing the calculus on the data that preceding it if not nullifying it altogether in some ways.

Your argument is to stop worrying since you are likely wrong anyway, by a factor of 10. I am saying is 1935 people also thought they have time to ride the wave.

edit: ok, need coffee. too many edits


> Now after the election, that is another story altogether.

Agree. First half of 2025 could be pretty spectacular (if/when we get through 2024).

I suspect there might be some pretty radical plans for US debt monetisation being drawn up, to be implemented early in the new presidential term.


My brain goes there too, but the other part of my brain says "line always goes up." The richest among us are heavy owners of stocks, and this country does everything it can to keep those numbers up. Look at that insane COVID V-shaped recovery that happened. That's just not a real/natural market reaction in my book.


The worst part is that I get the need to do something to rein it in, but I get the feeling it will, as always, not be the actual rich ( owns color blue rich level ), who will suffer from those plans. There are less and less moves the government has as time progresses.


It may not be their fault directly but it is causing Windows systems to bluescreen, which IS their fault and their responsibility, ultimately.


How is it their fault and responsibility? Isn’t falcon sensor basically running like a kernel module? Does it mean that Windows is not engineered properly when it can be crashed by this?


Are you saying that they should prevent or limit the ability of their users from installing third party software? Or at the very least prevent it from running in kernel mode?


A more reasonable claim would be that microsoft should have a way to allow virus-scanners to run without needing to be able to crash the kernel.

That isn't an easy thing to do, but it should be possible.


I don't think that is possible. How can an anti-virus not in kernel mode defend against viruses running in kernel mode then?


Ebpf can, I believe, not crash the Kernel


Windows blue screen was never Microsoft's responsibility. /s


That’s what the license agreement says. Wait till every man and his dog sues them.


This is an insane take. Do you think other industries get away with limiting their liability to the product cost? No, because that doesn't provide adequate incentives for making a safe product. The amount of software that gets written depends mostly on the demand for that software. Even if Micrososft would not be willing to up their game to make the risk viable then someone else would.


The thing is we know how to make (eg) food that is safe or to a lesser extent bridges that don't fall down. If you sell food that makes people sick you should have known how to avoid that and so you can be held liable.

We don't have a good idea how to make software that is flawless, at least, not at scale for a cost that is acceptable. This is changing a little bit now with the drive by governments to use memory-safe languages, but that only covers a small part of the possible spectrum of bugs in software and hardware.


Nothing is without flaws, it's about limiting risk to an acceptable amount. Critical software should be held against higher standards.


What's "critical software"? Software controlling flight systems in planes is already held to very high standards, but is enormously expensive to write and modify.

In this case it seems most of the software which is failing is dull back office stuff running on Windows - billing systems, train signage, baggage handling - which no one thought was critical, and there's no way on earth we could afford to rewrite it in the same way as we do aircraft systems.


Something that has managed to ground a lot of planes and disable emergency calls today is in fact critical. The outcome of it failing proves it is critical. Whatever it is.

Now, that it was not known previously to be critical, that may be. Whether we should have realised its criticality or not, is debatable. But going forward we should learn something from this. So maybe think more about cascading failures and classify more things as critical.

I have to wonder how the failure of billing and baggage handling has resulted in 911 being inoperative. I think maybe there's more to it than you mention here.


Agreed, there is no such thing as perfect software.

In physical world, you can specify a tolerance of 0.0005 in but the part is going to cost $25k a piece. It is trivially easy to specify tolerance, very hard to engineer a whole system that doesn't blow the cost and impossible to fund.

Great software architectures are the ones that operate cheaply, but are bulletproof when software fails. https://en.wikipedia.org/wiki/Chaos_engineering


Given how widespread the issue is, it seems that proper testing on Crowdstrike's part could have revealed this issue before rolling out the change globally.

It's also common to rollout changes regionally to prevent global impact.

To me it seems Crowdstrike does not have a very good release process.


> but is enormously expensive to write and modify.

We're talking about critical software. If we can't afford to reach the level of safety needed because it's too expensive, well so be it.

Besides, the enormously expensive flight systems don't seem to make my plane ticket expensive at all...


There's only one piece of software which (with adaptations) runs every Airbus plane. The cost of developing and modifying that -- which is enormous -- is amortized over all the Airbus planes sold. (I can't speak about Boeing)

What failed today is a bunch of Windows stuff, of which there is a vast amount of software produced by huge numbers of companies, all of very variable quality and age.


I meant critical software a short-hand for something like: quality of software should be proportional to the amount of disruption caused by downtime.

Point of sale in a records store, less important. Point of sale in a pharmacy, could be problematic. Web shop customer call center, less important. Emergency services call center, could be problematic.


I, as a producer of software, have effectively no control over where it gets used. That's the point.

Outside of regulated industries it's the context in which software is used which determines how critical it is. (As you say.)

So what you seem to be suggesting (effectively) is that use of software be regulated to a greater/lesser extent for all industries... and that just seems completely unworkable.


What you're describing is a system where the degree of acceptable failure is determined after the software becomes a product because it is being determined by how important the buyer is. That is backwards and unworkable.


It isn't, though. "You may not sell into a situation that creates an unacceptable hazard" is essentially how hazardous chemical sale is regulated, and that's just the first example that I could find. It's not uncommon for a seller to have to qualify a buyer.


I think the system is rather a one where if you offer critical services then you're not allowed to use a software that hasn't been developed up to a particular high standard.

So if you develop your compression library it can't be used by anyone running critical infra unless you stamp it "critical certified", which in turn will make you liable for some quality issues with your software.


I assume you mean "if the buyer will use the software in critical systems."

That's very realistic and already happens by requiring certain standards from the resulting product. For example, there are security standards and auditing requirements for medical systems, payment systems, cars, planes, etc.


> Software controlling flight systems in planes is already held to very high standards, but is enormously expensive to write and modify.

Here's something I don't understand: those jobs pay chump change compared to places like FB and (afaik) social networks don't have the same life-or-death context


Hence, Windows should blue/green kernel modules and revert to a past known good version if things break


Would not shock me for AV companies to immediately work around that if it were to be implemented. “You want our protection all of the time, even if the attacker is corrupting your drivers!”


This seems like the kernel module was faulty for some time. The update only changed the input data for the module.


Crowdstrike should have higher testing standards, not every random back-office process.


> Software controlling flight systems in planes is already held to very high standards, but is enormously expensive to write and modify.

Boeing disagrees.


We don't know how to make general software safe, but we do know how to make any one piece of software safe. If you're software is going to be used as infrastructure then it should be held to the same standards. If you don't want it to be treated as infrastructure don't sell it to hospitals.


Mixing up the responsibility, in your world hospitals shouldn't purchase it.


Responsibility can be shared.


How about people in charge of choosing these clown solutions - both crowdstrike and windows?


Windows being clown solution? Out of touch with reality is huge here


I can't imagine starting a project from ~2010 and on while choosing Windows as the stack.


The production simplicity of having a standardized OS and being able to drop in a .exe and have it run everywhere without worrying about building for 1000 system combinations cannot be beat.


Enterprise Linux can fairly consistently be assumed to be RHEL, Ubuntu, or SuSE, with the first two being far more likely in the U.S. That’s not that much to ask for.


That's... not reality even on desktop PCs, and never was. If your business is more complex than selling hot dogs or ice cream (or even that on big enough scale), IT of such company will become a small monstrosity over time, and complexity of such deployments on Unix vs Windows is nothing compared to overall picture.


I see you somehow avoided learning what dll hell is, what various .net runtime incompatible versions are and what optional compatibility levels windows 10 offers.


Easily done if you target x86-64 statically


Clowns are taking over reality, on many levels. And they will tell you that you are the clown. Welcome to clown world.


Such a naivete


In your world i should switch my modest 1000 seats over to Linux desktops?

I'm not sure how i'm going to explain the productivity loss and retraining costs to the board if im honest.


Plus, CrowdStrike runs on Linux as well. _This time_ they only crashed Windows devices, but there's no guarantee that switching to Linux would prevent any of it.

You can switch away from CrowdStrike but I doubt you'll be able to convince whoever mandated CS to be installed to not install an alternative that carries exactly the same risks.


>CrowdStrike runs on Linux as well. _This time_ they only crashed Windows devices, but there's no guarantee that switching to Linux would prevent any of it.

In fact there was a recent CrowdStrike-related crash in RHEL:

https://old.reddit.com/r/crowdstrike/comments/1cluxzz/crowds...

https://access.redhat.com/solutions/7068083


At least on Linux it runs on eBPF sniffing so the chances of fudging something are lower. There are some supported Linux distributions where they also have a kernel module and there might a higher chance of that exploding.


No you should switch over to Chromeos, iPads, ... anything but Microsoft.

Crowdstrike only exists because Windows and other Microsoft products are so insecure their default configuration.


There's nothing special about Windows beyond the fact that you can run arbitrary executable files. The problem could just as easily have happened for Linux or iOS/Mac and in fact it has. ChromeOS kind of works if you want to run a web application that's hosted on some web server... but it's not appropriate for running programs where a dumb browser doesn't suffice.


What defaults would those be, and how would you change them?


I'm not in IT anymore and we run 100% macs, so serious question here: isn't nearly everything a webapp nowadays? Every "non dev" thing that I have to do for work happens in my browser or an electron app. I guess maybe MS Office apps may be the biggest hitch? We use Google Workspace and that's all in browser.


Legacy apps are quite common. I have recently been doing IT for State Farm Insurance.

Every State Farm insurance office in the country is still using a DOS App from the 1980's to run their office.


I'm interested what these DOS applications are running on. Is it virtualised or a real physical machine?


Not at all. "Industry", think: manufacturing is still big on desktop applications.


Shouldn't be too hard to bundle them together with qemu, or some other vm solution.


None of my enterprise ERP/PLM/CRM systems run on Mac Server OS


There are actually web versions of the office suite now.


It's horrible to use though. Google's suite is somewhat better than MSFT's web one, but it still is weak compared to any established desktop office suite, even libreoffice.


I've found it alright to be honest. I'd like to use libre office but the incompatibilities with .docx make it too annoying. Finally I can easily work with .docx on Linux, thanks to the web version :)


It's only good for viewing and simplistic editing. More complex stuff ends up being unavailable on the web version very often.


Your 1000 seats crashing won't prevent airplanes from landing.

These things should have gone from mainframes of yore to various unix systems, ideally a mix of different unix systems in hot failover.

Without running uncontrolled "agent" software of course.


Enjoy the circus then!


I dont think you can hold Microsoft liable for 3rd party software pushing its own update. Microsoft didn't make anyone install Crowdstrike or it's update files.


Some people in the comments claim CS was used for compliance reasons. Some others claim Windows & CS do not offer warranties. How can a product satisfy the compliance check-box, if it does not offer the warranty and not accept liability for the related features?


While software is often warranted, contracts won't often accept liability in terms of business damages etc, and that's not usually a requirement for compliance.

If it was, it would also make it impractical for a small business to contract with a large one because of risk.


Depends. I'm at an EMR maker; our Windows machines (as well of those of our clients - read: hospitals and doctors offices) are down. That is, of course, bad for the patients under their care.

Do these clients have SLAs? If so, they're definitely on the hook for something. You could probably get a few businesses together for a decent class-action against Crowdstrike. You're then expecting a lawyer to be able to convince a dozen semi-random people with varying degrees of computer knowledge that Crowdstrike's software was negligently designed, developed, and deployed in a way that caused financial or life losses for customers.

So, really, it's a coin flip.


What if your company mandated your customers run crowdstrike in order to run your software? What are the legal implications of that? Wouldn't that also put your contracts on the hook?


Prison time for the CEO and board of directors would be nice.

Enough of this limited liability nonsense, there need to be serious, severe, life-changing consequences.


Hypothetically even if they were liable they would bankrupt before even a few percent of damages is recovered. You cannot pluck a bald chicken.


>they would bankrupt before even a few percent of damages is recovered

Wouldn't that be the desirable outcome, though? Given the amount of damage they have caused, they should cease to exist.


Sort of. They need to be sued into bankruptcy. Current shareholders get completely zeroed out; the company still exists, but is sold to the highest bidder with the proceeds paid out to affected customers.

We need this so that every company board is always asking "are we investing enough to make sure this never happens to us?"


A local rooflayer is absolutely corrupt. He cheats every customer, produces leaky roofs, doesn't even pay taxes completely.

It takes 2 year for the legal system to catch up, at which point he starts a new company, bankrupts the old one, sells all his tools cheaply to the new company, and fires and rehires his workers. I've seen this game going on for 14 years now.

I think Crowdstrike would do the same: Start a new one, sell the software, fire and rehire the workers, then go on as if nothing happened


I'd call BS on this story, but I know a friend that bought a home a few years back from a homebuilder that did a similar thing, except at a whole home level. Absolute disaster. he's been chasing him for half a decade now via legal means to get things fixed.


Not really though. Whether they should continue to exist into the future should depend on if the expected positive value of their services in that future exceeds the expected damage from having a big meltdown every once in a while. That some of their devs made a fuckup doesn't mean the entire product line is now without merit.

Killing the company because they made a mistake doesn't just throw away a ton of learned lessons (because the devs will probably be scattered around the industry where their newly acquired domain knowledge will be less valuable) but also forces a lot of companies to spend resources changing their antivirus scanners. For all we know, Crowdstrike might never fuck up again after this and forcing that change would burn hundreds of millions for basically no reason.


"Whether they should continue to exist into the future should depend on if the expected positive value of their services in that future exceeds the expected damage from having a big meltdown every once in a while"

I don't think that's right, since it ignores externalities.

You want to create a system where every company is incentivized to make positive security decisions. If your response to a fuckup of unprecedented scale is just "they learned their lesson, they probably won't do that again", then the message these companies receive is that it is okay to neglect proper security procedures, because you get one global economic meltdown for free.


This is where public executions of executives help.


But the Ticketmaster software would buckle under the strain :-)


>You cannot pluck a bald chicken.

Haven't heard that one before but I love everything about that!


Financial liability often doesn't equate to actual recovery of damages


Yes, SLA. No one gets held liable if the legal is done correctly and there were no guarantees, but on cloud there is 100% SLA so they will pay out.


Do we not remember "Ma" Bell?This should perhaps be a wakeup call in regards to Microsoft and other large tech having concentrated fingers in too many pies. This appears to be an anti-trust issue at its core.

Was it really a botched update? Or was it a test run for holding the world hostage prior to a coup?


Negligence at Crowdstrike is not covered by any SLA. Even if insured, Crowdstrike could be fucked. Let alone, companies going to try and how much cost this has. Long term, their fucked.


There will be no Crowdstrike left after this. I am just upset I cant short it…


> Chances if Microsoft or Crowdstrike will be held liable for financial losses caused by this outage?

Zero. Exactly Zero.

Clearly you have never been involved in buying insurance or writing contracts for IT products/services.

Loss of contracts, profits, goodwill, economic loss, loss of data and all that jazz is excluded in whole or limited to a fixed monetary value.

It is known as indirect, consequential or special loss, damage or liability.

No lawyer worth their salt will let an IT product/service company draft a contract that does not have the above type of clause..

And good luck finding an insurance contract that will pay out for such losses, indeed most of them have conditions that state your contracts with customers must exclude or limit such losses.

Most software also has clauses excluding use in safety critical environments.


I'm just curious, don't they have something like "gradual rollout" to update their app? They just bulk-update simultaneously across entire agents? No way. Something is a bit off for me. But there are good lessons to learn for sure.


My company stays multiple versions behind latest for this exact reason, but we were still affected


I read that they pushed a new configuration file, so possibly they don't consider that a "software update" and pushed it to everyone. Which is obviously insane. If I am publishing software, it doesn't matter if I've changed a .py file or a .yaml file. A change is a change and it's going to be tagged with a new version.


I'm a bit unfamiliar with this stuff anymore... supposedly it was a content update, not the agent itself :/


Surely though these content updates must go through some kind of regression testing right? Right?


Party 'try not to cry', me and you


Wondering, how were you affected if you didn't update?


they likely pushed an update to all versions, or updated their updater(?) not exactly aware to us at the moment


It's not the first time they pull something similar...1 month ago: "CrowdStrike bug maxes out 100% of CPU, requires Windows reboots" - https://www.thestack.technology/crowdstrike-bug-maxes-out-10...

75 Billion dollars valuation, CNBC Analysts praising the company this morning on how well the company is run!...When in reality they can't master the most basic of the phased deployment methodologies known for 20 years...

Hundreds of handsomely paid CTO's, at companies with billions of dollars in valuations, critical healthcare, airlines, who can't master the most basic of the concepts of "Everything fails all the time"...

This whole industry is depressing....


> This whole industry is depressing....

I'll take it a step further and say that every industry is depressing when it comes to computers at scale.

Rather than build efficient, robust, fault-tolerant, deterministic systems built for correctness, we somehow manage to do the exact opposite. We have zettabytes and exaflops at our fingertips, and yet, we somehow keep making things slower. Our user interfaces are noisier than ever, and our helpdesks are less helpful than they used to be.


Exactly! :(

I am drifting towards hating to turn on my computer in the morning. The whole day is like pissing into the wind, trying to find workaround of annoyances or even malfunctions, getting rid of obstructive noise from all direction, my productivity using modern computer systems is diminishing compared to where it was just mere 10-15 years ago (still better than 25 years ago not only becuase of experience but also the access of information on demand). Very depressing. I should have became a farmer perhaps.


What I find definitely depressing is the fact we used to roll out progressively even OS upgrades (I guess now that is done through intune?) and was one point in favor of windows (on Linux you had to do things yourself at the time AFAIK, I don't think the situation has improved much).

Nowadays we get mandated random software upgrading at once on the entire company fleet and no one bats an eye - I counted more than a dozen agents installed for "security" and "monitoring" purposes in my previous company servers, many of those with hooks in the kernel obviously, and many of those installed with random policies to tick yet another compliance box...


> (on Linux you had to do things yourself at the time AFAIK, I don't think the situation has improved much)

You can schedule the updates any time you want, want to do it staggered then configure that, want to do it all at the same time then do that, want it with a random interval also possible. I don't see the "you need to do everything yourself" option as much as any managed environment.


I haven't been a sys admin in a very long time so my systems knowledge might be outdated, but I reckon functionality like intune's built-in monitoring of specific feature install failures would make a huge difference with a few dozen systems, let alone the hundreds of thousands you see in some of today's deployments. It's not like that stuff isn't possible on Linux, but if you're coordinating more than a few systems, that turns into a big, expensive project pretty quickly.


Centralized management is very useful, just a random delay is not enough. One of the (big) companies I worked with had jury rigged something with chef I believe to show different machines different "repositories" and roll things out progressively (1% of the fleet, 5%...).


Staggering is necessary in some cases. I've heard of scenarios where a company has lots of devices in the field which all simultaneously try to download a big update, and DDOS the servers hosting that update.


This borked our dispatch/911 call center then as well. However, it wasn't as bad as this one. This outage put our entire public safety system into the stone age and with that we were at stone age efficiency.


I work IT at a regional 911 center. We're fine but I sympathize with those who are back to pen and paper dispatching. Hard for most current dispatchers to realize the way we did it back in the day.


The worst part is that nobody will be held accountable. A F up like this should wipe out the entire company but instead everyone will just shrug it off as an opposie a few low level employees will get punished and nothing will change.


Those "CNBC analysts" truly know nothing, especially when it comes to tech. They're just cheerleaders who repeat talking points all days.


Here's a visual representation of flight cancellations and delays at major US airports https://www.flightaware.com/miserymap/


Oh wow! Thanks for that. I use flight aware and twas not award of that particular page.


this animations were all the rage in Chinese wechat while this was going on


So CrowdStrike is deployed as third party software into the critical path of mission critical systems and then left to update itself. It's easy to blame CrowdStrike but that seems too easy on both the orgs that do this but also the upstream forces that compel them to do it.

My org which does mission critical healthcare just deployed ZScaler on every computer which is now in the critical path of every computer starting up and then in the critical path of every network connection the computer makes. The risk of ZScaler being a central point of failure is not considered. But - the risk of failing the compliance checkbox it satisfies is paramount.

All over the place I'm seeing checkbox compliance being prioritised above actual real risks from how the compliance is implemented. Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting. So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.


> The risk of ZScaler being a central point of failure is not considered. But - the risk of failing the compliance checkbox it satisfies is paramount.

You're conflating Risk and Impact, and you're not considering the target of that Risk and that Impact.

Failing an audit:

1. Risk: high (audits happen all the time)

2. Impact to business: minimal (audits are failed all the time and then rectified)

3. Impact to manager: high (manager gets dinged for a failing audit).

Compare with failing an actual threat/intrusion:

1. Risk: low (so few companies get hacked)

2. Impact to business: extremely high

3. Impact to manager: minimal, if audits were all passed.

Now, with that perspective, how do you expect a rational person to behave?

[EDIT: as some replies pointed out, I stupidly wrote "Risk" instead of "Odds" (or "Chance"). Risk is, of course, the expected value, which is probability X impact. My post would make a lot more sense if you mentally replace "Risk" with "probability".]


Moreover no manager gets dinged for "internet-wide" outages unfortunately, so the compliance department keeps calling the shots. The amount of times I've had to explain there's no added security in adding an "antivirus" to our linux servers as we already have proper monitoring at eBPF level is annoying.


I'd be fired if I caused enough loss in revenue to pay my own salary for a year.

I am responsible for my choices. I'm CTO, I don't doubt that in some cases execs cover for each other, but at least I have anecdotal experience of what it would take for me to be fired- and this is clearly communicated to me.


Hope you get paid a lot! Otherwise you are either in a very young or very stupid job.

I regularly spend multiples of my salary every month on various commitments my company makes, any small mistake could easily mean that its multiples of my salary type of problem within 10 days.


A friend of mine spent half a million on a storage device that we never used. It sat in the IT area for years until we were acquired. Everyone gave him so much shit. Finance asked me about it numerous times (going around my friend the CTO) so they could properly depreciate it. He didn't get dinged by the board at all. It remained an open secret. We were making million dollar decisions once a month, though.


What sort of storage device, just out of curiosity?


> I regularly spend multiples of my salary every month on various commitments my company makes.

Yeah, same here.

But if I choose a vendor and that vendor fails us so catastrophically as to make us financially insolvent, then it's my job to have run a risk analysis and to have an answer for why.

If it's more cost effective to take an outage, that's fine, if it's not: then why didn't I have a DRP in place, why did we rely so much on one vendor, what's the exposure.

It's a pretty important part of being a serious business person.


Sure, but that's not what I said or you said, and my commentary was about relative measures of your salary to your budget.

If you can't make a mistake of your salary size in your budget then your budget is small or very tight, most corporations fuck up big multiples of their CTOs salary quarterly (but that turns out to be single digit percentage points of anything useful.)


So you never messed up ever? That's the only thing that can fulfill both your comments, unless you've also been fired.


CTOs do not get fired because they chose a massive system like crowdstrike and it fails once a year

They would get fired if they chose a non-normal system and it failed once every 10 years


> I'd be fired if I caused enough loss in revenue to pay my own salary for a year.

I'm not so sure.

I know of a major company that had a glitch, multiple times, that caused them to lose about ~15 million dollars at least once (a non-prod test hit prod because of a poorly designed too).

I was told the decision-makers decided not to fix the problem (the risk of losing more money again) because the "money had already been lost."


"no manager gets dinged for "internet-wide" outages"

Kind of like, nobody gets fired for hiring IBM, or using SAP. They are just so big, every manager can say, "look how many people are using them, how was I supposed to know they are crap".

But, seems like for uptime, someone should be identifiable. If your job is uptime, and there is a world wide outage, I'd think it would roll down hill onto someone.


> Kind of like, nobody gets fired for hiring IBM, or using SAP. They are just so big, every manager can say, "look how many people are using them, how was I supposed to know they are crap".

I wouldn't necessarily say IBM or SAP are "crap". It's much more likely that orgs buying into IBM or SAP don't the due diligence on what the true costs to properly set it up and keep it running, therefore cut tons of corners.

They basically want to own a Ferrari and when it comes to maintenance, they want run Regular gas and try to get their local mechanic to slap Ford parts on it because its too expensive to keep going back to the dealership.


> "look how many people are using them, how was I supposed to know they are crap".

if all your friends jump off a cliff, do you as well?

This is taught to children at a young age to teach them not to blindly follow others. Why do you think these adults deserve a pass?


The thing is usually this argument goes something like this:

A: Should prod be running a failover / <insert other safety mechanism>?

B: Yes!

A: This is how much it costs: <number>

B: Errm... Let me check... OK I got an answer, let's document how we'd do it, but we can't afford the overhead of an auto-failover setup.

And so then there will be 2 types of companies, the ones that "do it properly" will have more costs, their margins will be lower, over time they'll be less successful as long as no big incident happens. When a big incident happens though, for most businesses - recent history proves that if everyone was down, nobody really complains. If your customers have 1 vendor down due to this issue, they will complain, but if your customers have 10 vendors down, and are themselves down, they don't complain anymore. And so you get this tragedy of the commons type dynamic where it pays off to do what most people do rather than the right thing.

And the thing is, in practice, doing the thing most people do is probably not a bad yardstick - however disappointing that is. 20 years ago nobody had 2FA and it was acceptable, today most sites do and it's not acceptable anymore not to have it.


That's a lot of words to say: "Yes, I will jump off a cliff if all my friends do it!"

Besides, no one is seriously considering auto failover for desktop machines. Not sure where that came from?


Parents may teach this to kids but the kids usually notice their parents don't practice what they preach. So they don't either.

The world is filled with people following everybody else off a cliff. If you're warning people or even just not playing along in a time of great hysteria, people at best ignore your warnings and direct verbal abuse at you. At worst, you can face active persecution for being right when the crowd has gone insane. So most people are cowards who go along to get along.


"if all your friends jump off a cliff, do you as well?"

Sure, that is a common idiom. Usually as stated implying that people shouldn't or wont, jump off the cliff. 'People must be smarter, right?'.

And we would like to think that is logical, and people wouldn't jump off a cliff.

Sadly, it seems like it is more true, people DO jump off the cliff, follow the illogical leader and jump.

It seems to me more and more that it is human nature to follow the leader off the cliff.

Maybe something to do with being social animals, following the herd.


Depends how big the cliff is and whats at the bottom.


I think the parent was correct in the use of the word "Risk"; it's different than your definition, which appears to be closer to "likelihood".

Risk is a combination of likelihood and impact. If "risk" were just equivalent to "likelihood" then leaving without an umbrella on a cloudy day would be a "high-risk situation".

A rational person needs to weigh both the likelihood and impact of a threat in order to properly evaluate its risk. In many cases, the impact is high enough that even a low likelihood needs to be addressed.


ZScaler and similar software also has some hidden costs: Performance and all the other fun that comes with a proxy between you and the server you connect to.


Their local proxy is so poorly implemented that it's impossible to get more than 2mbps on a bypassed site.


> Now, with that perspective, how do you expect a rational person to behave?

What is a business to do, maximize business or manager contentment ?


> What is a business to do, maximize business or manager contentment ?

A "business" is still just a collection of people. Each person is going to take actions that are in their best interests.

What I'm saying is that the business's interests are not aligned with the people comprising that business.

In that regard, what "the business" wants is irrelevant.


> What I'm saying is that the business's interests are not aligned with the people comprising that business.

Yep, that's the point of capitalism.

> In that regard, what "the business" wants is irrelevant.

And yet here we are. Companies get fined left and right for breaching rules but it's ok because it earned them money. There are literal plans made to calculate whether it's profitable to cheat or not. In the current system, what the business wants always wins over individual qualms, unfortunately.


Because the punative system in most countries doesn't affect individuals. As a manager, you're not going to jail for breaking environmental laws, a different entity (the company) is paying for being caught. So, it's still the rational thing to do to break the environment laws to make your groups numbers go up and get a promo or bonus.


Almost correct, but you mean 'chance' where you write 'risk':

    Risk = Chance × Impact
The chance of failing an audit initially are high (or medium, present at least). The impact is usually low-ish. It means a bunch of people need to fix policy and set out improvement plans in a rush. It won't cost you your certification if the rectification is handled properly.

It's actually possible that both of your examples are awarded the same level of risk, but in practice the latter example will have its chance minimized to make the risk look acceptable.


Chance has more positive connotations than it has negative connotations IMO.

Probability is a more neutral word, and fits better.


> Now, with that perspective, how do you expect a rational person to behave?

They'd deploy the software on the critical path. That's exactly GP's point, isn't it? That's why GP explicitly wants us to shift some of the blame from the business to the regulators. GP advocates for different regulatory incentives so that a rational person would then do the right thing instead of the wrong thing.


> Risk: low (so few companies get hacked)

I’m at risk of sounding like chicken little, the reality is companies are getting popped all the time - you just don’t hear about them very often. The bar for media reporting is constantly being raised to the point where you only hear about the really big ones.

If you read through any of the weekly Risky Biz News posts [1] you’ll often see a five or more highly impactful incidents affecting government and industry, and they’re just the reported ones.

[1] https://news.risky.biz/


If you read their comment holistically, they obviously agree with you and think that the outcome of the audit should be more meaningful.


> 1. Risk: low (so few companies get hacked)

I wonder how much that's still true now that ransomware has apparently become viable.

Finding an insecure target, setup the data hostage situation, have the victim come to pay is scalable and could work in volume. If getting small money from a range of small targets becomes profitable, small fishes will bear sinilar risks to juicier targets.


But...surely you're also missing another point of consideration:

Single point of failure fails, taking down all your systems for an indeterminate length of time:

1. Risk: moderate (an auto-updating piece of software without adequate checks? yeah, that's gonna fail sooner or later)

2. Impact to business: high

3. Impact to manager: varies (depending on just how easy it is to spin the decision to go with a single point of failure rather than a more robust solution to the compliance mandate)


> 3. Impact to manager: minimal, if audits were all passed.

I don't know about you, but I'll be making sure everyone knows that the manager signed off on the spectacularly stupid idea to push through an update on a friday without testing.


You’re conflating risk and frequency


>Risk: low (so few companies get hacked)

Come on.


Of course, disabling those auto updates will have you fail the external security audit and now your security team needs to fight with the rest of the leadership in the company explaining why you're generating needless delays, costs against the "state of the art in security industry" and why your security guys are smarter than the people who have the power to approve or deny your security certification.


I've taken part in some security audits where I work. They're not a joke only because they're a tragic story of incompetence, hubris, and rubberstamping. They 100% focus on checking boxes and cargo-culting, while leaving enormous vulnerabilities wide open.


> you fail the external security audit

aka, you fail the cover-your-ass security, rather than actual security.


yep, but "trust us, we're secure, pinky promise by our internal employees" doesn't really work either.


Don't forget the press releases all saying "We take security very seriously!"


Employees would rather care for their employment and keeping boss happy rather than going against their orders.


What I don't understand is why they don't have a canary update process. Server side deployments do this all the time. You would think Windows would offer that to their institutional customers, for all types of updates including (especially) 3rd party.


This isn't a Windows update (which absolutely does let you do blue/green deployments vis SUS), but rather a Crowdstrike update which also lets you stage rollouts and I expect several administrators are finding out why that is important.


I know about update policies, but afaik those are about the “agent” version. Today’s update doesn’t look like an agent version. The version my box is running was released something like a week ago.

Is there some possibility tu stage rollouts of the other stuff it seems to download?


I have been in these audits and nowhere does it say that software has to be 'auto updated', this is a ridiculous statement and requirement.

What a proper audit will look for is a update and testing control with supporting evidence.


Sounds like your employer has better auditing processes than most places.


Or lose IT/security insurance for not installing or disabling it.


Well, let’s see how much insurance companies will pay now


It’s not about whether they pay out, large enough customers demand you have insurance as a condition of sale. It’s cover your arse all the way down!


Kind of a big thing most people don't understand about the various forms of "Business Insurance." For the most part, businesses have whatever insurance whatever they are doing requires them to have. Those requirements are set by laws/regulations applied to those entities and the various entities they want to do business with.

At every small shop I've worked when the topic of Business Insurance came up with one of the owners, the response was extremely negative -- basically summarized as "it's the most you will ever pay for something you won't ever be able to use".


Yep, it’s pretty much a toll on doing business with entities. I’ve no doubt the intention is so your customer can sue you without you winding up, whether it actually works… no idea.


Why do we call managers "leaders" now? That's not what they are.


Well. Now you have something to point to. Next RFO you can ignore the blameless part and point to a executive override of a technical decision.


>> It's easy to blame CrowdStrike but that seems too easy on both the orgs that do this but also the upstream forces that compel them to do it.

While orgs using auto update should reconsider, the fact that CrowdStrike don't test these updates on a small amount of live traffic (e.g. 1%) is a huge failure on their part. If they released to 1% of customers and waited even 24 hours before rolling out further this seems like it would have been caught and had minimal impact. You have to be pretty arrogant to just roll out updates to millions of customers devices in one fell swoop.


Why even test the updates on a small amount of live customers first? Wouldn't this issue already have surfaced if they tested the update on a handful of their own machines?


I would hope they've done that and it passed internal QA. But maybe not a good idea to assume they're doing any sort of testing at all.


They’d have rigorous test harnesses, but you can’t really account for the complexities of a highly configurable platform like Windows.


The prevalence of the issue makes it seem unlikely to have been caused by site-specific configurations


You are completely right. BTW It wasn't a software update, it was a content update, a 'channel file'. Someone didn't do enough testing. edit: or any testing at all?

https://x.com/George_Kurtz/status/1814235001745027317

https://x.com/brody_n77/status/1814185935476863321


It's an automatic update of the product. Semantic "channel vs. binary" doesn't indicate anything. If your software's definition files can cause a kernel mode driver to crash in a bootloop you have bigger problems, but the outcome is the same as if the driver itself was updated.


Indeed. Its worse really, it means there was a bug lurking in their product that was waiting for a badly formatted file to surface it. Given how widespread the problem is it also means they are pushing these files out without basic testing.

edit: It will be very interesting to see how CrowdStrike wriggle out of the obvious conclusion that their company no longer deserves to exist after a f*k up like this.


Simple: They are obviously too big to fail now.


"Too big to uninstall" is a thing I guess


> President & CEO CrowdStrike, Former CTO of McAfee

Well that's certainly a track record.


That's funny, because IIRC McAfee back in the Windows XP days did this exact same thing! They added a system file to the signature registry and caused Windows computers to BSOD on boot.

https://www.zdnet.com/article/defective-mcafee-update-causes...


Showing yet again that the executive class only fails upwards.


That’s even worse—-they should be fuzz testing with bad definitions files to make sure this is safe. Inevitably the definitions updates will be rushed out to address zero days and the work should be done ahead of time to make them safe.


Having spent time reverse-engineering Crowdstrike Falcon, a lot of funny things can happen if you feed it bad input.

But I suspect they don't have much motivation to make the sensor resilient to fuzzing, since the thing's a remote shell anyways, so they must think that all inputs are absolutely trusted (i.e. if any malicious packet can reach the sensor, your attackers can just politely ask to run arbitrary commands, so might as well assume the sensor will never see bad data..)


"that all inputs are absolutely trusted"

This is something funny to say when the inputs contain malware signatures, which are essentially determined by the malware itself.

I mean, how hard would it be to craft a malware that has the same signature as an important system file? Preferably one that doesn't cause immediate havoc when quarantined, just a BSOD after reboot, so it slips through QA.

Even if the signature is not completely predictable, the bad guys can try as often as they want and there would not even be way to detect these attempts.


> malware signatures, which are essentially determined by the malware itself.

No they're not. The tool vendor decides the signature, they pick something characteristic that the malware has and other things don't, that's the whole point.

> how hard would it be to craft a malware that has the same signature as an important system file?

Completely impossible, unless you mean, like, bribe one of the employees to put the signature of a system file instead of your malware or something.


The tool vendor decides the signature

Sure, but they do it following a certain process. It's not that CrowdStrike employees get paid to be extra creative in their job, so you likely could predict what they choose to include in the signature.

In addition to that, you have no pressure to get it right the first time. You can try as often as you want and analyzing the updated signatures you even get some feedback about your attempts.


> Sure, but they do it following a certain process.

Which is going to include checking that it doesn't match any OS files.

> You can try as often as you want and analyzing the updated signatures you even get some feedback about your attempts.

As others said, probably only if you can reverse a hash function.


Please more details. What do you mean with "is a remote shell anyways"? thanks!


Falcon has a feature called "Real Time Response". The sensor is in contact with a server with which it exchanges events serialized in protobuf.

One of the event you can get from the Crowdstrike server runs an arbitrary shell command.

https://www.crowdstrike.com/tech-hub/endpoint-security/the-p...


How can THIS pass any sane Audit?!

Like, «We require that your employees opens only links on white list, and social networks cannot be put on this list, and we require managed antivirus / firewall solution, but we are Ok that this solution has backdoor directly for 3rd party organization»?

It is crazy. All these PCI DSS and SOC2 looks like a comedy if they allow such things.


At a former employer of about 15K employees, two tools come to mind that allowed us to do this on every Windows host on our network[0].

It's an absolute necessity: you can manage Windows updates and a limited set of other updates via things like WSUS. Back when I was at this employer, Adobe Flash and Java plug-in attacks were our largest source of infection. The only way to reliably get those updates installed was to configure everything to run the installer if an old version was detected, and then find some other ways to get it to run.

To do this, we'd often resort to scripts/custom apps just to detect the installation correctly. Too often a machine would be vulnerable but something would keep it from showing up on various tools that limit checks to "Add/Remove Programs" entries or other mechanisms that might let a browser plug-in slip through, so we'd resort various methods all the way down to "inspecting the drive directory-by-directory" to find offending libraries.

We used a similar capability all the way back in the NIMDA days to deploy an in-house removal tool[1]

[0] Symantec Endpoint Protection and System Center Configuration Manager

[1] I worked at a large telecom at that time -- our IPS devices crashed our monitoring tool when the malware that immediately followed NIMDA landed. The result was a coworker and I dissecting/containing it and providing the findings to Trend Micro (our A/V vendor at the time) maybe 30 minutes before the news started breaking and several hours before they had anything that could detect it on their end.


At ring 0 I assume. Not that it would matter, I imagine privesc would be fairly trivial.


It's an interface to the ring 0 kernel module. Everything is a remote shell if it can talk to ring 0.


Hilariously, my last employer was switching to Crowdstrike a few months ago when my contract ended. We previously used Trellis which did not have any remote control features beyond network isolation and pulling filesystem images. During the Crowdstrike onboarding, they definitely showed us a demo of basically a virtual terminal that you could access from the Falcon portal, kind of like the GCP or AWS web console terminals you can use if SSH isn't working.


it's a root-kit with RCE and C&C is CS headquarters.


That approach only makes sense if trusted inputs are tested


As I understand, this only manifests after a reboot and if the 'content update' is tested at all it is probably in a VM that just gets thrown away after the test and is never rebooted.

Also, this makes me think:

How hard would it be to craft a malware that has the same signature as an important system file?

Preferably one that doesn't cause immediate havoc when quarantined, just a BSOD after reboot, so it slips through QA.

I don't believe this is what's happened, but I think it is an interesting threat.


Nope, not after a reboot. Once the "channel update" is loaded into Falcon, the machine will crash with a BSOD and then it will not boot properly until you remove the defective file.


> How hard would it be to craft a malware that has the same signature as an important system file?

Very, otherwise digital signatures wouldn’t be much use. There are no publicly known ways to make an input which hashes to the same value as another known input through the SHA256 hash algorithm any quicker than brute-force trial and error of every possibility.

This is the difficulty that BitCoin mining is based on - the work that all the GPUs were doing, the reason for the massive global energy use people complain about is basically a global brute-force through the SHA256 input space.

See the “find a custom SHA256” challenge on HN last month discussions: https://news.ycombinator.com/item?id=40683564


I was talking about malware signatures, which do necessarily use cryptographic hashes. They are probably more optimized for speed because the engine needs to check a huge number of files as fast as possible.


Cryptographic hashes are not the fastest possible hash, but they are not slow; CPUs have hardware SHA acceleration: https://www.intel.com/content/www/us/en/developer/articles/t... - compared to the likes of a password hash where you want to do a lot of rounds and make checking slow, as a defense against bruteforcing.

That sounds even harder; Windows Authenticode uses SHA1 or SHA256 on partial file bytes, the AV will use its own hash likely on the full file bytes, and you need a malware which matches both - so the AV will think it's legit and Windows will think it's legit.


> same signature as an important system file

AFAIK important system files on Windows are (or should be) cryptographically signed by Microsoft. And the presence of such signature is one of the parameters fed to the heuristics engine of the AV software.

> How hard would it be to craft a malware that has the same signature as an important system file?

If you can craft malware that is digitally signed with the same keys as Microsoft's system files, we got way bigger problems.


>How hard would it be to craft a malware that has the same signature as an important system file?

Extremely, if it were easy that means basically all cryptography commonly in use today is broken, the entire Public Key Infrastructure is borderline useless and there's no point in code signing anymore.


That makes me even more unsettled! Shouldn't this be closer to metadata than operational/mechanical?

Feels like they made unsafe data for the format they created. Untrustworthy. To your point, they aren't testing.


Why does it make you more unsettled? The amount of parsers written in unsafe languages for difficult formats is immense. They're everywhere.


Admittedly, I don't know exactly what's in these files. When I hear 'content' I think 'config'. This is going to be very hypothetical, I ask for some patience. Not arguments.

The 'config file' parser is so unsafe that... not only will the thing consuming it break, but it'll take down the environment around it.

Sure, this isn't completely fair. It's working in kernel space so one misstep can be dire. Again, testing.

I think it's a reasonable assumption/request that something try to degrade itself, not the systems around it

edit: When a distinction between 'config' and 'agent' releases is made, it's typically with the understanding that content releases move much faster/flow freely. The releases around the software itself tend to be more controlled, being what is actually executed.

In short, the risk modeling and such doesn't line up. The content updates get certain privileges under certain (apparently mistaken) robustness assumptions. Too much credit, or attention, is given to the Agent!


It passed all unit tests!


It passed the type checker!


> It wasn't a software update, it was a content update, a 'channel file'

Because I know nothing about Crowdstrike... what is a "channel file"? Some kind of config file?


It is how they package their malware definitions. It's semantics.


So their malware definition turned into malware?

Good to know they don't check their definitions for defects before installing them.


It's possible there's no human involvement from detection to deployment.


This will go down as one of the worst examples of communication during an outage.


Since when is this content not software, just because it is.not an .exe?


"All over the place I'm seeing checkbox compliance being prioritised above actual real risks from how the compliance is implemented."

Great statement and one that needs to be seriously considered - would DORA regulation in the EU address this I wonder? Its a monster piece of tech legislation that SHOULD target this but WILL it - someone should use todays disaster and apply it to the regs to see if its fit for purpose.


Emphatically NO. Involved in (IT) Risk and DORA in a firm that actually does IT risk scenario planning (the sort opposite of checkbox compliance). DORA is rubber stamping al the way round. One caveat is that we are way ahead of DORA, so treating DORA as a checkbox exercise might be situational. But I haven’t noticed a place where the rubber hits the road regulatory wise. It’s too easy to stay in checkbox compliance if the board doesn’t see IT-risk as a major concern. I’m happy one of our board members does. We’ve gone so far as to introduce a person and paper based credit line, so we can continue an outgoing cashflow if most of our processes fail (for an insurer).


Broken regulations? Fix by adding more!


What's your suggestion for fixing broken regulations? Not having any? That is also "broken".


Well, yeah. If a regulation is broken and not achieving its goal it should be changed. What's the alternative? "Regulation? We tried that once and it didn't work perfectly, so now we let The Market™ sort out safety standards."


Who needs regulation when you can have free Fentanyl with your CrowdStrike subscription! All of your systems will go down, but you won't care, and the chance of accidental overdose is probably less than 10%!


The child slave labour is what really gets the deal across the line


Yes, in many contexts that may well be the correct conclusion. Your comment presumes that regulation here has proven itself useful and not resulted in a single point of failure which potentially reduces overall safety. It’s of course the correct comment from a regulator’s perspective.


For the market to work wouldn't you need something to hold the corps accountable if they fail to be secure AND to make regular people whole if the crops' failures cause them problems?


Yes, like the court system … specifically class actions in the United States have been established for this exact purpose.


After attorney's fees, class action rarely pays enough to make the victims whole.

Suing individually is only an option if someone can afford a lawyer.


Especially for something like technology and infosec which rapidly changes, it’s silly to look to slow moving regulations as a solution, not to mention ignoring history and gambling politicians will do it competently and it won’t have negative side effects like distracting teams from doing real work that’d actually help.

You can make fines and consequences after the fact for blatant security failures as incentives but inventing a new “compliance” checklist of requirements is going to be out of date by the time it’s widely adopted and most companies do the bare minimum bullshit to pass these checklists.


There are so many english centric assumptions here.

Regulation of liability can be very generic and broad, with open standards that dont need to be updated.

Case in point: Most of continental Europe still uses Napoleon's code civile to prescribe how and when private parties are liable. This is more than 150 years old.

The real issue is that most Americans are stuck with an old English regulatory system, which for fear of overreach was never modernized.


> companies do the bare minimum bullshit

This can be true of security (and every other expense) whether it's regulated or not. Which do you think will result in fewer incidents: the regulated bare minimum, or the unregulated base minimum?


EU tech regulation actually addressing an issue effectively? I wouldn't hold my breath, but there is a first time for everything.


I like USB-C in my iPhone.


> So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.

No, we need to hold Architects accountable, and this is the core of the issue. Creating systems with single, outsourced responsibility, in the critical path.


As a CTO, when your company goes down you get fired.

When every company goes down you get let off.

The sensible thing is follow the herd and centralise. You're outsourcing the risk to your own job.


This is the point of much of the security efforts we see now.

Outsourcing of security functions, and things like login push a lot of liability and legal issues off into someone else's house.

It's hard to be the source of a password leak, or be compromised when you don't control the passwords. But like any chain your only as secure as your weakest link... Snowflake is a great current example of this. Mean while the USPS just told us "oops" we had tracking pixels for a bunch of vendors all over our delivery preview tool.

Candidly, most people stacks look a lot less like software and more like a toolbar riddled IE5 install circa 2000. I don't think our industry is in a good place.


This is one of the interesting aspects in Ethereum.

If your validator is down, you lose a small amount of stake, but if a large percentage of the total set of validators are down, you all start being heavily penalized.

This incentives people running validators to not use the most popular Ethereum client, to avoid using a single compute provider, and to overall, avoid relying on the popular choice since doing so can cause them to lose the majority of their stake.

There hasn't been a major Ethereum consensus outage, but when that happens, the impact of being lazy and following the heard will be huge.


How is it lazy and herd-like to _not_ run the latest and greatest? Sounds like Etherium's design is promoting a robustly diverse ecosystem rather than a monoculture.


> How is it lazy and herd-like to _not_ run the latest and greatest?

I'm not sure what you're asking here. Ethereum incentives don't make you run the latest version of your client's software (unless there's a hardfork you need to support). You can run any version that follows the network consensus rules.

The incentives are there to punish people who use the most common software. For example, let's say there are around 5 consensus clients which are each developed by independent teams. If everyone ran the same client, a bug could take down the entire network. If each of those 5 clients were used to run 20% of the network, then a bug in any one of them wouldn't be a problem for Ethereum users and the network would keep running.

If the network is evenly split across those 5 clients but all of them are running in AWS, then that still leaves AWS as a sigle point of failure.

The incentives baked into the consensus protocol exist to push people towards using a validator client that isn't used by the majority of other validators. That same logic applies to other things like physical host locations, 3rd party hosting providers, network providers, operating systems, etc... You never want to use the same dependencies as the majority of other validators. If you do and a wide-spread issue happens, you're setting yourself up to lose a lot of money.


It sounds like you're describing the advantages of diversity, with a little game theory thrown in to sweeten the deal. Still not sure how that can be described as lazy, or did I completely mis-read the original phrasing?


If 90% of the world runs in AWS, and I'm the only one running on my own hardware, do I get a benefit when AWS goes down?


> When every company goes down you get let off.

And when only companies who use a certain OS and/or Cloud vendor go down? ;-) Do you also get let off?


If it's large enough. Nobody loses their job when office 365 goes down, even if that happens once a year.

However if you decided to choose a small company which goes down once every 5 years, you're screwed.


I find that in today's world it is no longer about one person being "accountable". There is always an interplay of factors, like others have pointed out cyber security has a compliance angle. Other times it is a cost factor, redundancy costs money. Then there is the whole revolving door of employees coming and going, so institutional knowledge about why a decision was made lost with them.

That is hard to do for even a small company. How do you balance all that out for critical infrastructure at a much larger scale?


The problem is that even knowing that this likely to happen many companies would still put CrowdStrike into a critical system for the sake of security compliance / audit. And it's not even prioritization of security over reliability because incentives are to care more about check-boxes in the audit report than about the actual security. Looks like almost no party in this tragic incident had a strong incentive to prevent it so it's likely to happen again.


Can anyone explain how CrowdStrike could possibly fix this now? If affected machines are stuck in an endless BSOD cycle, is it even possible to remotely roll out a fix? My understanding is that the machines will never come to the point where a CS update would be automatically installed. Is the only feasible option the official workaround of manually deleting system files after booting into the recovery environment? How could this possibly be done on scale in organizations with tens of thousands of machines?


There are orgs out there right now with 50,000+ systems in a reboot loop. Each one needs to me manually configured to disable CS via safe mode so that the agent version can be updated to the fixed version. Throw bitlocker in the mix which makes this process even longer, we're talking about weeks of work to recover all systems.


CrowdStrike itself will not fix anything. They published a guide on how to workaround the problem and that's it. Most likely a lot of sales reps and VPs will be fielding calls all over the weekend explaining large customers how did they manage to screw up and how much discount will they offer on the next renewal cycle.

Legally, I think somewhere in their license it says is that they're not responsible in any way or form if their software malfunctions in any way.


> Legally, I think somewhere in their license it says is that they're not responsible in any way or form if their software malfunctions in any way.

I really should add this to my resume and see if it’ll work.


Nah, it only works for corporations. Peons still have accountability.


I think about this all the time.

Like if I kill someone of course I go to jail. But if I get some people together, say we're a company, and then kill 100 people, nobody goes to jail. How does that work? What a huge loophole.


Phillips (the company) basically killed people with malfunctioning CPAP machines (which are meant to help against sleep apnea) and no one went to jail. So that's a practical example.


While it sounds funny, it doesn't work like that. We'd be having a real corporate shootouts everyday all over the place :))


I don't think that's true in this case. I've never heard of an individual employee who introduced a bug being legally liable for it.


It's already the norm for devs to not be responsible for software malfunctions. They can choose to end their relationship with you, but they can't sue you for damages.


Small companies get the shitty generic license.

Big companies negotiate liability terms.


Yep, I've been involved in many vender contracts at my company and the contracts take weeks to months to finalize because every aspect of the agreement is up for discussion. Even things like SLA's (including how they're calculated), liability limitations, indemnity, recourse in the event of system failure are all put through the ringer until both sides come to agreeable terms. This is true for big and tiny venders.


> Big companies negotiate liability terms.

I have never heard of that. Can you point to some examples?

Not SLA's (which are standard), but actual liability? E.g. if we brick your computers we'll pay for replacements and lost employee productivity?


> Big companies negotiate liability terms.

Never heard that in the context of the software licenses.


This isn't a Github project with a MIT license. When you do B2B software, there aren't software licenses, there are contractual terms and conditions. The T&Cs outline any number of elements but including SLAs, financial penalties for contractual breaches, etc. Larger customers negotiate these T&Cs line by line. Smaller customers often accept the standard T&Cs.


Penalties, as far as I was involved in vendor discussions, are a part of the negotiation only when the software provider does any work on the client's premises and are liable to that extent.

For software, you don't pay penalties that it might malfunction once in a while, that's what bug-fixes are for and you get offered an SLA for that, but only for response time, not actual bug fixing. Where you do get penalties and maybe even your money back, is when the software is listed as being able to do X,Y,Z and it only does X and Z and the contract says it must do everything it said it does.


Pretty standard in enterprise b2b, most of the sales cycle is in contracts


Well, probably no? I've never seen liabilities in dollar value, or rather any significant value. Also I saw our company Ceowdstrike contract for 10k+ seats, no liabilities there.


"THIS SOFTWARE IS PROVIDED AS-IS..."


I think I preferred it AS-WAS.


But they've already fixed it.

"CrowdStrike Engineering has identified a content deployment related to this issue and reverted those changes."

How they've reverted changes on non-booting PCs, goodness only knows... ;)


Sounds like people in some of these environments will be doing their level best to automate an appropriate fix.

Hopefully they have IPMI and remote booting of some form available for the majority of the affected boxes/VMs, as that could likely fix a large chunk of the problem.


Imagine if North Korea comes with a statement, that they did it.. It would spawn such amount of work internally at CS to proof if it was intentional or a simple mistake.


Amazing idea


I work for government organization that is constantly audited and I've seen this play out over and over.

An important aspect I never see mentioned is most Cyber Security personnel don't have the technical experience to truly understand the systems they are assessing, they are, like you said, just pushing to check those compliance boxes.

I say this as someone who is currently in a Cyber Security role, unfortunately, as I'm coming to learn cyber roles suck. But this isn't a jab at those Cyber Security personnel's intelligence. It's literally impossible to understand multiple systems at a deep level, it takes employees working on those systems weeks to months to understand this stuff, and that's with them being in the loop. Cyber is always on the outside looking in, trying like hell to piece it all together.

Sorry for the rant. I just wanted to add on with my personal opinions on the cyber security framework being severely broken because I deal with it on a daily basis.


> It's literally impossible to understand multiple systems at a deep level,\

No, it's not. It takes above average intelligence, and major investment in actual education (not just "training"), and actual depth of experience, but it's not impossible.


Do you think it comes from a fundamental misconception of how these roles should be structured? My take is that you just can't fundamentally assess technical elements from the outside unless they have been designed that way in the first place (for assessability). For example I educate my team that they have structure their git commits in a way that demonstrates their safety for audit / compliance purposes (never ever combine a high risk change with a low risk one, for example). That should go all the way up the chain. Failure to produce an auditable output is failure to produce an output that can be deployed.


Our compliance and security people turned up with an urgent request to patch out Linux kernels in AWS.

The pcmcia driver had a vuln

I don’t listen to them much anymore


I know of an important company currently pushing to implement a redundant network data loss prevention solution, while they don't have persistent VPN enabled and multiple known misconfigurations of things that prevent web decryption working properly.

Because someone needs a checkbox.


The flip side is, if you don't do auto updates and an exploit is published and used against you and you haven't yet tested / pushed the patch, that you would have been protected against if it had auto updated, you are up the creak without a paddle in that situation as well.

To some degree you have to trust the software you are using not to mess things up.


So since I do mission critical healthcare I do run into this concept. But it's not as unresolvable as you portray. Consider for example HIPAA "break the glass" requirement. It says that whatever else you implement in terms of security you must implement a bypass that can be activated by routinely non-authorised staff to access health information if someone's life is in danger.

Similarly, when I questioned, "why can't users turn off ZScaler in an emergency" we were told that it wouldn't be compliant. But it's completely implementable at a technical level (Zscaler even supports this). You give users a code to use in an emergency and they can activate it and it will be logged and reviewed after use. But the org is too scared of compliance failure to let users do it.


While I agree with the requirement, but it sounds like a vault would need to have an unlocked door with a sign.


Well, if the vault says you have COPD, and the devious bank robber is interested in your continued breathing, perhaps we can just review the footage after the fact.

This is one of those cases where you don't disable emergency systems to defend against rogue employees. If people abuse emergency procedures, you let the legal system sort it out.


A vault with firearms in the police station to which every staff member has a key. Sounds reasonable to me.

Users are not prisoners left in the burning building without a fire escape.


> It says that whatever else you implement in terms of security you must implement a bypass that can be activated by routinely non-authorised staff to access health information if someone's life is in danger.

Huh.

I can see why this needs to exist, but hadn't thought of it before. Same deal as cryptography and law-enforcement backdoors.

> logged and reviewed after use

I was going to ask how this has protection from mis-use.

Seems good to me… but then I don't, not really, not deeply, not properly, feel medical privacy. To me, violation of that privacy is clearly rude, but how the bar raises from "rude" to "illegal" is a perceptual gap where, although I see the importance to others, I don't really feel it myself.

So it seems good enough to me, but am I right or is this an imagination failure on my part? Is that actually good enough?

I don't think cryptography in general can use that, unfortunately. A simple review process can be too slow for the damage in other cases.


Yes. And the vast majority of the time, it doesn’t mess things up.

The notion that you may take on risk to net alleviate risk is somehow lost on a lot of people in these conversations.


This is an oversimplification. IF we are talking about compliance to ISO 27001 you are supposed to do your own risk assessment and implement necessary controls. The auditor will basically just check that you done the risk assessment, and that you have done the controls you said yourself you need to do.

I'd say this has nothing with regulatory compliance to do at all. The real truth is that modern organizations are way too attached to cloud solutions. And this runs across all parts of the organization with Saas and PaaS whether it's email (imagine Google Workspace having a major issue), AWS, Azure, Okta…

I've had the discussions so many times and the answer is always – the risks doesn't matter because the future is cloud and even talking about self hosting anything is naive and honestly we need to evaluate your competence for even suggesting it.

(Also the cloud would maybe not be this fragile if it wasn't for lock-in with different vendors. If you read the TOS it says basically on all cloud services that you are responsible for the backup – but getting your data out of the service is still pain in the ass – if possible at all)


> The real truth is that modern organizations are way too attached to cloud solutions.

I'm confused. This is a security product for your local machine. Not the cloud.

Unless you call software auto-update "the cloud", but that's not what people usually mean. The cloud isn't about downloading files, it's about running programs and storage remotely.

I mean, if CloudStrike were running entirely on the cloud, it seems like the problem would be vastly easier to catch immediately and fix. Cloud engineers can roll back software versions a lot easier than millions of end users can figure out how to safe boot and follow a bunch of instructions.


Well, in all times usually there has been the option to run a local proxy/cache for your updates so that you can properly test them inside your own organization before rolling them out to all your clients (precisely to avoid this kind of shit show). But doing that requires an internal team running it and actually testing all updates. But modern organizations don't want an IT-department, they want to be "cloud first". So they rely on services that promise they can solve everything for them (until they don't).

Cloud is not just about where things are – it's also about the idea that you can outsource every single piece of responsibility to a intangible vendor somewhere on the other side of the globe – or "in the cloud".


> Cloud is not just about where things are – it's about the idea that you can outsource every single piece of responsibility to a intangible vendor somewhere in the cloud.

I've never heard of a definition of cloud like that.

Cloud is entirely about where things are.

Outsourcing responsibility to a vendor is totally orthogonal to the idea of the cloud. You can outsource responsibility in the cloud or not. You can also outsource responsibility on local machines or not.

And outsourcing responsibility has existed since long before the concept of the cloud was invented.

It's important to keep definitions clear.


The product affected here is litelarly called "CrowdStrike Falcon® Cloud Security". Meraki all tough they sell routers and switches markets their products as "cloud-based network platform". Jamf all tough their product is run on endpoint devices is marked as "Jamf Cloud MDM". I think its fair to say that cloud these days does not only mean storing data, or running servers in cloud but also if infrastructure is in any way MANAGED in cloud.

So to tie back to what i wrote earlier – none of these services has to have the management part in the cloud. They could just give you a piece of software to run on your own server. That would certainly distribute the risk since now it only takes someone hacking the vendor to go after all their customers, or in this case one faulty update brakes all users experience. And as far as I can see it seems we are willing to take those risks because we think it's nice having someone else manage the infrastructure (and that was my main point in the first comment).


> My org which does mission critical healthcare just deployed ZScaler on every computer which is now in the critical path of every computer starting up

Hi fellow CVS employee. Are you enjoying your zscaler induced SSO outages every week that torpedo access to email and every internal application? Well now your VMs can bluescreen too. A few more vendor parasites and we'll be completely nonfunctional. Sit tight!


When we think "security" on HN we think about the people who escalate wiggling voltages at just the right time into a hypervisor shell on XBox, but I've had to recognize that my learned bias is not correct in the real world. In the real world, "computer security" is a profession full of hucksters that can't tell post-quantum from heap and whose daily work of telling people repeatedly to not click links in Outlook and filling out checklists made by people exactly like them has essentially no bearing on actual security of any sort.


It's driven by a lot of things. Part of it is driven by rising cyber liability insurance rates, for one. A lot of organizations would rather not pay for CrowdStrike, but the premiums for not having an "EDR/XDR/NGAV" solution can be astoundingly high at-scale.

Fundamentally there's a lot of factors in this ecosystem. It's really wild how incentives that seem unrelated end up with crazy "security" products or practices deployed.


> A lot of organizations would rather not pay for CrowdStrike, but the premiums for not having an "EDR/XDR/NGAV" solution can be astoundingly high at-scale.

Just like a lot of homeowners would rather not pay for ADT, but insurance requires a box-ticking “professionally-monitored fire alarm system.” Nevermind that I can dial 911 as well as the “professional” when I get the same notification as they do.


> In the real world, "computer security" is a profession full of hucksters

Always has been. The information security model is about analogizing digital systems as physical systems, and employing the analogues of those physical controls that date back hundreds of years on those digital systems. At no point, in my relatively long career, have I ever met anyone in Information Security who actually understands at depth anything about how to secure digital systems. I say this as someone who has spent a lot of my career trying to do information security correctly, but from the perspective of operations and software engineering, which is where it must start.

The entire information security model the world works with is tacking on security after the fact, thinking you need to builds walls and a vault door to protect the room after the house has already been built, when in fact you need to build the house to be secure from the start because attacks don't go through doors, attacks are airborne (I recognize the irony of my analogizing digital concepts to physical concepts surrounding security, but I do it because of any infosec people that may read my comment so they can understand my point).

Because of this model, we have gone from buying "boxes" to buying "services", but it has never matured away from the box-checking exercise it's been since day one. In fact, many information security people have /no training or education/ in security, it's entirely in regulatory compliance.


I’ve met highly paid “security engineers” that talked about not really being into programming or being okay with python but everything else is too complicated.

It shocks me that such a low level of technical competence is required.


> So CrowdStrike is deployed as third party software into the critical path of mission critical systems and then left to update itself.

TIL that US government has pressured foreign nations to install a mystery blob in the kernel of machines that run critical software "for compliance".

If this wasn't a providential goof on the part of Crowdstrike -- the entire planet is now aware of this little known fact -- then some helpful soul in Crowdstrike has given us a heads-up.


Don't put your eggs in one basket, I use multiple anti-virus products so that if one blows up at least not all computers are affected. Looks like my old wisdom is still new wisdom.

Clarification: I mean that every computer has one anti-virus product, but not every computer has the same anti-virus product. I'm not installing multiple anti-virus products on the same computer.


If you have all of them on a critical path then your risk of blow up increases!


Not sure that's a great idea. This stuff tends to have very high privilege access.

It's enough that one of your anti virus vendors get hacked for your whole organization to get owned...


I see one main challenge - it can increase the incidence of false positives.


You use multiple anti-virus products. Let's assume you use 3. Do you have multiple clusters of machines, each running their own AV product, so in case one has this problem the other two are unaffected?

How much overhead are we talking about here? Because if you're just using multiple AV software installed on one machine, 1) holy shit, the performance penalty, 2) you'd still be impacted by this, as CS would have taken it down.


They surely mean that all odd number assets are running crowdstrike and even are running sential-one (or similar, %3, %4, etc etc). At least then you only lose half your estate.


Yes each computer has only one anti-virus installed, it's basically a random distribution among the estate.


I have never seen a company that uses multiple AV products rolled out to user machines, ever. Sure, when you transition from one product to another, but across the whole company, at the same time? Never... I have also never seen a distribution of something like active directory servers based on antivirus software. I think these stories are purely academic, "why didn't you just..." tall tales.


Mine certainly does, our key windows based control systems use windows defender, the corporate crap gets sentinal one and zscaler and whatever else has been bought on a whim.

I'd assumed that any essential company would be similar. OK if your purchasing systems for your hospital are down for a couple of days it's a pain. If you can't get x-rays it's a catastrophe.

If half your x-ray machines are down and half are up, then it's a pain, but you can prioritise.

But lots of companies like a single supplier. Ho hum.


Not the person you're replying to, but in any reasonable organization with automated software deployment it should be easy to pool machines into groups, so you can make sure that each department has at least one machine that uses a different anti-virus software.

Bonus, in case you do catch a malware, chances are higher that one of the three products you use will flag it.


Again, "should be" academic stuff.

So you have multiple AV products and you target those groups. You have those groups isolated on their own networks, right? With all the overhead that comes with strict firewall rules and transmission policies between various services on each one. With redundant services on each network... you've doubled or tripled your network device costs solely to isolate for anti virus software. So if only one thing finds the zero day network based virus, it won't propagate to the other networks that haven't been patched against this zero day thing.

How far down the rabbit hole do we want to go? If you assume many companies are doing this kind of thing, or even a double digit percentage of companies, I have bad news for you.


Basically every machine gets a randomly picked anti-virus suite assigned at deployment. I'm not running multiple AV products on one machine.


Was there any situation where having 3 anti-virus was more beneficial than having only 2?


I'm reading this as first third of computers have AV brand A, second third have brand B, remainder have brand C.

Thus, if brand A does something actively harmful all by itself, only 1/3rd of machines are impacted.

This is an improvement on having only 2 brands, as having 1/3rd of your machines go down is better than having 1/2 of your machines go down.


This is much easier applied personally than it is to a 30k person organisation. No need to be condescending.


In both case it's costly.

But cost of maintenance aside it wouldn't be that bad to deploy each half the fleet with two distincts EDR.

This is actually implicitly in place for big companies that support BYOD. If half your fleet is on Windows another 40% on MacOs and 10% on Linux you need distinct EDR solutions and a single issue can't affect all your fleet at once.


Indeed a wise strategy


Zscaler is truly amazing. It can't do HTTP/2. Our product is HTTP/2-only. So we can't use our own product at work.


I know a few people who have Zscaler deployed at work. It will routinely kick them of the internet, like multiple times a day. It has gotten to the point where they can sort of tell in advance that it's about to happen.

The theory so far it that it's related to their activities, working in DevOps they will sometimes generate "suspicious" traffic patterns which will then trigger someone policy in Zscaler, but they're not actually sure.


ZScaler itself uses port 443 UDP, but blocks QUIC. The last time I checked it didn't support IPv6 so they told customers to disable IPv6. Security software is legacy software out of the box and cuts the performance of computers in half.


What Zscaler can and will do though is break your network randomly and in strange ways. They don't even seem to charge for that feature!


Their visibility and process in general for handling abuse of their services is also abysmal.


> more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.

Duh, else there would be no need to audit them to force compliance, they'd just do it by themselves. The only reason it needs forcing is that they otherwise aren't motivated enough.


Good point. But the audit seems useless now. It's supposed to prevent the carelessness from causing... this thing that happened anyway.

Sure, maybe it prevented even more events like this from happening. But still.


> Good point. But the audit seems useless now. It's supposed to prevent the carelessness from causing... this thing that happened anyway.

> Sure, maybe it prevented even more events like this from happening. But still.

Because the point of audit is not to prevent hacks, it's to prove that you did your due diligence to not get hacked, so fact that hack happened is not your fault.

You can hide under umbrella of "sometimes hacks happen no matter what you do".


CYA is the reason you do the audit. But the reason for the audit's existence and requirement is definitely so that hacks don't happen. Don't tell me regulatory agencies require things so that companies can hide behind them.


The reason for the audit's existence is CYA one level above. The chain ends with a politician's CYA in front of the electorate.


To be fair, I'd claim that it's pretty rare for anything anyone ever does to not be a trade-off.


Audit is papering over the problem rather than fixing it. The only way to make them responsible is to put real liability on them.


Who is them though? The airport that used this software? You can't put all the blame on the software vendor. It can be a good and useful component when not relied on exclusively for the functioning of the airport. Not relying on a single point of failure should be the responsibility of the business customer who knows the business context and requirements.

You will have each company person pointing at the others. That's why you have contracts in place.

You won't ever have real consequences for executives and real decision makers and stakeholders because the same kind of people make the laws. They are friends, revolving door etc.


There's no responsibility at any level, is the thing. Those people who couldn't fly might get a rebooking and some vouchers sent out to them, but they won't really get made whole. The airport knows they won't really be on the hook, so they don't demand real responsibility from their vendors, and so on.


In the grand scheme of things, being able to fly around the globe at these prices is a pretty good deal, even with these rare events taken into account. It's not like the planes fell out of the sky. If you must must definitely be somewhere at a time, plan to arrive one or two days earlier.


The dynamic between compliance and operational integrity


I don't even want to know how many mission critical systems automatically deploy open source software downloaded from github or (effectively random) public repositories.


Unlike Windows, there is at least the option to use curated software distributions such as Debian or RH that won't apply random stuff from upstream repositories.


I'm talking about all sorts of software projects implemented using especially Python, Ruby, node.js etc.


I like Debian but it's not like they need random upstream repositories when they can make random patches themselves e.g. the OpenSSL Purify issue.


These risks must be carefully managed that's it I think


If I were running an organization that needs these audits, I'd always have fallback procedures in place that would keep everything running even if all computers suddenly stop working, like they did today. General-purpose software is too fragile to be fully relied upon, IMO.

If a general-purpose computer must be used for something mission-critical, it should not have an internet connection and it should definitely not allow an outside organization to remotely push arbitrary kernel-mode code to it. It should probably also boot from a read-only OS image so that it could always be restored to a known-good state by just rebooting.


Organizations don't want to increase risk by listening to an employee with their personal opinion. Orgs want an outside vendor who they can point at and say "it's their fault", and await a solution. Employees going rogue and not following the vendor defined SW updates is a much higher risk than this particular crisis.


Isn't there a way to schedule the updates? With Windows updates, when I used to work at a firm with a critical system running on Windows, we had main and DR servers and the updates were scheduled to first rollout on the main server and a day after I think at the DR, which has saved us at least once in the past from a bad Windows update...


More or less. You can set up some update policies which and apply those to subsets of your machines. You can disable updates during time blocks, or block them altogether. There's also the option of automatically installing the "n-1" update.

We run auto n-1 at work, but this also happened at the same time on my test machine with runs "auto n". It never happened before, so this looks like something different than the actual installed sensor version, especially since the latest version was released something like a week ago.


It's a big stretch to call this the regulator's fault when its basic lack of testing by Microsoft and/or Crowdstrike. If a car manufacturer made safety belts that broke, you don't blame the regulators.

The root cause is automatic, mindless software update without proper testing - nothing to do with regulators.


That's some very twisted logic. If I expect someone to clean the kitchen as part of restaurant closeup checklist, and they fuck it all up, would I blame the checklist, or the person doing the work?

You blame the person fucking it up. In this case, it's someone who only cares about checking a box. Or someone who pushes broken shit.


If this person simultaneously fucks up millions of kitchens around the world, you do not blame that person. You blame the checklist which encouraged giving a single person global interlocked control over millions of kitchens, without any compartmentalization.


> If this person simultaneously fucks up millions of kitchens around the world, you do not blame that person.

No, you definitely do, even more than before. Let's say for example that the requirement is to disinfect anything that touches food. And the bleach supplier fucks it all up. You blame the bleach supplier. You don't throw out the disinfectant requirement.


Most enterprises will have teams of risk and security people. They will be asking who authorized deployment of an untested update into production. If CrowdStrike deployments cannot be managed, then they will switch to a product which can be managed.


Well, if you fail at compliance, you can be fired and sometimes even sued. If your compliance efforts cause system wide outage - nobody's to blame, shit happens. I predict this screwup will end up with zero consequences for anyone who took the decisions that led to it too. So how else do you expect this system to evolve, given this incentive structure?


The question is, how did they manage to not crash everything for so long without a staged/rolling update deployment strategy?


Perhaps it took this long for the offending file to pass 64K Excel rows... :)


> Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.

If a failed audit is the big scary monster in their closet, then it sounds like the senior leadership is not intimately familiar with the technology and software in general, and is unable to properly weigh the risks of their decisions.

More and more companies are becoming software companies whether they like it or not. The software is essential to the product. And just like you would never want a non-lawyer running your law firm, you don't want a non-software person running your software company.


Very sharp and to the point, this comment. I would like to add that in large companies the audit will, in my experience, very often examine documents only -- not actual configuration or code.


This is all well deserved for executives who trust MS to run their businesses. If you have the resources, like a bank, it is crime to put your company in the hands of MS.


IME, Boomer managers refuse to use anything but Windows. We have a few more years of this.


It's possible that CrowdStrike heavily incentivises being left to update itself.

Removing the features that would allow sysadmins to actually do it automatically, even via the installer itself- would definitely be one way, but another one could be aggressive focus-stealing nags (similar to Windows' own nags) which in a server environment can actually cause some major issues, especially when automating processes in Windows (as you need to close the program when updating).

I think it's easy to blame the sysadmins, but I would also be remiss if I didn't point out that in the Windows world we have been slowly accepting these automatic dark patterns and alternative (more controlled) mechanisms have been removed over time.

I almost don't recognise the deployment environment today as to what it was in 2004; and yes, 20 years is a long time, but the total loss of control over what a computer is doing is only going to make issues like this significantly more common.


They say it was caused by a faulty channel file. I don't know what a channel file is, and they claim to not rely on virus signatures, but typically anti virus product need the latest signatures all the time and poll them probably once an hour or so. So I'm not surprised that an anti virus product wants to stay hyper updated and updates are rolled out immediately to everyone globally.


No, I'm not surprised either. But if you're operating at this kind of scale and with this level of immediate roll-out, what I would expect are:

* A staggered process for the roll-out, so that machines that are updated check-in with some metrics that say "this new version is OK" (aka "canary deployment") and that the update is paused/rolled back if not.

* Basic smoke testing of the files before they're pushed to any customers

* Validation that the file is OK before accepting an update (via a checksum or whatever, matched against the "this update works" automated test checksums)

* Fuzz tests that broken files don't brick the machine

Literally any of the above would have saved millions and millions of dollars today.


In any kind of serious environment the admin should not have any interaction with any system's screen when performing any kind of configuration change. If it can't be applied in a GPO without any interaction it has no business being in a datacenter.


1) No true scotsman fallacy at work.

There are situations where you will interact with the desktop, for debugging reasons not-withstanding. Saying anything else is hopelessly naive. For example: how do you know if your program didn't start due to missing DLL dependencies? There is no automated way: you must check the desktop because Windows itself only shows a popup.

2) What displays on the screen is absolutely material to the functioning of the operating system.

The windows shell (UI) is intertwined intrinsically with the NT kernel, there have been attempts to create headless systems with it (Windows Core etc;) however in those circumstances if there is a popup: that UI prompt can crash the process because it does not have dependencies to show the pop-up.

If you're in a situation where you're running windows core, and a program crashes if auto-updates are not enabled... well, you're more likely than not to enable updates to avoid the crash, after all, whats the harm.

Elsewise you will be aware that when a program has a UI (windows console) the execution speed of the process will be linked to the draw rate of the screen, so having a faster draw rate or fewer things on screen can actually affect performance.

Those that write Linux programs are aware that this is also true for linux (write to STDOUT is blocking), however you can't put I/O on another thread in the same way on Windows.

Anyway, all this to say: it's clear you've never worked in a serious windows environment. I've deployed many thousands of bare-metal windows machines across the world and of course it was automated, from PXE/BIOS to application serving on the internet, the whole 9 yards, but believing that the UI has no effect or no effectiveness of administration is just absurd.


> So we need to hold regulatory bodies accountable as well...

My bank, my insurer, my payment card processor, my accounting auditor and probably others may all insist I have anti-virus and insist that it is up to date. That is why we have to have these systems. However, I used to prefer systems that allowed me to control the update cycle and push it to smaller groups.


hey, but at least (a) we have a process (b) we documented it and (c) we review it regularly!

what can go wrong?!


There should be a new term called compliance hell.


The term already exists, but the "hell" is mostly silent (even in writing).


Is there a process for that??


> So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.

Replacing common-law liability with prescriptive regulation is one of the main drivers of this problem today. Instead of holding people accountable for the actual consequences of their decisions, we increasingly attempt to preempt their decisions, which is the very thing that incentivizes cargo-cult "checkbox compliance".

It motivates people who otherwise have skin in the game and immediate situational awareness to outsource their responsibility to systems of generalized rules, which by definition are incapable of dealing effectively with outliers.


No doubt there will be another piece of software mandated to check up on the compliance software. When that causes a global IT outage, software that checks up on the software that checks up on the compliance software will be mandated.


Consolidation / optimization of labor.

When Crowdstrike messes up and BSODs thousands of machines, they have a dedicated team of engineers working the problem and can deliver a solution.

When your company gets owned because you didn't check a compliance checkbox, it's on you to fix it (and you may not even currently have the talent to do so).

We see similar risk tradeoffs in cloud computing in general; yes, hosting your stuff on AWS leaves you vulnerable to AWS outages, but it's not like outages don't happen if you run your own iron. You're just going to have to dispatch someone a three hour drive away to the datacenter to fix it when they do.


> they have a dedicated team of engineers working the problem and can deliver a solution.

No. Can merely facilitate the customer's on-site admin to deliver a solution.


That for sure is a great point, it's actually something Dorota brought up in her talk this year https://vimeo.com/940487124


CrowdStrike has various auto update policies, including not to automatically update to the latest version, but to the latest version -1 or even -2. Customers with those two policies are also impacted.


> Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.

I've been someone in one of those audit meetings defending decisions made and defending things based on the records we keep and I understand this because it is both a deeply unpleasant and expensive affair to pull people from current projects and place them before auditors for several hours to debate what compliance actually means.


It’s even worse. The consultants who run the audits (usually business school recent grads) work with other consultants who shill the third party software and implementation work.


"The metric becomes the measure", i.e. Goodhart's Law.

https://en.wikipedia.org/wiki/Goodhart%27s_law


So true! It seems like all of these were invented to create another market for b2b saas security, audit, monitoring, etc. companies. Nobody cares about actual security or infrastructure anymore. Everything is just buying some subscription for random saas companies, not checking their permissions and grant policies and ticking boxes because... compliance.


CrowdStrike, ZScaler, and the rest of these people surely have lobbyists that ensure their software is compelled by regulators.


It depends on what your position is. Are you there to actually provide security to your org or to tick a in an audit. If both which is more important. Because failing an audit have real consequences, while having breaches in security have almost none. Just look at credit score companies.


Regulation or auditors rarely require specific solutions. It's the companies themselves that choose to achieve the goals by applying security like tinctures: "security solutions". The issue is that the tinctures are an approved remedy.


Zscaler is such insane garbage. Legitimately one of the worst pieces of software I have ever used. If your organization is structurally competent, it will never use Zscaler and will just use wireguard or something.


It's VERY easy to blame CrowdStrike and companies like them as they are the one LOBBYING for those checkboxes. Both zscaler and Crowdstrike spent 500K last year lobbying.


> the risk of failing the compliance checkbox it satisfies is paramount.

i'm curios as to what compliance is there to be satisfied to necessitate such a hardcore measure?


There's a reasonable number of circumstances where there are cybersecurity standards that get imposed on organisations: insurance, from a customer, or from the government (especially if they are a customer). These standards are usually fairly reasonably written, but they are also necessarily vague and say stuff like "have a risk assessment", and "take industry-standard precautions". This vagueness can create a kind of escalation ratchet: when people tasked with (or responsible for) compliance are risk-averse and/or lazy, they will essentially just try to find as many precautions as they can find and throw them all in blanket-style, because it's the easiest and safest way to say that you're in compliance. This is especiallly true when you can more or less just buy one or two products which promise to basically tick every possible box. And if something else pops up as a suggestion, they'll throw that in too. Which then becomes the new 'industry standard', and it becomes harder to justify not doing it, and so on.


I worked in orgs where customers put a certain security standard in the contract. So if you fail that you are kind of in breach of contract


It's easy to blame CrodStrike because they're the ones to blame here. They lit a billion system32 folders on fire with an untested product and push out fear mongering corny marketing material. Turns out you should be afraid.


this is the world that lawyers gave us


When a metric becomes a target ...


Picking up pennies in front of a steamroller...


This is 100% the reality


It's what you get when you let luddites, also known as managers, make the rules.


> All over the place I'm seeing checkbox compliance being prioritized above actual real risks from how the compliance is implemented.

Because if everyone is doing their job and checks their box, they're not gonna get fired. Might be out of a job because the company goes under, but hey, it was no one's fault, they just did their job.


Sounds familiar


Electricity too is a SPOF.


Most mission critical medical systems have a backup generator.


And IT has dual power supply servers, and endpoint UPSs.

Electricity is aggressively made redundant for mission-critical systems.


SMB here. Just spent a nine hour day fixing this. We had two machines that after a couple of reboots just came back up fine.

We were trialing CrowdStrike and about to purchase next week. If their rep doesn't offer us at least half off, we are going with Sentinel One which was half the price of CS already.

The incompetence that allowed this is baffling to me. I assumed with their billions of dollars they'd have tiers of virtual systems to test updates with.

I remember this happening once with Sophos where it gobbled up Windows system files. If you had set to Delete instead of Quarantine, you were toast.


> We were trialing CrowdStrike and about to purchase next week. If their rep doesn't offer us at least half off, we are going with Sentinel One which was half the price of CS already.

You’re still considering them after all this?


It is baffling to me that you are still considering them.


Crowdstrike marketing slogan on their website: "A radical new approach proven to stop breaches". I'll give them that: Putting all Windows computers within a company into an endless BSOD loop is a very radical approach to stop breaches. :)


"We breach your systems to hackers can't!"


The Windows ecosystem typically deployed in corporate PCs or workstations is often insecure, slow, and poorly implemented, resulting in ongoing issues visible to everyone. Examples include problems with malware, ransomware, and Windows botnets.

In corporate environments, IT staff struggle to contain these issues using antivirus software, firewalls, and proxies. These security measures often slow down PCs significantly, even on recent multi-core systems that should be responsive.

Microsoft is responsible for providing an operating system that is inherently insecure and vulnerable. They have prioritized user lock-in, dark patterns, and ease of use over security.

Apple has done a much better job with macOS in terms of security and performance.

The corporate world is now divided into two categories: 1. Software-savvy companies that run on Linux or BSD variants, occasionally providing macOS to their employees. These include companies like Google, Amazon, Netflix, and many others. 2. Companies that are not software-focused, as it's not their primary business. These organizations are left with Microsoft's offerings, paying for licenses and dealing with slow and insecure software.

The main advantage of Microsoft's products is the Office suite: Excel, Word and Powerpoint but even Word is actually mediocre.

EDIT: improve expression and fix errors:


I think you represent the schism in your own post. Retail is hyper focused on the name Microsoft and Windows. But the enterprise and technical people are focused on rolling back a bad CrowdStrike bad update. They will spend hours and even days focusing on doing that, asking why they were vulnerable to such an update and what they should have done to avert being vulnerable to a bad update.

And for them it will be a bit of a stretch to say Microsoft should have stopped us deploying CrowdStrike. I’m sure Microsoft would love to do just that and sell its own Microsoft Solution.

Now if enterprises decide to run only Linux, BSD, or MacOS would they have been invulnerable to a bad CrowdStrike update: https://www.google.com/search?q=crowdstrike+kernel+panic

No so your entire premis is fully invalidated by a single google search.

On the other had I do feel Microsoft does have life far too easy in so many enterprises, but the fault here lies as much with the competition.


> it will be a bit of a stretch to say Microsoft should have stopped us deploying CrowdStrike

I read GP's post to mean that if you take a step back, Windows' history of (in)security is what has led us to an environment where CrowdStrike is used / needed.


Well, then why would we have Linux and macOS versions of CrowdStrike Falcon Sensor (tm), too?


I can answer this. For the same reason I have run ClamAV on Linux development workstations. Because without it, we cannot attest that we have satisfied all requirements of the contract from the client's security organization.

Also if you are a small business and are required to have cybersecurity liability insurance, the underwriter will require such sensors to be in place or you will get no policy.


If said underwriters don't typically cover things like the current CrowdStrike problem, that seems like a pretty big case of misaligned incentives.


For the same reasons there's antivirus software for Mac and Linux.

People coming from Microsoft systems just expect it to be required, so there's demand for it (demand != need). And in hybrid environments it may remove a weak link: e.g. a Linux mailserver that serves mail to Windows users best has virus detection for windows viruses.


I’m not defending CrowdStrike here. This is a clearly egregious lack of test coverage, but CrowdStrike isn’t “just” antivirus. The Falcon Sensor does very useful things beyond that, like USB device control, firewall configuration, reporting, etc.

If your use case has a lesser need for antimalware you might still deploy CrowdStrike to achieve those ends. Which help to lessen reliance on antimalware as a singular defense (which of course it shouldn’t be).


I know it isn't just antivirus. I was merely drawing a simpler analogy.


It's not just those darn windows admins. Alot of the certifications customers care about- SOC II, ISO whatever, FedRamp, have line items that require it.


I've had to install server antivirus onto my Linux laptop at 4 different companies. Every time it's been a pain in the ass because the the only antivirus solutions I've found for Linux assume that "this must be a file server used by Windows clients". None of them are actually useful, so I've installed them and disabled them. There, box-checking exercise done.


> For the same reasons there's antivirus software for Mac and Linux.

Because they can also get malware or could use the extra control CS provides, and the "I'm not a significant target so I'm safe" is not really a solid defense? Bad quality protection (as exemplified by the present CS issues) isn't a justification for no protection at all.

Would you ignore the principle of least privilege (least user access) and walk around with all the keys to the kingdom just because you're savvier than most at detecting an attack and anyway you're only one person, what are the chances you're targeted? You're the Linux/MacOS of the user world, and "everyone knows those principles are only for the Windows equivalent of users".


I'm not arguing that Linux or Mac need no protection.

There are serious threats to any Linux machine. And if you include Android, there are probably far more Linux machines out there. Hell, including their navigation, router, NAS, TV, and car, my 70+ yo mom runs at least 5 Linux machines at her home. It's a significant target. And Mac is quite obviously a neat target, if only because the demographic usually has higher income (hardly any Bangladeshi sweatshop worker will put down the cash to buy a MacBook or iphone. But might just own an Android or windows laptop)

I'm arguing that viruses aren't a threat, generally. Partly due to the architecture, partly due to their useage.


Neither Linux nor OSX are immune to viruses, though malware is more commonly written to target Windows given its position in the market. Both iOS and Android are frequent malware targets despite neither being related to Windows, and consequently, both have antivirus capabilities integrated deeply into both the OS and the app delivery ecosystem.

Any OS deployed on a user device needs some form of malware protection unless the device is blocked from doing anything interesting. You can generally forgo anti-malware on servers that are doing one thing that requires a smaller set of permissions (e.g., serving a website), but that's not because of the OS they are running.


Wut?

You can’t run ClamAV on iPhone, can you?


No, ClamAV doesn't have an iOS version. There are plenty of iOS-specific AV programs available if you need one, though.


I just looked, and your claim is very misleading.

Sure, “AVG Mobile Security” is available, but nobody needs it, and it isn’t anything like antivirus software on a computer. It provides... a photo vault, a VPN, and “identity protection.”

To tell people that they are vulnerable without something like this on their iPhone is ludicrous.

Nobody meeds antivirus software or malware protection like this on their iPhone, unless they like just giving money away.


If you'll scroll up to the comment you originally replied to, you'll see that I said Android and iOS have AV capabilities built into the OS and app delivery ecosystem. That's more than enough for most users: mobile OSes have something much closer to a capability-based security paradigm than desktop OSes, and both Apple and Google are pretty quick to nerf app behavior that subverts user expectations via system updates (unless it was done by the platform to support ad sales).

Your mobile device is a Turing machine, and as such it is vulnerable to malware. However, the built-in protections are probably sufficient unless you have a specific reason to believe they are not.

The only AV software for mobile devices that I have seen used is bundled with corporate "endpoint management" features like a VPN, patch and policy management, and remote wipe support. It's for enterprise customers that provision phones for their employees.


You said…

> You can generally forgo anti-malware on servers that are doing one thing that requires a smaller set of permissions (e.g., serving a website), but that's not because of the OS they are running.

It seems to me like you’re trying to have it both ways.

It really is because of the OS that one doesn’t need to run anti-malware software on those servers and also on the iPhone, which you seem to have admitted.


It seems like we're both trying to make a distinction that the other person thinks is unimportant. But if the crucial marker for you is whether anti-malware protection is built into the OS, then I've got great news for you: Windows has built-in AV, too, and it's more than enough for most users.

The distinction I was trying to make is that the anti-malware strategy used by servers (restrict what the user can do, use formal change control processes, monitor performance trends and compare resource utilization against a baseline and expectations inferred from incoming work metrics) is different from the anti malware strategy used by "endpoints" (scanning binaries and running processes for suspicious patterns).


I'd say very special people need malware protection like this on their iPhone.

Remember NSO Group? Or the campaign Kaspersky exposed last year? Apple successfully made malware on iOS very rare unless you are targeted. But right now, it is impossible for these targeted people to get any kind of protection. Even forensics after being compromised is extremely difficult thanks to Apple's walled garden approach.


It depends on what you mean by “like this.”

The usefulness of a theoretical app that might be able to stop high-power exploits isn’t being debated. The claim I’m objecting to is that everybody should be running (available) antivirus software on their phone.

But if you mean that these highly targeted people would have been helped by running “AVG Mobile Security” or one of the other available so-called “antivirus” apps, then I’ve got an enterprise security contract to sell you. :)


> The claim I’m objecting to is that everybody should be running (available) antivirus software on their phone.

You're objecting to the (much more specific) claim that everybody should be running 3P antivirus software on their phone. Nobody made this claim. You are already running AV software on your phone, and whatever is built into the platform is more than sufficient for most users.


It's not just fake demand, it's required in most instances (example- STIG requirements)


fake requirements?


I spent some time on STIG website out of curiosity. There seem to be down-to-earth practical requirements but only for Windows, cf. https://public.cyber.mil/stigs/gpo/

Why does it justify running antiviri on Linux is beyond my understanding.

Weak, impotent, speechless IT personnel that can not face off incompetence?


Except at least on the Mac, your AV software is unlikely to be part of the boot process, and doesn't run in the kernel.

Shit like today is precisely why Apple kicked Mac developers out of kernel-space for the most part.


Windows IT admins who don’t use or understand Linux/Mac. Who also buy at the enterprise level. And who probably have to install (perhaps unnecessary) endpoint protection to satisfy compliance checklists.

The amount of Windows centric IT that gets pushed to Linux/Mac is crazy. I’ve been in meeting where using Windows based file storage was discussed at a possibility for an HPC compute cluster (Linux). And they were being serious. This was in theory so that central IT could manage backups.


To make money? Just because CrowdStrike is available for Linux and Mac doesn't mean that a) people buy and use it in substantial numbers b) people need to buy it. It would be interesting to hear from someone using CrowdStrike in a Linux/Mac environment.


Had it on my Mac a few years back and my long-lasting memory of it was how it:

a) slowed down the performance of my machine to a crawl in a NodeJS project

b) had my laptop fans spinning at full blast 24/7, even waking up the laptop overnight to do it

It was purely for compliance, but I also got the impression that it was a bloated enterprise solution for the problem.


We run Crowdstrike on Linux and Macs so that we can tick some compliance checkbox.

Fun fact: they’ve recommended we don’t install the latest kernel updates since they usually lag a bit with support. We’re running Ubuntu LTS, not some bleeding edge arch. It now supports using ebpf so it’s somewhat better.


CS installed on my managed Mac. Generally no problems except randomly network stops working. Fixed by waiting.


The policies are written by folks who have no understanding of different operating environments. The requirement "All servers and workstations must have EDR software installed" leads to top-level execs doing a deal with Crowdstrike because they "support" Linux, Mac, and Windows. So then every host must have their malware installed to check the box. Doesn't matter if it's useful or not.


Indeed and insurance too. For our business, our professional errors and omissions coverage for years had the ability to cover cyber issues. No more. That requires cybersecurity insurance and the underwriters will not entertain underwriting a policy unless EDR is in place. They don't care if you are running OpenBSD and are an expert in cybersecurity who testifies in court cases or none of that. EDR from our list or no insurance.


Because of Security Theater.


So that work can't progress too fast?


For macOS? Because without it you don't have certain monitoring and compliance capabilities that are standard built-ins in windows, plus for windows/linux/mac the monitoring capabilities are all useful and help detect unwanted operation.


Because it will look very bad if you answer, "No, our company has no Anti-virus because we are a macOS shop" on a security questionnaire


> I read GP's post to mean that if you take a step back, Windows' history of (in)security is what has led us to an environment where CrowdStrike is used / needed.

Windows does have a history of insecurity, but it is no different from any other software in this regard. The environment would be the same in the absence of Windows.

Attacks are developed for Windows because attacks against Windows are more valuable -- they have a large number of potential targets -- not because they're easier to develop.


In the case of a bad Linux kernel update I would just reboot and pick the previous kernel from the boot menu. By default most Linux distributions keep the last 3. I'm not an IPMI remote management expert but it may be possible to script this.

All my machines at home run Linux except for my work laptop. It is stuck in this infinite blue screen reboot loop. Because we use Bitlocker I can't even get it into safe mode or whatever to delete the bad file. I think IT will have to manually go around to literally 8,000 work laptops and fix them individually.


You would "just pick the previous kernel from the boot menu". That's funny, cause in this case you could "just delete the file causing the issue." Anything can sound easy and simple if you state it that way.

How do you access the boot menu for a server running in the cloud, which you normally just SSH into (RDP in Windows' case)?

About your last paragraph: we have just started sending out the bitlocker keys to everyone so it can be done by them too. Surely not best practice, but it beats everyone having to line up at the helpdesk.


> You would "just pick the previous kernel from the boot menu". That's funny, cause in this case you could "just delete the file causing the issue." Anything can sound easy and simple if you state it that way.

One small difference, is that choosing the kernel from the boot menu is done before unlocking the encrypted drive, so no recovery keys would be necessary. And yes, choosing an entry from a menu (which automatically appears when the last boot has failed) is simpler than entering recovery mode and typing a command, even without disk encryption.

A better analogue would be a bad update on a non-kernel package which is critical to the boot sequence, for instance systemd or glibc. Unless it's one of the distributions which snapshot the whole root filesystem before doing a package update.


NixOS boots to a menu of system configuration revisions to chose from which includes any config change, not just kernel updates.

It's not filesystem snapshots either. It keeps track of input parameters and then "rebuilds" the system to achieve the desired state. It sounds like it would be slow, but you've still got those build outputs cached from the first time, so it's quite snappy.

If you took a bad update, and then boot to a previous revision, the bad update is still in the cache, but it's not pointed to by anything. Admittedly it takes some discipline to maintain that determinism, but it's discipline that pays off.


I hate to be the guy that's like "Nix is the solution," but...Nix is the solution.

Nearly every corporate machine that needs to run Windows should run it as a VM on a NixOS base, unless there is an extremely good reason not to.


Progress is slow, but eventually there will be nix on windows: https://discourse.nixos.org/t/nix-on-windows/1113/117 (fingers crossed).

I don't expect to use it much myself but I love the idea of reducing the OS to an interchangeable part. What matters is the software and its configuration. If windows won't boot for some reason, boot to the exact same environment but on a different OS, and get on with your day.

If something is broken about your environment, fix it in the code that generates that environment--not by booting into safe mode and deleting some file. Tamper with the cause, not with the effect. Cattle, not pets, etc.

This sort of thing is only possible with nix (and maybe a few others) because elsewhere "the exact same environment" is insufficiently defined, there's just not enough information to generate it in an OS-agnostic way.


I can't delete a file if the machine doesn't finish booting. Unless you are suggesting removing the drive and putting it in another machine. That requires a screwdriver and 5 minutes vs. the 10 seconds to reboot and pick a different kernel.

I'm not talking about the cloud. I am talking about the physical machines sitting in front of me specifically my work laptop.

I am an integrated circuit computer chip designer, not a data center IT person. I have seen IPMI on the servers in our office. Do cloud data centers have this available to people?

I have a cheap cloud VM that I pay $3.50 a month. I normally just SSH in but if I want to install a new operating system or SSH is not responding then I log in to the web site and get a management console. I can get a terminal window and login, I can force a reboot, or I can upload an ISO image of another operating system and select that as the boot device for the next reboot and install that.

Does your cloud service not have something like this?

I don't know what our corporate IT dept wants to do. We all work from home on Friday and I can't login to check email so I'll just wait until Monday as there is nothing urgent today anyway.


Booting into safe mode still works to delete the bad file.


The OS drive is encrypted with Bitlocker. I've seen another thread where corporate IT departments were giving out the recovery key to users. I don't need to get anything done today. I'll go into the office on Monday and see what they say.


Idk if this is a serious question, but you just turn on console access in the cloud provider. It’s super easy. Same concept as VMWare. It’s possible that not all cloud providers do that, I suppose.


The biggest cloud providers out there (AWS, Azure, GCP) don't.


> How do you access the boot menu for a server running in the cloud, which you normally just SSH into (RDP in Windows' case)?

They just said IMPI.


MacOS has been phasing out support for third-party kernel extensions and CrowdStrike doesn't use a kernel extension there according to some other posts.


I’m convinced that one reason for this move by Apple was poor quality kernel extensions written by enterprise security companies. I had our enterprise virus/firewall program crash my Mac all the time. I eventually had to switch to a different computer (Linux) for that work.

It wasn’t Crowdstrike, but quality kernel level engineering isn’t was I think of when I think of security IT companies.

But, also credit Apple here. They’ve made it possible for these programs to still run and do their jobs without needing to run in kernel mode and be susceptible to crashes.


Not only security software, but really any 3rd party drivers have caused issues on Windows for years. Building better interfaces less likely to crash the kernel was a smart move


When I started doing driver development on MacOS X in the early 2000s, there were a number of questions on the kernel/driver dev mailing lists for darwin from AV vendors implementing kernel extensions. Most of them were embarrassing questions like "Our kernel extension calls out to our user level application, and sometimes the system deadlocks" that made me resolve to never run 3rd party AV on any system.


Whether you like macOS or not, they definitely are innovating in this space. They (afaik) are the only OS with more granular data access for permissions as well (no unfettered filesystem access by default, for instance)

It's also a shame CrowdStrike doesn't take kernel reliability seriously


I'm sorry, restricting user's ability to change their computer is not innovation. It is paternalism.


The user can change anything they want, but a process launched by your user doesn't inherit every user access by default. You (the user) can give a process full disk access, or just access to your documents, or just access to your contacts, etc. It's maximizing user control, not minimizing it.


I am talking about removing the ability to install kernel extensions.

As for full disk access, go try and remove Photo Booth from you Mac.


The user isn't being restricted. Third-party software is being restricted, by default, and those restrictions can be disabled by the user.


This is a feature not a bug in the enterprise.



Appears to be opt in vs opt out. I'm curious how many orgs use this


Qubes OS has a better model, security by compartmentalization: everything runs in separate VMs with hardware virtualization.


Qubes is great but no desktop GPU supports virtualization.


I could be happy if the GPU was only used for compositing.

If I were doing ML work, maybe I do that work in an ephemeral cloud environment.

I know this doesn’t cover everyone’s use case, but it doesn’t have to.


> Qubes is great but no desktop GPU supports virtualization.

Intel 12th-gen and newer iGPUs do, and AFAIK it can be unlocked on certain Arc cards as well but details are fuzzy.



> They plan to add GPU acceleration in the next release: https://github.com/QubesOS/qubes-issues/issues/8553

You say they're planning to add a feature in the next release, but what you linked to is merely an uncompleted to-do item for creating a UI switch to toggle a feature that hasn't been written yet. I think you win the prize for the most ridiculous exaggeration in this thread. Unless you can link to something that actually comes anywhere close to supporting your claim, you're just recklessly lying.


The linked Issue #8553 is "just" about creating a toggle for GPU acceleration. It's blocked by Issue #8552 [0], which is the actual Issue about the acceleration and originally belonged to Milestone "Release 4.3". It seems to have been removed later, which I didn't expect or know about. Accusation of lying was completely unnecessary in your comment.

Moreover, the Milestone was removed not because they changed their mind about the Release but for other reasons [1].

[0] https://github.com/QubesOS/qubes-issues/issues/8552

[1] https://github.com/QubesOS/qubes-issues/milestone/28

See also: https://forum.qubes-os.org/t/gpu-acceleration-development/24...


Ok, so your [0] shows that the real work has barely been started. The only indication it was ever planned for the next release was a misunderstanding on your part about the meaning of a tag that was applied to the issue for less than one day last fall, and they've stopped tagging issues with milestones to prevent such misunderstandings in the future. It still looks to me like your exaggerated claim was grounded in little more than wishful thinking.


Am I missing something? This is to add a toggle button and the developers say they are blocked because GPU acceleration feature doesn't exist so the button wouldn't be able to do anything.


See my other comment here.


Android and iOS have compartmentalization as well but it's not hardware level (at least as far as I know).


https://www.dropboxforum.com/t5/Apps-and-Installations/New-D...

Is this happening with or without kernel extensions?


Also, it does actually work on MacOS despite this. We’ve had it catch someone getting malware.


The issue with Crowdstrike on Linux did not cause widespread failures, so its clear that the majority of enterprises that do run their servers on Linux were not affected. They were invulnerable because they do not need Crowdstrike or similar.

Linux (or BSD) servers do not usually require third party kernel modules. Linux desktops might have the odd video driver or similar.


Crowdstrike on Linux is only useful for appeasing corporate auditors, and making Crowdstrike money.


If you ran "only Linux, BSD, or MacOS" on a Microsoft hypervisor, yes. I would never recommend that, and your link exemplifies one reason why.


The difference is that i van easily rollback a linux system, a complete update too, nota on windows


>Apple has done a much better job with macOS in terms of security and performance.

I really like their corporate IT products that are going to push MS out as you say. I particularly love iActive Directory, iExchange, iSQLserver, iDynamics ERP, iTeams. Apples office products are the reason noone uses Excel any more. Their integration with their corporate cloud, iAzure is amazing. I love their server products in particular, it being so easy to spin up an ios server and have dfs filesharing, dns etc is great. MS must be quaking in their shoes


All of those are product that creates huge risks when deployed to mission critical environments and this is exactly the problem.

The entire wintel ecosystem depends on people putting their heads in the sand and repeating "nobody ever got fired for buying Microsoft/crowdstrike/IBM" and neglecting to run even the most trivial simulation of what happens when the very well understood design flaws of those platforms gets triggered by a QA department you have no control over drops the ball.

The problem is that as long as nobody dares recognizing that the current mono culture around the "market leading providers" this kind of event will remain really likely even if nobody is trying to break it and and extremely likely once you insert well funded malicious actors(ranging from bored teenagers to criminal gangs and geopolitical rivals).

The problem is that adding fair weather product that gives the illusion of control though fancy dashboards on the days they work is not really an substitute for proper reliance testing and security hardening but far less disruptive to companies that don't really want to leave the 90ies PC metaphor behind.


How should corporate IT do it?

You have 100,000 devices to manage. How do you handle that efficiently without creating a monoculture?

It's not a "90ies PC metaphor" problem. Swap Chromebooks for PCs and you still have the problem-- how do you handle centralized management of that "fleet"?

Should every employee "bring their own device" leaving corporate IT "hands-off"? There are still monocultures within that world.

Poor quality assurance on the part of software providers is the root cause. The monocultures and having software that treats the symptoms of bad computing metaphors aren't good either, but bad software quality assurance is the reason this happened today.


> Swap Chromebooks for PCs and you still have the problem-- how do you handle centralized management of that "fleet"?

Simplicity (and hence low cost) of fleet management, OS boot-verification, no third-party kernel updates, and A/B partitions for OS updates are among the major selling points of Chromebooks.

It's a big reason they have become so ubiquitous in primary education, where there is such a limited budget that there's no way they could hire a security engineer.


The OP was deriding monoculture. My point was that pushing out only Chromebooks is still perpetuating a monoculture. You're just shifting your risk over to Google instead of Crowdstrike / Microsoft.

re: Chromebooks themselves - The execution is really, really good. The need for legacy software compatibility limits their corporate penetration. I've done enough "power washes" to know that they're not foolproof, though.


I agree that monoculture is an issue that makes events like this more probable, regardless of OS.

That said, a third party being able to add/update a kernel driver ignores (even if out of business necessity) best practices for OS architecture.


ChromeOS is just Linux, isn't it? It's going to suffer from the same problem as NT re: a buggy kernel mode driver tanking the entire OS.

Google gets a pass because their Customers are okay with devices with limited general purpose ability. Google is big enough that the market molds product offerings to the ChromeOS limitations. I think MSFT suffers from trying to please everybody whereas Google is okay with gaining market share by usurping the market norms over a period of years.


> ChromeOS is just Linux, isn't it? It's going to suffer from the same problem as NT re: a buggy kernel mode driver tanking the entire OS.

ChromeOS is not just Linux. It uses the Linux kernel and several subsystems (while eschewing others), but it also has a security and update model that prevents third parties (or even the user themselves) from updating kernel space code and the OS's user space code, so basically any code that ships with the OS.

Therefore, the particular way that the Crowdstrike failure happened can't happen on ChromeOS.

However, Google themselves could push a breaking change to ChromeOS. That, however would be no different than Apple or Microsoft doing the same with their OS's.


> ChromeOS is not just Linux.

I am familiar with Google's walled garden w/ ChromeOS. I didn't mean to give the impression that I was not.

It's "just Linux" in the sense that it has the same Boolean kernel mode/user mode separation that NT has. ChromeOS doesn't take advantage of the other processor protection rings, for example. A bad kernel driver can crash ChromeOS just as easily as NT can be crashed.

Hopefully Google just doesn't push bad kernel drivers. Crowdstrike can't, of course, because of the walled garden. That also means you can't add a kernel driver for useful hardware, either. That limits the usefulness of ChromeOS devices for general purpose tasks.


> That also means you can't add a kernel driver for useful hardware, either. That limits the usefulness of ChromeOS devices for general purpose tasks.

It's target market isn't niche hardware but rather the plethora of use cases that use bog standard hardware, much like many of the use cases that CS broke a few days ago.


Yes. I said that in a post up-thread. Google is making the market mold itself to their offering, rather than being like Microsoft and molding their offering to the market. Google is content to grow their market share that way.


If crowdsource QA department is all that stands between you and days of no operations then you chose to live with the near certainty that you will have days rather then hours of unplanned company wide downtime.

And if you cannot actually abandon someone like microsoft that consistantly screws up their QA then it's basically dishonest for you to claim that reliability is even a concern for your desktop platform.

And that's essentially what i say when i accuse the modern enterprise it's client device teams of being stuck in the 90ies as those risk were totally acceptable back when the stakes were low and outages only impacted non time critical back office clerical work. but what we saw today was that those high risk cost optimized systems got deployed into roles where the risk/consequence profile is entirely different.

So what you do is that you keep the low impact data entry clerks and spreadsheet wranglers on the windows platform but threat the customer facing workers dealing with time sensitive task something a bit less risky.

It's might not be as easy as just deploying the same old platform designed back in the 90ies to everyone but once you leave the Microsoft ecosystem dual sourcing based on open standards become totally feasible, at costs that might not be prohibitive as everything in the unix like ecosystem including web browsers have multiple independent implementations so you basically just have to standardize of 2-4 rather then one platform which again isnt unfeasible.

It's telling that an Azure region failed this news cycle without anyone noticing because companies just don't tolerate the kind of risk people takes with their wintel desktop for their backends so most critical services hosted in microsofts Iowa datacenter had and second site on standby.


>And if you cannot actually abandon someone like microsoft that consistantly screws up their QA

The last outage I can remember due to an ms update was 7 or 8 years ago. Desktops got stuck on 'update 100% complete'. After a couple of minutes I pressed ctrl+alt+del and it cleared. Before that...I don't remember. Btw MS provides excellent tools to manage updates, and you can apply them on a rolling basis.


> If crowdsource QA department is all that stands between you and days of no operations ...

For companies of a certain large size, I guess. For all but the largest companies, though, there's no choice but to outsource software risks to software manufacturers. The idea that every company is going to shoulder the burden of maintaining their own software is ridiculous. Companies use off-the-shelf software because it makes good financial sense.

> And if you cannot actually abandon someone like microsoft that consistantly screws up their QA then it's basically dishonest for you to claim that reliability is even a concern for your desktop platform.

When a company has significant software assets tied to a Microsoft platform there's no alternative. A company is going to use the software that best-suits their needs. Platform is a consideration, however I've never seen it be the dominant consideration.

Today's issue isn't a Microsoft problem. The blame rests squarely on Crowdstrike and their inability to do QA. The culture of allowing "security software" to automatically update is bad, but Crowdstrike threw the lit match into that tinderbox by pushing out this update globally.

As another comment points out, Microsoft has good tools for rolling update releases for corporate environments. They're not perfect but they're not terrible either.

> It's might not be as easy as just deploying the same old platform ...

When a company doesn't control their software platform they don't have this choice. Off-the-shelf software is going to dictate this.

In some fantasy world where every application is web-based and legacy code is all gone maybe that's a possibility. I have yet to work in that environment. Companies aren't maintaining the "wintel desktop" because they want to.


Blaming crowdstikes QA might feel good but the problem is that no company in the history of the world have been good enough at QA for it not to be reckless to allow day one patching of critical systems, or for that matter to allow single vendor, single design, critical systems in the first place. and yet the cyber security guidelines required to allow the pretense that windows can be used securely all but demand that companies take that risk.

It's also fundamentally a problem of Danial, everyone knows there will not be an good solution to any issue around security and stability that does not require that the assets tied up inside fragile monopoly operated ecosystems to be eventually either extracted or written off but nobody want to blaze new trails.

Claiming powerlessness is just lazy yes it might take an decade to get out from under the yokel of an abusive vendor, we saw this with IBM, but as IBM is now an footnote in the history of computing it's pretty clear that it can be done once people start realizing there is an systematic problem and not just a serious of one-off mistakes.

And we know how to design reliable systems, it's just that doing so is completely incompatible with allowing any of America's Big IT Vendors to remain big and profitable, and thats scary to every institution involved in the current market.


To be fair, IBM products back in the day when that saying made sense never had these kinds of problems. It's straight up insulting to compare them to somebody like Crowdstrike.

Wintel won by being cheaper and shittier and getting a critical mass of fly by night OEMs and software vendors on board.

IBM was more analogous to the way Apple handles things. Heavy vertical integration and premium price point with a select few software and hardware vendors working very closely with IBM when software and hardware analogous to Crowdstrike in terms of access was created.


> I really like their corporate IT products that are going to push MS out as you say. I particularly love iActive Directory, iExchange, iSQLserver, iDynamics ERP, iTeams.

You’re being sarcastic, but do you like those MS products, specifically Teams?

I genuinely believe that any business that doesn’t make Teams is doing the lords work.


I'm stuck with them on my company Macbook and will definitely say, they suck.

In the 5 years I've been here, Outlook has never addressed this bug (not even sure they consider it a bug): Get an invitation to an event. See it on calendar view. Respond to it on calendar view. Go to inbox. Unread invitation is sitting there in your inbox requesting a response.

I don't even need to talk about why Teams is trash. Terrible design is in Teams's DNA.


Would you like to try new Teams?

It’s the same, but you get to start with a nag about it every time you open it.


In enterprise software, you don't need to be good. Just better than your competitors. I distinctly remember doing a happy jig about 6 years ago when we moved from Skype for Business (shudder) to Teams. Did teams drive me nuts? Absolutely. But I was free from the particular hell of SFB.


No you just need to have a Support contract so that you can blame them and/or respond to the users that you have raised a ticket with the vendor.


Teams isn't better than the competitors. SFB is MS too. You went from one POS to another.


TBF I have less experience with Dynamics than the others, but yes they are all excellent.

I include Teams in that. I don't think there is another app on the market that does what Teams does. Integrated video conferencing, messaging, and file sharing in one place. All free with the office package my team already use and fully integrated with Azure AD for sso. I use it all day with zero problems. I honestly can't see why anyone would use anything else


Most of the software you list either has a Mac version or will interop well with Apple's ecosystem and has for a decade.


The fact Apple is not trying to be a tentacular behemoth syphoning profits in every enterprise environment does not invalidate the fact macOS is secure and performant.

Apple is a tentacular behemoth in the consumer space.


Not a single statement you purport as "fact" has been true cross large scale deployments in my experience. Especially the first part which tells me you have not experienced working with them as a supplier. I think you mean in your opinion or experience, but please don't attribute wishful thinking to factual statements. It derails objectivity and discussions.


Azure status/support page is amazingly amazing. Their current advice regarding virtual machines with the Crowdstrike problem? Keep rebooting!

https://azure.status.microsoft/en-us/status


As ridiculous as it sounds, this does work on a subset of the machines affected based on my experience of the last few hours. With other machines you can seemingly reboot endlessly with no effect.


I think their "product" is getting people to astroturf on forums like this!

Apple always does just as bad, if not worse, on pwn2own https://www.bleepingcomputer.com/news/apple/apple-fixes-safa... as everyone else. And there are several companies that make a lot of money installing spyware on iPhones.


Dynamics, Teams, Exchange, Active Directory all suck. There are better alternatives but CIOs are stuck in 1996. Apple themselves in their corporate IT environment use none of those things yet somehow are one of the biggest and most profitable companies in the world. Azure is garbage compared to AWS. Using Azure Blob vs S3 is a nightmare. MSSQL is garbage compared to PostgreSQL. Slack is vastly better than Teams in literally every aspect. I just did a project moving a company from AWS to Azure and it was simply atrocious. Nobody at the user level likes using MS products if they have experience using non-MS products. It’s like Bitbucket — nobody uses that by choice.


You got to admire Apple fanboy's nerve to say Apple is a better company when it comes to IT in a professional setting.

It appears whatever their basic and narrow use-case is becomes what the whole "corporate IT" is.

Windows sucks and recently Microsoft has been on a path to make it suck more, but saying Apple is better for this part of the IT universe is.. hilarious.


You know that the parent commenter was joking right?


Yes, hence my comment about what he was responding/mocking to.


I think he was talking about grandparent due to baseless criticism of Microsoft and overly praise of Apple based on a flawed or lack of understanding.


I think if someone wants to criticize Microsoft after experiencing their buggy products for 20 years straight, that is not “baseless,” although I accept that taking responsibility for literally anything our products do goes against the core values of our profession.


The do have some crappy products, but those crappy products make the world move, because nobody really makes better drop in replacement products, same as SAP, Canonical, Android, etc, none of them are fault tolerant, they all have issues and will fail if you fuzz them with enough edge cases, and according to this article CroudStrike caused the issue, not Windows which is what I was pointing at.

Do you think MacOS can't fail if you fuck with it long enough? Sometimes you don't even have to, it just fails by itself. My Ubuntu 22.04 LTS at my previous job gave me more issues than Windows ever did. Thanks Snaps, Wayland and APT. No workstation OS is perfect.

If you want a fault tolerant OS you're gonna have to roll out your own Linux/BSD build based on your requirements and do your own dev and testing. Which company has money for that? So of course they're gonna pick an off-the-shelf solution that best fits their needs on the budget. How is this Microsoft's fault what their customers choose to do with it? Did they guarantee anywhere their desktop OS os fault tolerant should be used in high availability systems and emergency services, especially with crappy endpoint solutions hooked at kernel level?


lol. i’ll dunk on Apple as much as i’ll dunk on any other OS, but they wouldn’t be as praised for security if they had to manage the infrastructure and users that Windows supports


> I particularly love iActive Directory, iExchange, iSQLserver, iDynamics ERP, iTeams. Apples office products are the reason noone uses Excel any more.

I see your sarcasm backfire as most you are listing is just Microsoft dog-food with no real usefulness. The only good thing in your list is Excel, all the rest is bloatware. Teams is a resource hog that serve no useful purpose. Skype was perfectly fine to send messages or have some video call.

I admit I don't have experience as an IT administator but things like managing emails, accounts, database, manage remote computers can be done with well estalished tools from the linux/BSD world.


> I don’t have experience as an IT Admin

Wild that you’d write this comment with such a confident voice then.

I worked at a company who’s IT team managed both windows and Mac computers and apparently MS’s ActiveDirectory is leagues ahead of apple’s offering. Which makes sense. MS is selling windows to administrators, not to users


I'm a die hard FOSS guy, but as someone who has done LDAP work with FreeIPA and OpenLDAP -- AD does a better job.

Admittidly, it's mostly a better job at integrating with Microsoft-powered systems, so it should damn well do a better job, but it's a core business offering and has polish on it in ways that many FOSS offerings don't.

disclaimer: haven't done FreeIPA and LDAP work in the last ~3 years, maybe they got better.


I would disagree. I work in healthcare and we’ve always used SQL Server. While I wouldn’t pick it, it’s been reliable and integrates with auth.

No one “loves” Teams, but honestly it serves its purpose for us at no cost.

No one loves OneDrive but it works.

I think people underestimate how much work it would take to integrate services, train people, and meet compliance requirements when using a handful of the best in class products instead of MS Suite.


People use Teams and OneDrive because it’s “Free” when you use Office. IMO, that’s a bit of an anti-trust problem. Both have good competitors (arguably better competitors) that are getting squeezed because of the monopoly pricing with Office.

But with SQL Server, on the other hand, I think you are right. It is a good piece of software. But it also has high quality competition from multiple vendors. Some of it enterprise (Oracle, DB2), some of it FOSS (Postgres, MySQL). Because of this, it has to be better quality to survive… they couldn’t bundle it to get market share, it actually had to compete.


Word, no one uses teams because its great. The only reason it's used is because it's bundled with $M365.


People use Teams because it's well integrated into Office, 365, Entra and other MS products, they would (and recently do) pay for it. It has functionalities that no other alternative has, e.g. it can act as a full call centre solution through a SIP gateway.


"Well integrated" is honestly a stretch, but it is fair to say it's integrated with no extra setup.


How to manage Slack access control via Azure AD groups? Even the most basic integrations are missing in other options...


> No one “loves” Teams, but honestly it serves its purpose for us at no cost.

Of course there's a cost, its just hidden and you are forced to pay it. Microsoft used its monopoly position to move into a new market.


Yeah, sure. But the marginal cost is zero, whereas a Slack subscription for every person in our org will cost about 1 million dollars a year. And it doesn’t integrate as well with every other piece of functional but mediocre software.

The person approving the $1 million dollar budget item doesn’t really care that Teams isn’t “free” in the sense that there is no free lunch, and while they perhaps have moral qualms of antitrust, that’s outside their purview. We’re locked into Office suite and right now there is no extra charge for Teams.


Which is why the legal process is simply too slow for big tech

Microsoft did a massively illegal thing (again) and got away with it

Time to hold companies responsible for their suppliers.


> Teams is a resource hog that serve no useful purpose. Skype was perfectly fine to send messages or have some video call.

I’m sorry, this is a very silly take. I’m no fan of Teams or Slack but I can’t deny the functionality they offer, which is far above and beyond what Skype does.

> I admit I don’t have experience as an IT administrator

Well, quite.


Time was, NeXT was a hard sell into corporations because it required so little administration, and what there was was so easily done IT staffs were hugely cut back after implementing them.

I'd be glad to see Apple bring those tools back.


Looks fondly over at the old black pizza box


Had to move my Cube this past week-end, and it made me incredibly sad.

Using a NeXT Cube w/ Wacom ArtZ and an NCR-3125 running Go Corp.'s PenPoint (and rebooting into Windows for Pen Computing when I wanted to run Futurewave Smartsketch) was the high-water mark of my computing experience.

It was elegant, consistent, reliable, and "just worked" in a way no computer has since (and I had such high hopes for Rhapsody and the Mac OS Public Beta).


> I don't have experience as an IT administator

Then you probably shouldn't speak on software exclusively understood and administered by IT administrators. I've worked in IT for some time and every single one of those products(aside from Dynamics) have been the most important parts of our administrative stack.


Even Excel is beginning to be regarded as a dangerous piece of software that gives the illusion of power while silently bankrupting departments who depend on the idea that large spreadsheets is an accurate and reliable way to analyze large/complex datasets.

the 90ies are over but for some reason average enterprise department have a problem internalizing the fact that the demands today is different then they were 25 years ago.


Meanwhile, while HN bubble imagines people doing big data jobs on Excel, in the real world 10s or 100s of millions of people are perfectly satisfied doing small data jobs in Excel.


The problem is that without tools and processes to systematically validate those result's people might be perfectly happy about completely inaccurate results.

I know i have had to correct one in three excel sheet i have ever gone over using pen and paper in order to validate the results but i am a paranoid sod who actually do this kind of exercise on a regular basis.

almost all of the disciplines known to rely on excel have a serous issue with repeatability of results either because nobody ever attempts it, or because it's a messy field without a well defined methodology.


I work in finance. We have double entry accounting and literal checks and balances to validate our results. It is not a messy field, and has a well defined methodology. We have been the biggest spreadsheet users at many of the companies I have worked with.


"I admit I don't have experience as an IT administator"

Then just hit the back button.


SQL Server ran and runs a lot of big company (it ran MySpace!) however, everything else in your list is hot trash and should be yeeted into the sun.


StackOverflow runs on SQL Server.


Yeah, but Microsoft's been trying to convince them to move to Azure's stuff for years, so who knows :)


> Apple has done a much better job with macOS in terms of security and performance.

Do not underestimate corporate IT's ability to slow down Macs with endpoint security software.


This has been my experience.

I used to run a C++ shop, writing heavy-duty image processing pipeline software.

It did a lot, and it needed to do it in realtime, so we were constantly busting our asses to profile and optimize the software.

Our IT department insisted that we install 'orrible, 'orrible Java-based sneakware onto all of our machines, including the ones we were profiling.

We ended up having "rogue" machines, that would have gotten us in trouble, if IT found out (and I learned that senior management will always side with IT, regardless of whether or not that makes sense. It resulted in the IT department acting like that little sneak that sticks his tongue out at you, while hiding behind Sister Mary Elephant's habit).

But, to give them credit, they did have a tough job, and the risks were very real. Many baddies would have been thrilled to get their claws on our software.


Air-gapped systems: keeping you safe from IT (and incidentally hackers on government payrolls)



Had a problem with a "slow network" from a mac to a nas drive, was capping about 800mbit a second, despite having a 10g link.

As I looked through I killed sophos. Suddenly speeds shot up above 7gbit. A few seconds later they dropped back down, sophos has retured.

A "while (true) pkill sophos" later and the malware was sedated.

Having proved it wasn't a network problem I left it with the engineer to determine the best long term solution.


Mosyle is doing their best to make Macs unusable.


Oh?

We were in need of an MDM to help staff (non-techs) with their Mac books. I haven't noticed any issues, nor have two of my staff who are trialling it. What's been your main gripe?

I'm a Dev but also manage the It team of one sys admin and haven't noticed any performance hits. Yet anyway, but it's only been two weeks.


Installing software is painful- some of this is perhaps related to how the IT group has restricted so much for us, can't even change my screen saver, and weirdness like bizarre pop ups asking for your password from time to time. It just doesn't belong on a developer machine.


Fair enough.

I haven't had the same experience but that's mostly because I'm a Dev, implementing it for other developers.

It sounds less like a product problem vs. configuration problem caused by your IT teams.


Yeah, idk what they do, but in my company some new MacBook Pros with M3 are taking 15 minutes to login after typing the user password.


The poor quality of Windows and associated software is not the problem here. The problem is that Microsoft especially, but software vendors generally, encourage users to blindly accept updates which they do not understand or know how to roll back. And by "encourage" I mean that they've removed the "no thanks" and "undo" buttons.

Here on Linux (NixOS), I am prompted at boot time:

> which system config should be used?

If I applied a bad update today, I can just select the config that worked yesterday while I fix it. This is not a power that software vendors want users to have, and thus the users are powerless to fix problems of this sort that the vendors introduce.

It's not faulty software, it's a problematic philosophy of responsibility. Faulty software is the wake-up call.


What makes you think the FAANG companies don't use windows? Spent four years at Amazon recently and unless you were a dev, you were more likely to have a windows PC than Mac. Saw zero Linux laptops.


It's funny how that works.

Leave FAANG and most internal developers at large corporations are running Windows. It wasn't until I started at a smaller shop that I found people regularly using Linux to do their jobs, usually in a dual-boot or with a virtual Windows install "just in case" but most never touched it.

I'm presently working supporting a .NET web app (some of which is "old .NET Framework) but my work machine runs OpenSUSE Tumbleweed. I can't see that flying at the larger shops I have previously worked at. I'll admit, that might be different -- today -- I haven't worked at a large shop in more than a decade.


Most corporations have no interest in paying the cost of running a multi-OS IT shop nor dealing with the challenges of fleet management with both Linux and Mac that make running those fleets more expensive and challenging.

That's before you factor in that almost everyone in IT is a born and bred in Windows and in almost every case people tend to choose what they know best.


Depend on which FAANG I guess. Approaching now 10y at Google and I saw Windows laptops only used by very few sales people. Everyone else is either using Macs or Chromebook.


Fellow Googler here. I'm the exception that proves the rule. After 7 years of Macbook and Linux devices, I needed Windows for a special project, so I got a "gWindows" device and found it very well supported.

Aside from the specific Windows-only software I needed, I would still just ssh into a Linux workstation, but gWindows can do basically everything my Mac can. I was pleasantly surprised.


What’s the secret sauce in gwindows? Do they add a hidden Russian keyboard or locale to neutralize malware?


At Apple nobody uses Windows.


The entire bootcamp team is an empty chair with a note tacked on that says "brb in 15; lunch"


Is there still a bootcamp team? I thought they abandoned that with the shift to Apple silicon


You can still Bootcamp if you have an Intel mac, but yes, on Apple silicon it's dead


They do develop some Windows software, so I’m sure some do.


The company developing that probably operates under a different name.


> The Windows ecosystem typically deployed in corporate PCs or workstations is often insecure, slow, and poorly implemented

Yes, but that's not because of Windows itself (which is fast and secure out of the box) but because of an decades-old "security product" culture that insists on adding negative-value garbage like Crowdstrike and various anti-virus systems on the critical path, killing performance and harming real security.

It's a hard problem. No matter how good Windows itself gets and no matter how bad these "security products" become, Windows administrators are stuck in the same system of crappy incentives.

Decades of myth and superstition demand they perform rituals and make incantations they know harm system security, but they do them anyway, because fear and tradition.

It's no wonder that they see Linux and macOS as a way out. It's not that they're any better -- but they're different, and the difference gives IT people air cover for escaping from this suffocating "you must add security products" culture.


> which is fast and secure out of the box

Disagree. At least in the context of business networks.

My favorite example is the SMB service, which is enabled by default.

In the Linux world, people preach:

- disabling SSH unless necessary

- use at least public key-based auth

- better both public key and password

- don't allow root login

In Windows, the SMB service:

- is enabled by default

- allows command execution as local admin via PsExec, so it's essentially like SSH except done poorly

- is only password-based

- doesn't even support MFA

- is not even encrypted by default

It's a huge issue why everyone gets encrypted by ransomware.

I always recommend disabling it using the Windows firewall unless it is actually used, and if it is necessary define a whitelist of address ranges, but apparently it is too hard to figure out who needs access to what, and much easier to deploy products like Crowdstrike which admittedly strongly mitigate the issue.

The next thing is that Windows still allows the NTLM authentication protocol by default (now finally about to be deprecated), which is a laughably bad authentication protocol. If you manage to steal the hash of the local admin on one machine, you can simply use it to authenticate to the next machine. Before LAPS gained traction, the local admin account password was the same on all machines in basically every organization. NT hashes are neither salted nor do they have a cost factor.

I could go on, but Microsoft made some very questionable security decisions that still haunt them to this day because of their strong commitment to backwards compatibility.


You don't need Crowdstrike to disable any of these things. You can use regular group policy. I'm not saying Windows can't be hardened. I'm saying these third party kernel hooks add negative value.


I know, I even said you should rather use the tools that the OS is providing, like the firewall.

All I did was challenge the statement that Windows is secure OOB.


Fun fact, these negative value garbage offerings are often “required” by box checking certifications like SOC2. Sure, if you have massive staffing to handle compliance you might be able to argue you’ve achieved the objective without this trash. The rest of us are just shrug and do it.

Some of the “compliance managers as a service” push you in this direction as well.


Why do companies need these "box checking certifications"? I imagine the answer, as usual, is that either they or one of their customers is working with the government which requires this for its contractors. That's usually the answer whenever you find an idiotic practice that companies are mindlessly adopting.


Pretty much. We’re in the healthcare space and most of our customers are large hospital systems. Anything except “SOC2 compliant, no exceptions on report” will take an already long deal cycle (4-18 months) and double or triple it.

If you’re a startup it also means that your core people are now sitting in multiple cycles of IT review with their IT staff filling out spreadsheet after spreadsheet of “Do you encrypt data in transit?”


> Windows itself (which is fast and secure out of the box)

That's a really bold claim. I'd say Windows comes with a lot of unsafe defaults OOB.


> Yes, but that's not because of Windows itself (which is fast and secure out of the box)

I think what you're really saying is that a Windows system is secure until you apply power to the computer.


> > The Windows ecosystem typically deployed in corporate PCs or workstations is often insecure, slow, and poorly implemented

> Yes, but that's not because of Windows itself

Come on. There’s a reason Windows users all want to install crappy security products: they’ve been routinely having their files encrypted and held for ransom for the last decade.


And Linux/BSD generally would not help here. Ransomeware is just ordinary file IO and is usually run "legitimately" by phished users rather than actual code execution exploits

I have a similar disdain for security bloatware with questionable value, but one actually effective corporate IT strategy is using one of those tools to operate a whitelist of safe software, with centralized updates


I think having a Linux/BSD might be helpful here in the general case, because the culture is different.

In Windows land it's pretty much expected that you go to random websites, download random executables, ignore the "make changes to your computer?" warnings and pretty much give the exe full permission to do anything. It's very much been the standard software install workflow for decades now on Windows.

In the Linux/BSD world, while you can do the above, people generally don't. Generally, they stick to trusted software sources with centralized updates, like your second point. In this case I don't think it's a matter of capability, both Windows and Unix-land is capable of what you're suggesting.

I think phishing is generally much less effective in Max/Linux/BSD world because of this.


Until a a lucrative contract requires you to install prescribed boutique windows-only software from a random company you've never heard of, and then it's back to that bad old workflow.


Yeah, because no one on Linux or Mac would clone a git repo they just found out about and blindly run the setup scripts listed in the readme.

And no one would pipe a script downloaded with wget/curl directly into bash.

And nobody would copy a script from a code-formatted block on a page, paste it directly into their terminal and then run it.

Im not going to go so far as to claim that these behaviors are as common as installing software on Windows, but they are still definitely common, and all could lead to the same kinds of bad things happening.


I would agree this stuff DOES happen, but typically in development environments. And I also think its crappy practice. Nobody should ever pipe a curl into sh. I see it on docs sometimes and yes, it does bother me.

I think though that the culture of robust repositories and package managers is MUCH more prominent on Mac/iOS/Linux/FreeBSD. It's coming to Windows too with the new(er) Windows store stuff, so hopefully people don't become too resistant to that.


A developer is much more likely to be able to fix their computer and/or restore from a backup than a typical user is. A significant problem is cascading failures, where one bozo installing malware either creates a business problem (e.g. allowing someone to steal a bunch of money) or is able to disable a bunch of other computers on the same network. It is not that common for macOS to be implicated in these sorts of issues. I know people have been saying for a long time that it’s theoretically possible but it really doesn’t seem that common in practice.


I'd wager if Linux had the same userbase as Windows, you'd see more ransomware attacks on that platform as well. Nothing about Linux is inherently more secure.


Yeah I don't get where this "Linux is more secure" thing comes from.

Basically any userspace program can read your .aws, .ssh, .kube, etc... The user based security model desktops have is the real issue.

Compare that with Android and iOS for instance. No one needs anti-virus bloatware, just because apps are curated and isolated by default.


> Yeah I don't get where this "Linux is more secure" thing comes from.

It comes from the 1990s and early 2000s. Back then, Windows was a laughingstock from a security point of view (for instance, at one point connecting a newly installed Windows computer to the network was enough for it to be automatically invaded). Both Windows and Linux have become more secure since then.

> Basically any userspace program can read your .aws, .ssh, .kube, etc... The user based security model desktops have is the real issue. Compare that with Android and iOS for instance. No one needs anti-virus bloatware, just because apps are curated and isolated by default.

Things are getting better now, with things like flatpak getting more popular. For instance, the closed-source games running within the Steam flatpak won't have any access to my ~/.aws or ~/.ssh or ~/.kube or etc.


What fraction of ransomware attacks would these security products have prevented exactly? Windows already comes with plenty of monitoring and alerting functionality.


Probably close to none at some point. They may block some things.

But most of Windows falling to this is that it’s what people use. The only platform that is somewhat actually protected against attacks is the iPhone - the Mac can easily be ransomwared it’s just the market is so small nobody bothers attacking it; no ROI.


Yeah. The mobile ecosystems are what real security design looks like. Everything is sandboxed, brokered, MACed, and fuzzed. We should either make the desktop systems work the same way or generalize the mobile systems into desktops.


The mobile ecosystem is what corporate IT should be. Centralized app store, siloed applications, immutable filesystem (other than the document part for each application), then VM and specials computers for activities like development. However locked iOS can be, most upgrades happen without an hitch, and no need for security software.


Hard to say, but windows defender doesn't stop as many as EDR's can. There are actual tests for this, ran by independent parties that check exactly this. Defender can be disabled extremely easily, modern EDRs cannot.


> There’s a reason Windows users

Yes, average Windows users are significantly less tech literate due to obvious reasons and there are way more of them. This create a very lucrative market.

How is desktop Linux somehow inherently particularly more secure than Windows?


did you really just say windows is secure out of the box?


You can just scroll back up and read it again.


okay so i did and he defs claims windwos was secure out of the box. so again, i ask if he really said that ahaha, with a straight face.

SMB 1.0 is enabled, non admin users have powershell access, defender can be disabled with a single command, user is admin by default, passwords can be reset via booting to a bootable media device and then swapping its CLI to c:

there are so many basic insecurities out of the box in windows.


Apple on the desktop/laptop, Google in the cloud for email, collaboration, file sharing, office suite. I ran a substantial sized company this way for a decade. Then we did a merger and had to migrate to Microsoft- massive step backwards, quintupling of IT problems and staff.


> Companies that are not software-focused, as it's not their primary business. These organizations are left with Microsoft's offerings

I wonder why is it the case. These companies still have IT departments, someone has to manage these huge fleets of Windows machines. So nothing would prevent them from hiring Linux admins instead of Windows admins. What makes the management of these companies consider Windows to be the default choice?


It's because of two things:

1. Users are more comfortable running Windows and Office because it's Windows they likely used in school and on personal laptops.

2. This is the biggie: Microsoft's enterprise services for managing fleets of workstations are actually really good -- or at least a massive step up from the competition. Linux (and it's ilk) is much better for managing fleets of servers, but workstations require a whole different type of tooling. And once you have AD and it's ilk running and thus Windows administrators hired, it's often easier to run other services from Windows too, rather than having to spin up another cluster of management services.

Software focused businesses generally start out with engineers running macOS or Linux, so they wouldn't have Windows management services pre-provisioned. And that's why you generally see them utilising stuff like Okta or Google Workspace


Unfortunately Google did not succeed to get more into schools around the globe with chromebooks, which is a pity by my opinion. That helps to keep the Win/Office monopoly situation to go on in organizations and businesses hiring people who never used another software than one from Microsoft.


One reason being that Microsoft lobby hard against low-end PC & notebooks that are not aligned with its interests. [1]

Microsoft has a large, entrenched distribution network and market all over the world. It makes an uphill battle to create low-end programs for schools, universities, governments, SMBs.

Hence the phrase "no one was ever fired from buying Microsoft". It's too hard a battle to go against the flow.

[1] https://www.tomshardware.com/software/windows/microsofts-dra...


Inertia, plus integration - AFAIK Exchange and SharePoint don't run on Linux, so if the company buys into that, then it's Windows all the way down.

Still, all this is a red herring. Using Linux instead of Windows on workstations won't change anything, because it's not the OS that's the problem. A typical IT department is locked in a war on three fronts - defending against security threats, pushing back on unreasonable demands from the top, and fighting the company employees who want to do their jobs. Linux may or may not help against external attackers, but the fight against employees (which IT does both to fulfill mandates from the top and to minimize their own workload) requires tools for totalitarian control over computing devices.

Windows actually is better suited for that, because it's designed to constrain and control users. Linux is designed for the smart user to be able to do whatever they want, which includes working around stupid IT policies and corporate malware. So it shouldn't be surprising corporate IT favors Windows workstations too - it puts IT at an advantage over the users, and minimizes IT workload.


>Windows actually is better suited for that, because it's designed to constrain and control users. Linux is designed for the smart user to be able to do whatever they want, which includes working around stupid IT policies and corporate malware.

This just tells me you don't know linux. Linux can be much more easily hardened and restricted than windows. It's trivial to make it so that a user can only install whitelisted software from private repos.


> It's trivial to make it so that a user can only install whitelisted software from private repos.

This is also straightforward on Windows with AD-managed Group Policies.


Excel. There is no other software that can currently fill excel’s role in business. It’s the best at what it does and what it does is usually very important. Unfortunately.


Excel is the driver for small businesses and individual departments. SharePoint is what keeps large businesses committed to Windows.


Excel runs just fine on Macs, though, so that only explains "why not Linux?", not "why Windows?"


The situation might have changed since I last used Excel on Mac, but in 2018, the "Excel" on Mac barely resembled the Excel on Windows. Many obvious and useful features were missing.


No longer. Everything is there. Just switched work machines to Mac from Win


Not everything is there. There are still important limitations: https://spreadsheeto.com/mac-vs-windows/


My guess is that the fact you can buy about two to three cheap Dell desktop machines for the price of one Mac probably factors quite heavily into the equation.


It definitely does not run “just fine”. It’s passable at best.


Gnumeric, Libreoffice, Google Sheets, Zoho...

There are plenty of sufficient replacements for Excel if organizations are willing to work with other tools


If you’re only doing vacation travel planning, sure. But there’s a long tail of advanced functionality used across all kinds of industries (with plugins upon plugins) that are most certainly not even close to being supported by any of the options proposed.


I don't know, but I would guess that Microsoft Office is what retains people; personal anectodal experience suggests that anything else (Apple's offerings, Google Docs, LibreOffice &c.) is not acceptable to the average user. My suspicion is that Microsoft would be very unhappy to have MS Office running successfully on Linux systems.


> These companies still have IT departments

A lot actually don’t, in any meaningful sense. My partner’s company has a skeleton IT staff with all support requests being sent offshore. An issue with your laptop? A new one gets dispatched from ??? and mailed to you, you mail the old one back, presumably to get wiped and redispatched to the new person that has a problem.


Tooling, infra, knowledge? The only reason why people are talking about "issues in Windows" because people are widely using it.

If linux had software anywhere close to the amount that windows has, it would have experienced the same issues too. After all it is not just about running a server and tinkering with config files. It is about ability to manage the devices, rolling out updates and so on.


You have to also factor in competition. I think it's a big factor on why corporate IT is generally bad, Microsoft and their partners have no reason to improve on the status quo. If we had viable alternatives, in a market where no entity has more than 20% market share or something like that the standards would be much higher.


Standards of what? Microsoft cannot force the third party company to test their own builds before releasing them.


The whole idea of running a backdoor with OS privileges in order to increase system security screams Windows. In Linux, even if Crowdstrike (or similar endpoint management software) is allowed to update itself, it doesn't have to run as a kernel driver. So a buggy update to Crowdstrike would only kill Crowdstrike and nothing else.

And Linux is not even a particularly hardened OS. If we could take some resources from VC smoke and mirrors and dedicate them to securing our critical infrastructure we could have airports and hospitals running on some safety-critical microkernel OS with tailored software.


Office. The entire world runs on Excel, Word and Powerpoint. Unfortunately.


Word and PowerPoint are disposable. Pages and Keynote work just fine. Excel on the Mac is perfectly fine.

But that aside — Excel is a single application. That one app doesn’t determine an entire Corporate IT strategy.


> That one app doesn’t determine an entire Corporate IT strategy

It actually does in some industries, but it is dumb that it does.


You can get that on Mac right?


the comment I am replying to explicitly mentions Linux as an alternative to Windows. In any case, yes, one could use Mac, as I do, but it comes with its own issues, starting from price. I'd happily switch 100% to Linux if I didn't need to work on documents edited with Office. The online version may actually solve this, but it's still buggy as hell.


Word, Excel, Powerpoint and all the other windows software. Plus all the people that know how to use the windows software vs Linux equivalents (if they exist).


Purchasing decisions are made by purchasing managers. Purchasing managers spend their time torturing numbers in spreadsheets, writing reports, and getting free lunches from channel sales reps. Microsoft is just a sales organization with some technical prowess, and their channel reps are very effective.

Technical arguments, logic, and sense do not contribute much to purchasing decisions in the corporate world.


The business world runs on Windows, no way around that unless you only need a simple cash register and inventory software.


If you were ready to ditch corpomicrosoft why would you go to corpoapple instead of something foss like debian tho


I'd say something implementing the ideas of NixOS, i.e. immutable versioned systems and declarative system definitions, is poised to replace the current deployment mess, which is extremely fragile.

With NixOS, you can upgrade without fear, as you can always roll back to a previous version of your system. Regular Linux distributions, macOS, and Windows make me very nervous because that is not the case.


> I'd say something implementing the ideas of NixOS, i.e. immutable versioned systems

NixOS isn't immutable, things aren't mounted read only. AFAIK, it can't be setup that way.

> With NixOS, you can upgrade without fear, as you can always roll back to a previous version of your system. Regular Linux distributions, macOS, and Windows make me very nervous because that is not the case.

Because you can't roll back to a previous backup?


The store is immutable in the functional programming sense, as the package manager creates a new directory entry for each hash value.

Backups could be an option, but it is much better to have a system where two computers are guaranteed to be running the exact same software if configuration hashes are the same.

In other OSes, the state of your system could depend on previous actions.


> Regular Linux distributions, macOS, and Windows make me very nervous because that is not the case.

I'm personally only really nervous when updating Linux distributions. Besides security updates it usually hardly matters or is noticeable on macOS/Windows (well besides the random UX changes..).


Ideally there would be a usable security first os based on something like sel4 with a declarative package system for slow to change mission critical appliances.


How do you automatically roll back if you’re in a boot loop?


In NixOS, you have a bootloader to load your OS. Unless you botch your bootloader, you can't paint yourself into an unbootable state. If one system configuration doesn't work, you reboot and choose the prior one before the OS begins to load in a menu displayed by the bootloader.

This is also true of most regular Linux setups. Except that in those, you can only choose the kernel. Hence, if you have broken other parts of your configuration, your system might not be bootable. So the safety net is much thinner.


I really have no problem imagining an antivirus company convinced the bootloader needs an upgrade =)


> foss

Because you just want stuff to work and couldn't care less about the ideology part?

Also no feature parity (it's not about Windows being "better" than Linux or the other way around, none of that matters) there are not out of the box solutions to replace some of the stuff enterprise IT relies in Windows/etc. which would mean they'd have to hire expensive vendors to recreate/migrate their workflows. The costs of figuring out how to run all of your legacy Windows software, retraining staff etc. etc. would be very significant. Why spend so much money with no clear benefits?

To be fair I'm not sure how Apple figures into this. They don't really cater to the enterprise market at al..


> Because you just want stuff to work

I think the current outage undercuts this premise.


Why? Both things seem pretty tangential. Poorly written software exists or can exist on any platform, just like the IT infrastructure wouldn't somehow automagically become robust if they just switched to Linux.


When I took a Linux course in college I had an old laptop that I installed Linux on. However, for some reason my wireless card wouldn't work. I mentioned it to my professor and the next day he told me "It's actually quite simple, you just have to open up the source code for the wireless driver and make a one line change."

Maybe things have gotten better, but I think that's why people use Mac. It's POSIX but without having to jump through arcane hoops.


Things have definitely gotten better.

The problem with the linux desktop was usually that most hardware companies were either not spending any time/effort on non-windows drivers/compatibility or when they did it was a tiny fraction of the effort that went into working around bugs in the windows driver API's.

Today with the failure of windows in both the mobile and industrial control space we now see vendors actually giving a damn about the quality of their Linux drivers.

Today the main factor keeping the enterprise marked locked on windows is the fat clients written around the turn of the millennium, and that's as much a problem for mac adaptation as it is Linux adaptation.

The macs are slick well designed devices that speaks to a huge segment of the consumer market so will eventually find the way into the high cost niches where no specific dependency on legacy software exists but they are too expensive and inflexible to replace all of the wintel system so for Microsoft and it's partners to have their license to screw over the enterprise sector revoked Linux(or FreeBSD) will have to play a role too.


Things have definitely gotten better. I remember the painful years. My most recent Ubuntu install on a new laptop was about 3 years ago. As someone who has used Linux as the daily driver for more than a decade (and dual booted as a second OS for another decade) I was pleasantly surprised that everything just worked! I think that was a first

It was an HP from Costco, not something special sold with Linux. My wireless worked, dual monitors just worked, even the fingerprint reader that I never use. I remember sitting there thinking "I didn't have to fight anything?" Hopefully that becomes the norm, maybe it is - I haven't needed a new laptop yet.


Because for some people (certainly not all), their objection is not to a "corporate" OS, but to the specific things Microsoft does that Apple does not.


Because there is software that runs only on certain OSes, and not others.


Fewer and fewer. And there's VM for that, so you can rollback in case like this.


Honestly, windows out of the box is pretty secure. I don't want to defend Microsoft here, but adding third party security to Windows hasn't been anything but regulatory compliance at best and cargo culting at worst for over a decade now. If you actually look at core windows exploits compared to market share, they're comparable to Apple. Enterprises insist on adding extra attack surface area in the name of security.

I agree that people who actually know what they're doing are generally running Linux backends, but Microsoft have enterprise sewn up, and this attack is not their fault.


A lot of active directory defaults are wildly insecure, even on a newly built domain, and there are a lot of active directory admins out there that don't know how to properly delegate as permissions.


This is true. You are basically one escalation attack on the CFO away from someone wiring money to hackers and a new remotely embedded admin freely roaming your network.


Windows is leagues ahead of MacOS in terms of granularity of permissions and remote management tools. It's not even close. That's mainly why enterprise IT prefers it to alternatives.


downvoted, because in your response you conflate two issues:

1. The problem with using Microsoft 2. The lack of institutional knowledge of securing BSD and MacOS and running either of those at the scale Microsoft systems are being run at.

The vast majority of corporate computer endpoints are running windows. The vast majority of corporate line-of-business systems are running Windows Server (or alternatively Microsoft 365).

That means a whole lot of people have knowledge on how to administer windows machines and servers. That means the cost of knowledge to adminster those systems is going down as more people know how to do it.

Contra that with MacOS Server administration, endpoint administration, or BSD Administration. Far fewer people know how to do that. Far fewer examples of documentation and fixing issues administrators have are on the internet, waiting to help the hapless system administrator who has a problem.

It's not just about better vs. worse from your perspective; it's about the cost of change and the cost of acquiring the knowledge necessary to run these corporate systems at scale -- not to mention the cost of converting any applications running on these Windows machines to run on BSD or MacOS -- both from an endpoint perspective and a corporate IT system perspective.

It's really not even feasible to suggest alternatives to any of the corporations using Microsoft that are impacted by this outage.

If you want to create an alternative to Microsoft's Corporate IT Administration you're gonna need to do a lot more than point to MacOS or BSD being "better".


This is a sample of what Y2K would look like if not for the countermeasures.


Actually really scary to see/read the comments on. Like Die Hard's 'Fire Sail"


*sale


I watched a presentation by someone representing "I Am The Cavalry" at B-Sides, Las Vegas, a few years ago. Very interesting stuff, gave me a whole new perspective on "cyber security".

https://iamthecavalry.org


US Based and got a NANOG alert email just in time. At least half our windows servers down.

I went into our crowdstrike policies and disabled auto update of the sensor. Hopefully this means it doesnt hit everything. Double check your policies!!!

Edit:

Crowdstrike has an article out on the manual fix:

https://supportportal.crowdstrike.com/s/article/Tech-Alert-W...


IMO, having a mix of servers would help in mitigating issues like that.

Like run stuff on Linux, windows and freebsd servers, so that you have OS redundancy should an issue affect one in particular (kernel or app).

Just like you want more than a single server handling your traffic, you’d want 2 different base for those servers to avoid impacting them both with an update.


Not using this crap security software would mitigate this issue.


So Crowdstrike protects your computers from cyber attacks. But who is going to protect you from Crowdstrike?


I dunno, Coast Guard?


Underrated joke, thanks


Not Microsoft, that’s for sure


All major US airlines have put in a total ground stop. No flights can take off anymore.


It's morning here in Europe, departure peak time. We're still flying, but...

The problems mean the takeoff and weight & balance data is missing. It needs to be done manual by each crew. Baggage handling is also manual, so that means counting bags and the cabin crew counting people. Then manually calculating performance data before you can take off.

Big delays everywhere.


It's not all of Europe. The airports in Norway are operating as normal. One of the airlines reported booking issues on their website, but I haven't read about any other issues.

Some international flights have been delayed or canceled of course, depending on the route.


> It's morning here in Europe, departure peak time. We're still flying, but...

Looks like some airlines aren't flying, KLM being one of them.


Is it KLM, or just the KLM flights operated by Delta (of which there are a decent chunk)?


Sounds like it's a majority of KLM flights, so seemingly it's themselves.


KLM was flying, but all flights pass through Schiphol Airport (it's their hub) and Schiphol couldn't board fights for a while. Because of that they ran out of gates for arriving flights, so everyone had to cancel flights to avoid compounding delays. As the biggest user of Schiphol that means KLM had to cancel a lot of flights.


> It needs to be done manual by each crew.

I can't imagine believing that this computation, ordinarily automated by these out of service systems, can be performed correctly by crews that probably haven't had to do this in ... years?


At most places there is an iPad app that does the calculation locally. So it's mostly entering a lot of numbers and checking that the results make sense. Usually both crew members do it individually, and then cross check the results.


All airlines in Russia are operating as usual. Sanctions work in unpredictable ways


Electronic payment at grocery stores in the US is affected.


I’m Amazed at that. I had no idea the US had caught up and had electronic payments.


I’m guessing for chip reading? It’s one-time transaction key so I don’t think it works async.


All five payment terminals in Germany are down. "Unfortunately, all online governmental services are affected. All two of them" says the chancellor.


What are you talking about? Credit cards have been around since the 1950's


For much of their history they were written down or copied on carbon paper and manually processed by phone later. Electronic processing came in the 70s and wasn't universally used until much later. I saw plenty of credit card imprinters in use well into the 90s when I was growing up.

https://en.wikipedia.org/wiki/Credit_card_imprinter

Card issuers only stopped embossing them recently.


POS down? Payment processor down?

Lotta different grocery stores and processors in USA. WHO?


Not all. Only some. And many are already lifted.


WTF is CrowdStrike and why is it affecting so many people and companies? I've never heard of it before. And apparently it isn't anything relevant to all Windows users as it didn't affect any computer of any person I personally know.


Very popular corporate endpoint protection (malware detection and spyware) that runs telemetry & monitoring agents installed as kernel-mode drivers on windows. Thus if there is a crash, it crashes the entire kernel (BSOD) . And their drivers load at boot.


Guess we will never read the real facts. Truth is RMS was right. Again. Closed source security software is too often malware by design. We need open solutions we can truly trust.


> Closed source security software is too often malware by design.

Can you be more specific? Genuinely curious what you mean here.


Crowdstrike is closed source security software.

What's the difference between malware and what Crowdstrike has done to the world today?

We might as well reclassify Crowdstrike as malware and remove it from all computers to avoid this situation in the future.


The difference is that the intent of malware is to disrupt.

Is gasoline useless just because it explodes when you light a match next to it?

edit to add: OSS is not inherently more secure than closed source.


Gasoline is very useful. We also take a lot of precautions when using it.

We also have things like inspections and financial penalties if you were storing it in an unsafe manner.

It's clear we need to take more precautions before using Crowdstrike. More testing, ability by IT departments to not push updates, ability to rollback updates.


This is why I subscribe to /r/sysadmin despite not being one ... like a canary in the coalmine for stuff like this


On a positive note, I'm in morocco and getting money from ATM wasn't working for the whole day I believe because of this outage. I was at the till in a supermarket and people started asking if they can chip in to pay for some food I bought because I didn't have the cash.

Humanity 1 - Technology 0

Edit: Outage of all ATM's in Morocco was yesterday not today. so not sure how the two are related.


such stupidity. our $$$ corporate geniuses mandate multiple so-called security software which is:

- unaccountable black boxes

- of questionable, and un-auditable, quality

- requires kernel modules, drivers, LocalSystem, root access, etc.

- updates at random times with no testing

- download these updates from where? and immediately trust and run that code at high privilege. using unaccountable-black-box crypto to secure it.

- all have known patterns of bad performance, bugs, and generally poor quality

all in the name of security. let's buy multiple "solutions" and widely deploy them to protect us from one boogeyman, or at least the shiny advertisements say. while punching all sorts of serious other holes in security. why even look for a Windows ZeroDay when we can look for a McAfee or Crowdstrike zero day?


According to Reddit It's hitting Croatia, Philippines, US, Germany, Mexico, India, Japan. SAP servers dropping like flies, that's Defence,Banks, Payroll all affected. Major Retail Chains like Big W down.


We have outages across whole APAC and most EMEA. Despite being a very big client of CS, we do not have an official resolution yet, an hour into the incident.


SAP isn't linux?


Thundering herd? Idk


Husband is a deputy in California. His department and many others here are down as well (including PDs, jails, ambulance companies, etc.)


This seems like a pretty severe point of failure.


I'm a little late to the party, but I've uploaded my source codes to GitHub in case anyone needs a more convenient tool to deploy/execute on running machines and/or needs something fast on USB flash drives to run around the office:

https://github.com/cookiengineer/fix-crowdstrike-bsod

Releases section contains prebuilt binaries, but of course, I always recommend to check the source and then build it yourself.


I'm sure you mean well, but it's not going to be most programmers or devs who will need to apply a fix for this, it'll be sysadmin/network/SREs who'll be doing this and they're not going to download Go to build this random github code repo. Because it affects only Windows systems, it'll be way better writing a bat or powershell script that can non-programmers can read and comprehend before they execute anything in production/live systems.


The details (the particular companies / systems etc) of this global incident don't really matter.

When the entire society and economy are being digitized AND that digitisation is controlled and passes through a handful of choke points its an invitation to major disaster.

It is risk management 101, never put all your digital eggs in one (or even a few) baskets.

The love affair with oligopoly, cornered markets and power concentration (which creates abnormal returns for a select few) is priming the rest of us for major disasters.

As a rule of thumb there should be at least ten alternatives in any diversified set of critical infrastructure service providers, all of them instantly replaceable / forced to provide interoperability...

Some truths will hit you in the face again and again until you acknowledge the nature of reality.


Efficiency goes against resilience.

Can you imagine having just one road connecting two big cities to cut costs? No alternative roads, nor big nor small.

That will be really cheap to maintain, and you can charge as much as you want in tolls as there are no alternatives. And you can add ads all over the road as people has to watch them to move from one city to the other.

And if the road breaks, the goverment needs to pay for the cost as they cannot allow the cities to go unconnected.

We live in the middle-ages of technology.


We have multiple exits to avoid trapping people in a fire. Redundancy is important when the cost of failure is high.


America has much stricter rules about this than other places, actually.


Sure, and all those rules were created when a building without multiple exists burned down and many people died.

Maybe this crowdstrike outage will be the "burning building" that will bring change :)


Computer systems have gone down before. It's only when people actually die, rather than merely be inconvenienced, that change happens.


911 systems down overnight will certainly be found to have caused deaths.


> Efficiency goes against resilience.

Only when you are focusing on short-term effects. If you thinking long-term it is always better to not be out of business.


> Can you imagine having just one road connecting two big cities to cut costs?

Sure! But now imagine those roads private property.


Or like...one power company in Texas?


Sadly, I'm going to have to update examples in my blog post... https://www.evalapply.org/posts/software-debt/index.html#sof....

Software debt is networked.

I'm writing this in the wake of the aftermath of the disclosure of the log4j zero-day vulnerability. But this is only a recent example of just one kind of networked risk.

With managed services we effectively add one more level to the Inception world of our software organisation. We outsource nice big chunks of supply chain risk management, but we in-source a different risk of depending critically on entities that we do not control and cannot fix if they fail.

Not to mention the fact that change ripples through the parallel yet deeply enmeshed dimensions of cyberspace and meatspace. Code running on hardware is inexorably tied to concepts running in wetware. Of course, at this level of abstraction, the notion applies to any field of human endeavour. Yet, it is so much more true of software. Because software is essentially the thoughts of people being played on repeat.


The oligopoly is not a "love affair", that's how IT works: prime mover advantage, "move fast and break things" (the first of them being interoperability), moats, the brittleness of programming...

The whole startup/unicorns ecosystem exists only because there is the possibility of becoming the dominant player in a field within a few years (or being bought out by one of the big players). This "love affair with oligopoly" is the reason why Ycombinator/HN exists.


I feel like posting this as if it's a given means you are unaware of history. Just because something exists now doesn't mean it is always determined.


It's not an intrinsic property of IT, it's a property of how we've built it and allowed it to be built.


It works this way because we, as a society, decided we wanted it to.


The society didn't decide that we want this. The society didn't decide that we want something else.


It's correct that these are political/economical decisions. But most people in society neither have the knowledge for an informed opinion on such matters, nor a vote.


Centralisation vs decentralisation. Cost-savings vs localisation of disaster.

It's a swinging pendulum of decisions. And developers know that software/hardware provision is a house of cards. The more levels of dependency, the more fragile the system is.

Absolutely horrible when lives will be lost, but determining the way our global systems are engineered and paid for will always be a moving target based on policy and incentive.

My heart goes out to life and death results of this. There are no perfect tech solutions.


Be aware that enterprise firms actively choose and "asses" who their AV suppliers are on-premis and in the cloud not imposed by msft. Googling it does seem that CrowdStrike, does have a history of Kernel Panics. Perhaps such interesting things as Kernel panic should be part of compliance checklist.

Googling it seems crowdstrike has a history of causing kernel panics.

https://www.google.com/search?q=crowdstrike+kernel+panic


They have a Gartner Magic Quadrant on the landing page.

This is checkbox software, selling to decision makers in enterprises who will never see nor touch the software they are buying.

This kind of software where the user isn’t the customer is always terrible.


> checkbox software

Never heard this name before. Thx.


Maybe if they got off their "asses" and did some more comprehensive assessments we wouldnt be in theis mess.


What are they going to do, go back to Tanium?

Everytime there was a mysterious performance problem affecting a random subset of machines, it was Tanium. I know how difficult it is for anyone to just get rid of this type of software, but frankly it has been proven over and over that antivirus are just more surface attack, not less.


I think the enterprise software ecosystem currently is not really "all eggs in one basket", but rather you have a whole bunch of baskets, some of them you are not even aware of, some are full of eggs, some have grenades in them instead, some are buckets instead. All baskets are being constantly bombarded with a barrage of eggs from unknown sources, sometimes the eggs explode for inexplicable reasons. Oh yeah and sometimes the baskets themselves disintegrate all at once for no apparent reason.


You seem to have put an awful lot of eggs in the basket of that metaphor.


That's not really the problem here.

The problem is allowing a single vendor, with a reputation of fucking up over and over again, to push code into your production systems at will with no testing on your part.


Right. I thought the "big guys" know better and they have some processes to vet Crowdstrike updates. Maybe even if they don't get its source code, they at least have a separate server that manages the updates, like Microsoft's WSUS.

But no, they are okay with a black box that calls home and they give it kernel access to their machines. What?


We do that. CS literally entirely pushed this over our staging system and straight into production.


Why did they have the technical means to do so?


Because our security guys are fuckwits.

(I am operations management and fought against this product and approach for months)


Monocultures are known to be points of failure, but people keep going down that path because they optimize for efficiency (heck, most modern economics is premised on the market being efficient).

This problem is pervasive and effects everything from food supply (planting genetically identical seeds rather than diversified "heirloom" crops) to businesses across the board buying and gutting their competitors thus reducing consumer choice.

It's a tough problem akin to a multi-armed bandit: exploit a known strategy or "waste" some effort exploring alternatives in the hopes of better returns. The more efficient you are (exploitation), the higher the likelihood of catastrophic failure in weird edge cases.


this isn't even the first time something like this has happened. it's literally a running joke in programmer circles that AWS East going down will take down half the internet, and yet there's absolutely zero initiative being taken by anyone who makes these sorts of decisions to maybe not have every major service on the internet be put into the same handful of points of failure. nothing will change, no one will learn anything, and this will happen again.


That’s very different though. That’s avoidable. We all can easily have our services running in different data centers around the world. Heck, the non-amateurs out there all have their services running in different Amazon data centers around the world. So you can get that even from a single provider. Hardware redundancy is just that cheap nowadays.

This CS thing, there’s no way around. You use it and they screw up, you get hit. Period. You don’t failover to another data center in Europe or Asia. You just go down.

Hardware, even cloud hardware, is rarely the issue. Probably especially cloud hardware is not an issue because failover is so inexpensive relative to software.

Software is a different issue entirely. How many of us will develop, shadow run, and maintain a parallel service written on a separate OS? My guess is “not many”. That’s the redundancy we’re talking about to avoid something like this. You’d have to be using a different OS and not using CS anywhere in that new software stack. (Though not using CS wouldn’t be much of a problem if the OS is different but I think you see what I mean.)

Amazon, implementing failover for your hardware is a few clicks. But if you want to implement an identical service with different software, you better have a spare dev team somewhere.


AWS East going down will (and has) cause(d) disruption in other regions. Last time it happened (maybe like 18 months ago), you ran into billing and quota issues, if my memory serves.

AWS is, as any company, centralized in a way or another.

Want to be sure you won't be impacted by AWS East going down, even if you run in another region? Well, better be prepared to run (or have a DRP) on another cloud provider then...

The cost of running your workload on two different CSP is quite high, especially if your teams have been convinced to use AWS-specific technologies. You need to first get your software stack provider agnostic and then manage the two platform in sync from a technical and contract perspective, which is not always easy...


You just made the single point of failure your software stack hardware abstraction layer. There’s a bug in it, you’re down. Everywhere. Not only that, but if there is CS in either your HAL, or your application you’re down. So to get the redundancy the original commenter was talking about, you need to develop 2 different HALs with 2 different applications all using a minimum of 2 different OS and language stacks.

Why multiply your problems? Use your cloud service provider only to access hardware and leave the rest of that alone. That way any cloud provider will due. Any region on any cloud provider will due. You could even just fallback to your own racks if you want. Point is, you only want the hardware.

Now to get that level of redundancy, you would still have to create 2 different implementations of your application on 2 different software and OS stacks. But the hardware layer is now able to run anywhere. Again, you can even have a self hosted rack in your dispatch stack.

So hardware redundancy is easy to do at the level the original commenter recommends . Software redundancy is incredibly difficult and expensive to do at the level the original commenter was talking about. Your idea to make a hardware/cloud abstraction layer only multiplies the number of software layers you would need multiple implementations of, shadow run and maintain to achieve the hypothetical level of redundancy.


It's very avoidable. Just don't use shit software.


> It is risk management 101, never put all your digital eggs in one (or even a few) baskets.

The fact it's widespread is because so many individual organisations individually chose to use CrowdStrike, not because they all got together and decided to crown CrowdStrike as king, surely?

I agree with you in principle, but the only solution I can think of would be to split up a company with reach like CrowdStrike's. The consequences of having to do that are up for debate.


It's never that simple. There is a strong herd mentality in the business space. Just yesterday I've been in a presentation from the risk department and they described the motives around choosing a specific security product as `safe choice, because a lot of other companies use it in our space, so regulator can't complain`...the whole decision structure boiled down to: `I don't want to do extra work to check the other options, we go with whatever the herd chooses`. Its terrifying to hear this...


The whole point of software like this is a regulatory box-ticking exercise, no-one wants it to actually do anything except satisfy the regulator. Crowdstrike had less overhead and (until now) outages than its competitors, and the regulators were willing to tick the box, so of course people picked them. There are bad cases of people following the herd where there are other solutions with actually better functionality, but this isn't that.


OTOH... I remember an O365 outage in London a few years ago.

You're down? Great, so are your competitors, your customers, and your suppliers. Head to the pub. Actually, you'll probably get more real value there, as your competitors, customers and suppliers are at that same pub. Insurance multinationals have been founded from less.

That didn't affect any OT though, so it was more just proof that 90% of work carried out via O365 adds no real value. Knowing where the planes are probably is important.


> You're down? Great, so are your competitors, your customers, and your suppliers. Head to the pub. Actually, you'll probably get more real value there, as your competitors, customers and suppliers are at that same pub. Insurance multinationals have been founded from less.

I mean yeah, that's the other thing - the Keynesian sound banker aspect. But that's more for software that you're intentionally using for your business processes. I don't think anyone was thinking about Cloudstrike being down in the first place, unless they were worried about an outage in the webpage that lists all the security certifications they have.


You say that as it's some bad thing, but it's just other words for "use boring tech".

Yes, there could be reasons to choose a lesser-known product, but they better be really good reasons.

Because there are multiple general reasons in the other direction, and incidents like this are actually one of those reasons: they could happen with any product, but now you have a bigger community sharing heads-ups and workarounds, and vendor's incident response might also be better when the whole world is on fire, not only a couple of companies.


It's not just Crowdstrike, it's all up and down the software and hardware supply chain.

It's that so many people are on Azure - which is a defacto monopoly for people using Microsoft stack - which is a defacto monopoly for people using .Net

And if they're doing that, the clients are on Windows as well, and probably also running Crowdstrike. The AD servers that you need to get around Bitlocker to automatically restore a machine are on Azure, running Windows, running Crowdstrike. The VM image storage? Same. This is basically a "rebuild the world from scratch" exercise to some greater or lesser degree. I hope some of the admins have non-windows machines.


How come AWS sometimes has even better tooling for .NET than Azure, while JetBrains offers better IDE on Linux, macOS and, depending on your taste, Windows than Microsoft? Or, for some reason, the most popular deployment target is just a container that is vendor-agnostic? Surely I must be missing something you don't.


All of that is absolutely true and in no way affects the behavior at hand. Big companies go with whoever sells them the best, not any kind of actual technical evaluation.


Perhaps the organisations have a similar security posture. And that creates a market that will eventually result in a few large providers who have the resources to service larger corporations. You see something similar in VPN software where Fortinet and Palo become the linchpin of security. The deeper question is to wonder at the soundness of the security posture itself.


There's a strong drive for everyone to do things the same way in IT. Some of the same pressure that drives us towards open standards can also drive us towards using a standard vendor.

> I agree with you in principle, but the only solution I can think of would be to split up a company with reach like CrowdStrike's.

Changing corporate structures doesn't necessarily help. It's possible that if CrowdStrike were split up into to smaller companies, all the customers would go to the one with the "better" product and we'd be in a similar position.


Well, if they'd used a different vendor (or nothing) on the DR servers we could have done a failover and gotten on with our day. But alas nobody saw, an app that can download data from the internet, whenever it wants to update itself arbitrarily without user intervention, as a problem.

So here we are.


They choose because other have. "Look how many others choose us" is a common marketing cry. Perhaps instead too popular is a reason not to choose? Perhaps not parroting your competitors and industry is a reason not to choose?


When it comes to security products, the size of the customer base matters. More customers means more telemetry. More telemetry means better awareness of IOCs, better training sets to determine what's good and what's bad.


I wonder how many of those orgs were "independently" audited by security firms which made passing audit without Crowdstrike specifically a hell.

Most of crap security I met in big organisations was driven by checklist audits and compliance audits by a few "security" firms. Either you did it the dumb way or good luck fighting your org and their org to pass the audit.


Setting aside the utter fecklessness if not outright perniciousness of cybersecurity products such as this, I hope this incident (re-)triggers a discussion of our increasing dependence on computing technology in our lives, its utter inescapability, and our ever-growing inability to function without it in modern society.

Not everything needs to be done through a computer, and we are seeing the effects now of organizing our systems such that the only way to interface with them is through a digital device or a smartphone, with no alternative. Such are the consequences of moving everything "into the cloud" and onto digital devices as a result of easy monetary policy and the concomitant digital gold rush where everyone and their dog scrambled to turn everything into a smartphone app.


This past week I purchased a thermostat. There were "high-end" touch only models, models that were app-assisted also with analog controls, and then finally old school analog only. I went with the middle / combo so that I have analog as a call back if the pure tech mode fails.

Being prepared can cost more and/or be less flashy (read: I didn't get touch-only) but it's only peace of mind, at least for critical components. I want a thermostat that works, I don't get no satisfaction from any bragging rights. Nod to the Rolling Stones.


I literally dealt with this just a few hours ago. I need a new HVAC system. I wanted the high-end model, but it will only work with their fancy cloud-connected thermostat. You cannot replace it with an off-the-shelf thermostat.

Have home automation? Sorry, you'll have to use the Internet.

I vote with my dollars, so it cost them the higher-margin sale. I also went with the mid-tier system, and grabbed a Z-Wave compatible thermostat along with it. I wonder if I'll miss the nifty variable-speed system?

I really wish everyone would stop trying to trap us into their walled gardens. Apple at least lets people write software for theirs, but the hardware/appliance manufacturers (not to mention the automotive folks) are awful about this.


> I really wish everyone would stop trying to trap us into their walled gardens

*And* adding on a subscription.


> The details (the particular companies / systems etc) of this global incident don't really matter.

It definitively matters. The main issue he is that Crowdstrike was able to push and update on all server around the world where all their agent is installed ... it looks like an enormous botnet ...

We need a detailed post-mortem on what happened here.


The name sounds like something security researchers would name a botnet.


The other aspect of risk management is an acceptance that something going wrong isn't necessarily a reason to change what you are doing. If the plan was tacitly to run something at a 99% uptime, then incidents causing 1% downtime can be ignored.

We are going to get hit by some terrible outage eventually (I hope someone is tracking things like what happens if a big war breaks out and the GPS constellations all go down together). But having 10x providers won't help against the big IT-related threats which are things like grid outages and suchlike having cascading effects into food supplies.


> there should be at least ten alternatives in any diversified set of critical infrastructure service providers, all of them instantly replaceable / forced to provide interoperability...

And does anyone actually know how to actually implement this, at the scale required (dealing with billions of transactions daily) in a way that would resolve the problems we are seeing?

It very much seems like a data access problem; places can't access/modify data. The physical disks themselves are most likely fine, but the 'interfaces' are having troubles (assuming that the data isn't stored on the devices having the issue).

But in any case how do you design a system that, if the 'main' interface is troubled, you can switch over, instantly, seemlessly, duplicating access controls, permissions, data validation, logic etc.

There is a reason everything is centralised because it makes no financial sense to duplicate for an extremely unlikely and rare chance. The world is random and these things will happen, but a global outage on this type of scale is not a daily occurance.

We'll look back in a few years and think "those were a crazy few hours" and move on...


> The details (the particular companies / systems etc) of this global incident don't really matter.

But they do matter. This is elementary. It's like saying "playing with matches doesn't matter". This is a problem that has happened before, albeit on smaller scale, and the solution/cure is well known and imho it should be established 2 decades ago to every org on the planet.

This is basic COBIT (or BYOFramework) stuff from 10-15-20 years ago.

How can you push a patch/update without testing it fist? I get it if you are a tiny company with 1 IT person an 20 local PCs. Stuff like that cripples you for a couple of days. But when you are an org, with 10k+ laptops, 500+ servers (half of them MS Win), how can you NOT test each and every update?

If you don't want to have the test/staging environments, then at least wait 1-3-5 days to see what the updates will do to others/the news.

Sorry not sorry guys and gals. I've been auditing systems and procedures for so many years, that this is a basic failure. "One cannot just push an update without testing it first" any update, no matter how small/innocent.


It is also risk management 101 that managing (i.e. avoiding or insuring) risks doesn't come for free.

The cost and benefits of managing risks need to be balanced.

So I am not convinced that there need to be "at least ten alternatives" to be fail safe as society.

Where I agree with you is that these decisions should be deliberate and done after a cost / benefit analysis.


> So I am not convinced that there need to be "at least ten alternatives" to be fail safe as society.

The required number "N for safety" is a good discussion to have. Risk-Return, Cost-Benefit etc are essential considerations. We live in the real world with finite resources and stark choices. But I would argue (without trying to be facetious) that they are risk management 102 type considerations.

Why? Because they must rely on the pretense of knowledge [1]. As digitization keeps expanding to encompass basically everything we do, the system becomes exceedingly complex, nobody has a good picture of all internal or external vulnerabilities and how much they might cascade inside an interconnected system.

Assessing cost versus benefit implies one can reasonably quantify all sides of the equation. In the absence of a demonstrably valid model of the "system" the prudent thing is to favor detail-agnostic rules of thumb. If these rules suggest that reducing unsafe levels of concentration is not economically viable there must be something wrong with the conceptual business model of digitization as it is now pursued.

[1] https://www.nobelprize.org/prizes/economic-sciences/1974/hay...


Or perhaps it's just because companies release features, planes, devices, etc. without any form of QA, aiming just to increase their profits?

In this case, has CS done any QA on this release? Have they tested it for months on all the variations of the devices that they claim to support? It seems not.


Considering CS Falcon causes your performance to drop by about half and does the same to your battery life, I doubt they have any sort of QA that cares about anything but hitting stakeholder goals.


Just week or so ago, there was an issue with CS Release pegging a whole CPU.


Yet, catastrophic failures like this happen, and people move on. Sure, there is that one guy who spent 10 years building a 10-fold redundancy plan, and his service didn't go down when the whole planet went down, but do people really care?


If he provides emergency services like fire and ambulance people care a lot.


His customers do


Unless his systems are up but critically dependent on other external systems (payment services, bucket storage, auth etc...) that are down. It's becoming increasingly difficult to not have those dependencies.


> when the whole planet went down

If his business is the most critical thing for people's lives, sure. If it's anything else, his customers will have other things to worry about.


While this is a great theory, how would you actually accomplish this with antivirus software?

Multiple machines, each one using different vendor software? What other software needs to be partitioned this way? What about combinations of this software?

I’m just barely awake but don’t know if I’m affected yet. One of my devs is, our client support staff is, and I have no idea how our servers are doing just yet.


While I agree 100% with what you say in principal, stats show that these occurrences are increasingly rare.


> It is risk management 101, never put all your digital eggs in one (or even a few) baskets.

I mean, plenty of businesses only have penguin eggs in their basket, and some sort of penguin problem would cause major problems for them. I believe that last time this happened was with the leap second thing around 2005 or thereabouts.

"Don't put all your eggs in one basket" sounds nice, but it would mean a completely different independent service all through your stack. That's not really realistic, IMHO.

The bigger issue here is that: 1) some driver update "just" gets pushed (or how does this work?), and 2) the inability to easily do "this is broken, restore to last version". That is even something that could be automatic.


This isn't some global conspiracy, it's just incentives and economies of scale. When it's cheaper to pay a hyperexpert to handle your security, why wouldn't you?

The fact that physical distance is no longer a limit to who you do business with means that you can select the cheapest vendor globally, but then that vendor has an incentive to hyperspecialize (because everyone goes to them for this one thing), which means that even more people go to them.

Avoiding once-in-a-century events just isn't something we're willing to pay the extra cost for, except now we have around twenty places where these once-in-a-century events can happen, which kind of makes them more frequent.

How much stuff do you host on Hetzner instead of AWS?


Easy to state; non-trivial to implement.


Now they know the state of each of the affected companies systems. How adept their sysops guys are, a birds eye view of their security practices. Nice move and plausible deniable too :D.

I mean how did this happen at all? Are there no checks in place @ crowdstrike? Like deploying the new update to a selected machines and check whether everything is ok, and then releasing it to the wild incrementally?

Mind boggling.


I suspect the `assertNoBSOD()` test was marked as flakey


Hey hey, Silicon Valley just bought themselves a VP to ensure no regulation.


> As a rule of thumb there should be at least ten alternatives

I think you mean 14. https://xkcd.com/927/

> When the entire society and economy are being digitized AND that digitisation is controlled and passes through a handful of choke points its an invitation to major disaster.

Once again, it's Microsoft, directly, or indirectly, choosing a strategy of eventually getting all worldwide Windows desktops online, and connected via their systems.

Which is why I installed Fedora after Windows 7 and never looked back. 100% local, 100% offline if needed.

My company is looking to a non-Microsoft desktop. We're not affected by this, but it will certainly encourage us to move sooner rather than later.


Society was able to move to mass WFH on a global scale in a single month during Covid, thanks to the highly centralized and efficient cloud infrastructure. That could have easily saved tens of millions of lives (Imagine the Spanish flu with mass air travel, no vaccines, no strain-weakening)

These small 'downages' basically never cause serious issue. Your solutions are just alarmist and extremely costly (though they will provide developer employment...).


> These small 'downages' basically never cause serious issue.

Hospitals, airlines, 911, grocery stores, electric companies, gas companies, all offline. There will be more than a few people dead as an indirect result of this outage, depending on how long it lasts.


> These small 'downages' basically never cause serious issue.

Emergency Departments and 911 were knocked offline. People will indirectly die because of this, just like the last time 911 went down, and just like the last time EDs went down.

It's not alarmist, it's realist.


If CrowdStrike can cause this with a faulty update (allegedly), what do you think could happen to Western infrastructure from a full blown cyberwar? It's a valid risk.

> Society was able to move to mass WFH on a global scale in a single month during Covid

I don't know how much WFH saved lives, seeing as ordered isolation and social distancing was a thing during the Spanish Flu too (you just take the economic hit). But yes it allowed companies to keep maintaining profits. Those that couldn't WFH got paid in most countries anyway (furlough in England, etc).

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2862334/


true, but incentives should be in place to encourage a more diverse array of products, at the moment with many solutions (especially security) it is a choice between that one popular known product (Okta, CrowdStrike, et al, $$$) and bespoke ($$$$$$$$$$).

If only because we can then move away from one-size-fits-all, while mitigating the short-term impact of events like the above.


I just landed at SeaTac an hour ago and the rideshare/app pickup was absolutely nutso. Like thousands of people standing around waiting for taxis and Ubers. The one person I asked what was going on said that the computer systems at all the regional hotels are down (not sure how that makes more people need cabs). Wonder if it’s from this


Just realized this is posted on the SeaTac website now: “ SEA is experiencing temporary issues with the system that populates flight and baggage information on in terminal screens and the flySEA app/website. Travelers are recommended to check with their airlines for current gate and baggage claim information. Check With Your Airlines”


I remember when my big regional train system had like 95% of its morning trains cancelled overnight due to a big snowstorm in 2014.

Of course, no news on its websites or socials because those people didn’t start until 9am.

I think they finally fixed that.


> just landed at SeaTac an hour ago and the rideshare/app pickup was absolutely nutso. Like thousands of people standing around waiting for taxis and Ubers.

So a normal day at SeaTac?


This is why you don't make changes on a Friday. Lots of weekends absolutely ruined now.


Early weekend here :D


Did Crowdstrike forget the rule, that one does not simply deploy on Friday?

https://www.reddit.com/r/ProgrammerHumor/comments/f79iag/don...


For years now antivirus solutions have ridiculous amount of control over the OS. I accidentally installed an adware antivirus the other day that was bundled-up with a third party software, and I had to boot to Linux to manage to completely remove the damn thing from Windows. The uninstall option left a process running that couldn’t be forcefully killed.

Microsoft needs to take control and forbid anyone and anything from running software with that kind of behavior.


High time to stop using Microsoft Windows/Azure which is full of security tech debt, that you need all these tools which themselves brick the computer


If anyone feels like disagreeing about Azure, here's a comment of mine from a few months ago:

A random selection of serious security incidents from Azure:

just from Wiz from the past 2-3 years, and of course they aren't the only ones:

https://www.wiz.io/blog/secret-agent-exposes-azure-customers...

https://www.wiz.io/blog/storm-0558-compromised-microsoft-key...

https://www.wiz.io/blog/azure-active-directory-bing-misconfi...

https://www.wiz.io/blog/omigod-critical-vulnerabilities-in-o...

https://www.wiz.io/blog/chaosdb-explained-azures-cosmos-db-v...

Of course Microsoft AI researchers sucking at security: https://www.wiz.io/blog/38-terabytes-of-private-data-acciden...

Nice overview from Corey Quinn that predates some of those but things were already horrifically bad: https://www.lastweekinaws.com/blog/azures-terrible-security-...

Go and look for similar things for AWS and GCP, and there's nothing on this level (cross-tenant, trivial to exploit).

Oh and there's also this, them selling your usage patterns to partners (hopefully they've stopped): https://twitter.com/QuinnyPig/status/1359769481539506180

Oh and another one where they bungled the response: https://twitter.com/QuinnyPig/status/1536868170815795200

I find it impossible to believe that Azure as a whole organisation takes security seriously. There might be individuals that do, but definitely nobody with decision making power. Half of the above described exploits are trivial and should have never passed any sort of competent review process.


>If anyone feels like disagreeing about Azure

Talking with people in the MSFT camp is like talking with people in a cult. I'm not being melodramatic.

Pointing out these issues is good, but to them, they'll just shrug it off.

And businesses will keep giving them money. Madness.


It's basically "nobody got fired for buying from Microsoft".


CrowdStrike Falcon has a Linux product line for 'cloud security'.


https://www.bbc.co.uk/news/live/cnk4jdwp49et, seems to be quite a wide impact from this, e.g. Sky News in the UK is off air!


So its not all bad then!


Wild that a piece of software so integral to basic function has such bad release discipline. A/B, Blue/Green, Canary, Rolling, etc..

I've worked on 4 person software teams that at least followed basic user group rolling release system.


Integral? I would argue that is wild that a piece of software so useless for the basic and correct function gets so much privilege.


Well I mean integral here in that your PC can't boot and do anything useful if the software breaks.

It's not some solitaire app or saas website.

Completely nuked 1000s of companies ability to operate for a day, and the thing is auto-updating with apparently global big bang release push method..


i've never heard of crowdstrike ever but it (co-)runs half of the essential IT infrastructure, worldwide?

(also, great choice of name i must say)


I only know them because their CEO is a relatively good amateur racing driver lol.


seems like he's also a relatively good amateur kernel driver developer - or at least his team is

and he's the former CTO of mcafee?


Have spent all my afternoon and all evening on a bridge trying to support flailing systems. Was supposed to be on a plane in 5 hours to start my vacation. Guaranteed it's not gonna happen.

With hearing 911 and other safety critical systems going down, I hope that the worst that comes out of this is a couple delayed flights and a couple missed bank payments.


Good news (um, as in better than the bad news today) is the plane won't be taking off anyway, so you're golden.


Yet their stock tanked only a couple of dollars. They (and their customers) should face some rather unpleasant lawsuits. If you let others own your systems, you should not be allowed to provide critical infrastructure.


I don't think you're going to see as many lawsuits are you think. Most of these contracts probably state that they had to follow reasonable precautions for business continuity and data recovery. Having Crowdstrike in the path seems to have been a reasonable and potentially best practice before today's outage.

I don't think that companies are going to be held liable at all.


I do hope some organizations realize though it’s not a great idea to have a half-baked rootkit as your lucky charm against cybercrime.


Eh. I think you're underestimating how overmatched these IT depts are when it comes to cybersecurity.

Either sign a contract with a best-in-class (even if in name only) vendor who says that they'll do all of this for us or we need to become "experts" in cybersecurity and potentially still use them.

The CIO is overmatched here so they're making the decision that protects them and their clients in _almost all_ cases.


Crowdstrike should be held liable and sued out of existence.


They won’t. If crowdstrike was an individual or a state there would be repercussions. But this will all be a forgotten memory in two weeks or less.


Once they are taken to court and all their crap gets subpoena'd I think we might find that reasonable precautions were not taken.

Its possible that this update was never properly QA'd and was just rushed out the door. If thats the case then it could be found to be negligence, and no amount of legal jargon protects you from negligence. It could be the end of CrowdStrike. /end fud.


Think the parent meant the client companies probably won't generally be held liable. CrowdStrike is certainly going to be in all sorts of trouble.


They didn’t force anyone to use their software in critical infrastructure. The customers deploying the software as part of critical infrastructure should take the necessary precautions or insist on a contractual agreement that makes the vendor liable for any causally related failures of the critical infrastructure. The mistake is that so much software is being put into use without any substantial liability. Doing so would also make software much, much more expensive.


unqualified people on the internet shouldn't give legal advice, but since we're doing it anyway: no, this is definitely not true and if you make assurances about the fitness of the good you are on the hook when it fails, even if it's some absurdly improbable "I had no way of knowing that our pencils intended for schoolchildren would be used on a spacecraft" situation.

There is a reason you will see a ton of warranties and terms of service/EULA specifically forbid or disclaim the use in life-critical situations, in which case you are safe, because you said don't do it. But if you don't, you generally are going to be liable.

Sadly there is a reason the chainsaws say "do not stop chain with genitals". Not only did someone probably do that, but the damages stuck.

example, I was talking about the CUDA license yesterday and of course one of the clauses is:

> You acknowledge that the SDK as delivered is not tested or certified by NVIDIA for use in connection with the design, construction, maintenance, and/or operation of any system where the use or failure of such system could result in a situation that threatens the safety of human life or results in catastrophic damages (each, a “Critical Application”). Examples of Critical Applications include use in avionics, navigation, autonomous vehicle applications, ai solutions for automotive products, military, medical, life support or other life critical applications. NVIDIA shall not be liable to you or any third party, in whole or in part, for any claims or damages arising from such uses. You are solely responsible for ensuring that any product or service developed with the SDK as a whole includes sufficient features to comply with all applicable legal and regulatory standards and requirements.

Why is this here? because they'd be liable otherwise, and more generally they want to be on the record as saying "hey idiot don't use this in a life-critical system".

There might well be a clause like that in crowdstrike's license too, of course. But the problem is it's generally different when what you are providing is a mission-critical safety/security system... hard to duck responsibility for being in critical places when you are actively trying to position yourself in critical places.


>Sadly there is a reason the chainsaws say "do not stop chain with genitals". Not only did someone probably do that, but the damages stuck.

This is very dependent on your jurisdiction. The USA's laws leave a lot more room for litigating in a way which I would deem frivolous than those of Canada. If you sell a chainsaw with safety features that adhere to common standards you should reasonably expect people not to try to stop it with their ballsack and a court of law that holds the manufacturer liable for moronic use of the object is a poorly designed court.


I think you have it flipped. Any clause in a contract is a negotiation. Warranties and insurance coverage are part of it.

Any smart CIO would have said - ill take what you sell but if you fail i can come back and haunt you and you are going to give me endorsement for your product insurance and i'll require upping your coverage + notifications of you being up to date with your insurance policy that has 50XXX M in coverage, minimum.

If the software is sitting on top of your business core IT, you must protect the business in its entirety by demanding a proportional shield, and using IT vendor's own IT insurance shield as if it was your own. And demanding more coverage, if the shield is too small . Then once those elements are in place, you are protected. Its as simple as that.


What you say doesn’t contradict my comment. I’m sure that CrowdStrike has disclaimers.


Their stock is down 13% right now... pretty huge for an intra-day drop.


I mean, seriously: You can cause a worldwide outage of gargantuan proportions, affecting actual human lives and untold points off GDP in several countries ...

... and the market gives you the equivalent of a wrist slap? No lawsuits?

What is it down to? Anonymous hit markets? Where is justice going to be served here?

PS. Most outlets are reporting "a fix has been issued" - as in "Whoopsy. No biggie" ...

... I mean, who's going to make affected (still alive!) people whole?


Naive question, if it’s a blue screen of death with a boot loop, how are they going to restore things? Don’t tell me the answer is going to every system manually.


Well, it seems that Windows is not yet accessible remotely when it crashes.

If system administrator had too much free time, and configured every system to probe network on booting, and there is no encryption, it is possible to boot from a minimal Linux image with a script that automatically renames the driver and restarts.

The corporate version of the same approach uses Intel AMT (or however else it is called), but it is only available on licensed hardware from big suppliers.

Otherwise, you can distribute flash drives with the same auto-executing fix to everyone who is able to enter firmware setup, and boot from USB. If it's not available for security reasons, more manual work is required.

But what happens next? If Crowdstrike handled all the security measures, and there was no additional firewall rules, address checks, and so on, your network is now as open as it can be. I suppose certain groups have been celebrating, and uploading gigabytes of data from networks whose detection systems became severed.


Go to every system manually, boot to safe mode, rename the sys files, run a fix.

Easier to just rebuild from the image. For every windows machine your company has. lol.


Lots of systems (not all) are able to reboot, and have CrowdStrike download the fix before the bad code is able to crash things. But otherwise, yes, you have to go to systems manually.


Going to every system manually, then delete a file via command line in Windows' recovery environment.


Remote access control (e.g. iDRAC) or physical access.


It's kind of surprising so much infra was using windows servers or windows cloud VMs for these things. I assumed these systems would all be Linux VMS in Azure/AWS/GCP at this point.

on https://azure.status.microsoft/en-gb/status the message is currently:

> We have been made aware of an issue impacting Virtual Machines running Windows Client and Windows Server, running the CrowdStrike Falcon agent, which may encounter a bug check (BSOD) and get stuck in a restarting state.


Welcome to the enterprise. Where “lift and shift” was sold to corporate CTO’s as better than maintaining their own IT infrastructure.


Workaround + update (within a authenticated portal)

https://www.reddit.com/r/crowdstrike/comments/1e6vmkf/bsod_e...


There is hardly a better time if you write software to watch - "The Mess We're In" by Joe Armstrong - https://www.youtube.com/watch?v=lKXe3HUG2l4

I am not sure in which one of his talks he briefly mentioned that one of his concerns is that we are basically building a digital Alexandria library, and if it burns, well ...

Even more devastating events like this will happen in the future.

We stand on the shoulders of giants and yet we learned nothing.


We have ~50 thousand laptops in reboot loop and ~1.5k servers as well. No resolution yet.


I guess people who continue to use Windows in 2024 arguably deserve this, particularly those utilizing it in a production environment.


It's truly horrifying how many critical systems run Windows.


It was Windows in this case but nothing is stopping it from happening with any other widely used system that gets online updates. CrowdStrike has root on Linux/MacOS as well after all.

The problem is relying on networked computers for critical infrastructure with no contingency plan. This sort of thing will happen whether because of a bug or because of ransomware. The software and hardware industries are incapable of producing reliable and safe products in our economic system.

Important services such as hospitals, groceries, water treatment plants, and electric grids should be able to operate in offline mode when this sort of thing inevitably happens.


what about all the people that use services provided by people that use windows? should there be some sort of pushback here?



The impact is Australia is immense. https://downdetector.com.au/, it's almost every major org in the country. NSW Government is completely offline.


Australian news reporting this has hit hospitals, fire and rescue, banking, media, airlines and many other companies worldwide.



Newsreaders on the ABC are reading off paper notepads lmao


I was listening to Triple J (one of ABC's radio stations), they said: "welcome to our first and possibly last ever Triple J's USB Fridays, we can't play any of our usual music because the computers are all down, all we can play is the songs that happen to be on the USB stick that one of us had in our pocket". LOL!


Our company is in panic mode. 15 machines blue screened for no apparent reason and stuck in boot loop. I’m a gloating Linux user :)

Also in Australia


Year of the Linux desktop at last!


I think there are companies with 10000+ or probably more


What I'm curious about: other than checkbox compliance, how does Crowdstrike convince companies to buy their product? Do they present evidence that their product is effective at protecting customers? Because certainly Crowdstrike customers still get hacked.


I've watched it occur countless times. Often times the people making the purchase decision are largely incompetent.

They usually come out and take your team to a nice lunch. Then they run you through a fancy slide deck and convince you to let them run some scaremongering reporting tool over your infra. By the end of the day, most of your leadership is convinced they need the solution.

Rinse and repeat hundreds of times and you have the 3rd party vendor hodgepodge hellscape that constitutes most large corporations' IT infrastructure.


I would imagine that their best weapon is that so many other big organizations are using CS, so choosing CS gives the decision maker the best shield from responsibilities, similar to "nobody gets fire by choosing IBM".

Of course, how they started from small was completely different.


They all should have used some expensive corporate-and-government-level product that promises protection against exactly that kind of large scale attack on infrastructure.


I assume this is irony, since isn’t Crowdstrike exactly that these days?


Is it believed to be an attack? I only saw mention of a bug so far.


CrowdStrike have finally posted publicly on it: https://www.crowdstrike.com/blog/statement-on-windows-sensor...


Brilliant press release:

- Not apologetic

- An 'issue' in a 'single' release

I bet they were originally planning on starting with a '[reported by a] small number of customers' too.


“We’ve encountered a small single issue in which.. cough we’ve crashed all the computers in the world…”


"A specific and limited subset of computers on a single globe"


Their legal team would definitely not allow nothing else.


When I saw 'Global IT Outage' trending I assumed it was another major cloud service failure. Obviously this has far wider impact because of the need for intervention on individual endpoints.

The irony is dawning on me that for much of the recent computing era we've developed defenses against massive endpoint outages (worms, etc.) and one of them is now inadvertently reproducing the exact problem we had mostly eradicated.


CyberStrike offers a temporary solution for crashed systems Cyberstike has given users a potential way to fix their systems.

Boot Windows into Safe Mode or the Windows Recovery Environment (you can do that by holding down the F8 key before the Windows logo flashes on screen) Navigate to the C:WindowsSystem32driversCrowdstrike directory Locate the file matching “C-00000291.sys” file, right click and rename it to “C-00000291.renamed” Boot the host normally.



fun doing that on thousands of machines


It's easy, the users can do it themselves. Just send them an e-mail with the inst... oh wait...


One of the downsides of WFH. How do you contact your IT support via Microsoft Teams when you don’t have your laptop or Microsoft Teams.


Yup. Well in our case (and we are thankfully not affected), they could call IT support. But then again, if IT support themselves cannot boot their PCs...


Great. now I'll do this 10,000 times.


Practice makes a man perfect.


Microsoft are going to be pissed that this is widely being discussed as a Microsoft outage. Do AV vendors like Crowdstrike need a license or something from Microsoft to push these kernal driver based things? Or is it just like anyone can make one?


yes, and they have two. it is a windows problem.


The impact of this will be profound!

Obviously bugs are inevitable, but why this wasn't progressively rolled out is beyond me.


My understanding is that multiple recent versions are affected.


Yeah, progressive rollout would have dramatically reduced the impact, I think that should be mandatory for any of these systems.


It seems like this would indirectly tell us what systems use Cloudstrike. Could that in of itself be information that could help an attacker? I know the security team at work is adamant about not leaking details of our system.


In terms of analysing risk factors to minimise something like this happening again, what are the factors at play here?

A Crowdstrike update being able to blue-screen Windows Desktops and Servers.

Whilst Crowdstrike are going to cop a potentially existential-threatening amount of blame, an application shouldn't be able to do this kind of damage to an operating system. This makes me think that, maybe, Crowdstrike were unlucky enough to have accidentally discovered a bug that affects multiple versions of Windows (ie. it's a Windows bug, maybe more-so than it is a Crowdstrike bug).

There also seems to have been a ball-dropped in regards to auto-updating all the things. Yes, gotta keep your infrastructure up to date to prevent security incidents, but is this done in test environments before it's put into production?

Un-audited dependence on an increasingly long chain of third-parties.

All the answers are difficult, time consuming, and therefore expensive, and are only useful in times like now. And if everyone else is down, then there's safety in the crowd. Just point at "them too", and stay the path. This isn't a profitable differentiation. But it should be! (raised fists towards the sky).


> Whilst Crowdstrike are going to cop a potentially existential-threatening amount of blame, an application shouldn't be able to do this kind of damage to an operating system.

It doesn't operate in user space, they install a kernel driver.


> "they install a kernel driver"

And therein lies the problem!


It's a design decision. People want the antivirus to protect them even if an attacker exploits a local privilege escalation vulnerability or if an attacker that compromised an admin account (which happens all the time in Windows environments) wants to load malicious software. That's kind of the point of these things. Somebody exploits a memory vulnerability of one of the hundreds of services on a system, the antivirus is supposed to prevent that, and to their benefit, Crowdstrike is very good at this. If it didn't run in the kernel, an attacker with root can deactivate the antivirus. Since it's a kernel module, the attacker needs to load a signed kernel module, which is much harder to achieve.


Presumably Crowdstrikes driver also has the ELAM flag which guarantees it will be loaded before any other third party drivers, so even if a malicious driver is already installed they have the opportunity to preempt it at boot.

https://learn.microsoft.com/en-us/windows-hardware/drivers/i...


> guarantees it will be loaded before any other third party drivers

Point of information. "Guarantee" and "any" are unsubstantiated by that MS article.


If we are being pedantic then an ELAM driver can't be guaranteed to load before another ELAM driver of course, but only a small list of vetted vendors are able to sign ELAM drivers so it is very unlikely that malware would be able to gain that privilege. That's the whole point.


Not pedantic. Just accurate.

> an ELAM driver can't be guaranteed to load before another ELAM driver of course,

Thanks for the correction.


Yep. We can't migrate our workstations to Ubuntu 24.04 because Crowdstrikes falcon kernel modules don't support the kernel version yet. Presumably they wanted to move to EBPF but I'm guessing that hasn't happened yet. Also: I can't find the source code of those kernel modules - they likely use GPL-only symbols, wouldn't that be a GPL violation?


Why would you use Crowdstrike on Ubuntu? Is because of a real security concern, or abiding to regulations (thou shalt have an antivirus) or else?


I was given to understand that Crowdstrike provided some protection from unvetted export of data. I'm not sure that data would be useful without the rare domain expertise to use it, but I wasn't shown the risk analysis. And then someone else demands and gets ssh access to GitHub. Sigh.


Ask my IT dep. AFAIK it's audit related, safety-critical software


I think "compliance" would be a better word to use that "safety" when it comes to a lot of "security" software on computers.

And I bring up the distinction because while compliance is "sometimes" about safety, it's also very often about KPIs of particular individuals or due to imaginary liability for having not researched every possible "compliance" checkbox conceivable and making sure it's been checked.

Some computer security software is completely out of hand because its primary purpose is to have the appearance of effectiveness for the exec whose job is to tick off as many safety checkboxes as they can find, as opposed to being actually pragmatically effective.

If the same methodologies were applied to car safety, cars would be so weighed down by safety features, that they wouldn't be able to go faster than 40km/h.


Just to be safe, of course! In my org they try to rollout sentinel one on every ‘endpoint’ regardless of operating system.


Probably only a violation if you distribute the linked result. Not if you only install it.


How would you install it without them distributing it?


They mean distributing Linux + the module together. Like e.g. shipping the Nvidia kernel module alone is fine, but shipping a Linux distro with that module preinstalled is not fine.


Two different "it". As an analogy: selling pizza Hawaii is dicey, but you can sell pineapple slices and customers can add those to their pizza themselves.


> We can't migrate our workstations to Ubuntu 24.04 because Crowdstrikes

Should you upgrade before 24.04.1 is released? It's scheduled for August 15.


IIRC, about a 12-18 months ago CrowdStrike was recruiting for a development with eBPF skills.


The generally accepted (but not well tested) legal position is that it's ok to have a proprietary kernel module that is dynamically loaded.

You can, for instance, ask a running kernel if it is "tainted" by having loaded a non-GPL module.


Last time I dealt with HP, I had to use their fakeraid proprietary kernel module which "tainted" the kernel. Of course they never open-sourced it. I guess it's not necessary.


GPL exported symbols are the ones that are thought to be so tightly coupled to the kernel implementation that if you are using them, you are writing a derivative work of the kernel.


Yeah that was also my understanding, and I can't imagine a av module able to intercept filesystem and syscalls to be only using non-core symbols. But of course you never know without decompiling the module


> and I can't imagine a av module able to intercept filesystem and syscalls to be only using non-core symbols.

I can, considering that you can do that from user space using strace. Or ebpf which is probably the actual right way to do this kind of thing.


Not like they have an option. Kernel drivers are required.


Are they? Apple has pretty much banned kernel drivers (kexts) in macOS on Apple Silicon. When they were still used, they were a common cause of crashes and instability, not to mention potential gaping security holes.

Most things that third-party kernel drivers used to do (device drivers, file systems, etc) are now done just as well, and much more safely, in userspace. I'm surprised if Microsoft isn't heading in this direction too?

Presumably, Crowdstrike runs on macOS without a kernel extension?


> Presumably, Crowdstrike runs on macOS without a kernel extension?

That's correct: CrowdStrike now only installs an "Endpoint Security" system extension and a "Network" system extension on macOS, but no kernel extension anymore.


One would hope that Crowdstrike does a similar thing on Linux and relies on fanotify and/or ebpf instead of using a kernel module. The other upside to this would be not having to wait for Crowdstrike to be constantly updating their code for newer kernels.


Wait, you still will be using CS? Why?


I believe so but would like better details. We used to use another provider that depended on exact kernel versions whereas the falcon-sensor seems quite happy with kernel updates.


Whatever protection is implemented in user-land can be removed from user-land too. This is why most EDR vendors are now gradually relying on kernel based mechanisms rather than doing stuff like injecting their DLL in a process, hooking syscalls, etc...


This is wrong, there are many facilities that, once applied, cannot be modified (unless reboot)


Such as ?


Random example: https://man.openbsd.org/OpenBSD-7.3/msyscall

This is a syscall used by userspace to tell the kernel which memory portion is allowed to do syscalls

This syscall can only be used once : once the linker has done it, the kernel will refuse extra calls (so allowing more memory pages is not possible)


First, we were talking about EDR in Windows usermode.

Second, still, that doesn't change anything. You can make your malware jmp to anywhere so that the syscall actually comes from an authorized page.

In fact, in windows environment, this is actively done ("indirect syscalls"), because indeed, having a random executable directly calling syscalls is a clear indicator that something is malicious. So they take a detour and have a legitimate piece of code (in ntdll) do the syscall for them.


The original Windows NT had microkernel architecture, where a driver/server could not crash the OS. So no, Crowdstrike didn't have an option really, but Microsoft did.

As PC got faster, Microsoft could have returned to the microkernel architecture, or at least focused on isolating drivers better.


They've done it to a degree but only for graphics drivers, Windows is (AFAIK) unique amongst the major OSes in that it can nearly always recover from a GPU driver or hardware crash without having to reboot. It makes sense that they would focus on that since graphics drivers are by far the most complex ones on most systems and there are only 3 vendors to coordinate API changes with, but it would be nice if they broadened it to other drivers over time.


NT was never a true microkernel. Most drivers are loaded into the kernel. Display drivers being a huge pain point, subsequently rolled back to user space in 2000, and printer drivers being the next pain point, but primarily with security -- hence moving to a Microsoft-supplied universal print driver, finally in Windows 11.


Yep, this is technical legacy in action.


There's a grey area between "kernel drivers are required for crowdstrike" and "windows is not modular enough to expose necessary functionality to userspace". It could be solved differently given enough motivation.


An expanded explanation with the third option of: even with existing options, it was really badly implemented - https://social.treehouse.systems/@marcan/112812791936639598


Required for crowdstrike to do what crowdstrike does. Which is mostly useless security theatre.


The people installing crowdstrike have an option: Don't install it.


So what? Crowdstrike is a kernel AV. How else would you solve this?


My experience working with Crowdstrike was that they were super arrogant about these risks. I was working on a ~50k enterprise rollout, and our CS guy was very belligerent about how long we were taking to do it, how much testing we wanted to do, the way that we were staggering roll outs and managing rollback plans. He didn’t think any of this was necessary, that we should roll it out in one fell swoop, have everything to auto-update all the time, and constantly yapped about how many bigger enterprises than ours completed their rollouts in just a couple of weeks.

He actually threatened to fire us as a client because he claimed he didn’t want the CS brand associated with an org that wasn’t “fully protected” by CS. By far the worst vendor contact I’ve ever had. I’ve had nicer meetings with Oracle lawyers than I was having with this guy. I hope this sort of thing humbles them a little.


> constantly yapped about how many bigger enterprises than ours completed their rollouts in just a couple of weeks.

Evidence is pointing towards him actually being right about this, despite likely being wrong about everything else.

It'd be worth giving him a call, just to check in how he's going, and take him up on the offer to fire you as a client.


I was just a contractor there, and don’t work with them at the moment. But I’m a customer of theirs and they’re definitely having an outage right now, so I’m guessing it’s all still in place.


Mind rephrasing? I don't understand what you're saying.


I don’t work there any more. But they were having an outage, so I’m guessing they never got fired as a client (guessing that they’re still using Crowdstrike) and could still take that offer (of being fired as a client) if they wanted to.


What evidence are you referring to? Was there a company that was breached for taking a few days or weeks to update crowdstrike?


>I hope this sort of thing humbles them a little.

Hopefully not. It would be better that this company is sued into oblivion by all the customers that were affected by this huge outage.


Maybe humbles all the other surviving companies? We can only dream


> I hope this sort of thing humbles them a little.

What I hope, is that they stop to exist as a product and as a company. They have caused inconvenience, economic damage in global scale and probably also loss of life, given that many hospitals, ER units had outages. It has been proven that their whole way of working is wrong, from the very foundation to the top.


Ouch, considering the devil works under Oracles lawyers, thats bad!


sounds like a very inexperienced person.

If their mission is to protect businesses they should understand your concerns.

Speed is useless without control.


He was pretty senior for his role, but really I have no idea whether he was representative of the wider company culture.

We had a buggy client release during the rollout which consumed all the CPU in one of our test environments (something he assured us could never happen), and he calmed down a bit after that. Prior to that though he was doing stuff like finding our CISO on LinkedIn to let him know how worried he was about our rollout pace, and that without CS protection a major breach could be imminent.


I'm guessing he's not typical, only because of he was, CrowdStrike would be known far and wide for this behavior.

For example, the way Oracle's lawyers are known.


At the end of the day, if you give an application a deep set of permissions, that's on you as an administrator, not the OS. This unchecked global rollout appears to just be a violation of every good software engineering practice we know.


Administrators are to blame because management (and a lot of 'cybersecurity policies') demand there's a virus scanner on the machines?

While virus scanners might pick up some threats not addressed by OS updates yet every one of them I've seen is a rootkit in disguise wanting full system privileges. There are numerous incidents with security holes and crashes caused by these security products. They also aren't that clever: repeatedly scanning the same files 'on access' over and over again wasting CPU and IO is not going to give you any extra security.


Not so much in disguise.

CS has official RCE root/admin access on all the clients. Which skips any normal auth of the OS. Yes, on all windows, mac and linux.


I often watch Crowdstrike thrash my laptop's resources, making it slow to do compiles. Cybersecurity won't let me disable it either, so I just set it to lower priority process.


You might have more luck asking Cybersecurity to add a path like ~/code which contains your source code to the exclusion list.


As someone who worked for a company, who's a Crowdstrike partner, I assure you that Crowdstrike does not sell to administrators. It is very much a product sold to management and company auditors.

Where you're correct is that it's on the administrators to rollout the updates, but I'm not sure that's how Crowdstrike works. It's a managed solution and updates are done for you, maybe that can be disabled, but I honestly don't know.


This should clue you in.

CS is not sold to SA or technical types. It's sold to management as a risk reduction.

The whole point is that if you are technical, you are so untrusted that management is willing to require circumvention of known good practices and force installation of this software against technical advice.


> This unchecked global rollout appears to just be a violation of every good software engineering practice we know.

Yeah, this is what surprises me. Corporate infrastructure policy seems to have been matched to smart phone default settings.


I have worked in Finance for 25 years, and the amount of pressure I had to stand from Auditing on "Why do we have a 20-day-window on applying most updates as we get them from suppliers? We are not best practice!" is gruelling.

These people report to the Board Chairman, don't understand any real implication of their work, and believe the world is a simplistic Red - Amber - Green grid.

I understand most CIOs / CTOs / CISOs in Corporate would buckle.


So the silver lining from this incident would be that you can simply point to it, and tell those auditors to fuck off.


I'm pretty sure Apple does gradual rollouts of upgrades, so default smartphone settings are better than that.


It's actually worse than phone updates. Ever looked at your phone and noticed it hasn't updated to the new OS despite it having been out for a few days already? This is why.


> an application shouldn't be able to do this kind of damage to an operating system

Antivirus software by its nature probably needs the kind of access that would let it bluescreen your computer.


This is not the case. There are many possible AV architectures, with or without kernel drivers and/or administrator level permissions.


Wading out my depth here, so forgive any stupidity following.

And there's a certain amount of sense to that, it has to get "under" the layer that viruses can typically get to, but I still think there should be another layer at which the OS is protected from misbehaving anti-virus software (which has been known to happen).


That usually makes it a port of entry for attacks. Antivirus are really malwares waiting to be exploited.


It's a kernel mode driver. There aren't layers in kernel drivers. Any kernel module/driver can crash your system if it wants to.


You're taking about how things are, the comment you're replying to is talking about how things could be. There's not a contradiction there.

Originally, x86 processors had 4 levels of hardware protection, from ring 0 up to ring 3 (if I remember right). The idea was indeed that non-OS drivers could operate at the intermediate levels. But no one used them and they're effectively abandoned now. (There's "level -1" now for hypervisors and maybe other stuff but that's besides the point.)

Whether those x86 were really suitable or not is not exactly important. The point is, it's possible to imagine a world where device drivers could have less than 100% permissions.


It runs at Ring 0, there's no lower ring (besides maybe IME and the like).


The problem I have with this is that anti-virus software has never felt like the most reliable, well-written, trustworthy software that's deserving of it's place in Ring 0.

I understand I'm yelling into the storm here, because anti-virus also requires that level of system access due to the nature of what it's trying to detect. But then again, does it only need Ring 0 access for the worst of the worst? Can it run 99% of the time in Ring 1, or user space, and only instantiate it's Ring 0 privileges for regular but infrequent scans or if it detects something else may be 'off'?

Default Ring 0? Earn it.

This turns into a "what's your threat level" discussion.


Technically, there are rings -1 through -3; hypervisor/-1 actually seems widely used and maybe could be used here.

https://en.wikipedia.org/wiki/Protection_ring#Miscellaneous


Need something like a hypervisor OS/hardware that isnt IME.


Modern Windows installs already run under a hypervisor. It's called Core Isolation or Virtualization Based Security.


We need security layers all the way down ... :)


Don't root kit yourself then cry about it when it falls over. Problem solved.


Crowdstrike is basically corporate malware - the failure is in large part with security dept deciders who signed off on policies that compel people to install these viruses on their work machines.


Other than a lack of redundant systems, it should be illegal to roll out updates like this to more than x% of any gov stuff at a time. Brute force way of avoiding correlational Armageddon.


I feel the need to make a mustang car crowdsrike meme


There's a better joke, Crowdstrike sponsors the Mercedes Formula 1 team and in 1955 Mercedes was involved in the worst motorsport accident ever, killing over 80 people watching from the stands when parts of the cars flew off and... striked the crowd...


I think it was a Dodge charger. That pro-trump KKK guy in the gray car who drove through the crowd while the crowd was I think it was a George Floyd protest?

Also heard today crowd stroke.


Could you elaborate?


Mustangs have a reputation as being 'crowd (or streetlight) seeking' missiles.

This is due to their price making them relatively more available to the enthusiasts than say Hellcats, enthusiasts who may not be experienced enough to deal with having that much power available to them in a RWD car. This confluence of power, confidence and lack of skill often comes to a head when the enthusiast goes to a car meet to show off and meet with like minded folks. At the conclusion of the meet, or during a group drive, they'll often pull a sick burnout as they pull out of the parking lot on to a street.

A sick burnout they haven't practiced, and will often cause them to lose the back end sending the car into the curb, a tree, or a crowd of like minded attendees at the car meet. Therefore, the reputation.

For example: https://www.youtube.com/watch?v=DPx5aBI8UTQ


Mustangs are famous for their high power and poor handling - there are lots of videos showing drivers doing burnouts, losing control, and striking the crowd they are showing off to.


maybe they installed crowdstrike because they wanted updated without testing. and crowdstrike failed at testing them in their environment.

sounds like they didnt test all cases and stumbled on a windows bug


Maybe it's time that critical systems switch to Linux. The major public clouds are already primarily running Linux. Emergency services, booking, and traditional point-of-sale have no strong reason to run Windows. In the past 10 years, the technological capability differences between Windows and Linux have widened considerably, with Linux being the most advanced operating system in the world without question.

Concerns about usability between Windows and Linux in the modern day are disingenuous at best and malicious at worst. There is no UX concern when everything runs off a webapp these days.

Just use Linux. You will save money and time, and your system will be supported for many years, you won't be charged per E-Core, you won't suffer BSoDs in 2024. Red Hat is a trustworthy American company based out of Raleigh, NC, in case you have concerns of provenance.

Really there's no downside. If you were building your own company you would base your tech stack on Linux and not Windows.

Critical systems cannot go down; therefore they cannot run Windows. If they do, they are being mismanaged and run negligently. Management should have no issue finding Linux engineers, they are everywhere. I could make a killing right now as a consultant going from company to company and just swapping out Windows backends for Linux. And quite frankly I might just do that, literally starting right now.


The discussed issue is not related to any meaningful difference between Windows and Linux – Crowdstrike used a kernel driver, apparently containing a serious bug, which took down the system, which is something any kernel driver can do, no matter which kernel you use. At least Windows have a well-developed framework for writing userspace drivers, unlike Linux.

> Linux being the most advanced operating system in the world without question.

Very strong and mostly unfounded claim; there are specific aspects where Linux is "more advanced", and others where Windows come out ahead (e.g. almost anything related to hardware-based security and virtualization).

> your system will be supported for many years

Windows Server 2008 was supported until earlier this year, longer than any RHEL release.

> you won't suffer BSoDs in 2024

Until you install a shitty driver for a dubious (anti)malware service.


I don't understand this sort of blindness? Linux fails all the time, with rather terrible nobody to root vulns because some idiot failed to use the right bounds check. Ye gods, XZ utils was barely a few months ago!


But no damage actually ended up happening with the xz utils exploit. It didn't even get released because someone picked it up pre-release.

Every system gets attacked, but I think your point shows that even with state-level attacks Linux handles it better than other platforms.


Hmm? It was released for two plus months? 5.6.0 and 5.6.1

I'd also say this wasn't a good example of 'linux handling it better': usually when a mess like this occurs on windows all the corps get a quiet tap on the shoulder that they need to immediately patch when MS releases it, then a few days later it hits the news. In XZ's case, the backdoor was published before the team knew about it, huge mess.


You’re right that it went noticed for a long time, just one clarification

> all the corps get a quiet tap on the shoulder that they need to immediately patch when MS releases it, then a few days later it hits the news

AFAIK, distros were notified and released a patched version of xz like a week before it hit the news, so at least a lot of machines received it via automatic updates.


Depends which news you're talking about. MS guy who discovered it found it March 29th, published to oss. It was in infosec news same day as redhat, others pushed out critical advisories. Patch didn't come til a day or two later.


You're half right - people who compiled it from source could theoretically get those releases, but no, it wasn't released in any distros. So in practice since no linux distro released it, no-one relying on linux distros was exposed to it.


You mean 'xz utils'


zx sounds better


> Maybe it's time that critical systems switch to Linux.

I switched critical systems to illumos and BSD years ago and it's been smooth sailing ever since. Nowadays there really is no need to contribute to linux monoculturization whatsoever.


oh, you think security won't mandate to run CS on linux.

Granted it didn't down linux this time but nothing is stopping it.


It’s not security, it’s compliance. The two are sometimes aligned, sometimes less so.


We've had production outages caused by Microsoft Defender our RHEL boxes :(


Yeah, they definitely would mandate it.

My work laptop is running Ubuntu, and corporate IT requires Symantec Antivirus to be running on it


I too want to see Linux more widely adopted, but it won't prevent this from happening. People will install corrupted kernel modules on Linux too for anti-virus purposes.


All good points but Windows didnt win because it had the best tech or user interface. Merely the most developer support thus user numbers. Legacy momentum is an incredibly difficult thing to sway. It has taken Apple decades an potentially hundreds of billions of dollars of marketing and good will to carve out its share of the market. Linux doesn't have that despites its clear technical advantages.

It is an incredibly frustrated battle akin to Sisyphus.


Crowdstrike has a linux version. It is mandatory in our linux servers in my company so that is not the solution.

I would say issue 1 is management/compliance forcing admins to install malwares like crowdstrike. But issue 1 is because of issue 2 which is about admins / app devs / users aren't smart enough to not have their machines compromised on a regular basis in the first place. And issue 2 is because issue 3 of the software industry not focusing on quality and making bug free software.

All in all this should be mitigated by more diversity in OS, software and "said security solution". Standardization and monopolies works well until they don't and you get this kind of shit.


I think we don't do enough to fight back this requests in a language that is understood by management. Ask them to sign a security waiver assuming risks for installing software techs would classify as a malware and RCE risk.

Companies like CS live on reputation, it should be dragged down.


> Crowdstrike has a linux version

But would it crash the OS?


One place I'm at recently required us to install it in our Kubernetes cluster which powers a bunch of typical web apps.

Falcon sensor is the most CPU intensive app running in the cluster and produces a constant stream of disk activity (more so than any of our apps).

It hasn't crashed anything yet but it definitely leaves me feeling iffy about running it.

I don't like CrowdStrike at all. I got contacted by our security department because I used curl to download a file from GitHub on my dev box and it prompted a severe enough security warning that it required me to explain my intent. That was the day I learned I guess every command or maybe even keystroke I type is being logged and analyzed.


We were also forced to run that until the agent had introduced a memory leak that ate almost all the memory on all the hosts. Thankfully we managed to convince our compliance people that we could run an immutable OS rather than deploy this ~~malware~~ XDR agent.


Yes, the CS Falcon agent caused a kernel panic on RHEL about a month ago.


and yet everyone is blaming Windows sigh.

Windows actually runs a lot of drivers in user-mode, even GPU drivers. largely this is because third-party drivers were responsible for the vast majority of blue screens, but the users would blame Microsoft. which makes sense; Windows crashes so they blame Windows, but I doubt anyone blamed Linux for the kernel panic.


I think windows can be blamed on how badly you can fix that kind of issues. I mean on linux or any bsd admins would build an iso image that would automatically run a script that would take care of optionnally decrypting the system drive, then remove crowdstrike. Or alternatively simply building a live system that take an address via dhcp and start an ssh server. and admins would remotely and automatically run a playbook that mount that iso on the hypervisor, boot it, remotely apply the fix, then boot back the system on the system drive.

Maybe this is just my ignorance about windows and its ecosystem but it seems most admins this morning were clueless on how to fix that automatically and remotely on n machines and would resort to boot in safe mode and remove a file manually on each single server. This is just insane to think that supposed windows sysadmins / cloudops have no idea how to deploy a fix automatically on that platform.


Linux is blamed for bad device drivers all the time, even on HN.



It can kill process based on memory scanning. Imagine systemd was getting killed at every boot?

An issue might not be as universal as on windows, because some distros do things differently like not using glibc, or systemd, or whatever. Yet there are some baselines common to the most popular ones.


If it works the same way - absolutely.


Why wouldn't it? This particular bug wouldn't, but another one...



I suggest switching to macOS. They don't allow third-party kernel drivers which is already a big advantage over Windows or Linux.


Well, Microsoft tried to lock down its kernel with Windows Vista and then antivirus vendors cried that they won't be able to protect Windows, anticompetetive etc.

https://www.computerworld.com/article/1642872/q-a-microsoft-...

https://betanews.com/2006/10/18/mcafee-ms-failing-to-provide...


> Linux being the most advanced operating system in the world without question.

Only if you don't need a GUI/Desktop.


I rate Linux DE higher than I do windows and Mac desktop tbh. Better ergonomics, better user experience and less bloat.


I could never get smooth scrolling to work on Linux in any mainstream web browser, most people don’t seem to see it, but I’m sensitive to things like that.


Imho that was somewhat true on x11 but on wayland I feel everything is much smoother. I am more a pgup/pgdown user though.


Like with a laptop trackpad? I'm smooth-scrolling through these comments right now, and don't remember when scrolling wasn't smooth by default on any trackpad.


It’s smooth to a point, but not smooth like OS X is. It might have improved (I think I last tried desktop Linux a year ago). I do enjoy using Linux as my default headless OS.


NOT SMOOTH SCROLLING!


I need a few accessibility settings and Mac just excels in this regard.


> Only if you don't need a GUI/Desktop.

I not only need a GUI/Desktop, it's my daily driver!

And there are precious few things that Windows GUI/Desktop provides which I don't have on Linux, while the reverse is never true.

When I used Mac (Big Sur, I think?) until a year ago, I was absolutely miserable about having to use such a primitive GUI.


I have a GUI/Desktop on Linux, not sure what you're referring to?


Do Linux systems not crash if a third party kernel module crashes? Or was your comment sarcastic?


My employers pays Crowdstrike to double my build times. Quite astounding really.


Anecdote: my first job was IT at a small org. We had somehow gotten a 15 minute remote meeting with Kevin Mitnick, and asked him several questions about security best practices and software recommendations. I don't remember a lot about that meeting, but I do remember his strong recommendation of Crowdstrike. Interesting to see it brought up again in this context.


Can someone explain to me why such systems need anti-virus in the first place?

Windows has pretty good facilities for locking down the system so that ordinary users, even those with local admin rights, cannot run or install unauthorised code so if nothing can get in why would the system need checking for viruses?

So why do most companies not lock down their machines?


easier to show a paid bill than to show true due diligence to your insurance when you're hit with ransomware.

that's the whole CS business model.


I assume you have to install "CrowdStrike" yourself (i.e. not bundled with Windows by default)? I have no idea what it is before.


Its paid antivirus software, they cater to businesses


Anything that has root/kernel access is a risk. It always has been. When will we learn. Probably never. Because money runs this world. So sad. Time to open a bakery and move on from this world.


Considering what Crowdstrike is intended to do, it's not really possible for it to work without running at the kernel level.


Things like hospitals, airlines, 911, should have multiple systems with different software stacks and independent backends running in-parallel, so that when one infra goes down they can switch to another.


For some areas of our critical systems we have three independent software groups program the same exact system on different infrastructure. Just for moments like these...


There is an enormous cost associated with the kind of redundancy you're talking about. Capitalism prevents us from being set up in the way you're describing. Why invest in company A if company B can run the same business with half the operational expenses? Shareholder profit above all.


Is company B allowed to take the full brunt of all the problems when there is a failure, or does government protect it by limiting damages? If company B's cheaper choice leads to harm and lets people and estates sue company B into the ground, then company A is a safer investment even if it has lower returns. If government interaction limits such recovery options, then that is what leads to company B's higher returns not also having higher risks, so they'll be the better investment. But that is a result of government intervention, not the economic system in play.


Maybe they do perform canary deployments and Australia was the canary?

Certainly feels like it's disproportionately affecting us down under.


We’re just the only ones awake to feel it


Anecdotal evidence: the global mega corp I unfortunately work for is definitely feeling this globally


It looks like the whole world is the canary, but they will have the release ready and in top shape for the Mars deployment.


How does such a huge company do “full deploys” like this? At this number of endpoints, only a few % should have been updated (and faced the problems) before a full rolout

This is not a small startup with some SaaS, these guys are in most computers of too many huge companies. Not rolling out the updates to everyone at the same time seems just too obvious


This incident definitely makes a good case for staggered deploys of patches.


Working late Thursday night in Florida, USA. I have someone in Australia wanting me to write a quick script in LSL for an object in Second Life. We were interrupted: Second Life kept running, but Discord went down, telling me to 'try another server' which doesn't make sence when you are 1-on-1 with someone. All my typing in Discord turned red. Additionally, I couldn't log into the email portal for outlook.com: I got a screen of tiny-fonted text all clinging to the left edge of the display, unreadable, unusable. Second Life, though, stayed online and kept working for me, but then I'm on Windows 7. My friend who had requested the collaboration froze in Second Life on his Windows 10 system, and I don't know what his Discord was doing. I ended the session since I couldn't get a no/no-go out of him for the latest script version.


Wow I didn't know second life was still a thing. Literally yesterday I looked at a 20 year old archived version of a freeware portal which also listed a version of second life.


This is a good example of why you don't want ring0 level access for clients. Or just, you don't want client-based solutions. The provider just becomes another threat vector.


Those focusing on QA and staged rollouts are misguided. Yes of course a serious company should do it but CrowdStrike is a compliance checkbox ticker.

They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”.

The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).

I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)


So can crowdstrike be classified as malware now?

Currently waiting in line for 2 hours + waiting for Delta to tell me when my connecting leg can be booked. My current flight is delayed 5 hours.


Due to the scale I think it’s reasonable to state that in all likelihood many people have died because of this. Sure it might be hard to attribute single cases but statistically I would expect to see a general increase in probability.

I used to work at MS and didn’t like their 2:1 test to dev ratio or their 0:1 ratio either and wish they spent more work on verification and improved processes instead of relying on testing - especially their current test in production approach. They got sloppy and this was just a matter of time. And god I hate their forced updates, it’s a huge hole in the threat model, basically letting in children who like to play with matches.

My important stuff is basically air-gapped. There is a gateway but it’ll only accept incoming secure sockets with a pinned certificate and only a predefined in-house protocol on that socket. No other traffic allowed. The thing is designed to gracefully degrade with the idea that it’ll keep working unattended for decades, the software should basically work forever so long as equivalent replacement hardware could be found.


At one company I used to work for, we had boring, airgapped systems that just worked all the time, until one day security team demanded that we must install this endpoint security software. Usually, they would fight tooth and nail to prevent devs from giving any in-house program any network access, but they didn't even blink once to give internet access to those airgapped systems because CrowdStrike agents need to talk to their mothership in AWS. It's all good, it's for better security!

It never caught any legit threat, but constantly flagged our own code. Our devs talked to security every other week to explain why this new line of code is not a threat. It generated a lot of work and security team's headcount just exploded. The software checked a lot of security checkboxes, and our CISO can sleep better at night, so I guess end of day it's all worth it.


>It never caught any legit threat, but constantly flagged our own code

When I worked in large enterprise it got to the point that if a piece of my app infrastructure started acting weird the blackbox security agents on the machines were the first thing I suspected. Can't tell you how many times they've blocked legit traffic or blown up a host by failing to install an update or logging it to death. Best part is when I would reach out to the teams responsible for the agents they would always blame us, saying we didn't update, or weren't managing logs etc. Mind you these agents were not installed or managed by us in any way, were supposed to auto update, and nothing else on the system outran the logrotate utility. Large enterprise IT security is all about checking boxes and generating paperwork and jobs. Most of the people I've interacted with on it have never even logged into a system or cloud console. By the end I took to openly calling them the compliance team instead of the security team.


I know I've lost tenders due to not using a pre-approved anti-virus vendors which really does suck and has impinged the growth of my company, but since I'm responsible for the security it helps me sleep at night. This morning I woke up to a bunch of emails and texts asking me if my systems have been impacted by this and it was nice to be able to confidently write back that we're completely unaffected.

I day-dream about being able to use immutable unikernels running on hypervisors so that even if something was to get past a gateway there would be no way to modify the system to work in a way that was not intended.

Air-gapping with a super locked down gateway was already getting more popular precisely due to the forced updates threat surface area, and after today I expect it to be even more popular. At the very least I’ll be able to point to this instance when explaining the rational behind the architecture which could help in getting exemptions from the antivirus box ticking exercise.


I love their forced updates, because if you know what you're doing you can disable them, and if you don't know what you're doing, well you shouldn't be disabling updates to begin with. I think people forget how virus infested and bug addled Windows used to be before they enforced updates. People wouldn't update for years and then bitch how bad Windows was, when obviously the issue wasn't Windows at that point.


If the user wants to boot an older, known-insecure, version so that they can continue taking 911 calls or scheduling surgeries... I say let 'em. Whether to exercise this capability should be a decision for each IT department, not imposed by Microsoft on to their whole swarm.


Microsoft totally lets them. If you use any Enterprise version of Windows, the company can disable updates, but not the user.


No, after the fact. Where's the prompt at boot-time which asks you if you want to load yesterday's known-good state, or today's recently-updated state?

It's missing because users are not to be trusted with such things, and that's a philosophy with harmful consequences.


Isn't this in the boot options?

https://support.microsoft.com/en-us/windows/advanced-startup...

> Last Known Good Configuration (advanced). Starts Windows with the last registry and driver configuration that worked successfully.


I don't have any affected systems to test with, but I'd be pretty surprised if that were an effective mechanism for un-breaking the crowdstruck machines. Registry and driver configuration is a rather small part of the picture.

And I don't think that's an accident either. Microsoft is not interested in providing end users with the kind of rollback functionality that you see in Linux (you can just pick which kernel to boot to) because you can get less money by empowering your users and more money by cooperating with people who want to spy on them.


1) It is not enterprise version of Windows; it is any version capable of GPO (so Pro applies too, Home doesn't).

2) it is not disabling them; it is approving or rejecting them (or even holding up the decision indefinitely).

You can do that too, via WSUS. It is not reserved to large enterprises, as I've seen claimed several times in this thread. It is available to anyone, who has Windows Server in their network and is willing to install the WSUS role here.


We took 911 calls all night, I was up listening to the radio all night for my unit to be called. The problem was the dispatching software didn't work so we used paper and pen. Glory Days!!!!


Again, this is something the sysadmin can configure. Reread my comment.


It doesn't really matter to me that it's possible to configure your way out of Microsoft's botnet. They've created a culture of around Windows that is insufficiently concerned with user consent, a consequence of which is that the actions of a dubiously trusted few have impacts that are too far and wide for comfort, impacts which cannot be mitigated by the users.

The power to intrude on our systems and run arbitrary code aggregates in the hands of people that we don't know unless we're clever enough to intervene. That's not something to be celebrated. It's creepy and we should be looking for a better way.

We should be looking for something involving explicit trust which, when revoked at a given timestamp, undoes the actions of the newly-distrusted party following that timestamp, even if that party is Microsoft or cloudstrike or your sysadmin.

Sure, maybe the "sysadmin" is good natured Chuck on the other side of the cube partition: somebody that you can hit with a nerf dart. But maybe they're a hacker on the other side of the planet and they've just locked your whole country out of their autonomous tractors. No way to be sure, so let's just not engage in that model for control in the first place. Lets make things that respect their users.


I'm specifically talking about security updates here. Vehicles have the same requirement with forced OTA updates. Remember, every compromised computer is just one more computer spreading malware and being used for DDOS.


Ignoring all of the other approaches to that problem I wonder if this update will take the record for most damage done by a single virus/update. At some point the ‘cure’ might be worse than the disease. If it were up to me I would be suggesting different cures.


I don't see what this has to much do with MS. A bad proprietary kernel module can crash any OS.


An immutable OS can be set up to revert to the previous version if a change causes a boot failure. Or even a COW filesystem with snapshots when changes are applied. Hell, Microsoft's own "System Restore" capability could do this, if MS provided default-on support for creating system restore points automatically when system files are changed & restoring after boot failures.


What's funny to me is that in college we had our computer lab set up such that every computer could be quickly reverted to a good working state just by rebooting. Every boot was from a static known good image, and any changes made while the computer was on were just stored as an overlay on a separate disk. People installed all manner of software that crashed the machines, but they always came back up. To make any lasting changes to the machine you had to have a physical key. So with the right kind of paranoia you can build systems that are resilient to any harmful changes.


Right, an OS completely crashing like this is the fault of the OS and the problematic code.

An OS should be really resistant to this kind of things.


What other OS besides recent CoreOS/Silverblue/etc does this auto-restore of system files automatically?


No other OS forces an auto-restart.


No restart was needed to cause this crash. As soon as Falcon downloads the updated .sys file ... BOOM.


Well, not the OS, per se, but macos updating mechanisms have auto-restart path, and I imagine any Linux update that touches the kernel can be configured in that way too. It's more the admin's decision then OS's but on all common systems auto-restart is part of the menu too.


MS could've leaned more towards user-space kernel drivers though. Apple has been going in that direction for a while and I haven't seem much of that (if anything) coming from MS.

That would have prevented a bad driver from taking down a device.


Apple created their own filesystem to make this possible.

The system volume is signed by Apple. If the signature on boot doesn't match, it won't boot.

When the system is booted, it's in read-only mode, no way to write anything to it.

If you bork it, you can simply reinstall macOS in place, without any data/application loss at all.

Of course, if you're a tinkerer, you can disable both, the SIP, and the signature validation, but that cannot be done from user-space. You'll need to boot into recovery mode to achieve that.

I don't think there's anything in NTFS or REFS that would allow for this approach. Especially when you account for the wide variety of setups on which an NTFS partition might sit on. With MBR, you're just SOL instantly.

Apple hardware on the other hand has been EFI (GPT) only for at least 15 years.


Well we all know where Microsoft is in security… even the government acknowledges it’s terrible


I blame Microsoft in the larger sense; they still allow kernel extensions for use cases that Apple has shown could be moved outside the kernel.


I don’t know the specifics of this case, but formal verification of machine code is an option. Sure it’s hard and doesn’t scale well but if it’s required then vendors will learn to make smaller kernel modules.

If something cannot be formally verified at the machine code level there should be a controls level verification where vendors demonstrate they have a process in place to achieving correctness by construction.

Driver devs can be quite sloppy and copy paste bad code from the internet, in the machine code Microsoft can detect specific instances of known copy and pasted code and knows how to patch it. I know they did this for at least one common error. But if I was in the business of delivering an OS I want people to rely on my OS this stuff formal verification at some level would be table stakes.


I thought Microsoft did use formal verification for kernel-mode drivers and that this was supposed to be impossible. Is it only for their first-party code?


No, I believe 3rd party driver developers must pass Hardware Lab Kit testing for their drivers to be properly signed. This testing includes a suite of Driver Verifier passes that are done, but this is not formal verification in the mathematical sense of the term.


I wasn’t privy to the extent it was used, if this was formally verified to be correct and still caused this problem then that really would be something. I’m guessing given the size and scope of an antivirus kernel module that they may have had to make an exception but then didn’t do enough controls checking.


This is almost definitely on Crowdstrike.

There is a windows release preview channel that exists for finding issues like this ahead of time.

To be fair - it is possible the conflicting OS update did not make it to that channel. It is also possible it is due to an embarassing bug from MSFT (uknown as yet).

Until I hear that this is the case - I am pinning this on Crowdstrike. This should have been caught before prod.


Even if this is entirely due to Crowdstrike I see it as Microsofts failure to properly police their market.

There is the correctness by testing vs correctness by construction dynamic and in my view given the scale of interactions between an OS and the kernel modules trying to achieve correctness by testing is negligent. Even at the market scale Microsoft has there are not enough Windows computers to preview test every combination. Especially when taking into account the people on the preview ring have different behaviors to those on the mainline so many combinations simply won't appear in the preview.

I see it as Microsoft owning the Windows kernel module space and has allowed sloppiness by third parties and themselves, I don't know the specifics but I could easily believe that this is a due to a bug from Microsoft. The problem with allowing such sloppiness is that the slopy operators out compete the responsible operators, the bad pushes out the good until only the bad remains. A sloppy developer can push more code and gets promoted while the careful developer gets fired.


There's not enough public information about it - but taking this talking point at face value, Microsoft did sign their kernel driver in order for it to be able to do this kind of damage. It's not publicly documented what all validation they do as part of the certification and signing process:

https://learn.microsoft.com/en-us/windows-hardware/drivers/i...

The damage may have been done in a dependency which was not signed by Microsoft. Who knows? Hopefully we'll find out.

In general, a fair amount of the bad behavior of windows devices since Vista has been really about poorly written drivers misbehaving, so there appears to be value in that talking point. All the Vista crashes after release (according to some sources, 30% of all Vista crashes after release were due to NVidia drivers), notably, and more recently if you've ever tried to put your Windows laptop to sleep, and discovered when you take it out of your bag that it had promptly woken back up and cooked itself into having a dead battery. (Drivers not properly supporting sleep mode) WHQL has some things to answer for for sure.


Microsoft can prevent this and they should have prevented this, that they did not prevent this in the past does not make it any better.


Crowdstrike has released the detail that the bad files were configuration data.

It is their fault, not Microsoft's. The driver was fine.


> And god I hate their forced updates,

My windows machine notified me of the update, asked me to restart. I was busy, so I didn't. Then the news broke, then the update was rolled back.


It wasn't a Windows update. If you got a notification for an update, it wasn't the update that did this.


As a tester, I'm frustrated by how little support testing gets in this industry. You can't blame bad testing if it's impossible to get reasonable time and cooperation to do more than a perfunctory job.



Ah the "move fast and break things" philosophy gets a demonstration.


That's misplaced. Windows is an ancient platform. CrowdStrike is ubiquitous and routinely updated. There was no "move fast" here, at least on the part of the people operating these systems.


Pushing an update to all clients worldwide simultaneously isn't "move fast"?


No. It's routine. They're not promulgating some fabulous new invention. They're digital hall monitors, chasing bad actors.

They're just bad at it.


Stopped by a gas station in rural Wisconsin leaving from MSP. Thank God we were on a full tank when we left, nothing was operational except the bathrooms (which is why we stopped).

I left thinking about how anti-anti-fragile our systems have become. Maybe we should force cash operations…


Back in the 1990s when Microsoft wanted to enter the embedded systems market there was a saying "You don't want Windows controlling your car's breaks". We now let them control a huge part of our lives. Should we let them add AI to the already unpalatable cocktail?


Lessons learned from this:

- CS: Have a staging (production-like) environment for proper validation. It looks like CS has one of these bu they have just skipped it - IT Admins: Have controlled roll-outs, instead of doing everything in a single swoop. - CS: Fuzz test your configuration

Anything I have missed?


It is possible Cloudflare did a timepointed release on this. Controlled roll-outs wouldn't work if all the daily chunked updates didn't activate the kernel driver until some point in the future.


Don't. Deploy. On. Fridays.


Maybe one day we will stop giving RCE to so many vendors via auto update.


I am going to stop saying this but people don't realize CS has an official RCE as a feature. As in run remote commands as root/admin on windows or linux/mac through their web.


Assuming this event itself isn't malicious, what an excellent POC for something that is. I sure hope every org out there with this level of market reach has good security in place. It's certainly going to be getting some probing after this.


This is a manifestation of almost everything wrong about software development and marketing practices.

I work in hardware development and such a failure is almost impossible to imagine. It has to work, always. It puzzles me why this isn't the casebfor software. My SWE colleagues often get mad at us HW guys because we want to see their test coverage for the firmware/drivers etc.. The focus is having something which compiles and pushing the code to production as fast as possible and then regressing in production. Most of HW problems are a result of this. I found it's often better to go over the firmware myself and read line by line to understand what the code does. It saves so much time from endless debugging sessions later. It pisses of firmware guys, but hey, you have to break some eggs to make an omelette.


> It puzzles me why this isn't the case for software

In my anecdotal experience, its because corporate software projects are not typically run by people who are good at building safe things - but rather, just building things quickly.

There's a huge issue with the mentality of "it works, ship it" being propagated.

I build systems software for safety-critical and mission-critical markets, and I can say without a doubt that if there aren't at least two quality stages in your process workflow (and your workflow isn't waterfall), then you're going to be in for a rough time, rookies.

Always, always delay your releases, and always, always, eat your own dog food by testing your delayed releases in your own customer-like environment, which is to say, never release a developers build ..


This is also my experience. And the worst was always being lectured by SW project managers about being agile and having to move quick and release early. I won't release anything without making sure everything works in every possible condition. This is why it takes years to build a complex chip (CPU, Fpga, any SoC really). Their firmware is often squeezed into months, and often the developers are handling like 10 different projects. So, no focus, no time to understand the details of the design. At the end it's common to have firmware issues in the first year after release. It's kind of expected even.


Complexity. As you get further from the driver and the kernel software complexity expands massively. It gets to a point where it is beyond the abilities of humans and processes to manage it in a cost effective manner.


I understand that might be case for a lot of SW development. But in the context I was talking about, the HW is so much more complex than the SW. Valid for a lot of cases too. But then, why? If we know that we cannot build a 100km long bridge, nobody attempts to build that and waste resources. Why does software development lack this?


When is HW much more complex than the SW? I work in a company that designs (far from trivial) hardware, and develops embedded software, and in my experience software is always more complex than hardware, due to the many layers of abstraction (unless you are writing only, IDK, boot loaders in assembler? and even then it is about the same level of complexity)


When it is IC design. PCB modules are not so complex but ICs are.


There's supposedly a fix being deployed (https://x.com/George_Kurtz/status/1814235001745027317). Since it's a channel update I'm assuming that it would be downloaded automatically? Has anyone received it yet? Does the garbage driver disappear or is it replaced?

Edit: got in touch with an admin:

C-00000291-00000000-00000029.sys SHA256 1A30..4B60 is the bad file (timestamp 0409 UTC)

C-00000291-00000000-00000030.sys SHA256 E693..6FAE is the fix (timestamp >= 0527 UTC)

Do not rely on the hashes too much as these might vary from org to org I've read.


Ironically SolarWinds court case happened yesterday. SEC won. SolarWindows was fraudulent to say their software way “secure”. They should rename a side channel attack a “Tom and Jerry”, because its getting like a game of Cat and Mouse


1. This is why kernel modules are a bad idea 2. This is why centralism is a bad idea 3. This is why sacrificing stability for security is a bad idea 4. Security still needs to factor in security of supply - not just data safety


Centralisation in a nutshell. Monopolies so big that they become globally fragile. CloudFlare outages break a lot of the internet, and now we can see, Windows-based updates bricking machines across the world.

We've all pushed bad updates but how was this not tested?


Industry should move to Linux on desktop - we should not rely on single vendor


2024!!! The year of Linux on the desktop!


Finally! And here I thought it would never come.


We got day or maybe even week of linux on the desktop at least :)


Don't forget about the BSD's. We should not rely on a single Finnish man.


At least Linux servers, this situation is crazy.


c:\system32\drivers\csagent.sys renaming this file on server with safe mode boot fixes the issue but disables agent


BBC live coverage: https://www.bbc.com/news/live/cnk4jdwp49et

Looks like this is a big deal.


Looks like crowdstrike are just delivering what their name promised, striking crowds around the world


How many people still believe the "cloud" was worth it? Maybe we should go back to the days of buying software and running it ourselves with our own infrastructure.

I know, I'm dreaming.


Maybe a silly question, but: why hasn't this affected Linux? I assume it uses a proprietary kernel module just like it does on Windows. I guess this will come out in a post-mortem if they publish one, but it's been on my mind.

edit: aha https://news.ycombinator.com/item?id=41005936

They did do this to Linux, but in the past. Maybe whatever they did to deal with it saved Linux this time around


The sheer coverage of this outage across multiple businesses and industries, the impact must be greater than some of the malicious cyber attacks from ransomware, worms etc.


I want to say the problem is that the industry has systematically devalued software testing in favor of continuous delivery and the strategy of hoping that any problems are easy to roll back.

But it's deeper than that: the industry realizes that, once you get to a certain size, no one can hurt you much. Crowdstrike will not pay a lasting penalty for what has just happen, which means executives will shrug and treat this as a random bolt of lightning.


Greenspun's tenth rule:

"Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp."

(https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule)

Arthur C. Clarke's third law:

"Any sufficiently advanced technology is indistinguishable from magic."

(https://en.wikipedia.org/wiki/Clarke%27s_three_laws#:~:text=....)

Apparently we now have the following, as well:

"Any sufficiently bad software update is indistinguishable from a cyberattack…"

(https://x.com/leighhoneywell/status/1814278230704111792)


This is why I don't like fully automatic updates. I prefer having control over the "deploy" button for the ability to time it when I can tolerate downtime. In mission-critical production systems all updates should go through test staging pipelines that my team controls, not a vendor.

Broken updates have cause far more havoc than being a few hours or even days late on a so-called critical patch.


Even if you deploy manually all at once: you have the same problem.

A solution is slow rollout. Not a manual deploy button


It's not the same problem at all.

Troubleshooting an issue like this when I have the time and am prepared for a potential outage (with human resources hot and standing by for immediate action) is VASTLY different than encountering it the evening before some critical deadline for a multi-million dollar project (as Murphy's Law will be sure to have it).

When I have control over the deployment of updates I can push them through my own QA environment first, to uncover many of these kinds of issues before they hit production. Vendors pushing them out on their whim leaves me subject to whatever fast and loose practice they use and prevents me from being able to properly manage my own infrastructure.

A slow rollout certainly helps but doesn't satisfy the kind of 9's I demand in the environments I care for.


Their stock price will suffer but they can waive license fees for a year or so for every endpoint affected (~$50).

They better pin this on a rogue employee, but even then, force pushing updates shouldn't be in their capability at all! They must guarantee removal of that capability.

Lawsuits should be interesting. They offer(ed?) $1 mil breach insurance to their customers, so if they were to pay only that much per customer this might be compensation north of $10B. But to be honest, wouldn't surprise me if they can pay up without going bankrupt.

The sad situation is, as twitter people were pointing out, IT teams will use this to push back against more agents for a long time to come. But in reality, these agents are very important.

Crowdstrike Falcon alone is probably the single biggest security improvement any company can make and there is hardly any competition. This could have been any security vendor, the impact is so widespread because of how widely used they are, but there is a reason why they are so widely used to begin with.

Oh and just fyi, the mitigation won't leave you unprotected, when you boot normal, the userspace exe's will replace it with a fixed version.


> the single biggest security improvement

Clearly not, unless you don't count a world-wide economic and societal disruption of unprecedented scale a security incident.

> This could have been any security vendor

...that apparently deploys Kernel Extensions to millions of Windows devices at once, without any staggering.

> there is a reason why they are so widely used to begin with

Because companies need to check a box, and purchasing CrowdStrike checks that box.


> Crowdstrike Falcon

“Cybersecurity’s AI-native platform for the XDR era”.

I hope there’s a blockchain somewhere in it.


Allow me to give a different, information-theoretic, perspective. How much damage can flipping a single bit cause? How much damage can altering two bits cause?

The fanout is a robustness measure on systems. If we can control the fanout we increase reliability. If all it takes is a handful of bits in a 3rd party update to kill IT infrastructure, we are doing it wrong.


Are you suggesting that a 3kb update be tested 3k times to assess the impact of each possible bit flip, and 9M times for the impact of each possible pair bit flips?

Because I think that's effort better spent in other ways.


Not remotely. I am aware of the state space explosion and the difficulty with brute forcing the testing. I am suggesting that the damage a broken antivirus update can do should be restricted.


"Incidents of this nature do occur in a connected world that is reliant on technology." - Mike Maddison, CEO, NCC Group

Until I see an explanation of how this got past testing, I will assume negligence. I wasn't directly affected, but it seems every single Windows machine running their software in my org was affected. With a hit rate that high I struggle to believe any testing was done.


A reminder why switching off auto-update is a thing.


It looks like it wasn’t a software update, it was a AV definitions update, so internal to the CA application.


True, though tbf it's still part of the running system.

I read that many of those affected are global orgs. When I worked at an oil major, everything was tested to oblivion before going into production in the DCs, the reason being to avoid precisely this kind of situation where at all possible. There were clusters set aside for operational acceptance testing to ensure everything, from business application right down to kernel, ran successfully. The idea of leaving auto-update on in any production system was unthinkable. Yet here we are.


Admin: We should turn off the AV auto update in prod and test it in staging first.

Manager/CISO: That would increase our exposure time on zero day vulnerabilities. Overruled.


Was just using the energy vic website and thought I'd been rate limited when their API stopped working. Seems like it could be this.


afaiu google (and I presume other operators of large number of computers) deploy updates to their software first to a small set of nodes and only if after a given time the update has been deemed successful, continue to update an increasingly larger set til complete.

Isn't this done as well with automatic updates of end user software or embedded systems and if not, why not?


This is effecting our company. A colleague visited her local supermarket (Woolworths) and all the self-service checkouts were effected.


Silver lining?


This event raises the question: What is the liability of Crowdstrike given its erroneous update caused the meltdown, and the impact certainly had negative personal or business outcomes globally.

See for example 6000 flights cancelled or the many statements posted here regarding it negatively impacting healthcare and other businesses.


we are bound to see the YouTube ads equivalent of late night spot ads for lawyers with accelerated audio "have you lost someone to the 2024, 2025 or 2029 crowdstrike global hospital outages? if so you may be entitled to compensation. DM law5237 on X to find more"


I wonder who exactly messed up the update, microsoft or crowdstrike. Usually, there is pre-rollout update testing AND some companies use N-1 version staging for critical/production systems. For me it feels much more complex a failure than just "it's crowdstrike's fault". Everybody involved must have done something wrong.


That was quick: Microsoft is blaming "3rd party" and announced that "a fix is forthcoming". Very curious indeed. https://techcrunch.com/2024/07/19/banks-airlines-brokerage-h...


I haven’t seen a simultaneous outage as big as this in my entire life. I’m just hoping this gets enterprises to move off of Windows.


Except it hasn't got much to do with Windows... its a faulty kernel software package from a commercial vendor unrelated to the OS.

Philosophically it's always good to have diversity, precisely to avoid such disruptions. But the real issue here is: A) Apparently half the world runs CloudStrike... So everything is disrupted. B) Apparently CloudStrike didn't test their update properly.

I'm very curious what will happen to CloudStrike. This seems like a huge liability?


We don't know that yet


We do now, hahaha

Delta is going after CrowdStrike, and I doubt they'll be the last.


Workaround fixed it for me, thankfully I had access to the bitlocker recovery keys. This will be a bad day for IT people worldwide.


All my customers endpoints are on Linux based endpoints. Because our users' Windows apps run in vdi with disposable instances based off snapshots and highly restrictive networking on the Linux endpoints, none of our users are effected.

Running Windows on bare-metal was always obviously very stupid. The consequences of such stupidity are just being felt now.


Genuine question: How the heck crapeware like CloudStrike got into all critical systems from 911 to hospitals to airlines? My understanding was that all these critical systems are just super lazy to upgrade or install anything at all. I would love to know all the sales tactics CS used to get into millions of systems for money!


Reading other comments here, sorry I don't have the link, one crowd strike salesperson threatened to cancel them as a Client, yes you read that right, if the client wasn't easier to work with. So they're bullies or at least that one salesperson in crowd strike is a bully.

Another article talked about crowd strike being required for compliance, people here talking about checkbox compliance. So there's a systemic requirement from perhaps insurers for there to be some kind of comprehensive near real-time updated antivirus solution.

Furthermore, the haste makes waste philosophy seems to not be honored, in my opining mind, by the minds who drive The impacted sectors of our economy. Hospitals, Banks, airlines. This kind of vulnerability should not have been accepted. It's a single point of failure. Even on crowdstrike's website they have this kind of like radar ring hotspot Target kind of graphic, where they show at the very center one single client app .. theirs, as if that one single client is the thing that's going to save us?


This is amazing sales tactics! So, you buddy up with insurance, they create a checkbox and recommend you for a revenue cut! Now you suddenly have millions of customers out of nowhere and your product gets installed on billions computers before you even know it. I have seen this tactic get used for many mediocre products. For example, 3rd party dishwasher soap recommended by dishwasher company. Amazingly powerful. I don’t think most of CrowdStrike employees even knew they were in more than billion computers with paid service. The CEO was just busy doing brutal marketing of this pointless product.


Any company that inserts itself so heavily into US politics cannot be counted on as a solid engineering organization.


There appears to be a workaround but my question is how are they going to get all of these endpoints out of a BSOD loop?


Might need to manually do it. Depends on if there is any lower level admin access than windows for each system.


I've been warning about the coming software apocalypse for years. This isn't a one-off, this is the beginning of a pattern. Tech recruitment is broken, software is more complex than ever, more and more people are turning to hacking, people are growing increasingly dissatisfied with the status quo...


OTA update went wrong? How can an update go live without proper testing for the millions of live connected endpoints?


Can they recover this OTA considering the systems can’t even boot?


I’m sure the patch itself can be fixed, and there will be a workaround to boot up the machine to fix it. My only concern is the BitLocker keys. If the hard drive is encrypted by Windows and assuming no backup for that key has been done, the system admins will have to activate their disaster recovery plans for these devices, and I hope they have that too, but hope isn’t a strategy!


Why would the BitLocker keys not be recoverable?


If the keys aren’t backed up, you will be locked out of the system, and as soon as you try to boot into the safe mode to perform that workaround, you will be asked to enter it manually (or if you have it back it up on a USB drive), if you don’t have, or don’t know the key, you will have an encrypted drive with all of your data locked there.


So far requires going into recovery mode and removing/rename the cloud strike executable. Then you can boot into Windows from there, it will probably be a sys admin thing dependent on the organisation setup.


Absolutely shameful display of how the cure can be worse than the disease. It's nonsense snake oil and security theater such as this that throws the cyber"security" industry into disrepute. One may as well have just installed McAfee Anti Virus.


This has been the story of the antivirus "industry" all along. They simultaneously seem to employ actual bona-fide security researchers while also making sure none of their software products are ever touched by people you could even refer to as "developers". I can't even imagine the noise at Microsoft from all the crash reports solely caused by antivirus software written by utter clowns injecting into other programs and, as here, into the kernel.

Previous: https://infosec.exchange/@wdormann/112530285189478825

Previous: https://thehackernews.com/2022/05/chinese-hackers-caught-exp...

Previous: https://www.ftc.gov/news-events/news/press-releases/2024/02/...

Previous: https://www.fortiguard.com/psirt/FG-IR-24-015

...


"We're only backdooring your machines for your own good! We are the good guy experts, we promise."


>snake oil

So much software - especially in the 'operating system' sphere of things - really is just snake oil.

Its just, very functional oil, in many cases - and highly toxic and slippery in many, many other cases.

>cyber "security" industry

Yes, I agree this is a market of smoke and mirrors, lies and propaganda.

The reason is, operating systems are broken. Pretty much all of them. Only, some of them work well enough to get a lot of work done, most of the time. Of the 99.9995% of the time it works, its great.

But, here's a thing I feel needs broader attention and discussion - It is my firm opinion that "Operating Systems Vendors" are a very poor, ragged class of professionals these days.

The decisions made at Microsoft - and other OS vendor corporations - have really lost the plot.

I can prove this by asking the golden question among the general public, and categorically get a standard response: "does this feature benefit the user, or does it benefit an advertiser?"

"No, this all seems to be some sort of setup. Windows doesn't feel like its for us, any more."

I mean, how many 3rd-party vendors do I need, secretly installing crippling 'updates' in my production systems, before I realize that there is no security, so write better software that doesn't need all this utter junk.

I mean this sincerely, operating systems vendors are treasonous to the user if 3rd parties are of more relevance to production runtime, than the thing the user very definitely needs to be operating.

The cloud is for backups, encrypted. It is made of snake oil.

Always run your own machines.


Yet Lennart Pottering and Redhat (spelled that way as I am one of the original pre-IPO investor of RedHat via Alex Brown/Deutsche Bank) wants to put networking of Linux into UEFI this quarter, inside the most sacrosanct PID 1.

They still won’t learning anything from Crowdstrike’s mistakeS!

Maybe it is time for me to ditch that stock.


Source of claim?



Network sockets are in the systemd code repository.


Feels like what people imagined the millennium bug would have been like, just short of PCs catching on fire.



is there an ELI5 on how can this happen? Like i get its a boot loop, but what did crowdstrike do that cause it? How can non malicious code trigger boot loop?


I would not call Crowdstrike ”non-malicious”. It’s incredibly incompetently implemented kit that’s sold to organizations as snakeoil that ”protects them from cybercrime”. It’s purpose is to give incompetent IT managers to ”implement something plausible” against cyberincidents, and when an incident happens, it gives them the excuse that ”they followed best practices”.

It craps the users PC while at it too.

I hope the company burns to the ground and large organizations realize it’s not a really great idea to run a rootkit at every PC ”just because everyone else does it”.


I have to say, it saved our ass a few months ago. Some hacker got access to one of multiple brands server infrastructure, started running PowerShell to weed through the rest and CrowdStrike notified us (the owning brand) that something was off about the PowerShell being ran. Turns out this small brand was running a remote in tool that had an exploit. Had Crowdstrike not been on that server we wouldn't have known until someone manually got in there to look at it.


Happy to know it works when needed!

But the implementation (when running on user PC:s) is still half-baked.

My experience is using PC with Crowdstrike for daily software development. In that setting it’s quite terrible.

The server setting sounds a much more reasonable use.


I've had CrowdStrike completely delete a debug binary I ran from Visual Studio. Its injected module in every single process shows up in all of our logging.


Yep. Exactly this and more.


I assume if you weren't running crowdstrike, you would have still had logging/alerting systems set up, no?


What specifically makes it "incredibly incompetently implemented", and would you simply derisively describe any system that can push updates requiring admin access a "rootkit", or is there some way you envision a "competently implemented rootkit" operating? Your opinion seems incredibly strong so I'm just curious how you arrived at it? I'm not in IT, but the idea of both rolling out updates remotely and outsourcing the timely delivery of these updates to my door* is a no brainer.

* if not directly to all my thousands of PCs without testing, which is 100% a "me" task and not a "that cloud provider over there" task


It's "rootkit" because it literally implements remote code execution as root as a feature.


Rootkit means Crowdstrike literally intercepts commands before they can be executed in the CPU. It is like letting a third party implant a chip in your brain. If the chip thinks the command in your head is malicious, it will stop your brain from ever receiving the command.


Crowdstrike needs to be the first person in the room so that they can act like the boss. If other people show up before crowdstrike, there's a possibility that they'll somehow prevent crowdstrike from being the boss. For this reason, crowdstrike integrates with the boot process in ways that most software doesn't.

Their ability to monitor and intervene against all software on the system also puts them in a position to break all software on the system.

more accurately: s/boss/most informed spy/g


Kernel driver bug that essentially defaults, then on reboot loads the same driver early on segfaults and reboots again, ad nauseum.


s/defaults/segfaults/ # stupid autocorrect


Thanks for catching that.


Best ELI5 ever!


What they did is that they forgot to write a graceful failure mode for their driver loader. (And what they did on top of it is to ship it without testing.)


My assumption is that when you have graceful failure for something like this, you risk a situation where someone figures out how to make it gracefully fail, so no it's disabled on this huge fleet.

It's likely that there have been multiple discussions about graceful failure at the load stage and decided against for 'security' reasons.


If the threat model includes "someone can feed corrupted files to us" then I would definitely want more robustness and verification, not less.

It's perfectly okay to make the protected services unavailable for security reasons, but still a management API should be available, and periodically the device should query whatever source of truth about the "imminent dangers". And as the uncertainty decreases the service can be made available again.

(Sure, then there's the argument against complexity in the kernel ... true, but that simply means that they need to have all this complexity upstream, testing/QA/etc. And apparently what they had was not sufficient.)


Crashes in kernel mode usually result in BSODs.


Cybersecurity company secures computers worldwide by not allowing them to be turned on. - not the onion


This article seems more relevant than ever and was posted a few days ago: https://ea.rna.nl/2024/07/12/no-it-really-no-i-t/


Why are we still running ANY operating systems based on Ambient Authority, as part of our infrastructure?

DoD shouldn't have given up on MULTICS. That premature optimization is going to sink the US and the Free World.

Personally, I'm still waiting for Genode to be my daily driver.


Official CrowdStrike workarround: 1. Boot Windows into Safe Mode or the Windows Recovery Environment 2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory 3. Locate the file matching “C-00000291*.sys”, and delete it. 4. Boot the host normally.


TBF although I worried about this possibility the first time the IT dude wandered into my office in 1989 holding a floppy he said he wanted to put into all the PCs we had (we had no PCs), it has actually taken a very long time for the shit to hit the fan.


Good day for OSINTers, APTs, redteamers, to find out who uses Crowdstrike on their endpoints.


This is what happens when you entrust software security to ex-hackers. Hackers love complexity because that's the kind of environment they thrive in; yet when they start working for the other side as security consultants, they still love complexity. Complexity ought to be the security consultant's worst enemy.

Ex-hackers often talk about security as if it's something you need to add to your systems... Security is achieved through good software development practices and it's about minimalism. You can't take intrinsically crappy, over-engineered, complex software and make it more secure by adding layers upon layer of complex security software on top.


If you run RATs like these on your machines then I'm sorry, this is just a case of fucking around and finding out.

Just don't do it. Windows Defender is a thing, it does just fine. For everything else there is least-privilege and group policy.


Can't wait for the Kevin Fang video about this.


It's bizarre reading all the headlines about companies offline, flights canceled, banks not working because of a piece of antivirus software in 2024.

Mostly because I lived through Y2K and every fear about Y2K just materialised but because of Crowdstrike instead.

I can't imagine the amount of wasted work this will create, not only the lost of operations across many industries but recovery will be absolute hell with Bitlocker. How many corporate users have access to their encryption keys? And when stored centrally, how many of the servers have Crowdstrike running and just got stuck in a boot loop now?

I don't envy the next days/weeks for Windows IT admins of the world...


I can't wait to see the CloudFlare traffic report after this. All those computers going down must have affected traffic worldwide. Even from Linux systems as their owners couldn't run jobs from their bricked Windows laptops.


It looks quite normal so far: https://radar.cloudflare.com/traffic

DE-CIX traffic is also often a good indicator during global events, looks normal: https://www.de-cix.net/en/locations/frankfurt/statistics


Interesting! Thanks for that. I guess most servers and consumer endpoints are fine, and those are driving all the traffic.


It is interesting that operating systems exist for server applications at all.

What is the problem they are solving?

What is the difference between what an operating system contains and can do and what you need it to do?

Why would I want to rent a server to run a program that performs a task, and also have the same system performing extra tasks - like intrusion detection, intrusion detection software updates, etc.

I just don't understand why compiled program that has enough disk and memory would ever be asked to restart for a random fucking reason having nothing to do with the task at hand. It seems like the architecture of server software is not created intelligently.


If the workstations are stuck in a boot loop, how will they be able to push a hotfix out?


what we do is safemode the PC and then open the run CMD as admin. then issue this command. sc delete csagent. reinstall crowdstrike using previous version.


Not gonna be awesome for an org that has 20,000 laptops out there at home with users who don't have admin...


safemode PC i mean


after issuing the command restart the PC it will proceed


Get out and push?


elbow grease


It's eye-opening how bad our crucial IT infra is nowadays. Running in-kernel third-party tools (AV) on critical infrastructure on Windows? Central banks? Control towers? Seriously? We should fire everyone involved and start IT from scratch. This level of negligence cannot be fixed.


I am most annoyed by built-in RCE. Who thought that to be a good idea?


All US flights are grounded too. The people I was traveling with cant check into hotels


Those EDR software is implemented as a kernel driver.

A third party closed source Windows kernel driver that can't be audited. It gathers massive amount of activities and send back to the central server(which can be sold) as well as execute arbitrary payload from the central server.

It became single point of failure to your whole system.

If an attacker gain control of the sysadmin PC, it's over.

If an attacker gain administrator privilege on EDR-installed system, it run the same privilege with EDR so attacker can hide their activities from EDR. There aren't many EDR products in the world it can be done.

I'd like to call it "full trust security model".


How is it that these major companies aren't rolling out vendor updates to a small number of computers first to make sure that nothing broke, and then rolling out to the entire fleet? That's deployment 101.


There’s already a Wikipedia page on the outage.

https://en.wikipedia.org/wiki/July_2024_global_cyber_outages


It now redirects to a new page that names CrowdStrike specifically.

https://en.wikipedia.org/wiki/2024_CrowdStrike_incident


It seems that an unexplored weirdness here is the prevalence of virtual Windows in the medical world. It seems that this has approach has become commonplace for HIPAA reasons (though it's unclear that it makes the world better versus using secure applications to handle HIPAA data). In the case of this Crowdstrike outage, one would think that virtual machines would simplify getting things up and running again, but instead there seems to be just the opposite going on, where lack of hardware access is limiting restoring them.

Any insight from those affected?


If I were a cloud vendor, I would provide a "CrowdStrike recovery" button which queues the recovery image and restores the system for the entire project. Why didn't hetzner, linode, DO, gcp, aws do something like this? Why leave people to their devices? Isn't this a basic application of centralization? It feels to me like this should be easier than managing your data center.


Are people counting this on the Windows TCO?


It's not Microsoft's fault someone installed third-party spyware/malware on their systems.


they have kernel modules for macos and linux too afaik, so i wouldn't be counting those chickens too fast


I would expect this to be a kernel specific bug. I'm on a company laptop with falcon, and we have linux systems using the same, no signs of problems so far.


I was going to buy some put options against CRWD with spare pocket money, but it turns out that the service I have my investment money is in broken right now. I wonder if that's because of Crowdstrike.


This title doesn’t nearly describe the breadth and severity of the problem…


How do so many super critical things rely on… windows? I wouldn’t trust windows to run a laptop reliably but here it is running prettty ucy everything. I guess that’s why they need crowdstrike.


It's because Windows is a mature, professional-grade operating system developed and maintained by extremely competent people with a lot to lose.


Crowdstrike can run on Linux and some companies mandate it, with similar issues...


Vendors of tools like this drive the cybersecurity industry discourse, so 'defense in depth' often practically sorta means 'add more software that does more things'.

But maybe this kind of thing can actually impart the lesson that loading your OS up with always-on, internet-connected agents that include kernel components in order to instrument every little thing any program does on the system is, uh, kinda risky.

But maybe not. I wonder if we'll just see companies flock to alternative vendors of the exact same type of product.


Anyone have a technical writeup of the actual bug? I'm trying to explain how this could happen to people who think this is related to AI or cyber attacks.

What happened to the QA testing, staggered rollouts, feature flags, etc.? It's really this easy to cause a boot loop?

To me, BSOD indicates kernel level errors, which I assume Crowdstrike would be able to cause because it has root access due to being a security application. And because it's boot-looping, there's not a way to automatically push out updates?


I don't have a technical writeup to offer, but your assessment around the BSOD seems correct enough. Without having an affected machine but knowing how NT loads drivers like this, I'd hazard a guess that the OS likely isn't even getting to the point where smss.exe starts before the kernel bugchecks. This means no userspace, which almost certainly means no hope of remotely remediating the problem.



Crazy- wasn’t Azure having an outage earlier today? Is this related?


I'd guess that it's related. Azure, OneDrive, and Office 365 were having issues this morning Australian time (about 6 hours ago).


If that’s the case I wonder why we’re only now seeing everyone’s systems BSODing. Should be an interesting write up (if they release one)


Doesn't actually look like. Noone really figured out the solution yet


By way of a data point for everyone else I live in HongKong and haven't seen any of this level of disruption yet. I also was in Shenzhen China yesterday, probably the words highest density of Win95 machines, and everything was fine. At home we have only one old laptop on win10 that only gets opened when the 8yo gets windows homework - otherwise it's MacOs and Linux on all laptops, desktops and SBCs.

If I see some news I will update this comment.


So, if CrowdStrike licenses didn't say "We're responsible for nothing" and if all affected users sued them, they'd be worth negative 90 trillion dollars or so right now. iow out of business.

I can understand the frustration their customers feel. But how could a software company ever bear liability for all the possible damage they can cause with their software? If they built CrowdStrike to space mission standards nobody could afford it.


I guess all the blamed EuroCommission will again have to do their job to bring anti-oligo/monopoly regulations, which everyone will hate but still slightly work.

Architecting technical systems is MUCH WAY easier than architecting social-economical systems. I hope one day all those tech-savvy web3 wannabe revolutionaries will start to do the real job a designing socially working systems, not only technically barely working cryptographically strong hamster-tapping scams


I'd hate to be on Microsoft's teams today. They're catching a lot of stray blame for this in the public eye where it's entirely not their fault.


From reddit:

> I'm in Australia. All our banks are down and all supermarkets as well so even if you have cash you can't buy anything.

I hope the national security/defense people are looking at this closely. Because you can bet the bad guys are. What's the saying, civilisation is only ever three days away from collapse or something?

I am pretty convinced this is a fuckup not an attack, but if Iran or someone managed something like this, there would be hell to pay.


You can bet a substantial amount of money on CloudStrike inadvertently painting a huuuuge hacker target on their back over this...


If you are IT team for a large impactful organization, you have to control updates to your organization's fleet. You cannot let vendors push updates directly. You have to stage those updates and test them and then do a gradual rollout to your whole organization.

Plus, for your critical communication systems, you must have a disaster recovery plan that actually helps you recover quickly in minutes, not hours or days. And you have to exercise this plan regularly.

If you are crowd strike, shame on you for not testing your product better. You failed to meet a very low bar. You just shipped a 100% reproducible widely impactful bug. Your customers must leave you for a more diligent vendor.

And I really hope the leadership teams in every software engineering organization learn a valuable lesson from this – listen to that lone senior engineer in your leadership team who pushes for better craft and operational rigor in your engineering culture; take it seriously - it has real business impact.


We often read about how organizations are so bad because they don't spend enough on security. That slope is particularly slippy.

Crowdstrike is very expensive.


Today's incident shows that the real problem is actually that organisations spend too much (money, but too little time / manpower) on security.

Hey, third-party vendor, I'll give you all the money you want, I'll let you pwn all my systems, I'll be your little bitch, just make me secure, I don't have time for all that security shit, kthxbye.


The whole thing needs to be redesigned, so that antivirus and EDR solutions do not require such high privilege. We need a high-performance way for a possibly privileged service to export all the data that is needed for a decision, and then let the AV/EDR do its thing. If the AV/EDR is broken by an update, fine. At least the system won't go down.


And in critical production systems AV/EDR upgrades should be first tested on lower environments.


Absolutely. Discipline can make all the difference.


This story is about to break into the top 15 of upvoted stories on HN, but it already seems safely within the top 10 by number of comments.


My company has some bios bitlocker extension installed which prompts for a password on boot, so automatic updates (one of which tried to install last night) just get stuck there in jet engine mode. Normally this is extremely annoying but today I count myself lucky - aside from a couple of people with Chromebook thin clients I am the only person showing as online in Teams right now.


An update to the internal database. It still did not sunk to developers that data has equivalent risk as code. A400 crashed because of an XML file update. I have witnessed my share of critical bugs caused by "innocent? updates to "data" which were treated less seriously because of this. Management and devs alike should change their conception about this.


Aren't they doing canary release? Seems weird this would not have been detected on a smaller scale before with a good release process.


Buying 10 computers with different configurations is not hard either. They are just lazy


Canary releases aren't a magical bug free fix. They might be doing it, but the conditions that trigger a problem can be sneaky and can happen outside your canary period, rendering it useless. It's a best effort method.


It depends on how global the issue is.

But yeah I've seen US companies for example only doing their initial releases in the USA only which has zero value for issues that might appear with different localization/language settings for example.


Rock me Amadeus.

At least the central flight booking system is up I guess. Google brought it years ago and it's a mainframe.

Hence why google flights is so tapped in :)


CrowdStrike’s faulty update crashed 8.5 million Windows devices, says Microsoft

https://www.theverge.com/2024/7/20/24202527/crowdstrike-micr...


This should at the very least put them out of business by causing each and every client to abandon them as their security solution.


CrowdStrike should have learned the lesson from the more seasoned players in the industry to slow roll their updates and observe.


Security technology harming security? Shocker. We need less monoculture. Trouble is monoculture pays. Write the software once, deploy it everywhere - free money.

I manage a simple Tier-4 cloud application on Azure, involving both Windows and Linux machines. Crowdstrike, OMI, McAfee and endpoint protection in general has been the biggest thorn in my side.


This is pretty wild. I woke up to a news alert on my phone stating a "global IT outage" took down banks, airlines (who were calling for a global ground stop for all flights), hospitals, emergency services, etc. Expected it to be some sort of Tier 1 Network issue. Nope, a failed update for some third party Windows security app.


I don’t know Windows systems. I’ve read it’s causing Blue Screen of Death.

I take that to mean that systems can’t even boot. Right?

Can this be fixed over the air?


Right now the workaround is doing brain surgery on the system in safe mode, so probably no ota fix.


== kernel panic if that clears it up


Seems CS themselves may have been hacked? For example, seems unlikely that both:

1. CS normally pushes global updates to entire user base simultaneously?

2. This made it through their testing. Not only 'just' QA but likely CS employees internally run a version or two ahead of their customer base?

Just speculation - folks who know either answer can validate or debunk.


(they confirmed there was no hack) I think you have too much faith in minimal software development practices being applied at companies.


Isn't a Windows BSOD the equivalent of a kernel panic? I don't understand how this is CrowdStrike's fault. Vanilla userspace operations shouldn't cause a kernel panic--that's a bug in the OS, not a bug in some user software. If anything, we should be blaming Windows here?


> Vanilla userspace operations shouldn't cause a kernel panic...

The component Crowdstrike says you need to remove to restore functionality is a ".sys" file. That's a kernel-mode driver. The fault is happening on the kernel side.


CrowdStrike is not a vanilla userspace program, it hooks deeply into the operating system.


HP laptops are booting and its looping

DELL laptops are observed, after blue dump . Server is up and running fine

Temporary workaround leads compliance issue.

From India


Does anyone know how to proceed if I do not have administrator level access to the computer?

I do not have access to c:\windows\system32\drivers\crowdstrike folder to delete the corrupted .sys file

I was able to boot on recovery mode with network, after waiting 30 min, I rebooted and BSOD persisted.

Are there other alternatives on how to recover?


Create a bootable USB stick from Ubuntu, then see if you can mount the windows drive and delete the file that way.


This just in ‘CrowdStrike Strikes Crowd’


I have been told 'not to worry' because it isn't a cyber attack. Yet the outcomes we are seeing feel a lot like the doomsday predictions of what a cyberattack would do. It is almost as if we are experiencing the cybersecurity/warfare equivalent of 'friendly fire'.


Microsoft to give Vista kernel access to security firms (2006)

https://arstechnica.com/information-technology/2006/10/7998/


This has all the hallmarks of a SSCA (Software Supply Chain Attack).

Either that or Crowdstrike is testing critical software meddling in ring zero so poorly, causing crashes and bootloops out in the wild on 100% of the deployments, that they need to get sued out of existence.

I hope for their sake its the former.


I wonder what the rollout procedure is for CrowdStrike. I put $100 down that this was a minor update they decided was so minimal it didn't need extensive testing.

So many places use the "emergency break glass rollout procedure" on every deploy because it doesn't require all the hassle


Considering what CrowdStrike's software does, I'd say the majority of their updates could be quite easily argued as being "emergency" updates, so yeah, quite possibly they've gotten into the habit of "omg URGENT must break glass" way too often.


Maybe the world can finally reconsider their use of software products that cater to security theater. And the politics in companies which lead to things like this being introduced ("nobody gets fired for buying IBM").

Edit: took out a bit of snark.


I dont know how CS is considered snake oil. Or what IBM has to offer. CS and S1 are really just the best out there.


"Endpoint protection" is just the new, hip term for antivirus/intrusion prevention/incident logging of the past. Why not provide immutable Linux based machines (like Chromebooks, Fedora Silverblue) which are locked down outside of the browser? I am aware that this isn't possible in some areas of the industry that rely on large amounts of Windows-only desktop software, but in many cases it may be worth a thought.

If I am being naive here, happy to hear other opinions. But I hate opening my company Windows laptop and having the fans turn to 11 just because some "security" software is parsing random files for malicious signatures or running an update that BSOD loops.


The IBM quip is an old saying: pick the dominant vedor because even if it's a mistake no one will blame you for it.


well, apparently they aren't


Make a live CD Linux image that mounts the NTFS drives, locates the Windows directories from the bootloader, and deletes the file.

Also, you can mount BitLocker partitions from Linux iirc. If it encounters a BitLocker partition, have it read a text file of possible keys off the USB drive.


CloudStrike had managed to invade into StarBucks IT. All of the online order taking systems are down.


What's the actual magnitude of this outage? Is there a way to estimate how many machines were down?


We routinely implement phased / canary deployments in server-side systems to prevent faults from rolling out globally. How is it possible that CrowdStrike and/or Windows does not have a similar system built in for large, institutional customers? This is outrageous.


I take it patching remote machines is going to be difficult or impossible?

I haven't used windows in years, but from what I read you need to be in safe mode to delete a crowdstrike file in a system directory, but you need some 48 char key to get into safe mode now if it is locked down?


I don’t really understand why AV updates aren’t tested before being pushed out to critical systems and I don’t understand why every system would run the same AV.

But also I don’t understand why this corporate garbageware is still a thing in 2024 when it adds so little value.


While initially everyone blamed Microsoft and then quickly pointed the finger at CrowdStrike, I'd like to call out Microsoft especially their Azure division for making the recovery process unnecessarily difficult.

1) A key recovery step requires a snapshot to be take of the disk. The Portal GUI is basically locking up, so scripting is the only way to do this for thousands of VMs. This command is undocumented and has random combinations of strings as inputs that should be enums. Tab-complete doesn't work! See: https://learn.microsoft.com/en-us/powershell/module/az.compu...

E.g.: What are the accepted values for the -CreateOption parameter? Who knows! Good luck using this in a hurry. No stress, just apply it to a production database server at 1 am in the morning.

2) There has been a long-standing bug where VMs can't have their OS disk swapped out unless the replacement disk matches its properties exactly. For comparison, VMware vSphere has no such restrictions.

3) It's basically impossible to get to the recovery consoles of VMs, especially VMs stuck in reboot loops. The serial console output is buggy, often filled with gibberish, and doesn't scroll back far enough to be useful. Boot diagnostics is an optional feature for "reasons". Etc..

4) It's absurdly difficult to get a flat list of all "down" VMs across many subscriptions or resource groups. Again, compare with VMware vSphere where this is trivial. Instead of a simple portal dashboard / view, you have to write this monstrous Resource Graph query:

    Resources
    | where type =~ 'microsoft.compute/virtualmachines'
    | project subscriptionId, resourceGroup, Id = tolower(id), PowerState = tostring( properties.extended.instanceView.powerState.code)
    | join kind=leftouter (
      HealthResources
      | where type =~ 'microsoft.resourcehealth/availabilitystatuses'
      | where tostring(properties.targetResourceType) =~ 'microsoft.compute/virtualmachines'
      | project targetResourceId = tolower(tostring(properties.targetResourceId)), AvailabilityState = tostring(properties.availabilityState))
      on $left.Id == $right.targetResourceId
    | project-away targetResourceId
    | where PowerState != 'PowerState/deallocated'
    | where AvailabilityState != 'Available'


I wonder what Crowdstrike's opsec is like re: malicious actors gaining control of their automated update servers. This incident certainly highlights the power of that type of attack, even if this one just ends up being typical human incompetence-based.


What do card houses do for a living?


Texas, where software goes to die. Or maybe that is where killer software is developed?


Crazy isn't it, I had no issues because my group policy updates have been off since last year, guess the "everyone must forcefully update" for "security reasons" ended up backfiring, who could've thought


Why Crowdstrike doesn't follow standard deployment strategies such as canary or rolling? Gradual update would uncover this bug before reaching critical mass. Doing all-at-once update is unacceptable to critical systems.


I haven't heard ask this, but would this have happened on linux. Obviously not many people run virus s/w, but would something similar like this have caused this?

Are there any protections to prevent repeating reboots?


“To err is human, but to really fuck things up requires a computer.” ~ Len Beattie


The IT security chief at my co (paraphrasing):

>talked to pres of Crowdstrike. His forthrightnes was refreshing. He said “We got it wrong.”

>They are working with Microsoft to understand why this happened.

Pretty much the message minus even more boilerplate talk.


No rolling updates? How could a 100% repro BSOD pass QC? I'm more concerned about the deployment process than the crash itself. Everyone experiences a bad build from time to time. How did this possibly go live?


Go to advance repair option then advanced open cmd. Go to windows/system32/drivers/crowdstrike. Then list all the file and delete file name having 291 at the end using cmd "del filenameendingwith291"


Ho do they test this before they roll it out? Looks like a bug thats easy to spot. I would presume they test it at several configurations and when it passes the test ( a reboot), they roll it out. Has this been tested?


> We have collaborated with Intel to remediate affected hosts remotely using Intel vPro and with Active Management Technology.

This worries me. Does this mean intel has access to remotely access my machine?!?!


What is with this paranoia on HN?

5s Google says that vPro is the Remote management platform offered by Intel for business, basically a fleet management tool, which is typical for corp


Yah. But the way it was worded is that a 3rd party company who goofed was able to go to another company and ask for access to it.

I assume that some sort of configuration and opt in step would need to be had.

If this had been done prior to the event it seems like this would have been the first thing to do, not a Hail Mary after being down.


Was watching TV this morning in France (TF1, 8:00 CET), the weather forecast map system was out. The journalist just gave us the information as if he was on the radio, telling he was sorry for the system to be failing.


They sponsor the Mercedes F1 team https://crowdstrikeracing.com/f1/about-partnership/ , who have a race this weekend and practice sessions today. It'd be funny if their cars can't go on track because their computers are down...


They did go down! https://www.reddit.com/r/formula1/comments/1e71dtn/mercedes_...

But someone probably fixed it, and the cars were able to go out on track for the first practice session.


So, why did our little company's (little used) two Windows machines not BSOD overnight? They were just sitting idle. They run CS Falcon sensor. Did the update force a restart? Didn't seem to happen here.


It looks like a configuration file update is the culprit. The software presumably picks up the update, then BSODs.


Microsoft to give Vista kernel access to security firms (2006) | Hacker News

https://news.ycombinator.com/item?id=41014426


Why would they roll out this update globally to all users immediately? Isn’t it normal to do gradual rollouts? Or did this update contain some critical security fix they wanted everyone to have as fast as possible?


People at my workplace were affected but I dodged the bullet because I left my computer turned on overnight because I always want to be able to RDP in the next morning in case I decide to stay home.


Oh, the foresight of the 1st episode of Connections. https://www.youtube.com/watch?v=XetplHcM7aQ


Wow, CrowdStrike made it to #6 of all time HN threads by now.. https://hn.algolia.com/?q=


I hope the narrative of , install crowd strike and pass the audit or else changes after this .

but being in the industry for so long , I don't expect any changes whatsoever, it's either CS or some other tool


Can we end the whole “loading a kernel rootkit” thing? AFAIK Apple already shuns kernel extensions. What’s preventing Microsoft to do the same? As a bonus shit like anti cheat will go away too.


it is humbling (and lowkey reassuring?) to know that not all large players use the absolute cutting edge approaches in their workflow.

it seems and i hope that after all is said and done there is no major life-threatening consequence of this debacle. at the same time, heart goes out to the dev who pushed the troubling code. very easy to point at them or the team's processes, but we need to introspect at our own setup and also recognize that not all of us work in crucial systems like this.


Does crowdstrike work similarly on MacOS? I have to imagine the "walled garden" doesn't allow for 3rd parties to insert themselves into the OS kernel but I could be wrong.


I mean, you can always disable SIP and install your own extensions.


I'd bet my career CS isn't spending enough on QA. It's always the first thing to be cut, no one cares about QA when everything is going well, but when things go wrong...


I just wanted to mention that Microsoft has 3 tiers of Windows beta releases before changes are pushed to production. I can't comprehend how this wasn't noticed before.


It didn't come from Microsoft or Windows Update. It was pushed by Crowdstrike to their corporate security kernel extension.


3rd party got unsupervised access to kernel? I'd say it's even worse then


Probably a stupid question but, how can the Windows kernel recover so well after a graphics driver crash and at the same time being unable to do the same for other kind of drivers.


I am quite sure that they have had three precious timezone hours to detect a total failure of telemetry after their fateful midnight upgrade.

Like the most useful Canary Island in the Coal Mine.


what would be funny is if crowdstrike demanded ransom from their castomars.

security is a great business - you play on people's fears, your product does not have to deliver the goods.

like the lock maker, you sell a lock, the thief breaks it, but it is not your problem, and you sell a bigger badder lock the next year which promptly gets broken.

as a business, you dont have any consequences for how your product works or doesnt work, what a great business to be in !!


The postmortem will should interesting, can't imagine how even just basic integration testing didn't catch this. Much less basic best practice like canarying.


There's already somebody trying to cash in on this problem:

https://fix-crowdstrike-apocalypse.com


Yep happened to us too. Its global. And it just started happening.


Same in Adelaide. Reports coming in from Gov agencies, utilities etc.


So - what is the lesson learned? The only clear message for me is that critical programs that also demand kernel level access maybe shouldn't update themselves.


Can we find an uptime(availability) graph for the CrowdStrike agent? Don't you think this graph should be included in the postmortem?


Next up: CrowdStrike Sued for Alleged Negligence.

I hate lawyers, but this is the reason why companies outsource. Why take the blame (and spend the money) when you can blame the vendor?


I’m guessing it’s completely incidental that the CEO of crowdstrike was critical of China earlier this year, and that China is somehow unaffected by this ‘global’ issue!


I mean… yeah? almost everyone in the west except for some highly corrupt people is critical of china, and china doesn’t deploy crowdstrike.


Rolling out updates in an A/B test slowly is the only way to reduce the occurrence of such issues _significantly_. There's no other way, literally, nothing.


I can't wait for rachelbythebay's comments on this.


Crowdstrike seems like the kind of thing that's sold to CEOs at conferences, forced on IT against objections, and the subject of a lot of discussion at Defcon.


I feel we are at a point in the evolution of our digital society where relying on general purpose OS-s is just not an option when moving forward.


Thank Chronos I switched to Qubes OS almost two years ago!


The front page of https://www.accel.com/

"Fail Fast. Evolve Faster"


The workaround suggests removing a file with .sys extension. What does the file do normally? If removed, what happens to the state of security on that system?


.sys files on Windows are typically drivers and driver-related files


Germany is not affected since it's Krautstrike only.


All the crazy people banging the drum for war with Russia and/or China.

Imagine what our IT systems would look like with someone _intentionally_ messing with them.


On the plus side this will help us develop an immune system against cyber attacks in any future war. Businesses will start thinking of contingencies.


Adding a comment to make this the most commented piece on hackernews and hence highlight the bad impact a bug can make on lives founded on IT.


Welp this fucked my night. A toast to the rest of you who are waaaaaay more screwed than me


How can an antivirus update affect Azure's servers?


Perversely, this may make many companies no longer invest in this type of cyber security software. Which may lead to a whole host of other problems...


This is the Irish potato famine (essentially due to the farming of a single species of potato) equivalent in IT infrastructure: a single vendor.


The correct solution is to have IT force push updates only when they deem fit (after they have tested internally on some ghost machines).


All flights grounded. World wide airline systems outage.


Just boarded a flight in Iceland, no mention of being grounded.



“ American Airlines, United and Delta have asked the FAA for global ground stop on all flights, according to an alert from the FAA on Friday morning.

The FAA is telling air traffic controllers to tell airborne pilots that airlines are currently experiencing communication issues.”


That's what you get for letting a company install a root kit on your servers and desktops ;-)

I mean, don't they do canary updates on CrowdStrike too? Every Windows admin has done this for the last 5+ years, test Windows updates on a small number of systems to see if they are stable. Why not do the same for 3rd party software?


For $150/hour, I will spend today consulting for businesses who need someone in the St Louis area to go reboot a remote workers’ machine.


CRWD dropped $50/share at market open. Wild.

Is this specific to only Windows machines “protected” with CS or is this impacting Linux/macOS as well?


I know I have the benefit of hindsight in this regard, but how isn't there redundant checks and tests that would prevent a mishap of this magnitude?

I mean, there should be extensive automated testing using many different platforms and hardware combinations as a prerequisite for any rollout.

I guess this is what we get when everything is opaque, not only the product and the code, but also the processes involved in maintaining and evolving the solution. They would think twice about not investing heavily in testing their deployment pipelines if everyone could inspect their processes.

It might also be the case that they indeed have a thorough production and testing process deployed to support the maintenance of crowdstrike solutions, but we are only left to wonder and to trust whatever their PR will eventually throw at us, since they are a closed company.


The first time I experienced crowdstrike in a corporate environment it seemed obvious that something like this would eventually happen.


This is what AI's first strike will look like


That piece of "AV software" slowed down my brand new corporate i7 Lenovo to shit so I switched to M2 Pro. Best decision ever.


Until corporate decides to install new MDM software on the usually-blazingly-fast apple silicon chips :(

I can't even open files larger than 500 lines without my whole system slowing to a crawl because of the insanely aggressive and slow "antivirus" bloatware the MDM forces on me.


Their Windows sensor has made development almost unworkable. Not sure why but I haven't noticed the OSX sensor slow things down appreciably. I suspect my Windows profile is configured to be more aggressive?


How do you patch software causing a BSOD?

It seems like a chicken and egg problem.

I ran a team that developed a remote agent, and this was my nightmare scenario.


If i was North Korea, I would say that was me. That would be however a crazy story if Russia and China had done anything about it.


In pre-market, CRWD is 14% down. I think investors are a bit scared that THIS time there is going to be some consequences.


I'm amazed it's just 14%, not more like 75%-80%. Surely a lot of customers are going to uninstall and move to competitors. The remainders are at least going to demand much cheaper service with better guarantees going forward.


Yeah, and now recovered to -9.39%. Let's see what happens. I guess CrowdStrike is backed by enough powerful people to NOT lose too much business.


This is how I would start a war… surreal.

I hope it’s just a bug.


So what are going to be the consequences of this? In my country some healthcare institutions and emergency systems are working.


I'm actually very fond of "fail fast" and "no blame" culture, but someone needs to get fired for this!


There's a workaround: reboot 10-15 times. I've seen two people say it independently, so maybe it's for real.


Random strangers running unknown, untrusted code on your computers is the worst. It's a good thing we patched that security flaw by letting the _right_ random strangers run unknown, untrusted code on our computers.

As something of a friendly reminder, it was Microsoft this time, but it's a matter of "when" not "if" till every other OS with that flavor of security theatre is similarly afflicted (and it happens much more frequently when you consider the normal consequences of a company owning the device you paid for -- kicked out of email forever, ads intruding into basic system functions, paid-in-full device eventually requires a subscription, ...). Be cautious with automatic updates.


Were cloud providers (AWS and azure) so heavily impacted because they use CS internally or because so many users use CS?


Uninstall the Current Version:

Open the Command Prompt as an administrator. Run the following command to uninstall the current version: shell

sc delete csagent


The most concerning thing about this is the realization of just how many incredibly critical systems run on Windows.



In my org, none of the essential systems went down (those used by labor). However all of management's individual PCs went down which got me wondering... Is this the beginning (or continuation) of whittling down what is "essential" human labor versus what could be done remotely (or eliminated completely)?

Or perhaps Microsoft is just garbage and soon will be as irrelevant as commercial real estate office parks and mega-call centers


Do all the machines need to be manually fixed? It doesn't seem like an automatica update will work here...


This is, of course, why they should be doing phased rollouts. 1% of their customers, then 10%, then all the rest.


Or just test your code?


Both. No testing is going to catch everything. And if it only hits 1 computer in 100 then your local testing will probably miss it


I absolutely abhor these end point solutions that "auto update for your convenience and safety."

I can control and manage my own systems. I do not need nanny state auto updating for me.

Crowdstrike should be held liable for financial losses associated with this nonsense.


the funny thing it's often labeled "Microsoft IT outage" - theguardian.com as example


I love how their company name foreshadows this exact event. It’s malware pretending to be a security suite.


Plot twist: The * in C-00000291*.sys is "-block-ultron"

Premature deployment of Crowdstrike AGI disaster response plan.


I want to see the internal postmortem of why this happened to CrowdStrike (if they are still in business)


if I'm reading this correctly the short interest for the stock doubled over June? :)

https://www.nasdaq.com/market-activity/stocks/crwd/short-int...


Dont see how anyone is getting out of this without applying the workaround or reimaging their whole fleet.


Mission critical systems should be running something like ChromeOS.

Too bad ChromeOS seems be on the way out at Google.


This is the first time I'm hearing about crowdstrike, what is it and why is this such a big deal?


Seems like a modern operating system would have an automatic rollback mechanism for cases like this.


Windows has restore points that do this in the event of a failed update, but this wasn't Windows.


Do people not have test environments?


Why would you name your company "CrowdStrike" anyway? What does Crowd Strike even mean?


I'm confused, is this an issue with Windows or with Crowdstrike software installed on Windows?


With a crowdstrike kernel driver, so technically not a microsoft/windows issue.


Someone at CrowdStrike got fired for this. I'm curious to know who this person is.


Question : was this update delivered by Crowdstrike’s update agent or Windows Update ?


My read is that Crowdstrike's update agent downloaded new security threat definitions and those definitions exposed a bug in the existing Crowdstrike drivers, causing the disaster.


It’s that easy. A hacker that controls the update terminal at crowdstrike controls the world?


Ironic that the software intended to prevent exactly these kinds of outages ends up causing it.


Ok... Would a Linux based infrastructure be more resilient.

Does Linux require Crowdstrike style AV software?


technically no os requires such software, it's people who make a decision to run it

Someone in comments described crowd strike bringing down their fleet of Linux servers in April


A system restore helps. But obviously not when you’ve got an environment of ~500 or more clients


This outage may be more expensive and cause more damage than any cyberattack in history.


I heard all Windows PCs at the University of New South Wales were also boot-looping.


I wonder how unknown (yet) malware will be wiped out or enabled before this is over.


It seems monocultures are not only bad for resilience in agriculture, but also in IT.


I guess Microsoft can now offer some similar to a Crowstrike solution for Azure users.


CrowdStrike today has shown why it's absolutely crucial to test code before deployment, say no to YOLO deployments with LLM powered software testing https://github.com/codeintegrity-ai/mutahunter


So where can I buy an ETF of companies specializing in software Quality Assurance?


Nobody does that anymore. :)


Is there a way to estimate number of affected devices? 10 million? 100 million?


Can someone with experience explain how integration tests did not detect that?


Why are you assuming there were tests?


Right.

I just can‘t imagine how it passed tests for a common configuration that is exhibited by large number of windows machines. Stuff always can go wrong, but OS is not booting should be caught?


Is this just a massive mistake or is it deliberate and cover for something else


Good question. Weaponized/compromised update? Disgruntled employee Logic bomb?


Faced the same issue few minutes back after few loops of reboot my system is up


We had a few machines come out of the boot loop - only to re-enter it 20 mins later. I am sure CS pulled the patch from their CDNs but ...maybe some cached versions still linger?


Don’t they have canary deployments? Such huge updates happen al at once?


this is really microsoft's fault for handing out kernel access to random 3rd parties, none of which are doing anything special that microsoft couldn't implement themselves (AV, anti-cheat, security)


Yes, Microsoft should just grant itself a monopoly on all those markets "for your own safety" and see what happens with their lawyers.


Or do what Apple does, disallow kernel extensions, and provide rigid kernel faciltiies for VPN clients, EDR agents, etc. to use, so they don't have to implement custom code resident in the kernel.


Apple can disallow kernel extensions because it fully controls the entire hardware and software stack. Everything that would need to be an extension is already in the kernel and Apple knows all of those things.


Why aren't we upvoting a list of alternatives to CrowdStrike here?


Somewhere out there, there is an engineer with the biggest "I told you so" shit eating grin scrolling through every social media site and basking in the glory.


I believe that today they struck the entire crowd… (or should that be cloud)


What a fun time to be less than 48 hours out from a transcontinental flight


Oh wow, this is #5 for all time already, beating out Steve Jobs.


Why are "security" patches not tested before they are deployed?



4 hour delay at the airport in Los Cabos. At least they have tacos!


This never would have happened if all these orgs used a blockchain.

/sarcasm

/but is it really?


video summary from fireship: https://www.youtube.com/watch?v=4yDm6xNeYas


I wonder what happens to the engineer who deployed this patch.



Please at least an archive link. That website is trash.


Looks like it affected the Crowdstrike stock, but not Microsoft.


Was involved in a "security mandated" mandatory rollout of Crowdstrike at my prior company.

This software was utter shit, and broke stuff all over the place. And installs itself as basically malware into critical paths everywhere. We objected to ever using it as a SPOF, but was overruled.

So yeah, not remotely surprised this happened.

Any kind of middleware/dynamic agent is highly suspect in my experience and to be avoided.


(FORCE) Pusheedd to Prod on FRIDAYY -- Burneeeddd by its Sins


I could not imagine so many critical systems run on Windows.


My team and I have begun to refer to this issue as CrowdStroke


I’m trying to refresh to get latest update… let’s keep posting


Anyone found any fixes, while Crowdstrike comes up with a fix?


Do this at your own risk!! Apparently there are two NON-OFFICIAL solutions:

1. Rename csagent.sys ( the file causing the BSOD ) 2. Rename c:\windows\System32\Drivers\Crowdstrike

Again, do this at your own risk. Both workaround have been reported as "working". I am Linux user so I cannot tell.


we will found a troubleshooting , we start in revovery mode and rename the path C:\windows\system32\drivers\Crowdstrike


"F8 + last known good configuration" worked for us


Meanwhile the linux desktop just keeps on truckin'.


This is what happens when you treat IT as a cost center.


Fix: Boot Windows into Safe Mode or the Windows Recovery Environment Navigate to the C:\Windows\System32\drivers\CrowdStrike directory Locate the file matching “C-00000291*.sys”, and delete it. Boot the host normally

Did it work for you?


how can billion dollar company push update before testing?


dumb techbro c-suites: what, why would you have an issue with a proprietary closed source app that frequently self updates and sends tons of data to a third party while essentially being a backdoor? We said we wanted security and this has Security(tm) all over the literature! Look we even have dashboards for the gui-ninjas like the security team!


It's backdoor as a feature. It even has a cute name - "CrowdStrike Real Time Response".


I love the name! Really tells you what's going on ^^


Do we have any estimates how many machines are affected?


Why they don't use Windows' own anti-virus?


That's called [Microsoft Defender for Endpoint](https://learn.microsoft.com/en-us/defender-endpoint/), which is used even on Linux servers in big corporations. (Largely because it's the easiest way to complete box ticking exercises with Windows servers: once you have it, it's easy to decide to extend it to non-Windows machines as well.)

The binary self-upgrades and runs in highly privileged mode, so it might not be immune from the kind of failure CrowdStrike had here. Though apparently there's at least a way to use a local mirror so you have some control on the updates: https://learn.microsoft.com/en-us/defender-endpoint/linux-su...


Do you suppose they test before pushing updates out?


Who is responsible for this billion dollar mistake?


Google spending a boatload for Wiz looks smarter now


i work liquor distribution in the united states and our entire company is out across 44 states, allegedly due to this “crowd strike outage”


I’m trying to refresh to get the latest update …


I know there's a better word to be used here, but what initially looked like a massive cyberattack turning out to be a massive defender foot-broom is chefs kiss.

I saw it was Windows and went to bed. What a great feeling.

I'm sorry to those of you dealing with this. I've had to wipe 1200 computers over a weekend in a past life when a virus got in.

Did I receive any appreciation? Nope. I was literally sleeping under cubicle desks bringing up isolated rows one by one. I switched everything in that call center to linux after that. Ironically it turns out it was a senior engineers ssh key that got leaked somehow and was used to get in and dig around servers in our datacenter outside of my network. My filesystem logging (in Windows, coincidentally) alerted me.

IT is fun.


Hi guys, what is the KB code of this update?


Sounds like a good time to buy Red Hat stock


Sounds like a time to buy SPX (always has been).

Joke being I have given up trying to time markets :-). With some rare exceptions to the rule (once in 10 year type things).


Has any thread had 3000 comments before?


No worries for us. https://pinggy.io/ is working like a charm :)


From the guidelines:

> Don't solicit upvotes, comments, or submissions. Users should vote and comment when they run across something they personally find interesting—not for promotion.


Windows breaking computers since 1985.


They should be sued into bankruptcy


Lots of issues in Spain and Germany.


Have we failed as an industry?


maybe this was just an enormous distraction while Spetssvyaz did a bunch of fun stuff


Never heard mainframe going down


From the BBC's cyber correspondent Joe Tidy [1]:

> A "content update" is how it was described. So, it wasn’t a major refresh of the cyber security software. It could have been something as innocuous as the changing of a font or logo on the software design.

He can't be serious, right? Right?

[1] https://www.bbc.co.uk/news/live/cnk4jdwp49et?post=asset%3Abd...


2024 years after 2k we have 2k.


Harvey Norman’s system is down


I think we have reached and inflection point. I mean we have to make an inflection point out of this.-

This outage represents more than just a temporary disruption in service; it's a black swan célèbre of the perilous state of our current technological landscape. This incident must be seen as an inflection point, a moment where we collectively decide to no longer tolerate the erosion of craftsmanship, excellence, and accountability that I feel we've been seeing all over the place. All over critical places.-

Who are we to make this demand? Most likely technologists, managers, specialists, and concerned citizens with the expertise and insight to recognize the dangers inherent in our increasingly careless approach to ... many things, but, particularly technology. Who is to uphold the standards that ensure the safety, reliability, and integrity of the systems that underpin modern life? Government?

Historically, the call for accountability and excellence is not new. From Socrates to the industrial revolutions, humanity has periodically grappled with the balance between progress and prudence. People have seen - and complained about - life going to hell, downhill, fast, in a hand basket without brakes since at least Socrates.-

Yet, today’s technological failures have unprecedented potential for harm. The Crowdsource outage killed, halted businesses, and posed serious risks to safety—consequences that were almost unthinkable in previous eras. This isn't merely a technical failure; it’s a societal one, revealing a disregard for foundational principles of quality and responsibility. Craftsmanship. Care and pride in one's work.-

Part of the problem lies in the systemic undervaluation of excellence. In pursuit of speed and profit uber alles. Many companies have forsaken rigorous testing, comprehensive risk assessments, and robust security measures. The very basics of engineering discipline—redundancy, fault tolerance, and continuous improvement—are being sacrificed. This negligence is not just unprofessional; it’s dangerous. As this outage has shown, the repercussions are not confined to the digital realm but spill over into the physical world, affecting real lives. As it always has. But never before have the actions of so few "perennial interns" affected so many.-

This is a clarion call for all of us with the knowledge and passion to stand up and insist on change. Holding companies accountable, beginning with those directly responsible for the most recent failures.-

Yet, it must go beyond punitive measures. We need a cultural shift that re-emphasizes the value of craftsmanship in technology. Educational institutions, professional organizations, and regulatory bodies must collaborate to instill and enforce higher standards. Otherwise, lacking that, we must enforce them ourselves. Even if we only reach ourselves in that commitment.-

Perhaps we need more interdisciplinary dialogue. Technological excellence does not exist in a vacuum. It requires input from ethical philosophers, sociologists, legal experts. Anybody willing and able to think these things through.-

The ramifications of neglecting these responsibilities are clear and severe. The fallout from technological failures can be catastrophic, extending well beyond financial losses to endanger lives and societal stability. We must therefore approach our work with the gravity it deserves, understanding that excellence is not an optional extra but an essential quality sine qua non in certain fields.-

We really need to make this be an actual tuning point, and not just another Wikipedia page.-


Speaking of security. I got an email yesterday that I need a different system now to log into my social security account. This one:

https://www.id.me/government

It is for social security, taxes, unemployment benefits, whatever. And running under a foreign TLD, .ME for Montenegro. I am not a security specialist. But I think this is asking for trouble.

By the way, do you remember when fuck.yu became fuck.me ?


All down had no backup plan.


I want to add something to the discussion but it's difficult for me to accurately summarize and cite things. In a nutshell, there appears to be a lot of tomfoolery with CrowdStrike and the stuff that happened with the DNC during the 2016 election. Here's some of what I'm talking about:

There's a strong link between the DNC, Hillary, and CrowdStrike. Here's once piece that links a cofounder of CrowdStrike with Hillary pretty far back: https://www.technologyreview.com/innovator/dmitri-alperovitc...

This 2017 piece talks about doubt behind CrowdStrike's analysis of the DNC hack being the result of Russian actors. One of the groups disputing CrowdStrike's analysis was Ukraine's military. https://www.voanews.com/a/crowdstrike-comey-russia-hack-dnc-...

This detailed analysis of CrowdStrike's explanation of the DNC hack goes so far as to say "this sounded made up" https://threatconnect.com/resource/webinar-guccifer-2-0-the-...

The Threat Connect analysis is also discussed here: https://thehill.com/business-a-lobbying/295670-prewritten-gu...

"For one, the vulnerability he claims to have used to hack the NGP VAN ... was not introduced into the code until an update more than three months after Guccifer claims to have entered the DNC system."

Noted at the end of this story they mention that CrowdStrike installed it's software on all of the DNC's systems: https://www.ft.com/content/5eeff6fc-3253-11e6-bda0-04585c31b...

Finally, there's this famous but largely forgotten story of the time Bernie's campaign was accused to accessing Hillary's data: https://www.npr.org/2015/12/18/460273748/bernie-sanders-camp...

"This was a very egregious breach and our data was stolen," Mook said. "We need to be sure that the Sanders campaign no longer has access to our data."

"This bug was a brief, isolated issue, and we are not aware of any previous reports of such data being inappropriately available," the company said in a blog post on its website.

(edited for spelling)


By chance, I watched a few episodes of 911 and kept thinking that it was all completely unrealistic nonsense. Then there's an episode where the entire emergency call system for LA goes down, and even though there were different reasons in the episode (a transformer fire), I couldn't have imagined that it was actually possible to completely disable the emergency call system (and what else) of a city.


all Delta flights grounded


It should be obvious to everyone now that kernel extensions for ‘security’ is not worth it


heavy clouds this morning.

Maybe time to reconsider how solid a ground clouds are.


Would this issue not affect bare metal as well?


The great clownstrike.


Here’s my take as a security software dev for 15 years.

We put too much code in kernel simply because it’s considered more elite than other software. It’s just dumb.

Also - if a driver is causing a crash MSFT should boot from the last known-good driver set so the install can be backed out later. Reboot loops are still the standard failure mode in driver development…


Not possible in this situation, the "driver" is fine, it's a file the driver loads during startup that is bad, causing the otherwise "good" driver to crash.

Going back to an earlier version—since the driver is "good—would just re-load the same driver, loading the updated file, and then crashing again.


A driver that crashes with bad input is not “fine.” Bad design, bad configuration loading and crap input validation. Did they even fuzz the code?

We’d spend 20x development time on kernel code because BSOD is never an option.

I get that this was a bad release - but IMHO it’s incredible that they pushed this out to a billion devices before the red flags went up.


The machine stops.


how come does anyone still use crowdstrike?


Anyone has a good news? or still it's on the BSOD loop?


use windows recovery to boot into safe mode and update latest crowdstrike. That is the only option at this time.


Hopefully now people might wake up to the idea that these tech monopolies are not leading to safe, secure and reliable systems. They will wonder how a third party component could cause such breakage. I expect many will be calling for regulation.


I used to laugh at Dijkstra's idea that all code should be mathematically proven correct. I thought of it as a laughable idea from yet another out-of-touch mathematician.

I suppose true genius is seldom understood within someone's lifetime.


> I used to laugh at Dijkstra's idea that all code should be mathematically proven correct.

That would not help for this outage.


If I weren't an atheist I would say this is god's punishment for installing malware on your employees' machines, on one hand, and for being a spineless patsy for management by letting them install that crap on your work machine.


SARRRRRRRSSSSS!


I mean... installing what is essentially a 3rd party enterprise rootkit that not only has root access to all files and network activity but also a self-update mechanism ... who could have seen this coming?


Year of Linux


Why are people still using Windows?


Half of the world's computers are down. The biggest tech failure of our time. Airports. Banks. NYSE. 298 of the fortune 500 companies. RIP.


> The biggest tech failure of our time.

Is it? Is this even the first time this year that all the airlines are down? How many times a day is a hospital a victim of ransomware?


Am I supposed to use some AI bot to summarize all this shit? Ain't no one got time to read 3000+ comments. Any good links?


My Commodore 64 never gave me a blue screen of death and my Atari ST never lapsed into a hardcore boot loop.

just sayin'


Finally the crowd has struck!

ba-dum ching!


Yeah, these events will be fun once new product liability directive (that includes sw) comes into force.


Good luck everyone. I just spent all night fixing my shit and we caught it early


fun times...


Sorry to be dense, but what is CrowdStrike and do I have it on my computer?


Now things are serious: I can't place a mobile order at Starbucks.


Someone accidentally shut down the planet with a code push -- rofl


The world just became a slightly better place.


does Russia has something to do with this?


Lol we were using Symantec software so thankfully no affect.


Crowdstruck


Wow



I'm curious about investing and economy, and I always wonder about P/E ratios like Crowdstrike's (currently 450-something, was over 500 last week).

Some P/E ratios for today, for some companies I find interesting:

- Shopify: 615.12

- Crowdstrike: 455.70

- Datadog: 341.98

- Palantir: 212.34

- Pinterest: 187.67

- Uber: 99.0

- Broadcom: 77.68

- Tesla: 58.33

- Autodesk: 52.36

- Adobe: 49.23

- Microsoft: 37.97

What's going on here? Do investors expect Shopify, for example, to increase their earnings by an order of magnitude despite already having done extraordinarily well in a very competitive market? Can anyone ELI5?


Former equity analyst here. Nobody on "The Street" is actually valuing these companies on PE ratios. Tech companies often intentionally re-invest earnings back into the business in real time and so their reported EPS is often quite low and a poor metric to evaluate the underlying business on. So instead, analysts typically use other metrics like EV/EBITDA or even P/Sales ratios in their valuation models.

Very generally speaking, trading these companies is kind of more of like placing a bet on whether or not their future top-line growth will be dramatically different than the market's current expectations.


The only common belief held by investors in a stock is that the price is going to go up. You may have value investors with a belief that Shopify is undervalued based on earnings, you may have investors betting that the rest of the market will buy Shopify, you may have people who’ve seen the line go up and decided to buy…

Stock prices have been decoupled from earnings or “value” for a long time now and that’s toothpaste we will never get back in the tube. We are in the Robinhood age where you can buy and sell a stock in seconds with no effort.


> Stock prices have been decoupled from earnings or “value”

No, they aren't, but the market can remain irrational for longer than you can remain solvent. It doesn't help that our dear government seems loathe to actually ensure competitive markets.


> We are in the Robinhood age where you can buy and sell a stock in seconds with no effort.

I read somewhere that retail investors are less that 10% of trades.


a fairly small volume of trades can still have a large impact on prices, no?


The greater fool theory: https://en.m.wikipedia.org/wiki/Greater_fool_theory

Essentially, investors buy as long as they think they will be able to sell at a higher price in the future, regardless of economic fundamentals.


> regardless of economic fundamentals.

not regardless, but only if. The future is unknown, so their bet is also based on that unknown. Is it foolish? Who knows. Did nvidia seem foolish if somebody made that bet before their ai boom?


You can also just say things you don't understand are always created by fools.

Now, there are some fools buying these stocks. But to say that each one of these has a high P/E because every shareholder is a fool is very reductionary.


> You can also just say things you don't understand are always created by fools.

Do you have a better hypothesis that would explain the extreme valuations of those stocks?

> But to say that each one of these has a high P/E because every shareholder is a fool is very reductionary.

That's not what "greater fool theory" means.


> Do you have a better hypothesis that would explain the extreme valuations of those stocks?

This isn't crypto, these are real, well run companies with good fundamentals.

The trade may be a bet that they are able to corner the market and extract more value. Maybe, it's wrong, but doesn't mean it's just empty hype.


> these are real, well run companies with good fundamentals

I'm not disputing that. But even "real" companies don't warrant P/E multiples in the three-digit range, unless there's a very good reason to expect them to grow their profits by 10x or more in the foreseeable future – and that has to be the expected value of earnings growth (roughly, the average growth over all possible futures), discounted by the time value of the investment.

P/E multiples over 100 are practically never justifiable, except as "someone else will come along and pay even more" – i.e., the greater fool theory.


Capture the market by not making customers pay full costs => low profits.

Grow revenues without substantially increasing costs (i.e running a loss)

Hope you can turn up the profit dial later.

Seems like the modern way?



Stock buybacks help push these up. Buybacks are a way to pay investors at capital gain tax rates instead of normal income (dividend) rates.


The enterprise value is 80.58B. The gross profit is 2.5B. 80/2.5 is 32, similar to Tesla stock.

The earnings are affected by how much the company reinvests (which shows up as a cost) before it becomes earnings on the accounting sheet.


TBH I don't think many figures here make any financial sense -- but I gotta hold it if my friends all hold it. And once everyone holds it no one is allowed to mass sell it because it's going to hurt your friends, and in finance that's a sin.


With numbers like that, either the market is crazy or the market believes the actual meaningful earnings are substantially higher than the GAAP reported numbers. Although even there the difference would have to be pretty big.


arm is also surprisingly high at 557.95x according to my broker btw. yet it's the only stock in my portfolio that reliably goes up.


Past performance is not necessarily indicative of future performance


30-50 is a reasonable PE range for larger companies.


There's a lot of comments knocking the due diligence, but the call out of the threat vector and timing of this make it a bit hard to brush off as coincidence.


How is Microsoft stock down less than a percent?

The problem was Windows giving arbitrary access to the kernel to software that can be updated OTA without user intervention and allowing that to crash the kernel, right? Wouldn't this mean that Windows is considerably less secure and stable than assumed?


No one assumed Windows were stable and secure that's why they install crowdstrike


Not really. These were kernel modules authorized and installed by the system admin. Of course kernel code runs the risk of crashing your system. The same is true on Linux, and according to another commenter it already has happened with Crowdstrike for Linux


Coincidence?


Probably.


That's bananas. Bro is about to get a knock-knock.


He says he bought seven put contracts for $7.30 at the $185 strike. Absolute max profit, from CS going to -0, would be (185-7.3) x 7 X 100 = ~$125k.

I don’t know if the absolute amount of profit affects decisions here. It seems if he were more certain of what’s going on he would have bet a lot more.


> It seems if he were more certain of what’s going on he would have bet a lot more.

Outside of the HN bubble, $125K is already a pretty big sum of money to get all at once, and unlikely to bring too much scrutiny, if it was somehow not a coincident. Seems like a smart strategy, if the user was sitting on inside information and didn't want to ring too many alarm bells.


However posting on reddit about it, would not be such a smart strategy. I think it's genuinely just a coincidence, WSB gets plenty of worthless "DD" posts every day that end up amounting to nothing.


Fair. But, one more point, even with its pre market drop, it’s still way above that strike, though the value of the put is going to be up.


Yeah, tongue firmly in cheek, but that was a very specific, prescient analysis.


Most timed post ever?


This smells of some inside trading. Someone internal at crowdstrike (or their relative/friend) got wind of this and is trying to save face if they get investigated.

Reading the post its obvious they don’t have a deep understanding of tech, while having that be core to their thesis.

It’s prohibitively hard to hack into a “cloud system” due to few possible entry points - as a reddit commenter said, open S3 buckets are tough to crack!


It wouldn't work, the SEC has incredible tools for finding these things.

Especially for the mom/pop investors.


Yet another good example why liability in software should already be a common thing.


I don't understand why this outage and the Azure outage earlier don't make it to the front page.

I'm getting more up to date technical details from the regular media.

This outage looks to be huge.


This is literally a black swan event that'll be memorialized in textbooks, and that's not even considering the actual fallout that will follow.


That is not an understatement. This is literally the largest failing of internet infrastructure to date.

Alas, using the internet has given us a lot of efficiency. The trade off is resilience. The entire global system is more brittle than ever but it what gave it such speed.


I'd argue the infrastructure of the Internet isn't to blame here, it sounds like a software/config bug at Crowdstrike. There are wider discussions around over-reliance on cloud-based tech too. But the good old Internet can hold its head up high IMHO.


I'm not really sure cloud has much blame here either.

Imagine it's 1998 and Norton push a new definition file that makes NAV think kernel32 is a virus. The only real difference today is that always-on means we all get the update together, instead of waiting for mum to get off the phone this evening.

We got an email this morning telling us none of our usual airlines could take bookings right now. That wouldn't have been much different in 1998, airline bookings have been centralised for my entire lifetime.


That is a fair clarification. More an issue with the stuff on top of the internet.

As per your point, the base infrastructure is working perfectly for better or worse!


It's not a failing of Internet infrastructure.

It's one vendor pushing out a bad update, and thousands of companies with no supply chain diversity.


I responded to another comment with this and you are right, that is a fair clarification.

The base infrastructure worked perfectly fine. The stuff on top of it, not so great.


In a way, this might end up being a blessing in disguise. It's an emergency drill for something potentially catastrophic (e.g. massive cyberattack, solar flare), and it's a large enough wake-up call that society can't just ignore it.


I hope so but never under estimate businesses to cut corners to stay on top. I hope I am wrong.


Expectations: Russia, China or North Korea will take the Western tech infrastructure down.

Reality: the infrastructure took down itself.


This not Internet failure but software functuality failure from a cyber security update , thios is why i never use real cloud security but rather use home cloud security on my Serverpc/NAS/SAN , when it crashes okay so be it i dont get acces to my cloud , but i allways use a online backupcloud that does not need instalation for most important stuff , u should never rely on 1 softweare to do all allways have backups


Did you mean the Azure thing or CS thing? Just happened that the Azure thing did not impact me as I was already off...


It isn't clear at this point that the Azure thing and the CS thing are unrelated


Azure issue and CS issues are separate incidents. They are not related.


Indeed this is different , This is world wide and it does not even only affect Windows pcs but with some friends of mine Linux and Macs are in trouble 2 , this is not ur Windows Vista BS problem etc


The impact is more than huge. Our whole workforce is impacted as the world's largest tech service provider.


Y2K came 24+ years late!


And 12 years before the Unix epoch!


Agreed, I assume people are flagging this post for some reason?


It’s most likely the flame war filtering algorithm of HN. Posts that create a lot of discussion quickly are down ranked until an admin fix the rank manually, or not.


HN has some flame war protection I remember having read, downrating some discussions where it assumes nothing useful is being discussed.

Maybe a false positive? (100% speculation. That's the problem using closed source software)


Most HNers are on the clock while browsing (for work purposes of course) and forcibly afk at the moment.


I'd guess also a significant proportion of us are involved in our company's response to the incident.


Or we just woke up because we are on vacation


I’m on vacation right now, but I do compile a lot of code when at work


Funny how I got rejected today from crowdstrike because I couldn’t code a hard leetcode problem under 40mins. I guess leetcode isn’t true software engineering after all.


You probably weren't good enough to take down half the world's Windows systems.


Alternately if he had gotten the job maybe he would have take down three quarters of the world's Windows systems.


This is a testing and deployment issue rather than coding... mistakes and bugs happen - but most serious businesses have routines setup to catch them before rolling them out globally!


So maybe they should swap out the Leetcode for a testing and deployment test during their interview!


Evidently you dodge bullets well, you should consider running for office.


Breaking patches is? …how did this get through QA with a big enough issue that it breaks many many windows machines


I'm about willing to bet they don't have a qa team.


Discussed more thoroughly here: https://news.ycombinator.com/item?id=41002195 (Not sure why that's not on the frontpage)


Because ppl are locked out to say they are locked out.


Probably shoved off by the flamewar detector.


This company has post-apocalyptic style photos to make you panic-buy their solution.

https://ibb.co/Bc6n527

"62 minutes could bring your business down"

I guess they could bring all the businesses down much quicker.

edit: link https://www.crowdstrike.com/en-us/#teaser-79minutes-adversar...


Their "Statement" is remarkably aloof for having brought down flights, hospitals, and 911 services.

"The issue has been identified, isolated and a fix has been deployed."

Maybe I'm misunderstanding what I read elsewhere, but is the machine not BSODing upon boot, prior to a Windows Update service being able to run? The "fix" I see on reddit is roughly:

Workaround Steps:

1. Boot Windows into Safe Mode or the Windows Recovery Environment

2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory

3. Locate the file matching “C-00000291*.sys”, and delete it.

I'm horrified at the thought of tens of thousands of novice Windows users digging through System32 to delete driver files; can someone set my mind at ease and assure me this will eventually be fixed in an automated fashion?

https://www.crowdstrike.com/blog/statement-on-windows-sensor...


> Their "Statement" is remarkably aloof for having brought down flights, hospitals, and 911 services.

Their lawyers certainly won't allow mentioning such dramatic (is "dramatic" appropriate here?) consequences.


It cannot and will not be fixed in an automated fashion.


Of course it can be fixed in an automated fashion; it just requires effort. The machines should have netboot enabled so that new validated operating system images can be pushed to them anyway, so you just write a netboot script to mount the filesystem and delete the file, then tell the netboot server that you're done so it doesn't give you the same script again when it reboots.

It's like two hours of work with dnsmasq and a minimal Linux ISO. The only problem is that much of the work is not shareable between organisations; network structures differ, architectures may differ, partition layout may differ, the list of assets (and their MAC addresses) will differ.

Edit: + individual organisations won't be storing their BitLocker recovery keys in the same manner as each other either. You did back up the recovery keys when you enabled BitLocker, right? Modern cryptsetup(8) supports a BITLK extension for unlocking said volumes with a recovery key. Again, this can be scripted.


> so you just write a netboot script to mount the filesystem and delete the file

Because writing such a script (that mounts the filesystem and delete a file) under stress and time constraint is a great idea? That's a recipe for a worse disaster. The best solution, for now, is to go PC by PC manually. The sole reason the situation is as is was the lack of backstage testing.


If the affected organizations had such an organized setup, they probably won't need crowdstrike in the first place. The product is made so that companies that don't understand (and won't invest) in security can just check that box by installing the software. Everyone is okay with this.


> I'm horrified at the thought of tens of thousands of novice Windows users digging through System32 to delete driver files; can someone set my mind at ease and assure me this will eventually be fixed in an automated fashion?

Nope. Both my orgs (+2000 each) have sent out a Google doc to personal emails on using CMD Prompt to delete that file.

Anyone with technical experience is being drafted to get on calls and help people manually delete this file.


So no bitlocker on the system?


And I guess if they use bitlocker then they need to enter the key as well? Imagine doing that to thousands of computers


The 1000's of laptops my wife's work uses are bitlockered. I went to fix the issue, when I found that out. I wonder if they will be giving out the keys or if IT will require hands on to those laptops to fix it.... what a shitshow.


Good luck to Joe Schmoe in the IT dept who has to do this over and over flawlessly


Boy, that is some corny branding.


I agree but I've also personally witnessed how effective this crap is on a certain cohort of IT managers. You can see the 3 or 4 gears grinding together in their head... something like "oh my goodness look at all the things I get for one purchase order!".


I could certainly see that! Haha.


> "62 minutes could bring your business down"

> I guess they could bring all the businesses down much quicker.

It is because the buyer does not get the message. And, when they get it, it is too late.


And for much longer.


The antivirus did its job, now you can't get viruses. Jokes aside, I've checked their website and it was full of AI buzzwords so I guess that happens when you focus on nonsense instead of what your customers actually need (I know that all antiviruses have a machine learning component, but usually you don't advertise it as some sort of AI to get better stocks).


All antivirus software is indistinguishable from malware.


I have never seen malware take half the worlds IT systems offline, even in the days of code red and slammer


Blaster legit took down over half of computers on the internet.


But the effects on society were far less because “online” wasn’t synonymous to “internet”


I think the AI talk is just the fashion now amongst C-level execs. Their product - no matter what it does - suddenly needs some sort of AI integration.


When will people learn?

1. Stop putting mission critical systems on Windows, it's not the reliable OS it once was since MS has cut off most of its QA

2. AV solutions are unnecessary if you properly harden your system, AV was needed pre-Vista because Windows was literally running everything as Administrator. AV was never a necessity on UNIX, whatever MS bundles in is usually enough

3. Do not install third party software that runs in kernel mode. This is just a recipe for disaster, no matter how much auditing is done beforehand by the OEM. Linux has taught multiple times that drivers should be developed and included with the OS. Shipping random binaries that rely on a stable ABI may work for printers, not for mission critical software.


None of this advice is useful for massive organizations like banks and hospitals who got hit by this. They cannot switch off of windows for a number of reasons.


There's nothing they can do right now, but my issue is that this will be forgotten when next update/purchasing round swings into action.

Take Mærsk who couldn't operate their freight terminals due to a cyber attack and had the entire operation being dependent on a hard drive in a server that happened to be offline. Have they improved network separation? Perhaps. Have they limited their critical infrastructure to only run whitelisted application? I assure you they have not. They've probably just purchased a Crowdstrike license.

Companies continuously fail to view their critical infrastructure as critical and severely underestimate risk.


Mærsk is kind of a bad example, because they made real security mitigations afterwards.[0] I cannot speak to whether they whitelist applications, but neither can you.

[0] https://www.csoonline.com/article/567845/rebuilding-after-no...


That's the reason why I wrote, "stop putting" instead of "throw all of your PCs out of the window". Just like they migrated away from DOS they should start planning to migrate away from Windows to more modern, sandboxed solutions. There are ZERO reasons why a cash register shouldn't boot from a read-only filesystem, run AV, and so on.


It's not a "3 step solution" that some IT guy can do now. It's a high level critique of using Windows and ignoring known problems.


Because they made a continuing choice over the years to write their systems to rely on windows?


They are not willing perhaps? Why cannot they?


All of the hardware that's attached to workstations in our hospital are designed for windows. Certain departments have specific needs as well and depend on software that is Windows only. After decades of Windows it develops an insidious grasp that is difficult to escape, even moreso when your entire industry is dependent on Windows.

Switching over to windows wouldn't just be extremely costly from an IT perspective but would require millions of dollars in new hardware. We are in the red in part because of the pandemic, existing problems in our industry accelerated by the last few years, and because a large percentage of our patients are on Medicare, which the fed govt shrinks fixed service payments for every year.

I can't imagine convincing our administration to switch over to Linux across the hospital without a clear, obvious, and more importantly short-term financial payoff.


I'm working for a company that has no Windows boxes at all, anywhere. Sure, some Windows software has no alternatives. We're running all of those programs in VMs.

Does this make financial sense? Probably not in the short run, which is an issue for most companies nowadays. But in the long run? I think it's the right choice.


It is not the hardware designed for windows but the driver code, which is most probably written in basic C, which most probably can be cross-compiled for usage outside Windows – so instead of millions of dollars in new hardware it is really thousands in porting the drivers and GUIs to the new platform. What works on windows in 90% cases is an easy porting job for the manufacturer, they just won't be doing it unless someone stops paying for windows version and be willing to pay for alternative platform port.

Anyway, i totally agree with you. The convincing part here is short of clear and obvious for administration types. Until MS finally bricks it's OS and renders it totally unusable they can continue to do whatever shit they want and keep mocking their loyal customers forever.


Well, there’s this one app, written in VB6 using lots of DCOM that produces XML and XSLT transforms that only work in IE6, and the entire organisation depends on it, and the nephew who built it is now a rodeo clown and is unavailable for consultation.

True story.


He-he, entire organisation depending on IE6. I have good news and bad news...


I'd attempt an answer.

1/ imagine running >1000 legacy applications, some never updated in 20 years 2/ imagine a byzantine mix of local data centers, VPCs in aws/gcp/azure 3/ imagine a IT departament run by a lot of people who have never learned anything new since they were hired

That would be your typical large, boring entity such as a bank, public utility or many of the big public companies.

Yeah, there is no law of physics preventing this, but it's actually nearly impossible to disentangle an organization from decades of mess.


That's why we've invented emulators, sandboxing, ...

People have continued to run old management systems inside of virtual machines and similar solutions. You can sandbox it, reset it, do all kinds of wondrous things if you use modern technologies in an era-appropriate way. Run your old Windows software inside of a VM, or tweak it to run well on Wine if you have the source. The reason this mess happened is that all of those software are literally running a desktop OS in mission critical applications.

I have worked as an embedded engineer for a while and I can't count the number of nonsensical stuff I've seen incompetent people running on unpatched, obsolescent Windows XP and 7 machines. This mess is 100% self inflicted.


I think these are just technical excuses, but the real answer lies somewhere in the fields of politics and economics. If people in charge are to make a decision – then us tech nerds are going to migrate and refactor 1000 applications and update 20 years of byzantine code mess. I saw entities so large and boring they can barely move one step – changing rapidly and evolving once their economic stability is at stake, and this is a great example of such a disruption which can push them into chasm of change.


Because it would mess with the backroom deal the executive making the decision has with MS.


Because reasons!


Shareholders prefer profits to long-term investment. Thanks capitalism!


This issue could easily happen on any other OS - Linux, macOS, BSDs - because it's a third party kernel driver which would be installed by the corporate IT regardless of anyone's opinion for compliance reasons. Your advice is incompatible with how the real world operates.


I've seen orgs get through soc2 and pci-dss without kernel anti virus.

It's all about compensating controls.


Alas in the world of B2B, contracts from larger companies nearly always come with lists of specific requirements for security controls that must be implemented, which nearly always include requiring anti-virus.

It just not as simple as commenters on this thread wish!


The contracts are rarely specifying stuff like antivirus explicitly, but instead compliance with one or more of the security standards like PCI DSS. Those say you have to use antivirus, but they all have an escape hatch called a "compensating control" which is basically "we solved the problem this is trying to solve this other way that's more conducive to our overall security posture, and got the auditor to agree with us".


My source: I review a lot of contracts. It's very common for things to be explicitly required.

Yes you can go back and forth and argue the toss, but it pushes up the cost of the sale and forces your customer to navigate a significant amount of bureaucracy to get a contract agreed. Or you could just run AV like they asked you to...


Wait, I thought in this case we are the customer!? Okay what kind of contracts are we talking about? :D


Can you propose an example of a compensating control for an "antivirus" that had a chance to pass? Would you propose something like custom SELinux/Apparmor setup + maybe auditd with alerting? Or some Windows equivalent of those.


compensating controls ftw. the spirit of the law vs the letter of the law. our system was more secure with the compensating controls, vs the prescribed design. this meant no having to rotate passwords because fuck that noise.


You should grab the folks who've done it and start authoring a book. You've got a 100x audience increase today!


Same, I’ve been in an org that got PCI-DSS level 1 without antivirus beyond Windows Defender or any invasive systems to restrict application installation.

It did involve a lot of documentation of inter-machine security controls, network access restriction and a penetration test by an offensive security company starting with a machine inside the network, but it can be done! Also in my opinion it gives you a more genuinely secure environment.


You should explain how they do it.

If for instance they're remoting into a restricted VM all day, that's a different set of tradeoffs many might not be happy with.


Nothing like that, basically what sitharus said above you. Extra network level, zero trust to minimize lateral movement and giving the pen testers a leg up by letting them start already within the corporate network.


corporate IT heads need to roll for that to ever change.

The romans used to make the architects stand under the arches they built, to enforce the idea of consequences for bad work.


Corporate ITs need to stop mandating security malware tools on their systems just because someone showed them some nice powerpoints.


Yeah, my work requires me to run an antivirus kernel module on my Ubuntu laptop.

Corporate IT is always going to lean towards the "safe" compliance option.


> AV was never a necessity on UNIX, whatever MS bundles in is usually enough

What prevents someone pushing a malicious package that takes my user data (that is accessible from a logged in session directly) and sends it somewhere? Especially in non-system repos, like Maven/NuGet/npm/pip/RubyGems and so on? What about the too widespread practice of piping shell scripts from the web, or applications with custom update mechanisms that might be compromised and pull in malicious code?

I'm not saying that AV software would protect against all of these, but even if users don't do stupid things (which they absolutely will anyways, sooner or later), then there are still vectors of attack against any system.

As for why *nix systems don't see that much malware, I've no idea, maybe because it's not as juicy of a target because of the lower count of desktop software installations (though the stuff that is on the systems might be more interesting to some, given the more tech savvy userbase), or maybe because a lot of the exploits focus on server software, like most CVEs.

On Windows, I guess the built in AV software is okay, maybe with occasional additional scans by something like Malwarebytes, but that's situational.


Nothing, in fact there have been many cases where python's and nodejs's package systems were exploited to achieve arbitrary code execution (because that's a feature, not a bug, to allow "complicated installation processes to just work").

https://arstechnica.com/information-technology/2021/12/malic...

AVs are the wrong way to go about security anyway, it's a reactionary strategy in a cat and mouse game by definition. For prevention, I think the BSDs are doing some promising work with the "pledge" mechanism. And as much hate as they get, I like appimages and snap et al for forcing people to consider a better segmentation model and permission system for installed software.


I would like to inform you that none of the AV product on the market will be able to protect you from piping a bad script from the web. Case closed.


Crowdstrike agent is theoretically able to detect that what you just pipe-installed is now connecting to a known command and control server and can act accordingly.


yes, as any competent classic old fart network-wide IPS/IDS

endpoint security is a great utopia to strive for, but to get there we ought to starts with having secure by default endpoints.


Carbon Black will block any executables it pulls down though. And I think it may also block scripts as well. Executables have to be whitelisted before they can run.

Its an extremely strict approach, but it does address the situation you're talking about.


Scripts are not executables


Agreed, but Carbon Black can stop scripts from running.


If it lets you spawn a shell I would bet money against that


If you write a batch file on a Windows PC with Carbon Black on it, you will not be able to run it. Of course there is customisation available to tweak what is/isn't allowed.


Yes, but that's like 1% of the actual surface area for "running a script". I am not a Windows expert but on, say, Linux you can overwrite a script that someone has already run, or modify a script that is already running, or use an interpreter that your antivirus doesn't know about, or sit around and wait for a script to get run and then try to swap yourself into the authorization that gets granted for that, or…there's a whole lot of things. I assume Windows has most of the same problems. My confidence in Carbon Black stopping this is quite low.


If your malicious script starts doing things like running well known payloads or trying to move laterally or access things it really shouldn't be trying to access AV will flag/block it.


What happens when the malicious script tries a not-very-well-known payload? Hint: nothing good.


No one is suggesting it is 100% coverage but you would be suprised at the ammount of things XDR detects and prevents in a average organization with average users. Including the people who can't stop clicking YourGiftcard.pdf.exe


I am not against trying to protect against people who do that. The problem is that you pay XDR big bucks to stop a lot more than that, and this mostly doesn't work.


That’s both untrue and missing the point.

In a perfect world, AV software wouldn’t be necessary. We don’t live in a perfect world. So we need defense-in-depth, covering prevention, mitigation, and remediation.


> What prevents someone pushing a malicious package that takes my user data

That's not an argument in good faith. If you install unvetted packages in your airline control system, bank, or supermarket, the kind of systems that we're talking about here, you have much bigger problems to worry about.

> I'm not saying that AV software would protect against all of these,

Or indeed any of these. Highly privileged users piping shell scripts from untrusted sources is out of scope for any antivirus system, on any platform.

That doesn't mean all platforms are identical, or share the same attack vectors. It is much more accepted to install kernel mode drivers on the Windows platform, where it is not only accepted but have established quality control programs to manage it, than on Linux, where the major vendor will very literally show you the middle finger on video for everyone to see for doing so.

The Linux community is more for doing that kind of work upstream. If some type of new access control or binary integrity checking is required, that work goes upstream for everyone to use. It is not bolted on running systems with kernel mode drivers. That is because Linux is more like a shared platform, and less like a "product". That culture goes way beyond mere technical differences between the systems.


> If you install unvetted packages in your airline control system, bank, or supermarket, the kind of systems that we're talking about here, you have much bigger problems to worry about.

Surely we can agree that if it's a vector with an above 0% chance of it being exploited, then any methods for mitigating that are a good thing. Quite possibly even multiple overlaid methods for addressing the same risks. Defense in depth and all, the same reason why many run a WAF in front of their applications even though someone could just say: "Just have apps that are always up to date with no known CVEs".

> Or indeed any of these. Highly privileged users piping shell scripts from untrusted sources is out of scope for any antivirus system, on any platform.

You don't even have to be highly privileged to steam information, e.g. an "app" for running some web service could still serve to exfiltrate data. As others have mentioned, maybe this is not what AV software has been historically known for, but definitely there are pieces of software that attempt to mitigate some of the risks like this.

I'd rather have every binary or piece of executable code be scanned against a frequently updated database of bad stuff, or use heuristics to figure out what is talking with what, or have other sane defaults like preventing execution of untrusted code or to limit what can talk to what networks, not all of which is always trivial to configure in the OSes directly (even though often possible).

I won't pretend that AV software is necessarily the right place for this kind of functionality, but I also won't pretend that it couldn't be an added benefit to the security of a system, while also presenting different risks and shortcomings (threat vector in of itself or something that impacts system stability at worst, or just a hog on the resources and performance in most cases).

Use separate VMs, use secret management solutions, use separate networks, use principle of least privilege, make use of good system architecture, have good OS configuration, use WAFs, use AV software, use scanning software, use dependency management alerting software, use static code analysis, use whatever you need to mitigate the risk of waking up and realizing that there's been a breach and that your systems are no longer your own.

Even all of that might not be enough (and sometimes will actually make things worse), but you can at least try.


In that we can agree. But I would put "build on operating systems intended for the purpose" on top of that list, too. There is no excuse for building airline or bank systems on office operating systems and trying to compensate by bolting on endpoint protection systems.

The issue here is not simply scanning for known malware, "endpoint protection" systems go way beyond that. I have never, in practice, seen any of those systems be a net benefit for security. And I mean in a very serious and practical way. Depending on your needs, there are far more effective solutions that don't require backdooring your systems. There simply shouldn't be any unauthorized changes for this type of systems.


> In that we can agree. But I would put "build on operating systems intended for the purpose" on top of that list, too.

Agreed, most folks should probably use a proven *nix distro, or one of the BSD varieties. That would be a good starting point.

That said, I doubt whether the OS alone will be enough, even with a good configuration, but at some point the technical aspects have to contend with managing liability either way.


Carbon Black, running in DO NOT LET UNTRUSTED EXECUTABLES RUN mode, would not let you run binaries that curl | sh just grabbed unless they were allow-listed.


Meh, without proper MAC with process namespaces, I guess nothing.

SELinux and whatever Apple is doing looks right IMHO


Windows Defender is more than sufficient for most of these companies, but they need that false sense of security, or maybe they have excess budget to spare, or they are transferring the risk per their risk management plan.


Transferring the risk of malicious actors, but creating a resilience risk they are not owning.


Bingo


This isn't a windows issue. For what it's worth, I've had plenty of problems in the past with kernel panics from crowdstrike's macos system extension, although it was fairly random, nothing like today's issue.


Linux isn't exactly reliable either... I'm sorry but that OS is barely capable of outputting a stable HDMI signal, god help you if you are on a laptop with external monitor.

For 3 computers, 2 laptops, I've never _not_ had display bugs/oddities/issues. System upgrades always make me nervous because there is a very real chance of something getting fucked up and my screen staying black the next time it boots, having to go into a TTY, and manually fixing stuff up or booting the previous version that was still saved in GRUB.

We can not get computers perfect. They are too complicated. That's true for anything in life. As soon as it gets too complicated, you're left in a realm of statistics and emergent phenomena. As much as I dislike windows enough to keep using Linux, I never had display issues on windows.

To anyone compelled to reply with a text that contains "just" or "simply": simply just consider that if you are able to think of it in 10 seconds, then I have thought of it as well, and tried it too.


In my comment I was referring to mission critical systems, which most definitely you don't put on cheap commodity hardware you buy in a brick and mortar store.

Linux is used EVERYWHERE for a reason. Most car HUD now run on some form of Linux embedded, like basically all embedded and low power devices. The problem here is that people still put embedded mission critical systems on a desktop OS and slap desktop software on it, which is _a bad choice_.


> Linux isn't exactly reliable either... I'm sorry but that OS is barely capable of outputting a stable HDMI signal, god help you if you are on a laptop with external monitor.

This is demonstrably false, given the amount of people that game on Linux nowadays.

> System upgrades always make me nervous because there is a very real chance of something getting fucked up and my screen staying black the next time it boots, having to go into a TTY, and manually fixing stuff up or booting the previous version that was still saved in GRUB.

I had this happen to me once. Timeshift was painless to use, and in about 15 minutes I had my machine up and running again, and could apply all updates properly afterwards. If anything it made me bolder lol.


> Linux isn't exactly reliable either... I'm sorry but that OS is barely capable of outputting a stable HDMI signal, god help you if you are on a laptop with external monitor.

It just just works for me, and has just worked with every laptop I have had in the last 15 years. My kids and I have several Linux installs and the only one with HDMI output issues is a cheap ARM tablet that is sold as a device for early adopters.

> For 3 computers, 2 laptops, I've never _not_ had display bugs/oddities/issues. System upgrades always make me nervous because there is a very real chance of something getting fucked up and my screen staying black the next time it boots, having to go into a TTY, and manually fixing stuff up or booting the previous version that was still saved in GRUB.

At least that number of machines (I do not know whether you mean three or five in total) for the last 20+ years and can recall one such issue.


> For 3 computers, 2 laptops, I've never _not_ had display bugs/oddities/issues. System upgrades always make me nervous because there is a very real chance of something getting fucked up and my screen staying black the next time it boots, having to go into a TTY, and manually fixing stuff up or booting the previous version that was still saved in GRUB.

I've had a Debian update break GRUB itself as well: https://blog.kronis.dev/everything%20is%20broken/debian-and-...

I still use some Linux distros because when they work, they're pretty good, but when they don't, be prepared for a bunch of annoyances and debugging.


I also had Windows Update fucking up my VMs and physical installs multiple times - this stuff just happens _with desktop machines, on desktop OSes_. The point is, lots of companies are using random cheap x86 computers with Windows desktop for mission critical appliances and systems, which is nonsensical. The rule of thumb has always been, do not put Windows (client) on anything you can't format on a short notice at any time. Guess people just never learn


> The rule of thumb has always been, do not put Windows (client) on anything you can't format on a short notice at any time.

This is reasonable and sound.

> This stuff just happens _with desktop machines, on desktop OSes_

I wouldn't call Debian a desktop OS per se, though (albeit installing XFCE just in case does introduce a bit more risk of breakage).

Critique of consumer hardware is valid, but it's quite upsetting that bad software is the status quo.


How is your lack of a stable HDMI signal relevant to that the world's airlines and supermarkets and banks probably shouldn't run Windows with third-party antivirus software bolted on? That is a platform originally intended for office style typewriter emulation and games.

Every engineering-first or Internet-native company that could choose chose Linux and for simple reasons. Anything not Linux in The Cloud is a rounding error. Most of the world's mobile phones is Linux. And most cloud-first desktops too. They don't seem to be particularly more troubled with HDMI signal quality or other display issues than other devices.


>Linux isn't exactly reliable either...

That's certainly a perspective.


What are you saying? Am I holding it wrong?


I'm saying you're allowed to have whatever belief you want, no one is stopping you from stating it on the internet either.


This is one of the comments ever.


> I'm sorry but that OS is barely capable of outputting a stable HDMI signal, god help you if you are on a laptop with external monitor.

You may have had particularly bad luck with poorly supported hardware, but I don't think this is a normal experience.

I've been using Linux exclusively on desktops and laptops (with various VGA, DVI, DisplayPort, HDMI, and PD-powered DisplayPort-over-USB-C monitors and TVs since 2002 without any unstable behavior or incompatibility.


Most likely. I think laptops are particularly gnarly, especially when they have both an apu and a discrete gpu. While manufacturers use windows' amenities for adding their own drivers and modifications so that they ensure that the OS understands the topology of the hardware (so that the product doesn't get mass RMA'd), there's no such incentive to go out of your way to make Linux support it.


Yeah I mean you probably shouldn't put mission critical systems on a laptop with an external monitor either.


I mean, this isn't a support forum.

But working with hundreds of computers, running many different distributions of Linux for decades, they just haven't ever seen what you're describing. It's really hard to reconcile what I read here with my hands-on experience.


1. this is a crowdstrike issue not windows

2. plenty of malware and c2 systems happily operate off all systems, regardless of how hardened (or how unix) they are - IDS/IPS is a reactive way to try and mitigate this

3. you don't need third party software to compromise the unix kernel, you just need to wait a week or two until someone finds a bug in the kernel itself

all that being said, this has solarwinds vibes. the push for these enterprise IDS systems needs to be weighted, the approach adjusted


CrowdStrike Falcon is not an AV, Windows can be decently hardened and Microsoft did not "cut off most of its QA"


> Microsoft did not "cut off most of its QA"

Windows RTMs used to be shipped in a usable state (albeit buggy) for more than a decade. You installed it from a CD and it worked fine, you installed patches every once in a while from a random Service Pack CD you got from somewhere.

Modern Windows has had the habit of being so buggy after release in such horrendous ways that I can't imagine being able to use the same install CD for years. This is definitely putting less attention to detail in my view.


The slice of Microsoft stuff I worked at certainly did not have dedicated QA at the time I was there and used to have a QA team before, so there is some degree of truth to the statement. I can't speak for other Microsoft teams and offices. It was very disappointing for me, because I have had the opportunity to work with great QA staff before and in my current job and there is no way a developer dedicating 25 % of their time (which is what was suggested as a replacement for having dedicated QA) can do a job anywhere near as good.


I have a feeling most commenters (not just here) don't really know what Falcon is and does, if EDR (and more?) keeps getting compared to a plain antivirus.


Same difference


> Microsoft did not "cut off most of its QA"

Yes they did. Hear about it first hand from an ex windows software tester:

https://www.youtube.com/watch?v=S9kn8_oztsA

https://www.youtube.com/watch?v=lRV6PXB6QLk


Crowdstrike is not just "Antivirus" capability.

Depending on the threats pertinent to the org they may require deep observability and the ability to perform threat hunting for new and emerging threats, or detect behaviour based signals, or move to block a new emerging threat. Not all threats require Administrator privileges!

Not installing AV might be fine for a small number of assets in a low risk industry, but is bad advice for a larger more complex environment.

If were unbiased here the apparent crowdstrike problem could occur on any OS and with any vendor where you have updates or configuration changes automatically deployed at scale.


> Do not install third party software that runs in kernel mode. T

You mean don't install Steam nor the Epic Store, nor many of the games.

Note: I'm agreeing with you except that pretty much the only reason I have a Windows machine is for games. I do have Steam installed. I also have the Oculus software installed. I suspect both run in kernel mode. I have to cross my fingers that Valve and Facebook don't do bad things to me and don't leave too many holes.

I don't install games that require admin.

Oh, and I have Photoshop and I'm pretty sure Adobe effs with the system too >:(


Steam does not have any kernel-mode components.


Steam asks for admin to be installed and asks for it again to install more features related to screen sharing.

To me, any app that asks for admin is suspect.


Admin privileges aren't the same thing as a kernel-mode driver. Steam does require admin to be installed, but it does not install a kernel-mode driver.


are you sure about that? anti cheat stuff is pretty invasive these days


per microsoft admin to kernel is not a security boundary


Root and kernel are different levels.


I've never seen a program running in kernel mode other than AV software. Pretty sure all stuff you listed doesn't. Asking admin permissions doesn't mean it's kernel mode software.


> You mean don't install Steam nor the Epic Store, nor many of the games.

...would you install Steam on a POS machine? Is your gaming PC a "mission critical system"?


this "kernel level = invasive" paranoia that's been going on lately is complete FUD at its core and screams tech illiteracy

no software vendor needs to or wants to write a driver to spy on you or steal your data when they can do all of that with user-level permissions without triggering any AV.

3rd party drivers are completely fine, and its normal that advanced peripherals like an Oculus uses them


Yet we have rootkit level "anti-cheat protection", without which you cannot participate in some online games.


> Linux has taught multiple times that drivers should be developed and included with the OS.

I've had Linux GPU drivers fail multiple times due to system updates, to the point were I needed to roll back. I've had RHEL updates break systems in a way were even Red Hat support couldn't help me (I had to fix them myself).

I don't see how Linux is any better in this regard than Windows to be honest.

Also: > AV was never a necessity on UNIX

Sure, why write a virus when you can just deploy your malware via official supply chains?


Do you have/need GPUs on your 'mission critical systems'? I would bet most of us don't.

I quite agree with OP here. VMs are now quite lightweight (compared to available resources on machines at least) and I would rather use a light, hardened Linux as my base OS that will run windows VM and do snapshots for quick rollbacks. Actually, that's what I run on my own PC, and I think it would be the sanest way to operate.


Imagine a supply chain attack on a closed system and nobody finding out about it.


> Stop putting mission critical systems on Windows, it's not the reliable OS it once was since MS has cut off most of its QA

Dream on, that ship sailed long, long, long ago.

It’s always funny to me reading comments on this site from users like this who have no idea how the real world operates


Or doesn't operate, as it currently stands.


> Or doesn't operate, as it currently stands.

The definition of "operate" has changed over the years.


Why is this type of a comment upvoted to the top?

It's a knee jerk reaction to the midwit zeitgeist. No real understanding of how the real world operates. No maturity or thoughtfulness.

Totally misleading prognosis. Misleading advice.


You missed the most important one:

Have some kind of soaking/testing environment for production critical systems, especially if you're a big business. If you're hip, something like a proper blue/green setup (please chime in with best practices!). If you're legacy, do it all by hand if you must.

Blindly enabling immediate internet-delivered auto-update on production systems will always allow a bad update to cause chaos. It doesn't matter how well you permission things off on your favourite Linux flavor. If an update is to be meaningful, the update can break the software. And clearly you're relying on the software, otherwise you wouldn't be using it.


Not saying the microsoft QA is stellar but I also remember heartbleed.


> Windows, it's not the reliable OS it once was

And when exactly was Windows a reliable OS ?

When it was turned off ? /s


Sitting in our work slack feeling pretty smug that I forced the migration to only Linux servers and Linux or macOS work computers now.


I dont understand? Linux servers dont need endpoint protection?

I venture the vast majority of servers with crowdstrike are linux


I'm assuming Linux implementation is different and the bug is not present there.


Yes. CrowdStrike for Linux has had bugs in the past, though. This time the Windows version is affected.


> endpoint protection

That's a polite way to say malware


The bug only affects Windows machines


Until the VPN is affected by the Domain Controller that is on Windows.


100% nix based here so thankfully zero systems affected. Everything from routers to devices, we have a total blanket ban on any Windows based software.



Thanks!


BBC reports: “ The cause is not known - but Microsoft says it's taking mitigation action”.

Most of the media I found say it’s because “cloud infrastructure”. I am yet to see any major source actually factually report this is caused by a bad patch in Crowdstrike software installed on top of Windows.

Gets to show how little competency there is in journalism nowadays. And begs a question how often they misinterpret and misreport things in other fields?



The BBC are starting to say that 'tech people are saying this is Crowdstrike', so I guess it's just a question of being certain? Perhaps we'd have similar concerns about rigour in journalism if it were to turn out that it's actually not Crowdstrike specifically, it's caused by the interplay of Crowdstrike and some other currently unknown thing, and actually it's not Crowdstrike that's behaving improperly, but this other currently unknown thing.

It's looking more and more like Crowdstrike screwed up, but I appreciate rigour and accuracy more than FRISTTT!!! type announcements.


BFMTV, French broadcaster, reports:

"Selon le quotidien The Australian, qui relaie les déclarations du ministère australien des Affaires intérieures, l'entreprise Crowdstrike pourrait être en cause, après avoir été victime d'une brèche au sein de sa plateforme."

Translated/summarized: "According to the publication The Australian, Crowdstrike may be the cause of the outage after having suffered a security breach"

I like how it redirects blame away from those responsible and perpetuates the idea that "hackers" are the real threat.

Source: https://www.bfmtv.com/tech/direct-une-panne-informatique-mon...


BBC are doing a rare awful job. I've got BBC News on here in the UK and they just keep saying a "Microsoft IT outage".


Or maybe they usually do an awful job, except this time this is our field, so we know for certain that they do.


On BBC news a few minutes ago, an expert did describe the problem as affecting Microsoft Azure cloud systems as well as Windows systems running Crowdstrike due to an "update gone wrong".


Well, in BBC’s live coverage, just minutes ago, their technology editor said:

“ There have been reports suggesting that a cybersecurity company called Crowdstrike, which produces antivirus software, issued a software update that has gone horribly wrong and is bricking Windows devices - prompting the so-called "blue screen of death" on PCs. Now, whether these two issues are the same thing, or whether it's a perfect storm of two big things happening simultaneously - I don't yet know. It certainly sounds like it's going to be causing a lot of havoc.”

What two issues? Two major independent outages? This is seriously bad and purely speculative.


There is also an Azure outage going on, and it is unknown if they are related.


Oh, fair enough.


There was a different Azure and other MS services (including Office 365) outage earlier which is separate from the crowdstrike thing that started a few hours later.


its strange how people who work in professions that are considered crucial infrastructure are held to such a high standard but there's always some tech problem that cripples them the hardest


And they all invariably use Windows instead of a high-reliability OS.


Windows is high reliability. The problem here is what's basically a third party backdoor.


"Windows" is the combination of the OS per se and all the things needed for it to run properly. That thing is a mess of proprietary drivers and pieces of software cobbled together. It can't be called "high reliability" with a straight face.


Crowdstrike is a multiplatform malware that chronically damages computers on all major desktop OSes. This is a Crowd strike problem and an admin problem.


That's a hell of a take that should not be taken seriously. Perhaps if you hold everything else to the same standard. Anything used on macOS or Linux or whatever else fully and completely represents that core platform, then I'd agree.

Anecdotally, I have zero stability problems on my non-ECC consumer-grade 11th gen Intel Windows 11 system. It'll stay up for months, until I decide to shut it down. I had a loose GPU power cable that was causing me problems at a point, but since I reseated everything I haven't had a single issue. That was my fault, things happen. The system is great.

More significantly, I see no difference in stability between our Windows Server platform and Red Hat Enterprise (Oracle) server platform at work either. Work being one of the top 3 largest city governments in the USA.


Meanwhile, I'm lucky if the laptop I installed Ubuntu on will keep from crashing for over an hour of continuous use.


its an accurate take, windows is a mess

didnt red hat have a massive DEI/anti white man scandal? I wouldnt trust their products

the smartest people use and maintain Arch, ergo everything should run on Arch for maximum stability


I don't even think Linux is the definite answer. The majority of these critical apps are just full-screen UIs written in C, C++ or Java with minimal computing and networking, so they could just as easily run on Qubes or BSD without all the constant patching for dumb vulnerabilities that still persist even though Windows is 40 years old.

The problem is the middle management class at hospitals, governments, etc., only know how to use Word and maybe Excel, so they are comfortable with Microsoft, even though it's objectively the worst option if you aren't gaming. So then they make contracts with Microsoft and all the computers run Windows, so all the app developers have to write the apps for Windows.


Not really disagreeing with you, but "staying up for months" isn't a serious bar to clear, it really provides no information in 2024 everything you can install should clear that bar.


Can you say with a straight face that if you were designing a system that had extremely high requirements of reliability that you would choose Windows over Linux? Like, all other things being equal? I'm sorry, but that would be an insane choice.


Well, yes? Of course, not the consumer deployment of Windows. Part of ensuring reliability is establishing contracts with suppliers that shift liability to them, so they're incentivized to keep their stuff reliable. Can't exactly do that with Linux (RHEL notiwthstanding) and open source in general, which is why large enterprises have been so reluctant to adopt them in the past - they had to figure out how to fit OSS into the flow of liability and responsibility.


I guess it depends whether you want your system to work, or whether you just want it to be not your fault when it breaks


It's not as straightforward of a choice as it may seem. In theory Linux would be a better choice but there simply isn't the infrastructure or IT staffing in place to manage millions and millions of Linux desktops. I'm not saying it can't be done but for various reasons it hasn't been done and that's a major practical roadblock. Just from a staffing perspective alone if you hand millions of Linux desktops to life long Microsoftsies you're begging for disaster.


For sure, no question! There's a reason people choose Microsoft. My question was narrower, just the question on reliability (hence "all else being equal"). I don't think you can say that, leaving aside issues like this, that Windows is as or more reliable than Linux.

For instance, if you had to make deploy a mission critical server, assuming cost and other software was the same, would you choose Linux or Windows for reliability? Of course you would choose Linux.


Well, with the proliferation of systemd and all the nightmares it's caused me over the past decade, I actually might. But thankfully BSD is an option.

But Linux isn't immune from this exact sort of issue, though - these overgrown antivirus solutions run as kernel drivers in linux as well, and I have seen them cause kernel panics.


>Windows is high reliability.

Depends i think. When i was working as a super market cashier the tils had embedded XP. in 2 or 3 years it rarely had issues. The rare issues it did have were with the java POS running on top.

Windows 10 for my home desktop crashed a lot more and just seems to have gotten more "janky" with time.


> Windows is high reliability.

lol no


The people working in those professions are; their bosses and their IT departments are not. IT security is treated as solved problem - if you deploy enough well-known solutions that prevent your employers from working, everything will be Safe from CyberAttacks. There's an assumption of quality like you'd normally have with drugs or food in the store. But this isn't the case in this industry, doubly so in security. Quality solutions are almost non-existent, so companies should learn to operate under the principle of caveat emptor.


There are sooo many companies in the world, when snowflake or crowdstrike or solarwinds has an issue, it’s going to touch every industry.


This company has post-apocalyptic style photos to make you panic-buy their solution.

https://ibb.co/Bc6n527

"62 minutes could bring your business down"

I guess they could bring all the businesses down much quicker.

https://www.crowdstrike.com/en-us/#teaser-79minutes-adversar...

(Repeating my comment because other story is duped)


> (Repeating my comment because other story is duped)

Please don't do this! It makes merging threads a pain. It's better to let us know at hn@ycombinator.com and then we'll merge the threads so your comment shows up in the main one.


I'm so sorry, TIL.


I'm curious why this post is still not the 1st (but 2nd after an ebook reader announcement), despite all the upvotes.


Lots of people commenting drags down the position on the front page.


10 hours ago someone posted a critical post about CrowdStrike on the "wallstreetbets" subreddit.

https://old.reddit.com/r/wallstreetbets/comments/1e6ms9z/cro...


Edit; it appears my comment has been moved to a top level comment, i.e. peer with the parent without any way of telling what happened - so now there is the whole other pointless branch polluting the relevance of the tree.

Previously;

It appears that someone was able to take my previous comment in this thread completely off hacker news, it's not even listed as flagged. It was at 40pts before disappearing, perhaps there is some reputation management going on here. If it was against the site rules it would be helpful to know which ones.

Edit; the link is https://news.ycombinator.com/item?id=41007985 it was a high up comment that no longer appears even though flagged comments do appear. I checked if it has been moved but the parent comment is still the same. This feels like hellbanned in that there isn't an easy way for me to see if I've been shadowbanned. But I really don't know. I was commenting in good faith.


(I've detached this offtopic subthread from https://news.ycombinator.com/item?id=41002977.)

Your original comment was https://news.ycombinator.com/item?id=41007985. I detached it from its original parent (https://news.ycombinator.com/item?id=41002977) because it was more of a generic tangent than a specific reply.

It's a vital moderation function to do this, particularly when the parent is the top comment of the entire thread. Those tend to attract non-reply-replies, and that has bad effects on the thread as a whole. It causes the top part of the page to fill up with generic rather than specific content, and it makes the top subthread too top-heavy.

I'm not saying that you did anything wrong or that your post was bad or that it was unrelated to the original parent. The problem is that the effects I'm describing pile up unintentionally and end up being a systemic problem. It isn't anybody's fault, but there does need to be someone whose job it is to watch out for the system as a whole, and that's basically what moderators do.

Sometimes we comment that we detached a post from its original parent (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...) and sometimes not. (Perhaps the software should display this information automatically.) I'm less likely to do it when a comment stands on its own just fine, which was the case with your post https://news.ycombinator.com/item?id=41007985 and which is usually the case with the more generic sort of reply—in fact it's one test for deciding that question.

> so now there is the whole other pointless branch polluting the relevance of the tree

Yes, please don't do that—especially in the top subthread. I understand the frustration of "WTF where did my comment go", but you can always reach us at hn@ycombinator.com and get an answer.


Oh, nevermind, it's just a third-party cybersecurity tool running on the server detected a potential threat and quarantined the offending database record, just in case!


I was looking forward to spending the day talking to people about cyber security but if my comments are going to disappear like that then maybe hacker news is not the site for me. A shame really.

Edit; I don't know for sure but this is possibly the last straw for me on hacker news. It really has gone downhill. If good faith discussions from experts are being secretly deleted for what I can only now assume are for nefarious reasons then I can't trust what I find here is in anyway representative. It's unfortunate in that there really isn't anywhere else to go. Now my best discussions are in small WhatsApp groups / Discords with friends. It's ok for me where I have had a career to get to know people personally and have such groups but if public forums are tainted in this way then younger people in this field will end up only talking to each other.


I appreciated your comment and saw it earlier before it was detached. Thank you for sharing it. It got decent visibility to readers, as your points suggest. I suggest you cut the mods some slack for adjusting things on one of the heaviest trafficked threads ever. The phenomena dang describes with only semi-related reply to the first thread does exist and I myself have gotten higher points on posts that benefit from it, unintentionally and intentionally (didn’t realize that was abuse, sorry dang). I think we are better off with an ecosystem that limits such point/visibility seeking or accidental behavior, even for good content. Don’t take it personally.

(I do think there should be some way to skim for, say top X% rated comments particularly on mega threads, somewhat like there was/is on slash dot with its point filtering. This would have helped visibility for a detached comment like yours, would reduce the ordering benefit dang mentions for those using it, and improve usability more generally for busier readers. But that’s my 2 cents. These things always cut multiple ways.)

A megathread is always a tough place to add value, on any platform. Who am I, but I appreciate you and your comments and hope you continue to share with a broader audience that includes me here.


HN has pagination plus it could be just a bug...


I was tipple downvoted in ~20s before it disappeared so I noticed it very quickly because I keep an eye on pts to see if people have interreacted with something I've posted. I then scanned through the peer comments and noticed that a bunch of other comments had been very freshly flagged, within the ~40s of the last time I checked. These comments have been around for over an hour so the odds that there were all independently flagged so quickly at exactly the same time is highly improbable. And I couldn't see if my own comment had been flagged even though I could see others.

It's possibly a bug but I've seen similar behavior before and it was due to flagging but without the flagging flag appearing which happened later. I think it's a variation of hellbanned but for a single post. It was easy to notice because not only did the points go down just before disappearing they also stopped going up as it had done so reliably before being removed.


I could see your comment. Another option: voting bots.


I can see you lamenting the loss of your internet points. Chin up


I still keep the points, so perhaps a loss of possible future points. The points don't bother me, I restart antonymous accounts on occasion. I kind of use points to get an idea of where other peoples opinions which is half the reason I use this site. A negative signal to something I think is good is actually more interesting than a positive signal. I'm more worried about the damage such actions do to the 'market place of ideas' and if it hasn't pushed me away then perhaps it has pushed others that I'm interested to hear from. And if so where have they gone. Once I become disinterested in the opinions of others on this site then it's unlikely I'll have any further use for hacker news - and I'm getting pretty close to that point.


what was the comment about?


I edited my post to include the link, I guess if people see these posts I'm not completely banned and maybe there isn't anything nefarious going on, they have a situation where the comments to a single comment are already off the first page so there could be a software bug. But it does appear that there is some sort of reputation management going on.


FYI I do see https://news.ycombinator.com/item?id=41007985, it's currently on page six (out of ten) of this megathread and it looks normal (not dead or flagged).


Ah thanks, I guess it's been moved to it's own top level comment - I did check but only checked until page 4. It's weird because this chain of comments which is otherwise off topic and doesn't have anywhere near the points (2 (edit now 0) vs 37) and has same parent is on the first page. So I'm not sure if that was the right remedy. Hackernews needs better tooling for this or at least let me know if something has moved to try to flatten out the tree.


closed source will always fuck you in the ass


This is why I don't use Windows and refuse any SWE jobs that require Windows machines. Additionally, I believe kernel-level game anti-cheat software should be banned.


[flagged]


Please don't cross into the flamewar style in HN comments. It's not what this site is for, and destroys what it is for.

We detached this subthread from https://news.ycombinator.com/item?id=41007791.


Sure this isn't Windows' fault but it's hardly a "high reliability OS"


The reason people "can't learn" how to operate alternative software is because we don't give software the weight it deserves. We don't consider it crucial, but evidently it is.

When a new surgical technique is standardized, we don't tell doctors "well don't worry about it - we can't expect you to learn cryptic things!" Because we understand the gravity of what they're doing and how crucial that technique may be.

For whatever reason, software is still treated like the wild west and customers/employees are still babied. They're told it's all optional, they don't need to learn more. We still tell Windows users its fine to download executables online, click "okay!" and have them run as Admin. And that's the root cause of why we're in this mess.

We have safer computing environments - just look at iOS or Mac. Even Microsoft is slowly trying to faze this out with the new Windows store. But alas, we cannot expect anyone to change anything ever, so we still use computers like it's 1995.


Assuming that other operating system, whether high-reliability or not, are necessarily "cryptic" and unnecessarily impair people in their ability to "get shit done" is naive at best and disingenuous at worst.


[flagged]


It's ridiculous how many people here are blaming Windows for something that's really completely unrelated


Obligatory XKCD : dependency [0]

[0] https://xkcd.com/2347/


I assure you that Crowdstrike is being paid very well for their software.


For sure, but Crowdstrike eventually hosts a subproject some random person in nebraska has been thanklessly maintaining since 2003 :p


worded badly whatever


Because it's not a Windows update.


Because it's not a Windows update perhaps


> worded badly whatever

I feel your pain. Perhaps it is time to increase the karma level required for downvoting.


Flippant commentary:

Thank fuck Netflix runs on Linux. I just hope the full chain from my TV to Netflix is immune...


It actually runs on a lot of FreeBSD iirc


Netflix uses FreeBSD for certain parts, but IIRC it's Linux outside of the streaming delivery.

(Some of the team members lurk around here, maybe we'll get lucky and one comments)


I'd say regardless of the OS, you might find a company like Netflix is less likely to impose security-theatre box ticking exercises.

Which makes it less likely to take a 3rd party agent from a snakeoil company that sells to execs, then embed it at low levels into mission critical services with elevated privileges, then give it realtime external updates that can break your platform at any point.


Even better!


USA can no longer be trusted to supply import technology for the world.


DO NOT REDEEM SAARRRRRRRRSSSSS! BLODDY BASTARDS INVALID FORMATING SARRRRRRSSSSS!!


Don’t be fooled, it’s Skynet, head to the bunkers!


Username "mehh" has been noted.


Is this something that could be solved by building AI code review directly into git clients? I can't help thinking Claude 3 would have caught this.


Yes AI already solved it, my god why people are so high on LLM's solving everything


Poe's law in action.


LMAO

Yeah let's throw an LLM at the C++ kernel driver and auto-push to prod


This is good and bad. This showcases the importance of CrowdStrike. This is a short term blip but in the long run they will learn from this and prevent this type of an issue in the future. On the flip side, they have a huge target on their back for the U.S. government to try and control them. They are also a huge target for malicious actors since they can clearly see that CS is part of critical US and western infra. Taking them down can cripple essential services.

On a related note, this also demonstrates the danger of centralized cloud services. I wish there were more players in this space and the governments would try their very best to prevent consolidation in this space. Alternatively, I really wish the CS did not have this centralized architecture that allows for such failure modes. Software industry should learn from great & age old engineering design principles. For example, a large ships have watertight doors that prevent compartments from flooding in case of a breach. It appears that CS didn't think the current scenario was not possible therefore didn't invest in anything meaningful to prevent this nightmare scenario.


I'm not that confident that they're going to be around to recover from after their stock price falls into the toilet and they get sued out the yin-yang. I don't think 'read the EULA terms lol' is gonna cut it here.


> This is a short term blip

No security engineer in the world is going to trust the words CrowdStrike after this.


Security engineers are the ones who first came up with these crap in the first place. Sales people are not to blame, they'll sell anything.


Or, and that maybe a radical idea, YOU DON'T INSTALL THIS FUCKING SNAKE OIL IN THE FIRST PLACE.

The idea of antivirus software is laughable when Adobe cannot implement a safe and secure PDF parser then how can Crowdstrike while simultaneously supporting the parsing of a million other protocols?

Everyone involved: Vendor, operator, and auditors who mandate this shit are responsible and should be punished.

YOU HAVE TO MINIMIZE THE ATTACK SURFACE, NOT INCREASE IT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: