Microsoft has serious questions to answer after the biggest IT outage in history

Lx1oG-AWb6h_ZG0 · 2024-07-19T15:40:25.000000Z

Apparently Crowdstrike also brought down Linux hosts in the same way in April but it didn’t get widely reported: https://news.ycombinator.com/item?id=41005936

jamescun · 2024-07-19T15:39:15.000000Z

Not sure what questions Microsoft have to answer. A third-party vendor shipped defective software.

I guess the only question they could answer is why they don't provide a framework like Apple do with Endpoint Security for third-party vendors to use.

Daviey · 2024-07-19T15:45:04.000000Z

Because an essential enterprise security application was /able/ to bring down an entire OS like this. The issue is that Microsoft doesn't provide an interface for an application to operate in user-space to have the functionality it requires.

Linux has eBPF which can provide most of the capability that Crowdstrike needs, by using an "in-kernel verifier which performs static code analysis and rejects programs which crash, hang or otherwise interfere with the kernel negatively". If MS had this functionality, it is likely this incident would not have happened.

That said, from personal experience on Linux it's been an extremely long time since a bad kernel module has rendered a system entirely FUBAR'd.

(To Microsoft credit, they have begun copying the eBPF methodoloy to Windows, but it is still in it's infancy https://github.com/Microsoft/ebpf-for-windows/ ).

jcranmer · 2024-07-19T15:57:19.000000Z

It's possible for a badly-written eBPF policy to prevent any application from starting up, AIUI, so that's more or less the same situation isn't it?

keneda7 · 2024-07-19T16:22:42.000000Z

Crowdstrike brought linux machines down earlier this year in April. There* are several posts in this thread about it.

netdevnet · 2024-07-24T08:45:10.000000Z

> Linux has eBPF which can provide most of the capability that Crowdstrike needs, by using an "in-kernel verifier which performs static code analysis and rejects programs which crash, hang or otherwise interfere with the kernel negatively". If MS had this functionality, it is likely this incident would not have happened.

It didn't stop Linux machines from being down so it is clearly not as easy as you put it. The reality is that writing software is hard yet devs often trivialise it to their own detriment

Daviey · 2024-07-24T08:55:20.000000Z

The issue I am raising is /design/, not /development/. The current model of unconstrained unforgiving highly privileged execution space is a bad design, that is what eBPF tries to address.

netdevnet · 2024-07-24T12:42:27.000000Z

It didn't make a different though. Linux still went down so clearly the design is enough

Daviey · 2024-07-24T13:22:22.000000Z

It is a different issue[0]. The Linux issue from April was a Linux Kernel bug[1], that CS Falcon happened to trigger. The design to use eBPF is sound, but the implementation on the kernel side had a bug.

Also, CS Falcon didn't support RHEL 9.4 (only up to 9.3), so for this specific bug you highlighted, CS should not be held accountable for regression testing, because it was a platform they did not support.

With Windows, the design is currently poor to not be able to run code in a safe manner. Most recently, it appears MS is blaming the EU for forcing them to create an interface for services such as CS to run[2]. Rather than lean into the problem and create a good design, they didn't create security boundaries - risking the entire system.

Bugs happen, and Linux will continue to harden and be more resilient - but unless MS focussed on secure design in this area, things like this will continue to happen (same as they have with AV before).

  [0] https://access.redhat.com/solutions/7068083
  [1] https://access.redhat.com/errata/RHSA-2024:3306
  [2] https://www.forbes.com/sites/davidphelan/2024/07/22/crowdstrike-outage-microsoft-blames-eu-while-macs-remain-immune/

politelemon · 2024-07-19T15:48:24.000000Z

Might be editorialised by op or sky changed the title, it is currently:

"Serious questions to answer after what could be the biggest IT outage in history"

landr0id · 2024-07-19T15:51:16.000000Z

Assuming Sky since the URL slug shows "Microsoft"

landr0id · 2024-07-19T15:50:09.000000Z

>Not sure what questions Microsoft have to answer.

The only thing I could think of is if it was a driver update, the driver has to be "WHQL" signed. WHQL stands for "Windows Hardware Quality Lab" -- what quality are they ensuring? (spoiler alert from my time at Microsoft: it's not terribly robust :p )

It's not realistic for Microsoft to test drivers in a manner that represents real-world usage, but perhaps they need to start doing some basic "it works with whatever integrated agent/etc is required" testing as a requirement for signing a driver.

If it was a user-mode update? Yeah no real fault on Microsoft here.

KHRZ · 2024-07-19T15:59:13.000000Z

From what I heard Crowdstrike just updated their DB file, which means the bug was alreadyq there, waiting for someone to trigger it with a "low risk" quick roll out.

ghthor · 2024-07-20T05:10:01.000000Z

So kind of like the xz exploit, carefully placed and laying in wait.

I only hope this was a good guy move by someone to knock a placed chess piece off the board.

drpossum · 2024-07-19T15:43:58.000000Z

You're confusing the Crowdstrike issue with Azure being down. Microsoft is ultimately responsible for anything regarding Azure even if it was a vendor that did something wrong because they choose their vendors

danbruc · 2024-07-19T15:45:56.000000Z

The article is about CrowdStrike incident and not the Azure configuration issue.

velcrovan · 2024-07-19T15:41:11.000000Z

“The Entire Culture Around So-Called ‘Software Engineering’ And Our Collective Failure to Build Strong Legal Institutions Around That Culture has serious questions to answer after the biggest IT outage in history”

there fixed it for you

zdragnar · 2024-07-19T15:46:52.000000Z

This is an odd take. My electrical power goes out more often than Windows crashes.

velcrovan · 2024-07-19T16:09:32.000000Z

Maybe you're in the ERCOT service area? I manage IT for a global firm based in Minnesota and that is definitely not my experience. I think perhaps you unintentionally make the point that the electrical grid provides a demonstration of the value of regulation and the consequences of neglecting or dismantling it.

zdragnar · 2024-07-19T16:26:51.000000Z

Nope, I just happen to live in a rural area with lots of trees and a few really strong storms a year. A tree going down on a power line or a car crashing into a pole is pretty inevitable, despite the company's best efforts to keep the lines clear, including regular cutting back growth.

I don't recall the last time I've had Windows itself crash on me... Two years maybe?

velcrovan · 2024-07-19T16:57:12.000000Z

Ask my engineers...Dell and Lenovo laptops crashing due to thermal issues while using CAD...graphics driver updates causing crashes. I think we've worked those out but those are some recent examples. For vanilla office users I agree it's comparatively uncommon.

But we're not talking about isolated Windows crashes. We're talking about the impact of a single update to low-level software rolled out to thousands of critical systems as a security compliance checkbox-filling measure. No regulation, no oversight. I'm not calling for Joey Highschooler to need a government-issued license to hack on his SaaS, I'm saying that all components of critical infrastructure should be subject to engineering best practices, including software engineering and software management.

adrianN · 2024-07-20T04:19:34.000000Z

A tree going down on a power line is a solved problem in most of the civilized world: Just cut down trees that are too close to power lines. Having a few dozen meters of buffer or a sturdy railing between roads and power lines also prevents cars from crashing into them.

gjsman-1000 · 2024-07-19T15:48:26.000000Z

At a minimum, at a minimum, we should absolutely demand that Microsoft abandon requiring Microsoft Accounts to set up Windows. They have lost all rights to do that with a straight face.

velcrovan · 2024-07-19T16:50:44.000000Z

Maybe that is relevant to home users but there's no relevance to today's outage. And if you're in a business setting I'm not sure why you wouldn't be using a domain controller and/or Entra ID and Intune to manage your Windows endpoints.

blackoil · 2024-07-19T16:11:00.000000Z

A third party s/w crashed because users failed to validate deployments. So from MS perspective users can't be trusted and need to be further locked down.

bebop · 2024-07-19T15:51:38.000000Z

You can still use local users, you just have to work a little harder to do so.

rolph · 2024-07-19T16:46:22.000000Z

for me that translates as opt-out friction, vs it just opts right in.

commandlinefan · 2024-07-19T15:35:57.000000Z

I predict that the people who were actually responsible (the "deadline above all else" crowd) will not be the ones who are actually blamed.

johnnyo · 2024-07-19T15:34:26.000000Z

I don’t see how this is Microsoft’s fault or issue.

MS can’t prevent a software vendor from breaking the machine.

drpossum · 2024-07-19T15:40:50.000000Z

I'm going to follow the guidelines and not be snarky: Microsoft is not some weak company at the mercy of the market. They choose their vendors (they also actively throw their weight around with vendors) and MS on top of that is capable of doing anything in-house (or buying it to bring it in house).

The stance of "let's not hold companies accountable for cutting corners" is one reason everything is getting worse. It's because we collectively let it get worse.

alwa · 2024-07-19T16:05:57.000000Z

Is it possible that you’re talking about a different incident? In the incident at hand, didn’t enterprises in question choose the EDR vendor, not Microsoft?

Is the implication that Microsoft should be compelled to develop its own EDR product at a level of sophistication comparable to what CrowdStrike offered, and compete with them on that basis?

It feels strange to me to hold Microsoft accountable for the poor design decisions of firms who just develop third-party software on their platform.

johnnyo · 2024-07-19T15:45:06.000000Z

I’m not following your line of reasoning, can you clarify?

Is the argument that it was MS responsibility to bake something like this in at the OS level? And if they did it would be more robust?

I’m not sure I agree. MS has already gotten in trouble for monopolistic practices before, so from a legal standpoint, I’m not sure that’s the best course of action.

MattGaiser · 2024-07-19T15:46:34.000000Z

Except that doing those things could be viewed as monopolistic or anti-competitive behaviour.

If Microsoft is responsible, then they need to also need greater control. If Microsoft isn't supposed to have that kind of power, they cannot be blamed.

georgeplusplus · 2024-07-19T16:15:30.000000Z

>>>It's because we collectively let it get worse

What would you suggest we do to make sure it doesn't get worse?

A large majority of the population probably has no idea the implications of this outage and what to do about it because most are tech deaf.

MattGaiser · 2024-07-19T15:40:09.000000Z

They could basically eliminate vendors and make everyone use a Microsoft tool. But I suspect many would object to that solution also.

diego_sandoval · 2024-07-19T15:43:58.000000Z

The they'd get sued by the EU for bundling these tools into the OS and eliminating the market for third party vendors.

MattGaiser · 2024-07-19T15:49:25.000000Z

Exactly. If you are going to demand a ton of different companies build different parts of a product, integration issues are unavoidable.

traveler1 · 2024-07-19T16:13:37.000000Z

Drivers have the right to crash the system in my books - software doesn't. They need to take a stronger stance on antiviruses and kernel based software in general and push defender as the defacto antivirus for Windows.

dzonga · 2024-07-19T15:58:04.000000Z

because the underlying os they provide allows kernel access. if they had windows fence of the kernel and maybe provide a security api ? then this whole thing wouldn't be an issue

jgeada · 2024-07-19T15:44:18.000000Z

WTF? Isn't that exactly the one of the main jobs of the OS, to not crash regardless of what user-space software is doing?

zdragnar · 2024-07-19T15:48:28.000000Z

Endpoint protection is hardly user-space software. It gets deeply privileged access to the entire system.

swozey · 2024-07-19T16:01:37.000000Z

There's also the argument that a business OS that you spend thousands or pay a monthly licensing fee for should be hardened enough already to not need software like Crowdstrike. But I'm also completely ignorant to what it actually does and how critical it is.

I used to be a Windows Engineer in webhosting (RAX, Hostgator, 2-3 others) I assume before this software existed and I had to hand-craft an insane amount of security services in posh and python. When I first got into Windows syseng stuff, I think IIS5 so win2k IIRC, IIS didn't have something as simple as URL Rewrite abilities. You had to buy a 3rd party package for EACH server at $25 or write one, I had thousands of servers. Zero thought about people actually using IIS for webhosting. I had to make my own brute force detection service that continuously monitored eventviewer for an RDP permission denied error code, then write that IP to the windows firewall. All this stuff is an apt-get away in lunix. Windows Server is so shockingly barebones and to be quite frank most Windows syseng people aren't the best engineers and wouldn't think to make almost any of this. On many of my teams I was the only one who could program.

We'd put servers up without a firewall and post their IPs on irc and see how long it took someone to pop one, if they didn't get popped before we got back to our NOC.

I dealt with that OS from sysadmin 1-3 over 10 years I am so goddamned happy everything is an ephemeral linux container now.

johnnyo · 2024-07-19T15:47:47.000000Z

I think the idea is that CrowdStrike doesn’t run in user space.

If an Nvidia driver had bricked the machines, would that be MS fault or Nvidia fault?

flohofwoe · 2024-07-19T15:55:44.000000Z

IME a graphics driver crash recovers just fine on Windows. The screen goes black for half a second and you're back in business without losing progress.

velcrovan · 2024-07-19T17:56:35.000000Z

I've had NVidia drivers bluescreen Windows 10 and 11 machines within the past six months.

diffeomorphism · 2024-07-19T15:53:54.000000Z

Why not both? I am perfectly happy to blame multiple parties, not just one.

theblazehen · 2024-07-19T15:56:54.000000Z

It wasn't user space, it installs a kernel mode driver

fifteen1506 · 2024-07-19T15:50:08.000000Z

it was a kernel mode driver.

williamstein · 2024-07-19T15:35:57.000000Z

Apple

duxup · 2024-07-19T15:38:08.000000Z

Does Apple have a better method for preventing something like that?

I love my Mac, but I've had crashes that I suspect were caused by an application.

jnwatson · 2024-07-19T15:45:45.000000Z

They have sunset third party kext files. That means if the kern crashes, it is Apple's fault.

I once did a little MacOS driver development and had a kext signing key. It was an unforgiving, poorly-documented environment. Good riddance.

rawgabbit · 2024-07-19T15:43:50.000000Z

https://support.apple.com/guide/security/operating-system-in...

duxup · 2024-07-19T16:03:59.000000Z

conscion · 2024-07-19T15:47:15.000000Z

Large corporations buy Windows _because_ they can have this level of control over their machines. The CTOs and auditors want to be able to say they've personally secured their systems using "top of the line" security software.

bradford · 2024-07-19T16:07:22.000000Z

I see discussion about who's at fault: Microsoft or Crowdstrike.

But one thing I don't get about this: what was the role of the enterprise admins?

Most administrators at large companies are cautious about rolling out new software versions to their employees. They (normally?) test before broad deployment.

Seems like one of three things would have had to have happened for this to be missed:

1. Admins ignored testing this update prior to enterprise rollout.

2. Crowdstrike forced the update on unwilling users.

3. Crowdstrike does not provide a framework for such pre-rollout testing, and enterprises chose to use it anyway.

Can anyone offer insight?

[Disclosure: I'm a Microsoft employee, but not an enterprise admin]

viridian · 2024-07-19T16:35:39.000000Z

> Most administrators at large companies are cautious about rolling out new software versions to their employees. They (normally?) test before broad deployment.

In my experience at both a 70,000 company and a 260,000 person company, both of which I can confirm have outages right now, this just isn't the case.

The security vendor says update and sysadmins say "right away", because the institution has learned that "right away" is the only acceptable answer from auditors, both internal and external.

This story is interesting because there's an entire chain of places you can pass the buck and absolve responsibility if you so choose. You could, if you so desired choose to blame:

1. The crowdstrike developer who pushed the change

2. The developer responsible for the kernel bug

3. crowdstrike as a company for not having better change management

4. microsoft for how they handle kernel access

5. system admins for not owning the update process of their entire body of devices

6. security teams / the CISO for operating on checklists that exist to please auditors rather than treating security as a living, breathing problem

7. Auditors for structuring security audits as a checklist rather than treating security as a living, breathing problem

8. Regulators for using one size fits all audits as the preferred method of determining security compliance

swozey · 2024-07-19T16:18:17.000000Z

As a previous IT Manager (SRE last decade), honestly, most IT Admins I know literally do nothing beyond click Auto Update checkboxes and let things churn until they break. I hate to put that career down but I worked at all levels of it starting from tech support in a call center. It's a very easy job to get either get complacent with your skill set, comfortable, and really just half ass things. I have a lot of friends that are on IT teams and most of them don't have any interest in what I do, like learning to write golang, rust, python, learn kubernetes, docker, etc.. I tell them all the time about how much money they can make if you really buckle down and learn to program or just learn a cloud and terraform. They all bitch about how much they hate their jobs because they're doing basically line tech support but are fine where they're at. I had horrible IT jobs so I'm super sympathetic to it and always try to hire them when I can. I hired one last year, not a nepotism hire I didn't know him, but he was an IT guy wanting to move into SRE.

They just use Windows and let it do its thing. Its their day job and they don't work on improving so their skillset is super subpar. To them it's all they need to do their job, I don't have that personality, I'm obsessively min-maxing things.

I even worked at MSPs (Managed Service Providers) so I did IT/Network admin work for tons of companies around different cities and every single MSP just either puts everything on auto update or has a strict rule of NOTHING on auto update which just means nothing ever gets updated until a customer calls in for $$ hourly support. You throw the updating in because you get more hours. Or you have a scheduled update ticket every month/etc.

I also saw a comment or a meme about Crowdstrike being able to update whenever wherever, no idea if that's true or not.

I wrote this post from my point of view which is that of being a Windows SysEng at 4-5 big webhosts over a decade. I had to write a TON of my own security and backup and whatever services because Windows was so barebones and at the fleet (tens of thousands) server level I was managing, with the revenue webhosting makes, we definitely couldn't buy expensive software. Most of my customers paid $10/mo and we were cramming thousands of them on one server the licensing was a huge pain for any software.

https://news.ycombinator.com/item?id=41007824

luma · 2024-07-19T15:58:20.000000Z

I'm not sure what to make of this but I'm noting something odd this morning: coverage of this event out of the UK near-unanimously is laying this outage on Microsoft. BBC ran a story this morning that didn't mention Crowdstrike until the 4th paragraph, and headline after headline is repeating the message that Microsoft caused a global outage.

Reporting from the US and elsewhere seems to be a bit more on point. Is it just because the Brits went to press earlier in the day before the problem was understood?

plesner · 2024-07-19T16:06:56.000000Z

There seemed to me to be a clear shift in focus from CrowdStrike to Microsoft somewhere along the way, maybe a little while after George Kurz' message. I was wondering if it was either spin or the media collectively deciding that people understand what MS is better than CS.

multimoon · 2024-07-19T16:24:52.000000Z

I can’t explain enough how much I dislike Microsoft the corporation, but this wasn’t their fault - a 3rd party kernel driver crashed the system.

kkfx · 2024-07-19T16:06:42.000000Z

Well... The only question should be architectural:

- why automatic, silent upgrades

- why no boot environment/generations at boot to reboot into a previous snapshot of the system (since nfts do have snapshots indeed), meaning why no integration between the storage and the system management

- why massive rollout instead of partitioned testing rollout slowly propagating

For the rest is a third party tool, not mandated by the vendor so... It's a user choice.

duxup · 2024-07-19T15:32:39.000000Z

This is a pretty empty article that just seems like spin on the current issues going on.

arshiiita · 2024-07-20T02:39:41.000000Z

The problematic driver was dowloaded from Microsoft managed infrastructure, even though it was a third-party module. Microsoft needs to do a better job at running integration tests between windows kernel and driver updates for sure. They can’t publish security updates without running integration tests, this is basics of software engineering. Windows is their product not Crowdstrike. 100% Microsoft failure here.

1vuio0pswjnm7 · 2024-07-19T22:31:22.000000Z

I have been 100% Microsoft-free in all computers and networks I control for decades. I know I should miss Excel, etc. but strangely I feel like I have sacrificed nothing. There is more I can do without Microsoft than with it.

geodel · 2024-07-19T15:38:33.000000Z

The answer is Cloud based Windows OS which MS is working towards for many years.

MattGaiser · 2024-07-19T15:38:58.000000Z

I'd be curious whether people would want the cost effective fix for this, which is basically to eliminate vendors for anything important.

Near complete vertical integration of security, like with Apple.

lloydatkinson · 2024-07-19T15:54:26.000000Z

What an infuriatingly poor article, Tom Clarke should be ashamed.

> A software update from cybersecurity company CrowdStrike has now taken a large number of those machines offline.

So Tom opens the article with the admission that it is CrowdStrike, not Microsoft.

> Thankfully, the update that caused the Microsoft meltdown did not affect these other software families - if it had, the impacts could have been catastrophic.

This is such a strawman (like the rest of the article honestly) I don't know where to begin. Inflammatory language.

A fucking "meltdown"? A meltdown of Microsoft, no less? Putting aside the fact that Microsoft and Windows are not the same thing, it is again nothing "meltdown" like that Microsoft did or could do.

> There are serious questions of course for CrowdStrike. As a leading provider of security software for large companies like Microsoft.

Tom, was you paid by CrowdStrike or what? What do you mean "of course"? It is literally the only party that should be answering questions here. I suspect that even if Tom were to "question" Microsoft their answer about kernels, drivers, privileges, and how shipping seemingly untested code into the core of an operating system is a bad idea wouldn't even be comprehendible for him.

> The situation may also lead to calls from Microsoft users about what more the company could do to ensure products made for their software aren't going to cause major outages like this one.

This is getting absurd now, and I just can't give more energy to this. OK, it could now insist only memory safe languages such as Rust are allowed for drivers. Or outright permanently blacklisting drivers from certain vendors. The bitching and moaning from manufacturers would then, of course, have people like Tom writing articles like "Microsoft is making manufacturers lives harder, think of the poor IT professionals!".

> Any engineer will tell you over-reliance on one system leaves you open to a "single point of failure". Critical digital infrastructure has to have redundancy - back up systems - built in to ensure it is resilient.

Please Tom, tell us more on your thoughts about memory safe languages, failure recovery modes, the unikernel vs microkernel debate, and how it's just a simple matter of overnight making operating systems "not a single point of failure".

This entire article is some kind of exercise in trying to get everything wrong while meeting a minimum word count, and I bet with some ChatGPT thrown in there too.

I flagged this post because I think it's far below even the minimum quality level for HN. It's outright clickbait drivel.