This should not have passed a competent C/I pipeline for a system in the critica...

martinky24 · 2024-07-19T19:17:23 1721416643

A lot of assumptions here that probably aren't worth making without more info -- For example it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something, at which point it passes a lot of things before the failure.

We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.

EvanAnderson · 2024-07-19T19:19:13 1721416753

I haven't seen the file, but surely each build artifact should be signed and verified when it's loaded by the client. The failure mode of bit rot / malice in the CDN should be handled.

chatmasta · 2024-07-19T22:29:34 1721428174

The actual bug is not that they pushed out a data file with all nulls. It’s that their kernel module crashes when it reads this file.

I’m not surprised that there is no test pipeline for new data files. Those aren’t even really “build artifacts.” The software assumes they’re just data.

But I am surprised that the kernel module was deployed with a bug that crashed on a data file with all nulls.

(In fact, it’s so surprising, that I wonder if there is a known failing test in the codebase that somebody marked “skip” and then someone else decided to prove a point…)

Btw: is that bug in the kernel module even fixed? Or did they just delete the data file filled with nulls?

SoftTalker · 2024-07-19T22:55:43 1721429743

The instructions that my employer emailed were:

  1. Start Windows in Safe Mode or the Windows Recovery Environment (Windows 11 option).
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory.
  3. Locate the file matching C-00000291*.sys and delete it.
  4. Restart your device normally.

hansvm · 2024-07-19T22:40:27 1721428827

> Btw: is that bug in the kernel module even fixed? Or did they just delete the data file filled with nulls?

Is that a real question? They definitely didn't do anything more than delete the file, perhaps just rename it.

chatmasta · 2024-07-20T00:05:10 1721433910

Yeah they have been very obfuscatory in calling this a “fix.” I watched the CEO on Cramer and he kind of danced around this point.

xyst · 2024-07-19T20:18:57 1721420337

Hindsight is 20/20

This is a public company after all. In this market, you don’t become a “Top-Tier Cybersecurity Company At A Premium Valuation” with amazing engineering practices.

Priority is sales, increasing ARR, and shareholders.

StressedDev · 2024-07-19T22:32:08 1721428328

Not caring about the actual product will eventually kill a company. All companies have to constantly work to maintain and grow their customer base. Customers will eventually figure out if a company is selling snake oil, or a shoddy product.

Also, the tech industry is extremely competitive. Leaders frequently become laggards or go out of business. Here are some companies who failed or shrank because their products could not complete: IBM, Digital Equipment, Sun, Borland, Yahoo, Control Data, Lotus (later IBM), Evernote, etc. Note all of these companies were at some point at the top of their industry. They aren't anymore.

geodel · 2024-07-19T22:49:32 1721429372

Keyword is eventually. By then C-level would've been retired. Others in top management would've changed multiple jobs.

IMO point is not where are these past top companies now but where are top people in those companies now. I believe they end up being in very comfortable situation no matter which place.

Exceptions of course would be criminal prosecution, financial frauds etc.

feoren · 2024-07-19T23:15:25 1721430925

Bingo! It's the Principal Agent Problem. People focus too much on why companies do X and companies do Y, it's bad in the long term. The long term doesn't exist. No decision maker at these public companies gives a rat's ass about "the long term", because their goal is to parasitize from the company and fly off to another host before the damage they did becomes apparent. And they are very good at it: it's literally all they do. It's their entire profession.

worik · 2024-07-19T22:35:31 1721428531

> Not caring about the actual product will eventually kill a company.

Eventually

By then the principles are all very rich, and no longer care.

Do you think Bill Gates sleeps well?

s4Brains · 2024-07-20T18:32:45 1721500365

People stop caring when they see their friends getting laid off while the CEO and head of HR get big bonuses. That what happens at most big companies with subpar executives these days.

jjav · 2024-07-20T03:12:45 1721445165

> Not caring about the actual product will eventually kill a company.

Eventually is a long time.

Unfortunately for all of us ("us" being not just software engineers, but everyone impacted by this and similar lack of proper engineering outcomes) it is a proven path to wealth and success to ignore engineering a good product. Build something adequate on the surface and sell it like crazy.

Yeah, eventually enough disasters might kill the company. Countless billions of dollars will have been made and everyone responsible just moves on to the next one. Rinse & repeat.

fsloth · 2024-07-19T20:56:45 1721422605

This is the market. Good engineering practices don’t hurt but they are not mandatory. If Boeing can wing it so can everybody.

StressedDev · 2024-07-19T22:33:28 1721428408

Boeing has been losing market share to AirBus for decades. That is what happens when you cannot fix your problems, sell a safe product, keep costs in line, etc.

bsaul · 2024-07-19T23:19:57 1721431197

i wonder how far from the edge a company driven by business people can go before they start to put the focus back on good engineering. Probably much too late in general. Business bonus are yearly, and good/bad engineering practices take years to really make a difference.

moomin · 2024-07-19T23:21:01 1721431261

The question then becomes: if the market is producing near-monopolies of stuff that is barely fit for purpose, how do we fix the market?

MBCook · 2024-07-19T22:30:39 1721428239

That’s too much of an excuse.

This isn’t hindsight. It’s “don’t blow up 101” level stuff they messed up.

It’s not that this got past their basic checks, they don’t appear to have had them.

So let’s ask a different question:

The file parser in their kernel extension clearly never expected to run into an invalid file, and had no protections to prevent it from doing the wrong thing in the kernel.

How much you want to bet that module could be trivially used to do a kernel exploit early in boot if you managed to feed it your “update” file?

I bet there’s a good pile of 0-days waiting to be found.

And this is security software.

This is “we didn’t know we were buying rat poison to put in the bagels” level dumb.

Not “hindsight is 20/20”.

SoftTalker · 2024-07-19T22:58:45 1721429925

Truly an "the emperor has no clothes" moment.

AdamJacobMuller · 2024-07-19T20:25:54 1721420754

The file was just full of null bytes.

It's very possible the signature validation and verification happens after the bug was triggered.

wk_end · 2024-07-19T20:31:43 1721421103

"Load a kernel module and then verify it" is not the way any remotely competent engineer would do things.

(...which doesn't rule out the possibility that CS was doing it.)

justinclift · 2024-07-19T21:37:50 1721425070

The ClownStrike Falcon software that runs on both Linux and macOS was incredibly flaky and a constant source of kernel problems at my previous work place. We had to push back on it regardless of the security team's (strongly stated) wishes, just to keep some of the more critical servers functional.

Pretty sure "competence" wasn't part of the job description of the ClownStrike developers, at least for those pieces. :( :( :(

soraminazuki · 2024-07-19T22:39:29 1721428769

ClownStrike left kernel panics unfixed for a year until macOS deprecated kernel extensions altogether. It was scary because crash logs indicated that memory was corrupted while processing network packets. It might've been exploitable.

usr1106 · 2024-07-19T20:51:23 1721422283

Haven't used Windows for close to 15 years, but I read the file is (or rather supposed to be) a NT kernel driver.

Are those drivers signed? Who can sign them? Only Microsoft?

If it's true the file contained nothing but zeros that seems to be also kernel vulnerability. Even if signing were not mandatory, shouldn't the kernel check for some structure, symbol tables or the the like before proceeding?

poizan42 · 2024-07-19T21:37:39 1721425059

No the file is not a driver. It's a file loaded by a driver, some sort of threat/virus definition file I think?

And yes Windows drivers are signed. If it had been a driver it would just have failed to load. Nowadays they must be signed by Microsoft, see https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

MBCook · 2024-07-19T22:32:14 1721428334

That was my read.

The kernel driver was signed. The file it loaded as input with garbage data had seemingly no verification on it at all, and it crashed the driver and therefore the kernel.

usr1106 · 2024-07-19T22:54:25 1721429665

Hmm, the driver must be signed (by Microsoft I assume). So they sign a driver which in turn loads unsigned files. That does not seem to be good security.

anonymfus · 2024-07-19T22:17:00 1721427420

NT kernel drivers are Portable Executables, and kernel does such checks, displaying BSOD with stop code 0xC0000221 STATUS_IMAGE_CHECKSUM_MISMATCH if something went wrong.

https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

dagaci · 2024-07-19T21:22:56 1721424176

Think more, imagine that the your CrowdStrike security layer detects an 'unexpected' kernel level data file.

Choice #1 Diable security software and continue. Choice #2 Stop. BSOD message contact you administrator

There may be nothing wrong with the drivers.

derefr · 2024-07-19T21:56:07 1721426167

Choice #3 structure the update code so that verifying the integrity of the update (in kernel mode!) is upstream of installing the update / removing the previous definitions package, such that a failed update (for whatever reason) results in the definitions remaining in their existing pre-update state.

(This is exactly how CPU microcode updates work — the CPU “takes receipt” of the new microcode package, and integrity-verifies it internally, before starting to do anything involving updating.)

warkdarrior · 2024-07-19T22:30:52 1721428252

> a failed update (for whatever reason) results in the definitions remaining in their existing pre-update state

Fantastic solution! You just gave the attackers a way to stop all security updates to the system.

JohnBooty · 2024-07-20T04:28:47 1721449727

No, that doesn't follow.

For most systems, a sensible algorithm would be "keep running the last known good definition, until we get the next known good definition"

In other words: ignore bad updates but keep checking for valid ones. That doesn't mean you've permanently lost the ability to update.

Of course, for some systems, different behavior might make more sense.

monocasa · 2024-07-20T00:24:49 1721435089

When you can't verify an update, rolling back atomically to the previous state is generally considered the safest option. Best run what you can verify was a complete package from whoever you trust.

rahkiin · 2024-07-19T21:26:54 1721424414

The file was data used by the actual driver like some virus database. It is not code loaded by the kernel

kabdib · 2024-07-19T23:24:27 1721431467

Yet it was named ".sys", an extension normally reserved for driver executables AFAIK

Brillant! [sic]

gjsman-1000 · 2024-07-19T19:20:01 1721416801

Perhaps - but if I made a list of all of the things your company should be doing and didn't, or even things that your side project should be doing and didn't, or even things in your personal life that you should be doing and haven't, I'm sure it would be very long.

jjav · 2024-07-19T20:06:54 1721419614

> all of the things your company should be doing and didn't

Processes need to match the potential risk.

If your company is doing some inconsequential social app or whatever, then sure, go ahead and move fast and break things if that's how you roll.

If you are a company, let's call them Crowdstrike, that has access to push root privileged code to a significant percentage of all machine on the internet, the minimum quality bar is vastly higher.

For this type of code, I would expect a comprehensive test suite that covers everything and a fleet of QA machines representing every possible combination of supported hardware and software (yes, possibly thousands of machines). A build has to pass that and then get rolled into dogfooding usage internally for a while. And then very slowly gets pushed to customers, with monitoring that nothing seems to be regressing.

Anything short of that is highly irresponsible given the access and risk the Crowdstrike code represents.

Denvercoder9 · 2024-07-19T21:31:02 1721424662

> A build has to pass that and then get rolled into dogfooding usage internally for a while. And then very slowly gets pushed to customers, with monitoring that nothing seems to be regressing.

That doesn't work in the business they're in. They need to roll out definition updates quickly. Their clients won't be happy if they get compromised while CrowdStrike was still doing the dogfooding or phased rollout of the update that would've prevented it.

jjav · 2024-07-20T03:16:27 1721445387

> That doesn't work in the business they're in. They need to roll out definition updates quickly.

Well clearly we have incontrovertible evidence now (if it was needed) that YOLO-pushing insufficiently tested updates to everyone at once does not work either.

This is being called in many places (righfully) the largest IT outage in history. How many billions will be the cost? How many people died?

So yes, clearly not the correct way to operate.

nikau · 2024-07-21T04:33:39 1721536419

I mean this isn't some bizarre edge case bug, its a file full of nulls and a broken parser that blindly imported it.

Its negligence.

EvanAnderson · 2024-07-19T19:24:37 1721417077

A company deploying kernel-mode code that can render huge numbers of machines unusable should have done better. It's one of those "you had one job" kind of situations.

They would be a gigantic target for malware. Imagine pwning a CDN to pwn millions of client computers. The CDN being malicious would be a major threat.

soraminazuki · 2024-07-19T20:55:26 1721422526

Oh, they have one job for sure. Selling compliance. All else isn't their job, including actual security.

Antiviruses are security cosplay that works by using a combination of bug-riddled custom kernel drivers and unsandboxed C++ parsers running with the highest level of privileges to tamper with every bit of data it can get its hands on. They violate every security common sense. They also won't even hesitate to disable or delay rollouts of actual security mechanisms built into browsers and OSes if it gets in the way.

The software industry needs to call out this scam and put them out of business sooner than later. This has been the case for at least a decade or two and it's sad that nothing has changed.

https://ia801200.us.archive.org/1/items/SyScanArchiveInfocon... https://robert.ocallahan.org/2017/01/disable-your-antivirus-...

heraldgeezer · 2024-07-19T21:33:06 1721424786

Nope, I have seen software like Crowdstrike, S1, Huntress and Defender E5 stop active ransomware attacks.

jjav · 2024-07-20T03:20:39 1721445639

> Nope, I have seen software like Crowdstrike, S1, Huntress and Defender E5 stop active ransomware attacks.

Yes, occasionally they do. This is not an either-or situation.

While they do catch and stop attacks, it is also true that crowdstrike and its ilk are root-level backdoors into the system that bypass all protections and thus will cause problems sometimes.

soraminazuki · 2024-07-19T21:56:06 1721426166

That anecdote doesn't justify installing gaping security holes into the kernel with those tools. Actual security requires knowledge, good practice, and good engineering. Antiviruses can never be a substitute.

lytedev · 2024-07-19T23:24:48 1721431488

You seem security-wise, so surely you can understand that in some (many?) cases, antivirus is totally acceptable given the threat model. If you are wanting to keep the script kiddies from metasploiting your ordinary finance employees, it's certainly worth the tradeoff for some organizations, no? It's but one tool with its tradeoffs like any tool.

TeMPOraL · 2024-07-20T07:36:57 1721461017

That's like pointing at the occasional petty theft and mugging, and using it to justify establishing an extraordinary secret police to run the entire country. It's stupid, and if you do it anyway, it's obvious you had other reasons.

Antivirus software is almost universally malware. Enterprise endpoint "protection" software like CrowdStrike is worse, it's an aggressive malware and a backdoor controlled by a third party, whose main selling points are compliance and surveillance. Installing it is a lot like outsourcing your secret police to a consulting company. No surprise, everything looks pretty early on, but two weeks in, smart consultants rotate out to bring in new customers, and bad ones rotate in to run the show.

Yeah, that's definitely a good tradeoff against script kiddies metasploiting your ordinary finance employees. Wonder if it'll look as good when loss of life caused by CrowdStrike this weekend gets tallied up.

jmb99 · 2024-07-19T23:38:19 1721432299

How many attacks have they stopped that would have DoS’d a significant fraction of the world’s Windows machines roughly instantly?

The ends don’t justify the means.

cduzz · 2024-07-19T20:57:12 1721422632

Which is their "One Job" ?

Options include:

1. protected the systems always work even if things are messed up

2. protected systems are always protected even when things are messed up

The two failure modes are exclusive; ideally you let the end user decide what to do if the protection mechanism is itself unstable.

One could suggest "the system must always work" but that's ignoring that sometimes things don't go to plan.

None of the systems in boot loops were p0wned by known exploits while they were boot looping. As far as we know anyhow.

(edited to add the obvious default of "just make a working system" which is of course both a given and not going to happen)

jmb99 · 2024-07-19T23:44:52 1721432692

The failure mode here was a page fault due to an invalid definition file. That (likely) means the definition file was being used as-is without any validation, and pointers were being dereferenced based on that non-validated definition file. That means this software is likely vulnerable to some kind of kernel-level RCE through its definition files, and is (clearly) 100% vulnerable to DoS attacks through invalid definition files. Who knows how long this has been the case.

This isn’t a matter of “either your system is protected all the time, even if that means it’s down, or your system will remain up but might be unprotected.” It’s “your system is vulnerable to kernel-level exploits because of your AV software’s inability to validate definition files.”

The failure mode here should absolutely not be to soft-brick the machine. You can have either of your choices configurable by the sysadmin; definition file fails to validate? No problem, the endpoint has its network access blocked until the problem can be resolved. Or, it can revert to a known-good definition, if that’s within the organization’s risk tolerance.

But that would require competent engineering, which clearly was not going on here.

cduzz · 2024-07-21T14:17:14 1721571434

Well.... yeah, incuriously shoving unsigned data from elsewhere into ring 0 and executing it is malpractice.

EvanAnderson · 2024-07-20T00:47:33 1721436453

Their "one job" is to not make things worse than the default. DoS'ing the OS with an unhandled kernel mode exception would be not doing that job.

How about a different analogy: First do no harm.

bn-l · 2024-07-19T19:34:28 1721417668

I think in this case it’s reasonable for us to expect that they are doing what they should be doing.

chrisjj · 2024-07-19T21:03:59 1721423039

> it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something

I.e. only one link in the chain wasn't tested.

Sorry, but that will not do.

> We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.

The parent post did not suggest they don't test anything. It suggested they did not test the whole chain.

martinky24 · 2024-07-19T22:25:58 1721427958

From the parent comment:

> it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support

I know nothing about Crowdstrike, but I can guarantee that "they need to test target images that they claim to support" isn't what went wrong here. The implication that they don't test against Windows is so incredulous, it's hard to take the poster of that comment seriously.

StressedDev · 2024-07-19T22:39:03 1721428743

Thank you for pointing this out. Whenever I read articles about security, or reliability failures, it seems like the majority of the commenters assume that the person or organization which made the mistake is a bunch of bozos.

The fact is mistakes happen (even huge ones), and the best thing to do is learn from the mistakes. The other thing people seem to forget is they are probably doing a lot of the same things which got CrowdStrike into trouble.

If I had to guess, one problem may be that CrowdStrike's Windows code did not validate the data it received from the update process. Unfortunately, this is very common. The lesson is to validate any data received from the network, from an update process, received as user input, etc. If the data is not valid, reject it.

Note I bet at least 50% of the software engineers commenting in this thread do not regularly validate untrusted data.

chrisjj · 2024-07-20T00:06:56 1721434016

I'll bet 50% aren't delivering code that can stiff millions of PCs.

And given Crowdstrike are, and data validation neglect is so common, why have they not already learned this lesson?

maxquesar · 2024-07-23T14:02:44 1721743364

Not validating an update signature is a huge security compliance issue. When you get certified, and I assume CroudStrike had many certifications, you provide proof of your compliance to many scenarios. Proving your updates are signed and verified is absolutely one of those.

arp242 · 2024-07-19T20:53:59 1721422439

> Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.

Who is saying they don't have that? Who is saying it didn't pass all of that?

You're making tons of assumptions here.

JKCalhoun · 2024-07-19T21:06:35 1721423195

To be sure. But the fact is the release broke.

I'm not sure: is having test servers that it passed any better than none at all?

strken · 2024-07-19T21:58:02 1721426282

It is absolutely better to catch some errors than none.

In this case it gives me vibes of something going wrong after the CI pipeline, during the rollout. Maybe they needed advice a bit more specific than "just use a staging environment bro", like "use checksums to verify a release was correctly applied before cutting over to the new definitions" and "do staged rollouts, and ideally release to some internal canary servers first".

martinky24 · 2024-07-19T22:16:53 1721427413

"Have these idiots even heard of CI/CD???" strangely seems to be a common condescending comment in this thread.

I honestly though HN was slightly higher quality than most of the comments here. I am proven wrong.

StressedDev · 2024-07-19T22:40:25 1721428825

Agreed - The worst part is most of the people making these unhelpful comments are probably doing the same sorts of things which caused this outage.

kristjansson · 2024-07-19T22:30:12 1721428212

Big threads draw a lot of people; we regress toward the mean

chuckadams · 2024-07-19T22:49:47 1721429387

> I honestly though HN was slightly higher quality

HN reminds me of nothing so much as Slashdot in the early 2000's, for both good and ill. Fewer stupid memes about Beowulf Clusters and Natalie Portman tho.

exe34 · 2024-07-19T22:06:40 1721426800

I don't understand why you wouldn't do staged roll outs at this scale. even a few hours delay might have been enough to stop the release from going global.

chuckadams · 2024-07-19T22:49:06 1721429346

They almost certainly have such a process, but it got bypassed by accident, probably got put into a "minor updates" channel (you don't run your model checker every time you release a new signature file after all). Surprise, business processes have bugs too.

But naw, must be every random commentator on HN knows how to run the company better.

TeMPOraL · 2024-07-20T07:48:18 1721461698

> (you don't run your model checker every time you release a new signature file after all)

Wonder if the higher-ups who mandated this software to be installed in their hospitals were informed about that fact.

martinky24 · 2024-07-19T21:25:37 1721424337

Yes, yes it is. Because there's tons more breakages that have likely been caught.

One uncaught downstream failure doesn't invalidate the effort into all the previously caught failures.

JKCalhoun · 2024-07-21T13:10:58 1721567458

At the same time (and I am looking directly at Unit Tests when I say this) I have seen what is perhaps dangerous confidence that, because such and such tests are in place, we can be a more lax when pushing out changes.

chatmasta · 2024-07-19T22:32:40 1721428360

The release didn’t break. A data file containing nulls was downloaded by a buggy kernel module that crashed when reading the file.

For all we know there is a test case that failed and they decided to push the module anyway (“it’s not like anyone is gonna upload a file of all nulls”).

Btw: where are these files sourced from? Could a malicious Crowdstrike customer trick the system into generating this data file, by e.g. reporting it saw malware with these (null) signatures?

leptons · 2024-07-19T22:35:10 1721428510

A lot of the software industry focuses on strong types, testing of all kinds, linting, and plenty of other sideshows that make programmers feel like they're in control, but these things only account for the problems you can test for and the systems you control. So what if a function gets a null instead of a float? It shouldn't crash half the tech-connected world. Software resilience is kind of lacking in favor of trusting that strong types and tests will catch most bugs, and that's good enough?

martinky24 · 2024-07-19T20:55:01 1721422501

Yeah... the comment above reads like someone who has read a lot of books on CI deployment, but has zero experience in a real world environment actually doing it. Quick to throw stones with absolutely no understanding of any of the nuances involved.

AndrewKemendo · 2024-07-19T21:29:50 1721424590

There is no nuance needed - this is a giant corporation that sells kernel layer intermediation at global scale. You better be spending billions on bulletproof deployment automation because *waves hands around in the air pointing at whats happening just like with solarwinds*

Bottom line this was avoidable and negligent

For the record I owned global infrastructure as CTO for the USAF Air Operations weapons system - one of the largest multi-classification networked IT systems ever created for the DoD - even moreso during a multi-region refactor as a HQE hire into the AF

So I don’t have any patience for millionaires not putting the work in when it’s critical infrastructure

People need to do better and we need accountability for people making bad decisions for money saving

arp242 · 2024-07-19T21:38:42 1721425122

Almost everything that goes wrong in the world is avoidable one way or the other. Simply stating "it was avoidable" as an axiom is simplistic to the point of silliness.

Lots of very smart people have been hard at work to prevent airplanes from crashing for many decades now, and planes still crash for all sorts of reasons, usually considered "avoidable" in hindsight.

Nothing is "bulletproof"; this is a meaningless buzzword with no content. The world is too complex for this.

HL33tibCe7 · 2024-07-19T21:59:29 1721426369

> You better be spending billions on bulletproof deployment automation

There is no such thing.

jonathanzufi · 2024-07-19T23:10:48 1721430648

You must have insanely cool stories :-)

What are your thoughts on MSFTs role in this?

They’ve been iterating Windows since 1985 - doesn’t it seem reasonable that their kernel should be able to survive a bad 3rd party driver?

AndrewKemendo · 2024-07-20T01:04:33 1721437473

1. System high/network isolation is a disaster in practice and is the root of MSFT and AD/ADFS architecture

2. The problem is the ubiquity of windows so it’s embedded in the infrastructure

We’ve put too many computers in charge of too much stuff for the level of combined capabilities of the computer and the human operator interface

chrisjj · 2024-07-19T21:01:07 1721422867

So let's hear the "nuances" that excuse this.

arp242 · 2024-07-19T21:32:42 1721424762

I am not defending of excusing anything. I am saying there is not enough information to make a judgement one way or the other. Right now, we have almost zero technical details.

Call me old-fashioned and boring, but I'd like to have some basic facts about the situation first. After this I decide who does and doesn't deserve a bollocking.

chrisjj · 2024-07-19T22:06:00 1721426760

I think we do have enough info to judge e.g. :This should not have passed a competent C/I pipeline for a system in the critical path."

Thay info includes that the faulty file consisted entirely of zeros.

arp242 · 2024-07-19T22:46:23 1721429183

> That info includes that the faulty file consisted entirely of zeros.

Even that is not certain. Some people are reporting that this isn't the case and that the all-zeroed file may be a "quick hack" to send out a no-op.

So no, we have very little info.

chrisjj · 2024-07-20T00:08:31 1721434111

But the all-zero file is version CS has IDed as the cause, right?

zmatt · 2024-07-24T02:07:03 1721786823

No, CS has explicitly stated that the cause was a logic error in the rules file. They have also stated "This is not related to null bytes contained within Channel File 291 or any other Channel File."

cweld510 · 2024-07-19T21:16:54 1721423814

It’s not a matter of excusing or not excusing it. Incidents like this one happen for a reason, though, and the real solution is almost never “just do better.”

Presumably crowdstrike employs some smart engineers. I think it’s reasonable to assume that those engineers know what CI/CD is, they understand its utility, and they’ve used it in the past, hopefully even at Crowdstrike. Assuming that this is the case, then how does a bug like this make it into production? Why aren’t they doing the things that would have prevented this? If they cut corners, why? It’s not useful or productive to throw around accusations or demands for specific improvements without answering questions like these.

jacobr1 · 2024-07-19T21:33:15 1721424795

Not an excuse - they should be testing for this exact thing - but Crowdstrike (and many similar security tools) have a separation between "signature updates" and "agent/code" updates. My (limited) reading of this situation is that this as a update of their "data" not the application. Now apparently the dynamic update included operating code, just just something the equivalent of a yaml file or whatever, but I can see how different kinds of changes like this go through different pipelines. Of course, that is all the more reason to ensure you have integration coverage.

ikiris · 2024-07-19T22:50:25 1721429425

Dude, the fact that it breaks directly.

You sound like the guy that a few years ago tried to argue (the company in question) tested os code that didn't include any drivers for their gear's local storage. Its obvious it wasn't to anyone competent.

carterschonwald · 2024-07-19T20:19:00 1721420340

The strange thing is that when I interviewed there years ago with the team that owns the language that runs in the kernel, they said their ci has 20k or 40k machine os combinations/configurations. Surely some of them were vanilla windows!

dboreham · 2024-07-19T20:49:02 1721422142

They used synthetic test data in CI that doesn't consist of zeros.

dlisboa · 2024-07-19T21:02:51 1721422971

Fuzz testing would've saved the day here.

azemetre · 2024-07-19T21:27:39 1721424459

I’m sure some team had it in their backlog for years.

queuebert · 2024-07-19T21:58:39 1721426319

That team was probably laid off because they weren't shipping product fast enough.

0x6c6f6c · 2024-07-19T21:51:37 1721425897

Oh yeah, FEAT#927261? Would love to see that ticket go out

TeMPOraL · 2024-07-20T07:55:53 1721462153

Why not? It's unlikely it was the last null byte in the data file that killed the driver.

dagaci · 2024-07-19T20:45:13 1721421913

/* Acceptance criteria #1: do not allow machine to boot if invalid data signatures are present, this could indicate a compromised system. Booting could cause presidents diary to transmit to rival 'Country' of the week */

if(dataFileIsNotValid) { throw FatalKernelException("All your base are compromised"); }

EDIT+ Explanation:

With hindsight not booting may be exactly the right thing to do since a bad datafile would indicate a compromised distribution/ network.

The machines should not fully boot until file with valid signature is downloaded.*

hnlmorg · 2024-07-19T20:38:12 1721421492

It seems unlikely that a file entirely full of null characters was the output of any automated build pipeline. So I’d wager something got built, passed the CI tests, then the system broke at some point after that when the file was copied ready for deployment.

But at this stage, all we are doing is speculating.

russdill · 2024-07-19T20:31:50 1721421110

You can have all the CI, staging, test, etc. If some bug after that process nulls the file, the rest doesn't matter

fabian2k · 2024-07-19T20:40:09 1721421609

Those signature files should have a checksum, or even a digital signature. I mean even if it doesn't crash the entire computer, a flipped bit in there could still turn the entire thing against a harmless component of the system and lead to the same result.

HL33tibCe7 · 2024-07-19T22:01:36 1721426496

What happens when your mechanism for checksumming doesn't work? What happens when your mechanism for installing after the checksum is validated doesn't work?

It's just too early to tell what happened here.

The likelihood is that it _was_ negligence. But we need a proper post-mortem to be able to determine one way or another.

LorenPechtel · 2024-07-19T21:07:24 1721423244

Yup. I had quite a battle with some sort of system bug (never fully traced) where I wrote valid data but what ended up on disk was all zero. It appeared to involve corrupted packets being accepted as valid.

It doesn't matter how much you test if something down the line zeroes out your stuff.

Jtsummers · 2024-07-19T20:35:17 1721421317

If a garbage file is pushed out, the program could have handled it by ignoring it. In this case, it did not and now we're (the collective IT industry) dealing with the consequences of one company that can't be bothered to validate its input (they aren't the only ones, but this is a particularly catastrophic demonstration of the importance of input validation).

russdill · 2024-07-19T20:42:23 1721421743

I'll agree that this appears to have been preventable. Whatever goes through CI should have a hash, deployment should validate that hash, and the deployment system itself should be rigorously tested to insure it breaks properly if the hash mismatches at some point in the process

Cerium · 2024-07-19T21:14:43 1721423683

What sort of sane system modifies the build output after testing?

Our release process is more like: build and package, sign package, run CI tests on signed package, run manual tests on signed package, release signed package. The deployment process should check those signatures. A test process should by design be able to detect any copy errors between test and release in a safe way.

jononor · 2024-07-19T22:59:46 1721429986

The issue is not that a file with nulls was produced. It is that an invalid file (or any kind) can trigger a blue screen of death.

0xcafecafe · 2024-07-19T19:23:07 1721416987

They could even have done slow rollouts. Roll it out to a geographical region and wait an hour or so before deploying elsewhere.

saati · 2024-07-19T21:24:51 1721424291

In theory CrowdStrike protects you from threats, leaving regions unprotected for an hour would be an issue.

Thaxll · 2024-07-19T22:34:33 1721428473

Not really, even for security updates are not needed by the minute. Do you think Microsoft rollout world wide updates to everyone?

easterncalculus · 2024-07-20T23:45:58 1721519158

This is definitely their sales pitch, and most orgs (evidently) don't follow the guidance of doing EDR rollouts in staging environments first. That being said, if your security posture is at the point where not getting the latest updates from CrowdStrike quick enough is why you're getting breached, you are frankly screwed already.

xyst · 2024-07-19T20:20:41 1721420441

Or test in local environments first. Slow rollouts like this tend to make deployments very very painful.

koliber · 2024-07-19T20:58:05 1721422685

Slow rollouts can be quite quick. We used to do 3-day rollouts. Day one was a tiny fraction. Day two was about 20%. Day three was a full rollout.

It was ages ago, but from what I remember, the first day rollout did occasionally catch issues. It only affected a small number of users and the risk was within the tolerance window.

We also tested locally before the first rollout.

rplnt · 2024-07-19T21:12:24 1721423544

I don't know about this particular update, but when I used to work for an AV vendor we did like 4 "data" updates a day. It is/was about being quick a lot of the time, you can't stage those over 3 days. Program updates are different, drivers of this level were very different (Microsoft had to sign those, among many things).

Not thay it exuces anything, just that this probably wasn't treated as an update at all.

daseiner1 · 2024-07-19T20:52:54 1721422374

You say even (emphasis mine). Is this not industry standard?

miki123211 · 2024-07-19T21:29:17 1721424557

Keep in mind that this was probably a data file, not necessarily a code file.

It's possible that they run tests on new commits, but not when some other, external, non-git system pushes out new data.

Team A thinks that "obviously the driver developers are going to write it defensively and protect it against malformed data", team B thinks "obviously all this data comes from us, so we never have to worry about it being malformed"

I don't have any non-public info about what actually happened, but something along these lines seems to be the most likely hypothesis to me.

Edit: Now what would have helped here is a "staged rollout" process with some telemetry. Push the update to 0.01% of your users and solicit acknowledgments after 15 minutes. If the vast majority of systems are still alive and haven't been restarted, keep increasing the threshold. If, at any point, too many of the updated systems stop responding or indicate a failure, immediately stop the rollout, page your on-call engineers and give them a one-click process to completely roll the update back, even for already-updated clients.

This is exactly the kind of issue that non-invasive, completely anonymous, opt-out telemetry would have solved.

adzm · 2024-07-19T21:50:26 1721425826

This was a .dll in all but name fwiw.

ar_lan · 2024-07-19T22:04:00 1721426640

> tests all of the possible target images that they claim to support.

Or even at the very least the most popular OS that they support. I'm genuinely imagining right now that for this component, the entirety of the company does not have a single Windows machine they run tests on.

tinytime · 2024-07-21T21:55:00 1721598900

It's wild that I'm out here boosting existing unit testing practices with mutation testing https://github.com/codeintegrity-ai/mutahunter and there are folks out there that don't even do the basic testing.

sonotathrowaway · 2024-07-19T20:14:16 1721420056

That’s not even getting into the fuckups that must have happened to allow a bad patch to get rolled out everywhere all at once.

notabee · 2024-07-19T19:56:36 1721418996

Without delving into any kind of specific conspiratorial thinking, I think people should also include the possibility that this was malicious. It's much more likely to be incompetence and hubris, but ever since I found out that this is basically an authorized rootkit, I've been concerned about what happens if another Solarwinds incident occurs with Crowdstrike or another such tool. And either way, we have the answer to that question now: it has extreme consequences. We really need to end this blind checkbox compliance culture and start doing real security.

dheera · 2024-07-19T21:03:32 1721423012

I don't know if people on Microsoft ecosystems even know what CI pipelines are.

Linux and Unix ecosystems in general work by people thoroughly testing and taking responsibility for their work.

Windows ecosystems work by blame passing. Blame Ron, the IT guy. Blame Windows Update. Blame Microsoft. That's how stuff works.

It has always worked this way.

But also, all the good devs got offered 3X the salary at Google, Meta, and Apple. Have you ever applied for a job at CrowdStrike? No? That's why they suck.

* A disproportionately large number of Windows IT guys are named Ron, in my experience.

kabdib · 2024-07-19T21:31:15 1721424675

That's a pretty broad brush.

jfyi · 2024-07-19T23:45:01 1721432701

Eh, it's not too broad... I think we should ask how Ron feels about the characterization though.