News
CrowdStrike Apologizes, Blames Flaw in Testing Software for Faulty Update that Took Down Millions of Windows Systems
- By Chris Paoli
- July 24, 2024
Security firm CrowdStrike revealed that a flaw in its testing software led to a faulty update, causing more than 8.5 million Windows systems to crash last week. In a blog post published today, the Austin-based company provided more details on the incident, which resulted in flight cancellations and disruptions to public services, including 911 systems.
The company noted that on July 19, it released a "Rapid Response" content configuration update for its Falcon software. The update was intended to collect information on ongoing and potential security incidents.
However, CrowdStrike's Rapid Response update, designed to address threats "at operational speed," contained an error in one of its "Template Instances," leading to the widespread outage.
CrowdStrike has been working to address the issue and restore affected systems while investigating the root cause of the error.
Here's CrowdStrike's summary of Template Instances:
"Rapid Response Content is delivered as 'Template Instances,' which are instantiations of a given Template Type. Each Template Instance maps to specific behaviors for the sensor to observe, detect or prevent. Template Instances have a set of fields that can be configured to match the desired behavior."
Before being pushed through, Template Instances are first checked through the company's Content Validator. The two Template Instances pushed through last week both passed validation, despite one of the two containing corrupted content data. This is what led to Windows machines running Falcon sensor version 7.11 and above.
"When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception," said CrowdStrike. "This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)."
In CrowdStrike's preliminary assessment of the situation, it puts part of the blame on the company's internal testing procedures. While the company does do extensive human and AI testing for its main Sensor Content, the company relies only on the Content Validator for its smaller Rapid Response releases.
As for how the company avoids situations like this in the future, it all lies with how it tests and deploys its Rapid Response updates. The company plans to employ local developer testing, increased rollback and content testing, content interference testing and additional stress testing before any updates are pushed through.
As for deployment of content, the company said it will be making the following changes:
- Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
- Improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout.
- Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.
- Provide content update details via release notes, which customers can subscribe to.
CrowdStrike said it will be releasing a full "Root Cause Analysis" on the incident once its investigation is complete. In the meantime some customers are reporting that the company has sent them $10 Uber Eats gift card as an apology for the outage.