The majority of CrowdStrike Falcon sensors affected by a botched rapid response update were back up and running prior to the weekend of 27 and 28 July, as efforts to remediate the 19 July incident that caused more than eight million Windows machines to crash continue.
Writing on LinkedIn on 26 July, CrowdStrike CEO George Kurtz, who has been communicating information about the incident at a steady clip since it first unfolded, said that as of Thursday 25 July “over 97%” of Windows sensors were back online.
“This progress is thanks to the tireless efforts of our customers, partners, and the dedication of our team at CrowdStrike. However, we understand our work is not yet complete, and we remain committed to restoring every impacted system,” said Kurtz.
“To our customers still affected, please know we will not rest until we achieve full recovery. At CrowdStrike, our mission is to earn your trust by safeguarding your operations. I am deeply sorry for the disruption this outage has caused and personally apologise to everyone impacted. While I can’t promise perfection, I can promise a response that is focused, effective, and with a sense of urgency.”
Kurtz said the remedial efforts had been greatly helped thanks to the use of automated recovery techniques and by mobilising all possible resources to support affected customers. He reiterated CrowdStrike’s commitment to its core mission – to stop breaches – but with a new focus on customer controls and resilience, as detailed in the firm’s preliminary incident report last week.
Fixed update set for implementation soon
Meanwhile, CrowdStrike confirmed to Computer Weekly’s sister title TechTarget Security prior to the weekend that the logic error that caused the chaos was definitely fixed, and intensive testing is now underway before the update can be pushed to live, set for the coming days.
The tainted update was part of a rapid response roll-out normally used by CrowdStrike to enhance the dynamic protection mechanisms of its Falcon platform – that is to say, it was designed to identify new cyber security issues and help customers mitigate them.
The company performs such updates all the time, but on this occasion, some problematic content in a channel file made it past the beady eyes of CrowdStrike’s automated content validator. The two issues combined led to an out-of-bound memory condition, which triggered an exception overwhelming the Windows operating system and causing vulnerable devices to fail and crash, resulting in the infamous blue screen of death.
CrowdStrike is attempting to make sure the issue cannot replicate in future by improving the resilience of its rapid response updates through improved testing at multiple levels, and adding refreshed validation checks to the automated content validator tool that let it down.
It also now plans to roll out rapid response updates on a staggered basis, deploying them across the Falcon sensor base more slowly and making use of “canary” deployments designed to highlight any major issues before they spread.
This will see sensor and system performance receive enhanced monitoring, and at some point, CrowdStrike customers are to be given more options to manage rapid response updates themselves.
Real-life impacts
Meanwhile, real-world impacts continue to be felt from the outage, which notably caused airlines all over the world to delay, reschedule and cancel flights.
Among the stories to have emerged is that of an 83-year-old man who became the subject of a search operation by authorities in the US. Patrick Bailey, who was scheduled to fly home from Florida to California on 19 July, was put up in a local hotel when his flight was cancelled.
Although Bailey checked out the following morning, he accidentally left his mobile phone in his room and went missing for several days. Bailey eventually turned up in California on 28 July, having instead decided to take a long-distance Greyhound bus across the US.