The global Microsoft outage caused by a botched update from security firm CrowdStrike has highlighted the dangerous business continuity risk arising from concentrating so much of the world’s technology infrastructure in the hands of a very small number of businesses, experts are warning.
The outage, which began late on Thursday 18 July 2024 before spreading worldwide and hitting the headlines early in the morning of Friday 19 July, saw a bugged CrowdStrike update make it through quality control to worldwide deployment. When it hit computers, it threw them into what is known as a boot loop, causing them to crash on startup and display the infamous blue screen of death.
It’s estimated that it affected only about 8.5 million machines, which is a fraction of the global total, but with many of those belonging to public-facing organisations, pictures of bricked display screens in locations such as airports, railway stations and shops swiftly went viral.
Citing data from a study his firm published in May 2024, SecurityScorecard CEO and co-founder Aleksandr Yampolskiy revealed that IT products and services made by just 150 companies account for 90% of the global attack service, while 62% of the global attack surface is concentrated in the line-ups of just 15 tech firms – including Microsoft.
Ranked on Security Scorecard’s proprietary rating system, the original study claimed that those 15 organisations all had below-average cyber security risk ratings, and given ransomware gangs – and others – are known to systematically target third-party vulnerabilities at scale, this should be a significant worry for IT teams.
Yampolskiy described the state of much of global IT as a “precarious house perched on a cliff’s edge”, and said that in concentrating mission-critical services to a few big companies, businesses have created a single point of failure.
“When I used to work at Goldman Sachs, the policy was to get tools from multiple vendors,” he said. “This way, if one firewall goes down by one vendor, you have another vendor who may be more resilient. [Friday’s] global outage is a reminder of the fragility and systemic ‘nth-party’ concentration risk of the technology that runs everyday life: airlines, banks, telecoms, stock exchanges and more.
Grasping the chaos
Yampolskiy said the survey’s findings emphasised how a significant proportion of the global external attack surface is controlled by a small number of organisations, and that we are only just beginning to grasp the chaos – thrown into sharp relief thanks to events at CrowdStrike – that this could cause.
He argued that the CrowdStrike incident aptly demonstrated how knowing your supply chain (KYSC) was becoming an increasingly important part of operational resilience, adding that IT teams needed to better understand the dependencies in their business and those of their tech suppliers, and that such knowledge is critical to responding to outages effectively, whether they result from malicious cyber attacks, human error or something else.
“Understanding and managing your supply chain is critical in mitigating these risks,” said Yampolskiy. “By proactively identifying dependencies and potential vulnerabilities within your ecosystem, you can strengthen your organisation’s resilience against such disruptive events.
“An outage is just another form of a security incident,” he said. “Antifragility in these situations comes from not putting all your eggs in one basket. You need to have diverse systems, know where your single points of failure are, and proactively stress-test through tabletop exercises and simulations of outages. Consider the ‘chaos monkey’ concept, where you deliberately break your systems – for example, shut down your database or make your firewall malfunction to see how your computers react.”