Despite the best efforts of datacentre operators the world over to reduce the amount of downtime their facilities suffer, the severity and financial impact of server farm outages continue to spiral.
That is according to the fourth annual outage analysis survey by datacentre resiliency think-tank Uptime Institute, which says outage rates are increasing despite “strong investment” from operators in technologies designed to prevent downtime events.
“The overall impact and cost of outages is not shrinking – as might have been hoped – but is, in fact, growing,” said the organisation in its 23-page Annual outage analysis. “Investment in cloud-based and distributed resiliency may have helped reduce the impact of site-level failures, but it has also introduced error-prone complexity. Better management and staff training would help to reduce these failures.”
The report’s insights are based on an analysis of publicly available reports about datacentre outages, as well as data accrued by Uptime Institute through its own industry surveys and member feedback.
It said its findings acknowledge that although datacentres are far more reliable than they used to be, thanks to “decades of innovation, investment and better management”, society’s growing reliance on them means “major failures seem more common”.
It continued: “Despite this, it is clear from Uptime’s extensive research that outages in 2021 and 2022 continue to occur at a rate that is not measurably down from previous years. The evidence suggests that the disruption and costs of outage is, in fact, increasing.
“In short, the critical infrastructure industry is struggling to achieve the high standards that customers expect – and that are embodied in service-level agreements.”
Its data revealed that one in five organisations reported suffering a “serious” or “severe” outage in the past three years, which constitutes a “slight upward trend in the prevalence of major outages”.
At the same time, the proportion of outages that cost the affected company more than $100,000 has soared in recent years, with more than 60% of failures now resulting in at least $100,000 in total losses, which is up markedly from 39% in 2019.
The share of outages that cost upwards of $1m increased from 11% to 15% over that same period.
Also, the length of outages is becoming more prolonged, said the report. “The gap between the beginning of a major public outage and full recovery has stretched significantly over the last five years,” it said. “Nearly 30% of these outages in 2021 lasted more than 24 hours – a disturbing increase from just 8% in 2017.”
Power supply issues have traditionally been the most common cause of datacentre outages, but Uptime Institute predicted in its 2021 report that networking issues are set to become the most common source of server farm downtime events.
The 2022 report backs this view, and said outages are increasing attributed to network, software and systems issues, as the scale and complexity of the digital infrastructure underpinning enterprise cloud deployments increases.
“The increasing use of cloud services has changed the characteristics of outages in recent years,” said the report. “Failures are more likely to be due to software, systems or configuration errors – a reflection of the growing complexity of the IT and associated networking.
“These outages are also more likely to affect many IT services and organisations, reflecting system interdependency and the concentration of customers using single providers, often in single availability zones.”
Uptime Institute Intelligence founding member and executive director Andy Lawrence, who co-authored the report, said the situation will improve in time, but for now, outages will persist.
On this point, the organisation predicts – based on past public datacentre downtime data – that there will be at least 20 serious, high-profile IT downtime incidents worldwide each year.
“In time, both the technology and operational practices will improve,” said Lawrence. “But at present, outages remain a top concern for customers, investors and regulators. Operators will be best able to meet the challenge with rigorous staff training and operational procedures to mitigate the human error behind many of these failures.”