Causes and impacts of data centre outages

Annual outage analysis 2021.

Avoiding downtime remains a top priority for all managers of critical infrastructure. But as technology changes, and as the demands placed on IT change, so do the types, frequency and impacts of outages, as well as the best practices in outage avoidance.

According to the latest research report released by the Uptime Institute the avoidance of outages has always been a major priority for operators of mission-critical systems.

Uptime Institute’s annual analysis of data centre outages finds that progress toward reducing downtime — and the impact of outages — is mixed. While systems and processes have generally improved uptime and reliability, the impact of some big failures, and a concentration of workloads in a small number of large data centres owned by powerful players, has led some customers and regulators to seek better oversight and evidence of good practices. Innovations and investment in cloud-based and distributed resiliency may have helped reduce the impact of site-level failures, but it has also introduced some error-prone complexity.

Key findings include:

  • In spite of improving technology and better management of availability, outages remain a major concern for the industry — and increasingly, for customers and regulators. The impact and cost of outages is growing.
  • The causes of outages are changing. Software and IT configuration and network issues are becoming more common, while power issues are less likely to cause a major IT service outage.
  • Human error continues to cause problems. Many outages could be prevented by improving management processes and training staff to follow them correctly.
  • There were fewer serious and severe outages reported in 2020 than in the previous year.
  • While progress in improving reliability and availability is always a factor, this decrease may, in part, be due to changes in IT use and management as a result of COVID-19.

Critical IT systems, networks and data centres are far more reliable than they once were. This is the result of many decades of innovation, investment and management.

Major failures seem more common only because there is so much critical IT in use, because society’s dependency on it is so great, and because of greatly increased visibility through news and social media. In 2020 — a year in which COVID-19 made a big impact on how and where IT was used — there were, as always, some big outages that affected financial trading, government services and telecom services. However, the outages that made headlines most often were less seismic, affecting consumers and workers at home, such as interruptions or slowdowns of collaboration tools (e.g., Microsoft Teams, Zoom), online betting sites and fitness trackers.

The financial consequences of outages can be high, and the numbers are increasing. The Uptime Institute Global Survey of IT and Data Centre Managers 2020 found that four in 10 outages cost between US$100,000 and US$1 million – and about one in six costs over US$1 million.

One of these misunderstandings concerns the level of availability and outages generally. Overall, the level of reliability of data centers has been improving, not worsening. But this is not always clear from figures that show a high and consistent rate of outages experienced by IT and data center management. The anomaly may be simply explained.

The level of investment in new data centers, in an ever-increasing amount of IT capacity, and in new IT services in recent years has dwarfed that of all previous decades. The frequency of outages has grown too — but much more slowly. Even so, the risk of an outage at any data center, or for an IT service, is still high enough to concern managers and to justify high investment. The growing use of cloud-based or network-based resiliency has created some confusion, since some IT technicians have extremely high expectations of these technologies.

Some operators quote five nines availability or have implied that system-wide failure is nearly impossible. This is clearly not the case. Such modern IT architectures are designed to overcome component, equipment, and in some cases, site-level failures; equally, they are designed to support more fluid movement of data and processing, allowing rerouting of traffic to replicated data. But significant investment and expertise is required to operate this successfully, and some of this technology is still in its infancy. At scale, distributed resiliency can introduce complexity and other challenges that may lead to failures, some of which are not easily foreseeable.

This explains why a growing number of outages result from software and network systems and configuration errors. Distributed resiliency is a methodology still in development: it works well, but not perfectly. In the long term, greater investment and experience, and the use of advanced monitoring and optimisation technologies, will help to reduce failures more significantly.

The number of outages is only one metric, and not the one many managers will worry about most. A bigger concern is the likelihood — and possible impact — of outages for their type of operation.

In this regard, IT is paying the price of its success: The costs of outages are rising, along with the disruption caused. This is the result of several factors, including the growing dependency by business/society on IT; the concentration of IT in fewer companies/large data centres; and the difficulty of quickly resolving complex system outages, sometimes spanning multiple sites.

The importance of IT and data centres, and the impact of outages, many regulators of financial services, emergency services, telecoms and central governments are reaching the conclusion that greater visibility, accountability and control is needed. Prevention of outages is a constant challenge that requires attention, investment, and analysis on several fronts. But Uptime Institute’s research does point to one simple and actionable finding: Human error, which lies at the root of many outages, is often the result of failure to follow processes, or of having inadequate processes. Better focus, management and training will produce better results.


Leave a Comment

Related posts