What can other businesses learn from the data centre industry’s approach to power continuity?
Research from IT disaster recovery specialist Databarracks reveals only just over half of UK organisations (54%) are confident they have an up-to-date business continuity plan to fall back on if they were hit by a crisis such as one caused by a prolonged power cut.
Without electricity, many companies would be crippled. Mission-critical sites such as manufacturing plants or financial services institutions (like banks or insurance firms) would have more to lose than most.
Machine-led manufacturing depends less and less on manual workers and more on automation and data-driven processes.
Production lines would grind to a halt without electricity, while supply chains would also be hit. The tech-based financial sector would collapse. People wouldn’t have access to their accounts and countless automated payments would fail to go through.
Public services would struggle too. Hospitals, GP surgeries, care homes, social security systems and more are all reliant on databases of sensitive information requiring real-time access and processing. No power, no processing…
Many of these billion-pound industries and vital services depend on nationwide and global clusters of data centres that store and process gigabytes of information 24/7/365.
In the UK alone, there’s approximately 900,000m2 of server room space. That’s the equivalent of 140 full-size football pitches.
London and the M25 corridor are the country’s data centre heartland, but other significant hubs are based in Cardiff, Newcastle, Manchester, Leeds and Reading to name just a few.
Whether a sprawling cloud-based or colocation bit barn or a traditional on-site data centre, these mission-critical server rooms all share the same overarching goal – to minimise downtime.
Data centres are designed to mitigate potential problems ranging from basic component failure and human error through to a complete loss of power. Most are built with elements of redundant infrastructure, which means if a certain part fails, there’s a backup ready to pick up the slack.
They also employ standby systems such as uninterruptible power supplies (UPS) and backup generators to overcome any disruption with the electricity supply, from minor sags to a total power failure.
Typically, a data centre UPS runs on batteries whenever there’s an issue with the mains supply. This provides enough backup for servers and other essential equipment to run until the gas or diesel-powered generators take over.
Often this takes as little as a few minutes or even seconds. There are certain scenarios, however, when a UPS and its batteries can provide several hours’ runtime as a last resort.
Tiers For Fears (Of Power Loss)
The most well-established method for measuring data centre resilience is the Tier Classification System. It’s a certification introduced by the globally-recognised advisory body the Uptime Institute.
It categorises data centres on a sliding scale of 1-4, from the most basic infrastructure to the most complex and in theory most resilient.
For the most basic Tier I setup, there’s backup power supplies, while Tier II introduces redundancy into aspects such as the UPS, generators and pumps.
At Tier III, there’s additional paths for both power and cooling, plus enough redundant components to enable the maintenance and replacement of equipment without requiring a complete system shutdown.
The pinnacle, Tier IV, has redundancy for every component across the entire computing and non-computing infrastructure.
In reality, there are far more Tier III category data centres than Tier IV. To many, the costs involved with the latter are prohibitive, as it essentially duplicates all infrastructure.
For both Tier III and Tier IV, it’s also considered best practice to connect to the grid via a minimum of two transformers to minimise any disruption if one of the elements should fail.
Many larger data centres, particularly hyperscale or colocation, will actually have electricity routed from different substations or separate parts of the grid to spread risk even further.
It’s such thoroughness that means Tier III and IV data centres are designed to continue running during a power outage for at least 72 or 96 hours respectively.
Tier Classification | Availability | Average Annual Downtime | Description |
Tier I (Basic Capacity) | 99.671% | 28.8 hours | A single path for power & cooling with few, if any, redundant components. Includes a UPS & generator to protect IT from power outages. |
Tier II (Redundant Capacity Components) | 99.741% | 22 hours | A single path for power & cooling, plus limited redundant components including UPS modules, generators & pumps. Provides an increased safety margin against infrastructure equipment failures plus select maintenance opportunites. |
Tier III (Concurrent Maintainable i.e. minimum N+1) | 99.982% | 1.6 hours | Multiple paths for power & cooling, plus enough redundant components to enable equipment to be maintained & even replaced without requiring a system shutdown. Designed to protect against power outage for at least 72 hours. |
Tier IV (Fault Tolerance i.e. minimum 2N) | 99.995% | 26.3 minutes | Redundancy for every component across the entire computing & non-computing infrastructure. Designed to protect against power outage for at least 96 hours. |
So are there valuable lessons for businesses operating in other sectors? In fairness, many industries already go to great lengths to eliminate any potential single point of failure that could lead to their systems crashing down.
Healthcare facilities, for example, must comply with what’s known as the “3 Rs”:
- Robustness: a system or site should be able to absorb the effects of an event and continue to operate at the required level
- Redundancy: if robustness cannot be guaranteed, it is essential to provide more than one key facility or subsystem
- Reconfigurability: the most devastating risks are often unanticipated. For true resilience, a system or facility should be able to cope with the aftereffects of an unexpected event.
For the most critical applications, such as operating theatres and A&E departments, this means an alternative power source must be available in the event of a mains failure within half a second (or instantaneously if a break in power would stop vital equipment from working).
Other environments, from factories and refineries to water treatment works or power stations, also duplicate vital infrastructure to lessen the likelihood of failure.
However, is there only so much an organisation can do to mitigate the disruption from a sustained power outage?
Emergency backup from UPS systems and generators can hold the fort in the short-term, but what happens when the fuel eventually runs out?
Even for facilities with onsite generation, it certainly wouldn’t be “business as usual”. These sites would soon be operating in isolation with just a skeleton staff, little support from suppliers, and with limited demand from customers who’d have far more pressing concerns about their own personal situation.
Then there’s the ongoing maintenance, much of which is contracted out to vendors and third-party providers.
When equipment breaks down during a blackout, who’ll fix it? There’s no phone signal or means of communicating with the supplier.