Q: "Will data centre cooling failures become more common?” A: “Yes.”

By Luke Neville, Managing Director, i3 Solutions Group.

  • Friday, 18th November 2022 Posted 1 year ago in by Phil Alsop

Heatwaves are changing the risk appetite for data centre operators when thinking about a safe operating temperature.

More extreme weather patterns resulting in higher temperature peaks, such as the record 40oC experienced in parts of the UK in the summer of 2022, will cause more data centre failures. However, while it is inevitable that data centre failures will become more common, establishing a direct cause and effect is difficult as factors to consider include the growing number of sites and an aging data centre stock that will statistically increase the number of outages.

What the increasing peak summer temperatures are doing is shifting the needle and changing conversations both about how data centre cooling should be designed and what constitutes “safe” design and operating temperatures. Since the beginning of the modern data centre industry over two decades ago, the design of data centre cooling system capacity has always been a compromise of installation cost vs risk.

Designers sought to achieve a balance whereby a peak ambient temperature and level of plant redundancy is selected so that should that temperature be reached the system has the capacity to continue to support operations. The higher the peak ambient design temperature selected the greater size/cost of the plant, with greater resilience meaning further cost for plant for redundancy. It came to down the appetite for risk versus the cost for the owner and operator. It is a fact that whenever the chosen ambient design temperature is exceeded, the risk of a failure will always be present and increases with the temperature.

So, what is the right ambient and peak temperature set point?

ASHRAE publish temperatures for numerous weather station locations based on expected peaks over 5-, 10-, 20- and 50-year periods. Typically, the data for the 20-year period is used for data centre ambient design.

However, this is a guideline only and each owner/operator choses their own limit based on what they feel will reduce risks to acceptable levels without increasing costs too much. Hotter summers have seen design conditions trending upwards over the last twenty years.

Legacy data centres were traditionally aligned with much lower temperatures from say 28oC - 30oC to latterly accepted standard design conditions of 35 - 38oC. Systems were often selected to operate past these points, even up to 45oC (based on the UK – all other regions will have temperatures selected to suit the local climate).

The new record temperatures of +40oC in the UK will sound the warning bell to some data centre operators who may find themselves in a situation where dated design conditions, aging plant and high installed capacity will result in servers running at the limits of their design envelope.

All systems, have a reduced capacity to reject heat as the ambient temperature increases and also have a fixed limit irrespective of load, at which they will be unable to reject heat. Should these conditions be reached, failure is guaranteed.

More commonly, low levels of actual load demand versus the systems design capability mean typically data centres never experience conditions which stress the systems. However, that requires confidence that IT workloads are either constant, 100% predictable, or both.

For now, the failure of data centre cooling is most likely to be the result of plant condition impacting heat rejection capacity rather than design parameter limitations. This was cited as the root cause of one failure during the UK’s summer heatwave when it was stated that cooling infrastructure within a London data centre had experienced an issue.

Coupled with an increase in data centre utilisation, should temperatures outside the data centre continue to rise this will change.

Know your limits

Inside the data centre the increasing power requirements of modern chip and server designs also mean heat could become more of an issue. Whatever server manufacturers say about acceptable ranges, it has been the case that traditionally data centres and IT departments remain nervous about running their rooms at the top end of the temperature envelope. Typically, data centre managers like their facilities to feel cool.

To reduce the burden on the power consumption and over sizing of plant, it is common to integrate evaporative cooling systems within the heat rejection. However, there has been much focus recently on the quantity of water use for data centres and its impact on the sustainability of such systems. Whilst modern designs can allow for vast water storage systems and rainwater collection/use, should summers continue to get longer and dryer, more mains water use will be required to compensate and the risk of supply issues impacting the operation of the facility will increase.

In every sense, a paradigm shift away from cooling the technical space towards a focus on cooling the computing equipment itself may present the answer.

The adoption of liquid cooling systems, for example, can eliminate the need for both mechanical refrigeration and evaporative cooling solutions. In addition to some environmental benefits and a reduction in the number of fans at both room and server level, liquid cooling will help increase reliability and reduce failures generally as well as at times of extremely high temperatures.

Liquid cooling seems to be gaining traction. For example, at this year’s Open Compute Summit, Meta (formerly Facebook), outlined its roadmap for a shift to direct-to-chip liquid-cooled infrastructure in its data centres to support the much-heralded metaverse. However, one of the limitations of liquid cooling designs is that they leave little room for manoeuvre during a failure, as resilience can be more challenging to incorporate into these systems.

But for now, without retrofitting new cooling systems, many existing data centres will have to find ways to use air and water to keep equipment cool. And as the temperatures continue to rise inside and outside the facility, so too will the risks of failure.