Mission critical monitoring at HPC Wales’ highly efficient Dylan Thomas Data Centre

The Dylan Thomas Data Centre hosts the Swansea hub for HPC Wales – one of two hubs that form part of the UK’s largest distributed supercomputing network. HPC Wales is a unique collaboration between the Welsh Universities: Aberystwyth, Bangor, Cardiff, Swansea, and the Universities of Wales and South Wales. It provides businesses and researchers access to world-class, secure and easy-to-use HPC technology as well as the support and training necessary to fully exploit it.

  • Monday, 2nd June 2014 Posted 10 years ago in by Phil Alsop

The Dylan Thomas Data Centre implemented water cooling at the rack level in order to maximise the efficiency of their data centre. However, state-of-the-art cooling technology requires pro-active monitoring and management to ensure that small problems do not result in a catastrophic failure.

Concurrent COMMAND ensures the data centre operates smoothly. With over 90 error conditions and automated responses, the data centre managers have complete visibility of the entire data centre and peace of mind that the data centre is well protected.


The Dylan Thomas Data Centre was designed to house a High Performance Computing (HPC) system on behalf of HPC Wales.


The HPC system provides businesses and researchers access to world-class, secure and easy-to-use HPC technology.


Compared to typical corporate data centres, those that host HPC systems present unique design challenges due to their use of high density server and blade systems. These typically operate at close to 100% CPU utilisation and result in significantly higher power consumption per rack.


A side effect is the potential for generating much higher and localised temperatures in the data centre.


The cooling solution chosen by HPC Wales for the Dylan Thomas Data Centre employs water cooling at the rack level rather than traditional air conditioning. This enables them to drive much higher cooling efficiencies.


However, failures in water cooling systems can lead to instant and potentially catastrophic problems. Ensuring continuous service requires additional monitoring and risk mitigation capabilities.


The Data Centre Infrastructure Management (DCIM) tool, Concurrent COMMAND, is used to monitor all of the key systems at the Dylan Thomas Data Centre, including ambient temperatures and bespoke metrics that analyse how the water cooling systems and supporting infrastructure behave. It is also used for active management in the event of a failure of these systems.


End-to-end data centre monitoring
Centralised monitoring is essential in all data centres. However, the introduction of a rack-level water cooling system at the Dylan Thomas Data Centre introduces additional potential risks and drives the need to react quickly in order to protect the data centre from disastrous failures.


Concurrent COMMAND is a vital part of ensuring the data centre operates smoothly. Over 90 error conditions have been configured to trigger alarms and give the managers complete visibility of the entire data centre – no matter where they might be.


The real-time dashboard provided by Concurrent COMMAND displays high level information – from potential water leak threats to power consumption and availability. Much more detailed information is available by drilling down into each device.


For the majority of systems being monitored, it is sufficient to know whether they are functioning correctly. However, Concurrent COMMAND can also trigger automated responses, such as shutting servers down to protect hardware and data in critical circumstances.


An integrated console
Concurrent COMMAND provides a unified interface for managing the many varied systems across the Dylan Thomas Data Centre. Information is gathered from individual devices and collated in the Concurrent COMMAND console, making it easy for the team to monitor key performance metrics and potential problems.


Monitoring data centre efficiency: PUE
Power Usage Effectiveness (PUE) is a standard data centre efficiency metric that measures the ratio between the total power supplied and the actual power used to run computing equipment. A value of 1 implies that no energy is used by ancillary equipment to cool the data centre or provide back-up power. The higher the number, the less efficient the use of energy for a constant computational load – although it is important to note that PUE does not measure the usefulness of the IT load itself.
The Dylan Thomas Data Centre found that obtaining this figure manually was time consuming as data needs to be collated from multiple sources. Concurrent COMMAND automatically and continuously collates the necessary information in real time. The team are then able to analyse trends over time as the PUE figure is sampled at regular intervals.


Detailed power and environmental monitoring

Concurrent COMMAND not only allows data centre managers to collect power usage information from distribution boards and rack-mounted PDUs, but directly from the servers and blade chassis within the rack.


Similarly, detailed environmental information, such as server inlet temperatures and CPU temperatures, can be monitored.


This information can be used to identify hot-spots, measure the power efficiency of individual servers and ensure that each rack is operating within any given power and environmental constraints.


Monitoring data centre health
A key benefit of implementing a Data Centre Infrastructure Management (DCIM) tool is being able to monitor events within the data centre and compare them to expectations.


For example, as part of the optimisation of a data centre’s power usage, it is important to balance the three phases of the mains power supply. During the design phase of the Dylan Thomas Data Centre, the systems in the HPC cluster were allocated to individual phases in order to achieve this balance.


The ability to monitor the health of key disaster avoidance systems, such as UPS backup power and be warned when such systems fail, has evident advantages.


Active system management
Human intervention is not always readily available when problems occur in the data centre – for example during evenings, weekends or holiday periods.


The Dylan Thomas Data Centre takes advantage of Concurrent COMMAND’s active system management feature to protect the data centre in such instances.This allows Concurrent COMMAND to act autonomously when predetermined situations are detected.


For example, should the cooling system temperature rise beyond a predefined level, Concurrent COMMAND will initially alert the data centre managers. Where possible, the team will intervene and resolve the issue. Should the temperature continue to rise past another predetermined threshold, Concurrent COMMAND will take over and initiate a controlled phased shutdown of the HPC cluster in order to prevent serious damage.


Concurrent COMMAND logs data at each step for the team to review in order to help them analyse why the problem occurred and what was done to resolve it.


Conclusion
HPC Wales recognises that an efficient data centre requires the best technologies to be used to their full potential – and reliable, easy-to-use monitoring is a pivotal component of this.


Maintaining high levels of efficiency and availability with fallible technologies can be difficult. Since the consequences of failure can be extremely serious, continuous monitoring of all the elements across the infrastructure is required to ensure potential disasters are averted.


With Concurrent COMMAND in their armoury, the managers at the Dylan Thomas Data Centre, as well as other HPC Wales remote technical staff, have continuous access to detailed information about all of their IT and facilities assets, allowing them to identify trends and make informed operational decisions. In extraordinary circumstances, they are also assured that Concurrent COMMAND will detect critical issues and automate an orderly shutdown of their systems, making it an integral part of the data centre.