The future of sustainable HPC

It is an unavoidable fact, high performance computing (HPC) is an energy intensive environment. But, what steps can we take to address this and achieve a more sustainable future for HPC in universities, research centres and other organisations? By Mischa van Kesteren, pre-sales engineer and sustainability officer at OCF.

  • Thursday, 22nd July 2021 Posted 3 years ago in by Phil Alsop

Primary considerations include getting the most out of your power usage and considering renewable electricity sources. There are also options in offsetting the power you use within your organisation to make your operations more sustainable and ‘green’. Organisations like Plan Vivo can help legitimise the carbon offset programme you choose to work with really is sustainable. The important thing to remember with offsetting is that it is best used a tool to incentivise the reduction of resource consumption.

Offsetting helps to internalise the environmental cost of consuming power, however for it to be effective the additional cost must be passed on to the people consuming that power. If users are currently being billed or allocated resource based on core hours / job run time they won’t see the additional power cost as clearly. Switching to a resource allocation scheme based on power consumption would be more effective. Tools such as the EAR energy management framework can be used to provide per-job energy consumption accounting through integration with Intel CPUs and SLURM accounting functionality.

When building a new build HPC system, it is important to understand what your workload is going to look like. Energy efficiency comes down to the level of utilisation within the cluster. There are definitely more energy-efficient architectures. Generally, higher core count, lower clock speed processors tend to provide greater raw compute performance per watt, but you will need to have an application that will parallelise and is able to use all those hundreds of cores at once.

If your application doesn’t parallelise well, or if it needs higher frequency processors, then the best thing you can do is pick the right processor and the right number of them so you are not wasting power on CPU cycles that are not being used. When cycles are not being used, the CPUs should be configured to downclock to save power.

HPC managers will be assessed on how satisfied users are with their service, so many will artificially force all of the processors and nodes to run continually at 100 percent clock speed, so the processor won’t be put in a dormant state or be allowed to reduce its frequency. Ultimately, energy consumption just isn’t a major concern.

If customers come to us and want to improve energy efficiency based on their current estate, we would look at features being used in the scheduling software which can power off compute nodes, or at least put them into a dormant state if the processor supports that technology. We would check if these types of features are enabled and if they are making the most of them.

However, for some older clusters that do not support these features and generally provide much less power per watt performance than today’s technologies, it is often worth replacing a 200-node system that is 10 years old with something that is maybe 10 times smaller and provides just as much in terms of computing resource.

You can make a reasonable total cost of ownership (TCO) argument for ripping and replacing that entire old system, in some cases that will actually save money (and resources) over the next three to five years. Sometimes replacing what you have got is the best option, but I think the least invasive way and the first thing that we would look at with customers is: are they being smart with their scheduling software and are there benefits they can get in terms of reducing the power consumption of idle nodes.

Cloud bursting is another useful approach for sustainable HPC, particularly for instances when users have peak or infrequent workloads that don’t fit in their normal usage patterns. Using an automatic cloud resource for these workloads makes a lot of sense for temporary workloads. Rather than having 100 nodes on all the time, you may only need that capacity for four hours a day or once a week.

Large cloud providers can offer bigger benefits from economies of scale with power and energy efficiency using the latest cooling technologies and situating their data centres in more power efficient locations with greater renewable energy, like for example, Iceland. This approach also helps to outsource concerns over environmental efficiencies for an organisation.

Many universities have already voluntarily agreed to abide by the principles of the Paris Agreement for net zero by 2050. Some more ambitious institutions have committed to net zero by an earlier date. The high performance computing departments of these universities will certainly become an important factor in these considerations in the near future.

Ultimately computing is burning through energy to produce computational results. You cannot get away from the fact that you need to use electricity to produce results so the best thing you can do is to try and get the most computation out of every watt you use. That comes down to using your cluster to its maximum level, but then also making sure you are not wasting power.

By configuring the cluster in more environmentally sensitive ways, considering cloud options and giving people the conscious choice to do this will all help towards creating a more sustainable HPC in the future.