“What Just Happened?”

By Franek Sodzawiczny, Founder and CEO, Zenium.

Friday, 17th June 2016 Posted 9 years ago in by Phil Alsop

The majority of data center failures are unfortunately a result of human error. There have been multiple articles, reports and statistics published on this subject but it’s fully accepted that about 75% of failures can be attributed to mistakes made by staff. Why, knowing this, does it continue to happen?

Unfortunately, many issues are caused by the fact that the training of maintenance engineers is tends to be very generalised and poorly formulated. Time isn’t taken to consider the complexity of data center projects and the longer term challenges that these individuals will face when dealing with fast moving businesses under pressure to keep data accessible at all times.

Decisions about the provision of maintenance have also traditionally been based upon finding the lowest entry point and because it is often seen as ‘just an operational overhead’, little attention is paid to what is often the most important component of running a quality data center business.

It’s no wonder then that ‘incidents’ largely occur as a result of poor human intervention.

The solution lies in recognising that whilst the data center business is predominantly ‘engineering led’, the maintenance aspect of what we do is still critical. Engineering and design can be negatively impacted by poor operational elements so it should remain a vital part of the core service offering. As a result, there has to be a major focus on the quality of the individuals employed to fulfil this role in the first place. Engineers should have an inquisitive and methodical mind but it’s important to remember that you cannot teach this so finding a natural flair for this is key. That said, it is vital that bespoke training programmes are put in place for employees on each hall, combined with documented failure scenarios, so that staff are fully aware of the options open to them to ensure a safe recovery.

It’s also important to include maintenance staff during the multiple stages of commissioning in the development of the data center where they are likely to be working, so that they can become familiar with the products that they will be maintaining without risk to the systems themselves. We certainly believe that engineering-led maintenance, coupled with training and scenario testing, is the best way to reduce the potential exposure to human error.

Making full use of specialist suppliers and manufacturers that are involved in the construction and maintenance of a data center is also imperative. In-depth knowledge of complex products is essential and the people best placed to complete detailed maintenance are the manufacturers themselves, so when full servicing is required the most sensible approach is to call in the experts. Again, the issue here is focussing on the quality of the personnel carrying out the work, not their hourly rate.

Whilst the end user will not necessarily be aware of this background training and selection process, they will undoubtedly benefit from an honest and transparent approach to reporting. The success of the working relationship between a data center provider and its customers is ultimately based on the ability to create a partnership built on trust so two-way dialogue is a must. Reasonable and realistic operational expectations from the customer’s perspective are a good starting point but so too is a focus on best practice procedures by the data center operator.

It should come as no surprise then that the old adage of “measure twice, cut once” is perfectly suited to the data center environment. The benefit of working in teams of two engineers for all tasks, in constant communication, is the best way to ensure that decisions are tested and the ‘human element’ checked before any action is undertaken. We all know that the potential risk of a failure is something we can’t ignore so making the time to pay attention to the smallest of details is the best way to maintain standards and increase the success rate of each and every task.

Instead of wondering ‘what happened’, it’s clearly best to invest significant budget on getting technical support right; putting sufficient emphasis on carrying out due diligence first and giving careful consideration to every facet of data center design and construction. Expenditure should be secondary to quality; good advice and proven engineering experience must be prioritised, before and after construction. Aggressive and detailed testing should be a feature of maintenance best practice, resulting in a day-to-day challenge to maintain extremely high standards. Choose the right people to make the right decisions based on sound engineering principles, not cost, and concerns about ‘what happened’ should become a thing of the past.