When Cloud goes wrong

By Roger Keenan, managing director of the central London data centre, City Lifeline.

Monday, 6th October 2014 Posted 11 years ago in by Phil Alsop

It’s a well known fact that technical things go wrong. So what should businesses think about to ensure reliable and consistent operations with an added layer of complexity?

The first step is recognising that things will go wrong. Whether operations are in an in-house data centre, an external commercial colocation data centre, or in a hybrid cloud arrangement, with workload split between in-house and cloud, the principles are the same.

Cloud isn’t new
No matter what marketing would have us believe, cloud is not a new concept. It is simply remote hosting of some or all of the workload in a data centre, and is not dissimilar in principle to 1960’s timesharing services. The difference between 1964 and 2014 is the speed and data capacity of fibre optic cables, which open up a whole host of new possibilities to business owners.. But the principle remains the same as do the principles of resilient design.

As some or all of the workload can be hosted remotely, the most critical new consideration is the communication between the user and data centres where cloud operations take place.

Securing the right data partner
It is important that businesses choose a high quality data centre, with strong data communications and cloud experience to help minimise risks. Any data centre which says it has never had an outage of any sort is either too new to have a track record or is not training its sales staff to be honest. Even major players, with more money to spend than most businesses can dream of such as Google, Facebook and Amazon have experienced very public data centre outages in the last five years. Most recently in June this year, Microsoft Office’s 50 million users in the US experienced a nationwide two day outage. Operations managers and architects need to carefully ask the right questions to find out the truth and work through the concepts of automatic fall-overs or manual switching in the event of something going wrong. Ultimately, it comes down to choosing a data centre that you trust.

Moving the right workload
Choosing the right workload to move to cloud is also important, especially in the early days when in-house IT staff have less experience of cloud operations. In general, workload which has infrequent, small transactions which are not latency-critical works well in cloud. A CRM system is a good example, where a submission of a visit report or the retrieval of a customer phone number is infrequent, small, and not time critical. On the other hand, voice telephony, which is a continuous stream of time critical data, is not a good application to move to cloud, except for specialist suppliers who know how to do this and will be located in carrier-rich, carrier-neutral data centres to get the connectivity and diversity they need.

Automatic switching of IP address allocations is a particular problem which needs careful thought. The difficulty of automatically detecting a failure and instantly transferring all the IP addresses to another set of equipment in another location leads many smaller installations to accept a short outage and transfer the addresses manually.

In resilient or safety critical design, every element must be considered and there is one key question which must be asked - “what will happen if this element fails?” The design can then be changed so operations will continue without interruption. If that is not possible, then a plan has to be put in place to deal with the effects of a failure that cannot be mitigated.

Testing is key
Continuous testing is essential, as is reconsidering the effects of each potential failure anew each time the system design or architecture is changed. So is rehearsal and practice of both automatic fail-overs and manual procedures to deal with failures. At least once a year, every likely failure should be forced to happen, so that its effect on the overall system operation can be checked. This is one of the main principles of ensuring reliable, continuous operations, and is the same whether a business is operating an in-house data centre or a remotely hosted operation in a data centre in a cloud environment.