Are availability zones a disaster recovery solution?

By Richard Stinton, Enterprise Solutions Architect, iland.

Monday, 11th December 2017 Posted 8 years ago in by Phil Alsop

I recently read an article which began “you can’t predict a disaster, but you can be prepared for one!” It got me thinking, I can hardly remember a time when disaster recovery was a bigger challenge for infrastructure managers than it is today. In fact, with ever increasing threats to IT systems, a reliable Disaster Recovery strategy is now absolutely essential for an organisation, regardless of their vertical market.

What does all this have to do with availability zones, I hear you cry? Furthermore, what is an availability zone and is it a good disaster recovery strategy? The purpose of availability zones is to provide better availability while protecting against failure of the underlying platform (the hypervisor, physical server, network, and storage). They give customers more options in the event of a localised data centre fault. Availability zones can also allow customers to use cloud services in two regions simultaneously if these regions are in the same geographic area.

Let us begin our discussion about availability zones by looking at the core capabilities that provide availability and resilience. Dynamic Resource Schedulers (DRS) provide Virtual Machine (VM) placement. That is, which host should run a given VM? A DRS also moves VMs around a cluster based on usage in order to balance out the cluster. High Availability (HA) provides the capability to restart VMs on other hosts in a cluster when either a host fails, or a VM crashes for any reason.

Now, let us look at the advantages that availability zones offer, as well as areas where they may fall short of constituting an effective disaster recovery strategy. This analysis of availability zone effectiveness will be divided based on three key challenges that cloud providers face: handling crashes or downtime, performing maintenance, and offering sufficient storage.

Crashes or Downtime

It is not unusual for a cloud provider to only offer HA and not DRS. In this case, in the event of a host hypervisor crash or deliberate shutdown, VMs are restarted on other hosts because they have shared storage. This is done using initial placement calculation. However, providers often do not have the ability to move a running VM between hosts in a cluster with no loss of service, and to incorporate such a DRS capability would strengthen disaster recovery preparedness.

Maintenance

There is also a problem with this model around planned maintenance. When hosts are updated, it is not possible to move the VMs that are running on them without loss of service. Therefore, VMs occasionally have the rug pulled out from underneath them.

With this in mind, many service providers talk about a ‘Design for Failure’ model when designing resilient services. In a nutshell, this means designing cloud infrastructure on the premise that parts of it will inevitably fail. Resiliency is provided at the application level. At the very least, this requires the doubling up of all applications, and for many deployments this necessitates additional licensing and additional costs for the VMs themselves.

Storage

Another crucial area to factor into this analysis is persistent storage. In the past, storage was protected using RAID techniques. Yet as we move to the public cloud, object storage has appeared as a popular way of storing data. This method uses the availability zone topology to protect data — but only if you choose it and pay for it. To protect against individual disk failure, three copies of the data are spread across the storage subsystems.

For virtual machines requiring persistent storage, Elastic block storage (EBS) is often used, and is replicated within the availability zone to protect against failure of the underlying storage platform.

EBS storage is not always replicated to other regions.

Regardless, having data replicated to another region does not mean that the VMs are available there. It only guarantees back-up storage. VMs would need to be created from the underlying replicated storage. It is also important to note that replicating storage to another availability zone or region only protects against storage subsystem failure. It does not protect against storage corruption, accidental deletion, or recent threats such as ransomware encrypting the files within the storage. To that extent, it is not creating a Disaster Recovery solution.

So, we return to our original question: can availability zones theoretically offer the resiliency needed for a good disaster recovery strategy? In the event of a crash, Dynamic Resource Schedulers can be used to move a VM between hosts in a cluster with no loss of service. However, when hosts are being updated, it is very difficult to move the VMs that are running on them without loss of service. As we have just discussed, redundant storage does not guarantee VM availability in other regions. Most importantly, these capabilities do not protect against data corruption or threats such as ransomware that encrypt data. Given this, a disaster recovery solution should be implemented in addition to the use of availability zones.

Cloud-to-cloud Disaster Recovery as a Service (DRaaS) can be adopted between data centres. With the iland DRaaS solution, VMs can be rebooted within seconds in the event of a crash or downtime. iland DRaaS also offers a continuous replication solution with a journal supporting up to 30 days. This means that you can recover data if it is lost or corrupted; for example, you can recover data from a ransomware attack. Self-service testing can also be carried out whenever required, while replication carries on in the background. As customers think about migrating their traditional virtualised services to the public cloud, they need to consider crashes, maintenance, storage, and also a disaster recovery strategy.