Data resilience - high availability and disaster recovery planning for your data

By Charly Batista, PostgreSQL Tech Lead at Percona.

Monday, 31st July 2023 Posted 2 years ago in by Phil Alsop

Today, companies have a strained relationship with data. Data powers the applications that companies rely on to operate and make money, making it incredibly valuable to them. At the same time, this business value often does not translate into support for investments around protecting that data against operational problems. While we rely on this data to make money, we can find it hard to build a business case around why we should protect data as a commodity over time.

Why is this? Oscar Wilde once wrote that a cynic is someone who “knows the price of everything and the value of nothing.” Business leaders know the price of protecting data, but not the value. The problem is due to how IT and business teams think about risk - whereas IT tends to think of risk around interruptions to IT services, business teams will understand risk as any set of circumstances or situations that might affect revenues. This difference in thinking makes it harder to get support for investment, yet the success of a whole business model that depends on data alone can be the biggest risk of all.

The discipline of business continuity has existed for decades. It covers all the skills, processes, technology and people required to keep organisations running effectively. Whether it involves using high availability (HA) technologies to prevent downtime as much as possible, disaster recovery (DR) planning to make getting back up and running easier, or a mix of both, business continuity exists to make organisations more resilient and better able to survive.

What should your priorities be?

Typically, any planning that takes place around business continuity will move into discussing terms like recovery time objectives (RTO) and recovery point objectives (RPO). This covers how long your organisation can afford to be down for, and how much data you can afford to lose during that process. These terms are important for us in the technology space because they provide guidance on the financial impact of potential interruptions.

However, while they are useful for flagging those potential costs and challenges that companies will face, they are often perceived to be technical problems rather than business issues. At the same time, any of these projects can be viewed as being like insurance - costs that you pay out, but never get any value from if you don’t use them. This is the old problem of cost versus value.

This tends to be a problem where business leaders don’t understand the value of the physical data that the company processes, compared to the theoretical data that they work with and get value from. Without this understanding that data equates to physical information held in databases running on on-premises servers or on machines that host instances in the cloud, it is hard to convey the need for robust data support and disaster recovery. This level of protection and resilience does not happen by accident, yet it often becomes easier to get support when something goes wrong.

Changing this mindset should be your highest priority. Getting ahead of potential problems should be essential, but it will only happen if we can change the mindset from the top down. Once this is in place, support for projects that deliver business continuity, availability and resilience should follow suit.

Delivering data resilience

To make these projects work, we have to understand the real challenges that companies face, the risks that might affect the data we have and how we can protect it effectively. This might be through implementing HA for our critical application servers and databases. It might be through having an effective DR plan for restoring that data. But we also have to address other points of failure as well. This requires more insight into the whole business process that relies on that data as well as looking at our whole infrastructure.

Similarly, a lot of companies today have backups in place. They will proudly point to the tape that they use, or the DR install that takes copies of their critical files and moves them off-site. However, how many of those teams have tested their backups to prove that they work? It is not enough to back up data alone, you have to be able to restore reliably and predictably too.

Recovery processes get harder as you scale. For example, you won’t have one user connected to your database at a time. Usually, you will have a lot of concurrent connections with your database, and you will have transactions trying to write to the same table multiple times a second. As you grow your application and database to cope with transaction volumes, you will have to grow your database instances and either shard your data or use clustering to keep up. In a single database, it's less complicated to manage recovery situations because transactions to a single instance database tend to be similar to a queue for transactions to be processed. This makes it easy to determine the correct order for transactions.

When you have a cluster or set of shards to support, that queue will also be split. Recovering data in these circumstances is harder as you have to work out who is right and what order of transactions is accurate. While it is not a common scenario, what happens if your backup is not reliable? Instead of having files from only one node of your cluster, you may have files on different nodes of the environment and no information on which transactions are correct and which ones are not. This can be made even more complicated if users then carry out more interactions with those transactions to create more data or analysis. This is an atypical scenario for recovery, but when it happens, you need specialist skills to manage the situation. Recovering data in the wrong way can actually lead to more problems around data being right compared to it being lost in the first place.

The value of skills and experience

One of the biggest challenges in these circumstances is the lack of specialist skills around database administration. In the past, DBAs would have experience with availability and recovery processes, and the tools that provided this resilience. For split brain scenarios where recovery needs specific insight, it can be hard to succeed without this expertise, if at all.

This is a problem that affects even the latest technologies where automation should take care of the situations involved. For instance, many developers choose to use Kubernetes because it promises faster and easier development processes, as well as the ability to scale up in response to demand. In one instance, a problem occurred because the developer did not understand the backup process when it came to gathering data and files from different nodes. When the primary node experienced a problem, the secondary node was promoted and took over responsibility for operations. This worked fine and the application kept running. When they brought the previous primary node back, they didn't take care to bring it as a replica. To make it even worse they didn’t stop the backup process.

Why is this such a problem? It ended up with the two nodes that had run as primary fighting each other to say that they held the version of the truth in terms of transactions processed. What this led to was a complete mix-up of transactions processed and what was accurate. Luckily, the developer found they had this issue before a real-world disaster happened where

that backup was needed, thanks to an effective test process. However, this could have been avoided by having good working knowledge of databases and data design available.

Today, there are fewer full-time DBAs employed at companies. Teams like DevOps and Site Reliability Engineering are expected to provide support for data and database instances alongside their other responsibilities. While many database deployments can now be automated and skills provided within the tools we use, there are still many challenges that require experience and skills to get right. Retaining this experience can help with those edge cases and prevent some of the bigger problems that can otherwise affect services and applications.

Going back to the quote from Oscar Wilde, when it comes to being cynical, many IT professionals wear this term as a badge of honour. However, rather than being only focused on the cost, we have to deliver both value from data and protection against interruptions in order to succeed. That is far from being cynical.