Prepare for ‘partial failures’ of IT infrastructure like Visa outage

Visa’s letter to the Treasury Select Committee, documenting details behind the recent outage which left millions of people unable to complete card transactions, reinforces a critical challenge that organisations face when exposed to a ‘partial failure’ of IT infrastructure. This is according to Peter Groucutt, managing director of Databarracks.

Friday, 22nd June 2018 Posted 8 years ago in by Phil Alsop

This week, Visa revealed that a ‘rare defect’ to a switch caused a partial failure in its primary UK data centre. The issue delayed its secondary data centre from assuming responsibility for handling all of its card transactions taking place and was the root-cause behind millions of failed card transactions, over 10 hours on Friday 1st June 2018.

In the wake of the outage The Committee contacted the payments firm, seeking clarification over the cause of the outage and assurances to what action Visa is taking to prevent a repeat. Amongst the findings, Groucutt reveals that a number of lessons can be learned:

“Businesses are often better prepared for a complete outage than ‘partial failures’. When a system fails completely the process to fail-over is more clearly defined to whether it is a manual action, or automatic process. Partial failures however, make that change-over difficult. Once the problem has been identified, you have to make the decision to either fully switch to the secondary system or fix the problem on the primary. Defining the point at which to fail-over is specific to each organisation and the issue you are dealing with.

“A switch issue, for instance will require a different response to a natural disaster. An organisation with good Incident and Crisis Management processes will have these processes in place – decisions will already have been made and documented, so in the event of an incident, a business knows exactly what to do.

“In practice, a business might decide that it can’t tolerate an outage of longer than four hours. If it takes two hours to be fully operational at a second site, it then leaves you a window of just two hours to fix that issue before committing to fail-over.

Groucutt continues: “We would expect Visa to have a very mature incident management process in place and based on the reports, that was absolutely the case. Partial failures can be very difficult to plan for and mange, but the issue was identified, and response protocols initiated.”

Groucutt concludes: “The lessons Visa can take from the incident is that they weren’t prepared for this particular partial failure and should address this by building new processes to allow the backup switch to take over. We can all do the same.

“It is a good idea to include issues like this in your testing. It’s not just switches – we’ve seen exactly this issue for UPS systems and generators too. An organisation will have a testing schedule for each of these technologies, so it’s important to include the impact of partial failures to these. A business should think about how quickly it can identify what the issue is and importantly, the actions which then need to be taken to either fix the problem and recover or alternatively, manually take it offline and failover to a secondary site.”