Incident readiness is the key to operational resilience

Many UK businesses are still reeling from last year’s global IT outage that brought systems to a standstill. This was a stark reminder of how reliant businesses have become on third-party systems and the dangerous ripple effect caused by system failures. Steve Barrett, the VP of EMEA at Datadog, looks at how organisations can successfully maintain operational resilience across their ever expanding and deeply interconnected cloud environments.

Friday, 3rd October 2025 Posted 9 months ago in Tech & Trends by Phil Alsop

The majority of large enterprises are becoming increasingly reliant on complex and distributed environments made up of virtual machines, Software-as-a-Service (SaaS) solutions and hybrid cloud infrastructures. Many of these services are managed by different IT, DevOps and security teams, across different locations, which is only adding to the operational complexity. It’s a patchwork of systems that can be difficult to manage if an organisation has limited visibility of operations across the cloud and IT stack. This can often be the case as large organisations tend to be reliant on siloed monitoring tools.

The CrowsdStrike incident emphasised the fragility of these systems and how they can be undone by a single technical point of failure. With so many different monitoring tools in place it’s hard to spot failures as they happen, assess their impact on downstream services and recover quickly. This is why it’s vital to have real-time infrastructure monitoring that can track metrics, detect and prioritize vulnerabilities, and identify any strange activity across hybrid cloud environments.

A consolidated view provides clarity

In many scenarios, a single point of failure, or root issue, can lead to separate monitoring tools triggering different alerts across different consoles. This will only lead to delays in identifying and triaging the problem before taking action to resolve the situation. That’s why it’s better to have a centralised view enabled by observability platforms designed to unite metrics, logs, traces, and security signals. Having a system like this in place will ensure that any signals are consolidated into a single correlated alert, ensuring that IT teams can focus on the most urgent issues. Having one unified dashboard that spans SaaS platforms, applications and cloud infrastructures provides clarity, helping to expedite recovery time.

This process is bolstered by having an effective incident management plan in place, with open lines of communication that will lead to carefully orchestrated responses. This methodical and careful approach will also provide teams with the ability to adapt to different scenarios as the incident unfolds. It also creates a template that enables teams to create detailed post-mortems of incidents and attacks to better understand why they occur, to prevent them from happening again.

Adopting a culture of resilience and incident readiness

Real-time visibility into the health and performance of infrastructure, applications and services, supported by unified dashboards and a plan of action, can help to detect and neutralise issues before they escalate. This level of preparedness or ‘incident readiness’ is underlined by real-time access to telemetry data that makes it possible to spot incidents and anomalies as they occur.

Incident readiness provides a framework that can be easily incorporated into business processes. It enables teams of responders to collaborate and coordinate an immediate response to problems, using internal channels like Slack or other tools to manage a constant flow of communication. This will add more clarity to the situation and keep the executive team and other business stakeholders updated on the scale of the incident and the path to resolution.

This leads to a culture of resilience and organisational accountability that can be reinforced through regular incident training and a continuous review of incident management processes. This is embodied by a healthy and proactive culture that eschews blame in the event of human error or systemic failures. Incidents can lead to high-pressure situations and it’s better to alleviate pressure by empowering responders to find creative solutions to a problem. This enables teams to quickly assess the severity of an incident, contain it and identify the root cause, before resolving the issue. This approach will help to build operational resilience and maintain transparency in the long term.

Achieving a state of operational resilience

Today, cloud infrastructures are in a constant state of flux. Legacy monitoring tools are too fragmented, unable to keep track of the sprawl of data, SaaS platforms and the containers that underpin them. UK business leaders can turn to observability to achieve greater visibility across their IT stack, gain deep insights into the performance of their applications and infrastructure, and bolster organisational resilience.

Conversely, the same platforms and tools that provide insights into application performance and infrastructure can be used to detect threats in real-time across dynamic cloud environments. Meaning security teams can quickly investigate and triage threats, mitigate the impact, and collect data that will help to improve their security posture, recognize attack patterns and deal with known threats. This will bolster cyber resilience and break down silos between DevOps and security to create a DevSecOps framework that will foster greater collaboration.

UK businesses can achieve a high level of operational resilience through effective incident management based on real-time awareness of data and systems, across the IT stack. This will allow them to scale their monitoring and observability dynamically with their cloud environments to ensure they are in a constant state of readiness.

Incident readiness is the key to operational resilience

Deepfakes are changing cybercrime and creating a new challenge for channel partners

Thinking like a threat actor: Why MSPs must look beyond the visible attack surface

AI sovereignty must go beyond chips and models - how local LLMs can help businesses build sovereign AI

Before MSPs Adopt AI, They Need a Thesis

The Control Paradox: Why Regulated Industries Must Rethink AI in Security Operations

How Agentic AI Is Reshaping Enterprise Architecture and the Channel Ecosystem

Cybersecurity is no longer best practice – it's becoming a legal obligation

The infrastructure behind the UK’s AI ambitions