How to get started with AIOps

It can be stressful for IT Ops to manage, and AIOps is one solution coming to help IT improve system reliability and customer satisfaction while reducing some of the manual work. By Issac Sacolick.

  • Tuesday, 3rd August 2021 Posted 3 years ago in by Phil Alsop

“Houston, we have a problem.”

This is exactly what people in IT Operations think whenever a series of monitoring alerts go off simultaneously. Within five minutes, they receive the invite to the bridge call and start to read out what each monitor is reporting. The team reviews incidents raised in Cherwell, network alerts from Nagios, system alerts in LogicMonitor, log files in Splunk, and Jenkins deployments to identify potential causes and decide a course of action.

Fifteen minutes into the call and business leaders join to get status and remind everyone of the expected service levels on business-critical applications. Business leaders have higher expectations on system reliability and performance, especially on customer-facing applications and critical workflows.

It can be stressful for IT Ops to manage, and AIOps is one solution coming to help IT improve system reliability and customer satisfaction while reducing some of the manual work.

What is AIOps?

Proactive IT leaders look to apply AIOps capabilities to reduce complexities, enhance employee experiences, and improve service levels.

AIOps refers to applying AI and machine learning capabilities to support IT operations. A must-have outcome of AIOps helps IT correlate multiple monitoring alerts into a single, time-sequenced incident that’s easier to review and faster to resolve. It might show that a Continuous Integration (CI)/Continuous Delivery (CD) deployment triggered database failures followed by application errors and group them into a single incident. An incident manager seeing this sequence can quickly deduce the root cause, consult the development team on the recent changes, and determine the required steps to restore service.

AIOps in incident management and processing data from multiple monitoring tools is one use case of platform intelligence. Applying AI and machine learning in IT Operations also includes:

·         Discovery and Dependency Mapping (DDM) automations to capture hybrid cloud infrastructure changes, maintain the CMDB, and capture dependencies between systems.

·         Virtual agents use Natural Language Processing (NLP) to help end-users search and access the service catalog.

·         Sentiment analysis applied to end-user feedback can trigger a follow-up by the IT service desk when there is a negative customer response.

·         Machine learning categorization of requests can improve mapping requests to the correct services and rout them to the right team.

With several machine learning capabilities available, IT Ops leaders should consider the following steps in getting started with AIOps.

1. Configure the AIOps Data Sources

Machine learning algorithms require clean data sources, so the first steps to enable AIOps capabilities are to connect data sources and iterate through their configurations.

IT teams should start by configuring DDM to capture systems, network, and application data from data centers, private clouds, and public clouds, including AWS and Azure. DDM should update the CMDB regularly, and IT Ops should then map systems information to business services and service levels.

ITSM practices like incident management and change management have a lot more context with a DDM powered CMDB. Incidents and change tickets already capture what happened, and with an integrated DDM powered CMDB, the tickets can also include where they happened. The added context helps IT resolve issues faster and enables analytics on repeat problem areas.

Secondly, connect all the system, network, and application monitoring tools to a central AIOps solution. This solution should reduce the noise in reviewing multiple monitoring tools by correlating multiple alerts to one manageable incident. Connecting the data sources also starts capturing historical data for predictive analytics and anomaly detection once there is sufficient data to train machine learning algorithms.

2. Discover ITSM Pain Points and Opportunities

Once IT connects DDM to the CMDB and aggregates monitoring data, it’s time to put the data to use and improve KPIs. IT should seek opportunities to improve customer satisfaction, mean time to resolve issues (MTTR), and system reliability.

Proactive IT operation teams look at these improvements strategically by following the steps below:

1.    Itemize and prioritize IT operational pain points from incident management, request management, and services with business impacts.

2.    Review the data and insights in DDM and monitoring insight tools and perform a data discovery exercise. Identify, “What does the data tell you” and what improvement opportunities the data suggests.

3.    Align pain points with the opportunities to help prioritize focus areas.

4.    Determine stakeholders that can steer process improvements and articulate success criteria.

5.    Identify which metrics or KPIs demonstrate whether improvements meet the selected success criteria.

These steps assure that IT gains a business partner on the implementation and invests effort in areas that deliver the greatest business impact.

3. Use Agile Processed to Implement Solutions

This process of identifying stakeholders, priorities, and KPIs also helps define an ongoing process of improvements. As teams use data-driven insights to identify opportunities and partner with stakeholders on pain points, a backlog of process improvement projects should emerge.

The backlog is exactly what proactive IT leaders need when implementing AIOps capabilities. These leaders should form agile teams in IT Operations to implement improvements iteratively.

Why an agile process? Priorities are likely to change based on business needs, opportunities, and risks. The team might focus on end-user computing applications in the first sprints to support hybrid working environments. The group might then shift to work with an application development team that’s modernizing applications and migrating them to the cloud.

In both these scenarios, topology maps, infrastructure visualizations, and other DDM tools help the team identify implementation opportunities.

For example, let’s say end-users open incident tickets for slow application response time during the afternoon. The DDM’s topology report can help IT Ops identify the bottleneck by comparing application flows between poor and normal performance periods. The team may then choose to adjust cloud elasticity parameters to ramp up infrastructure in the bottlenecking areas ahead of peak periods. After the modifications, the team monitors performance and incidents to validate whether the change addressed the issue.

Using an AIOps solution, including a DDM subset of a solution, enables this data-driven cycle of process improvement. Machine learning in DDM and AIOps reduces the complexity of working with multiple data sources or stale data. Instead, it enables IT to focus on the customer, understand pain points, implement solutions, and validate results. The result is a proactive IT Operations team that’s constantly improving and delivering stronger system performance to business stakeholders.