Reducing operationa costs and improving productivity in a post-Covid world

How the aftermath of the pandemic has caused a shift in IT operations worldwide. By Ann Hall, Marketing Manager, HEAL Software Inc.

  • Tuesday, 3rd August 2021 Posted 3 years ago in by Phil Alsop

The current state of IT Operations has shifted to the widespread adoption of multiple AIOps Tools within an enterprise, each with niche capabilities such as observability and proactive anomaly detection. A single tool is frequently insufficient to deal with the increasing complexity of hybrid digital environments and multiple tools are needed to provide visibility across disparate silos. Operations teams are making the shift from monitoring / managing alert storms and disconnected events via dashboards, to focusing on root causes of issues in various dimensions like network, infrastructure, end user transactions and application code via analytics and intelligent event correlation.

2020 forced many teams out of their erstwhile comfort zones. While the pandemic made businesses rethink their overall organizational budgets, one area where expenses had to be pruned was IT Operations. Traditionally necessitating the presence of large teams – mostly comprising contract workers – physically co-located at data centers, teams were now forced to start working remotely from disparate locations due to offices being closed over Covid-19 concerns. On one hand was the challenge of having to operate with skeletal staff, and on the other, the heightened need to keep enterprises running 24×7 with virtually no downtime; most businesses from banking to retail storefronts started operating exclusively online, with branches and brick-and-mortar stores being shut. IT Operations teams were ill-equipped to straddle both these seemingly paradoxical requirements.

While 2020 was a wake-up call to teams to optimize IT Operations expenditure and look judiciously at making data center costs leaner, 2021 will be the year where these lessons are put into practice. In this whitepaper we look at the consequences of the pandemic on IT Operations and the steps that can be taken in 2021 to ensure that enterprises implement cost-cutting measures effectively.

The Pandemic Effect on IT Operations in 2020

1. Leaner Teams

Most jobs in Network Operations Centers are contract jobs. The first step that most organizations took in 2020 was to downsize the variable pool of talent. Leaner teams were expected to perform a similar quantum of work, which in turn emphasized the need to move to greater automation in ITOM.

Challenges in 2021: Attracting the right talent, doing more with less, introduction of automation in business processes to make do with smaller teams.

2. Remote Management

Businesses were forced to give up leased and rented office spaces to cut back on expenses – and this affected IT Operations teams in a big way. Skeletal staff were deployed in the remaining office(s) and the rest of the team was expected to move to a work-from-home mode of operation almost overnight. Managers had to oversee teams remotely and ensure that all employees had the required resources and know-how to use VPN, online meeting tools and collaborative solutions.

Gartner predicted that services such as apps, platforms and equipment spent on remote workers would increase by up to 19% in 2020, with this number potentially being even higher by the end of the year and in 2021. One of the largest areas of cloud to experience an increase in demand was desktop-as-a-service (DaaS), with a predicted growth of 95.4% in 2020. DaaS is an inexpensive option for organizations that are looking to support their remote workers by providing secure access to enterprise applications remotely.

Challenges in 2021: Greater security, availability of right talent with the required software and hardware resources, need for collaboration among teams.

3. Automated Deployments

In 2020, organizations had to rethink their deployment strategies and come up with mechanisms for remote and automated deployments due to staff shortage and absence of a centrally located workforce. This created the need for dedicated automation architects who could guide DevOps and SRE teams on the use of Agile practices for breaking down organizational silos between software developers and IT operations personnel.

DevOps and Agile are collaborative practices, focusing on faster deployment, maintenance and upgrades, and automated recovery and failover in case of issues. Adopting continuous deployment allows applications to be quickly brought up in several different environments.

Challenges in 2021: Adoption and continued use of DevOps and Agile practices, increased automation in the application deployment and maintenance process.

4. Revamping Application Infrastructure

Cost cutting measures extended to infrastructure as well, with organizations choosing to review their spending on dedicated hardware and moving to cloud (virtualization, microservice and container-based architectures) as well as assessing the number of software solutions being deployed in the enterprise and seeing if a switch to open-source was possible. Virtualization reduces the number of physical servers required in the enterprise, a container’s smaller infrastructure footprint provides a more secure and easier to configure environment and microservices make applications easier to understand, develop and test. Thus, costs of maintaining applications can be significantly reduced.

Challenges in 2021: Deployment of virtualization management systems, evaluating software tool replacement options, co-sourcing environment management to cut costs.

5. Higher level of Automation for Maximum Uptime

Downtime in any business has a direct impact on brand value, customer experience and revenue. This fact was accentuated during the pandemic due to more businesses starting to transact digitally and seeing a marked increase in online traffic due to storefronts being shut. The primary challenge was to provide close to 24x7x365 uptime with reduced IT Operations personnel, something made possible only thanks to automation. Enterprises adopted AIOps solutions providing proactive incident detection and autonomous resolution capabilities coupled with ITSM integrations, so the entire ticketing process was completely automated without the need for human intervention.

Challenges in 2021: Adoption of Preventive Healing Solutions for zero-downtime and complete automation, scaling businesses intelligently for increased online presence and traffic.

Ways to Achieve IT Operations Cost Reduction

In this section, we see how some of the challenges mentioned above can be tackled by enterprises in 2021, to cut back on costs and recover losses sustained due to the pandemic in 2020.

Cutting back on non-discretionary costs

Part of the non-discretionary costs in an enterprise comprise NOC personnel who are mostly contractors. They are tasked with assessing dashboards for alerts and anomalies, escalating notifications up the chain of command, and overseeing the ticketing process. Intelligent tools for notification and escalation, AI/ML solutions with which alert storms can be reduced, and automation of ticketing workflows via ITSM integrations are desirable, so the contract workforce can be reduced. Further, some amount of inhouse work can be shifted to consultants and contractors, so you only contract for the work you need, and the contractor/consultant is no longer on the payroll after the project completes.

Infrastructure as a Service for Intelligent Scaling

One area where costs can be cut for immediate returns is cloud infrastructure; cost savings in cloud services have a real and perceptible cash impact. Many organizations want to move more of their systems to virtualized ones but are still figuring out how to deal with configuration, provisioning, and overall management. If they deploy a virtualization management system to make this move a reality, they will enable faster adoption of cloud platforms.

Many enterprises opted for co-sourcing their environment management functions. In this arrangement, the enterprise would still have the systems in its environment, but another firm would run and operate them. These services can be far less expensive than hiring a cadre of full-time IT Operations personnel and provide the added advantage of having the right talent managing the environment with technical know-how and service guarantees in place.

Maintenance and management cost saving occurs in the following areas when you move in-house IT servers and services to the cloud.

Moving to the cloud reduces capital expenditures for servers and related network equipment, transforming one-time capital costs to monthly operating expenses.

Labor costs are handled by the cloud provider on an as-needed basis, instead of using dedicated staff.

Cloud providers can provision additional resources like disk space, CPU, memory, and communication lines faster and cheaper than an enterprise would be able to for on-premise servers and infrastructure.

Cloud providers can dynamically provision temporary capacity increases as service demand peaks. With on-premise servers, SRE teams need to plan capacity to handle peak demand.

Accurate provisioning of resources in the cloud – whether it is virtual machines, cloud storage or containerized services – is important to provide intelligent scaling for the business. To this end, enterprises can incorporate a workload trend-based capacity forecasting tool that can highlight under-provisioned and overprovisioned resources by running a what-if analysis on projected transaction volumes. This can help avoid a lot of headache in terms of unexpected outages due to resource crunches and bottlenecks, or unnecessary expenditure on cloud services.

Evaluating Software Tool Licenses

In any enterprise today with a hybrid digital infrastructure in place, multiple AIOps and monitoring tools are required to collect telemetry data across silos in different environments. However, the cost of onboarding any new tool includes hidden license expenses as well, and 100% integration with the existing toolset is required to make the incorporation of the new tool as seamless, cost-effective, and pain-free as possible.

Several new vendors are offering many of the same capabilities as enterprise providers. An enterprise may be able to purchase new software and implement it for less than the cost of annual maintenance for its current vendor. In most organizations, there are software tools performing duplicate functions. This could include polling to monitor performance, faults, or other key functions. Eliminating vendors will reduce the annual maintenance bill and reduce staff time to keep the systems up and running.

Moving Toward 100% Automation

Automation replaces labor costs with software and configuration costs. DevOps enables collaboration, integration, and automation in otherwise siloed IT environments. By anticipating and reacting to failures faster and working to deliver feature improvements to end-users on a continuous basis, organizations spend less on deployment hurdles and incident resolution, and devote more resources to run the business.

Dedicated automation architects can ensure that DevOps and Agile practices are implemented across the enterprise. Data Center Automation (DCA) reduces the need for manual configuration and processing. Part of this also includes autonomous resolution of incidents and automation of ticketing workflows. Dynamic workload optimization, dynamic resource provisioning and proactive identification of hotspots lead to the mitigation of issues before they even occur, thereby allowing a smaller team to run a more efficient and error-free data center.

Reducing MTTR on Unpreventable Issues

Suppression of alert storms and false positives as well as predictive alerting can be achieved through AI/ML techniques, and preventive healing can help nip issues in the bud before they occur, leading to what is nowadays being referred to as “negative MTTR”. However, some issues may still slip through the cracks due to sudden network or storage outages, hardware glitches or 3rd party dependencies like APIs and payment gateways being unavailable. In such cases, accelerated root cause analysis with event correlations and suggestions on where the error originated can significantly reduce MTTR and the number of personnel required to bring an issue to a satisfactory closure, ultimately leading to cost savings.

Such root cause analysis is aided by supplementing incident details with all time-synchronized contextual and forensic data available at hand. This could include logs, diagnostic data, business error codes, recent configuration changes, query-level statistics from the database as well as code level tracing and instrumentation. In the hands of a skilled IT Operations analyst, this data proves invaluable in establishing the chain of causation and bringing the incident to a logical end as soon as possible.

Integrations for Ticketing, Visualization, Notification and Collaboration

2020 accentuated the need for remote work and collaboration among endpoints spread across geographies. In 2021, there will be increased focus on the sustainability of a such a model, and the ability to secure and manage large populations of remote devices without overtaxing IT resources. Endpoints need to be resilient and capable of self-healing to minimize the need for IT intervention.

Other important integrations are needed to ease the life of the IT Operations personnel, who are themselves tasked with keeping remote systems for the entire enterprise up and running. The “new normal” needs to deliver more agile support for remote workers and this is putting an increasing burden on ITOps teams. Rex McMillan, Principal Product Manager at Ivanti, states that their recent research shows that for 63% of IT professionals, IT workloads have increased 37% since going remote. Hyper automation is the only way for IT to scale and handle the additional challenges. Integrating notification and ticketing platforms can save substantial time and is well worth the investment, as staff will not have to correlate information from multiple systems to determine what the true problem is and then report in another system.

Preventive Healing – A New Paradigm for IT Ops Automation

Preventive healing solutions adopt patented techniques that map an application’s workload to its underlying behavior and learn these workload-behavior correlations over a period so they can flag anomalous transaction patterns or behavioral metrics ahead of time. This helps in true predictive detection of issues before they even occur and allows for remedial steps to be put in place so the issue can be averted. Some modes of preventive healing include dynamically optimizing or shaping the workload so the underlying system behavior remains unaffected, dynamically provisioning additional resources in cloud environments so the system can handle workload surges, or projecting resource requirements based on a what-if analysis of future workload trends, so businesses can perform app-aware scaling.