Metrics only tell part of the story

By Tim McGraw, Loom Systems.

  • Friday, 29th September 2017 Posted 7 years ago in by Phil Alsop
My systems - that is, the systems I am responsible for - along with their security, performance, stability, and in some cases, development, easily number several dozen. Some are proprietary applications, some internal and some public facing, including several mobile apps on various mobile OS platforms. We manage this with a team of four people. If that number makes you gasp, I imagine you have a comparable story of your own to tell. As unnerving as that environment may be, my day-to-day challenges - performance monitoring, quality assurance and security – are not insurmountable. The answers are in the logs and the right metrics can bring them out. Or that is what I thought, until I learned that I can do more.

 

The Need for Metrics

I have been fortunate enough to work for organisations with a rapid growth curve, and that often means that growing needs outstrip tangible resources. That is what being a startup is all about. Affording speed and efficiency, metrics are able to bridge that gap.

 

To represent these metrics in a clear dashboard is valuable, yet I need help sifting through the vast volume of data produced by my systems and filtering the noise. I therefore cannot help asking myself if I have missed something. The IT world is a rapidly changing landscape and threatscape, and in this environment one small oversight could be disastrous. I have long since given up fighting accusations of being paranoid, and now embrace it.

 

 

Tell The Right Story

The point is that it is not enough to simply have a handle on the real-time performance of our environment: we need to be able to predict and prevent failures, and identify the root cause of issues as quickly as possible. Metrics can demonstrate when particular data point fails, but nothing more. In order to know that a failure is about to happen, how quickly, or what else it is affecting it, you have to have already set up other gauges; and even if you have, they do not always indicate a root cause, and they cannot draw correlations for you.

 

There is No Way to Read It All Myself

What I quickly found with metrics-based monitoring is that I spent a lot of time creating those sexy dashboards for myself and my colleagues but the simple truth is that I do not even know everything I need to look for. It is not possible to anticipate issues such as WannaCry and Heartbleed, and they are but two alarming examples. Security alerts are created on a daily basis, and that does not even begin to address my internal monitoring needs. The sheer volume of logs available in my systems means I need a mammoth team to read and analyse them all – which is not possible on a small budget!

 

I Need Artificial Intelligence (AI)

What I realised I needed was an automated or AI-powered system that can read through everything, and tell me when there is a meaningful change to my environment. There is no human, with any amount of experience, that can read and process the amount of logs our systems are producing, even for a simple workstation. Automated monitoring gives me full visibility into my IT environment which means I can see every time my systems does or experiences something out of the ordinary, and enables me to act on it before it impacts my users. An AI-powered solution can also remind me of insights and recommendations that might have been relevant the last time my systems encountered the situation.

 

Not so long ago, I would have thought this kind of knowledge, edging toward wisdom, was an impossible ask, even if I had the budget. I am very pleased to be proven wrong.