Imagine a situation where a super-fast passenger train is approaching a rail bridge but the bridge track is broken. What would be your reaction? Should someone not have looked at it and acted before the incident happened?
The same scenario is applicable in our business environments today where our IT systems are the railway track on which the train of business runs – should someone not be looking at the issues in IT before business gets impacted?
The answer is simple – we must have a robust monitoring solution in place!
In this article I will try to answer a few questions which come up when one thinks about monitoring.
What is monitoring?
In this context, we define monitoring as ‘Continuous or periodic observation of IT systems, IT processes and business processes to identify issues which can potentially cause business disruption. Monitoring includes both identification and alerting the right resolution group for resolution.’
Monitoring can be done using various monitoring tools or even manually. For example – A monitoring tool can measure the loading time of a website every 5 minutes or manually people can login at periodic intervals and measure the loading time. Irrespective of the method, when the loading time breaches threshold it must be communicated to the respective groups and resolved.
The objective is to ensure that any IT issue is identified and resolved before there is a business impact. Traditionally, organizations have had a command center for monitoring with many screens where people look at dashboards and coordinate resolution. However, in recent times, there have been advancements in technology in IT operations space which minimizes the manual interference or effort required in alert to action for monitoring. Have listed a few trends in this article.
Why do we need monitoring?
The primary value proposition of monitoring is to reduce business downtime and increase IT system health and operational efficiency.
- According to Gartnerthe average cost of IT downtime is $5600 per minute!
- Another survey by Information Technology Intelligence Consulting Research gives the following figures –
o 98% of organizations say a single hour of downtime costs over $100,000
o 81% of respondents indicated that 60 minutes of downtime costs their business over $300,000
o 33% of those enterprises reported that one hour of downtime costs their firms $1-5 million
Apart from revenue or operational loss the other impacts of business downtime are on customer satisfaction and brand value of company. However, this can be greatly minimized by having an effective monitoring system in place.
What components should be considered?
Here are the 4 key components to consider while we setup monitoring –
- Layer:Monitoring can be deployed at different layers such as Network, Infrastructure, Platform, Application, IT process and Business process. It is good to have a clear understanding of scope of deployment, collaboration and governance as many times we need to work with multiple partners to enable all layers of monitoring
- Aspect: Monitoring can be done from an end user perspective or from an IT operations team perspective. These factors decide what aspects need to be monitored in each layer. E.g. We can monitor availability and performance of each layer from different angle
- Level: We can have monitoring at 4 levels: predictive (predicting the probability of an issue), proactive (before an issue happens), real time (as soon as the issue happens) and reactive (after the issue happens). As we move from reactive to predictive there is a trade-off between cost of monitoring and cost of issue- we must maintain a balance in choosing the level for different parameters.
- Metrics: “What gets measured gets improved” – Everyone!
Metrics are generally two-fold: High-level metrics and Low-level metrics. Monitoring is deployed on low metrics and viewed on both high as well as low-level metrics.
High-level metrics are the tactical metrics which are used to measure the service levels. These metrics are used for reporting to external stakeholders, E.g. System Availability.
Low-level metrics are the operational metrics or causal metrics which will result in impacting high-level metrics. These metrics help in operational improvement and identifying root causes. There should be a top down approach to derive low-level metrics from high-level metrics so that there is relevancy to each monitoring point.
How to deploy effective and optimal monitoring
Deploying effective monitoring is a highly collaborative activity between the monitoring service provider and the IT and Business stakeholders. Though there are standard metrics to be deployed, the effectiveness comes only when the metrics for a scenario are defined and the right level of criticality is assigned.
It takes monitoring experts, robust process and effective tools to deliver a monitoring solution. Setting aside budget and time for monitoring and involving the right teams is the key to effectiveness.
After setting up monitoring there must be continuous improvement on the deployed solution and value articulation to ensure that the monitoring solution is relevant and delivering value to the organization.
Trends to watch out!
- Correlation and Causation between alerts– Alert correlation which helps the support teams to find correlation between alerts and look at the root cause quickly which in turn reduces Mean Time to Resolution (MTTR)
- Aggregation of metrics – In the current scenario, with many technologies and multiple partners supporting an IT landscape it is inevitable to have different monitoring tools for different purposes. Aggregation of metrics from multiple tools will help the application and business owners to have a holistic view of their application irrespective of underlying technologies.
- Robotic Process Automation(RPA) – There is a huge opportunity in the IT operations space to automate the process from alert to action using RPA which can help us move towards autonomous systems
- AIOPs – According to Gartner, monitoring is a key part of AIOPS which is powered by BigData and Machine Learning. AIOps stands for Artificial Intelligence for IT Operations. It refers to multi-layered technology platforms that automate and enhance IT operations by using analytics and machine learning to analyze big data collected from various IT operations tools and devices, to automatically spot and react to issues in real time.
In my current organization, we are on an exciting journey to make the best use of monitoring and live by the age-old proverb ‘Prevention is better than cure’!
By our guest blogger
Jones S., Associate General Manager, Enterprise IT
At Schneider Electric, Jones is working in IT monitoring service line and responsible for managing the program plan, incubating new technologies and working on platform maturity journey. Earlier he has also worked on digitization and analytics programs in supply chain domain. He has Masters in Business Administration and Bachelor of Engineering degrees apart from various industry certifications. He is passionate about business transformation and social upliftment using technology.