Organizations are in a constant struggle today to keep their IT infrastructure and applications up and running and to minimize the amount of unplanned downtime that they suffer from. This has become even more challenging since the adoption of modern architectures that include microservices, containerization, hybrid-cloud deployments, etc.
Service outages are becoming increasingly expensive, with nearly half of them costing organizations over $100,000 for a single outage.he huge volume of computer logs and metrics generated by today’s applications, and the increasing complexity of modern IT infrastructure, data centers require advanced monitoring technologies and capabilities – which could include distributed storage and processing of the raw logs and metrics, as well as machine learning techniques for understanding patterns buried amid the noise.
To cope with this huge amount of collected data, a methodology known as “data lake” has emerged in the last decade. The data lake methodology also comes with its own challenges. While unplanned downtime and outages are inevitable, organizations invest significant efforts in keeping them to a minimum, using a variety of techniques. This is only part of the process that re-defined the IT monitoring and observability space, and lead to an increased reliance of AI and ML powered IT operations (AIOps).
This survey is intended to introduce the market’s reaction to currently available AIOps tools and platforms, and to gauge what advances will be necessary to achieve realization of the “self-healing datacenter”.
In a survey conducted by Sensai with over 20 AIOps domain experts during 2020 and 2021,
we focused on the following topics:
- Widely used monitoring goals and practices
- Widely used AIOps tools and platforms
- Challenges encountered while using these AIOps tools and platforms
Due to the attainment of high redundancy and resiliency mechanisms and the maturity of the IT infrastructure layer, SRE groups are moving their focus toward the application layer, using application performance monitoring (APM) platforms.
The data center monitoring market is crowded with legacy as well as new generation AIOps platforms which leverage AI and ML technologies, to provide higher data center visibility & transparency, leading toward higher datacenter resiliency, and uptime.
Gartner’s market guide for AIOps
Two of the major areas under focus when monitoring the datacenter are performance & availability.
Performance is measured by the following major KPIs:
- CPU utilization
AIOps platform – The Major Challenges
Following our deep-dive discussions and survey that we conducted with AIOps platform users, we have identified the following major challenges:
The most common method used by most of the SREs for anomaly detection is threshold-based monitoring:
- Setting sophisticated thresholds is one of the most daunting tasks (especially when dealing with seasonality aspects of the collected metrics). In some cases, this is described as a bigger challenge than Root Cause Analysis (RCA).
- Static thresholds, which are most used, introduce high level of false alarms (signal-to-noise ratio) due to the agile, dynamic nature of the modern datacenter – especially when monitoring container-based platforms.
- Therefore, most of the SREs perform anomaly detection using a mixture of static and dynamic (baseline-based) thresholds in addition to complex rules-based thresholds, providing a partial solution for this problem.
Relationship between quality and risk
Correlation between different monitoring methods – metrics, logs, and traces – is a major challenge for data center monitoring managers. Quality correlation analyses reduce alert fatigue by filtering irrelevant anomalies and aggregating correlated or similar anomalies into a single alert.
Cross-layer monitoring represents a unified, holistic monitoring paradigm for all entity layers: the IT infrastructure layer, the application layer, and the business layer.
One of the major issues in APM platform deployments is encountered when the different layers are monitored as segregated silos. As a result, specific areas and use cases remain without coverage, resulting in monitoring “black holes”.
This leads to a common conclusion that was expressed by the majority of the AIOps users – that true, end-to-end, full-stack monitoring is not really available today.
Data center monitored layers
- Very few of the available tools provide an AI-based engine accompanied by an automatic and autonomous RCA solution.
- Most of the advanced tools in the market are based on a correlation-based engine which does not include an RCA capability.
- Therefore, in most organizations, RCA is handled manually by the NOC staff (and is performed usually as a top-down process). In some of the cases the process is initiated following anomaly on the business KPIs.
Full cycle automation is not available in most of the tools.
Many of the tools require login & manual shift between vendor’s several tools in order to:
- Cover the whole process from detection of the anomaly to RCA, mitigation and prediction.
One tool is performing the anomaly detection, while the second tool you have to manually switch to is providing the RCA and a third tool suggesting the recommended mitigation
- Get full stack visibility of all the layers involved in the anomaly.
One tool is providing visibility at the DC infrastructure & network layer while another tool is providing the application-level visibility
There is a challenge to monitor many enterprise applications, such as Outlook, Salesforce and SAP. Usually, these applications are treated as a “black box” since they provide few (if any) monitoring metrics.
For example, Salesforce as one of these enterprise level tools, does not provide sufficient metrics, and does not facilitate the installation of agents that will enable higher levels of visibility.
Most datacenters have evolved into hybrid deployments, containing legacy, on-prem (in many cases due to security requirements, as dictated by financial regulations), as well as cloud-based deployment.
In such deployments, many of the services are hybrid services, in which a service is initiated on-prem, and completed in the cloud – or vice versa.
Monitoring hybrid services is a challenging task, as most data center managers use the same monitoring solutions for both the on-prem and cloud environments – as many of them do not see any major difference between the two with regards to application behavior and monitoring.
Some of the managers did not see significant added value in cloud providers’ monitoring tools such as Azure Monitor & Amazon Cloud Watch over that of the monitoring-oriented tools they already use.
However, it was noted that most of the tools that work well in the public cloud environment do not work well in private cloud deployments.
Given the possibility that services migrated to the cloud may undergo performance degradation, it is crucial to perform “migration to cloud” benchmark tests, in which the datacenter manager compares the performance (and other metrics) of the service which they plan to migrate – between on-prem and cloud.
This is required in order to make sure that the performance (and/or any other important metric) of the critical service is not going to be negatively impacted following the migration to the cloud.
Logs are an important factor in the anomaly detection and analysis process. However, many SREs still perform manual log analyses with the assistance of problem domain experts, analyzing the huge volumes of logs that are generated by the datacenter entities.
The monitoring of container-based applications is often challenged by the agility and flexibility of the container-based platforms. One of the challenges is related to the need to reflect frequently changing applications topologies, in which CMDB implementations are no longer relevant or applicable.
As of this writing, most SREs and monitoring managers have not yet developed a sufficient level of trust in available AI-based tools, for the following reasons:
- According to the users, the machine learning mechanisms are not transparent enough, providing little or no insight into the algorithms used.
- Alert fatigue is still an issue, as many of the alerts generated are not relevant – for example, false alerts, or low priority alerts that mask the higher priority alerts.
- This is specifically true for the alerts that are generated just after the machine learning is deployed, and the ML mechanism is not familiar enough with the behavior of the system and its history.
- Not all tools mask alerts until the system and the ML models are fully familiar with the datacenter behavior.
- Baseline-based thresholds (sometimes called “intelligent thresholds”) do not function well on most tools:
- Baseline thresholds work well in “well-behaved sine curves”, but do not work well with spikes and with use cases in which there is insufficient traffic/activity to correctly learn system behavior. Common baseline-based models do not recognize well enough seasonality, especially when dealing with multi-tenant use cases that feature different seasonality behavior for each user.
Many users encounter user experience and usability challenges once they need to make changes to the default configurations set by the monitoring platform. Some of the commonly provided feedbacks by AIOps platform users include:
- You can control any metric or parameter, but the learning curve is steep, long, and very complex.
- Complex anomaly rules are hard to configure. In some cases, it took the tool vendor’s sales engineer 4 days to configure the complex rules required by the customer.
- Some of the tools are simple to operate when using the default settings. When you want to deviate from the defaults, they become too complex and/or cumbersome, causing usability problems for non-technical staff members.
- In some of the tools, setting a rule requires configuration of several different modules.
And finally… the biggest challenge of them all:
One of the major KPIs of a successful IT team is the time it takes the organization to adopt new technologies and new best practices as part of the organizational culture. This is especially challenging when adopting state-of-the-art AIOps tools and platforms that integrate ML and AI technologies, and even more so when they mandate the use of new hybrid-cloud monitoring tools and work procedures.
AIOps tools are disrupting the data center monitoring and observability market, offering new capabilities that provide a new level of visibility and understanding, and help to achieve significant decreases in DC downtime.
But despite the advantages and progress, the challenges are still immense, and the market is evolving toward autonomous AI tools that will lead to the establishment of the “self-healing datacenter”.