Abstract:
Processes and systems described herein are directed to determining efficient sampling rates for metrics generated by various different metric sources of a distributed computing system. In one aspect, processes and systems retrieve the metrics from metric data storage and determine non-constant metrics of the metrics generated by the various metric sources. Processes and systems separately determine an efficient sampling rate for each non-constant metric by constructing a plurality of corresponding reduced metrics, each reduced metric comprising a different subsequence of the corresponding metric. Information loss is computed for each reduced metric. An efficient sampling rate is determined for each metric based on the information losses created by constructing the reduced metrics. The efficient sampling rates are applied to corresponding streams of run-time metric values and may also be used to resample the corresponding metric already stored in metric data storage, reducing storage space for the metrics.
Abstract:
Automated processes and systems that detect abnormal performance of a complex computational system of a distributed computing system are described. The processes and systems determine time stamps of previous abnormal behavior of the complex computational system and determine uncorrelated metrics associated with the complex computational system. Rules are determined based on the uncorrelated metrics and the time stamps of previous abnormal behavior of the complex computational system. Each rule may be applied to run-time metric values of the uncorrelated metrics to detect abnormal behavior of the complex computational system and generate a corresponding alert in approximate real time. Each rule may include displaying a recommendation for addressing the abnormality based on remedial measures used to correct the same abnormality in the past. Each rule may also automatically trigger remedial action that automatically corrects the abnormality.
Abstract:
Computational processes and systems are directed to detecting abnormally behaving objects of a distributed computing system. An object can be a physical or a virtual object, such as a server computer, application, VM, virtual network device, or container. Processes and systems identify a set of metrics associated with an object and compute an indicator metric from the set of metrics. The indicator metric is used to label time stamps that correspond to outlier metric values of the set of metrics. The metrics and outlier time stamps are used to compute rules by machine learning. Each rule corresponds to a subset or combination of metrics and represents specific threshold conditions for metric values. The rules are applied to run-time metric data of the metrics to detect run-time abnormal behavior of the object.
Abstract:
Methods and systems are directed to quantifying and prioritizing the impact of problems or changes in a computer system. Resources of a computer system are monitored by management tools. When a change occurs at a resource of a computer system or in log data generated by event sources of the computer system, one or more of the management tools generates an alert. The alert may be an alert that indicates a problem with the computer system resource or the alert may be an alert trigger identified in an event message of the log data. Methods described herein compute an impact factor that serves as a measure of the difference between event messages generated before the alert and event messages generated after the alert. The value of the impact factor associated with an alert may be used to quantitatively prioritize the alert and generate appropriate recommendations for responding to the alert.
Abstract:
Automated computational methods and systems to classify and troubleshoot problems in information technology (“IT”) systems or services provided by a distributed computing system are described. Each IT system of the distribution computing system or IT service provided by the distributed computing system has an associated key performance indicator (“KPI”) used to monitor performance of the IT system or service. When real-time KPI data violates a KPI threshold, a real-time event-type distribution is computed from event messages generated by event sources associated with the IT system or service following the threshold violation. The real-time event-type distribution is compared with historical event-type distributions recorded for the KPI data in order to identify the problem and execute remedial action to resolve the problem.
Abstract:
This disclosure presents computational systems and methods for detecting anomalies in data output from any type of monitoring tool. The data is aggregated and sent to an alerting system for abnormality detection via comparison with normalcy bounds. The anomaly detection methods are performed by construction of normalcy bounds of the data based on the past behavior of the data output from the monitoring tool. The methods use data quality assurance and data categorization processes that allow choosing a correct procedure for determination of the normalcy bounds. The methods are completely data agnostic, and as a result, can also be used to detect abnormalities in time series data associated with any complex system.
Abstract:
Methods and system described herein are directed to identifying anomalous behaving components of a distributed computing system. Methods and system collect log messages generated by a set of event log source running in the distributed computing system within an observation time window. Frequencies of various types of event messages generated within the observation time window are determined for each of the log sources. A similarity value is calculated for each pair of event sources. The similarity values are used to identify similar clusters of event sources of the distributed computing system for various management purposes. Components of the distributed computing system that are used to host the event source outliers may be identified as potentially having problems or may be an indication of future problems.
Abstract:
Methods and systems to evaluate data center performance and prioritize data center objects and anomalies for remedial actions are described. Methods rank data center objects and determine object performance trends. Methods calculate an object rank of each object of the data center over a period of time and calculate an object trend of each object of the data center based on relative frequencies of alerts at different times. The objects may be prioritized for remedial actions based on the object ranks and object trends.
Abstract:
This disclosure is directed to data-agnostic computational methods and systems for adjusting hard thresholds based on user feedback. Hard thresholds are used to monitor time-series data generated by a data-generating entity. The time-series data may be metric data that represents usage of the data-generating entity over time. The data is compared with a hard threshold associated with usage of the resource or process and when the data violates the threshold, an alert is typically generated and presented to a user. Methods and systems collect user feedback after a number of alerts to determine the quality and significance of the alerts. Based on the user feedback, methods and systems automatically adjust the hard thresholds to better represent how the user perceives the alerts.
Abstract:
This disclosure is directed to automated computer-implemented methods and systems for detecting and correcting a trending problem with an application executing in a data center. The methods receive a new support request entered via a graphical user interface. The methods perform trend discovery of the new support request over recent time windows using a pre-trained and fine-tuned model bidirectional encoder representation from transformer. In response to detecting a trending problem described in the new support request, the method discovers recommended remedial measures for the new support request based on similar support requests previously recorded in a support request data store or on similar knowledge base articles previously recorded in a knowledge base data store. The recommended remedial measures for correcting the trending problem are executed using an operations manager of the data center.