Abstract:
This disclosure presents systems and methods for run-time analysis of streams of log data for abnormalities using a statistical structure of meta-data associated with the log data. The systems and methods convert a log data stream into meta-data and perform statistical analysis in order to reveal a dominant statistical pattern within the meta-data. The meta-data is represented as a graph with nodes that represent each of the different event types, which are detected in the stream along with event sources associated with the events. The systems and methods use real-time analysis to compare a portion of a current log data stream collected in an operational window with historically collected meta-data represented by a graph in order to determine the degree of abnormality of the current log data stream collected in the operational window.
Abstract:
Automated methods and systems for resolving potential root causes of performance problems with applications executing in a data center are described. The automated methods use machine learning to train an inference model that relates event types recorded in metrics, log messages, and traces of an application to values of a key performance indicator (“KPI”) of the application. The methods use the trained inference model to determine which of the event types are important event types that relate to performance of the application. In response to detecting a run-time performance problem in the KPI, the methods determine which of the important event has a higher probability of being the potential root cause of the performance problem. A graphical user interface displays an alert that identifies the application as having the run-time performance problem, identity of the important event types, and at least one recommendation for remedying the performance problem.
Abstract:
The current document is directed to improved methods and systems that collect, generate, and store multidimensional metric data used for monitoring, management, and administration of computer systems and that continuously optimize sampling rates for metric data. Multiple different metric-data streams are sampled for each of multiple different distributed-computer-system objects, and are hierarchically organized into a number of different individual and multidimensional metric-data streams. The sampling rates for the different individual and multidimensional metric-data streams are correspondingly hierarchically optimized in order to avoid oversampling the metric data while preserving the relevant information content of the sampled metric data for downstream data analysis.
Abstract:
Automated methods and systems for compressing log messages stored in a log message databased are described herein. The automated methods and systems perform lossy compression of an original set of log messages by identifying log messages that represent each of the various types of events recorded in the original set. The log messages in the original set are overwritten by corresponding representative log messages. Source coding is used to construct a source coding scheme and variable length binary codewords for each of the representative log messages. The representative log messages are replaced by the codewords, which occupies significantly less storage space than the original set. The lossy compressed set of log messages can be decompressed to obtain the representative log messages using the source coding scheme.
Abstract:
Methods and systems described herein are directed to troubleshooting anomalous behavior in a data center. Anomalous behavior in an object of a data center, such as a computational resource, an application, or a virtual machine (“VM”), may be related to the behavior of other objects at different hierarchies of the data center. Methods and systems provide a graphical user interface that enables a user to select a selected metric associated with an object of the data center experiencing a performance problem. Unexpected metrics of an object topology of the data center that correspond to the performance problem are identified. A recommendation for executing remedial measures to correct the performance problem is generated based on the unexpected metrics.
Abstract:
The current document is directed to methods and systems that generate forecasts based on input time-series data using a forecasting neural network or other machine-learning-based forecasting subsystem. In various implementations, an input time series is first classified and then transformed, based on the classification, to a corresponding stationary time series. The corresponding stationary time series is then submitted to a neural network or other machine-learning-based forecasting subsystem to generate an initial forecast for future time points. The initial forecast is then inverse transformed, based on the input-time-series classification, to generate a final, output forecast.
Abstract:
Computational processes and systems are directed to detecting abnormally behaving objects of a distributed computing system. An object can be a physical or a virtual object, such as a server computer, application, VM, virtual network device, or container. Processes and systems identify a set of metrics associated with an object and compute an indicator metric from the set of metrics. The indicator metric is used to label time stamps that correspond to outlier metric values of the set of metrics. The metrics and outlier time stamps are used to compute rules by machine learning. Each rule corresponds to a subset or combination of metrics and represents specific threshold conditions for metric values. The rules are applied to run-time metric data of the metrics to detect run-time abnormal behavior of the object.
Abstract:
The current document is directed to methods and subsystems within computing systems, including distributed computing systems, that collect, store, process, and analyze population metrics for types and classes of system components, including components of distributed applications executing within containers, virtual machines, and other execution environments. In a described implementation, a graph-like representation of the configuration and state of a computer system included aggregation nodes that collect metric data for a set of multiple object nodes and that collect metric data that represents the members of the set over a monitoring time interval. Population metrics are monitored, in certain implementations, to detect outlier members of an aggregation.
Abstract:
Computational processes and systems are directed to forecasting time series data and detection of anomalous behaving resources of a distributed computing system data. Processes and systems comprise off-line and on-line modes that accelerate the forecasting process and identification of anomalous behaving resources. In the off-line mode, recurrent neural network (“RNN”) is continuously trained using time series data associated with various resources of the distributed computing system. In the on-line mode, the latest RNN is used to forecast time series data for resources in a forecast time window and confidence bounds are computed over the forecast time window. The forecast time series data characterizes expected resource usage over the forecast time window so that usage of the resource may be adjusted. The confidence bounds may be used to detect anomalous behaving resources. Remedial measures may then be executed to correct problems indicated by the anomalous behavior.
Abstract:
The current document is directed to methods and systems for detecting the occurrences of abnormal events and operational behaviors within the distributed computer system. The currently described methods and systems continuously collect metric data from various metric-data sources, generate a sequence of metric-data observations, each metric-data observation comprising a set of temporally aligned metric data, and employ principle-component analysis to transform the metric-data observations to facilitate reduction of the dimensionality of the metric-data observations. The currently described methods and systems then employ clustering methods to identify outlying transformed-metric-data observations, accordingly label the transformed metric-data observations to generate a training dataset, and then apply one or more of various types of machine-learning techniques to the training dataset in order to generate an abnormal-observation detector that can be used to detect, in real time, abnormal metric-data observations as they are generated within the distributed computing system.