Abstract:
Methods determine a capacity-forecast model based on historical capacity metric data and historical business metric data. The capacity-forecast model may be to estimate capacity requirements with respect to changes in demand for the data center customer's application program. The capacity-forecast model provides an analytical “what-if” approach to reallocating data center resources in order to satisfy projected business level expectations of a data center customer and calculate estimated capacities for different business scenarios.
Abstract:
Methods and systems that manage large volumes of metric data generation by cloud-computing infrastructures are described. The cloud-computing infrastructure generates sets of metric data, each set of metric data may represent usage or performance of an application or application module run by the cloud-computing infrastructure or may represent use or performance of cloud-computing resources used by the applications. The metric data management methods and systems are composed of separate modules that perform sequential application of metric data reduction techniques on different levels of data abstraction in order to reduce volume of metric data collected. In particular, the modules determine normalcy bounds, delete highly correlated metric data, and delete metric data with highly correlated normalcy bound violations.
Abstract:
Methods and systems that estimate a degree of abnormality of a complex system based on historical time-series data representative of the complex system's past behavior and using the historical degree of abnormality to determine whether or not a degree of abnormality determined from current time-series data representative of the same complex system's current behavior is worthy of attention. The time-series data may be metric data that represents behavior of a complex system as a result of successive measurements of the complex system made over time or in a time interval. A degree of abnormality represents the amount by which the time-series data violates a threshold. The larger the degree of abnormality of the current time-series data is from the historical degree of abnormality, the larger the violation of the thresholds and the greater the probability the violation in the current time-series data is worthy of attention.
Abstract:
This disclosure presents systems and methods for run-time analysis of streams of log data for abnormalities using a statistical structure of meta-data associated with the log data. The systems and methods convert a log data stream into meta-data and perform statistical analysis in order to reveal a dominant statistical pattern within the meta-data. The meta-data is represented as a graph with nodes that represent each of the different event types, which are detected in the stream along with event sources associated with the events. The systems and methods use real-time analysis to compare a portion of a current log data stream collected in an operational window with historically collected meta-data represented by a graph in order to determine the degree of abnormality of the current log data stream collected in the operational window.
Abstract:
Automated computer-implemented methods and systems for resolving performance problems with objects executing in a data center are described. The automated methods use machine learning to obtain rules defining relationships between probabilities of event types of in log messages and performance problems identified by a key performance indictor (“KPI”) of the object. When a KPI violates a corresponding threshold, the rules are used to evaluate run time log messages that describe the probable root cause of the performance problem. An alert identifying the KPI threshold violation, and the log messages are displayed in a graphical user interface of an electronic display device.
Abstract:
The current document is directed to methods and systems that collect metric data within computing facilities, including large data centers and cloud-computing facilities. In a described implementation, two or more metric-data sets are combined to generate a multidimensional metric-data set. The multidimensional metric-data set is compressed for efficient storage by clustering the multidimensional data points within the multidimensional metric-data set to produce a covering subset of multidimensional data points and by then representing the multidimensional-data-point members of each cluster by a cluster identifier rather than by a set of floating-point values, integer values, or other types of data representations. The covering set is constructed to ensure that the compression does not result in greater than a specified level of distortion of the original data.
Abstract:
The current document is directed to methods and systems that employ distributed-computer-system metrics collected by one or more distributed-computer-system metrics-collection services, call traces collected by one or more call-trace services, and attribute values for distributed-computer-system components to identify attribute dimensions related to anomalous behavior of distributed-computer-system components. In a described implementation, nodes correspond to particular types of system components and node instances are individual components of the component type corresponding to a node. Node instances are associated with attribute values and node are associated with attribute-value spaces defined by attribute dimensions. A set of call traces is partitioned, by clustering. Using attribute values and call traces, attribute dimensions that are likely related to particular anomalous behaviors of distributed-computer-system components are determined by decision-tree-related analyses for each partition and are reported to one or more computational entities to facilitate resolution of the anomalous behaviors.
Abstract:
The current document is directed to methods and systems that employ distributed-computer-system metrics collected by one or more distributed-computer-system metrics-collection services, call traces collected by one or more call-trace services, and attribute values for distributed-computer-system components to identify attribute dimensions related to anomalous behavior of distributed-computer-system components. In a described implementation, nodes correspond to particular types of system components and node instances are individual components of the component type corresponding to a node. Node instances are associated with attribute values and node are associated with attribute-value spaces defined by attribute dimensions. Using attribute values and call traces, attribute dimensions that are likely related to particular anomalous behaviors of distributed-computer-system components are determined by decision-tree-related analyses and are reported to one or more computational entities to facilitate resolution of the anomalous behaviors.
Abstract:
Automated processes and systems for detecting abnormally behaving objects of a distributed computing system are described. Processes and systems obtain metrics that are generated in a historical time window and are associated with an object of the distributed computing system. Processes and system use the metrics to compute a time-dependent system indicator over the historical time window. Each value of the system indicator corresponds to a point in time of the historical time window when the object was in a normal or an abnormal state. Processes and systems use the normal and abnormal states of the system indicator in the historical time window to train a state classifier that is used to detect run-time abnormal behavior of the object. When the state classifier identifies abnormal behavior of the object, an alert is generated, indicating the abnormal behavior of the object.
Abstract:
Methods and systems of automatic confidence-controlled sampling to analyze, detect anomalies and problems in monitoring data and event messages generated by sources of a distributed computing system are described. A source can be virtual or physical object of the distributed computing system, a resource of the distributed computing system, or an event source running in the distributed computing. Monitoring data includes metric data generated by resources and data that represents meta-data properties of event sources. Confidence-controlled sampling is used to determine characteristics of the monitoring data, identify periodic patterns in the behavior of a source, detect changes in behavior of a source, and compare the behavior of two sources. Confidence-controlled sampling speeds up characterization the data sets, determination of behavior patterns, and detection and reporting of anomalies and problems of the resources and event sources of the distributed computing system.