Abstract:
Methods and systems to narrow a search for potential sources of problems in a distributed computing system are described. A volatile event type of event messages recorded in an event-log file is identified. The volatile event type is an event type that may have unexpectedly increased in frequency over an observation time window. An historical period of time may be selected to search for potential sources of the volatile event type. Frequencies of event messages in the event-log file with the same event type as the volatile event type are determined for time intervals of the historical period of time. A time interval of the historical period of time with a largest increase in frequency of event messages is identified. A list of event messages of the event-log file in a selected sub-time interval of the sub-time intervals of the time interval are displayed in a graphical user interface.
Abstract:
Methods and systems to identify log write instructions of a source code as potential sources of an event message of interest are described. Methods identify non-parametric tokens, such as text strings and natural language words and phrases, of an event message of interest. Candidate log write instructions and associated line numbers in a source code are identified. Non-parametric tokens of each event message of the one or more candidate log write instructions are determined. A confidence score is calculated for each candidate log write instruction based the number of non-parametric tokens the event message of interest and event message of the candidate log write instruction have in common. The candidate log write instructions are rank ordered based on the corresponding one or more confidence scores and the rank ordered candidate log write instructions and associated line numbers of the source code may be displayed in a graphical user interface.
Abstract:
The current document is directed to methods and systems for processing, classifying, and efficiently storing large volumes of event messages generated in modern computing systems. In a disclosed implementation, received event messages are assigned to clusters based on metrics computed for the event messages. In addition, a significance value is determined for each received event message. When the significance value exceeds a threshold value, one or more actions are taken, including marking an event record corresponding to the event message, storing an event record corresponding to the event message in a significant-event log, and generating a notice or alarm.
Abstract:
Automated processes and systems for detecting abnormally behaving objects of a distributed computing system are described. Processes and systems obtain metrics that are generated in a historical time window and are associated with an object of the distributed computing system. Processes and system use the metrics to compute a time-dependent system indicator over the historical time window. Each value of the system indicator corresponds to a point in time of the historical time window when the object was in a normal or an abnormal state. Processes and systems use the normal and abnormal states of the system indicator in the historical time window to train a state classifier that is used to detect run-time abnormal behavior of the object. When the state classifier identifies abnormal behavior of the object, an alert is generated, indicating the abnormal behavior of the object.
Abstract:
The current document is directed to methods and systems that process, classify, efficiently store, and display large volumes of event messages generated in modern computing systems. In a disclosed implementation, received event messages are assigned to event-message clusters based on non-parameter tokens identified within the event messages. A parsing function is generated for each cluster that is used to extract data from incoming event messages and to prepare event records from event messages that more efficiently and accessible store event information. The parsing functions also provide an alternative basis for assignment of event messages to clusters. Event types associated with the clusters are used for gathering information from various information sources with which to automatically annotate event messages displayed to system administrators, maintenance personnel, and other users of event messages.
Abstract:
Automated methods and systems to determine a baseline event-type distribution of an event source and use the baseline event type distribution to detect changes in the behavior of the event source are described. In one implementation, blocks of event messages generated by the event source are collected and an event-type distribution is computed for each of block of event messages. Candidate baseline event-type distributions are determined from the event-type distributions. The candidate baseline event-type distribution has the largest entropy of the event-type distributions. A normal discrepancy radius of the event-type distributions is computed from the baseline event-type distribution and the event-type distributions. A block of run-time event messages generated by the event source is collected. A run-time event-type distribution is computed from the block of run-time event messages. When the run-time event-type distribution is outside the normal discrepancy radius, an alert is generated indicating abnormal behavior of the event source.
Abstract:
Computational methods and systems for detecting and troubleshooting anomalous behavior in distributed applications executing in a distributed computing system are described herein. Methods and systems discover nodes comprising the application. Anomaly detection monitors the metrics associated with the nodes for anomalous behavior in order to identify an approximate point in time when anomalous behavior begins to adversely impact performance of the application. Anomaly detection also monitors logs messages associated with the nodes to detect anomalous behavior recorded in the log messages. When anomalous behavior is detected in either the metrics and/or the log messages an alert identifying the anomalous behavior is generated. Troubleshooting guides an administrator and/or application owner to investigate the root cause of the anomalous behavior. Appropriate remedial measures may be determined based on the root cause and automatically or manually executed to correct the problem.
Abstract:
This disclosure is directed to tagging tokens or sequences of tokens in log messages generated by a logging source. Event types of log messages in a block of log messages are collected. A series of tagging operations are applied to each log message in the block. For each tagging operation, event types that are qualified to receive the corresponding tag are identified. When a log message is received, the event type is determined and compared with the event types of the block in order to identify a matching event type. The series of tagging operations are applied to the log message to generate a tagged log message with the restriction that each tagging operation only applies a tag to token or sequences of tokens when the event type is qualified to receive the tag. The tagged log message is stored in a data-storage device.
Abstract:
The current document is directed to methods and systems for detecting the occurrences of abnormal events and operational behaviors within the distributed computer system. The currently described methods and systems continuously collect metric data from various metric-data sources, generate a sequence of metric-data observations, each metric-data observation comprising a set of temporally aligned metric data, and employ principle-component analysis to transform the metric-data observations to facilitate reduction of the dimensionality of the metric-data observations. The currently described methods and systems then employ clustering methods to identify outlying transformed-metric-data observations, accordingly label the transformed metric-data observations to generate a training dataset, and then apply one or more of various types of machine-learning techniques to the training dataset in order to generate an abnormal-observation detector that can be used to detect, in real time, abnormal metric-data observations as they are generated within the distributed computing system.
Abstract:
Methods and system described herein are directed to identifying anomalous behaving components of a distributed computing system. Methods and system collect log messages generated by a set of event log source running in the distributed computing system within an observation time window. Frequencies of various types of event messages generated within the observation time window are determined for each of the log sources. A similarity value is calculated for each pair of event sources. The similarity values are used to identify similar clusters of event sources of the distributed computing system for various management purposes. Components of the distributed computing system that are used to host the event source outliers may be identified as potentially having problems or may be an indication of future problems.