Abstract:
The disclosed embodiments provide a system that detects anomalous events in a virtual machine. During operation, the system obtains time-series garbage-collection (GC) data collected during execution of a virtual machine in a computer system. Next, the system generates one or more seasonal features from the time-series GC data. The system then uses a sequential-analysis technique to analyze the time-series GC data and the one or more seasonal features for an anomaly in the GC activity of the virtual machine. Finally, the system stores an indication of a potential out-of-memory (OOM) event for the virtual machine based at least in part on identifying the anomaly in the GC activity of the virtual machine.
Abstract:
The disclosed system produces synthetic signals for testing machine-learning systems. During operation, the system generates a set of N composite sinusoidal signals, wherein each of the N composite sinusoidal signals is a combination of multiple constituent sinusoidal signals with different periodicities. Next, the system adds time-varying random noise values to each of the N composite sinusoidal signals, wherein a standard deviation of the time-varying random noise values varies over successive time periods. The system also multiplies each of the N composite sinusoidal signals by time-varying amplitude values, wherein the time-varying amplitude values vary over successive time periods. Finally, the system adds time-varying mean values to each of the N composite sinusoidal signals, wherein the time-varying mean values vary over successive time periods. The time-varying random noise values, amplitude values and mean values can be selected through a roll-of-the-die process from a library of values, which are learned from industry-specific signals.
Abstract:
Systems, methods, and other embodiments associated with detecting feedback control instability in computer thermal controls are described herein. In one embodiment, a method includes for a set of dwell time intervals, wherein the dwell time intervals are associated with a range of periods of time from an initial period to a base period, executing a workload that varies from minimum to maximum over the period on a computer during the dwell time interval; recording telemetry data from the computer during execution of the workload; incrementing the period towards a base period; determining that either the base period is reached or a thermal inertia threshold is reached; and analyzing the recorded telemetry data over the set of dwell time intervals to either (i) detect presence of a feedback control instability in thermal control for the computer; or (ii) confirm feedback control stability in thermal control for the computer.
Abstract:
We describe a system that performs prognostic-surveillance operations based on an inferential model that dynamically adapts to evolving operational characteristics of a monitored asset. During a surveillance mode, the system receives a set of time-series signals gathered from sensors in the monitored asset. Next, the system uses an inferential model to generate estimated values for the set of time-series signals, and then performs a pairwise differencing operation between actual values and the estimated values for the set of time-series signals to produce residuals. Next, the system performs a sequential probability ratio test (SPRT) on the residuals to produce SPRT alarms. When a tripping frequency of the SPRT alarms exceeds a threshold value, which is indicative of an incipient anomaly in the monitored asset, the system triggers an alert. While the prognostic-surveillance system is operating in the surveillance mode, the system incrementally updates the inferential model based on the time-series signals.
Abstract:
During operation, the system uses N sensors to sample an electromagnetic interference (EMI) signal emitted by a target asset while the target asset is running a periodic workload, wherein each of the N sensors has a sensor sampling frequency f, and wherein the N sensors perform sampling operations in a round-robin ordering with phase offsets between successive samples. During the sampling operations, the system performs phase adjustments among the N sensors to maximize phase offsets between successive sensors in the round-robin ordering. Next, the system combines samples obtained through the N sensors to produce a target EMI signal having an EMI signal sampling frequency F=f×N. The system then generates a target EMI fingerprint from the target EMI signal. Finally, the system compares the target EMI fingerprint against a reference EMI fingerprint for the target asset to determine whether the target asset contains any unwanted electronic components.
Abstract:
Techniques for identifying a root cause of an operational result of a deterministic machine learning model are disclosed. A system applies a deterministic machine learning model to a set of data to generate an operational result, such as a prediction of a “fault” or “no-fault” in the system. The set of data includes signals from multiple different data sources, such as sensors. The system applies an abductive model, generated based on the deterministic machine learning model, to the operational result. The abductive model identifies a particular set of data sources that is associated with the root cause of the operational result. The system generates a human-understandable explanation for the operational result based on the identified root cause.
Abstract:
The disclosed embodiments provide a system that estimates a remaining useful life (RUL) for a fan. During operation, the system receives telemetry data associated with the fan during operation of the critical asset, wherein the telemetry data includes a fan-speed signal. Next, the system uses the telemetry data to construct a historical fan-speed profile, which indicates a cumulative time that the fan has operated in specific ranges of fan speeds. The system then computes an RUL for the fan based on the historical fan-speed profile and empirical time-to-failure (TTF) data, which indicates a TTF for the same type of fan as a function of fan speed. Finally, when the RUL falls below a threshold, the system generates a notification indicating that the fan needs to be replaced.
Abstract:
Techniques for providing decision rationales for machine-learning guided processes are described herein. In some embodiments, the techniques described herein include processing queries for an explanation of an outcome of a set of one or more decisions guided by one or more machine-learning processes with supervision by at least one human operator. Responsive to receiving the query, a system determines, based on a set of one or more rationale data structures, whether the outcome was caused by human operator error or the one or more machine-learning processes. The system then generates a query response indicating whether the outcome was caused by the human operator error or the one or more machine-learning processes.
Abstract:
Systems, methods, and other embodiments associated with autonomous cloud-node scoping for big-data machine learning use cases are described. In some example embodiments, an automated scoping tool, method, and system are presented that, for each of multiple combinations of parameter values, (i) set a combination of parameter values describing a usage scenario, (ii) execute a machine learning application according to the combination of parameter values on a target cloud environment, and (iii) measure the computational cost for the execution of the machine learning application. A recommendation regarding configuration of central processing unit(s), graphics processing unit(s), and memory for the target cloud environment to execute the machine learning application is generated based on the measured computational costs.
Abstract:
During operation, the system uses N sensors to sample an electromagnetic interference (EMI) signal emitted by a target asset while the target asset is running a periodic workload, wherein each of the N sensors has a sensor sampling frequency f, and wherein the N sensors perform sampling operations in a round-robin ordering with phase offsets between successive samples. During the sampling operations, the system performs phase adjustments among the N sensors to maximize phase offsets between successive sensors in the round-robin ordering. Next, the system combines samples obtained through the N sensors to produce a target EMI signal having an EMI signal sampling frequency F=f×N. The system then generates a target EMI fingerprint from the target EMI signal. Finally, the system compares the target EMI fingerprint against a reference EMI fingerprint for the target asset to determine whether the target asset contains any unwanted electronic components.