摘要:
One embodiment of the present invention provides a system that trains a pattern-recognition model for electronic prognostication for a computer system. First, the system monitors a performance parameter from a set of computer systems that includes at least two computer systems, wherein monitoring the performance parameter includes systematically monitoring and recording performance parameters in a set of performance parameters from computer systems in the set of computer systems, wherein the recording process keeps track of the temporal relationships between events in different performance parameters in the set of performance parameters. Next, the system generates a training data set based on the monitored performance parameter from the set of computer systems, wherein generating the training data set includes concatenating two or more time-series of the performance parameter from computer systems in the set of computer systems. Then, the system trains the pattern-recognition model using the training data set. Next, the system uses the pattern-recognition model to look for anomalies in performance parameters gathered during operation of a monitored computer system. The system then generates an alarm when the pattern-recognition model detects an anomaly in the performance parameters from the monitored computer system.
摘要:
One embodiment provides a technique for analyzing a target electromagnetic signal radiating from a monitored system. During the technique, the monitored system is positioned at a first locus of an ellipsoidal surface to amplify the target electromagnetic signal received at a second locus of the ellipsoidal surface. Next, the amplified target electromagnetic signal is monitored using an antenna positioned at the second locus of the ellipsoidal surface. Finally, the integrity of the monitored system is assessed by analyzing the amplified target electromagnetic signal monitored by the antenna.
摘要:
Some embodiments provide a system that analyzes telemetry data from a monitored system. During operation, the system obtains the telemetry data as a set of telemetric signals from the monitored system and groups the telemetry data into one or more clusters of correlated telemetric signals from the telemetric signals. Next, the system increases a bandwidth associated with monitoring the telemetric signals. To increase the bandwidth, the system omits one or more of the correlated telemetric signals from each of the clusters during sampling of the telemetric signals and estimates the omitted correlated telemetric signals by applying a nonlinear, nonparametric regression technique to the sampled telemetric signals.
摘要:
Some embodiments of the present invention provide a system that controls a temperature variation in a computer system. First, a performance parameter of the computer system is monitored. Next, a future temperature of the computer system is predicted based on the performance parameter. Then, a pitch of one or more blades in a cooling device in the computer system is adjusted based on the future temperature to control the temperature variation in the computer system.
摘要:
One embodiment provides a system that mitigates vibrations caused by cooling fans in a computer system. More specifically, the system includes a cooling fan mechanically coupled to the chassis of the computer system, wherein vibrations generated by the cooling fan are coupled to the chassis. The system also includes an actuation mechanism that creates a relative displacement between the cooling fan and the chassis when a control signal is applied to the actuation mechanism. The system additionally includes a detection mechanism which detects the relative displacement and generates a feedback signal which represents the relative displacement. The system further includes a control signal generation mechanism which converts the feedback signal into the control signal, which is subsequently applied to the actuation mechanism. When the control signal is applied to the actuation mechanism, the relative displacement between the cooling fan and the chassis vibrationally decouples the cooling fan from the chassis.
摘要:
Some embodiments of the present invention provide a system that controls a device that characterizes the health of a computer system power supply. During operation, a signature for the power supply is generated based on measurements of a set of performance parameters for the power supply. Then, the health of the power supply is characterized based on a comparison between the signature for the power supply and signatures for one or more other power supplies.
摘要:
Some embodiments of the present invention provide a system that monitors a connection in a computer system between a connector and a component coupled to the connector. During operation, a first motion parameter of the connector, and a second motion parameter of the component are measured. Then, the connection is monitored by comparing information related to the first motion parameter and information related to the second motion parameter.
摘要:
Some embodiments of the present invention provide a system for scheduling spin-up operations for a set of hard disk drives (HDDs) in a computer system. During operation, the system determines an available power of the computer system. Next, one or more HDDs are selected from the set of HDDs to be spun-up based on the available power and the power required to spin up each HDD. Then, spin-up operations are scheduled for the selected HDDs.
摘要:
Embodiments of a system that adjusts a checkpointing frequency in a distributed computing system that executes multiple jobs are described. During operation, the system receives signals associated with the operation of the computing nodes. Then, the system determines risk metrics for the computing nodes using a pattern-recognition technique to identify anomalous signals in the received signals. Next, the system adjusts a checkpointing frequency of a given checkpoint for a given computing node based on a comparison of a risk metric associated with the given computing node and a threshold, thereby implementing holistic fault tolerance, in which prediction and prevention of potential faults occurs across the distributed computing system.
摘要:
A system that facilitates estimating power consumption in a computer system by inferring the power consumption from instrumentation signals. During operation, the system monitors instrumentation signals within the computer system, wherein the instrumentation signals do not include corresponding current and voltage signals that can be used to directly compute power consumption. The system then estimates the power consumption for the computer system by inferring the power consumption from the instrumentation signals and from an inferential power model generated during a training phase.