摘要:
One embodiment of the present invention provides a system that enhances reliability, availability and serviceability in a computer system by replacing a signal from a failed sensor with an estimated signal derived from correlations with other instrumentation signals in the computer system. During operation, the system determines whether a sensor has failed in the computer system while the computer system is operating. If so, the system uses an estimated signal for the failed sensor in place of the actual signal from the failed sensor during subsequent operation of the computer system, wherein the estimated signal is derived from correlations with other instrumentation signals in the computer system. This allows the computer system to continue operating without the failed sensor.
摘要:
One embodiment of the present invention provides a system that optimizes support vector machine (SVM) kernel parameters. During operation, the system assigns sets of kernel parameter values to each node in a multiprocessor system. Next, the system performs a cross-validation operation at each node in the multiprocessor system based on a data set. This cross-validation operation computes an error cost value reflecting the number of misclassifications that arise while classifying the data set using the assigned set of kernel parameter values. The system then communicates the computed error cost values between nodes in the multiprocessor system, and eliminates nodes with relatively high error cost values. Next, the system performs a cross-over operation in which kernel parameter values are exchanged between remaining nodes to produce new sets of kernel parameter values. This process is repeated until a global winning set of kernel parameter values emerges.
摘要:
Some embodiments provide a system that analyzes telemetry data from a computer system. During operation, the system obtains the telemetry data as a set of telemetric signals from the computer system and validates the telemetric signals using a nonlinear, nonparametric regression technique. Next, the system assesses the integrity of a power supply unit (PSU) in the computer system by comparing the telemetric signals to one or more reference telemetric signals associated with the computer system. If the assessed integrity falls below a threshold, the system performs a remedial action for the computer system.
摘要:
One embodiment of the present invention provides a system that estimates the relative humidity inside a computer system. During operation, a set of performance parameters of the computer system and an external relative humidity outside of the computer system are monitored. Then, the relative humidity inside the computer system is estimated based on the set of performance parameters, the external relative humidity, and a relative humidity model, wherein training of the relative humidity model includes measuring an external training relative humidity outside of the computer system and a training relative humidity inside the computer system while monitoring the set of performance parameters of the computer system.
摘要:
A system for generating a power consumption model of at least one server includes one or more computers configured to obtain n time series telemetry signals indicative of operating parameters of the at least one server, obtain a time series power signal indicative of power consumed by the at least one server, and correlate each of the n time series telemetry signals with the time series power signal. The one or more computers are further configured to select a set of the n time series telemetry signals having an overall correlation with the time series power signal greater than a predetermined threshold, and generate a power consumption model of the at least one server based on at least the set of the n time series telemetry signals.
摘要:
One embodiment provides a system that analyzes telemetry data from a computer system. During operation, the system periodically obtains the telemetry data from the computer system. Next, the system preprocesses the telemetry data using a sequential-analysis technique. If a statistical deviation is found in the telemetry data using the sequential-analysis technique, the system identifies a subset of the telemetry data associated with the statistical deviation and applies a root-cause-analysis technique to the subset of the telemetry data to determine a source of the statistical deviation. Finally, the system uses the source of the statistical deviation to perform a remedial action for the computer system, which involves correcting a fault in the computer system corresponding to the source of the statistical deviation.
摘要:
A system that determines whether components are not present in a computer system is presented. During operation the system receives telemetry signals from sensors within the computer system. Next, the system dynamically generates a temperature map for the computer system based on the telemetry signals. The system then analyzes the temperature map to determine whether components are not present in the computer system.
摘要:
One embodiment of the present invention provides a system that generates a synthetic workload to test power utilization in a computer system. During operation, the system monitors power utilization of a reference computer system while the reference computer system executes a workload-of interest, wherein the monitoring process produces a power profile. Next, the system determines characteristics of the workload-of-interest from the power profile. Finally, the system uses the determined characteristics to construct the synthetic workload, wherein the synthetic workload has similar power utilization to the workload-of-interest.
摘要:
One embodiment of the present invention provides a system that dynamically controls a temperature profile within a disk drive by generating disk drive activity. During operation, the system first receives a desired temperature profile. Next, the system generates a load profile based on the desired temperature profile, wherein the load profile specifies read/write operations on the disk drive. The system then applies the load profile to the disk drive to generate disk drive activity, wherein the disk activity causes the temperature in the disk drive to track the desired temperature profile.
摘要:
A system for generating a power consumption model of at least one server includes one or more computers configured to obtain n time series telemetry signals indicative of operating parameters of the at least one server, obtain a time series power signal indicative of power consumed by the at least one server, and correlate each of the n time series telemetry signals with the time series power signal. The one or more computers are further configured to select a set of the n time series telemetry signals having an overall correlation with the time series power signal greater than a predetermined threshold, and generate a power consumption model of the at least one server based on at least the set of the n time series telemetry signals.