摘要:
Embodiments of the present invention provide a system that characterizes the reliability of a computer system. The system first collects samples of a performance parameter from the computer system. Next, the system computes the length of a line between the samples, wherein the line includes a component which is proportionate to a difference between values of the samples and a component which is proportionate to a time interval between the samples. The system then adds the computed length to a cumulative length variable which can be used to characterize the reliability of the computer system.
摘要:
A computer system to schedule loads across a set of processor cores is described. During operation, the computer system receives a process to be executed. Next, the computer system obtains one or more thermodynamic process characteristics associated with the process and one or more thermodynamic processor-core characteristics associated with operation of the set of processor cores. Then, the computer system schedules the process to be executed by at least one of the processor cores based on the one or more thermodynamic process characteristics and the one or more thermodynamic processor-core characteristics.
摘要:
A computer system that schedules loads across a set of processor cores is described. During operation, the computer system receives thermal measurements from sensors associated with the set of processor cores, and removes noise from the thermal measurements. Then, the computer system analyzes thermal properties of the set of processor cores based on the thermal measurements. Next, the computer system receives a process to be executed, and schedules the process to be executed by at least one of the processor cores based on the analysis. This scheduling is performed in a manner that reduces spatial and temporal thermal variations in the integrated circuit.
摘要:
A computer system that schedules loads across a set of processor cores is described. During operation, the computer system receives thermal measurements from sensors associated with the set of processor cores, and removes noise from the thermal measurements. Then, the computer system analyzes thermal properties of the set of processor cores based on the thermal measurements. Next, the computer system receives a process to be executed, and schedules the process to be executed by at least one of the processor cores based on the analysis. This scheduling is performed in a manner that reduces spatial and temporal thermal variations in the integrated circuit.
摘要:
Some embodiments of the present invention provide a system that controls temperature variations in a computer system. During operation, a telemetry variable of the computer system is monitored. Next, a future temperature of the computer system is predicted based on the telemetry variable. A signal is then generated in response to the future temperature. Then, the signal is sent to a cooling device in the computer system to control temperature variations of the computer system.
摘要:
Embodiments of the present invention provide a system that characterizes the reliability of a computer system. The system first collects samples of a performance parameter from the computer system. Next, the system computes the length of a line between the samples, wherein the line includes a component which is proportionate to a difference between values of the samples and a component which is proportionate to a time interval between the samples. The system then adds the computed length to a cumulative length variable which can be used to characterize the reliability of the computer system.
摘要:
A computer system to schedule loads across a set of processor cores is described. During operation, the computer system receives a process to be executed. Next, the computer system obtains one or more thermodynamic process characteristics associated with the process and one or more thermodynamic processor-core characteristics associated with operation of the set of processor cores. Then, the computer system schedules the process to be executed by at least one of the processor cores based on the one or more thermodynamic process characteristics and the one or more thermodynamic processor-core characteristics.
摘要:
Some embodiments of the present invention provide a system that controls temperature variations in a computer system. During operation, a telemetry variable of the computer system is monitored. Next, a future temperature of the computer system is predicted based on the telemetry variable. A signal is then generated in response to the future temperature. Then, the signal is sent to a cooling device in the computer system to control temperature variations of the computer system.
摘要:
Embodiments of a system that adjusts a checkpointing frequency in a distributed computing system that executes multiple jobs are described. During operation, the system receives signals associated with the operation of the computing nodes. Then, the system determines risk metrics for the computing nodes using a pattern-recognition technique to identify anomalous signals in the received signals. Next, the system adjusts a checkpointing frequency of a given checkpoint for a given computing node based on a comparison of a risk metric associated with the given computing node and a threshold, thereby implementing holistic fault tolerance, in which prediction and prevention of potential faults occurs across the distributed computing system.
摘要:
One embodiment of the present invention provides a system that efficiently conducts vibrational characterizations for a computer system having variable component configurations. During operation, the system receives a given component configuration associated with the computer system. Next, the system looks up the given component configuration in a resonant spectra library, which contains structural resonant frequencies for a number of possible component configurations for the computer system. If the given component configuration is found in the resonant spectra library, the system retrieves a set of structural resonant frequencies associated with the given component configuration. The system subsequently controls one or more vibration sources within the computer system to avoid the set of structural resonant frequencies.