摘要:
One embodiment of the present invention provides a system for predicting a remaining useful life (RUL) for a component in a set of components within a computer system. The system starts by collecting values of at least one degradation-related parameter associated with the operation of a monitored component within the computer system. Note that the degradation-related parameter is a direct measurement of a degree of degradation of the monitored component. The system additionally collects values of at least one stress-based parameter from the computer system. Note that the stress-based parameter measures an accumulative stress in the operating environment of the set of components which can cause degradation of the set of components. The system then uses the values of the at least one degradation-related parameter and the values of the at least one stress-based parameter to predict an RUL for a component in the set of components.
摘要:
One embodiment of the present invention provides a system for predicting a remaining useful life (RUL) for a component in a set of components within a computer system. The system starts by collecting values of at least one degradation-related parameter associated with the operation of a monitored component within the computer system. Note that the degradation-related parameter is a direct measurement of a degree of degradation of the monitored component. The system additionally collects values of at least one stress-based parameter from the computer system. Note that the stress-based parameter measures an accumulative stress in the operating environment of the set of components which can cause degradation of the set of components. The system then uses the values of the at least one degradation-related parameter and the values of the at least one stress-based parameter to predict an RUL for a component in the set of components.
摘要:
One embodiment of the present invention provides a system that enhances throughput and fault-tolerance in a parallel-processing system. During operation, the system first receives a task. Next, the system partitions N computing nodes into M set-aside nodes and N-M primary computing nodes, wherein M≧1. The system then processes the task in parallel across the N-M primary computing nodes. While doing so, the system proactively monitors the health of each of the N-M primary computing nodes. If the system detects a node in the N-M primary computing nodes to be at risk of failure, the system copies the portion of the task associated with the at-risk node to a subset of the M set-aside nodes. The system then processes the portion of the task in parallel across the subset of the M set-aside nodes while the N-M primary computing nodes continue executing.