Abstract:
Machine learning is utilized to analyze respective execution times of a plurality of tasks in a job performed in a distributed computing system to determine that a subset of the plurality of tasks are straggler tasks in the job, where the distributed computing system includes a plurality of computing devices. A supervised machine-learning algorithm is performed using a set of inputs including performance attributes of the plurality of tasks, where the supervised machine learning algorithm uses labels generated from determination of the set of straggler tasks, the performance attributes include respective attributes of the plurality of tasks observed during performance of the job, and applying the supervised learning algorithm results in identification of a set of rules defining conditions, based on the performance attributes of the plurality of tasks, indicative of which tasks will be straggler tasks in a job. Rule data is generated to describe the set of rules.
Abstract:
Examples are disclosed for determining or using server transaction latency information. In some examples, a network input/output device coupled to a server may be capable of time stamping information related to ingress request and egress response packets for a transaction. For these examples, elements of the server may be capable of determining transaction latency values based on the time stamped information. The determined transaction latency values may be used to monitor or manage operating characteristics of the server to include an amount of power provided to the server or an ability of the server to support one or more virtual servers. Other examples are described and claimed.
Abstract:
Machine learning is utilized to analyze respective execution times of a plurality of tasks in a job performed in a distributed computing system to determine that a subset of the plurality of tasks are straggler tasks in the job, where the distributed computing system includes a plurality of computing devices. A supervised machine-learning algorithm is performed using a set of inputs including performance attributes of the plurality of tasks, where the supervised machine learning algorithm uses labels generated from determination of the set of straggler tasks, the performance attributes include respective attributes of the plurality of tasks observed during performance of the job, and applying the supervised learning algorithm results in identification of a set of rules defining conditions, based on the performance attributes of the plurality of tasks, indicative of which tasks will be straggler tasks in a job. Rule data is generated to describe the set of rules.
Abstract:
A system can predict memory device failure through identification of correctable error patterns based on the memory architecture. The failure prediction can thus account for the circuit-level of the memory rather than the mere number or frequency of correctable errors. A failure prediction engine correlates hardware configuration of the memory device with correctable errors (CEs) detected in data of the memory device to predict an uncorrectable error (UE) based on the correlation.
Abstract:
Machine learning is utilized to analyze respective execution times of a plurality of tasks in a job performed in a distributed computing system to determine that a subset of the plurality of tasks are straggler tasks in the job, where the distributed computing system includes a plurality of computing devices. A supervised machine-learning algorithm is performed using a set of inputs including performance attributes of the plurality of tasks, where the supervised machine learning algorithm uses labels generated from determination of the set of straggler tasks, the performance attributes include respective attributes of the plurality of tasks observed during performance of the job, and applying the supervised learning algorithm results in identification of a set of rules defining conditions, based on the performance attributes of the plurality of tasks, indicative of which tasks will be straggler tasks in a job. Rule data is generated to describe the set of rules.
Abstract:
Examples are disclosed for determining or using server transaction latency information. In some examples, a network input/output device coupled to a server may be capable of time stamping information related to ingress request and egress response packets for a transaction. For these examples, elements of the server may be capable of determining transaction latency values based on the time stamped information. The determined transaction latency values may be used to monitor or manage operating characteristics of the server to include an amount of power provided to the server or an ability of the server to support one or more virtual servers. Other examples are described and claimed.
Abstract:
Systems, apparatuses and methods may provide for technology that identifies a plurality of fully correctable patterns associated with an error correction code (ECC) in a memory controller, detects one or more correctable errors in a memory module coupled to the memory controller, and generates an alert if an error-bit pattern of the one or more correctable errors does not match one or more of the plurality of fully correctable patterns.
Abstract:
Examples are disclosed for determining or using server transaction latency information. In some examples, a network input/output device coupled to a server may be capable of time stamping information related to ingress request and egress response packets for a transaction. For these examples, elements of the server may be capable of determining transaction latency values based on the time stamped information. The determined transaction latency values may be used to monitor or manage operating characteristics of the server to include an amount of power provided to the server or an ability of the server to support one or more virtual servers. Other examples are described and claimed.
Abstract:
Embodiments of the invention relate generally to the field of power management of computer systems, and more particularly to a method and apparatus for dynamically allocating power to servers in a server rack. The method comprises: measuring power consumption of a computer system having one or more servers; estimating probability distribution of power demand for each of the one or more servers, the estimation based on the measured power consumption; estimating performance loss via the estimated probability distribution; computing power capping limits for each of the one or more servers, the computation based on the estimated probability distribution and the performance loss; and dynamically allocating the power capping limits to each of the one or more servers by modifying previous power capping limits of each of the one or more servers.
Abstract:
Machine learning is utilized to analyze respective execution times of a plurality of tasks in a job performed in a distributed computing system to determine that a subset of the plurality of tasks are straggler tasks in the job, where the distributed computing system includes a plurality of computing devices. A supervised machine-learning algorithm is performed using a set of inputs including performance attributes of the plurality of tasks, where the supervised machine learning algorithm uses labels generated from determination of the set of straggler tasks, the performance attributes include respective attributes of the plurality of tasks observed during performance of the job, and applying the supervised learning algorithm results in identification of a set of rules defining conditions, based on the performance attributes of the plurality of tasks, indicative of which tasks will be straggler tasks in a job. Rule data is generated to describe the set of rules.