摘要:
A temperature-aware task scheduling method, system, and computer program product, includes the GPU, receiving a request to execute the task, collecting task information including an intensiveness factor of a computation by an arithmetic logic unit (ALU) and a memory usage of a dynamic random-access memory (DRAM) for the task, obtaining a temperature of the ALU and a temperature of the DRAM, and accepting the task to the GPU based on the intensiveness factor, the ALU temperature, and the DRAM temperature.
摘要:
A method and an apparatus of allocating available resources in a cluster system with learning models and tuning methods are provided. The learning model may be trained from historic performance data of previously executed jobs and used to project a suggested amount of resources for execution of a job. The tuning process may suggest a configuration for the projected amount of resources in the cluster system for an optimal operating point. An optimization may be performed with respect to a set of objective functions to improve resource utilization and system performance while suggesting the configuration. Through many executions and job characterization, the learning/tuning process for suggesting the configuration for the projected amount of resources may be improved by understanding correlations of historic data and the objective functions.
摘要:
A profiling tool identifies a code region with a false sharing potential. A static analysis tool classifies variables and arrays in the identified code region. A mapping detection library correlates memory access instructions in the identified code region with variables and arrays in the identified code region while a processor is running the identified code region. The mapping detection library identifies one or more instructions at risk, in the identified code region, which are subject to an analysis by a false sharing detection library. A false sharing detection library performs a run-time analysis of the one or more instructions at risk while the processor is re-running the identified code region. The false sharing detection library determines, based on the performed run-time analysis, whether two different portions of the cache memory line are accessed by the generated binary code.
摘要:
A temperature-aware task scheduling method, system, and computer program product, includes learning a condition to accept a task to the GPU based on a prior execution of the task on the GPU according to a varying thermal characteristic of the GPU.
摘要:
Transfer learning in machine learning can include receiving a machine learning model. Target domain training data for reprogramming the machine learning model using transfer learning can be received. The target domain training data can be transformed by performing a transformation function on the target domain training data. Output labels of the machine learning model can be mapped to target labels associated with the target domain training data. The transformation function can be trained by optimizing a parameter of the transformation function. The machine learning model can be reprogrammed based on input data transformed by the transformation function and a mapping of the output labels to target labels.
摘要:
A temperature-aware task scheduling method, system, and computer program product, includes learning a condition to accept a task to the GPU based on a prior execution of the task on the GPU according to a varying thermal characteristic of the GPU.
摘要:
A method and an apparatus of allocating available resources in a cluster system with learning models and tuning methods are provided. The learning model may be trained from historic performance data of previously executed jobs and used to project a suggested amount of resources for execution of a job. The tuning process may suggest a configuration for the projected amount of resources in the cluster system for an optimal operating point. An optimization may be performed with respect to a set of objective functions to improve resource utilization and system performance while suggesting the configuration. Through many executions and job characterization, the learning/tuning process for suggesting the configuration for the projected amount of resources may be improved by understanding correlations of historic data and the objective functions.
摘要:
Embodiments for crash recoverability for graphics processing units (GPUs) by a processor. GPU application data and kernel execution state of one or more GPUs may be checkpointed. The checkpointed GPU application data and the kernel execution state may be recovered. The checkpointed GPU application data and the kernel execution state may be persisted on non-volatile memory.
摘要:
A temperature-aware task scheduling method, system, and computer program product, includes the GPU, receiving a request to execute the task, collecting task information including an intensiveness factor of a computation by an arithmetic logic unit (ALU) and a memory usage of a dynamic random-access memory (DRAM) for the task, obtaining a temperature of the ALU and a temperature of the DRAM, and accepting the task to the GPU based on the intensiveness factor, the ALU temperature, and the DRAM temperature.