SYSTEMS AND METHODS OF RESOURCE CONFIGURATION OPTIMIZATION FOR MACHINE LEARNING WORKLOADS
摘要:
Systems and methods are provided for optimally allocating resources used to perform multiple tasks/jobs, e.g., machine learning training jobs. The possible resource configurations or candidates that can be used to perform such jobs are generated. A first batch of training jobs can be randomly selected and run using one of the possible resource configuration candidates. Subsequent batches of training jobs may be performed using other resource configuration candidates that have been selected using an optimization process, e.g., Bayesian optimization. Upon reaching a stopping criterion, the resource configuration resulting in a desired optimization metric, e.g., fastest job completion time can be selected and used to execute the remaining training jobs.
信息查询
0/0