SYSTEMS AND METHODS OF RESOURCE CONFIGURATION OPTIMIZATION FOR MACHINE LEARNING WORKLOADS

    公开(公告)号:US20210357256A1

    公开(公告)日:2021-11-18

    申请号:US16874479

    申请日:2020-05-14

    Abstract: Systems and methods are provided for optimally allocating resources used to perform multiple tasks/jobs, e.g., machine learning training jobs. The possible resource configurations or candidates that can be used to perform such jobs are generated. A first batch of training jobs can be randomly selected and run using one of the possible resource configuration candidates. Subsequent batches of training jobs may be performed using other resource configuration candidates that have been selected using an optimization process, e.g., Bayesian optimization. Upon reaching a stopping criterion, the resource configuration resulting in a desired optimization metric, e.g., fastest job completion time can be selected and used to execute the remaining training jobs.

    SYSTEMS AND METHODS OF RESOURCE CONFIGURATION OPTIMIZATION FOR MACHINE LEARNING WORKLOADS

    公开(公告)号:US20220292303A1

    公开(公告)日:2022-09-15

    申请号:US17199294

    申请日:2021-03-11

    Abstract: Systems and methods can be configured to determine a plurality of computing resource configurations used to perform machine learning model training jobs. A computing resource configuration can comprise: a first tuple including numbers of worker nodes and parameter server nodes, and a second tuple including resource allocations for the worker nodes and parameter server nodes. At least one machine learning training job can be executed using a first computing resource configuration having a first set of values associated with the first tuple. During the executing the machine learning training job: resource usage of the worker nodes and parameter server nodes caused by a second set of values associated with the second tuple can be monitored, and whether to adjust the second set of values can be determined. Whether a stopping criterion is satisfied can be determined. One of the plurality of computing resource configurations can be selected.

Patent Agency Ranking