TRAINING LARGE DL MODELS VIA SERVERLESS ARCHITECTURE USING CLOUD STORAGE SERVICES-BASED COMMUNICATION CHANNEL

    公开(公告)号:US20230409967A1

    公开(公告)日:2023-12-21

    申请号:US18140219

    申请日:2023-04-27

    CPC classification number: G06N20/00

    Abstract: State of the art methods require size of DL model, or its gradients be less than maximum data item size of storage used as a communication channel for model training with serverless platform. Embodiments of the present disclosure provide method and system for training large DL models via serverless architecture using communication channel when the gradients are larger than maximum size of one data item allowed by the channel. Gradients that are generated by each worker during current training instance, are chunked into segments and stored in the communication channel. Corresponding segments of each worker are aggregated by aggregators and stored back. Each of the aggregated corresponding segments are read by each worker to generate an aggregated model to be used during successive training instance. Optimization techniques are used for reading-from and writing-to the channel resulting in significant improvement in performance and cost of training.

Patent Agency Ranking