EFFICIENT RECOVERY FROM FAILURES DURING DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS

Invention Application

US20240428082A1 EFFICIENT RECOVERY FROM FAILURES DURING DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS 有权

Please log in to see more content

Patent Title: EFFICIENT RECOVERY FROM FAILURES DURING DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS
Application No.: US18491604

Application Date: 2023-10-20
Publication No.: US20240428082A1

Publication Date: 2024-12-26
Inventor: Zhuang Wang , Zhen Jia , Shuai Zheng , Zhen Zhang , Xinwei Fu , Yida Wang
Applicant: Amazon Technologies, Inc.
Applicant Address: US WA Seattle
Assignee: Amazon Technologies, Inc.
Current Assignee: Amazon Technologies, Inc.
Current Assignee Address: US WA Seattle
Main IPC: G06N3/098
IPC: G06N3/098

EFFICIENT RECOVERY FROM FAILURES DURING DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS

Abstract:

A placement plan for training state checkpoints of a machine learning model is generated based at least in part on a number of training servers of a distributed training environment. The plan indicates, with respect to an individual server, one or more other servers at which replicas of training state checkpoints of the individual server are to be stored. During selected periods of one or more training iterations of the model, respective portions of a replica of a training state checkpoint of a first server are transmitted to a second server selected based on the placement plan. After an event causes disruption of the training iterations, one of the checkpoints generated at the first server is retrieved from the second server and used to resume the training iterations.

Information query

Global Dossier Espacenet