-
公开(公告)号:US20250029010A1
公开(公告)日:2025-01-23
申请号:US18895264
申请日:2024-09-24
Inventor: Dianhai Yu , Gexiao Tian , Weibao Gong , Haifeng Wang , Yongsheng Xu , Jiabin Yang
IPC: G06N20/00
Abstract: A cluster-based training method includes: in response to a hardware fault in the training node, selecting a target standby node from the plurality of standby nodes, and obtaining a target training snapshot of the model training task in the training node, in which the target training snapshot includes training state data of the model training task; and initializing the target standby node based on a container image of a model training program in the training node and the training state data to replace the training node with the target standby node to continue executing the model training task.