-
公开(公告)号:US20250029010A1
公开(公告)日:2025-01-23
申请号:US18895264
申请日:2024-09-24
Inventor: Dianhai Yu , Gexiao Tian , Weibao Gong , Haifeng Wang , Yongsheng Xu , Jiabin Yang
IPC: G06N20/00
Abstract: A cluster-based training method includes: in response to a hardware fault in the training node, selecting a target standby node from the plurality of standby nodes, and obtaining a target training snapshot of the model training task in the training node, in which the target training snapshot includes training state data of the model training task; and initializing the target standby node based on a container image of a model training program in the training node and the training state data to replace the training node with the target standby node to continue executing the model training task.
-
2.
公开(公告)号:US11625248B2
公开(公告)日:2023-04-11
申请号:US17572140
申请日:2022-01-10
Inventor: Weihang Chen , Jiabin Yang , Hongyu Liu , Xiang Lan
Abstract: The present disclosure provides an operator registration method and apparatus for a deep learning framework, a device and a storage medium, relates to the field of computer technologies, and specifically to the field of artificial intelligence such as deep learning. The operator registration method for a deep learning framework includes: receiving registration information provided by a user for registering operators with the deep learning framework, the registration information including: a custom calculation function, the custom calculation function being written in a manner irrelevant to the deep learning framework; building operator meta-information in the deep learning framework based on the registration information; and constructing a to-be-registered operator within the deep learning framework based on the operator meta-information, and registering the to-be-registered operator in a global operator table within the deep learning framework. The present disclosure can simplify an operator registration process.
-