-
公开(公告)号:US20230206080A1
公开(公告)日:2023-06-29
申请号:US18118339
申请日:2023-03-07
Inventor: Shuohuan WANG , Weibao GONG , Zhihua WU , Yu SUN , Siyu DING , Yaqian HAN , Yanbin ZHAO , Yuang LIU , Dianhai YU
Abstract: A model training system includes at least one first cluster and a second cluster communicating with the at least first cluster. The at least one first cluster is configured to acquire a sample data set, generate training data according to the sample data set, and send the training data to the second cluster; and the second cluster is configured to train a pre-trained model according to the training data sent by the at least one first cluster.
-
2.
公开(公告)号:US20250036920A1
公开(公告)日:2025-01-30
申请号:US18026140
申请日:2022-09-20
Inventor: Liang SHEN , Haifeng WANG , Huachao WU , Weibao GONG , Zhihua WU , Dianhai YU
IPC: G06N3/045 , G06N3/0495
Abstract: The present disclosure provides a mixture-of-experts (MoE) model implementation method and system, an electronic device, and a storage medium, and relates to the field of artificial intelligence (AI) such as deep learning and distributed storage. The method includes: constructing a communication group, the communication group including a tensor-parallelism communication group, the tensor-parallelism communication group including at least two computing devices, tensor-parallelism segmentation being adopted for sparse parameters of each of the computing devices in a same tensor-parallelism communication group; and training an MoE model based on the communication group. By use of the solutions of the present disclosure, normal operation of model training can be guaranteed.
-
公开(公告)号:US20220374713A1
公开(公告)日:2022-11-24
申请号:US17880070
申请日:2022-08-03
Inventor: Zhihua WU , Dianhai YU , Yulong AO , Weibao GONG
IPC: G06N3/08
Abstract: The present disclosure provides a method and apparatus for performing distributed training on a deep learning model. The method may include: generating a distributed computation view based on data information of a to-be-trained deep learning model; generating a cluster resource view based on property information of a cluster hardware resource corresponding to the to-be-trained deep learning model; determining a target segmentation strategy of a distributed training task based on the distributed computation view and the cluster resource view; and performing distributed training on the to-be-trained deep learning model based on the target segmentation strategy.
-
-