-
公开(公告)号:US20220036241A1
公开(公告)日:2022-02-03
申请号:US17501003
申请日:2021-10-14
Inventor: Tianjian He , Dianhai Yu , Zhihua Wu , Daxiang Dong , Yanjun Ma
Abstract: The present disclosure discloses a method, an apparatus and a storage medium for training a deep learning framework, and relates to the artificial intelligence field such as deep learning and big data processing. The specific implementation solution is: acquiring at least one task node in a current task node cluster, that meets a preset opening condition when a target task meets a training start condition; judging whether a number of nodes of the at least one task node is greater than or equal to a preset number; synchronously training the deep learning framework of the target task by the at least one task node according to sample data if the number of nodes is greater than the preset number; and acquiring a synchronously trained target deep learning framework when the target task meets a training completion condition.
-
2.
公开(公告)号:US20230206024A1
公开(公告)日:2023-06-29
申请号:US17891617
申请日:2022-08-19
Inventor: Ji Liu , Zhihua Wu , Danlei Feng , Chendi Zhou , Minxu Zhang , Xinxuan Wu , Xuefeng Yao , Dejing Dou , Dianhai Yu , Yanjun Ma
CPC classification number: G06N3/04 , G06F11/3409
Abstract: A resource allocation method, including: determining a neural network model to be allocated resources, and determining a set of devices capable of providing resources for the neural network model; determining, based on the set of devices and the neural network model, first set of evaluation points including first number of evaluation points, each of which corresponds to one resource allocation scheme and resource use cost corresponding to the resource allocation scheme; updating and iterating first set of evaluation points to obtain second set of evaluation points including second number of evaluation points, each of which corresponds to one resource allocation scheme and resource use cost corresponding to the resource allocation scheme, and second number being greater than first number; and selecting a resource allocation scheme with minimum resource use cost from the second set of evaluation points as a resource allocation scheme for allocating resources to the neural network model.
-
公开(公告)号:US20230169351A1
公开(公告)日:2023-06-01
申请号:US18060705
申请日:2022-12-01
Inventor: Haifeng Wang , Zhihua Wu , Dianhai Yu , Yanjun Ma , Tian Wu
IPC: G06N3/098
CPC classification number: G06N3/098
Abstract: A distributed training method based on end-to-end adaption, a device and a storage medium. The method includes: obtaining slicing results by slicing a model to be trained; obtaining an attribute of computing resources allocated to the model for training by parsing the computing resources, in which the computing resources are determined based on a computing resource requirement of the model, computing resources occupied by another model being trained, and idle computing resources, and the attribute of the computing resources is configured to represent at least one of a topology relation and a task processing capability of the computing resources; determining a distribution strategy of each of the slicing results in the computing resources based on the attributes of the computing resources; and performing distributed training on the model using the computing resources based on the distribution strategy.
-
-