-
公开(公告)号:US20220027738A1
公开(公告)日:2022-01-27
申请号:US17450055
申请日:2021-10-05
Applicant: Huawei Technologies Co., Ltd.
Inventor: Hui Zang , Xiaolin Cheng
Abstract: A computer-implemented method for distributed synchronous training of a neural network model includes performing, by a worker machine of a plurality of worker machines, a forward computation of a training data set using a plurality of N layers of the neural network model. The forward computation starts at Layer 1 and proceeds through Layer N of the neural network model. The method further includes performing, by the worker machine, a backward computation of the training data set, the backward computation starting at Layer N and proceeding through Layer 1 of the neural network model. The method further includes synchronizing, by the worker machine, a plurality of gradients outputted by the neural network model during the backward computation. The synchronizing of the plurality of gradients is performed with other worker machines of the plurality of worker machines and in parallel with the backward computation.
-
公开(公告)号:US20210200521A1
公开(公告)日:2021-07-01
申请号:US17182753
申请日:2021-02-23
Applicant: Huawei Technologies Co., Ltd.
Inventor: Hui Zang , Huaqing Zhang , Xiaolin Cheng
Abstract: A system and method is provided for optimizing general matrix multiplication (GEMM) on target hardware by splitting matrices to be multiplied into tiles and formulating a tiling configuration search problem for matrices to be multiplied that explores a configuration search space to identify an optimal tiling configuration that minimizes running time on the target hardware for multiplication of matrices A (m×k) and B (k×n) on the target hardware for respective configuration states as a function of matrix parameters m, k, and n, and numbers of respective nested loops for each dimension m, k, and n, respectively. The optimal tiling configuration for the target hardware is obtained by implementing a Greedy Best-First-Search (GBFS) algorithm or a Neighborhood Actor Advantage Critic (N-A2C) algorithm that optimizes the running time for multiplication of the matrices on the target hardware, and the target hardware is configured and computations are run accordingly.
-
公开(公告)号:US20210374544A1
公开(公告)日:2021-12-02
申请号:US17445139
申请日:2021-08-16
Applicant: Huawei Technologies Co.,Ltd.
Inventor: Hui Zang , Xiaolin Cheng
IPC: G06N3/08
Abstract: A computer-implemented method for distributed synchronous training of a neural network model includes detecting gradient sets from a plurality of worker machines, each worker machine generating a gradient set in a current iteration of a training data set, and each gradient set of the gradient sets comprising a plurality of gradients. A lagging gradient set from a lagging worker machine is detected. The lagging gradient set is generated by the lagging worker machine in a prior iteration of the training data set. Aggregated gradients are generated based on the gradient sets and the lagging gradient set. The neural network model is updated based on the aggregated gradients.
-
公开(公告)号:US11842178B2
公开(公告)日:2023-12-12
申请号:US17182753
申请日:2021-02-23
Applicant: Huawei Technologies Co., Ltd.
Inventor: Hui Zang , Huaqing Zhang , Xiaolin Cheng
Abstract: A system and method is provided for optimizing general matrix multiplication (GEMM) on target hardware by splitting matrices to be multiplied into tiles and formulating a tiling configuration search problem for matrices to be multiplied that explores a configuration search space to identify an optimal tiling configuration that minimizes running time on the target hardware for multiplication of matrices A (m×k) and B (k×n) on the target hardware for respective configuration states as a function of matrix parameters m, k, and n, and numbers of respective nested loops for each dimension m, k, and n, respectively. The optimal tiling configuration for the target hardware is obtained by implementing a Greedy Best-First-Search (GBFS) algorithm or a Neighborhood Actor Advantage Critic (N-A2C) algorithm that optimizes the running time for multiplication of the matrices on the target hardware, and the target hardware is configured and computations are run accordingly.
-
-
-