DISTRIBUTED SYNCHRONOUS TRAINING ARCHITECTURE USING STALE WEIGHTS

    公开(公告)号:US20220027738A1

    公开(公告)日:2022-01-27

    申请号:US17450055

    申请日:2021-10-05

    Abstract: A computer-implemented method for distributed synchronous training of a neural network model includes performing, by a worker machine of a plurality of worker machines, a forward computation of a training data set using a plurality of N layers of the neural network model. The forward computation starts at Layer 1 and proceeds through Layer N of the neural network model. The method further includes performing, by the worker machine, a backward computation of the training data set, the backward computation starting at Layer N and proceeding through Layer 1 of the neural network model. The method further includes synchronizing, by the worker machine, a plurality of gradients outputted by the neural network model during the backward computation. The synchronizing of the plurality of gradients is performed with other worker machines of the plurality of worker machines and in parallel with the backward computation.

    COMPILER-LEVEL GENERAL MATRIX MULTIPLICATION CONFIGURATION OPTIMIZATION

    公开(公告)号:US20210200521A1

    公开(公告)日:2021-07-01

    申请号:US17182753

    申请日:2021-02-23

    Abstract: A system and method is provided for optimizing general matrix multiplication (GEMM) on target hardware by splitting matrices to be multiplied into tiles and formulating a tiling configuration search problem for matrices to be multiplied that explores a configuration search space to identify an optimal tiling configuration that minimizes running time on the target hardware for multiplication of matrices A (m×k) and B (k×n) on the target hardware for respective configuration states as a function of matrix parameters m, k, and n, and numbers of respective nested loops for each dimension m, k, and n, respectively. The optimal tiling configuration for the target hardware is obtained by implementing a Greedy Best-First-Search (GBFS) algorithm or a Neighborhood Actor Advantage Critic (N-A2C) algorithm that optimizes the running time for multiplication of the matrices on the target hardware, and the target hardware is configured and computations are run accordingly.

    LEVERAGING LAGGING GRADIENTS IN MACHINE-LEARNING MODEL TRAINING

    公开(公告)号:US20210374544A1

    公开(公告)日:2021-12-02

    申请号:US17445139

    申请日:2021-08-16

    Abstract: A computer-implemented method for distributed synchronous training of a neural network model includes detecting gradient sets from a plurality of worker machines, each worker machine generating a gradient set in a current iteration of a training data set, and each gradient set of the gradient sets comprising a plurality of gradients. A lagging gradient set from a lagging worker machine is detected. The lagging gradient set is generated by the lagging worker machine in a prior iteration of the training data set. Aggregated gradients are generated based on the gradient sets and the lagging gradient set. The neural network model is updated based on the aggregated gradients.

    Compiler-level general matrix multiplication configuration optimization

    公开(公告)号:US11842178B2

    公开(公告)日:2023-12-12

    申请号:US17182753

    申请日:2021-02-23

    CPC classification number: G06F8/443 G06F7/16 G06F8/447

    Abstract: A system and method is provided for optimizing general matrix multiplication (GEMM) on target hardware by splitting matrices to be multiplied into tiles and formulating a tiling configuration search problem for matrices to be multiplied that explores a configuration search space to identify an optimal tiling configuration that minimizes running time on the target hardware for multiplication of matrices A (m×k) and B (k×n) on the target hardware for respective configuration states as a function of matrix parameters m, k, and n, and numbers of respective nested loops for each dimension m, k, and n, respectively. The optimal tiling configuration for the target hardware is obtained by implementing a Greedy Best-First-Search (GBFS) algorithm or a Neighborhood Actor Advantage Critic (N-A2C) algorithm that optimizes the running time for multiplication of the matrices on the target hardware, and the target hardware is configured and computations are run accordingly.

Patent Agency Ranking