Efficient parallel training of a network model on multiple graphics processing units

    公开(公告)号:US10949746B2

    公开(公告)日:2021-03-16

    申请号:US15423900

    申请日:2017-02-03

    IPC分类号: G06N3/08 G06N3/04

    摘要: A system and method provides efficient parallel training of a neural network model on multiple graphics processing units. A training module reduces the time and communication overhead of gradient accumulation and parameter updating of the network model in a neural network by overlapping processes in an advantageous way. In a described embodiment, a training module overlaps backpropagation, gradient transfer and accumulation in a Synchronous Stochastic Gradient Decent algorithm on a convolution neural network. The training module collects gradients of multiple layers during backpropagation of training from a plurality of graphics processing units (GPUs), accumulates the gradients on at least one processor and then delivers the gradients of the layers to the plurality of GPUs during the backpropagation of the training. The whole model parameters can then be updated on the GPUs after receipt of the gradient of the last layer.

    APPLICATION PERFORMANCE SIMULATOR
    12.
    发明申请

    公开(公告)号:US20200065214A1

    公开(公告)日:2020-02-27

    申请号:US16110324

    申请日:2018-08-23

    IPC分类号: G06F11/34 G06F1/32 G06F9/50

    摘要: A computer-implemented method, system, and computer program product are provided to simulate a target system. The method includes determining system performance metrics for a target system and an execution system. The method also includes generating a ratio of estimation between the system performance metrics for the target system and the execution system. The method additionally includes throttling components in the execution system to adjust all of the system performance metrics of the execution system responsive to the ratio of estimation to create a throttled execution system. The method further includes measuring a throttled execution time while running an application on the throttled execution system. The method also includes estimating a target execution time for the application on the target system responsive to the throttled execution time.

    ESTIMATING PERFORMANCE OF GPU APPLICATION FOR DIFFERENT GPU-LINK PERFORMANCE RATIO

    公开(公告)号:US20190325549A1

    公开(公告)日:2019-10-24

    申请号:US15956321

    申请日:2018-04-18

    IPC分类号: G06T1/20 G06F9/38

    摘要: A computer-implemented method is provided for estimating the performance of a GPU application on a new computing machine having an increased GPU-link performance ratio relative to a current computing machine having a current GPU-link performance ratio. The method includes adding a delay to CPU-GPU communication on the current computing machine to simulate a delayed-communication environment on the current computing machine. The method further includes executing the target GPU application in the delayed-communication environment. The method also includes measuring the performance of the target GPU application in the delayed-communication environment. The method additionally includes estimating the performance of the new computing machine having the increased higher GPU-link performance ratio, based on the measured performance of the target GPU application in the delayed-communication environment.

    MEMORY REDUCTION FOR NEURAL NETWORKS WITH FIXED STRUCTURES

    公开(公告)号:US20190303025A1

    公开(公告)日:2019-10-03

    申请号:US15943079

    申请日:2018-04-02

    IPC分类号: G06F3/06 G06N3/08

    摘要: A method is provided for reducing consumption of a memory in a propagation process for a neural network (NN) having fixed structures for computation order and node data dependency. The memory includes memory segments for allocating to nodes. The method collects, in a NN training iteration, information for each node relating to an allocation, size, and lifetime thereof. The method chooses, responsive to the information, a first node having a maximum memory size relative to remaining nodes, and a second node non-overlapped with the first node lifetime. The method chooses another node non-overlapped with the first node lifetime, responsive to a sum of memory sizes of the second node and the other node not exceeding a first node memory size. The method reallocates a memory segment allocated to the first node to the second node and the other node to be reused by the second node and the other node.

    Packet communication system, communication method and program
    16.
    发明授权
    Packet communication system, communication method and program 有权
    分组通信系统,通信方式和程序

    公开(公告)号:US09066289B2

    公开(公告)日:2015-06-23

    申请号:US13890338

    申请日:2013-05-09

    摘要: A system including multiple nodes performing radio communication, wherein each node stores routing information, uses it to determine a transmission path, and performs cut-through transmission by transmitting and receiving packets to and from a node on the determined path through transmission and reception radio waves given a directivity by controlling their phases. In the system, time synchronization and transmission and reception of packet communication records are performed during a certain time period by carrying out the cut-through transmission while controlling phases of the radio waves so that all of the nodes form one or more closed loops. The node transmits and receives packets in accordance with routing information and a time frame assigned to each of the nodes as a time when each node is allowed to transmit and receive a packet, updates the routing information, and shares it with each node.

    摘要翻译: 一种包括执行无线电通信的多个节点的系统,其中每个节点存储路由信息,使用它来确定传输路径,并且通过发送和接收无线电波通过在确定的路径上的节点发送和接收分组来执行直通传输 通过控制它们的阶段给出方向性。 在该系统中,通过在控制无线电波的相位的同时执行直通传输,使得所有节点形成一个或多个闭环,在一定时间段内执行分组通信记录的时间同步和发送和接收。 节点根据路由信息和分配给每个节点的时间帧发送和接收分组,作为允许每个节点发送和接收分组的时间,更新路由信息,并与每个节点共享。

    DATA SWAPPING FOR NEURAL NETWORK MEMORY CONSERVATION

    公开(公告)号:US20220138580A1

    公开(公告)日:2022-05-05

    申请号:US17089245

    申请日:2020-11-04

    IPC分类号: G06N3/08

    摘要: Methods and systems for training a neural network include identifying units within a neural network, including a first unit for memory swapping and a second unit for re-computation to balance memory efficiency with computational efficiency. Each unit includes at least one layer of the neural network. Each unit has a first layer that is a checkpoint operation. During a feed-forward training stage, feature maps are stored in a first memory. The feature maps are output by the at least one layer of the first unit. The feature maps are swapped from the first memory to a second memory. During a backpropagation stage, the feature maps for the first unit are swapped from the second memory to the first memory. Feature maps for the second unit are re-computed.

    Multi-GPU deep learning using CPUs
    18.
    发明授权

    公开(公告)号:US11164079B2

    公开(公告)日:2021-11-02

    申请号:US15843244

    申请日:2017-12-15

    IPC分类号: G06N3/08 G06T1/20 G06N3/04

    摘要: A computer-implemented method, computer program product, and computer processing system are provided for accelerating neural network data parallel training in multiple graphics processing units (GPUs) using at least one central processing unit (CPU). The method includes forming a set of chunks. Each of the chunks includes a respective group of neural network layers other than a last layer. The method further includes performing one or more chunk-wise synchronization operations during a backward phase of the neural network data parallel training, by each of the multiple GPUs and the at least one CPU.

    Estimating performance of GPU application for different GPU-link performance ratio

    公开(公告)号:US10453167B1

    公开(公告)日:2019-10-22

    申请号:US15956321

    申请日:2018-04-18

    IPC分类号: G06F11/34 G06T1/20 G06F9/38

    摘要: A computer-implemented method is provided for estimating the performance of a GPU application on a new computing machine having an increased GPU-link performance ratio relative to a current computing machine having a current GPU-link performance ratio. The method includes adding a delay to CPU-GPU communication on the current computing machine to simulate a delayed-communication environment on the current computing machine. The method further includes executing the target GPU application in the delayed-communication environment. The method also includes measuring the performance of the target GPU application in the delayed-communication environment. The method additionally includes estimating the performance of the new computing machine having the increased higher GPU-link performance ratio, based on the measured performance of the target GPU application in the delayed-communication environment.