-
公开(公告)号:US10949746B2
公开(公告)日:2021-03-16
申请号:US15423900
申请日:2017-02-03
发明人: Imai Haruki , Tung Duc Le , Yasushi Negishi
摘要: A system and method provides efficient parallel training of a neural network model on multiple graphics processing units. A training module reduces the time and communication overhead of gradient accumulation and parameter updating of the network model in a neural network by overlapping processes in an advantageous way. In a described embodiment, a training module overlaps backpropagation, gradient transfer and accumulation in a Synchronous Stochastic Gradient Decent algorithm on a convolution neural network. The training module collects gradients of multiple layers during backpropagation of training from a plurality of graphics processing units (GPUs), accumulates the gradients on at least one processor and then delivers the gradients of the layers to the plurality of GPUs during the backpropagation of the training. The whole model parameters can then be updated on the GPUs after receipt of the gradient of the last layer.
-
公开(公告)号:US20200065214A1
公开(公告)日:2020-02-27
申请号:US16110324
申请日:2018-08-23
发明人: Yasushi Negishi , Kiyokuni Kawachiya , Jun Doi
摘要: A computer-implemented method, system, and computer program product are provided to simulate a target system. The method includes determining system performance metrics for a target system and an execution system. The method also includes generating a ratio of estimation between the system performance metrics for the target system and the execution system. The method additionally includes throttling components in the execution system to adjust all of the system performance metrics of the execution system responsive to the ratio of estimation to create a throttled execution system. The method further includes measuring a throttled execution time while running an application on the throttled execution system. The method also includes estimating a target execution time for the application on the target system responsive to the throttled execution time.
-
公开(公告)号:US20190325549A1
公开(公告)日:2019-10-24
申请号:US15956321
申请日:2018-04-18
发明人: Kiyokuni Kawachiya , Yasushi Negishi , Jun Doi
摘要: A computer-implemented method is provided for estimating the performance of a GPU application on a new computing machine having an increased GPU-link performance ratio relative to a current computing machine having a current GPU-link performance ratio. The method includes adding a delay to CPU-GPU communication on the current computing machine to simulate a delayed-communication environment on the current computing machine. The method further includes executing the target GPU application in the delayed-communication environment. The method also includes measuring the performance of the target GPU application in the delayed-communication environment. The method additionally includes estimating the performance of the new computing machine having the increased higher GPU-link performance ratio, based on the measured performance of the target GPU application in the delayed-communication environment.
-
公开(公告)号:US20190303025A1
公开(公告)日:2019-10-03
申请号:US15943079
申请日:2018-04-02
发明人: Taro Sekiyama , Haruki Imai , Jun Doi , Yasushi Negishi
摘要: A method is provided for reducing consumption of a memory in a propagation process for a neural network (NN) having fixed structures for computation order and node data dependency. The memory includes memory segments for allocating to nodes. The method collects, in a NN training iteration, information for each node relating to an allocation, size, and lifetime thereof. The method chooses, responsive to the information, a first node having a maximum memory size relative to remaining nodes, and a second node non-overlapped with the first node lifetime. The method chooses another node non-overlapped with the first node lifetime, responsive to a sum of memory sizes of the second node and the other node not exceeding a first node memory size. The method reallocates a memory segment allocated to the first node to the second node and the other node to be reused by the second node and the other node.
-
公开(公告)号:US10169874B2
公开(公告)日:2019-01-01
申请号:US15608814
申请日:2017-05-30
发明人: Hiroki Nakano , Yasushi Negishi , Masaharu Sakamato , Taro Sekiyama , Kun Zhao
摘要: A target object may be identified by estimating a distribution of a plurality of orientations of a periphery of a target object, and identifying the target object based on the distribution.
-
公开(公告)号:US09066289B2
公开(公告)日:2015-06-23
申请号:US13890338
申请日:2013-05-09
IPC分类号: H04J3/06 , H04W56/00 , H04L12/721 , H04W40/02 , H04W40/06
CPC分类号: H04W56/00 , H04J3/0658 , H04L45/40 , H04W40/02 , H04W40/06
摘要: A system including multiple nodes performing radio communication, wherein each node stores routing information, uses it to determine a transmission path, and performs cut-through transmission by transmitting and receiving packets to and from a node on the determined path through transmission and reception radio waves given a directivity by controlling their phases. In the system, time synchronization and transmission and reception of packet communication records are performed during a certain time period by carrying out the cut-through transmission while controlling phases of the radio waves so that all of the nodes form one or more closed loops. The node transmits and receives packets in accordance with routing information and a time frame assigned to each of the nodes as a time when each node is allowed to transmit and receive a packet, updates the routing information, and shares it with each node.
摘要翻译: 一种包括执行无线电通信的多个节点的系统,其中每个节点存储路由信息,使用它来确定传输路径,并且通过发送和接收无线电波通过在确定的路径上的节点发送和接收分组来执行直通传输 通过控制它们的阶段给出方向性。 在该系统中,通过在控制无线电波的相位的同时执行直通传输,使得所有节点形成一个或多个闭环,在一定时间段内执行分组通信记录的时间同步和发送和接收。 节点根据路由信息和分配给每个节点的时间帧发送和接收分组,作为允许每个节点发送和接收分组的时间,更新路由信息,并与每个节点共享。
-
公开(公告)号:US20220138580A1
公开(公告)日:2022-05-05
申请号:US17089245
申请日:2020-11-04
发明人: Haruki Imai , Tung D. Le , Yasushi Negishi , Kiyokuni Kawachiya
IPC分类号: G06N3/08
摘要: Methods and systems for training a neural network include identifying units within a neural network, including a first unit for memory swapping and a second unit for re-computation to balance memory efficiency with computational efficiency. Each unit includes at least one layer of the neural network. Each unit has a first layer that is a checkpoint operation. During a feed-forward training stage, feature maps are stored in a first memory. The feature maps are output by the at least one layer of the first unit. The feature maps are swapped from the first memory to a second memory. During a backpropagation stage, the feature maps for the first unit are swapped from the second memory to the first memory. Feature maps for the second unit are re-computed.
-
公开(公告)号:US11164079B2
公开(公告)日:2021-11-02
申请号:US15843244
申请日:2017-12-15
发明人: Tung D. Le , Haruki Imai , Taro Sekiyama , Yasushi Negishi
摘要: A computer-implemented method, computer program product, and computer processing system are provided for accelerating neural network data parallel training in multiple graphics processing units (GPUs) using at least one central processing unit (CPU). The method includes forming a set of chunks. Each of the chunks includes a respective group of neural network layers other than a last layer. The method further includes performing one or more chunk-wise synchronization operations during a backward phase of the neural network data parallel training, by each of the multiple GPUs and the at least one CPU.
-
公开(公告)号:US20200174848A1
公开(公告)日:2020-06-04
申请号:US16781992
申请日:2020-02-04
摘要: A method is provided for consistent data processing by first and second distributed processing systems having different data partitioning and routing mechanisms such that the first system is without states and the second system is with states. The method includes dividing data in each system into a same number of partitions based on a same key and a same hash function. The method includes mapping partitions between the systems in a one-to-one mapping. The mapping step includes calculating a partition ID based on the hash function and a total number of partitions, and dynamically mapping a partition in the first system to a partition in the second system, responsive to the partition in the first system being unmapped to the partition in the second system.
-
公开(公告)号:US10453167B1
公开(公告)日:2019-10-22
申请号:US15956321
申请日:2018-04-18
发明人: Kiyokuni Kawachiya , Yasushi Negishi , Jun Doi
摘要: A computer-implemented method is provided for estimating the performance of a GPU application on a new computing machine having an increased GPU-link performance ratio relative to a current computing machine having a current GPU-link performance ratio. The method includes adding a delay to CPU-GPU communication on the current computing machine to simulate a delayed-communication environment on the current computing machine. The method further includes executing the target GPU application in the delayed-communication environment. The method also includes measuring the performance of the target GPU application in the delayed-communication environment. The method additionally includes estimating the performance of the new computing machine having the increased higher GPU-link performance ratio, based on the measured performance of the target GPU application in the delayed-communication environment.
-
-
-
-
-
-
-
-
-