TRAINING NEURAL NETWORKS USING DISTRIBUTED BATCH NORMALIZATION

    公开(公告)号:US20240378416A1

    公开(公告)日:2024-11-14

    申请号:US18444267

    申请日:2024-02-16

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including instructions encoded on storage media, for performing reduction of gradient vectors for distributed training of a neural network. One of the methods includes receiving, at each of the plurality of devices, a respective batch; performing, by each device, a forward pass comprising, for each batch normalization layer: generating, by each of the devices, a respective output of the corresponding other layer for each training example in the batch, determining, by each of the devices, a per-replica mean and a per-replica variance; determining, for each sub-group, a distributed mean and a distributed variance from the per-replica means and the per-replica variances for the devices in the sub-group; and applying, by each device, batch normalization to the respective outputs of the corresponding other layer generated by the device using the distributed mean and the distributed variance for the sub-group to which the device belongs.

    Training neural networks using distributed batch normalization

    公开(公告)号:US11907825B2

    公开(公告)日:2024-02-20

    申请号:US16659543

    申请日:2019-10-21

    Applicant: Google LLC

    CPC classification number: G06N3/044 G06N3/04 G06N3/08 G06N3/084 G06V10/82

    Abstract: Methods, systems, and apparatus, including instructions encoded on storage media, for performing reduction of gradient vectors for distributed training of a neural network. One of the methods includes receiving, at each of the plurality of devices, a respective batch; performing, by each device, a forward pass comprising, for each batch normalization layer: generating, by each of the devices, a respective output of the corresponding other layer for each training example in the batch, determining, by each of the devices, a per-replica mean and a per-replica variance; determining, for each sub-group, a distributed mean and a distributed variance from the per-replica means and the per-replica variances for the devices in the sub-group; and applying, by each device, batch normalization to the respective outputs of the corresponding other layer generated by the device using the distributed mean and the distributed variance for the sub-group to which the device belongs.

    Cross replica reduction on networks having degraded nodes

    公开(公告)号:US11715010B2

    公开(公告)日:2023-08-01

    申请号:US16543410

    申请日:2019-08-16

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including instructions encoded on storage media, for performing reduction of gradient vectors for a network having one or more degraded nodes. A method comprises training a respective replica of a machine learning model on each node of multiple nodes organized in an n-dimensional network topology, combining the respective individual gradient vectors in the nodes to generate a final gradient vector by performing operations comprising: designating each group of nodes along the dimension as either a forwarding group or a critical group, updating, for each receiving node, a respective individual gradient vector with an intermediate gradient vector, performing a reduction on each critical group of nodes along the dimension to generate a respective partial final gradient vector for the critical group, and updating, for each critical group of nodes, an individual gradient vector for a representative node with the respective partial final gradient vector.

    CROSS REPLICA REDUCTION ON NETWORKS HAVING DEGRADED NODES

    公开(公告)号:US20210049408A1

    公开(公告)日:2021-02-18

    申请号:US16543410

    申请日:2019-08-16

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including instructions encoded on storage media, for performing reduction of gradient vectors for a network having one or more degraded nodes. A method comprises training a respective replica of a machine learning model on each node of multiple nodes organized in an n-dimensional network topology, combining the respective individual gradient vectors in the nodes to generate a final gradient vector by performing operations comprising: designating each group of nodes along the dimension as either a forwarding group or a critical group, updating, for each receiving node, a respective individual gradient vector with an intermediate gradient vector, performing a reduction on each critical group of nodes along the dimension to generate a respective partial final gradient vector for the critical group, and updating, for each critical group of nodes, an individual gradient vector for a representative node with the respective partial final gradient vector.

Patent Agency Ranking