COMMUNICATION OPTIMIZATIONS FOR DISTRIBUTED MACHINE LEARNING

    公开(公告)号:EP3506095A2

    公开(公告)日:2019-07-03

    申请号:EP18209320.3

    申请日:2018-11-29

    Abstract: Embodiments described herein provide a system to configure distributed training of a neural network, the system comprising memory to store a library to facilitate data transmission during distributed training of the neural network; a network interface to enable transmission and receipt of configuration data associated with a set of worker nodes, the worker nodes configured to perform distributed training of the neural network; and a processor to execute instructions provided by the library, the instructions to cause the processor to create one or more groups of the worker nodes, the one or more groups of worker nodes to be created based on a communication pattern for messages to be transmitted between the worker nodes during distributed training of the neural network.

    INSTRUCTIONS FOR FUSED MULTIPLY-ADD OPERATIONS WITH VARIABLE PRECISION INPUT OPERANDS

    公开(公告)号:EP4325350A3

    公开(公告)日:2024-05-15

    申请号:EP23213442.9

    申请日:2019-02-28

    Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor comprises: fetch circuitry to fetch a single multiply-accumulate (MAC) instruction having fields to indicate an opcode, a destination, a first source vector having a first element width, and a second source vector having a second element width that is smaller than the first element width; decode circuitry to decode the fetched single MAC instruction; and a single instruction multiple data (SIMD) execution circuit to execute the single MAC instruction and perform multiply-accumulate operations within each processing lane of a plurality of processing lanes, the multiply-accumulate operations in each processing lane including: multiplying a subset of elements of the first source vector by corresponding elements of the second source vector to produce a corresponding subset of products, and accumulating the subset of products with an accumulation data element corresponding to the processing lane to generate a result data element corresponding to the processing lane, the result data element each having a width greater than the first element width and the second element width.

    INSTRUCTIONS FOR FUSED MULTIPLY-ADD OPERATIONS WITH VARIABLE PRECISION INPUT OPERANDS

    公开(公告)号:EP4325350A2

    公开(公告)日:2024-02-21

    申请号:EP23213442.9

    申请日:2019-02-28

    Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor comprises: fetch circuitry to fetch a single multiply-accumulate (MAC) instruction having fields to indicate an opcode, a destination, a first source vector having a first element width, and a second source vector having a second element width that is smaller than the first element width; decode circuitry to decode the fetched single MAC instruction; and a single instruction multiple data (SIMD) execution circuit to execute the single MAC instruction and perform multiply-accumulate operations within each processing lane of a plurality of processing lanes, the multiply-accumulate operations in each processing lane including: multiplying a subset of elements of the first source vector by corresponding elements of the second source vector to produce a corresponding subset of products, and accumulating the subset of products with an accumulation data element corresponding to the processing lane to generate a result data element corresponding to the processing lane, the result data element each having a width greater than the first element width and the second element width.

    INSTRUCTIONS FOR FUSED MULTIPLY-ADD OPERATIONS WITH VARIABLE PRECISION INPUT OPERANDS

    公开(公告)号:EP3547117A2

    公开(公告)日:2019-10-02

    申请号:EP19160082.4

    申请日:2019-02-28

    Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.

    INSTRUCTIONS FOR FUSED MULTIPLY-ADD OPERATIONS WITH VARIABLE PRECISION INPUT OPERANDS

    公开(公告)号:EP3547117A3

    公开(公告)日:2019-12-18

    申请号:EP19160082.4

    申请日:2019-02-28

    Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.

Patent Agency Ranking