Patent search ap:("Intel Corporation") AND inv:"MUDIGERE Page Dheevatsa"

1.

发明公开
OPTIMIZED COMPUTE HARDWARE FOR MACHINE LEARNING OPERATIONS 审中-公开

公开(公告)号：EP3407183A3

公开(公告)日：2019-02-13

申请号：EP18170154.1

申请日：2018-04-30

Applicant: INTEL Corporation

Inventor： DAS, Dipankar , GRAMUNT, Roger , SMELYANSKIY, Mikhail , CORBAL, Jesus , MUDIGERE, Dheevatsa , MELLEMPUDI, Naveen K. , HEINECKE, Alexander F.

IPC: G06F9/30

Abstract: One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a fetch unit to fetch a single instruction having multiple input operands, wherein the multiple input operands have an unequal bit-length, a first input operand having a first bit-length and a second input operand having a second bit-length; a decode unit to decode the single instruction into a decoded instruction; an operand length unit to determine a smaller bit-length of the first bit-length and the second bit-length; and a compute unit to perform a matrix operation on the multiple input operands to generate an output value having a bit length of the smaller bit length.

2.

发明公开
INSTRUCTIONS FOR FUSED MULTIPLY-ADD OPERATIONS WITH VARIABLE PRECISION INPUT OPERANDS 审中-公开

公开(公告)号：EP3547117A2

公开(公告)日：2019-10-02

申请号：EP19160082.4

申请日：2019-02-28

Applicant: INTEL Corporation

Inventor： DAS, Dipankar , MELLEPUDI, Naveen K. , DUTTA, Mrinmay , KUMAR, Arun , MUDIGERE, Dheevatsa , KUNDU, Abhisek

IPC: G06F9/30 , G06F7/48

Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.

3.

发明公开
OPTIMIZED COMPUTE HARDWARE FOR MACHINE LEARNING OPERATIONS 审中-公开

公开(公告)号：EP3783479A1

公开(公告)日：2021-02-24

申请号：EP20200955.1

申请日：2018-04-30

Applicant: INTEL Corporation

Inventor： DAS, Dipankar , GRAMUNT, Roger , SMELYANSKIY, Mikhail , CORBAL, Jesus , MUDIGERE, Dheevatsa , MELLEMPUDI, Naveen K. , HEINECKE, Alexander F.

IPC: G06F9/30

Abstract: One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a fetch unit to fetch a single instruction having multiple input operands, wherein the multiple input operands have an unequal bit-length, a first input operand having a first bit-length and a second input operand having a second bit-length; a decode unit to decode the single instruction into a decoded instruction; an operand length unit to determine a smaller bit-length of the first bit-length and the second bit-length; and a compute unit to perform a matrix operation on the multiple input operands to generate an output value having a bit length of the smaller bit length.

4.

发明公开
INSTRUCTIONS FOR FUSED MULTIPLY-ADD OPERATIONS WITH VARIABLE PRECISION INPUT OPERANDS 审中-实审

公开(公告)号：EP4325350A3

公开(公告)日：2024-05-15

申请号：EP23213442.9

申请日：2019-02-28

Applicant: Intel Corporation

Inventor： DAS, Dipankar , MELLEMPUDI, Naveen K. , DUTTA, Mrinmay , KUMAR, Arun , MUDIGERE, Dheevatsa , KUNDU, Abhisek

IPC: G06F7/48 , G06F9/30

CPC classification number: G06F7/483 , G06F2207/38220130101 , G06F9/30036 , G06F9/30065 , G06N3/063 , G06F9/30014 , G06F7/5443 , G06F9/3887

Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor comprises: fetch circuitry to fetch a single multiply-accumulate (MAC) instruction having fields to indicate an opcode, a destination, a first source vector having a first element width, and a second source vector having a second element width that is smaller than the first element width; decode circuitry to decode the fetched single MAC instruction; and a single instruction multiple data (SIMD) execution circuit to execute the single MAC instruction and perform multiply-accumulate operations within each processing lane of a plurality of processing lanes, the multiply-accumulate operations in each processing lane including: multiplying a subset of elements of the first source vector by corresponding elements of the second source vector to produce a corresponding subset of products, and accumulating the subset of products with an accumulation data element corresponding to the processing lane to generate a result data element corresponding to the processing lane, the result data element each having a width greater than the first element width and the second element width.

5.

发明公开
INSTRUCTIONS FOR FUSED MULTIPLY-ADD OPERATIONS WITH VARIABLE PRECISION INPUT OPERANDS 审中-公开

公开(公告)号：EP4325350A2

公开(公告)日：2024-02-21

申请号：EP23213442.9

申请日：2019-02-28

Applicant: Intel Corporation

Inventor： DAS, Dipankar , MELLEMPUDI, Naveen K. , DUTTA, Mrinmay , KUMAR, Arun , MUDIGERE, Dheevatsa , KUNDU, Abhisek

IPC: G06F7/48

Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor comprises: fetch circuitry to fetch a single multiply-accumulate (MAC) instruction having fields to indicate an opcode, a destination, a first source vector having a first element width, and a second source vector having a second element width that is smaller than the first element width; decode circuitry to decode the fetched single MAC instruction; and a single instruction multiple data (SIMD) execution circuit to execute the single MAC instruction and perform multiply-accumulate operations within each processing lane of a plurality of processing lanes, the multiply-accumulate operations in each processing lane including: multiplying a subset of elements of the first source vector by corresponding elements of the second source vector to produce a corresponding subset of products, and accumulating the subset of products with an accumulation data element corresponding to the processing lane to generate a result data element corresponding to the processing lane, the result data element each having a width greater than the first element width and the second element width.

6.

发明公开
FINE-GRAIN COMPUTE COMMUNICATION EXECUTION FOR DEEP LEARNING FRAMEWORKS 审中-公开

公开(公告)号：EP4089537A1

公开(公告)日：2022-11-16

申请号：EP22181956.8

申请日：2018-04-30

Applicant: INTEL Corporation

Inventor： SRIDHARAN, Srinivas , MUDIGERE, Dheevatsa

IPC: G06F9/54 , G06N3/063 , G06N3/08

Abstract: One embodiment provides for a system to configure distributed training of a neural network. The system includes memory to store a library to facilitate transmission of data during distributed training of the neural network; a network interface to transmit and receive gradient data associated with the trainable parameters; a general-purpose processor to execute instructions provided by the library, the instructions to cause the general-purpose processor to configure the network interface to transmit and receive the gradient data associated with the trainable parameters during a workflow of a machine learning framework; and a graphics processor to perform compute operations associated with machine learning framework workflow to generate the gradient data associated with the trainable parameters, wherein, based on the machine learning framework workflow, the library is to interleave the compute operations on the graphics processor with transmission and receipt of gradient data via the network interface.

7.

发明公开
INSTRUCTIONS FOR FUSED MULTIPLY-ADD OPERATIONS WITH VARIABLE PRECISION INPUT OPERANDS 审中-公开

公开(公告)号：EP3547117A3

公开(公告)日：2019-12-18

申请号：EP19160082.4

申请日：2019-02-28

Applicant: INTEL Corporation

Inventor： DAS, Dipankar , MELLEPUDI, Naveen K. , DUTTA, Mrinmay , KUMAR, Arun , MUDIGERE, Dheevatsa , KUNDU, Abhisek

IPC: G06F9/30 , G06F7/48

Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.

8.

发明公开
OPTIMIZED COMPUTE HARDWARE FOR MACHINE LEARNING OPERATIONS 审中-公开

公开(公告)号：EP3407183A2

公开(公告)日：2018-11-28

申请号：EP18170154.1

申请日：2018-04-30

Applicant: INTEL Corporation

Inventor： DAS, Dipankar , GRAMUNT, Roger , SMELYANSKIY, Mikhail , CORBAL, Jesus , MUDIGERE, Dheevatsa , MELLEMPUDI, Naveen K. , HEINECKE, Alexander F.

IPC: G06F9/30

CPC classification number: G06F9/3887 , G06F9/30014 , G06F9/30036 , G06F9/3016 , G06F9/30181 , G06F9/30192 , G06F9/3851 , G06N3/00 , G06T1/20

Abstract: One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a fetch unit to fetch a single instruction having multiple input operands, wherein the multiple input operands have an unequal bit-length, a first input operand having a first bit-length and a second input operand having a second bit-length; a decode unit to decode the single instruction into a decoded instruction; an operand length unit to determine a smaller bit-length of the first bit-length and the second bit-length; and a compute unit to perform a matrix operation on the multiple input operands to generate an output value having a bit length of the smaller bit length.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification