-
1.
公开(公告)号:EP3937004A1
公开(公告)日:2022-01-12
申请号:EP21195277.5
申请日:2018-03-26
申请人: INTEL Corporation
发明人: Kaul, Himanshu , Anders, Mark A. , Mathew, Sanu K. , Yao, Anbang , Ray, Joydeep , Tang, Ping T. , Strickland, Michael S. , Chen, Xiaoming , Appu, Abhishek R. , Koker, Altug , Sinha, Kamal , Vembu, Balaji , Galoppo von Borries, Nicolas C. , Nurvitadhi, Eriko , Barik, Rajkishore , Lin, Tsung-Han , Ranganathan, Vasanth , Jahagirdar, Sanjeev , Shpeisman, Tatiana
摘要: The present disclosure provides a graphics processing unit, GPU, comprising: a plurality of memory controllers, a cache memory coupled with the plurality of memory controllers, a a graphics multiprocessor coupled with the cache memory and the plurality of memory controllers. The graphics multiprocessor having a single instruction, multiple thread, SIMT, architecture. The graphics multiprocessor includes a register file and a plurality of compute units coupled with the register file. The plurality of compute units including a first compute unit to perform a mixed precision matrix operation and a second compute unit to perform, in response to a single instruction, multiple compute operations, wherein the multiple compute operations include a fused multiply-add operation and a rectified linear unit operation applied to an output of the fused multiply-add operation.
-
2.
公开(公告)号:EP4160387A1
公开(公告)日:2023-04-05
申请号:EP22210195.8
申请日:2018-03-26
申请人: Intel Corporation
发明人: Kaul, Himanshu , Anders, Mark A. , Mathew, Sanu K. , Yao, Anbang , Ray, Joydeep , Tang, Ping T. , Strickland, Michael S. , Chen, Xiaoming , Appu, Abhishek R. , Koker, Altug , Sinha, Kamal , Vembu, Balaji , Galoppo von Borries, Nicolas C. , Nurvitadhi, Eriko , Barik, Rajkishore , Lin, Tsung-Han , Ranganathan, Vasanth , Jahagirdar, Sanjeev , Shpeisman, Tatiana
摘要: The present disclosure provides a data processing system, a method, a computer-readable medium and a graphics processing unit, GPU, to accelerate machine-learning operations, the GPU comprising: a multiprocessor including a single instruction, multiple thread, SIMT, architecture, the multiprocessor to execute a single instruction across multiple threads; and a first compute unit included within the multiprocessor, the single instruction to cause the first compute unit to perform at least a two-dimensional matrix multiply and accumulate operation, wherein to perform the two-dimensional matrix multiply and accumulate operation includes to compute an intermediate product of 16-bit operands and to compute a 32-bit sum based on the intermediate product; wherein to compute a 32-bit sum based on the intermediate product, the first compute unit is to: perform a floating-point multiply of two or more 16-bit operands to generate the intermediate product, compute an intermediate sum based on the intermediate product; and convert the intermediate sum to a 32-bit result.
-
3.
公开(公告)号:EP3958116A1
公开(公告)日:2022-02-23
申请号:EP21202337.8
申请日:2018-03-14
申请人: INTEL Corporation
发明人: Ould-Ahmed-Vall, ElMoustapha , Lakshmanan, Barath , Shpeisman, Tatiana , Ray, Joydeep , Tang, Ping T. , Strickland, Michael , Chen, Xiaoming , Yao, Anbang , Ashbaugh, Ben J. , Hurd, Linda L. , Ma, Liwei
摘要: One embodiment provides for a compute apparatus to perform machine learning operations. The compute apparatus comprises: instruction decode logic to decode a single instruction including multiple operands into a single decoded instruction, the multiple operands having differing precisions; and a general-purpose graphics compute unit including a first logic unit and a second logic unit, the general-purpose graphics compute unit to execute the single decoded instruction, wherein to execute the single decoded instruction includes to perform a first instruction operation on a first set of operands of the multiple operands at a first precision and simultaneously perform second instruction operation on a second set of operands of the multiple operands at a second precision.
-
4.
公开(公告)号:EP4290370A1
公开(公告)日:2023-12-13
申请号:EP23181292.6
申请日:2018-03-14
申请人: Intel Corporation
发明人: Ould-Ahmed-Vall, ElMoustapha , Lakshmanan, Barath , Shpeisman, Tatiana , Ray, Joydeep , Tang, Ping T. , Strickland, Michael , Chen, Xiaoming , Yao, Anbang , Ashbaugh, Ben J. , Hurd, Linda L. , Ma, Liwei
摘要: One embodiment provides for data processing system to perform machine learning operations. The data processing system comprises: a memory device configured to store instructions; a graphics processing unit (GPU) to execute the instructions. The instructions include a first instruction and a second instruction, the first instruction to cause the GPU to perform a floating-point operation, and the second instruction to cause the GPU to perform an integer operation. A general-purpose graphics compute unit included within the GPU has a single instruction, multiple thread architecture and is to execute the first instruction concurrently with execution of the second instruction.
-
5.
公开(公告)号:EP4130976A1
公开(公告)日:2023-02-08
申请号:EP22198967.6
申请日:2018-03-26
申请人: INTEL Corporation
发明人: Kaul, Himanshu , Anders, Mark A. , Mathew, Sanu K. , Yao, Anbang , Ray, Joydeep , Tang, Ping T. , Strickland, Michael S. , Chen, Xiaoming , Appu, Abhishek R. , Koker, Altug , Sinha, Kamal , Vembu, Balaji , Galoppo von Borries, Nicolas , Nurvitadhi, Eriko , Barik, Rajkishore , Lin, Tsung-Han , Ranganathan, Vasanth , Jahagirdar, Sanjeev , Shpeisman, Tatiana
摘要: The present disclosure provides a method, a data processing system and a graphics processing unit, GPU, comprising a memory controller; and a graphics multiprocessor coupled with the memory controller. The graphics multiprocessor includes a plurality of compute units configured to execute an instruction to perform a mixed-precision matrix operation on first input data and second input data; generate intermediate data based on a result of the mixed-precision matrix operation; convert the intermediate data to a floating-point format determined based on statistics associated with first output data; and output, as second output data, the converted intermediate data in the determined floating-point format.
-
6.
公开(公告)号:EP3859519A1
公开(公告)日:2021-08-04
申请号:EP21165109.6
申请日:2018-03-26
申请人: INTEL Corporation
发明人: Kaul, Himanshu , Anders, Mark A. , Mathew, Sanu K. , Yao, Anbang , Ray, Joydeep , Tang, Ping T. , Strickland, Michael S. , Chen, Xiaoming , Shpeisman, Tatiana , Appu, Abhishek R. , Koker, Altug , Sinha, Kamal , Vembu, Balaji , Galoppo von Borries, Nicolas C. , Nurvitadhi, Eriko , Barik, Rajkishore , Lin, Tsung-Han , Ranganathan, Vasanth , Jahagirdar, Sanjeev
摘要: The present disclosure provides an apparatus comprising a memory interface, an array of processing clusters each including a multiprocessor unit coupled to the memory interface, wherein at least one multiprocessor unit is to execute a fused multiply-add instruction in parallel across multiple threads. The at least one multiprocessor unit comprising a register file to store data, and a compute unit coupled to the register file, wherein the compute unit is to execute a fused multiply-add instruction on matrix data. The compute unit comprising hardware logic to quantize the data from a higher precision, including a 32-bit floating point format to a lower precision floating point format, including a 16-bit floating point format, having a 1-bit sign, an 8-bit exponent, and a mantissa, wherein fewer bits are used for the mantissa of the lower precision floating point format; and one or more logic units to perform the fused multiply-add operation on the data in the lower precision floating point format.
-
7.
公开(公告)号:EP3796154A1
公开(公告)日:2021-03-24
申请号:EP20207059.5
申请日:2018-03-26
申请人: INTEL Corporation
发明人: Kaul, Himanshu , Anders, Mark A. , Mathew, Sanu K. , Yao, Anbang , Ray, Joydeep , Tang, Ping T. , Strickland, Michael S. , Chen, Xiaoming , Shpeisman, Tatiana , Appu, Abhishek R. , Koker, Altug , Sinha, Kamal , Vembu, Balaji , Galoppo von Borries, Nicolas C. , Nurvitadhi, Eriko , Barik, Rajkishore , Lin, Tsung-Han , Ranganathan, Vasanth , Jahagirdar, Sanjeev
摘要: One embodiment provides for an apparatus comprising: an interconnect fabric; a memory interface coupled to the interconnect fabric; an input/output, IO, unit coupled to the interconnect fabric; an array of multiprocessors coupled to the interconnect fabric, a multiprocessor in the array of multiprocessors to execute a mixed-precision instruction in parallel across multiple threads. The multiprocessor in the array of multiprocessors comprises: a plurality of registers to store packed floating-point operand values; and execution circuitry to execute one or more of the mixed-precision instructions to perform a fused multiply-accumulate operation. The execution circuitry comprises: a 16-bit multiplier to multiply a first 16-bit floating point source value and a second 16-bit floating point source value to generate an intermediate result; and a 32-bit accumulator to add the intermediate result to an accumulated floating-point value to generate a new accumulation result.
-
公开(公告)号:EP3671439A1
公开(公告)日:2020-06-24
申请号:EP20155873.1
申请日:2018-03-23
申请人: Intel Corporation
发明人: Nealis, Kevin , Yao, Anbang , Chen, Xiaoming , Ould-Ahmed-Vall, Elmoustapha , Baghsorkhi, Sara S. , Nurvitadhi, Eriko , Vembu, Balaji , Galoppo von Borries, Nicolas C. , Barik, Rajkishore , Lin, Tsung-Han , Sinha, Kamal
摘要: One embodiment provides for a compute apparatus to perform machine learning operations, the apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including an input value and a quantized weight value associated with a neural network and an arithmetic logic unit including a barrel shifter, an adder, and an accumulator register, wherein to execute the decoded instruction, the barrel shifter is to shift the input value by the quantized weight value to generate a shifted input value and the adder is to add the shifted input value to a value stored in the accumulator register and update the value stored in the accumulator register.
-
公开(公告)号:EP3657323A1
公开(公告)日:2020-05-27
申请号:EP19218464.6
申请日:2018-03-02
申请人: Intel Corporation
发明人: Appu, Abhishek R. , Koker, Altug , Hurd, Linda L. , Kim, Dukhwan , MacPherson, Mike B. , Weast, John C. , Chen, Feng , Akhbari, Farshad , Srinivasa, Narayan , Satish, Nadathur Rajagopalan , Tang, Ping T. , Ray, Joydeep , Strickland, Michael S. , Chen, Xiaoming , Yao, Anbang , Shpeisman, Tatiana
IPC分类号: G06F9/30 , G06F9/38 , G06F9/46 , G06T1/20 , G06N3/063 , G06F3/14 , G06N3/04 , G06N3/08 , G06T15/00 , G09G5/36
摘要: A graphics processing unit has a set of memory controllers, a cache memory, and at least one compute cluster with at least one graphics multiprocessor coupled to the set of memory controllers. The at least one graphics multiprocessor includes an instruction unit, a plurality of processing cores, and a shared memory coupled to the plurality of processing cores. The instruction unit is configured to dispatch instructions for execution by a processing core. Execution of a mixed precision fused multiply-accumulate, FMAC, operation is supported by a compute mechanism, wherein the FMAC operation comprises an arithmetic logic unit, ALU, operation of D = A ∗ B + C with A and B being 8-bit integer data elements, and C being a 32-bit integer data element.
-
10.
公开(公告)号:EP3637247A1
公开(公告)日:2020-04-15
申请号:EP19214829.4
申请日:2018-03-26
申请人: INTEL Corporation
发明人: Kaul, Himanshu , Anders, Mark A. , Mathew, Sanu K. , Yao, Anbang , Ray, Joydeep , Tang, Ping T. , Strickland, Michael S. , Chen, Xiaoming , Shpeisman, Tatiana , Appu, Abhishek R. , Koker, Altug , Sinha, Kamal , Vembu, Balaji , Nurvitadhi, Eriko , Barik, Rajkishore , Lin, Tsung-Han , Ranganathan, Vasanth , Jahagirdar, Sanjeev , Galoppo von Borries, Nicolas
摘要: One embodiment provides for a machine-learning hardware accelerator comprising a compute unit having an adder and a multiplier that are shared between integer data path and a floating-point datapath, the upper bits of input operands to the multiplier to be gated during floating-point operation.
-
-
-
-
-
-
-
-
-