-
公开(公告)号:EP4141674A1
公开(公告)日:2023-03-01
申请号:EP22197260.7
申请日:2018-03-26
申请人: INTEL Corporation
发明人: OULD-AHMED-VALL, ElMoustapha , BAGHSORKHI, Sara S. , YAO, Anbang , NEALIS, Kevin , CHEN, Xiaoming , KOKER, Altug , APPU, Abhishek R. , WEAST, John C. , MACPHERSON, Mike B. , KIM, Dukhwan , HURD, Linda L. , ASHBAUGH, Ben J. , LAKSHMANAN, Barath , MA, Liwei , RAY, Joydeep , TANG, Ping T. , STRICKLAND, Michael S.
IPC分类号: G06F9/50 , G06T15/00 , G06F9/30 , G06F9/38 , G06N3/04 , G06N3/063 , G06N3/08 , G06T1/20 , G06F12/0811 , G06N3/084 , G06N3/044 , G06N3/045
摘要: The present disclosure provides a method and a graphics processing unit comprising a memory including plurality of memory devices; compression logic to compress data to be written to the memory; and a streaming multiprocessor coupled with the memory. The streaming multiprocessor to concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread, SIMT, architecture and the streaming multiprocessor is to execute multiple threads for multiple instructions. The multiple instructions include a first instruction to cause a first portion of the streaming multiprocessor to perform a floating-point operation on multiple floating-point input operands and a second instruction to cause a second portion of the streaming multiprocessor to perform an integer operation on multiple integer operands, the first instruction to execute concurrently with the second instruction.
-
公开(公告)号:EP3792839A1
公开(公告)日:2021-03-17
申请号:EP20205015.9
申请日:2018-03-02
申请人: INTEL Corporation
发明人: APPU, Abhishek R. , KOKER, Altug , HURD, Linda L. , KIM, Dukhwan , MACPHERSON, Mike B. , WEAST, John C. , CHEN, Feng , AKHBARI, Farshad , SRINIVASA, Narayan , SATISH, Nadathur Rajagopalan , TANG, Ping T. , RAY, Joydeep , STRICKLAND, Michael S. , CHEN, Xiaoming , YAO, Anbang , SHPEISMAN, Tatiana
摘要: The present disclosure provides an apparatus comprising an interconnect fabric comprising one or more switches, a memory interface coupled to the interconnect fabric, an input/output (10) interface coupled to the interconnect fabric an array of processing clusters coupled to the interconnect fabric, the array of processing clusters to process instructions at variable precisions. At least one processing cluster comprising a plurality of registers to store source operands at variable precisions and an execution unit comprising a plurality of arithmetic logic units (ALUs) to execute one or more of the instructions to perform a mixed-precision fused multiply-accumulate (FMAC) operation of D = A ∗ B + C. Each source operand A, B, and C may be any of FP64, FP32, FP16, INT32, INT16, INT8 or INT4. An ALU is to generate the result operand D by multiplying source operand A with source operand B to generate an intermediate product, and adding the intermediate product to source operand C.
-
公开(公告)号:EP3792761A1
公开(公告)日:2021-03-17
申请号:EP20205451.6
申请日:2018-03-26
申请人: INTEL Corporation
发明人: OULD-AHMED-VALL, ElMoustapha , BAGHSORKHI, Sara S. , YAO, Anbang , NEALIS, Kevin , CHEN, Xiaoming , KOKER, Altug , APPU, Abhishek R. , WEAST, John C. , MACPHERSON, Mike B. , KIM, Dukhwan , HURD, Linda L. , ASHBAUGH, Ben J. , LAKSHMANAN, Barath , MA, Liwei , RAY, Joydeep , TANG, Ping T. , STRICKLAND, Michael S.
摘要: The present disclosure provides an interconnect fabric comprising one or more switches, a memory interface coupled to the interconnect fabric, an input/output (IO) interface coupled to the interconnect fabric and an array of processing clusters coupled to the interconnect fabric. The array of multiprocessors is to process mixed-precision instructions. At least one processing cluster comprises a plurality of registers to store a plurality of packed data elements at a first precision and an execution unit to execute mixed-precision dot-product instructions. The execution unit is to perform a plurality of multiplications of different pairs of the plurality of packed data elements to generate a corresponding plurality of products and to add the corresponding plurality of products to an accumulation value stored at a second precision greater than the first precision.
-
公开(公告)号:EP3607492A1
公开(公告)日:2020-02-12
申请号:EP17904421.9
申请日:2017-04-07
申请人: INTEL Corporation
发明人: YAO, Anbang , WANG, Shandong , CHENG, Wenhua , CAI, Dongqi , WANG, Libin , XU, Lin , HU, Ping , GUO, Yiwen , YANG, Liu , HOU, Yuqing , SU, Zhou , CHEN, Yurong
IPC分类号: G06K9/66
-
公开(公告)号:EP3594813A1
公开(公告)日:2020-01-15
申请号:EP19182892.0
申请日:2018-03-26
申请人: Intel Corporation
发明人: OULD-AHMED-VALL, ElMoustapha , BAGHSORKHI, Sara S. , YAO, Anbang , NEALIS, Kevin , CHEN, Xiaoming , KOKER, Altug , APPU, Abhishek R. , WEAST, John C. , MACPHERSON, Mike B. , KIM, Dukhwan , HURD, Linda L. , ASHBAUGH, Ben J. , LAKSHMANAN, Barath , MA, Liwei , RAY, Joydeep , TANG, Ping T. , STRICKLAND, Michael S.
摘要: An accelerator on a multi-chip module, a method of accelerating a machine-learning operation and a data processing system are provided. In one embodiment, the accelerator comprises: a memory stack including multiple memory dies; and a graphics processing unit (GPU) coupled with the memory stack via one or more memory controllers. The GPU includes a plurality of multiprocessors having a single instruction, multiple thread (SIMT) architecture, the multiprocessors to execute at least one single instruction, the at least one single instruction to accelerate a linear algebra subprogram associated with a machine learning framework. The at least one single instruction to cause at least a portion of the GPU to perform a floating-point operation on input having differing precisions, the floating-point operation a two-dimensional matrix multiply and accumulate operation. At least a portion of the plurality of multiprocessors include a mixed precision core, the mixed precision core to execute a thread of the at least one single instruction, the mixed precision core including a floating-point unit to perform a first operation of the thread at a first precision and a second operation of the thread at a second precision. The first operation is a multiply having at least one 16-bit floating-point input and the second operation is an accumulate having a 32-bit floating-point input.
-
公开(公告)号:EP3396534A3
公开(公告)日:2019-01-23
申请号:EP18159835.0
申请日:2018-03-02
申请人: INTEL Corporation
发明人: BARIK, Rajkishore , OULD-AHMED-VALL, Elmoustapha , CHEN, Xiaoming , SRIVASTAVA, Dhawal , YAO, Anbang , NEALIS, Kevin , NURVITADHI, Eriko , BAGHSORKHI, Sara S. , VEMBU, Balaji , SHPEISMAN, Tatiana , TANG, Ping T.
摘要: One embodiment provides for a compute apparatus to perform machine learning operations, the apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to perform one or more machine learning operations, wherein the decode unit, based on parameters of the one or more machine learning operations, is to request a scheduler to schedule the one or more machine learning operations to one of an array of programmable compute units and a fixed function compute unit.
-
公开(公告)号:EP3396530A3
公开(公告)日:2018-11-14
申请号:EP18161820.8
申请日:2018-03-14
申请人: Intel Corporation
发明人: OULD-AHMED-VALL, Elmoustapha , LAKSHMANAN, Barath , SHPEISMAN, Tatiana , RAY, Joydeep , TANG, Ping T. , STRICKLAND, Michael , CHEN, Xiaoming , YAO, Anbang , ASHBAUGH, Ben J. , HURD, Linda L. , MA, Liwei
CPC分类号: G06F9/3887 , G06F9/3001 , G06F9/30014 , G06F9/30036 , G06F9/30094 , G06F9/30109 , G06F9/30112 , G06F9/3016 , G06F9/3802 , G06F9/3836 , G06F9/3851 , G06F9/50 , G06F13/4068 , G06F13/4282 , G06F15/80 , G06F2213/0026 , G06N3/00 , G06N99/005 , G06T1/20
摘要: One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising instruction decode logic to decode a single instruction including multiple operands into a single decoded instruction, the multiple operands having differing precisions and a general-purpose graphics compute unit including a first logic unit and a second logic unit, the general-purpose graphics compute unit to execute the single decoded instruction, wherein to execute the single decoded instruction includes to perform a first instruction operation on a first set of operands of the multiple operands at a first precision and a simultaneously perform second instruction operation on a second set of operands of the multiple operands at a second precision.
-
公开(公告)号:EP3396528A1
公开(公告)日:2018-10-31
申请号:EP18162625.0
申请日:2018-03-19
申请人: INTEL Corporation
发明人: KOKER, Altug , APPU, Abhishek R. , SINHA, Kamal , RAY, Joydeep , VEMBU, Balaji , OULD-AHMED-VALL, Elmoustapha , BAGHSORKHI, Sara S. , YAO, Anbang , NEALIS, Kevin , CHEN, Xiaoming , WEAST, John C. , GOTTSCHLICH, Justin E. , SURTI, Prasoonkumar , SAKTHIVEL, Chandrasekaran , AKHBARI, Farshad , SATISH, Nadathur Rajagopalan , MA, Liwei , BOTTLESON, Jeremy , NURVITADHI, Eriko , SCHLUESSLER, Travis T. , SHAH, Ankur N. , KENNEDY, Jonathan , RANGANATHAN, Vasanth , JAHAGIRDAR, Sanjeev
CPC分类号: G06N3/08 , G06F9/505 , G06N3/0445 , G06N3/0454 , G06N3/0481 , G06N3/063 , G06N99/005
摘要: In an example, an apparatus comprises a plurality of execution units comprising at least a first type of execution unit and a second type of execution unit and logic, at least partially including hardware logic, to analyze a workload and assign the workload to one of the first type of execution unit or the second type of execution unit. Other embodiments are also disclosed and claimed.
-
公开(公告)号:EP4369252A2
公开(公告)日:2024-05-15
申请号:EP24166744.3
申请日:2018-03-19
申请人: INTEL Corporation
发明人: SINHA, Kamal , VEMBU, Balaji , NURVITADHI, Eriko , GALOPPO VON BORRIES, Nicolas C. , BARIK, Rajkishore , LIN, Tsung-Han , RAY, Joydeep , TANG, Ping T. , STRICKLAND, Michael S. , CHEN, Xiaoming , YAO, Anbang , SHPEISMAN, Tatiana , APPU, Abhishek R. , KOKER, Altug , AKHBARI, Farshad , SRINIVASA, Narayan , CHEN, Feng , KIM, Dukhwan , SATISH, Nadathur Rajagopalan , WEAST, John C. , MACPHERSON, Mike B. , HURD, Linda L. , RANGANATHAN, Vasanth , JAHAGIRDAR, Sanjeev S.
IPC分类号: G06N3/045
CPC分类号: G06F15/78 , G06F9/30014 , G06F9/30036 , G06F1/3287 , G06F1/3293 , G06T15/005 , G06F15/76 , G06N3/084 , G06N3/063 , G06T1/20 , Y02D10/00 , G06N3/044 , G06N3/045
摘要: In an example, an apparatus comprises a compute engine comprising a high precision component and a low precision component; and logic, at least partially including hardware logic, to receive instructions in the compute engine; select at least one of the high precision component or the low precision component to execute the instructions; and apply a gate to at least one of the high precision component or the low precision component to execute the instructions. Other embodiments are also disclosed and claimed.
-
10.
公开(公告)号:EP4242838A2
公开(公告)日:2023-09-13
申请号:EP23182458.2
申请日:2018-03-26
申请人: INTEL Corporation
发明人: KAUL, Himanshu , ANDERS, Mark A. , MATHEW, Sanu K. , YAO, Anbang , RAY, Joydeep , TANG, Ping T. , STRICKLAND, Michael S. , CHEN, Xiaoming , SHPEISMAN, Tatiana , APPU, Abhishek R. , KOKER, Altug , SINHA, Kamal , VEMBU, Balaji , GALOPPO VON BORRIES, Nicolas C. , NURVITADHI, Eriko , BARIK, Rajkishore , LIN, Tsung-Han , RANGANATHAN, Vasanth , JAHAGIRDAR, Sanjeev
IPC分类号: G06F9/30
摘要: One embodiment provides for a processing unit comprising fetch and decode circuitry to fetch and decode a floating-point multiply-accumulate instruction; and execution circuitry to execute the floating-point multiply-accumulate instruction. The execution circuitry comprises mantissa multiplication circuitry, wherein the mantissa multiplication circuitry is shared with an integer datapath of the execution circuitry, wherein responsive to the floating-point multiply-accumulate instruction, the mantissa multiplication circuitry is to perform a multiplication operation with a mantissa value of each 16-bit floating-point data element of a first plurality of 16-bit floating-point data elements and a mantissa value of a corresponding 16-bit floating-point data element of a second plurality of 16-bit floating-point data elements to generate a corresponding plurality of mantissa results; exponent processing circuitry, responsive to the floating-point multiply-accumulate instruction, to perform an operation with an exponent value of each 16-bit floating-point data element of the first plurality of 16-bit floating-point data elements and an exponent value of each corresponding 16-bit floating-point data element of the second plurality of 16-bit floating-point data elements to generate a corresponding plurality of exponent results; circuitry to process the plurality of mantissa results and the plurality of exponent results to generate a corresponding floating-point product; and adder circuitry to generate a plurality of result floating-point values, each result floating-point value comprising a sum of one or more floating-point products of the plurality of floating-point products and a corresponding accumulated floating-point value of a plurality of accumulated floating-point values.
-
-
-
-
-
-
-
-
-