-
公开(公告)号:US20240045682A1
公开(公告)日:2024-02-08
申请号:US17958370
申请日:2022-10-01
Applicant: Intel Corporation
Inventor: Alexander Heinecke , Menachem Adelman , Evangelos Georganas , Amit Gradstein , Christopher Hughes , Naveen Mellempudi , Simon Rubanovich , Uri Sherman , Zeev Sperber
IPC: G06F9/30
CPC classification number: G06F9/30145 , G06F9/30036 , G06F9/3001
Abstract: Techniques for scale and reduction of FP8 data elements are described. An exemplary instruction includes fields for an having fields for an opcode, an identification of a location of a first packed data source operand, an identification of a location of a second packed data source operand, and an identification of a packed data destination operand, wherein the opcode is to indicate that execution circuitry is to perform, for each data element position of the packed data source operands, a floating point scale operation of a FP8 data element of the first packed data source by multiplying the data element by a power of 2 value, wherein a value of the exponent of the power of 2 value is a floor value of a FP8 data element of the second packed data source, and store a result of the floating point scale operation into a corresponding data element position of the packed data destination operand.
-
公开(公告)号:US20230418602A1
公开(公告)日:2023-12-28
申请号:US18456699
申请日:2023-08-28
Applicant: INTEL CORPORATION
Inventor: Robert Valentine , Galina Ryvchin , Piotr Majcher , Mark J. Charney , Elmoustapha Ould-Ahmed-Vall , Jesus Corbal , Milind B. Girkar , Zeev Sperber , Simon Rubanovich , Amit Gradstein
CPC classification number: G06F9/30014 , G06F7/5443 , G06F9/3818 , G06F9/30036 , G06F9/30105 , G06F9/30018
Abstract: Embodiments of systems, apparatuses, and methods for fused multiple add. In some embodiments, a decoder decodes a single instruction having an opcode, a destination field representing a destination operand, and fields for a first, second, and third packed data source operand, wherein packed data elements of the first and second packed data source operand are of a first, different size than a second size of packed data elements of the third packed data operand. Execution circuitry then executes the decoded single instruction to perform, for each packed data element position of the destination operand, a multiplication of a M N-sized packed data elements from the first and second packed data sources that correspond to a packed data element position of the third packed data source, add of results from these multiplications to a full-sized packed data element of a packed data element position of the third packed data source, and storage of the addition result in a packed data element position destination corresponding to the packed data element position of the third packed data source, wherein M is equal to the full-sized packed data element divided by N.
-
53.
公开(公告)号:US20230315450A1
公开(公告)日:2023-10-05
申请号:US18313026
申请日:2023-05-05
Applicant: Intel Corporation
Inventor: Naveen Mellempudi , Alexander F. Heinecke , Robert Valentine , Mark J. Charney , Christopher J. Hughes , Evangelos Georganas , Zeev Sperber , Amit Gradstein , Simon Rubanovich
CPC classification number: G06F9/30036 , G06F7/49915 , G06F9/30196 , G06F9/3887
Abstract: Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. A processor embodiment includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a destination matrix having single-precision elements, a first source matrix, and a second source matrix, the source matrices having elements that each comprise a quadruple of 8-bit floating-point values, the opcode to indicate execution circuitry is to cause, for each element of the first source matrix and corresponding element of the second source matrix, a conversion of the 8-bit floating-point values to single-precision values, a multiplication of different pairs of converted single-precision values to generate plurality of results, and an accumulation of the results with previous contents of a corresponding element of the destination matrix, decode circuitry to decode the fetched instruction, and the execution circuitry to respond to the decoded instruction as specified by the opcode.
-
54.
公开(公告)号:US20230195465A1
公开(公告)日:2023-06-22
申请号:US17558368
申请日:2021-12-21
Applicant: Intel Corporation
Inventor: Stanislav Shwartsman , Elad Shtiegmann , Sumeet Bandishte , Lihu Rappoport , Zeev Sperber , Jayesh Gaur
CPC classification number: G06F9/3802 , G06F9/3818 , G06F9/30032
Abstract: Techniques and mechanisms for efficiently making value prediction information available for use by in a processor. In an embodiment, the instruction execution is to include a loading of some data to a first location (e.g., a first register). A decoder of the processor accesses reference information which indicates that the execution is to comprise multiple micro-operations (μops) including a LoadCheck μop and a Move μop. The LoadCheck μop loads a first value to the first location, and checks whether the loaded first value is the same as a previously-determined second value which represents a prediction of what the first value would be. The Move μop moves the second value to the first location. In another embodiment, the Move μop is scheduled for execution out-of-order with respect to the LoadCheck μop, resulting in an early availability of the second value for access in a register file by another μop.
-
公开(公告)号:US11656971B2
公开(公告)日:2023-05-23
申请号:US17582051
申请日:2022-01-24
Applicant: Intel Corporation
Inventor: Adarsh Chauhan , Jayesh Gaur , Franck Sala , Lihu Rappoport , Zeev Sperber , Adi Yoaz , Sreenivas Subramoney
CPC classification number: G06F11/3476 , G06F9/24 , G06F9/3836 , G06F11/3024 , G06F11/3055 , G06F15/7875
Abstract: A processor comprises a microarchitectural feature and dynamic tuning unit (DTU) circuitry. The processor executes a program for first and second execution windows with the microarchitectural feature disabled and enabled, respectively. The DTU circuitry automatically determines whether the processor achieved worse performance in the second execution window. In response to determining that the processor achieved worse performance in the second execution window, the DTU circuitry updates a usefulness state for a selected address of the program to denote worse performance. In response to multiple consecutive determinations that the processor achieved worse performance with the microarchitectural feature enabled, the DTU circuitry automatically updates the usefulness state to denote a confirmed bad state. In response to the usefulness state denoting the confirmed bad state, the DTU circuitry automatically disables the microarchitectural feature for the selected address for execution windows after the second execution window. Other embodiments are described and claimed.
-
公开(公告)号:US11567765B2
公开(公告)日:2023-01-31
申请号:US16487766
申请日:2017-07-01
Applicant: Intel Corporation
Inventor: Robert Valentine , Menachem Adelman , Milind B. Girkar , Zeev Sperber , Mark J. Charney , Bret L. Toll , Rinat Rappoport , Jesus Corbal , Stanislav Shwartsman , Dan Baum , Igor Yanover , Alexander F. Heinecke , Barukh Ziv , Elmoustapha Ould-Ahmed-Vall , Yuri Gebil
Abstract: Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to load groups of strided data elements from memory into configured rows of the identified destination matrix operand to memory.
-
公开(公告)号:US11507369B2
公开(公告)日:2022-11-22
申请号:US17465905
申请日:2021-09-03
Applicant: Intel Corporation
Inventor: Robert Valentine , Galina Ryvchin , Piotr Majcher , Mark J. Charney , Elmoustapha Ould-Ahmed-Vall , Jesus Corbal , Milind B. Girkar , Zeev Sperber , Simon Rubanovich , Amit Gradstein
Abstract: Embodiments of systems, apparatuses, and methods for fused multiple add. In some embodiments, a decoder decodes a single instruction having an opcode, a destination field representing a destination operand, and fields for a first, second, and third packed data source operand, wherein packed data elements of the first and second packed data source operand are of a first, different size than a second size of packed data elements of the third packed data operand. Execution circuitry then executes the decoded single instruction to perform, for each packed data element position of the destination operand, a multiplication of a M N-sized packed data elements from the first and second packed data sources that correspond to a packed data element position of the third packed data source, add of results from these multiplications to a full-sized packed data element of a packed data element position of the third packed data source, and storage of the addition result in a packed data element position destination corresponding to the packed data element position of the third packed data source, wherein M is equal to the full-sized packed data element divided by N.
-
58.
公开(公告)号:US11455167B2
公开(公告)日:2022-09-27
申请号:US16701082
申请日:2019-12-02
Applicant: Intel Corporation
Inventor: Raanan Sade , Thierry Pons , Amit Gradstein , Zeev Sperber , Mark J. Charney , Robert Valentine , Eyal Oz-Sinay
Abstract: Disclosed embodiments relate to efficient complex vector multiplication. In one example, an apparatus includes execution circuitry, responsive to an instruction having fields to specify multiplier, multiplicand, and summand complex vectors, to perform two operations: first, to generate a double-even multiplicand by duplicating even elements of the specified multiplicand, and to generate a temporary vector using a fused multiply-add (FMA) circuit having A, B, and C inputs set to the specified multiplier, the double-even multiplicand, and the specified summand, respectively, and second, to generate a double-odd multiplicand by duplicating odd elements of the specified multiplicand, to generate a swapped multiplier by swapping even and odd elements of the specified multiplier, and to generate a result using a second FMA circuit having its even product negated, and having A, B, and C inputs set to the swapped multiplier, the double-odd multiplicand, and the temporary vector, respectively.
-
公开(公告)号:US20220206925A1
公开(公告)日:2022-06-30
申请号:US17582051
申请日:2022-01-24
Applicant: Intel Corporation
Inventor: Adarsh Chauhan , Jayesh Gaur , Franck Sala , Lihu Rappoport , Zeev Sperber , Adi Yoaz , Sreenivas Subramoney
Abstract: A processor comprises a microarchitectural feature and dynamic tuning unit (DTU) circuitry. The processor executes a program for first and second execution windows with the microarchitectural feature disabled and enabled, respectively. The DTU circuitry automatically determines whether the processor achieved worse performance in the second execution window. In response to determining that the processor achieved worse performance in the second execution window, the DTU circuitry updates a usefulness state for a selected address of the program to denote worse performance. In response to multiple consecutive determinations that the processor achieved worse performance with the microarchitectural feature enabled, the DTU circuitry automatically updates the usefulness state to denote a confirmed bad state. In response to the usefulness state denoting the confirmed bad state, the DTU circuitry automatically disables the microarchitectural feature for the selected address for execution windows after the second execution window. Other embodiments are described and claimed.
-
60.
公开(公告)号:US20220206801A1
公开(公告)日:2022-06-30
申请号:US17134373
申请日:2020-12-26
Applicant: Intel Corporation
Inventor: Naveen Mellempudi , Alexander F. Heinecke , Robert Valentine , Mark J. Charney , Christopher J. Hughes , Evangelos Georganas , Zeev Sperber , Amit Gradstein , Simon Rubanovich
Abstract: Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. A processor embodiment includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a destination matrix having single-precision elements, a first source matrix, and a second source matrix, the source matrices having elements that each comprise a quadruple of 8-bit floating-point values, the opcode to indicate execution circuitry is to cause, for each element of the first source matrix and corresponding element of the second source matrix, a conversion of the 8-bit floating-point values to single-precision values, a multiplication of different pairs of converted single-precision values to generate plurality of results, and an accumulation of the results with previous contents of a corresponding element of the destination matrix, decode circuitry to decode the fetched instruction, and the execution circuitry to respond to the decoded instruction as specified by the opcode.
-
-
-
-
-
-
-
-
-