-
公开(公告)号:US20250004764A1
公开(公告)日:2025-01-02
申请号:US18217544
申请日:2023-07-01
Applicant: Intel Corporation
Inventor: Michael ESPIG , Menachem ADELMAN , Jonathan COMBS , Amit GRADSTEIN , Christopher J. HUGHES , Vivekananthan SANJEEPAN , Wing Shek WONG
IPC: G06F9/30
Abstract: Techniques for providing 512-bit operands or smaller are described. In some examples, a prefix of an instruction is utilized to define the operand (vector) length. For example, an instruction is to at least include fields for a prefix, an opcode, and operand addressing information, wherein the prefix and addressing information are to be used by decoder circuitry to determine support for a particular a vector length for one or more operands of the instance of the single instruction and the opcode is to indicate one or more operations to perform on the one or more operands.
-
12.
公开(公告)号:US20240078283A1
公开(公告)日:2024-03-07
申请号:US18360793
申请日:2023-07-27
Applicant: Intel Corporation
Inventor: Amit GRADSTEIN , Simon RUBANOVICH , Sagi MELLER , Saeed KHAROUF , Gavri BERGER , Zeev SPERBER , Jose YALLOUZ , Ron SCHNEIDER
CPC classification number: G06F17/16 , G06F9/3001 , G06F9/30036 , G06F9/3851 , G06F9/34
Abstract: Systems, methods, and apparatuses relating to a matrix operations accelerator are described. In one embodiment, a processor includes a matrix operations accelerator circuit that includes a two-dimensional grid of fused multiply accumulate circuits that is switchable to a scheduling mode for execution of a decoded single instruction where the matrix operations accelerator circuit loads a first buffer of the two-dimensional grid of fused multiply accumulate circuits from a first plurality of registers that represents a first input two-dimensional matrix, checks if a second buffer of the two-dimensional grid of fused multiply accumulate circuits stores an immediately prior input two-dimension matrix that is the same as a second input two-dimensional matrix from a second plurality of registers that represents the first input two-dimensional matrix, and when the second buffer of the two-dimensional grid of fused multiply accumulate circuits stores the immediately prior input two-dimension matrix, from execution of a previous instruction, that is the same as the second input two-dimensional matrix: prevents reclamation of the second buffer between execution of the previous instruction and the decoded single instruction, performs an operation on the first input two-dimensional matrix from the first buffer and the immediately prior input two-dimension matrix from the second buffer to produce a resultant, and stores the resultant in resultant storage, and when the second buffer of the two-dimensional grid of fused multiply accumulate circuits does not store the immediately prior input two-dimension matrix, from execution of the previous instruction, that is the same as the second input two-dimensional matrix: loads the second input two-dimensional matrix into the second buffer of the two-dimensional grid of fused multiply accumulate circuits, performs the operation on the first input two-dimensional matrix from the first buffer and the second input two-dimension matrix from the second buffer to produce a resultant, and stores the resultant in the resultant storage.
-
公开(公告)号:US20230409333A1
公开(公告)日:2023-12-21
申请号:US17843823
申请日:2022-06-17
Applicant: Intel Corporation
Inventor: Menachem ADELMAN , Amit GRADSTEIN , Regev SHEMY , Chitra NATARAJAN , Igor ERMOLAEV
CPC classification number: G06F9/3818 , G06F9/30105
Abstract: Techniques for performing prefix sums in response to a single instruction are describe are described. In some examples, the single instruction includes fields for an opcode, one or fields to reference a first source operand, one or fields to reference a second source operand, one or fields to reference a destination operand, wherein the opcode is to indicate that execution circuitry is, in response to a decoded instance of the single instruction, to at least: perform a prefix sum by for each non-masked data element position of the second source operand adding a data element of that data element position to each data element of preceding data element positions and adding at least one data element of a defined data element position of the first source operand, and store each prefix sum for each data element position of the second source operand into a corresponding data element position of the destination operand.
-
公开(公告)号:US20230072105A1
公开(公告)日:2023-03-09
申请号:US17463410
申请日:2021-08-31
Applicant: Intel Corporation
Inventor: Alexander HEINECKE , Menachem ADELMAN , Robert VALENTINE , Zeev SPERBER , Amit GRADSTEIN , Mark CHARNEY , Evangelos GEORGANAS , Dhiraj KALAMKAR , Christopher HUGHES , Cristina ANDERSON
IPC: G06F9/30
Abstract: Techniques for comparing BF16 data elements are described. An exemplary BF16 comparison instruction includes fields for an opcode, an identification of a location of a first packed data source operand, and an identification of a location of a second packed data source operand, wherein the opcode is to indicate that execution circuitry is to perform, for a particular data element position of the packed data source operands, a comparison of a data element at that position, and update a flags register based on the comparison.
-
公开(公告)号:US20230068781A1
公开(公告)日:2023-03-02
申请号:US17463382
申请日:2021-08-31
Applicant: Intel Corporation
Inventor: Menachem ADELMAN , Alexander HEINECKE , Robert VALENTINE , Zeev SPERBER , Amit GRADSTEIN , Mark CHARNEY , Evangelos GEORGANAS , Dhiraj KALAMKAR , Christopher HUGHES , Cristina ANDERSON
IPC: G06F9/30
Abstract: Techniques for scale and reduction of BF16 data elements are described. An exemplary instruction includes fields for an having fields for an opcode, an identification of a location of a first packed data source operand, an identification of a location of a second packed data source operand, and an identification of a packed data destination operand, wherein the opcode is to indicate that execution circuitry is to perform, for each data element position of the packed data source operands, a floating point scale operation of a BF16 data element of the first packed data source by multiplying the data element by a power of 2 value, wherein a value of the exponent of the power of 2 value is a floor value of a BF16 data element of the second packed data source, and store a result of the floating point scale operation into a corresponding data element position of the packed data destination operand.
-
公开(公告)号:US20220207107A1
公开(公告)日:2022-06-30
申请号:US17133473
申请日:2020-12-23
Applicant: Intel Corporation
Inventor: Menachem ADELMAN , Robert VALENTINE , Daniel TOWNER , Amit GRADSTEIN , Mark Jay CHARNEY
IPC: G06F17/16
Abstract: An apparatus and method for complex matrix multiplication. For example, one embodiment of a processor comprises: a decoder to decode a first complex matrix multiplication instruction; execution circuitry to execute the first complex matrix multiplication instruction, the execution circuitry comprising parallel multiplication circuitry to multiply real values from the first plurality of real and imaginary values with corresponding real values from the second plurality of real and imaginary values to generate a first plurality of real products, to multiply imaginary values from the first plurality of real and imaginary values with corresponding imaginary values from the second plurality of real and imaginary values to generate a second plurality of real products; and addition/subtraction circuitry to subtract each real product in the second plurality of real products from a corresponding real product in the first plurality of real products to produce a corresponding real value in the result matrix. The decoder may also decode and the execution circuitry may execute a second complex matrix multiplication instruction to multiply real and imaginary values from the first plurality with corresponding imaginary and real values, respectively, from the second plurality to generate first and second pluralities of imaginary products, and to add corresponding imaginary products to produce a corresponding imaginary value in the result matrix.
-
公开(公告)号:US20220197654A1
公开(公告)日:2022-06-23
申请号:US17133400
申请日:2020-12-23
Applicant: Intel Corporation
Inventor: Menachem ADELMAN , Robert VALENTINE , Daniel TOWNER , Amit GRADSTEIN , Mark Jay CHARNEY
Abstract: An apparatus and method for complex matrix conjugation. For example, one embodiment of a processor comprises: a decoder to decode a complex conjugate transpose instruction including a source operand to identify a complex source matrix and a destination operand to identify a complex result matrix, the complex source matrix to store a first plurality of complex values and the complex result matrix to store a second plurality of complex values, each complex value in the first and second plurality of complex values including a real component and an imaginary component; a plurality of registers or local memory to store all or a subset of the first plurality of complex values; and execution circuitry to execute the complex conjugate transpose instruction using matrix conjugation hardware logic to determine a plurality of complex conjugate values corresponding to the first plurality of complex values, and transpose hardware logic to perform a matrix transpose operation using the plurality of complex conjugate values to generate a result matrix.
-
18.
公开(公告)号:US20210124580A1
公开(公告)日:2021-04-29
申请号:US17133078
申请日:2020-12-23
Applicant: Intel Corporation
Inventor: Alexander F. HEINECKE , Robert VALENTINE , Mark J. CHARNEY , Raanan SADE , Menachem ADELMAN , Zeev SPERBER , Amit GRADSTEIN , Simon RUBANOVICH
Abstract: Disclosed embodiments relate to systems and methods for performing instructions to convert to 16-bit floating-point format. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a first source vector comprising N single-precision elements, and a destination vector comprising at least N 16-bit floating-point elements, the opcode to indicate execution circuitry is to convert each of the elements of the specified source vector to 16-bit floating-point, the conversion to include truncation and rounding, as necessary, and to store each converted element into a corresponding location of the specified destination vector, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction as specified by the opcode.
-
19.
公开(公告)号:US20210117194A1
公开(公告)日:2021-04-22
申请号:US17133396
申请日:2020-12-23
Applicant: Intel Corporation
Inventor: Alexander F. HEINECKE , Robert VALENTINE , Mark J. CHARNEY , Raanan SADE , Menachem ADELMAN , Zeev SPERBER , Amit GRADSTEIN , Simon RUBANOVICH
Abstract: Disclosed embodiments relate to systems and methods for performing 16-bit floating-point vector dot product instructions. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of first source, second source, and destination vectors, the opcode to indicate execution circuitry is to multiply N pairs of 16-bit floating-point formatted elements of the specified first and second sources, and accumulate the resulting products with previous contents of a corresponding single-precision element of the specified destination, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction as specified by the opcode.
-
公开(公告)号:US20240184585A1
公开(公告)日:2024-06-06
申请号:US18436982
申请日:2024-02-08
Applicant: Intel Corporation
Inventor: Alexander HEINECKE , Menachem ADELMAN , Robert VALENTINE , Zeev SPERBER , Amit GRADSTEIN , Mark CHARNEY , Evangelos GEORGANAS , Dhiraj KALAMKAR , Christopher HUGHES , Cristina ANDERSON
IPC: G06F9/30
CPC classification number: G06F9/3016 , G06F9/30014 , G06F9/30036 , G06F9/30101
Abstract: Techniques for comparing BF16 data elements are described. An exemplary BF16 comparison instruction includes fields for an opcode, an identification of a location of a first packed data source operand, and an identification of a location of a second packed data source operand, wherein the opcode is to indicate that execution circuitry is to perform, for a particular data element position of the packed data source operands, a comparison of a data element at that position, and update a flags register based on the comparison.
-
-
-
-
-
-
-
-
-