-
1.
公开(公告)号:EP4485178A1
公开(公告)日:2025-01-01
申请号:EP23213640.8
申请日:2023-12-01
Applicant: INTEL Corporation
Inventor: Heinecke, Alexander , Wong, Wing Shek , Robinson, Stephen , Sade, Raanan , Gradstein, Amit , Rubanovich, Simon , Espig, Michael , Baum, Dan , Georganas, Evangelos , Kalamkar, Dhiraj
IPC: G06F9/30
Abstract: Decoder circuitry to decode an instruction indicating a first vector register having a 128-bit lane to store a first matrix having two rows by K columns of data elements having a number of bits, a storage location having 128 bits to store a second matrix having K rows by two columns of data elements having the number of bits, and a second vector register having a 128-bit lane to store a third matrix having two rows by two columns of data elements having a greater number of bits. Execution circuitry is to perform operations for the instruction, including to generate and store a result matrix having two rows by two columns of result data elements having the greater number of bits in 128-bit lane of second vector register. The result matrix represents accumulation of the third matrix with product matrix generated from matrix multiplication using the first and second matrices.
-
公开(公告)号:EP4485177A1
公开(公告)日:2025-01-01
申请号:EP23211673.1
申请日:2023-11-23
Applicant: Intel Corporation
Inventor: Hughes, Christopher J. , Wang, Zhe , Baum, Dan , Madduri, Venkateswara Rao , Heinecke, Alexander , Georganas, Evangelos , Dan, Chen , Nuzman, Joseph
Abstract: Techniques for loading data with a hint related to data sharing with other cores. For example, one embodiment of an apparatus comprises: a plurality of cores to process instructions; a first core of the plurality of cores comprising: decoder circuitry to decode a single instruction, the single instruction having a first field for an opcode to indicate a load operation to read data from a memory, a second field to indicate a memory address for a location of the data in the memory, and a third field to store a value to indicate whether the data is expected to be shared between the first core and at least a second core of the plurality of cores; execution circuitry to execute the single instruction to read the data from the location in the memory; and cache controller circuitry to store the data in one or more caches in a state selected based on the value.
-
公开(公告)号:EP4375835A2
公开(公告)日:2024-05-29
申请号:EP24169357.1
申请日:2019-10-15
Applicant: Intel Corporation
Inventor: Heinecke, Alexander F. , Georganas, Evangelos , Hughes, Christopher J. , Sade, Raanan , Valentine, Robert
IPC: G06F9/38
CPC classification number: G06F9/30032 , G06F9/30036 , G06F9/30109 , G06F9/3875 , G06F9/30038
Abstract: Disclosed embodiments relate to transposing vectors while loading from memory. In one example, a processor comprises: a register file comprising one or more vector registers; a memory interface to read a plurality of data elements from a memory; fetch circuitry to fetch an instruction; decode circuitry to decode the instruction, and execution circuitry to execute the instruction. The instruction includes a plurality of fields to indicate an opcode, a subset of the plurality of data elements to be broadcast, and locations of the plurality of data elements, the plurality of data elements arranged in a corresponding plurality of relative positions, wherein the plurality of data elements include a first group of data elements and a second group of data elements. The execution circuitry performs a permute operation and a broadcast operation in accordance with the instruction, wherein the broadcast operation is to cause the subset of the plurality of data elements to be broadcast to a plurality of the relative positions associated with a corresponding plurality of other subsets of the plurality of data elements, the subset of the plurality of data elements to replace the other corresponding subsets at the plurality of relative positions.
-
4.
公开(公告)号:EP4276608A3
公开(公告)日:2024-01-10
申请号:EP23195872.9
申请日:2021-09-14
Applicant: Intel Corporation
Inventor: Mellempudi, Naveen , Heinecke, Alexander F. , Valentine, Robert , Charney, Mark J. , Hughes, Christopher J. , Georganas, Evangelos , Sperber, Zeev , Gradstein, Amit , Rubanovich, Simon
IPC: G06F9/30
Abstract: Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. For example, a processing unit comprises circuitry to perform operations corresponding to an instruction, the instruction to specify a first matrix having M rows by 4*K columns of 8-bit floating-point data elements, a second matrix having 4*K rows by N columns of 8-bit floating-point data elements, and a third matrix having M rows by N columns of 32-bit single precision floating-point data elements. The operations includes to, for each row m of the M rows of the first matrix, and for each column n of the N columns of the second matrix: convert 4*K 8-bit floating-point data elements of the row m of the first matrix to 4*K corresponding higher precision floating-point data elements having a higher precision than an 8-bit floating-point data element, and convert 4*K 8-bit floating-point data elements of the column n of the second matrix to 4*K corresponding higher precision floating-point data elements having a higher precision than the 8-bit floating-point data element; multiply the 4*K higher precision floating-point data elements corresponding to the row m of the first matrix with corresponding ones of the 4*K higher precision floating-point data elements corresponding to the column n of the second matrix to generate 4*K products; accumulate the 4*K products with a 32-bit single precision floating-point data element corresponding to a row m of the M rows, and a column n of the N columns, of the third matrix, to generate a result 32-bit single precision floating-point data element; and store the result 32-bit single precision floating-point data element at the row m and the column n of the third matrix.
-
公开(公告)号:EP4293503A1
公开(公告)日:2023-12-20
申请号:EP23171339.7
申请日:2023-05-03
Applicant: Intel Corporation
Inventor: Adelman, Menachem , Gradstein, Amit , Rubanovich, Simon , Ziv, Barukh , Sherman, Uri , Rip, Dana , Mizrahi, Shahar , Baum, Dan , Rappoport, Rinat , Jain, Nilesh , Sperber, Zeev , Stupp, Gideon , Heinecke, Alexander , Hughes, Christopher , Georganas, Evangelos
IPC: G06F9/30
Abstract: Techniques and mechanisms for processor circuitry to execute a load and expand instruction of an instruction set to generate decompressed matrix data. In an embodiment, the instruction comprises a source operand which indicates a location from which compressed matrix data, and corresponding metadata, are to be accessed. A destination operand of the instruction indicates a location which is to receive decompressed metadata, which is generated, during execution of the instruction, based on the compressed matrix data and the corresponding metadata. The metadata comprises compression mask information which identifies which elements of the matrix have been masked from the compressed matrix data. In another embodiment, the instruction further comprises a count operand which identifies a total number of the unmasked matrix elements which are represented in the compressed matrix data.
-
6.
公开(公告)号:EP4276608A2
公开(公告)日:2023-11-15
申请号:EP23195872.9
申请日:2021-09-14
Applicant: Intel Corporation
Inventor: Mellempudi, Naveen , Heinecke, Alexander F. , Valentine, Robert , Charney, Mark J. , Hughes, Christopher J. , Georganas, Evangelos , Sperber, Zeev , Gradstein, Amit , Rubanovich, Simon
IPC: G06F9/30
Abstract: Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. For example, a processing unit comprises circuitry to perform operations corresponding to an instruction, the instruction to specify a first matrix having M rows by 4*K columns of 8-bit floating-point data elements, a second matrix having 4*K rows by N columns of 8-bit floating-point data elements, and a third matrix having M rows by N columns of 32-bit single precision floating-point data elements. The operations includes to, for each row m of the M rows of the first matrix, and for each column n of the N columns of the second matrix: convert 4*K 8-bit floating-point data elements of the row m of the first matrix to 4*K corresponding higher precision floating-point data elements having a higher precision than an 8-bit floating-point data element, and convert 4*K 8-bit floating-point data elements of the column n of the second matrix to 4*K corresponding higher precision floating-point data elements having a higher precision than the 8-bit floating-point data element; multiply the 4*K higher precision floating-point data elements corresponding to the row m of the first matrix with corresponding ones of the 4*K higher precision floating-point data elements corresponding to the column n of the second matrix to generate 4*K products; accumulate the 4*K products with a 32-bit single precision floating-point data element corresponding to a row m of the M rows, and a column n of the N columns, of the third matrix, to generate a result 32-bit single precision floating-point data element; and store the result 32-bit single precision floating-point data element at the row m and the column n of the third matrix.
-
公开(公告)号:EP4468146A3
公开(公告)日:2025-02-19
申请号:EP24205150.6
申请日:2020-11-26
Applicant: INTEL Corporation
Inventor: Adelman, Menachem , Valentine, Robert , Ziv, Barukh , Gradstein, Amit , Rubanovich, Simon , Sperber, Zeev , Charney, Mark J. , Hughes, Christopher J. , Heinecke, Alexander F. , Georganas, Evangelos , Pham, Binh
IPC: G06F9/30
Abstract: Embodiments for a matrix transpose and multiply operation are disclosed. In an embodiment, a processor comprises: a plurality of registers to store a plurality of packed data elements including a first plurality of packed data elements of a first source matrix tile and a second plurality of packed data elements of a second source matrix tile, the first and second source matrix tiles comprising respective portions of a first source matrix and a second source matrix, and wherein each packed data element of the plurality of packed data elements has an element width; a decoder to decode one or more instructions, at least one instruction of the one or more instructions including an opcode field configured to specify an opcode, a first source operand configured to indicate the first source matrix tile, a second source operand configured to indicate the second source matrix tile, and a destination operand configured to indicate a result matrix tile; and execution circuitry to, in response to the one or more instructions, to transpose the first source matrix tile in accordance with a granularity equal to the element width to generate a first transposed source matrix tile and to multiply the first transposed source matrix tile and the second source matrix tile. The execution circuitry comprises: a plurality of multipliers to multiply data elements of the first transposed source matrix tile and corresponding data elements of the second source matrix tile to produce a corresponding plurality of products; and one or more accumulators to add groups of the products to generate corresponding result data elements in the result matrix tile.
-
公开(公告)号:EP4462249A3
公开(公告)日:2025-02-19
申请号:EP24203555.8
申请日:2020-11-26
Applicant: INTEL Corporation
Inventor: Adelman, Menachem , Valentine, Robert , Ziv, Barukh , Gradstein, Amit , Rubanovich, Simon , Sperber, Zeev , Charney, Mark J. , Hughes, Christopher J. , Heinecke, Alexander F. , Georganas, Evangelos , Pham, Binh
IPC: G06F9/30
Abstract: Embodiments for a matrix transpose and multiply operation are disclosed. In an embodiment, an apparatus comprises decode circuitry to decode an instance of an instruction having a format including an opcode field to specify an opcode, a first destination operand field to specify a destination matrix location, a first source operand field to specify a first source matrix location, a second source operand field to specify a second source matrix location, and a third operand field to specify a source/destination matrix location; and execution circuitry to, in response to the opcode of the decoded instance of the instruction, transpose columns of data element pairs of the first source matrix into rows, perform a dot product of data element pairs of the transposed columns of data element pairs of the first source matrix and corresponding row data element pairs of the second source matrix, add a result of the dot product to a corresponding row data element of the source/destination matrix.
-
9.
公开(公告)号:EP4321992A3
公开(公告)日:2024-05-01
申请号:EP23210931.4
申请日:2020-02-07
Applicant: Intel Corporation
Inventor: Adelman, Menachem , Valentine, Robert , Ziv, Barukh , Gradstein, Amit , Rubanovich, Simon , Heinecke, Alexander , Georganas, Evangelos
IPC: G06F9/30
CPC classification number: G06F7/483 , G06F2207/382820130101 , G06F9/30032 , G06F9/30036 , G06F9/30025
Abstract: An apparatus and method for down-converting and interleaving data elements. For example, one embodiment of a processor comprises: a decoder to decode a first instruction to generate a decoded instruction; a first source register to store a first plurality of packed data elements; a second source register to store a second plurality of packed data elements; a destination register to store a third plurality and a fourth plurality of packed data elements, each of the third and fourth plurality of packed data elements to be encoded with fewer bits than each of the first and second plurality of packed data elements; execution circuitry to execute the decoded instruction, the execution circuitry comprising: down-conversion circuitry to down-convert each of the first plurality of packed data elements to generate one of the third plurality of packed data elements and to down-convert each of the second plurality of packed data elements to generate one of the fourth plurality of packed data elements; interleave circuitry to interleave the third plurality of packed data elements with the fourth plurality of packed data elements within the destination register.
-
公开(公告)号:EP4357914A2
公开(公告)日:2024-04-24
申请号:EP24161663.0
申请日:2022-07-08
Applicant: INTEL Corporation
Inventor: Heinecke, Alexander , Adelman, Menachem , Valentine, Robert , Sperber, Zeev , Gradstein, Amit , Charney, Mark , Georganas, Evangelos , Kalamkar, Dhiraj , Hughes, Christopher , Anderson, Cristina
IPC: G06F9/30
CPC classification number: G06F9/30036 , G06F9/30021 , G06F9/30094
Abstract: Techniques for comparing BF16 data elements are described. An exemplary instruction is to cause operations including to: provide, for each data element position of BF16 data elements of first and second packed data source operands, a data element result, wherein: for a predicate value that is a first value, the data element result is to include a corresponding data element that is a result of either a maximum comparison or a minimum comparison of a pair of corresponding BF16 data elements, wherein, when the BF16 data elements of the pair of corresponding BF16 data elements are both zero, of either sign, the data element result is to include the corresponding BF16 data element of the second packed data source operand; and for a predicate value that is a second value, the data element result is to include a corresponding data element that is either zero or remains unchanged.
-
-
-
-
-
-
-
-
-