-
公开(公告)号:EP4375835A3
公开(公告)日:2024-08-14
申请号:EP24169357.1
申请日:2019-10-15
Applicant: Intel Corporation
Inventor: Heinecke, Alexander F. , Georganas, Evangelos , Hughes, Christopher J. , Sade, Raanan , Valentine, Robert
IPC: G06F9/30
CPC classification number: G06F9/30032 , G06F9/30036 , G06F9/30109 , G06F9/30038
Abstract: Disclosed embodiments relate to transposing vectors while loading from memory. In one example, a processor comprises: a register file comprising one or more vector registers; a memory interface to read a plurality of data elements from a memory; fetch circuitry to fetch an instruction; decode circuitry to decode the instruction, and execution circuitry to execute the instruction. The instruction includes a plurality of fields to indicate an opcode, a subset of the plurality of data elements to be broadcast, and locations of the plurality of data elements, the plurality of data elements arranged in a corresponding plurality of relative positions, wherein the plurality of data elements include a first group of data elements and a second group of data elements. The execution circuitry performs a permute operation and a broadcast operation in accordance with the instruction, wherein the broadcast operation is to cause the subset of the plurality of data elements to be broadcast to a plurality of the relative positions associated with a corresponding plurality of other subsets of the plurality of data elements, the subset of the plurality of data elements to replace the other corresponding subsets at the plurality of relative positions.
-
2.
公开(公告)号:EP4276609A3
公开(公告)日:2024-02-14
申请号:EP23200278.2
申请日:2019-10-08
Applicant: Intel Corporation
Inventor: Heinecke, Alexander F. , Valentine, Robert , Charney, Mark J. , Sade, Raanan , Adelman, Menachem , Sperber, Zeev , Gradstein, Amit , Rubanovich, Simon
IPC: G06F9/30
Abstract: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processing unit comprises fetch circuitry to fetch an instruction, decode circuitry to decode the instruction, the instruction having a first field to specify a first storage location of a plurality of data elements corresponding to a first matrix having M rows by N columns of 32-bit single precision floating-point data elements, a second field to specify a second storage location of a plurality of data elements corresponding to a second matrix having M rows by K columns of pairs of 16-bit floating-point data elements having a bfloat16 format, and a third field to specify a third storage location of a plurality of data elements corresponding to a third matrix having K rows by N columns of pairs of 16-bit floating-point data elements having the bfloat16 format, and execution circuitry coupled with the decode circuitry, the execution circuitry to perform operations corresponding to the instruction.
-
公开(公告)号:EP3671438A1
公开(公告)日:2020-06-24
申请号:EP19203199.5
申请日:2019-10-15
Applicant: INTEL Corporation
Inventor: Heinecke, Alexander F. , Georganas, Evangelos , Hughes, Christopher , Sade, Raanan , Valentine, Robert
IPC: G06F9/30
Abstract: Disclosed embodiments relate to transposing vectors while loading from memory. In one example, a processor includes a register file, a memory interface, fetch circuitry to fetch an instruction, decode circuitry to decode the fetched instruction having fields to specify an opcode, a destination vector register, and a source vector having N groups of elements, N being a positive integer, the opcode to indicate the processor is to fetch the source vector, generate write data comprising one or more N-tuples, each N-tuple comprising corresponding elements from each of the N groups of elements, and write the write data to the destination vector register, and execution circuitry to execute the decoded instruction as per the opcode, the execution circuitry has a shuffle pipeline disposed between the memory and the register file, the shuffle pipeline to fetch, decode, and execute further instances of the instruction at one instruction per clock cycle.
-
4.
公开(公告)号:EP3651017A3
公开(公告)日:2020-06-24
申请号:EP19201841.4
申请日:2019-10-08
Applicant: INTEL Corporation
Inventor: Heinecke, Alexander F. , Valentine, Robert , Charney, Mark J. , Sade, Raanan , Adelman, Menachem , Sperber, Zeev , Gradstein, Amit , Rubanovich, Simon
IPC: G06F9/30
Abstract: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.
-
公开(公告)号:EP4361802A2
公开(公告)日:2024-05-01
申请号:EP24157718.8
申请日:2019-06-26
Applicant: Intel Corporation
Inventor: Toll, Bret , Hughes, Christopher J. , Baum, Dan , Ould-Ahmed-Vall, ElMoustapha , Sade, Raanan , Valentine, Robert , Charney, Mark J. , Heinecke, Alexander F.
IPC: G06F9/30
CPC classification number: G06F9/30036 , G06F9/30032
Abstract: Disclosed embodiments relate to systems for performing instructions to quickly convert and use matrices (tiles) as one-dimensional vectors. In one example, a processor comprises decode circuitry to decode an instruction, the instruction to specify a two-dimensional, 2D, tile storage of the processor, either of multiple rows or multiple columns of the 2D tile storage, multiple vector registers of the processor, and a size of elements of the multiple rows or multiple columns of the 2D tile storage as any one of 8-bits, 16-bits, 32-bits, and 64-bits; and execution circuitry, coupled with the decode circuitry, to perform operations corresponding to the instruction, the operations to include storing elements from each row of the multiple rows or each column of the multiple columns of the 2D tile storage to a corresponding one of the multiple vector registers as a corresponding one-dimensional, 1D, vector.
-
6.
公开(公告)号:EP3916543A3
公开(公告)日:2021-12-22
申请号:EP21187080.3
申请日:2019-06-27
Applicant: INTEL Corporation
Inventor: Sade, Raanan , Valentine, Robert , Toll, Bret , Hughes, Christopher J. , Heinecke, Alexander F. , Ould-Ahmed-Vall, ElMoustapha , Charney, Mark J.
IPC: G06F9/30
Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, a processor comprises decode circuitry to decode a single instruction into a decoded single instruction and execution circuitry to execute the decoded single instruction according to an opcode. The single instruction has a first field to specify a source matrix, a second field to specify a destination matrix, and the opcode to indicate the execution circuitry is to cause a store of: a first element and a second element from a first column of the source matrix respectively into a first element and a second element in a first row of the destination matrix, a first element and a second element from a second column of the source matrix respectively into a third element and a fourth element in the first row of the destination matrix, a third element and a fourth element from the first column of the source matrix respectively into a first element and a second element in a second row of the destination matrix, and a third element and a fourth element from the second column of the source matrix respectively into a third element and a fourth element in the second row of the destination matrix.
-
7.
公开(公告)号:EP3822774A1
公开(公告)日:2021-05-19
申请号:EP20216494.3
申请日:2019-10-08
Applicant: INTEL Corporation
Inventor: Heinecke, Alexander F. , Valentine, Robert , Charney, Mark J. , Sade, Raanan , Adelman, Menachem , Sperber, Zeev , Gradstein, Amit , Rubanovich, Simon
IPC: G06F9/30
Abstract: Disclosed embodiments relate to a processor, a system on a chip and a system for executing a format conversion instruction. In one example, a processor having a plurality of cores, including a core that, in response to a format conversion instruction having a first source operand including a first 32-bit single-precision floating point data element, and a second source operand including a second 32-bit single-precision floating point data element, is to: convert the first 32-bit single-precision floating point data element to a first 16-bit floating point data element, wherein, when the first 32-bit single-precision floating point data element is a normal data element, conversion is to be performed according to a rounding mode specified by the format conversion instruction, and the first 16-bit floating point data element is to have a sign bit, an 8-bit exponent, seven explicit mantissa bits, and one implicit mantissa bit, and wherein, when the first 32-bit single-precision floating point data element is a not-a-number, NaN, data element, the first 16-bit floating point data element is to have a mantissa with a most significant bit set to one; convert the second 32-bit single-precision floating point data element to a second 16-bit floating point data element, wherein, when the second 32-bit single-precision floating point data element is a normal data element, conversion is to be performed according to the rounding mode, and the second 16-bit floating point data element is to have a sign bit, an 8-bit exponent, seven explicit mantissa bits, and one implicit mantissa bit, and wherein when the second 32-bit single-precision floating point data element is a NaN data element, the second 16-bit floating point data element is to have a mantissa with a most significant bit set to one; and store the first 16-bit floating point data element in a lower order half of a destination register and the second 16-bit floating point data element in a higher order half of the destination register..
-
8.
公开(公告)号:EP3651017A2
公开(公告)日:2020-05-13
申请号:EP19201841.4
申请日:2019-10-08
Applicant: INTEL Corporation
Inventor: Heinecke, Alexander F. , Valentine, Robert , Charney, Mark J. , Sade, Raanan , Adelman, Menachem , Sperber, Zeev , Gradstein, Amit , Rubanovich, Simon
IPC: G06F9/30
Abstract: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.
-
9.
公开(公告)号:EP4290371A3
公开(公告)日:2024-03-13
申请号:EP23205691.1
申请日:2019-06-27
Applicant: Intel Corporation
Inventor: Sade, Raanan , Valentine, Robert , Toll, Bret , Hughes, Christopher J. , Heinecke, Alexander F. , Ould-Ahmed-Vall, ElMoustapha , Charney, Mark J.
IPC: G06F9/30
Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, an apparatus comprises: a plurality of registers, each register of the plurality of registers to store a plurality of matrix data elements and matrix processing circuitry to execute a matrix processing instruction to multiply a first source tile of a first matrix and a second source tile of a second matrix, the first source tile comprising rows and columns of a first subset of data elements of the first source matrix and the second source tile comprising rows and columns of a second subset of data elements of the second source matrix. The matrix processing circuitry comprises: circuitry to transform the first source tile by merging adjacent pairs of rows of the first source tile to generate corresponding row-interleaved data element sequences, each row-interleaved data element sequence to be loaded in a corresponding register of the plurality of registers; a set of multipliers to perform a parallel multiplication of each data element of the first subset of data elements stored in the corresponding registers of the plurality of registers with a corresponding data element of the second subset of data elements to generate a corresponding plurality of products; and accumulator circuitry to add the plurality of products to corresponding accumulated data elements of an accumulation matrix to generate corresponding result data elements of a result matrix.
-
10.
公开(公告)号:EP4290371A2
公开(公告)日:2023-12-13
申请号:EP23205691.1
申请日:2019-06-27
Applicant: Intel Corporation
Inventor: Sade, Raanan , Valentine, Robert , Toll, Bret , Hughes, Christopher J. , Heinecke, Alexander F. , Ould-Ahmed-Vall, ElMoustapha , Charney, Mark J.
IPC: G06F9/30
Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, an apparatus comprises: a plurality of registers, each register of the plurality of registers to store a plurality of matrix data elements and matrix processing circuitry to execute a matrix processing instruction to multiply a first source tile of a first matrix and a second source tile of a second matrix, the first source tile comprising rows and columns of a first subset of data elements of the first source matrix and the second source tile comprising rows and columns of a second subset of data elements of the second source matrix. The matrix processing circuitry comprises: circuitry to transform the first source tile by merging adjacent pairs of rows of the first source tile to generate corresponding row-interleaved data element sequences, each row-interleaved data element sequence to be loaded in a corresponding register of the plurality of registers; a set of multipliers to perform a parallel multiplication of each data element of the first subset of data elements stored in the corresponding registers of the plurality of registers with a corresponding data element of the second subset of data elements to generate a corresponding plurality of products; and accumulator circuitry to add the plurality of products to corresponding accumulated data elements of an accumulation matrix to generate corresponding result data elements of a result matrix.
-
-
-
-
-
-
-
-
-