SYSTEMS AND METHODS TO TRANSPOSE VECTORS ON-THE-FLY WHILE LOADING FROM MEMORY

    公开(公告)号:EP4375835A3

    公开(公告)日:2024-08-14

    申请号:EP24169357.1

    申请日:2019-10-15

    CPC classification number: G06F9/30032 G06F9/30036 G06F9/30109 G06F9/30038

    Abstract: Disclosed embodiments relate to transposing vectors while loading from memory. In one example, a processor comprises: a register file comprising one or more vector registers; a memory interface to read a plurality of data elements from a memory; fetch circuitry to fetch an instruction; decode circuitry to decode the instruction, and execution circuitry to execute the instruction. The instruction includes a plurality of fields to indicate an opcode, a subset of the plurality of data elements to be broadcast, and locations of the plurality of data elements, the plurality of data elements arranged in a corresponding plurality of relative positions, wherein the plurality of data elements include a first group of data elements and a second group of data elements. The execution circuitry performs a permute operation and a broadcast operation in accordance with the instruction, wherein the broadcast operation is to cause the subset of the plurality of data elements to be broadcast to a plurality of the relative positions associated with a corresponding plurality of other subsets of the plurality of data elements, the subset of the plurality of data elements to replace the other corresponding subsets at the plurality of relative positions.

    SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS

    公开(公告)号:EP4276609A3

    公开(公告)日:2024-02-14

    申请号:EP23200278.2

    申请日:2019-10-08

    Abstract: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processing unit comprises fetch circuitry to fetch an instruction, decode circuitry to decode the instruction, the instruction having a first field to specify a first storage location of a plurality of data elements corresponding to a first matrix having M rows by N columns of 32-bit single precision floating-point data elements, a second field to specify a second storage location of a plurality of data elements corresponding to a second matrix having M rows by K columns of pairs of 16-bit floating-point data elements having a bfloat16 format, and a third field to specify a third storage location of a plurality of data elements corresponding to a third matrix having K rows by N columns of pairs of 16-bit floating-point data elements having the bfloat16 format, and execution circuitry coupled with the decode circuitry, the execution circuitry to perform operations corresponding to the instruction.

    SYSTEMS AND METHODS TO TRANSPOSE VECTORS ON-THE-FLY WHILE LOADING FROM MEMORY

    公开(公告)号:EP3671438A1

    公开(公告)日:2020-06-24

    申请号:EP19203199.5

    申请日:2019-10-15

    Abstract: Disclosed embodiments relate to transposing vectors while loading from memory. In one example, a processor includes a register file, a memory interface, fetch circuitry to fetch an instruction, decode circuitry to decode the fetched instruction having fields to specify an opcode, a destination vector register, and a source vector having N groups of elements, N being a positive integer, the opcode to indicate the processor is to fetch the source vector, generate write data comprising one or more N-tuples, each N-tuple comprising corresponding elements from each of the N groups of elements, and write the write data to the destination vector register, and execution circuitry to execute the decoded instruction as per the opcode, the execution circuitry has a shuffle pipeline disposed between the memory and the register file, the shuffle pipeline to fetch, decode, and execute further instances of the instruction at one instruction per clock cycle.

    SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS

    公开(公告)号:EP3651017A3

    公开(公告)日:2020-06-24

    申请号:EP19201841.4

    申请日:2019-10-08

    Abstract: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.

    SYSTEMS AND METHODS FOR PERFORMING INSTRUCTIONS TO TRANSFORM MATRICES INTO ROW-INTERLEAVED FORMAT

    公开(公告)号:EP3916543A3

    公开(公告)日:2021-12-22

    申请号:EP21187080.3

    申请日:2019-06-27

    Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, a processor comprises decode circuitry to decode a single instruction into a decoded single instruction and execution circuitry to execute the decoded single instruction according to an opcode. The single instruction has a first field to specify a source matrix, a second field to specify a destination matrix, and the opcode to indicate the execution circuitry is to cause a store of: a first element and a second element from a first column of the source matrix respectively into a first element and a second element in a first row of the destination matrix, a first element and a second element from a second column of the source matrix respectively into a third element and a fourth element in the first row of the destination matrix, a third element and a fourth element from the first column of the source matrix respectively into a first element and a second element in a second row of the destination matrix, and a third element and a fourth element from the second column of the source matrix respectively into a third element and a fourth element in the second row of the destination matrix.

    SYSTEMS AND METHODS FOR PERFORMING INSTRUCTIONS TO CONVERT TO 16-BIT FLOATING-POINT FORMAT

    公开(公告)号:EP3822774A1

    公开(公告)日:2021-05-19

    申请号:EP20216494.3

    申请日:2019-10-08

    Abstract: Disclosed embodiments relate to a processor, a system on a chip and a system for executing a format conversion instruction. In one example, a processor having a plurality of cores, including a core that, in response to a format conversion instruction having a first source operand including a first 32-bit single-precision floating point data element, and a second source operand including a second 32-bit single-precision floating point data element, is to: convert the first 32-bit single-precision floating point data element to a first 16-bit floating point data element, wherein, when the first 32-bit single-precision floating point data element is a normal data element, conversion is to be performed according to a rounding mode specified by the format conversion instruction, and the first 16-bit floating point data element is to have a sign bit, an 8-bit exponent, seven explicit mantissa bits, and one implicit mantissa bit, and wherein, when the first 32-bit single-precision floating point data element is a not-a-number, NaN, data element, the first 16-bit floating point data element is to have a mantissa with a most significant bit set to one; convert the second 32-bit single-precision floating point data element to a second 16-bit floating point data element, wherein, when the second 32-bit single-precision floating point data element is a normal data element, conversion is to be performed according to the rounding mode, and the second 16-bit floating point data element is to have a sign bit, an 8-bit exponent, seven explicit mantissa bits, and one implicit mantissa bit, and wherein when the second 32-bit single-precision floating point data element is a NaN data element, the second 16-bit floating point data element is to have a mantissa with a most significant bit set to one; and store the first 16-bit floating point data element in a lower order half of a destination register and the second 16-bit floating point data element in a higher order half of the destination register..

    SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS

    公开(公告)号:EP3651017A2

    公开(公告)日:2020-05-13

    申请号:EP19201841.4

    申请日:2019-10-08

    Abstract: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.

    SYSTEMS AND METHODS FOR PERFORMING INSTRUCTIONS TO TRANSFORM MATRICES INTO ROW-INTERLEAVED FORMAT

    公开(公告)号:EP4290371A3

    公开(公告)日:2024-03-13

    申请号:EP23205691.1

    申请日:2019-06-27

    Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, an apparatus comprises: a plurality of registers, each register of the plurality of registers to store a plurality of matrix data elements and matrix processing circuitry to execute a matrix processing instruction to multiply a first source tile of a first matrix and a second source tile of a second matrix, the first source tile comprising rows and columns of a first subset of data elements of the first source matrix and the second source tile comprising rows and columns of a second subset of data elements of the second source matrix. The matrix processing circuitry comprises: circuitry to transform the first source tile by merging adjacent pairs of rows of the first source tile to generate corresponding row-interleaved data element sequences, each row-interleaved data element sequence to be loaded in a corresponding register of the plurality of registers; a set of multipliers to perform a parallel multiplication of each data element of the first subset of data elements stored in the corresponding registers of the plurality of registers with a corresponding data element of the second subset of data elements to generate a corresponding plurality of products; and accumulator circuitry to add the plurality of products to corresponding accumulated data elements of an accumulation matrix to generate corresponding result data elements of a result matrix.

    SYSTEMS AND METHODS FOR PERFORMING INSTRUCTIONS TO TRANSFORM MATRICES INTO ROW-INTERLEAVED FORMAT

    公开(公告)号:EP4290371A2

    公开(公告)日:2023-12-13

    申请号:EP23205691.1

    申请日:2019-06-27

    Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, an apparatus comprises: a plurality of registers, each register of the plurality of registers to store a plurality of matrix data elements and matrix processing circuitry to execute a matrix processing instruction to multiply a first source tile of a first matrix and a second source tile of a second matrix, the first source tile comprising rows and columns of a first subset of data elements of the first source matrix and the second source tile comprising rows and columns of a second subset of data elements of the second source matrix. The matrix processing circuitry comprises: circuitry to transform the first source tile by merging adjacent pairs of rows of the first source tile to generate corresponding row-interleaved data element sequences, each row-interleaved data element sequence to be loaded in a corresponding register of the plurality of registers; a set of multipliers to perform a parallel multiplication of each data element of the first subset of data elements stored in the corresponding registers of the plurality of registers with a corresponding data element of the second subset of data elements to generate a corresponding plurality of products; and accumulator circuitry to add the plurality of products to corresponding accumulated data elements of an accumulation matrix to generate corresponding result data elements of a result matrix.

Patent Agency Ranking