APPARATUS AND METHOD FOR A LOAD INSTRUCTION WITH A READ-SHARED INDICATION

    公开(公告)号:EP4485177A1

    公开(公告)日:2025-01-01

    申请号:EP23211673.1

    申请日:2023-11-23

    Abstract: Techniques for loading data with a hint related to data sharing with other cores. For example, one embodiment of an apparatus comprises: a plurality of cores to process instructions; a first core of the plurality of cores comprising: decoder circuitry to decode a single instruction, the single instruction having a first field for an opcode to indicate a load operation to read data from a memory, a second field to indicate a memory address for a location of the data in the memory, and a third field to store a value to indicate whether the data is expected to be shared between the first core and at least a second core of the plurality of cores; execution circuitry to execute the single instruction to read the data from the location in the memory; and cache controller circuitry to store the data in one or more caches in a state selected based on the value.

    SYSTEMS AND METHODS TO TRANSPOSE VECTORS ON-THE-FLY WHILE LOADING FROM MEMORY

    公开(公告)号:EP4375835A2

    公开(公告)日:2024-05-29

    申请号:EP24169357.1

    申请日:2019-10-15

    Abstract: Disclosed embodiments relate to transposing vectors while loading from memory. In one example, a processor comprises: a register file comprising one or more vector registers; a memory interface to read a plurality of data elements from a memory; fetch circuitry to fetch an instruction; decode circuitry to decode the instruction, and execution circuitry to execute the instruction. The instruction includes a plurality of fields to indicate an opcode, a subset of the plurality of data elements to be broadcast, and locations of the plurality of data elements, the plurality of data elements arranged in a corresponding plurality of relative positions, wherein the plurality of data elements include a first group of data elements and a second group of data elements. The execution circuitry performs a permute operation and a broadcast operation in accordance with the instruction, wherein the broadcast operation is to cause the subset of the plurality of data elements to be broadcast to a plurality of the relative positions associated with a corresponding plurality of other subsets of the plurality of data elements, the subset of the plurality of data elements to replace the other corresponding subsets at the plurality of relative positions.

    APPARATUSES, METHODS, AND SYSTEMS FOR 8-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS

    公开(公告)号:EP4276608A3

    公开(公告)日:2024-01-10

    申请号:EP23195872.9

    申请日:2021-09-14

    Abstract: Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. For example, a processing unit comprises circuitry to perform operations corresponding to an instruction, the instruction to specify a first matrix having M rows by 4*K columns of 8-bit floating-point data elements, a second matrix having 4*K rows by N columns of 8-bit floating-point data elements, and a third matrix having M rows by N columns of 32-bit single precision floating-point data elements. The operations includes to, for each row m of the M rows of the first matrix, and for each column n of the N columns of the second matrix: convert 4*K 8-bit floating-point data elements of the row m of the first matrix to 4*K corresponding higher precision floating-point data elements having a higher precision than an 8-bit floating-point data element, and convert 4*K 8-bit floating-point data elements of the column n of the second matrix to 4*K corresponding higher precision floating-point data elements having a higher precision than the 8-bit floating-point data element; multiply the 4*K higher precision floating-point data elements corresponding to the row m of the first matrix with corresponding ones of the 4*K higher precision floating-point data elements corresponding to the column n of the second matrix to generate 4*K products; accumulate the 4*K products with a 32-bit single precision floating-point data element corresponding to a row m of the M rows, and a column n of the N columns, of the third matrix, to generate a result 32-bit single precision floating-point data element; and store the result 32-bit single precision floating-point data element at the row m and the column n of the third matrix.

    APPARATUSES, METHODS, AND SYSTEMS FOR 8-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS

    公开(公告)号:EP4276608A2

    公开(公告)日:2023-11-15

    申请号:EP23195872.9

    申请日:2021-09-14

    Abstract: Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. For example, a processing unit comprises circuitry to perform operations corresponding to an instruction, the instruction to specify a first matrix having M rows by 4*K columns of 8-bit floating-point data elements, a second matrix having 4*K rows by N columns of 8-bit floating-point data elements, and a third matrix having M rows by N columns of 32-bit single precision floating-point data elements. The operations includes to, for each row m of the M rows of the first matrix, and for each column n of the N columns of the second matrix: convert 4*K 8-bit floating-point data elements of the row m of the first matrix to 4*K corresponding higher precision floating-point data elements having a higher precision than an 8-bit floating-point data element, and convert 4*K 8-bit floating-point data elements of the column n of the second matrix to 4*K corresponding higher precision floating-point data elements having a higher precision than the 8-bit floating-point data element; multiply the 4*K higher precision floating-point data elements corresponding to the row m of the first matrix with corresponding ones of the 4*K higher precision floating-point data elements corresponding to the column n of the second matrix to generate 4*K products; accumulate the 4*K products with a 32-bit single precision floating-point data element corresponding to a row m of the M rows, and a column n of the N columns, of the third matrix, to generate a result 32-bit single precision floating-point data element; and store the result 32-bit single precision floating-point data element at the row m and the column n of the third matrix.

    MATRIX TRANSPOSE AND MULTIPLY
    7.
    发明公开

    公开(公告)号:EP4468146A3

    公开(公告)日:2025-02-19

    申请号:EP24205150.6

    申请日:2020-11-26

    Abstract: Embodiments for a matrix transpose and multiply operation are disclosed. In an embodiment, a processor comprises: a plurality of registers to store a plurality of packed data elements including a first plurality of packed data elements of a first source matrix tile and a second plurality of packed data elements of a second source matrix tile, the first and second source matrix tiles comprising respective portions of a first source matrix and a second source matrix, and wherein each packed data element of the plurality of packed data elements has an element width; a decoder to decode one or more instructions, at least one instruction of the one or more instructions including an opcode field configured to specify an opcode, a first source operand configured to indicate the first source matrix tile, a second source operand configured to indicate the second source matrix tile, and a destination operand configured to indicate a result matrix tile; and execution circuitry to, in response to the one or more instructions, to transpose the first source matrix tile in accordance with a granularity equal to the element width to generate a first transposed source matrix tile and to multiply the first transposed source matrix tile and the second source matrix tile. The execution circuitry comprises: a plurality of multipliers to multiply data elements of the first transposed source matrix tile and corresponding data elements of the second source matrix tile to produce a corresponding plurality of products; and one or more accumulators to add groups of the products to generate corresponding result data elements in the result matrix tile.

    MATRIX TRANSPOSE AND MULTIPLY
    8.
    发明公开

    公开(公告)号:EP4462249A3

    公开(公告)日:2025-02-19

    申请号:EP24203555.8

    申请日:2020-11-26

    Abstract: Embodiments for a matrix transpose and multiply operation are disclosed. In an embodiment, an apparatus comprises decode circuitry to decode an instance of an instruction having a format including an opcode field to specify an opcode, a first destination operand field to specify a destination matrix location, a first source operand field to specify a first source matrix location, a second source operand field to specify a second source matrix location, and a third operand field to specify a source/destination matrix location; and execution circuitry to, in response to the opcode of the decoded instance of the instruction, transpose columns of data element pairs of the first source matrix into rows, perform a dot product of data element pairs of the transposed columns of data element pairs of the first source matrix and corresponding row data element pairs of the second source matrix, add a result of the dot product to a corresponding row data element of the source/destination matrix.

    APPARATUS AND METHOD FOR DOWN-CONVERTING AND INTERLEAVING MULTIPLE FLOATING POINT VALUES

    公开(公告)号:EP4321992A3

    公开(公告)日:2024-05-01

    申请号:EP23210931.4

    申请日:2020-02-07

    Abstract: An apparatus and method for down-converting and interleaving data elements. For example, one embodiment of a processor comprises: a decoder to decode a first instruction to generate a decoded instruction; a first source register to store a first plurality of packed data elements; a second source register to store a second plurality of packed data elements; a destination register to store a third plurality and a fourth plurality of packed data elements, each of the third and fourth plurality of packed data elements to be encoded with fewer bits than each of the first and second plurality of packed data elements; execution circuitry to execute the decoded instruction, the execution circuitry comprising: down-conversion circuitry to down-convert each of the first plurality of packed data elements to generate one of the third plurality of packed data elements and to down-convert each of the second plurality of packed data elements to generate one of the fourth plurality of packed data elements; interleave circuitry to interleave the third plurality of packed data elements with the fourth plurality of packed data elements within the destination register.

Patent Agency Ranking