SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT VECTOR DOT PRODUCT INSTRUCTIONS

    公开(公告)号:EP4455870A2

    公开(公告)日:2024-10-30

    申请号:EP24177208.6

    申请日:2019-10-08

    申请人: INTEL Corporation

    IPC分类号: G06F9/30

    摘要: Disclosed embodiments relate to systems and methods for performing a floating-point dot product instruction. In one example, a processor includes fetch circuitry to fetch a single instruction having fields to specify an opcode, a writemask, and locations of first source, second source, and destination vectors, decode circuitry to decode the fetched instruction, and execution circuitry to execute the instruction as per the opcode. The writemask is to control whether to mask the destination vector, with masked elements of the destination vector being either zeroed or merged. For elements which are not masked, the opcode is to indicate execution circuitry to generate products of N pairs of 16-bit floating-point elements of the first and second source vectors, and accumulate each product with previous contents of a corresponding single-precision element of the destination vector to produce a corresponding result element. The execution circuitry, in generating products, is to convert each 16-bit floating-point element in each pair to a single precision element by packing the 16 bits of the 16-bit floating-point element into the upper 16 bits of the single precision element, zeroing the lower 16 bits of the single precision element. A format of the 16-bit floating-point elements is bfloat16.

    INSTRUCTIONS TO CONVERT FROM FP16 TO FP8
    3.
    发明公开

    公开(公告)号:EP4318224A1

    公开(公告)日:2024-02-07

    申请号:EP23182966.4

    申请日:2023-07-03

    申请人: Intel Corporation

    IPC分类号: G06F9/30

    摘要: Techniques for converting FP16 or FP32 data elements to FP8 data elements using a single instruction are described. An exemplary apparatus includes decoder circuitry to decode a single instruction, the single instruction to include a one or more fields to identify a source operand, one or more fields to identify a destination operand, and one or more fields for an opcode, the opcode to indicate that execution circuitry is to convert packed half-precision floating-point data or single-precision floating point data from the identified source to packed FP8 data and store the packed bfloat8 data into corresponding data element positions of the identified destination operand; and execution circuitry to execute the decoded instruction according to the opcode to convert packed half-precision floating-point data or single-precision floating point data from the identified source to packed bfloat8 data and store the packed bfloat8 data into corresponding data element positions.

    SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS

    公开(公告)号:EP4276609A2

    公开(公告)日:2023-11-15

    申请号:EP23200278.2

    申请日:2019-10-08

    申请人: Intel Corporation

    IPC分类号: G06F9/30

    摘要: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processing unit comprises fetch circuitry to fetch an instruction, decode circuitry to decode the instruction, the instruction having a first field to specify a first storage location of a plurality of data elements corresponding to a first matrix having M rows by N columns of 32-bit single precision floating-point data elements, a second field to specify a second storage location of a plurality of data elements corresponding to a second matrix having M rows by K columns of pairs of 16-bit floating-point data elements having a bfloat16 format, and a third field to specify a third storage location of a plurality of data elements corresponding to a third matrix having K rows by N columns of pairs of 16-bit floating-point data elements having the bfloat16 format, and execution circuitry coupled with the decode circuitry, the execution circuitry to perform operations corresponding to the instruction.

    SYSTEMS, METHODS, AND APPARATUSES FOR DOT PRODUCTION OPERATIONS

    公开(公告)号:EP4012555A1

    公开(公告)日:2022-06-15

    申请号:EP22154164.2

    申请日:2017-07-01

    申请人: Intel Corporation

    IPC分类号: G06F9/30

    摘要: Embodiments detailed herein relate to matrix operations. For example, an apparatus comprises decode circuitry to decode an instruction and execution circuitry, coupled with the decode circuitry. The instruction has fields to indicate a first M row by K column (MxK) matrix, a second K row by N column (KxN) matrix, and a third M row by N column (MxN) matrix. The first MxK matrix has data elements of a first size, the second KxN matrix has data elements of the first size, and the third MxN matrix has data elements of a second size four times the first size. The execution circuitry performs operations corresponding to the instruction, including to: for each row of the first MxK matrix, and each column of the second KxN matrix: generate a dot-product from all data elements of the row of the first MxK matrix and all data elements of the column of the second KxN matrix, and accumulate the dot-product with a data element from a corresponding row and a corresponding column of the third MxN matrix.

    SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS

    公开(公告)号:EP4002105A1

    公开(公告)日:2022-05-25

    申请号:EP21217772.9

    申请日:2019-10-08

    申请人: Intel Corporation

    IPC分类号: G06F9/30

    摘要: Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processing unit comprises: fetch circuitry to fetch an instruction; decode circuitry to decode the instruction; and execution circuitry coupled with the decode circuitry, the execution circuitry to perform operations corresponding to the instruction. The instruction has an opcode, a first field to specify a first storage location of a plurality of data elements corresponding to a first matrix having M rows by N columns of 32-bit single precision floating-point data elements, a second field to specify a second storage location of a plurality of data elements corresponding to a second matrix having M rows by K columns of 16-bit floating-point data elements having a bfloat16 format, and a third field to specify a third storage location of a plurality of data elements corresponding to a third matrix having K rows by N columns of 16-bit floating-point data elements having the bfloat16 format. The execution circuitry is to perform operations corresponding to the instruction to, for each row m of the M rows of the second matrix, and for each column n of the N columns of the third matrix: generate a dot product from K 16-bit floating-point data elements corresponding to the row m of the second matrix and K 16-bit floating-point data elements corresponding to the column n of the third matrix; accumulate the dot product with a 32-bit single precision floating-point data element corresponding to a row m of the M rows, and corresponding to a column n of the N columns, of the first matrix to generate a result 32-bit single precision floating-point data element; and store the result 32-bit single precision floating-point data element in a position of the first storage location corresponding to the row m and the column n of the first matrix.

    EFFICIENT IMPLEMENTATION OF COMPLEX VECTOR FUSED MULTIPLY ADD AND COMPLEX VECTOR MULTIPLY

    公开(公告)号:EP3979073A1

    公开(公告)日:2022-04-06

    申请号:EP21203400.3

    申请日:2019-02-26

    申请人: Intel Corporation

    IPC分类号: G06F9/30

    摘要: Disclosed embodiments relate to efficient complex vector multiplication. In one example, a processor comprises fetch and decode circuitry to fetch and decode an instruction having fields to specify an accumulation complex vector, a multiplier complex vector, and a multiplicand complex vector, and execution circuitry, responsive to the decoded instruction, to generate a double-even multiplicand vector by duplicating even elements of the specified multiplicand complex vector into adjacent more significant odd element positions, multiply elements of the multiplier complex vector and elements from corresponding positions of the double-even multiplicand vector to generate corresponding products, accumulate the products with elements from corresponding positions of the destination complex vector, and store a result in a destination storage location.

    SYSTEMS, APPARATUSES, AND METHODS FOR FUSED MULTIPLY ADD

    公开(公告)号:EP3971710A1

    公开(公告)日:2022-03-23

    申请号:EP21207389.4

    申请日:2016-10-20

    申请人: INTEL Corporation

    IPC分类号: G06F9/30 G06F15/76

    摘要: In some embodiments, a single instruction is provided that has an opcode, a first field to represent a packed data source/destination operand, a second field to represent a first packed data source operand, and a third field to represent a second packed data source operand. Packed data elements of the first and second packed data source operands are of a first size and packed data elements of the packed data source/destination operand are of a second size greater than the first size. In response to the single instruction, execution circuitry of an apparatus, according to the opcode of the single instruction, for each packed data element position of the packed data source/destination operand is configured to: sign extend a plurality of packed data bytes from a corresponding packed data element position of the first packed data source operand; zero extend a plurality of packed data bytes from a corresponding packed data element position of the second packed data source operand; multiply each of the sign extended plurality of packed data bytes from the first packed data source operand with a corresponding one of the zero extended plurality of packed data bytes from the second packed data source operand to result in a plurality of results; add the plurality of results with a packed data element of the second size of a corresponding packed data element position of the packed data source/destination operand to result in an addition result; and store the addition result in the corresponding packed data element position of the packed data source/destination operand.

    SYSTEMS AND METHODS FOR PERFORMING INSTRUCTIONS TO CONVERT TO 16-BIT FLOATING-POINT FORMAT

    公开(公告)号:EP3889768A1

    公开(公告)日:2021-10-06

    申请号:EP21169540.8

    申请日:2019-10-08

    申请人: INTEL Corporation

    IPC分类号: G06F9/30

    摘要: Disclosed embodiments relate to a processor and a method for executing a format conversion instruction. In one example, a processor comprises a decode unit to decode the format conversion instruction and an execution unit to execute the decoded format conversion instruction. The format conversion instruction indicates a location of a first source operand, a location of a second source operand, a destination register, a writemask register, and a type of masking, the first source operand to include a first plurality of 32-bit single-precision floating point data elements, the second source operand to include a second plurality of 32-bit single-precision floating point data elements, the writemask register to store a plurality of mask bits each corresponding to a data element position in the destination register, the type of masking to be either zeroing masking or merging masking.