Abstract:
Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in at least a form of decode circuitry to decode an instruction having fields for an opcode, a source matrix operand identifier, and destination memory information, and execution circuitry to execute the decoded instruction to store each data element of configured rows of the identified source matrix operand to memory based on the destination memory information
Abstract:
An example processor includes a register and an ADD low functional unit. The register stores first, second, and third floating point (FP) values. The ADD low functional unit receives a request to perform an ADD low operation and, responsive to the request: adds the first FP value with the second FP value to obtain a first sum value; rounds the first sum value to generate an ADD value; adds the first FP value with the second FP value to obtain a second sum value; subtracts the ADD value from the second sum value to generate a difference value; normalizes the difference value to obtain a normalized difference value; rounds the normalized difference value to generate an ADD low value; and sends the ADD low value to an application.
Abstract:
Systems, methods, and apparatuses for data speculation execution (DSX) are described. In some embodiments, a hardware apparatus for DSX comprises execution hardware to execute instructions to begin and end a data speculative execution (DSX) and speculative instructions during the DSX, and DSX tracking hardware to track speculative memory accesses and detect ordering violations in a DSX of speculative instructions using a sequence number, addresses of instruction accesses, and whether an instruction being tracked is a write, and to trigger a mis-speculation upon an ordering violation.
Abstract:
Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to load groups of strided data elements from memory into configured rows of the identified destination matrix operand to memory.
Abstract:
I n some embodiments, packed data elements of first and second packed data source operands are of a first, different size than a second size of packed data elements of a third packed data operand. Execution circuitry executes decoded single instruction to perform, for each packed data element position of a destination operand, a multiplication of a M N-sized packed data elements from the first and second packed data sources that correspond to a packed data element position of the third packed data source, add of results from these multiplications to a full-sized packed data element of a packed data element position of the third packed data source, and storage of the addition result in a packed data element position destination corresponding to the packed data element position of the third packed data source, wherein M is equal to the full-sized packed data element divided by N.
Abstract:
Embodiments of systems, apparatuses, and method for getting even or odd data elements are described. For example, in some embodiments, an apparatus includes a decoder to decode an instruction, wherein the instruction to include fields for a first source operand, a second source operand, and a destination operand; and execution circuitry to execute the decoded instruction to extract data elements from even data element positions of the first and second source operands and store the extracted data elements into the destination operand.
Abstract:
A processor comprises a plurality of vector registers, and an execution unit, operatively coupled to the plurality of vector registers, the execution unit comprising a logic circuit implementing a load instruction for loading, into two or more vector registers, two or more data items associated with a data structure stored in a memory, wherein each one of the two or more vector registers is to store a data item associated with a certain position number within the data structure.
Abstract:
Systems, methods, and apparatuses for data speculation execution (DSX) are described. In some embodiments, a hardware apparatus for performing DSX comprises a hardware decoder to decode an instruction, the instruction to include an opcode, and execution hardware to execute the decoded instruction to continue a data speculative execution (DSX) and to determine that a DSX loop iteration is to be committed, commit speculative stores associated with the DSX loop iteration, and start a new DSX loop iteration.
Abstract:
Methods, apparatus, and system to optimize compilation of source code into vectorized compiled code, notwithstanding the presence of output dependencies which might otherwise preclude vectorization.
Abstract:
An Aggregate Scatter instruction is described. A processor may include a memory interface and a register to store data elements of a data structure. The data elements may be contiguously stored in a first location in a memory accessible via the memory interface. The processor may further include a decoder to decode an aggregate scatter instruction specifying a store operation for the data structure and an execution unit to contiguously store the data elements to a second storage location in the memory in response to the decoded aggregate scatter instruction. The second storage location may be identified by a starting memory address of the second storage location.