摘要:
An apparatus and method for processing array of structures (AoS) and structure of arrays (SoA) data. For example, one embodiment of a processor comprises: a destination tile register to store data elements in a structure of arrays (SoA) format; a first source tile register to store indices associated with the data elements; instruction fetch circuitry to fetch an array of structures (AoS) gather instruction comprising operands identifying the first source tile register and the destination tile register; a decoder to decode the AoS gather instruction; and execution circuitry to determine a plurality of system memory addresses based on the indices from the first source tile register, to read data elements from the system memory addresses in an AoS format, and to load the data elements to the destination tile register in an SoA format.
摘要:
A method of an aspect includes receiving an instruction indicating a destination storage location. A result is stored in the destination storage location in response to the instruction. The result includes a sequence of at least four consecutive non-negative integers in numerical order. In an aspect, the instruction does not indicate a source packed data operand having a plurality of packed data elements in an architecturally-visible storage location. Other methods, apparatus, systems, and instructions are disclosed.
摘要:
An apparatus and method for performing a reciprocal square root. For example one embodiment of a processor comprises: a decoder to decode a reciprocal square root instruction to generate a decoded reciprocal square root instruction; a source register to store at least one packed input data element; a destination register to store a result data element; and reciprocal square root execution circuitry to execute the decoded reciprocal square root instruction, the reciprocal square root execution circuitry to use a first portion of the packed input data element as an index to a data structure containing a plurality of sets of coefficients to identify a first set of coefficients from the plurality of sets, the reciprocal square root execution circuitry to generate a reciprocal square root of the packed input data element using a combination of the coefficients and a second portion of the packed input data element.
摘要:
An apparatus and method for performing a reciprocal. For example one embodiment of a processor comprises: a decoder to decode a reciprocal instruction to generate a decoded reciprocal instruction; a source register to store at least one packed input data element; a destination register to store a result data element; and reciprocal execution circuitry to execute the decoded reciprocal instruction, the reciprocal execution circuitry to use a first portion of the packed input data element as an index to a data structure containing a plurality of sets of coefficients to identify a first set of coefficients from the plurality of sets, the reciprocal execution circuitry to generate a reciprocal of the packed input data element using a combination of the coefficients and a second portion of the packed input data element.
摘要:
Detailed herein are systems, methods, and apparatuses for improving vector throughput. For example, an apparatus comprising a plurality of aliasable registers, wherein each of the plurality of aliasable registers is partitioned into a plurality of lanes and each lane is aliasable as a distinct register; and execution circuitry to execute instructions using data from the plurality of aliasable registers as input and output operands is described.
摘要:
Embodiments of systems, apparatuses, and methods for lane-based strided gather are disclosed. In an embodiment, an apparatus includes a decoder to decode an instruction, wherein the instruction to include fields for indices of addresses to memory, and a packed data destination register operand; and execution circuitry to execute the decoded instruction to extract data elements of a defined number of types from memory using the indices of the instruction, and for each type, store the extracted data elements in one or more lanes of a packed data destination register dedicated to that type, wherein relative data elements between types are strided data elements apart.
摘要:
An apparatus is described having instruction execution logic circuitry. The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element. The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions. The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry. The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths.
摘要:
An apparatus is described having instruction execution logic circuitry to execute first, second, third and fourth instruction. Both the first instruction and the second instruction insert a first group of input vector elements to one of multiple first non overlapping sections of respective first and second resultant vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction insert a second group of input vector elements to one of multiple second non overlapping sections of respective third and fourth resultant vectors. The second group has a second bit width that is larger than said first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus also includes masking layer circuitry to mask the first and third instructions at a first resultant vector granularity, and, mask the second and fourth instructions at a second resultant vector granularity.
摘要:
Embodiments of systems, apparatuses, and methods for performing in a computer processor vector double block packed sum of absolute differences (SAD) in response to a single vector double block packed sum of absolute differences instruction that includes a destination vector register operand, first and second source operands, an immediate, and an opcode are described.
摘要:
Systems, apparatuses, and methods for performing delta decoding on packed data elements of a source and storing the results in packed data elements of a destination using a single packed delta decode instruction are described. A processor may include a decoder to decode an instruction, and execution unit to execute the decoded instruction to calculate for each packed data element position of a source operand, other than a first packed data element position, a value that comprises a packed data element of that packed data element position and all packed data elements of packed data element positions that are of lesser significance, store a first packed data element from the first packed data element position of the source operand into a corresponding first packed data element position of a destination operand, and for each calculated value, store the value into a corresponding packed data element position of the destination operand.