摘要:
Embodiments of systems, apparatuses, and methods for broadcast arithmetic in a processor are described. For example, execution circuitry executes a decoded instruction to broadcast a data value from a least significant packed data element position of a first packed data source operand to a plurality of arithmetic circuits and for each packed data element position of a second packed data source operand, other than a least significant packed data element position, perform the arithmetic operation defined by the instruction on a data value from that packed data element position of the second packed data source operand and all data values from packed data element positions of the second packed data source operand that are of lesser position significance to the broadcast data value from the least significant packed data element position of the first packed data source operand, and stores a result of each arithmetic operation into a packed data element position of the packed data destination operand that corresponds to a most significant packed data element position of the second packed data source operand.
摘要:
A method of an aspect includes receiving an instruction indicating a destination storage location. A result is stored in the destination storage location in response to the instruction. The result includes a sequence of at least four consecutive non-negative integers in numerical order. In an aspect, the instruction does not indicate a source packed data operand having a plurality of packed data elements in an architecturally-visible storage location. Other methods, apparatus, systems, and instructions are disclosed.
摘要:
Detailed herein are systems, methods, and apparatuses for improving vector throughput. For example, an apparatus comprising a plurality of aliasable registers, wherein each of the plurality of aliasable registers is partitioned into a plurality of lanes and each lane is aliasable as a distinct register; and execution circuitry to execute instructions using data from the plurality of aliasable registers as input and output operands is described.
摘要:
Embodiments of systems, apparatuses, and methods for lane-based strided gather are disclosed. In an embodiment, an apparatus includes a decoder to decode an instruction, wherein the instruction to include fields for indices of addresses to memory, and a packed data destination register operand; and execution circuitry to execute the decoded instruction to extract data elements of a defined number of types from memory using the indices of the instruction, and for each type, store the extracted data elements in one or more lanes of a packed data destination register dedicated to that type, wherein relative data elements between types are strided data elements apart.
摘要:
An apparatus is described having instruction execution logic circuitry. The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element. The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions. The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry. The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths.
摘要:
An apparatus is described having instruction execution logic circuitry to execute first, second, third and fourth instruction. Both the first instruction and the second instruction insert a first group of input vector elements to one of multiple first non overlapping sections of respective first and second resultant vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction insert a second group of input vector elements to one of multiple second non overlapping sections of respective third and fourth resultant vectors. The second group has a second bit width that is larger than said first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus also includes masking layer circuitry to mask the first and third instructions at a first resultant vector granularity, and, mask the second and fourth instructions at a second resultant vector granularity.
摘要:
Embodiments of systems, apparatuses, and methods for performing in a computer processor vector double block packed sum of absolute differences (SAD) in response to a single vector double block packed sum of absolute differences instruction that includes a destination vector register operand, first and second source operands, an immediate, and an opcode are described.
摘要:
Systems, apparatuses, and methods for performing delta decoding on packed data elements of a source and storing the results in packed data elements of a destination using a single packed delta decode instruction are described. A processor may include a decoder to decode an instruction, and execution unit to execute the decoded instruction to calculate for each packed data element position of a source operand, other than a first packed data element position, a value that comprises a packed data element of that packed data element position and all packed data elements of packed data element positions that are of lesser significance, store a first packed data element from the first packed data element position of the source operand into a corresponding first packed data element position of a destination operand, and for each calculated value, store the value into a corresponding packed data element position of the destination operand.
摘要:
Instructions and logic provide vector horizontal majority voting functionality. Some embodiments, responsive to an instruction specifying: a destination operand, a size of the vector elements, a source operand, and a mask corresponding to a portion of the vector element data fields in the source operand; read a number of values from data fields of the specified size in the source operand, corresponding to the mask specified by the instruction and store a result value to that number of corresponding data fields in the destination operand, the result value computed from the majority of values read from the number of data fields of the source operand.
摘要:
Embodiments of systems, apparatuses, and methods for performing in a computer processor a data element shuffle and an operation on the shuffled data elements in response to a single data element shuffle and an operation instruction that includes a destination vector register operand, a first and second source vector register operands, an immediate value, and an opcode are described.