Abstract:
An apparatus is described having instruction execution logic circuitry to execute first, second, third and fourth instruction. Both the first instruction and the second instruction insert a first group of input vector elements to one of multiple first non overlapping sections of respective first and second resultant vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction insert a second group of input vector elements to one of multiple second non overlapping sections of respective third and fourth resultant vectors. The second group has a second bit width that is larger than said first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus also includes masking layer circuitry to mask the first and third instructions at a first resultant vector granularity, and, mask the second and fourth instructions at a second resultant vector granularity.
Abstract:
An apparatus and method for performing a vector bit shuffle. For example, one embodiment of a processor comprises: a first vector register to store a plurality of source data elements; a second vector register to store a plurality of control elements, each of the control elements comprising a plurality of bit fields, each bit field to be associated with a corresponding bit position in a destination mask register and to identify a bit from each of the source data elements to be copied to each of the particular bit positions; and vector bit shuffle logic to read each bit field from the second vector register to identify a bit from each of the source data elements and to responsively copy the bit from each of the source data elements to each of the corresponding bit positions in the destination mask register.
Abstract:
An apparatus and method for performing a vector bit gather. For example, one embodiment of a processor comprises: a first vector register to store one or more source data elements; a second vector register to store one or more control elements, each of the control elements comprising a plurality of bit fields, each bit field to be associated with a corresponding bit position in a destination vector register and to identify a bit from the one or more source data elements to be copied to each of the particular bit positions; and vector bit gather logic to read each bit field from the second vector register to identify a bit from the one or more source data elements and to responsively copy the bit from each of the one or more source data elements to each of the corresponding bit positions in the destination vector register.
Abstract:
In one embodiment a processing device implements a set of instructions to perform an inverse centrifuge operation using vector or general purpose registers. The inverse centrifuge operation interleaves bits from opposite regions of a source and writes the interleaved bits to a destination. The instructions use a control mask where each bit with a mask value of one is obtained from one side of the source register or vector elements with a mask of zero are obtained from the opposing side.
Abstract:
An apparatus and method are described for performing conflict detection operations. For example, one embodiment of a processor comprises: a first source vector register to store a first set of data elements; a second source vector register to store a second set of data elements; conflict detection logic to perform a specified comparison operation comparing each of the first set of data elements with specified data elements from the second set and generating a set of comparison results, the comparison operation to be selected from a group consisting of a greater than comparison, a less than comparison, a greater than or equal to comparison, a less than or equal to comparison, and a not equal to comparison.
Abstract:
An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.
Abstract:
Methods and apparatus relating to scalar core integration in a graphics processor. In an example, an apparatus comprises a processor to receive a set of workload instructions for a graphics workload from a host complex, determine a first subset of operations in the set of operations that is suitable for execution by a scalar processor complex of the graphics processing device and a second subset of operations in the set of operations that is suitable for execution by a vector processor complex of the graphics processing device, assign the first subset of operations to the scalar processor complex for execution to generate a first set of outputs, assign the second subset of operations to the vector processor complex for execution to generate a second set of outputs. Other embodiments are also disclosed and claimed.
Abstract:
Systems, methods, and apparatuses relating to one or more instructions for loading a tile of a matrix operations accelerator are described. In one embodiment, a system includes a matrix operations accelerator circuit comprising a two-dimensional grid of processing elements, a plurality of registers that represents a two-dimensional matrix coupled to the two-dimensional grid of processing elements, and a coupling to a cache; and a hardware processor core coupled to the matrix operations accelerator circuit and comprising a vector register, a decoder circuit to decode a single instruction into a decoded instruction, the single instruction including a first field that identifies the two-dimensional matrix, a second field that identifies a location in the cache, and a third field that identifies the vector register, and an opcode that indicates an execution circuit of the hardware processor core is to load elements into the plurality of registers that represents the two-dimensional matrix from the location in the cache by the coupling to the cache, and load one or more elements from the vector register into the plurality of registers that represents the two-dimensional matrix by a coupling of the hardware processor core to the matrix operations accelerator circuit that is separate from the coupling to the cache, and the execution circuit of the hardware processor core to execute the decoded instruction according to the opcode.
Abstract:
Systems, methods, and apparatuses relating to one or more instructions for row or column aligning of a tile of a matrix operations accelerator are described. In one embodiment, a system includes a matrix operations accelerator circuit comprising a two-dimensional grid of processing elements, a first plurality of registers that represents a first two-dimensional matrix coupled to the two-dimensional grid of processing elements, and a second plurality of registers that represents a second two-dimensional matrix coupled to the two-dimensional grid of processing elements; and a hardware processor core coupled to the matrix operations accelerator circuit and comprising a decoder circuit to decode a single instruction into a decoded instruction, the single instruction including a first field that identifies the first two-dimensional matrix, a second field that identifies the second two-dimensional matrix, and an opcode that indicates an execution circuit of the hardware processor core is to cause a third two-dimensional matrix to be logically formed for input into the two-dimensional grid of processing elements from the first two-dimensional matrix and the second two-dimensional matrix without moving data elements within the first plurality of registers and the second plurality of registers, and the execution circuit of the hardware processor core to execute the decoded instruction according to the opcode.
Abstract:
An apparatus and method for performing dual concurrent multiplications of packed data elements. For example one embodiment of a processor comprises: a decoder to decode a first instruction to generate a decoded instruction; a first source register to store a first plurality of packed doubleword data elements; a second source register to store a second plurality of packed doubleword data elements; and execution circuitry to execute the decoded instruction, the execution circuitry comprising: multiplier circuitry to multiply a first doubleword data element from the first source register with a second doubleword data element from the second source register to generate a first quadword product and to concurrently multiply a third doubleword data element from the first source register with a fourth doubleword data element from the second source register to generate a second quadword product; and a destination register to store the first quadword product and the second quadword product as first and second packed quadword data elements.