Abstract:
Embodiments of systems, apparatuses, and methods for performing an align instruction in a computer processor are described. In some embodiments, the execution of an align instruction causes the selective storage of data elements of two concatenated sources to be stored in a destination.
Abstract:
Embodiments of systems, apparatuses, and methods for performing an align instruction in a computer processor are described. In some embodiments, the execution of an align instruction causes the selective storage of data elements of two concatenated sources to be stored in a destination.
Abstract:
Embodiments of systems, apparatuses, and methods for performing a jump instruction in a computer processor are described. In some embodiments, the execution of a blend instruction causes a conditional jump to an address of a target instruction when all of bits of a writemask are zero, wherein the address of the target instruction is calculated using an instruction pointer of the instruction and the relative offset.
Abstract:
A cache-coherent network interface includes registers or buffers addressable by a processor with reference to an address space of the processor. The processor and the cache-coherent network interface both share a common system bus. The registers or buffers are further cacheable into a cache of the processor with reference to the address space.
Abstract:
A method is described that includes fetching an instruction and decoding the instruction. The method further includes fetching a first mask vector from a first mask register space location identified by the instruction. The method further includes fetching a second mask vector from a second mask register space location identified by the instruction. The method also includes executing the instruction by merging the first and second mask vectors into a single data structure and causing the single data structure to be written into a memory location identified by the instruction.
Abstract:
Embodiments of systems, apparatuses, and methods for performing a blend instruction in a computer processor are described. In some embodiments, the execution of a blend instruction causes a data element-by-element selection of data elements of first and second source operands using the corresponding bit positions of a writemask as a selector between the first and second operands and storage of the selected data elements into the destination at the corresponding position in the destination.
Abstract:
A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams.
Abstract:
An apparatus includes a decode unit to decode a permute instruction and a vector conflict instruction. A vector execution unit is coupled with the decode unit and includes a fully-connected interconnect. The fully-connected interconnect has at least four inputs to receive at least four corresponding data elements of at least one source vector. The fully-connected interconnect has at least four outputs. Each of the at least four inputs is coupled with each of the at least four outputs. The execution unit also includes a permute instruction execution logic coupled with the at least four outputs and operable to store a first vector result in response to the permute instruction. The execution unit also includes a vector conflict instruction execution logic coupled with the at least four outputs and operable to store a second vector result in a destination storage location in response to the vector conflict instruction.