摘要:
According to a first aspect, efficient data transfer operations can be achieved by: decoding by a processor device, a single instruction specifying a transfer operation for a plurality of data elements between a first storage location and a second storage location; issuing the single instruction for execution by an execution unit in the processor; detecting an occurrence of an exception during execution of the single instruction; and in response to the exception, delivering pending traps or interrupts to an exception handler prior to delivering the exception.
摘要:
According to a first aspect, efficient data transfer operations can be achieved by: decoding by a processor device, a single instruction specifying a transfer operation for a plurality of data elements between a first storage location and a second storage location; issuing the single instruction for execution by an execution unit in the processor; detecting an occurrence of an exception during execution of the single instruction; and in response to the exception, delivering pending traps or interrupts to an exception handler prior to delivering the exception.
摘要:
In one embodiment, a processor includes an instruction decoder to receive a first instruction having a prefix and an opcode and to generate, by an instruction decoder of the processor, a second instruction executable based on a condition determined based on the prefix, and an execution unit to conditionally execute the second instruction based on the condition determined based on the prefix.
摘要:
Embodiments of systems, apparatuses, and method for getting even or odd data elements are described. For example, in some embodiments, an apparatus includes a decoder to decode an instruction, wherein the instruction to include fields for a first source operand, a second source operand, and a destination operand; and execution circuitry to execute the decoded instruction to extract data elements from even data element positions of the first and second source operands and store the extracted data elements into the destination operand.
摘要:
An apparatus and method are described for providing low-latency invocation of accelerators. For example, a processor according to one embodiment comprises: a command register for storing command data identifying a command to be executed; a result register to store a result of the command or data indicating a reason why the commend could not be executed; execution logic to execute a plurality of instructions including an accelerator invocation instruction to invoke one or more accelerator commands, the accelerator invocation instruction to store command data specifying the command within the command register; one or more accelerators to read the command data from the command register and responsively attempt to execute the command identified by the command data, wherein if the one or more accelerators successfully execute the command, the one or more accelerators are to store result data comprising the results of the command in the result register; and if the one or more accelerators cannot successfully execute the command, the one or more accelerators are to store result data indicating a reason why the command cannot be executed, wherein the execution logic is to temporarily halt execution until the accelerator completes execution or is interrupted, wherein the accelerator includes logic to store its state if interrupted so that it can continue execution at a later time.
摘要:
An apparatus and method are described for providing low-latency invocation of accelerators. For example, a processor according to one embodiment comprises: a command register for storing command data identifying a command to be executed; a result register to store a result of the command or data indicating a reason why the commend could not be executed; execution logic to execute a plurality of instructions including an accelerator invocation instruction to invoke one or more accelerator commands, the accelerator invocation instruction to store command data specifying the command within the command register; one or more accelerators to read the command data from the command register and responsively attempt to execute the command identified by the command data, wherein if the one or more accelerators successfully execute the command, the one or more accelerators are to store result data comprising the results of the command in the result register; and if the one or more accelerators cannot successfully execute the command, the one or more accelerators are to store result data indicating a reason why the command cannot be executed, wherein the execution logic is to temporarily halt execution until the accelerator completes execution or is interrupted, wherein the accelerator includes logic to store its state if interrupted so that it can continue execution at a later time.
摘要:
A method of an aspect includes receiving a floating point scaling instruction. The floating point scaling instruction indicates a first source including one or more floating point data elements, a second source including one or more corresponding floating point data elements, and a destination. A result is stored in the destination in response to the floating point scaling instruction. The result includes one or more corresponding result floating point data elements each including a corresponding floating point data element of the second source multiplied by a base of the one or more floating point data elements of the first source raised to a power of an integer representative of the corresponding floating point data element of the first source. Other methods, apparatus, systems, and instructions are disclosed.
摘要:
An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.
摘要:
An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The first data structure is four times as large as the second data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second instruction to create a second replication data structure.
摘要:
A mask generating instruction is executed by a processor to improve efficiency of vector operations on an array of data elements. The processor includes vector registers, one of which stores data elements of an array. The processor further includes execution circuitry to receive a mask generating instruction that specifies at least a first operand and a second operand. Responsive to the mask generating instruction, the execution circuitry is to shift bits of the first operand to the left by a number of times defined in the second operand, and pull in a bit of one from the right each time a most significant bit of the first operand is shifted out from the left to generate a result. Each bit in the result corresponds to one of the data elements of the array.