Abstract:
Disclosed embodiments relate to instructions for dual-destination type conversion, accumulation, and atomic memory operations. In one example, a system includes a memory, a processor including: a fetch circuit to fetch the instruction from a code storage, the instruction including an opcode, a first destination identifier, and a source identifier to specify a source vector register, the source vector register including a plurality of single precision floating point data elements, a decode circuit to decode the fetched instruction, and an execution circuit to execute the decoded instruction to: convert the elements of the source vector register into double precision floating point values, store a first half of the double precision floating point values to a first location identified by the first destination identifier, and store a second half of the double precision floating point values to a second location.
Abstract:
An apparatus and method for loop flattening and reduction in a SIMD pipeline including broadcast, move, and reduction instructions. For example, one embodiment of a processor comprises: a decoder to decode a broadcast instruction to generate a decoded broadcast instruction identifying a plurality of operations, the broadcast instruction including an opcode, first and second source operands, and at least one destination operand, the broadcast instruction having a split value associated therewith; a first source register associated with the first source operand to store a first plurality of packed data elements; a second source register associated with the second source operand to store a second plurality of packed data elements; execution circuitry to execute the operations of the decoded broadcast instruction, the execution circuitry to copy a first number of contiguous data elements from the first source register to a first set of contiguous data element locations in a destination register specified by the destination operand, the execution circuitry to further copy a second number of contiguous data elements from the second source register to a second set of contiguous data element locations in the destination register, wherein the execution circuitry is to determine the first number and the second number in accordance with the split value associated with the broadcast instruction.
Abstract:
An apparatus and method for loop flattening and reduction in a SIMD pipeline including broadcast, move, and reduction instructions. For example, one embodiment of a processor comprises: a decoder to decode a broadcast instruction to generate a decoded broadcast instruction identifying a plurality of operations, the broadcast instruction including an opcode, first and second source operands, and at least one destination operand, the broadcast instruction having a split value associated therewith; a first source register associated with the first source operand to store a first plurality of packed data elements; a second source register associated with the second source operand to store a second plurality of packed data elements; execution circuitry to execute the operations of the decoded broadcast instruction, the execution circuitry to copy a first number of contiguous data elements from the first source register to a first set of contiguous data element locations in a destination register specified by the destination operand, the execution circuitry to further copy a second number of contiguous data elements from the second source register to a second set of contiguous data element locations in the destination register, wherein the execution circuitry is to determine the first number and the second number in accordance with the split value associated with the broadcast instruction.
Abstract:
An apparatus and method are described for floating point operation (FLOP) accounting. For example, one embodiment of a processor comprises: an instruction fetch unit to fetch instructions from system memory, the instructions including at least one masked vector floating point instruction to perform operations on a plurality of floating point data elements; a mask register to store a mask value associated with the masked vector floating point instruction; a decoder to decode the masked vector floating point instruction; and floating point operations (FLOP) accounting circuitry to read the mask register to determine a number of floating point operations to be performed during execution of the masked vector floating point instruction.
Abstract:
A processor includes a front end, a cache, and a cache controller. The front end includes logic to receive an instruction defining a priority dataset. The priority dataset includes ranges of memory addresses each corresponding to a respective priority level. The cache controller includes logic to detect a miss in the cache for a requested cache value, determine a candidate cache victim from the cache, determine a priority of the requested cache value and the candidate cache victim according to the priority dataset, and evict the candidate cache victim based on a determination that the priority of the candidate cache victim is less or equal to the priority of the requested cache value.
Abstract:
A processor includes a front end, a cache, and a cache controller. The front end includes logic to receive an instruction defining a priority dataset. The priority dataset includes ranges of memory addresses each corresponding to a respective priority level. The cache controller includes logic to detect a miss in the cache for a requested cache value, determine a candidate cache victim from the cache, determine a priority of the requested cache value and the candidate cache victim according to the priority dataset, and evict the candidate cache victim based on a determination that the priority of the candidate cache victim is less or equal to the priority of the requested cache value.
Abstract:
Disclosed embodiments relate to instructions for dual-destination type conversion, accumulation, and atomic memory operations. In one example, a system includes a memory, a processor including: a fetch circuit to fetch the instruction from a code storage, the instruction including an opcode, a first destination identifier, and a source identifier to specify a source vector register, the source vector register including a plurality of single precision floating point data elements, a decode circuit to decode the fetched instruction, and an execution circuit to execute the decoded instruction to: convert the elements of the source vector register into double precision floating point values, store a first half of the double precision floating point values to a first location identified by the first destination identifier, and store a second half of the double precision floating point values to a second location.
Abstract:
An embodiment of a semiconductor package apparatus may include technology to identify a nested loop in a set of executable instructions, and determine at runtime if the nested loop is a candidate for cache blocking. Other embodiments are disclosed and claimed.
Abstract:
An apparatus and method for loop flattening and reduction in a SIMD pipeline including broadcast, move, and reduction instructions. One embodiment of a processor comprises: a decoder to decode a broadcast instruction to generate a decoded broadcast instruction identifying a plurality of operations, the broadcast instruction including an opcode and first and second source operands, and having a split value associated therewith; and execution circuitry to execute the operations of the decoded broadcast instruction to copy a first data element specified by the first source operand to each of a first set of contiguous data element locations in a destination register and to copy a second data element specified by the second source operand to a second set of contiguous data element locations in the destination register, wherein the first and second sets of contiguous data element locations are determined in accordance with the split value.
Abstract:
An embodiment of a semiconductor package apparatus may include technology to identify a nested loop in a set of executable instructions, and determine at runtime if the nested loop is a candidate for cache blocking. Other embodiments are disclosed and claimed.