Abstract:
Embodiments of systems, apparatuses, and methods for aggregate gather and scatter are disclosed. In some embodiments, a decoder to decode an instruction, wherein the instruction to include fields for an index of memory address locations, an immediate, and a starting destination register operand and identifier of additional destination registers; and execution circuitry to execute the decoded instruction to gather, from memory at locations indicated by the index of memory locations, data elements and stores them in multiple destination registers in sizes dictated by the immediate are described.
Abstract:
An apparatus and method are described for enforcement of reserved bits. For example, one embodiment of a processor comprises: a memory management unit to store a set of bits including a set of reserved bits to a system memory; reserved bit enforcement logic to generate a pseudo-random pattern in the reserved bits and an error correction code over the pseudo-random pattern prior to storing the reserved bits; the memory management unit to load the reserved bits including the pseudo-random pattern and the error correction code; the reserved bit enforcement logic to use the error correction code to determine whether the reserved bits have been modified by software; and if the reserved bits have been modified, then the processor to generate an error condition and if not modified, then the processor to continue normal execution.
Abstract:
Single Instruction, Multiple Data (SIMD) technologies are described. A processing device can include a processor core and a memory. The processor core can generate a first bitmap comprising a plurality of bits, where the plurality of bits includes a first bit that represents a first memory location. The processor core can determine that the value of the first bit is equal to the value of a second bit in the first bitmap. The processor core can determine the location of the second bit in relation to the first bit in the first bitmap. The processor core can generate a second bitmap including a third bit indicating that the first bit is the last bit in the first bitmap with the same value as the second bit.
Abstract:
A processor includes a front end to receive an instruction to perform a vector-based bit manipulation, a decoder to decode the instruction, and a source vector register to store multiple data elements. The processor also includes an execution unit to execute the instruction with a first logic to apply a bit manipulation to each of the multiple data elements within the source vector register in parallel. In addition, the processor includes a retirement unit to retire the instruction.
Abstract:
Apparatus and methods are disclosed for example computer processors that are based on a hybrid dataflow execution model. Embodiments of the disclosed technology use read instructions to retrieve a value from a specified register in the register file of the processor architecture and send the value for use by one or more targets (e.g., other instructions in the instruction block). The read instruction may be predicated such that the instruction is only executed when a predicate condition is satisfied. In some examples of the disclosed technology, a compiler for such processors performs an analysis of the source and/or object code being compiled in order to determine whether operation(s) along conditional paths can be executed before or concurrently with determination of a condition on which the conditional operation(s) depend, thus improving processor efficiency.
Abstract:
Systems, apparatuses, and methods related to a block-based processor core composition register are disclosed. In one example of the disclosed technology, a processor can include a plurality of block-based processor cores for executing a program including a plurality of instruction blocks. A respective block-based processor core can include one or more sharable resources and a programmable composition control register. The programmable composition control register can be used to configure which resources of the one or more sharable resources are shared with other processor cores of the plurality of processor cores.
Abstract:
Distinct system registers for logical processors are disclosed. In one example of the disclosed technology, a processor includes a plurality of block-based physical processor cores for executing a program comprising a plurality of instruction blocks. The processor also includes a thread scheduler configured to schedule a thread of the program for execution, the thread using the one or more instruction blocks. The processor further includes at least one system register. The at least one system register stores data indicating a number and placement of the plurality of physical processor cores to form a logical processor. The logical processor executes the scheduled thread. The logical processor is configured to execute the thread in a continuous instruction window.
Abstract:
Apparatus and methods are disclosed for dynamic nullification of memory access instructions, such as memory store instructions. In some examples of the disclosed technology, an apparatus can include memory and one or more block-based processor cores. One of the cores can include an execution unit configured to execute memory access instructions comprising a plurality of memory load and/or memory store instructions contained in an instruction block. The core can also include a hardware structure storing data for at least one predicate instruction in the instruction block, the data identifying whether one or more of the memory store instructions will issue if a condition of the predicate instruction is satisfied. The core may further include a control unit configured to control issuing of the memory access instructions to the execution unit based at least in a part on the hardware structure data.
Abstract:
Apparatus and methods are disclosed for example computer processors that are based on a hybrid dataflow execution model. In particular embodiments, a processor core in a block-based processor comprises: one or more functional units configured to perform functions using one or more operands; an instruction window comprising buffers configured to store individual instructions for execution by the processor core, the instruction window including one or more operand buffers for an individual instruction configured to store operand values; a control unit configured to execute the instructions in the instruction window and control operation of the one or more functional units; and a broadcast value store comprising a plurality of buffers dedicated to storing broadcast values, each buffer of the broadcast value store being associated with a respective broadcast channel from among a plurality of available broadcast channels.