Abstract:
A data memory unit having a load/store unit and a data cache is provided which allows store instructions that are part of a load-op-store instruction to be executed with one access to a data cache. The load/store unit is configured with a load/store buffer having a checked bit and a way field for each buffer storage location. For load-op-store instructions, the checked bit associated with the store portion of the of the instruction is set when the load portion of the instruction accesses and hits the data cache. Also, the way field associated with the store portion is set to the way of the data cache in which the load portion hits. The data cache is configured with a locking mechanism for each cache line stored in the data cache. When the load portion of a load-op-store instruction is executed, the associated line is locked such that the line will remain in the data cache until a store instruction executes. In this way, the store portion of the load-op-store instruction is guaranteed to hit the data cache. The store may then store its data into the data cache without first performing a read cycle to determine if the store address hits the data cache.
Abstract:
A processor may efficiently implement register renaming and checkpoint repair even in instruction set architectures with large numbers of wide (bit-width) registers by (i) renaming all destination operand register targets, (ii) implementing free list and architectural-to-physical mapping table as a combined array storage with unitary (or common) read, write and checkpoint pointer indexing and (iiii) storing checkpoints as snapshots of the mapping table, rather than of actual register contents. In this way, uniformity (and timing simplicity) of the decode pipeline may be accentuated and architectural-to-physical mappings (or allocable mappings) may be efficiently shuttled between free-list, reorder buffer and mapping table stores in correspondence with instruction dispatch and completion as well as checkpoint creation, retirement and restoration.
Abstract:
A data processing device maintains register map information that maps accesses to architectural registers, as identified by instructions being executed, to physical registers of the data processing device. In response to determining that an instruction, such as a speculatively-executing conditional branch, indicates a checkpoint, the data processing device stores the register map information for subsequent retrieval depending on the resolution of the instruction. In addition, in response to the checkpoint indication the data processing device generates new register map information such that accesses to the architectural registers are mapped to different physical registers. The data processing device maintains a list, referred to as a free register list, of physical registers available to be mapped to an architectural registers.
Abstract:
A method includes, in one implementation, receiving a first set of instructions of a first thread, receiving a second set of instructions of a second thread, and allocating queues to the instructions from the first and second sets. During a time when the first and second threads are simultaneously being processed, changeable number of queues to can be allocated to the first thread based on factors such as the first and/or second thread's requirements or priorities, while maintaining a minimum specified number of queues that are allocated to the first and/or second thread. When needed, one thread may be stalled so that at least the minimum number of queues remains reserved for another thread while attempting to satisfy thread-priority requests or queue-requirement requests.
Abstract:
A processor may efficiently implement register renaming and checkpoint repair even in instruction set architectures with large numbers of wide (bit-width) registers by (i) renaming all destination operand register targets, (ii) implementing free list and architectural-to-physical mapping table as a combined array storage with unitary (or common) read, write and checkpoint pointer indexing and (iiii) storing checkpoints as snapshots of the mapping table, rather than of actual register contents. In this way, uniformity (and timing simplicity) of the decode pipeline may be accentuated and architectural-to-physical mappings (or allocable mappings) may be efficiently shuttled between free-list, reorder buffer and mapping table stores in correspondence with instruction dispatch and completion as well as checkpoint creation, retirement and restoration.
Abstract:
In some embodiments, a data processing system includes a processing unit, a first load/store unit LSU and a second LSU configured to operate independently of the first LSU in single and multi-thread modes. A first store buffer is coupled to the first and second LSUs, and a second store buffer is coupled to the first and second LSUs. The first store buffer is used to execute a first thread in multi-thread mode. The second store buffer is used to execute a second thread in multi-thread mode. The first and second store buffers are used when executing a single thread in single thread mode.
Abstract:
Systems and methods are disclosed for a computer system that includes a first load/store execution unit 210a, a first Level 1 L1 data cache unit 216a coupled to the first load/store execution unit, a second load/store execution unit 210b, and a second L1 data cache unit 216b coupled to the second load/store execution unit. Some instructions are directed to the first load/store execution unit and other instructions are directed to the second load/store execution unit when executing a single thread of instructions.
Abstract:
A branch prediction unit stores a set of branch selectors corresponding to each of a group of contiguous instruction bytes stored in an instruction cache. Each branch selector identifies the branch prediction to be selected if a fetch address corresponding to that branch selector is presented. In order to minimize the number of branch selectors stored for a group of contiguous instruction bytes, the group is divided into multiple byte ranges. The largest byte range may include a number of bytes comprising the shortest branch instruction in the instruction set (exclusive of the return instruction). For example, the shortest branch instruction may be two bytes in one embodiment. Therefore, the largest byte range is two bytes in the example. Since the branch selectors as a group change value (i.e. indicate a different branch instruction) only at the end byte of a predicted-taken branch instruction, fewer branch selectors may be stored than the number of bytes within the group.
Abstract:
A reorder buffer is configured into multiple lines of storage, wherein a line of storage includes sufficient storage for instruction results regarding a predefined maximum number of concurrently dispatchable instructions. A line of storage is allocated whenever one or more instructions are dispatched. A microprocessor employing the reorder buffer is also configured with fixed, symmetrical issue positions. The symmetrical nature of the issue positions may increase the average number of instructions to be concurrently dispatched and executed by the microprocessor. The average number of unused locations within the line decreases as the average number of concurrently dispatched instructions increases. One particular implementation of the reorder buffer includes a future file. The future file comprises a storage location corresponding to each register within the microprocessor. The reorder buffer tag (or instruction result, if the instruction has executed) of the last instruction in program order to update the register is stored in the future file. The reorder buffer provides the value (either reorder buffer tag or instruction result) stored in the storage location corresponding to a register when the register is used as a source operand for another instruction. Another advantage of the future file for microprocessors which allow access and update to portions of registers is that narrow-to-wide dependencies are resolved upon completion of the instruction which updates the narrower register.
Abstract:
A dependency table stores a reorder buffer tag for each register. When operand fetch is performed for a set of concurrently decoded instructions, dependency checking is performed including checking for dependencies between the set of concurrently decoded instructions as well as accessing the dependency table to select the reorder buffer tag stored therein. Either the reorder buffer tag of one of the concurrently decoded instructions, the reorder buffer tag stored in the dependency table, the instruction result corresponding to the stored reorder buffer tag, or the value from the register itself is forwarded as the source operand for the instruction. The dependency table stores the width of the register being updated. Prior to forwarding the reorder buffer tag stored within the dependency table, the width stored therein is compared to the width of the source operand being requested. If a narrow-to-wide dependency is detected the instruction is stalled until the instruction indicated in the dependency table retires.