Abstract:
Various techniques for processing and pre-decoding branches within an IT instruction block. Instructions are fetched and cached in an instruction cache, and pre-decode bits are generated to indicate the presence of an IT instruction and the likely boundaries of the IT instruction block. If an unconditional branch is detected within the likely boundaries of an IT instruction block, the unconditional branch is treated as if it were a conditional branch. The unconditional branch is sent to the branch direction predictor and the predictor generates a branch direction prediction for the unconditional branch.
Abstract:
In an embodiment, a processor includes a multi-level dispatch circuit configured to supply operations for execution by multiple parallel execution pipelines. The multi-level dispatch circuit may include multiple dispatch buffers, each of which is coupled to multiple reservation stations. Each reservation station may be coupled to a respective execution pipeline and may be configured to schedule instruction operations (ops) for execution in the respective execution pipeline. The sets of reservation stations coupled to each dispatch buffer may be non-overlapping. Thus, if a given op is to be executed in a given execution pipeline, the op may be sent to the dispatch buffer which is coupled to the reservation station that provides ops to the given execution pipeline.
Abstract:
A system and method for efficient branch prediction. A processor includes a next fetch predictor to generate a fast branch prediction for branch instructions at an early pipeline stage. The processor also includes a main return address stack (RAS) at a later pipeline stage for predicting the target of return instructions. When a return instruction is encountered, the prediction from the next fetch predictor is replaced by the top of the main RAS. If there are any recent call or return instructions in flight toward the main RAS, then a separate prediction is generated by a mini-RAS.
Abstract:
Various techniques for processing and pre-decoding branches within an IT instruction block. Instructions are fetched and cached in an instruction cache, and pre-decode bits are generated to indicate the presence of an IT instruction and the likely boundaries of the IT instruction block. If an unconditional branch is detected within the likely boundaries of an IT instruction block, the unconditional branch is treated as if it were a conditional branch. The unconditional branch is sent to the branch direction predictor and the predictor generates a branch direction prediction for the unconditional branch.
Abstract:
A circuit for implementing a branch target buffer. The branch target buffer may include a memory that stores a plurality of entries. Each entry may include a tag value, a target value, and a prediction accuracy value. A received index value corresponding to an indirect branch instruction may be used to select one of entries of the plurality of entries, and a received tag value may then be compared to the tag value of the selected entries in the memory. An entry in the memory may be selected in response to a determination that the received tag does not match the tag value of compared entries. The selected entry may be allocated to the indirect instruction branch dependent upon the prediction accuracy values of the plurality of entries.
Abstract:
A processor and method for fusing together an arithmetic instruction and a branch instruction. The processor includes an instruction fetch unit configured to fetch instructions. The processor may also include an instruction decode unit that may be configured to decode the fetched instructions into micro-operations for execution by an execution unit. The decode unit may be configured to detect an occurrence of an arithmetic instruction followed by a branch instruction in program order, wherein the branch instruction, upon execution, changes a program flow of control dependent upon a result of execution of the arithmetic instruction. In addition, the processor may further be configured to fuse together the arithmetic instruction and the branch instruction such that a single micro-operation is formed. The single micro-operation includes execution information based upon both the arithmetic instruction and the branch instruction.
Abstract:
A system and method for efficiently performing microarchitectural checkpointing. A register rename unit within a processor determines whether a physical register number qualifies to have duplicate mappings. Information for maintenance of the duplicate mappings is stored in a register duplicate array (RDA). To reduce the penalty for misspeculation or exception recovery, control logic in the processor supports multiple checkpoints. The RDA is one of multiple data structures to have checkpoint copies of state. The RDA utilizes a content addressable memory (CAM) to store physical register numbers. The duplicate counts for both the current state and the checkpoint copies for a given physical register number are updated when instructions utilizing the given physical register number are retired. To reduce on-die real estate and power consumption, a single CAM entry is stores the physical register number and the other fields are stored in separate storage elements.
Abstract:
A mechanism for reducing power consumption of a cache memory of a processor includes a processor with a cache memory that stores instruction information for one or more instruction fetch groups fetched from a system memory. The cache memory may include a number of ways that are each independently controllable. The processor also includes a way prediction unit. The way prediction unit may enable, in a next execution cycle, a given way within which instruction information corresponding to a target of a next branch instruction is stored in response to a branch taken prediction for the next branch instruction. The way prediction unit may also, in response to the branch taken prediction for the next branch instruction, enable, one at a time, each corresponding way within which instruction information corresponding to respective sequential instruction fetch groups that follow the next branch instruction are stored.
Abstract:
A mechanism for reducing power consumption of a cache memory of a processor includes a processor with a cache memory that stores instruction information for one or more instruction fetch groups fetched from a system memory. The cache memory may include a number of ways that are each independently controllable. The processor also includes a way prediction unit. The way prediction unit may enable, in a next execution cycle, a given way within which instruction information corresponding to a target of a next branch instruction is stored in response to a branch taken prediction for the next branch instruction. The way prediction unit may also, in response to the branch taken prediction for the next branch instruction, enable, one at a time, each corresponding way within which instruction information corresponding to respective sequential instruction fetch groups that follow the next branch instruction are stored.
Abstract:
A circuit for implementing a branch target buffer. The branch target buffer may include a memory that stores a plurality of entries. Each entry may include a tag value, a target value, and a prediction accuracy value. A received index value corresponding to an indirect branch instruction may be used to select one of entries of the plurality of entries, and a received tag value may then be compared to the tag value of the selected entries in the memory. An entry in the memory may be selected in response to a determination that the received tag does not match the tag value of compared entries. The selected entry may be allocated to the indirect instruction branch dependent upon the prediction accuracy values of the plurality of entries.