Abstract:
A branch predictor predicts a first outcome of a first branch in a first block of instructions. Fetch logic fetches instructions for speculative execution along a first path indicated by the first outcome. Information representing a remainder of the first block is stored in response to the first predicted outcome being taken. In response to the first branch instruction being not taken, the branch predictor is restarted based on the remainder block. In some cases, entries corresponding to second blocks along speculative paths from the first block are accessed using an address of the first block as an index into a branch prediction structure. Outcomes of branch instructions in the second blocks are concurrently predicted using a corresponding set of instances of branch conditional logic and the predicted outcomes are used in combination with the remainder block to restart the branch predictor in response to mispredictions.
Abstract:
Merging branch target buffer entries includes maintaining, in a branch target buffer, an entry corresponding to first branch instruction, where the entry identifies a first branch target address for the first branch instruction and a second branch target address for a second branch instruction; and accessing, based on the first branch instruction, the entry.
Abstract:
Processor-guided execution of offloaded instructions using fixed function operations is disclosed. Instructions designated for remote execution by a target device are received by a processor. Each instruction includes, as an operand, a target register in the target device. The target register may be an architected virtual register. For each of the plurality of instructions, the processor transmits an offload request in the order that the instructions are received. The offload request includes the instruction designated for remote execution. The target device may be, for example, a processing-in-memory device or an accelerator coupled to a memory.
Abstract:
Techniques for controlling prefetching of instructions into an instruction cache are provided. The techniques include tracking either or both of branch target buffer misses and instruction cache misses, modifying a throttle toggle based on the tracking, and adjusting prefetch activity based on the throttle toggle.
Abstract:
A set of entries in a branch prediction structure for a set of second blocks are accessed based on a first address of a first block. The set of second blocks correspond to outcomes of one or more first branch instructions in the first block. Speculative prediction of outcomes of second branch instructions in the second blocks is initiated based on the entries in the branch prediction structure. State associated with the speculative prediction is selectively flushed based on types of the branch instructions. In some cases, the branch predictor can be accessed using an address of a previous block or a current block. State associated with the speculative prediction is selectively flushed from the ahead branch prediction, and prediction of outcomes of branch instructions in one of the second blocks is selectively initiated using non-ahead accessing, based on the types of the one or more branch instructions.
Abstract:
A processor predicts a number of loop iterations associated with a set of loop instructions. In response to the predicted number of loop iterations exceeding a first loop iteration threshold, the set of loop instructions are executed in a loop mode that includes placing at least one component of an instruction pipeline of the processor in a low-power mode or state and executing the set of loop instructions from a loop buffer. In response to the predicted number of loop iterations being less than or equal to a second loop iteration threshold, the set of instructions are executed in a non-loop mode that includes maintaining at least one component of the instruction pipeline in a powered up state and executing the set of loop instructions from an instruction fetch unit of the instruction pipeline.
Abstract:
Systems, apparatuses, and methods for implementing scheduler queue assignment burst mode are disclosed. A scheduler queue assignment unit receives a dispatch packet with a plurality of operations from a decode unit in each clock cycle. The scheduler queue assignment unit determines if the number of operations in the dispatch packet for any class of operations is greater than a corresponding threshold for dispatching to the scheduler queues in a single cycle. If the number of operations for a given class is greater than the corresponding threshold, and if a burst mode counter is less than a burst mode window threshold, the scheduler queue assignment unit dispatches the extra number of operations for the given class in a single cycle. By operating in burst mode for a given operation class during a small number of cycles, processor throughput can be increased without starving the processor of other operation classes.
Abstract:
A subset of a set of architectural registers in a processing system is marked (or “tainted”) to indicate that speculative use of data in the subset of the architectural registers is constrained based on a taint handling policy. One or more speculation features supported by the processing system are disabled for the instruction so that the one or more speculation features cannot be used on data in the subset. In some cases, values of bits associated with the subset of architectural registers are modified to indicate that the subset is tainted. The taint handling policy can be indicated by values stored in a policy register. Taint markings are tracked in response to values stored in the tainted architectural registers being written to a memory or read from the memory.
Abstract:
Methods and systems for prefetching data for a processor are provided. A system is configured for and a method includes selecting one of a first prefetching control logic and a second prefetching control logic of the processor as a candidate feature, capturing the performance metric of the processor over an inactive sample period when the candidate feature is inactive, capturing a performance metric of the processor over an active sample period when the candidate feature is active, comparing the performance metric of the processor for the active and inactive sample periods, and setting a status of the candidate feature as enabled when the performance metric in the active period indicates improvement over the performance metric in the inactive period, and as disabled when the performance metric in the inactive period indicates improvement over the performance metric in the active period.
Abstract:
The present invention provides a method and apparatus for supporting embodiments of an out-of-order load to load queue structure. One embodiment of the apparatus includes a load queue for storing memory operations adapted to be executed out-of-order with respect to other memory operations. The apparatus also includes a load order queue for cacheable operations that ordered for a particular address.