Abstract:
Providing variable interpretation of usefulness indicators for memory tables in processor-based systems is disclosed. In one aspect, a memory system comprises a memory table providing multiple memory table entries, each including a usefulness indicator. A memory controller of the memory system comprises a global polarity indicator representing how the usefulness indicator for each memory table entry is interpreted and updated by the memory controller. If the global polarity indicator is set, the memory controller interprets a value of each usefulness indicator as directly corresponding to the usefulness of the corresponding memory table entry. Conversely, if the global polarity indicator is not set, the polarity is reversed such that the memory controller interprets the usefulness indicator value as inversely corresponding to the usefulness of the corresponding memory table entry. In this manner, the interpretation and updating of usefulness indicators by the memory controller can be varied using the global polarity indicator.
Abstract:
Performing distributed branch prediction using fused processor cores in processor-based systems is disclosed. In one aspect, a distributed branch predictor is provided as a plurality of processor cores supporting core fusion. Each processor core is configured to receive a program identifier from another of the processor cores (or from itself), generate a subsequent predicted program identifier, and forward the predicted program identifier (and, optionally, a global history indicator) to the appropriate processor core responsible for handling the next prediction. The processor core also fetches a header and/or one or more instructions for the received program identifier, and sends the header and/or the one or more instructions to the appropriate processor core for execution. The processor core also determines the processor core that will handle execution of the predicted program identifier, and sends that information to the processor core that received the predicted program identifier as an instruction window tracker.
Abstract:
Systems and methods relate to a hierarchical register file system including a level 1 physical register file (LI PRF) and a backing physical register file (PRF). A subset of ouptuts of instructions executed in an instruction pipeline of a processor which are deemed to have a high likelihood of use for one or more future instructions are identified. The subset of instruction outputs are stored in the LI PRF, while all instructon outputs are stored in the backing PRF.
Abstract:
Physical register scrubbing in computer microprocessors. Most instructions in a computer program produce some output value that is destined for one or more architected registers. These architected destination registers are renamed, in the processor pipeline, to physical registers in order to improve performance by exposing more instruction level parallelism to the processor. In one aspect, a method comprises identifying, in a reorder buffer, a first instruction and a second instruction, without intervening potential pipeline flushers, that write to the same architected destination register, in order to free the physical register corresponding to the older of the two instructions.
Abstract:
Caching instruction block header data in block architecture processor-based systems is disclosed. In one aspect, a computer processor device, based on a block architecture, provides an instruction block header cache dedicated to caching instruction block header data. Upon a subsequent fetch of an instruction block, cached instruction block header data may be retrieved from the instruction block header cache (if present) and used to optimize processing of the instruction block. In some aspects, the instruction block header data may include a microarchitectural block header (MBH) generated upon the first decoding of the instruction block by an MBH generation circuit. The MBH may contain static or dynamic information about the instructions within the instruction block. As non-limiting examples, the information may include data relating to register reads and writes, load and store operations, branch information, predicate information, special instructions, and/or serial execution preferences.
Abstract:
Reduced logic level operation folding of context history in a history register in a prediction system for a processor-based system is disclosed. The prediction system includes a prediction circuit employing reduced operation folding of the history register for indexing a prediction table containing prediction values used to process a consumer instruction when value has not yet been resolved. To avoid the requirement to perform successive logic folding operations to produce a folded context history of a resultant reduced bit width, reduced logic level folding operation of the resultant reduced bit width is employed. Reduced logic level folding operation of the resultant reduced bit width involves using current folded context history from previous contents of a history register as basis for determining a new folded context history. In this manner, logic folding of the history register is faster and operates with reduced power consumption as a result of fewer logic operations.
Abstract:
The disclosure generally relates to dynamic clock and voltage scaling (DCVS) based on program phase. For example, during each program phase, a first hardware counter may count each cycle where a dispatch stall occurs and an oldest instruction in a load queue is a last-level cache miss, a second hardware counter may count total cycles, and a third hardware counter may count committed instructions. Accordingly, a software/firmware mechanism may read the various hardware counters once the committed instruction counter reaches a threshold value and divide a value of first hardware counter by a value of second hardware counter to measure a stall fraction during a current program execution phase. The measured stall fraction can then be used to predict a stall fraction in a next program execution phase such that optimal voltage and frequency settings can be applied in the next phase based on the predicted stall fraction.
Abstract:
Storing narrow produced values for instruction operands directly in a register map in an out-of-order processor (OoP) is provided. An OoP is provided that includes an instruction processing system. The instruction processing system includes a number of instruction processing stages configured to pipeline the processing and execution of instructions according to a dataflow execution. The instruction processing system also includes a register map table (RMT) configured to store address pointers mapping logical registers to physical registers in a physical register file (PRF) for storing produced data for use by consumer instructions without overwriting logical registers for later executed, out-of-order instructions. In certain aspects, the instruction processing system is configured to write back (i.e., store) narrow values produced by executed instructions directly into the RMT, as opposed to writing the narrow produced values into the PRF in a write back stage.
Abstract:
Systems and methods are directed to instruction execution in a computer system having an out of order instruction picker, which are typically used in computing systems capable of executing multiple instructions in parallel. Such systems are typically block based and multiple instructions are grouped in execution units such as Reservation Station (RSV) Arrays. If an event, such as an exception, page fault, or similar event occurs, the block may have to be swapped out, that is removed from execution, until the event clears. Typically when the event clears the block is brought back to be executed, but typically will be assigned a different RSV Array and re-executed from the beginning of the block. Tagging instructions that may cause such events and then untagging them, by resetting the tag, once they have executed can eliminate much of the typical unnecessary re-execution of instructions.
Abstract:
The clock frequency of a processor is reduced in response to a dispatch stall due to a cache miss. In an embodiment, the processor clock frequency is reduced for a load instruction that causes a last level cache miss, provided that the load instruction is the oldest load instruction and the number of consecutive processor cycles in which there is a dispatch stall exceeds a threshold, and provided that the total number of processor cycles since the last level cache miss does not exceed some specified number.