Abstract:
Providing efficient recursion handling using compressed return address stacks (CRASs) in processor-based systems is disclosed. In one aspect, a processor-based system provides a branch prediction circuit including a CRAS. Each CRAS entry within the CRAS includes an address field and a counter field. When a call instruction is encountered, a return address of the call instruction is compared to the address field of a top CRAS entry indicated by a CRAS top-of-stack (TOS) index. If the return address matches the top CRAS entry, the counter field of the top CRAS entry is incremented instead of adding a new CRAS entry for the return address. When a return instruction is subsequently encountered in the instruction stream, the counter field of the top CRAS entry is decremented if its value is greater than zero (0), or, if not, the top CRAS entry is removed from the CRAS.
Abstract:
A method for predicting a fetch address of a next instruction to be fetched includes selecting, at a processor, a first way identifier or a second way identifier as a way pointer based on an active fetch address and historical prediction data. A first predictor table includes a first entry having the first way identifier and a second predictor table includes a second entry having the second way identifier. The method also includes selecting a first or second fetch address as a predicted fetch address based on the way pointer. A target table includes a first way storing the first fetch address and a second way storing the second fetch address. The first way and the second way are associated with the active fetch address. The first fetch address is associated with the first way identifier and the second fetch address is associated with the second way identifier.
Abstract:
The disclosure generally relates to dynamic clock and voltage scaling (DCVS) based on program phase. For example, during each program phase, a first hardware counter may count each cycle where a dispatch stall occurs and an oldest instruction in a load queue is a last-level cache miss, a second hardware counter may count total cycles, and a third hardware counter may count committed instructions. Accordingly, a software/firmware mechanism may read the various hardware counters once the committed instruction counter reaches a threshold value and divide a value of the first hardware counter by a value of the second hardware counter to measure a stall fraction during a current program execution phase. The measured stall fraction can then be used to predict a stall fraction in a next program execution phase such that optimal voltage and frequency settings can be applied in the next phase based on the predicted stall fraction.
Abstract:
Reduced logic level operation folding of context history in a history register in a prediction system for a processor-based system is disclosed. The prediction system includes a prediction circuit employing reduced operation folding of the history register for indexing a prediction table containing prediction values used to process a consumer instruction when value has not yet been resolved. To avoid the requirement to perform successive logic folding operations to produce a folded context history of a resultant reduced bit width, reduced logic level folding operation of the resultant reduced bit width is employed. Reduced logic level folding operation of the resultant reduced bit width involves using current folded context history from previous contents of a history register as basis for determining a new folded context history. In this manner, logic folding of the history register is faster and operates with reduced power consumption as a result of fewer logic operations.
Abstract:
Storing narrow produced values for instruction operands directly in a register map in an out-of-order processor (OoP) is provided. An OoP is provided that includes an instruction processing system. The instruction processing system includes a number of instruction processing stages configured to pipeline the processing and execution of instructions according to a dataflow execution. The instruction processing system also includes a register map table (RMT) configured to store address pointers mapping logical registers to physical registers in a physical register file (PRF) for storing produced data for use by consumer instructions without overwriting logical registers for later executed, out-of-order instructions. In certain aspects, the instruction processing system is configured to write back (i.e., store) narrow values produced by executed instructions directly into the RMT, as opposed to writing the narrow produced values into the PRF in a write back stage.
Abstract:
A method and apparatus for allowing an out-of-order processor to reuse an in-use physical register is disclosed herein. The method and apparatus uses identifiers, such as tokens and/or other identifiers in a rename map table (RMT) and a physical register file (PRF), to indicate whether an instruction result is allowed or disallowed to be written into a physical register.
Abstract:
Providing variable interpretation of usefulness indicators for memory tables in processor-based systems is disclosed. In one aspect, a memory system comprises a memory table providing multiple memory table entries, each including a usefulness indicator. A memory controller of the memory system comprises a global polarity indicator representing how the usefulness indicator for each memory table entry is interpreted and updated by the memory controller. If the global polarity indicator is set, the memory controller interprets a value of each usefulness indicator as directly corresponding to the usefulness of the corresponding memory table entry. Conversely, if the global polarity indicator is not set, the polarity is reversed such that the memory controller interprets the usefulness indicator value as inversely corresponding to the usefulness of the corresponding memory table entry. In this manner, the interpretation and updating of usefulness indicators by the memory controller can be varied using the global polarity indicator.
Abstract:
Storing narrow produced values for instruction operands directly in a register map in an out-of-order processor (OoP) is provided. An OoP is provided that includes an instruction processing system. The instruction processing system includes a number of instruction processing stages configured to pipeline the processing and execution of instructions according to a dataflow execution. The instruction processing system also includes a register map table (RMT) configured to store address pointers mapping logical registers to physical registers in a physical register file (PRF) for storing produced data for use by consumer instructions without overwriting logical registers for later executed, out-of-order instructions. In certain aspects, the instruction processing system is configured to write back (i.e., store) narrow values produced by executed instructions directly into the RMT, as opposed to writing the narrow produced values into the PRF in a write back stage.
Abstract:
The disclosure generally relates to dynamic clock and voltage scaling (DCVS) based on program phase. For example, during each program phase, a first hardware counter may count each cycle where a dispatch stall occurs and an oldest instruction in a load queue is a last-level cache miss, a second hardware counter may count total cycles, and a third hardware counter may count committed instructions. Accordingly, a software/firmware mechanism may read the various hardware counters once the committed instruction counter reaches a threshold value and divide a value of the first hardware counter by a value of the second hardware counter to measure a stall fraction during a current program execution phase. The measured stall fraction can then be used to predict a stall fraction in a next program execution phase such that optimal voltage and frequency settings can be applied in the next phase based on the predicted stall fraction.
Abstract:
The disclosure generally relates to dynamic clock and voltage scaling (DCVS) based on program phase. For example, during each program phase, a first hardware counter may count each cycle where a dispatch stall occurs and an oldest instruction in a load queue is a last-level cache miss, a second hardware counter may count total cycles, and a third hardware counter may count committed instructions. Accordingly, a software/firmware mechanism may read the various hardware counters once the committed instruction counter reaches a threshold value and divide a value of the first hardware counter by a value of the second hardware counter to measure a stall fraction during a current program execution phase. The measured stall fraction can then be used to predict a stall fraction in a next program execution phase such that optimal voltage and frequency settings can be applied in the next phase based on the predicted stall fraction.