摘要:
A system and method for reducing branch misprediction penalty. In response to detecting a mispredicted branch instruction, circuitry within a microprocessor identifies a predetermined condition prior to retirement of the branch instruction. Upon identifying this condition, the entire corresponding pipeline is flushed prior to retirement of the branch instruction, and instruction fetch is started at a corresponding address of an oldest instruction in the pipeline immediately prior to the flushing of the pipeline. The correct outcome is stored prior to the pipeline flush. In order to distinguish the mispredicted branch from other instructions, identification information may be stored alongside the correct outcome. One example of the predetermined condition being satisfied is in response to a timer reaching a predetermined threshold value, wherein the timer begins incrementing in response to the mispredicted branch detection and resets at retirement of the mispredicted branch.
摘要:
A system and method for reducing branch misprediction penalty. In response to detecting a mispredicted branch instruction, circuitry within a microprocessor identifies a predetermined condition prior to retirement of the branch instruction. Upon identifying this condition, the entire corresponding pipeline is flushed prior to retirement of the branch instruction, and instruction fetch is started at a corresponding address of an oldest instruction in the pipeline immediately prior to the flushing of the pipeline. The correct outcome is stored prior to the pipeline flush. In order to distinguish the mispredicted branch from other instructions, identification information may be stored alongside the correct outcome. One example of the predetermined condition being satisfied is in response to a timer reaching a predetermined threshold value, wherein the timer begins incrementing in response to the mispredicted branch detection and resets at retirement of the mispredicted branch.
摘要:
A system and method for servicing translation lookaside buffer (TLB) misses may manage separate input and output pipelines within a memory management unit. A pending request queue (PRQ) in the input pipeline may include an instruction-related portion storing entries for instruction TLB (ITLB) misses and a data-related portion storing entries for potential or actual data TLB (DTLB) misses. A DTLB PRQ entry may be allocated to each load/store instruction selected from the pick queue. The system may select an ITLB- or DTLB-related entry for servicing dependent on prior PRQ entry selection(s). A corresponding entry may be held in a translation table entry return queue (TTERQ) in the output pipeline until a matching address translation is received from system memory. PRQ and/or TTERQ entries may be deallocated when a corresponding TLB miss is serviced. PRQ and/or TTERQ entries associated with a thread may be deallocated in response to a thread flush.
摘要:
Systems and methods for efficient thread arbitration in a threaded processor with dynamic resource allocation. A processor includes a resource shared by multiple threads. The resource includes entries which may be allocated for use by any thread. Control logic detects long latency instructions. Long latency instructions have a latency greater than a given threshold. One example is a load instruction that has a read-after-write (RAW) data dependency on a store instruction that misses a last-level data cache. The long latency instruction or an immediately younger instruction is selected for replay for an associated thread. A pipeline flush and replay for the associated thread begins with the selected instruction. Instructions younger than the long latency instruction are held at a given pipeline stage until the long latency instruction completes. During replay, this hold prevents resources from being allocated to the associated thread while the long latency instruction is being serviced.
摘要:
A system and method for servicing translation lookaside buffer (TLB) misses may manage separate input and output pipelines within a memory management unit. A pending request queue (PRQ) in the input pipeline may include an instruction-related portion storing entries for instruction TLB (ITLB) misses and a data-related portion storing entries for potential or actual data TLB (DTLB) misses. A DTLB PRQ entry may be allocated to each load/store instruction selected from the pick queue. The system may select an ITLB- or DTLB-related entry for servicing dependent on prior PRQ entry selection(s). A corresponding entry may be held in a translation table entry return queue (TTERQ) in the output pipeline until a matching address translation is received from system memory. PRQ and/or TTERQ entries may be deallocated when a corresponding TLB miss is serviced. PRQ and/or TTERQ entries associated with a thread may be deallocated in response to a thread flush.
摘要:
Various techniques for mitigating dependencies between groups of instructions are disclosed. In one embodiment, such dependencies include “evil twin” conditions, in which a first floating-point instruction has as a destination a first portion of a logical floating-point register (e.g., a single-precision write), and in which a second, subsequent floating-point instruction has as a source the first portion and a second portion of the same logical floating-point register (e.g., a double-precision read). The disclosed techniques may be applicable in a multithreaded processor implementing register renaming. In one embodiment, a processor may enter an operating mode in which detection of evil twin “producers” (e.g., single-precision writes) causes the instruction sequence to be modified to break potential dependencies. Modification of the instruction sequence may continue until one or more exit criteria are reached (e.g., committing a predetermined number of single-precision writes). This operating mode may be employed on a per-thread basis.
摘要:
Various techniques for mitigating dependencies between groups of instructions are disclosed. In one embodiment, such dependencies include “evil twin” conditions, in which a first floating-point instruction has as a destination a first portion of a logical floating-point register (e.g., a single-precision write), and in which a second, subsequent floating-point instruction has as a source the first portion and a second portion of the same logical floating-point register (e.g., a double-precision read). The disclosed techniques may be applicable in a multithreaded processor implementing register renaming. In one embodiment, a processor may enter an operating mode in which detection of evil twin “producers” (e.g., single-precision writes) causes the instruction sequence to be modified to break potential dependencies. Modification of the instruction sequence may continue until one or more exit criteria are reached (e.g., committing a predetermined number of single-precision writes). This operating mode may be employed on a per-thread basis.
摘要:
Various techniques for dynamically allocating instruction tags and using those tags are disclosed. These techniques may apply to processors supporting out-of-order execution and to architectures that supports multiple threads. A group of instructions may be assigned a tag value from a pool of available tag values. A tag value may be usable to determine the program order of a group of instructions relative to other instructions in a thread. After the group of instructions has been (or is about to be) committed, the tag value may be freed so that it can be re-used on a second group of instructions. Tag values are dynamically allocated between threads; accordingly, a particular tag value or range of tag values is not dedicated to a particular thread.
摘要:
An apparatus and method for expediting parity checked TLB access operations is described in connection with a multithreaded multiprocessor chip. This parity checking mechanism eliminates the need to read a CAM entry from a TLB during a TLB access by storing the tag parity value in a RAM portion of a TLB, using the CAM key input to generate a tag parity check value for a matched entry, and comparing the generated tag parity check value to the stored tag parity value to determine if there is a parity match or error.
摘要:
In one embodiment, a processor comprises a register file, register management logic coupled to the register file, and at least two sources of window swap operations coupled to the register management logic. The register management logic is configured to control an interface to the register file to switch register windows in the register file in response to one or more window swap operations. The sources of window swap operations and the register management logic are configured to cooperate according to an arbitration scheme to arbitrate between conflicting window swap operations to be performed using the interface. In one particular implementation, for example, block signals may be used from higher priority sources to lower priority sources to block issuance of window swap operations by the lower priority sources.