摘要:
A system and method for branch prediction in a microprocessor. A hybrid device stores branch prediction information in a sparse cache for no more than a common smaller number of branches within each entry of the instruction cache. For the less common case wherein an i-cache line comprises additional branches, the device stores the corresponding branch prediction information in a dense cache. Each entry of the sparse cache stores a bit vector indicating whether or not a corresponding instruction cache line includes additional branch instructions. This indication may also be used to select an entry in the dense cache for storage. A second sparse cache stores entire evicted entries from the first sparse cache.
摘要:
A system and method for branch prediction in a microprocessor. A branch prediction unit stores an indication of a location of a branch target instruction relative to its corresponding branch instruction. For example, a target instruction may be located within a first region of memory as a branch instruction. Alternatively, the target instruction may be located outside the first region, but within a larger second region. The prediction unit comprises a branch target array corresponding to each region. Each array stores a bit range of a branch target address, wherein the stored bit range is based upon the location of the target instruction relative to the branch instruction. The prediction unit constructs a predicted branch target address by concatenating a bits stored in the branch target arrays.
摘要:
A system and method for branch prediction in a microprocessor. A hybrid device stores branch prediction information in a sparse cache for no more than a common smaller number of branches within each entry of the instruction cache. For the less common case wherein an i-cache line comprises additional branches, the device stores the corresponding branch prediction information in a dense cache. Each entry of the sparse cache stores a bit vector indicating whether or not a corresponding instruction cache line includes additional branch instructions. This indication may also be used to select an entry in the dense cache for storage. A second sparse cache stores entire evicted entries from the first sparse cache.
摘要:
Techniques are disclosed relating to improving the performance of branch prediction in processors. In one embodiment, a processor is disclosed that includes a branch prediction unit configured to predict a sequence of instructions to be issued by the processor for execution. The processor also includes a pattern detection unit configured to detect a pattern in the predicted sequence of instructions, where the pattern includes a plurality of predicted instructions. In response to the pattern detection unit detecting the pattern, the processor is configured to switch from issuing instructions predicted by the branch prediction unit to issuing the plurality of instructions. In some embodiments, the processor includes a replay unit that is configured to replay fetch addresses to an instruction fetch unit to cause the plurality of predicted instructions to be issued.
摘要:
There is disclosed a data processor that uses bypass circuitry to transfer result data from late pipeline stages to earlier pipeline stages in an efficient manner and with a minimum amount of wiring. The data processor comprises: 1) an instruction execution pipeline comprising a) a read stage; b) a write stage; and c) a first execution stage comprising E execution units that produce data results from data operands. The data processor also comprises: 2) a register file comprising a plurality of data registers, each of the data registers being read by the read stage of the instruction pipeline via at least one of R read ports of the register file and each of the data registers being written by the write stage of the instruction pipeline via at least one of W write ports of the register file; and 3) bypass circuitry for receiving data results from output channels of source devices in at least one of the write stage and the first execution stage, the bypass circuitry comprising a first plurality of bypass tristate line drivers having input channels coupled to first output channels of a first plurality of source devices and tristate output channels coupled to a first common read data channel in the read stage.
摘要:
There is disclosed a data processor having a clustered architecture that comprises a plurality of clusters and an interrupt and exception controller. Each of the clusters comprises an instruction execution pipeline having N processing stages. Each of the N processing stages is capable of performing at least one of a plurality of execution steps associated with instructions being executed by the clusters. The interrupt and exception controller operates to (i) detect an exception condition associated with one of the executing instructions, wherein this executing instruction issued at time t0, and (ii) generate an exception in response to the exception condition upon completed execution of earlier ones of the executing instructions, these earlier executing instructions issued at time preceding t0.
摘要:
Techniques are disclosed relating to improving the performance of branch prediction in processors. In one embodiment, a processor is disclosed that includes a branch prediction unit configured to predict a sequence of instructions to be issued by the processor for execution. The processor also includes a pattern detection unit configured to detect a pattern in the predicted sequence of instructions, where the pattern includes a plurality of predicted instructions. In response to the pattern detection unit detecting the pattern, the processor is configured to switch from issuing instructions predicted by the branch prediction unit to issuing the plurality of instructions. In some embodiments, the processor includes a replay unit that is configured to replay fetch addresses to an instruction fetch unit to cause the plurality of predicted instructions to be issued.
摘要:
There is disclosed a data processor that executes variable latency load operations using bypass circuitry that allows load word operations to avoid stalls caused by shifting circuitry. The processor comprises: 1) an instruction execution pipeline comprising N processing stages, each of the N processing stages for performing one of a plurality of execution steps associated with a pending instruction being executed by the instruction execution pipeline; 2) a data cache for storing data values used by the pending instruction; 3) a plurality of registers for receiving the data values from the data cache; 4) a load store unit for transferring a first one of the data values from the data cache to a target one of the plurality of registers during execution of a load operation; 5) a shifter circuit associated with the load store unit for shifting the first data value prior to loading the first data value into the target register; and 6) bypass circuitry associated with the load store unit for transferring the first data value from the data cache directly to the target register without processing the first data value in the shifter circuit.
摘要:
A data processor includes execution clusters, an instruction cache, an instruction issue unit, and alignment and dispersal circuitry. Each execution cluster includes an instruction execution pipeline having a number of processing stages, and each execution pipeline is a number of lanes wide. The processing stages execute instruction bundles, where each instruction bundle has one or more syllables. Each lane is capable of receiving one of the syllables of an instruction bundle. The instruction cache includes a number of cache lines. The instruction issue unit receives fetched cache lines and issues complete instruction bundles toward the execution clusters. The alignment and dispersal circuitry receives the complete instruction bundles from the instruction issue unit and routes each received complete instruction bundle to a correct one of the execution clusters. The complete instruction bundles are routed as a function of at least one address bit associated with each complete instruction bundle.
摘要:
There is disclosed a data processor containing an instruction issue unit that efficiently transfers instruction bundles from a cache to an instruction pipeline. The data processor comprises 1) an instruction pipeline comprising N processing stages; and 2) an instruction issue unit for fetching into the instruction pipeline instructions fetched from the instruction cache, each of the fetched instructions comprising from one to S syllables. The instruction issue unit comprises: a) a first buffer comprising S storage locations for storing up to S syllables associated with the fetched instructions, each of the S storage locations storing one of the one to S syllables of each fetched instruction; b) a second buffer comprising S storage locations for storing up to S syllables associated with the fetched instructions, each of the S storage locations for storing one of the one to S syllables of each fetched instruction; and c) a controller for determining if a first one of the S storage locations in the first buffer is full, wherein the controller, in response to such a determination, stores a corresponding syllable in an incoming fetched instruction in one of the S storage locations in the second buffer.