Abstract:
Embodiments of apparatuses and methods for variable-length instruction steering to instruction decode clusters are disclosed. In an embodiment, an apparatus includes a decode cluster and chunk steering circuitry. The decode cluster includes multiple instruction decoders. The chunk steering circuitry is to break a sequence of instruction bytes into a plurality of chunks, create a slice from a one or more of the plurality of chunks based on one or more indications of a number of instructions in each of the one or more of the plurality of chunks, wherein the slice has a variable size and includes a plurality of instructions, and steer the slice to the decode cluster.
Abstract:
A computer-readable storage medium, method and system for optimization-level aware branch prediction is described. A gear level is assigned to a set of application instructions that have been optimized. The gear level is also stored in a register of a branch prediction unit of a processor. Branch prediction is then performed by the processor based upon the gear level.
Abstract:
An apparatus and method for a dual return stack buffer (RSB) for use in binary translation systems. An embodiment of a processor includes: a dual return stack buffer (DRSB) comprising a native RSB and an extended RSB (XRSB), the dual RSB to be used within a binary translation execution environment in which guest call-return instruction sequences are translated to native call-return instruction sequences to be executed directly by the processor; the native RSB to store native return addresses associated with the native call-return instruction sequences; and the XRSB to store emulated return addresses associated with the guest call-return instruction sequences, wherein each native return address stored in the RSB is associated with an emulated return address stored in the XRSB.
Abstract:
A processor and method are described for alias detection. For example, one embodiment of an apparatus comprises: reordering logic to receive a set of read and write operations in a program order and to responsively reorder the read and write operations; adjustment information attachment logic to associate adjustment information with one or more of the set of read and write operations, wherein for a read operation the adjustment information is to indicate a number of write operations which the read operation has bypassed and for a write operation the adjustment information is to indicate a number of read operations which have bypassed the write operation; and out-of-order processing logic to determine whether execution of the reordered read and write operations will result in a conflict based, at least in part, on the adjustment information associated with the one or more reads and writes.
Abstract:
Embodiments of a method and apparatus for implementing and maintaining a stack of predicate values with stack synchronization instructions. In one embodiment the apparatus is an out of order hardware/software co-designed processor including instructions to explicitly manage the predicate register stack to maintain stack consistency across branches of executing that push a variable number of predicate values onto the predicate stack. In one embodiment the stack-based predicate register implementation enables early branch calculation and early branch misprediction recovery via early renaming of predicate registers.
Abstract:
A processor includes a binary translator an a decoder. The binary translator includes logic to analyze a stream of atomic instructions, identify words by boundary bits in the atomic instructions, generate a mask to identify the words, and load the mask and the plurality of words into an instruction cache line. The words include atomic instructions. At least one word includes more than one atomic instruction. The decoder includes logic to apply the mask to identify a first word from the instruction cache line and decode the first word based upon the applied mask.
Abstract:
A combination of hardware and software collect profile data for asynchronous events, at code region granularity. An exemplary embodiment is directed to collecting metrics for prefetching events, which are asynchronous in nature. Instructions that belong to a code region are identified using one of several alternative techniques, causing a profile bit to be set for the instruction, as a marker. Each line of a data block that is prefetched is similarly marked. Events corresponding to the profile data being collected and resulting from instructions within the code region are then identified. Each time that one of the different types of events is identified, a corresponding counter is incremented. Following execution of the instructions within the code region, the profile data accumulated in the counters are collected, and the counters are reset for use with a new code region.
Abstract:
An embodiment of an integrated circuit may comprise a front end unit, and circuitry coupled to the front end unit, the circuitry to provide a high confidence, multiple branch offset predictor. For example, the circuitry may be configured to identify an entry in a multiple-taken-branch prediction table that corresponds to a conditional branch instruction, determine if a confidence level of the entry exceeds a threshold confidence level, and, if so determined, provide multiple taken branch predictions that stem from the conditional branch instruction from the entry in the multiple-taken-branch prediction table. Other embodiments are disclosed and claimed.
Abstract:
A processor includes a binary translator an a decoder. The binary translator includes logic to analyze a stream of atomic instructions, identify words by boundary bits in the atomic instructions, generate a mask to identify the words, and load the mask and the plurality of words into an instruction cache line. The words include atomic instructions. At least one word includes more than one atomic instruction. The decoder includes logic to apply the mask to identify a first word from the instruction cache line and decode the first word based upon the applied mask.
Abstract:
Various different embodiments of the invention are described including: (1) a method and apparatus for intelligently allocating threads within a binary translation system; (2) data cache way prediction guided by binary translation code morphing software; (3) fast interpreter hardware support on the data-side; (4) out-of-order retirement; (5) decoupled load retirement in an atomic OOO processor; (6) handling transactional and atomic memory in an out-of-order binary translation based processor; and (7) speculative memory management in a binary translation based out of order processor.