Abstract:
Fast unaligned memory access. In accordance with a first embodiment of the present invention, a computing device includes a load queue memory structure configured to queue load operations and a store queue memory structure configured to queue store operations. The computing device includes also includes at least one bit configured to indicate the presence of an unaligned address component for an entry of said load queue memory structure, and at least one bit configured to indicate the presence of an unaligned address component for an entry of said store queue memory structure. The load queue memory may also include memory configured to indicate data forwarding of an unaligned address component from said store queue memory structure to said load queue memory structure.
Abstract:
A method for executing instructions using a plurality of virtual cores for a processor. The method includes receiving an incoming instruction sequence using a global front end scheduler, and partitioning the incoming instruction sequence into a plurality of code blocks of instructions. The method further includes generating a plurality of inheritance vectors describing interdependencies between instructions of the code blocks, and allocating the code blocks to a plurality of virtual cores of the processor, wherein each virtual core comprises a respective subset of resources of a plurality of partitionable engines. The code blocks are executed by using the partitionable engines in accordance with a virtual core mode and in accordance with the respective inheritance vectors.
Abstract:
A method of identifying instructions including accessing a plurality of instructions that comprise multiple branch instructions. For each branch instruction of the multiple branch instructions, a respective first mask is generated representing instructions that are executed if a branch is taken. A respective second mask is generated representing instructions that are executed if the branch is not taken. A prediction output is received that comprises a respective branch prediction for each branch instruction. For each branch instruction, the prediction output is used to select a respective resultant mask from among the respective first and second masks. For each branch instruction, a resultant mask of a subsequent branch is invalidated if a previous branch is predicted to branch over said subsequent branch. A logical operation is performed on all resultant masks to produce a final mask. The final mask is used to select a subset of instructions for execution.
Abstract:
A method for outputting alternative instruction sequences. The method includes tracking repetitive hits to determine a set of frequently hit instruction sequences for a microprocessor. A frequently miss-predicted branch instruction is identified, wherein the predicted outcome of the branch instruction is frequently wrong. An alternative instruction sequence for the branch instruction target is stored into a buffer. On a subsequent hit to the branch instruction where the predicted outcome of the branch instruction was wrong, the alternative instruction sequence is output from the buffer.
Abstract:
A method for translating instructions for a processor. The method includes accessing a plurality of guest instructions that comprise multiple guest branch instructions comprising at least one guest far branch, and building an instruction sequence from the plurality of guest instructions by using branch prediction on the at least one guest far branch. The method further includes assembling a guest instruction block from the instruction sequence. The guest instruction block is translated to a corresponding native conversion block, wherein an at least one native far branch that corresponds to the at least one guest far branch and wherein the at least one native far branch includes an opposite guest address for an opposing branch path of the at least one guest far branch. Upon encountering a missprediction, a correct instruction sequence is obtained by accessing the opposite guest address.
Abstract:
A global interconnect system. The global interconnect system includes a plurality of resources having data for supporting the execution of multiple code sequences and a plurality of engines for implementing the execution of the multiple code sequences. A plurality of resource consumers are within each of the plurality of engines. A global interconnect structure is coupled to the plurality of resource consumers and coupled to the plurality of resources to enable data access and execution of the multiple code sequences, wherein the resource consumers access the resources through a per cycle utilization of the global interconnect structure.
Abstract:
Systems and methods for load canceling in a processor that is connected to an external interconnect fabric are disclosed. As a part of a method for load canceling in a processor that is connected to an external bus, and responsive to a flush request and a corresponding cancellation of pending speculative loads from a load queue, a type of one or more of the pending speculative loads that are positioned in the instruction pipeline external to the processor, is converted from load to prefetch. Data corresponding to one or more of the pending speculative loads that are positioned in the instruction pipeline external to the processor is accessed and returned to cache as prefetch data. The prefetch data is retired in a cache location of the processor.
Abstract:
A method for weak stream software data and instruction prefetching using a hardware data prefetcher is disclosed. A method includes, determining if software includes software prefetch instructions, using a hardware data prefetcher, and, accessing the software prefetch instructions if the software includes software prefetch instructions. Using the hardware data prefetcher, weak stream software data and instruction prefetching operations are executed based on the software prefetch instructions, free of training operations.
Abstract:
A global interconnect system. The global interconnect system includes a plurality of resources having data for supporting the execution of multiple code sequences and a plurality of engines for implementing the execution of the multiple code sequences. A plurality of resource consumers are within each of the plurality of engines. A global interconnect structure is coupled to the plurality of resource consumers and coupled to the plurality of resources to enable data access and execution of the multiple code sequences, wherein the resource consumers access the resources through a per cycle utilization of the global interconnect structure.
Abstract:
A method for decentralized resource allocation in an integrated circuit. The method includes receiving a plurality of requests from a plurality of resource consumers of a plurality of partitionable engines to access a plurality resources, wherein the resources are spread across the plurality of engines and are accessed via a global interconnect structure. At each resource, a number of requests for access to said each resource are added. At said each resource, the number of requests are compared against a threshold limiter. At said each resource, a subsequent request that is received that exceeds the threshold limiter is canceled. Subsequently, requests that are not canceled within a current clock cycle are implemented.