Abstract:
In an embodiment, a processor includes a plurality of cores to independently execute instructions, a shared cache coupled to the cores and including a plurality of lines to store data, and a power controller including a low power control logic to calculate a flush latency to flush the shared cache based on a state of the plurality of lines. Other embodiments are described and claimed.
Abstract:
An efficient method for software-pipelining (SWP) of loops to translate programs, from higher level languages into equivalent object or machine language code for execution on a computer. In one example embodiment, this is accomplished by spilling and filling multiple computed values, in a register, that are live across multiple stages in a software-pipelined loop, using multiple rotating stack memory locations to reduce compiler-time of SWP, and complexity of the implemented SWP.
Abstract:
A computing platform may include components to determine performance loss values and energy savings values for each of the plurality of regions and/or the memory boundedness value of each of a plurality of regions within an application. The computing platform may provide a user interface for a user to provide a user input, which provides an indication of an acceptable performance loss. For the provided performance loss value, the frequency values may be determined and the processing element may be operated at the frequency values while processing each of the plurality of regions.
Abstract:
A method of parallel execution of a first and a second instruction in an in-order processor. Embodiments of the invention enable parallel execution of memory instructions that are stalled by cache memory misses. The in-order processor processes cache memory misses of instructions in parallel by overlapping the first cache memory miss with cache memory misses that occur after the first cache memory miss. Memory-level parallelism in the in-order processor can be increased when more parallel and outstanding cache memory misses are generated.
Abstract:
A method and system for optimizing the execution of a software loop is provided. The method involves the determination of an edge in a critical recurrence cycle in the software loop. The edge is a dependency link between two instructions and contains a dependee and a dependent. The dependee is an instruction that produces a result, and the dependent is an instruction that uses the result. The method further involves performing predicate promotion of at least one of the dependee and the dependent if one or more pre-determined conditions are met.
Abstract:
Code restructuring or reordering based on profiling information and memory hierarchy is provided by constructing a Program Execution Graph (PEG) corresponding to a level of the memory hierarchy, partitioning this PEG to reduce estimated memory overhead costs below an upper bound, and constructing a PEG for a next level of the memory hierarchy from the partitioned PEG. The PEG is constructed from control flow and frequency information from a profile of the program to be restructured. The PEG is a weighted undirected graph comprising nodes representing basic blocks and edges representing transfer of control between pairs of basic blocks. The weight of a node is the size of the basic block it represents and the weight of an edge is the frequency of transition between the pair of basic blocs it connects.
Abstract:
The invention is directed to the transformation of software loops having early exit conditions, thereby allowing the loops to be more effectively converted to a single basic block for software pipelining. The invention assigns a predicate register for each early exit condition of the software loop. The predicate registers are set when the corresponding early exit condition is satisfied. In this manner, when the loop terminates the predicate registers can be examined to indicate which early exit conditions were satisfied. The invention produces loops having a lower recurrence II and resource II than conventional techniques.
Abstract:
Code restructuring or reordering based on profiling information and memory hierarchy is provided by constructing a Program Execution Graph (PEG) corresponding to a level of the memory hierarchy, partitioning this PEG to reduce estimated memory overhead costs below an upper bound, and constructing a PEG for a next level of the memory hierarchy from the partitioned PEG. The PEG is constructed from control flow and frequency information from a profile of the program to be restructured. The PEG is a weighted undirected graph comprising nodes representing basic blocks and edges representing transfer of control between pairs of basic blocks. The weight of a node is the size of the basic block it represents and the weight of an edge is the frequency of transition between the pair of basic blocks it connects. The nodes of the PEG are partitioned or clustered into clusters such that the sum of the weights of the nodes in any cluster is no greater than an upper bound. A next PEG is then constructed from the clusters of the partitioned PEG such that a node in the next PEG corresponds to a cluster in the partitioned PEG, and such that there is an edge between two nodes in the next PEG if there is an edge between the clusters represented by the two nodes. Weights are assigned to the nodes and edges of the next PEG to produce a PEG, and then the PEG partitioning, basic block reordering, and PEG construction steps may be repeated for each level of the memory hierarchy. After the clustering is completed, the basic blocks are reordered in memory by grouping all of the nodes of a cluster in an adjacent order beginning at a boundary for all the levels of the memory hierarchy. Because clusters must not cross boundaries of memory hierarchies, NOPs are added to fill out the portion of a memory hierarchy level that is not filled by the clusters.
Abstract:
A computing platform may include components to determine performance loss values and energy savings values for each of the plurality of regions and/or the memory boundedness value of each of a plurality of regions within an application. The computing platform may provide a user interface for a user to provide a user input, which provides an indication of an acceptable performance loss. For the provided performance loss value, the frequency values may be determined and the processing element may be operated at the frequency values while processing each of the plurality of regions.
Abstract:
The present invention provides a mechanism that facilitates speculative execution of instructions within software-pipelined loops. In accordance with one embodiment of the invention, a software-pipelined loop is initialized with a speculative instruction deactivated. At least one initiation interval of the software-pipelined loop is executed, and the speculative instruction is activated. Subsequent initiation intervals of the software-pipelined loop are then executed.