摘要:
A method and apparatus for scheduling instructions to provide adequate prefetch latency is disclosed during compilation of a program code in to a program. The prefetch scheduler component of the present invention selects a memory operation within the program code as a “martyr load” and removes the prefetch associated with the martyr load, if any. The prefetch scheduler takes advantage of the latency associated with the martyr load to schedule prefetches for memory operations which follow the martyr load. The prefetches are scheduled “behind” (i.e., prior to) the martyr load to allow the prefetches to complete before the associated memory operations are carried out. The prefetch schedule component continues this process throughout the program code to optimize prefetch scheduling and overall program operation.
摘要:
One embodiment of the present invention provides a system for compiling source code into executable code that performs prefetching for memory operations within critical sections of code that are subject to mutual exclusion. The system operates by compiling a source code module containing programming language instructions into an executable code module containing instructions suitable for execution by a processor. Next, the system identifies a critical section within the executable code module by identifying a region of code between a mutual exclusion lock operation and a mutual exclusion unlock operation. The system schedules explicit prefetch instructions into the critical section in advance of associated memory operations. In one embodiment, the system identifies the critical section of code by using a first macro to perform the mutual exclusion lock operation, wherein the first macro additionally activates prefetching. The system also uses a second macro to perform the mutual exclusion unlock operation, wherein the second macro additionally deactivates prefetching.
摘要:
The present invention discloses a method and device for ordering memory operation instructions in an optimizing compiler. for a processor that can potentially enter a stall state if a memory queue is full. The method uses a dependency graph coupled with one or more memory queues. The dependency graph is used to show the dependency relationships between instructions in a program being compiled. After creating the dependency graph, the ready nodes are identified. Dependency graph nodes that correspond to memory operations may have the effect of adding an element to the memory queue or removing one or more elements from the memory queue. The ideal situation is to keep the memory queue as full as possible without exceeding the maximum desirable number of elements, by scheduling memory operations to maximize the parallelism of memory operations while avoiding stalls on the target processor.
摘要:
The present invention discloses a method and device for placing prefetch instruction in a low-level or assembly code instruction stream. It involves the use of a new concept called a martyr memory operation. When inserting prefetch instructions in a code stream, some instructions will still miss the cache because in some circumstances a prefetch cannot be added at all, or cannot be added early enough to allow the needed reference to be in cache before being referenced by an executing instruction. A subset of these instructions are identified using a new method and designated as martyr memory operations. Once identified, other memory operations that would also have been cache misses can “hide” behind the martyr memory operation and complete their prefetches while the processor, of necessity, waits for the martyr memory operation instruction to complete. This will increase the number of cache hits.
摘要:
A heuristic algorithm which identifies loads guaranteed to hit the processor cache which further provides a “minimal” set of prefetches which are scheduled/inserted during compilation of a program is disclosed. The heuristic algorithm of the present invention utilizes the concept of a “cache line” (i.e., the data chunks received during memory operations) in conjunction with the concept of “related” memory operations for determining which prefetches are unnecessary for related memory operations; thus, generating a minimal number of prefetches for related memory operations.
摘要:
One embodiment of the present invention provides a system for compiling source code into executable code that performs prefetching for memory operations within regions of code that tend to generate cache misses. The system operates by compiling a source code module containing programming language instructions into an executable code module containing instructions suitable for execution by a processor. Next, the system runs the executable code module in a training mode on a representative workload and keeps statistics on cache miss rates for functions within the executable code module. These statistics are used to identify a set of “hot” functions that generate a large number of cache misses. Next, explicit prefetch instructions are scheduled in advance of memory operations within the set of hot functions. In one embodiment, explicit prefetch operations are scheduled into the executable code module by activating prefetch generation at a start of an identified function, and by deactivating prefetch generation at a return from the identified function. In embodiment, the system further schedules prefetch operations for the memory operations by identifying a subset of memory operations of a particular type within the set of hot functions, and scheduling explicit prefetch operations for memory operations belonging to the subset.
摘要:
A system that allows a programmer to specify a set of constraints that the programmer has adhered to in writing code so that a compiler is able to assume the set of constraints in disambiguating memory references within the code. The system operates by receiving an identifier for a set of constraints on memory references that the programmer has adhered to in writing the code. The system uses the identifier to select a disambiguation technique from a set of disambiguation techniques. Note that each disambiguation technique is associated with a different set of constraints on memory references. The system uses the selected disambiguation technique to identify memory references within the code that can alias with each other.
摘要:
Operations including inserted prefetch operations that correspond to addressing chains may be scheduled above memory access operations that are likely-to-miss, thereby exploiting latency of the “martyred” likely-to-miss operations and improving execution performance of resulting code. More generally, certain pre-executable counterparts of likely-to-stall operations that form dependency chains may be scheduled above operations that are themselves likely-to-stall.
摘要:
By maintaining consistency of instruction or operation identification between code prepared for profiling and that prepared using profiling results, efficacy of profile-directed code optimizations can be improved. In particular, profile-directed optimizations based on stall statistics are facilitated in an environment in which correspondence maintained between (i) instructions or operations whose execution performance may be optimized (or which may provide an opportunity for optimization of other instructions or operations) and (ii) particular instructions or operations profiled.
摘要:
Apparatus, methods, and computer program products are disclosed that improve the operation of a computer that uses a top-of-stack cache by reducing the number of overflow and underflow traps generated during the execution of a program. The invention maintains a predictor value that controls the number of stack elements that are spilled from, or filled to, the top-of-stack cache in response to an overflow trap or an underflow trap (respectively). The predictor reflects the history of overflow traps and underflow traps.