摘要:
A system and method for precisely identifying an instruction causing a performance-related event is disclosed. The instruction may be detected while in a pipeline stage of a microprocessor preceding a writeback stage and the microprocessor's architectural state may not be updated until after information identifying the instruction is captured. The instruction may be flushed from the pipeline, along with other instructions from the same thread. A hardware trap may be taken when the instruction is detected and/or when an event counter overflows or is within a given range of overflowing. A software trap handler may capture and/or log information identifying the instruction, such as one or more extended address elements, before returning control and initiating a retry of the instruction. The captured and/or logged information may be stored in an event space database usable by a data space profiler to identify performance bottlenecks in the application containing the instruction.
摘要:
A system and method for profiling a software application may include means for capturing profiling information corresponding to an instruction identified as having executed coincident with the occurrence of a runtime event, and for associating the profiling information with the event in an event set. In some embodiments, the identified instruction, which may have triggered the event, may be located in the program code sequence at a predetermined position relative to the current program counter value at the time the event was detected. The predetermined relative position may be fixed dependent on the processor architecture and may also be dependent on the event type. The predetermined relative position may be zero, indicating that when the event was detected, the program counter value corresponded to an instruction associated with the event. If the identified instruction is an ambiguity-creating instruction, an indication of ambiguity may be associated with the event.
摘要:
The present invention discloses a method and device for placing prefetch instruction in a low-level or assembly code instruction stream. It involves the use of a new concept called a martyr memory operation. When inserting prefetch instructions in a code stream, some instructions will still miss the cache because in some circumstances a prefetch cannot be added at all, or cannot be added early enough to allow the needed reference to be in cache before being referenced by an executing instruction. A subset of these instructions are identified using a new method and designated as martyr memory operations. Once identified, other memory operations that would also have been cache misses can “hide” behind the martyr memory operation and complete their prefetches while the processor, of necessity, waits for the martyr memory operation instruction to complete. This will increase the number of cache hits.
摘要:
A heuristic algorithm which identifies loads guaranteed to hit the processor cache which further provides a “minimal” set of prefetches which are scheduled/inserted during compilation of a program is disclosed. The heuristic algorithm of the present invention utilizes the concept of a “cache line” (i.e., the data chunks received during memory operations) in conjunction with the concept of “related” memory operations for determining which prefetches are unnecessary for related memory operations; thus, generating a minimal number of prefetches for related memory operations.
摘要:
Cache line optimization involves computing where cache misses are in a control flow and assigning probabilities to cache misses. Cache lines may be scheduled based on the assigned probabilities and where the cache misses are in the control flow. Cache line probabilities may be calculated based on the relationship of the cache line and where the cache misses are in the control flow. A control flow may be pruned before calculating cache line probabilities. Function call sites may be used to prune the control flow. Address generation of a cache miss may be duplicated to speculatively hoist address generation and the associated prefetch. References may be selected for optimization, identifying cache lines, and mapping the selected references. Dependencies within the cache lines may be determined and the cache lines may be scheduled based on the determined dependencies and probabilities of usefulness. Instructions may be scheduled based on the scheduled cache lines and the target machine model to maximize outstanding memory transactions. Cache lines may be scheduled across call sites.
摘要:
One embodiment of the present invention provides a system for compiling source code into executable code that performs prefetching for memory operations within regions of code that tend to generate cache misses. The system operates by compiling a source code module containing programming language instructions into an executable code module containing instructions suitable for execution by a processor. Next, the system runs the executable code module in a training mode on a representative workload and keeps statistics on cache miss rates for functions within the executable code module. These statistics are used to identify a set of “hot” functions that generate a large number of cache misses. Next, explicit prefetch instructions are scheduled in advance of memory operations within the set of hot functions. In one embodiment, explicit prefetch operations are scheduled into the executable code module by activating prefetch generation at a start of an identified function, and by deactivating prefetch generation at a return from the identified function. In embodiment, the system further schedules prefetch operations for the memory operations by identifying a subset of memory operations of a particular type within the set of hot functions, and scheduling explicit prefetch operations for memory operations belonging to the subset.
摘要:
Data address profiling allows determination of sources of code execution hindrance with different perspectives of memory references and allows correlation of sampled runtime events and memory reference objects, such as cache lines. Associating sampled runtime events with data addresses provides for efficient and targeted optimization of code with respect to data addresses and physical and/or logical memory reference objects (e.g., memory segments, heap variables, variable instances, stack variables, etc.). An instruction instance is identified in relation to a sampled runtime event. A data address is determined from the instruction instance. From the determined address, a memory reference object is ascertained.
摘要:
A system and method for performance monitoring may use data collected from a hardware event agent comprising a hardware sampling mechanism and/or one or more hardware counters to increment one or more synthesized performance counters by an amount dependent on an expression involving the collected data. Each synthesized performance counter may be configured to count events of a different type and may comprise a machine addressable storage location. The event types may include various memory references or misses, branches, branch mispredictions, or any other event of interest in performance monitoring. The hardware event agent may comprise one or more instruction counters, cycle counters, timers, or other hardware performance counters. One hardware performance counter may be used in a time-multiplexed or data-multiplexed manner to monitor events of multiple event types. The hardware sampling mechanism may return a statistical packet for sampled instructions, which may be examined to determine the event type.
摘要:
A data space profiler may include an analysis engine that associates runtime events of profiled software applications with execution costs and extended address elements. Relational agents in the analysis engine may apply functions to profile data collected for each event to determine the extended address element values to be associated with the event. Each extended address element may correspond to a data profiling object (e.g., hardware component, software construct, data allocation construct, abstract view) involved in each event. The extended address element values may be used to index into an event set for the profiled software application to present costs from the perspective of these profiling objects. A filtering mechanism may also be used to extract profile data from the event set corresponding to events that satisfy the filter criteria. By alternating between presentation of profiling object views and filtered event data, performance bottlenecks and their causes may be identified.
摘要:
A system and method for profiling a network application may include means for operating on context-specific data and costs. The system may include an apparatus for associating local extended address elements with a message sent from a first computing system to a second computing system across a network. The second computing system may store the received information as remote extended address information and may store its own local extended address information. An event agent may capture values of local and/or remote extended address elements in response to detecting the message or another system event and may associate the extended address elements with the message or system event in an event set accessible by a data space profiler. The extended address information may include time stamps. An event agent may determine network latency dependent on time stamps of messages and may generate an event if the latency exceeds a predetermined threshold.