摘要:
Detecting performance bottlenecks in a target application is provided. In response to receiving hotspot selections from a user interface, bottleneck rules are extracted from a database. A hotspot is a region of source code that exceeds a time threshold to execute in the target application. Metrics needed to evaluate the bottleneck rules extracted from the database are identified. The identified metrics are computed. It is determined whether each bottleneck rule extracted from the database is evaluated to true using the computed metrics for hotspots in the target application. In response to determining that a bottleneck rule is evaluated to true using an appropriate computed metric corresponding to the bottleneck rule, a bottleneck description is created for the bottleneck rule. Then, the bottleneck description is sent to the user interface.
摘要:
A target application is automatically tuned. A list of solutions for identified performance bottlenecks in a target application is retrieved from a storage device. A plurality of modules is executed to compute specific parameters for solutions contained in the list of solutions. A list of modification commands associated with specific parameters computed by the plurality of modules is generated. The list of modification commands associated with the specific parameters is appended to a command sequence list. The list of modification commands is implemented in the target application. Specific source code regions corresponding to the identified performance bottlenecks in the target application are automatically tuned using the implemented list of modification commands. Then, the tuned target application is stored in the storage device.
摘要:
A method for increasing performance of an operation on a distributed memory machine is provided. Asynchronous parallel steps in the operation are transformed into synchronous parallel steps. The synchronous parallel steps of the operation are rearranged to generate an altered operation that schedules memory accesses for increasing locality of reference. The altered operation that schedules memory accesses for increasing locality of reference is mapped onto the distributed memory machine. Then, the altered operation is executed on the distributed memory machine to simulate local memory accesses with virtual threads to check cache performance within each node of the distributed memory machine.
摘要:
During runtime of a binary program file, streams of instructions are executed and memory references, generated by instrumentation applied to given ones of the instructions that refer to memory locations, are collected. A transformation is performed, based on the executed streams of instructions and the collected memory references, to obtain a table. The table lists memory events of interest for active data structures for each function in the program file. The transformation is performed to translate memory addresses for given ones of the instructions and given ones of the data structures into locations and variable names in a source file corresponding to the binary file. At least the memory events of interest are displayed, and the display is organized so as to correlate the memory events of interest with corresponding ones of the data structures.
摘要:
A method for increasing performance of an operation on a distributed memory machine is provided. Asynchronous parallel steps in the operation are transformed into synchronous parallel steps. The synchronous parallel steps of the operation are rearranged to generate an altered operation that schedules memory accesses for increasing locality of reference. The altered operation that schedules memory accesses for increasing locality of reference is mapped onto the distributed memory machine. Then, the altered operation is executed on the distributed memory machine to simulate local memory accesses with virtual threads to check cache performance within each node of the distributed memory machine.
摘要:
A system for detecting performance bottlenecks in a target application. In response to receiving hotspot selections from a user interface, bottleneck rules are extracted from a database. A hotspot is a region of source code that exceeds a time threshold to execute in the target application. Metrics needed to evaluate the bottleneck rules extracted from the database are identified. The identified metrics are computed. It is determined whether each bottleneck rule extracted from the database is evaluated to true using the computed metrics for hotspots in the target application. In response to determining that a bottleneck rule is evaluated to true using an appropriate computed metric corresponding to the bottleneck rule, a bottleneck description is created for the bottleneck rule. Then, the bottleneck description is sent to the user interface.
摘要:
A method for profiling performance of a system includes steps of: monitoring execution of the system at multiple points during the system's operation; analyzing results derived from the monitoring in order to provide analyzed results; reconfiguring the monitoring non-uniformly according to the analyzed results; and repeatedly performing iterations of the above steps until a particular event occurs. The iterations may be terminated upon: reaching a specified level of analysis precision, determining a source of one or more performance bottlenecks, determining a source of unexpectedly high output or low completion time, completing a predefined number of iterations, reaching an endpoint of an application, or having performed iterations for a specified period of time.
摘要:
A computer implemented method analyzes shared memory accesses during execution of an application program. The method includes instrumenting events of shared memory accesses in the application program, where the application program is to be executed on a target configuration having p nodes; executing the application program using p1 processing nodes, where p1 is less than p and satisfies a constraint. For accesses made by the executing application program, the method determines a target thread and maps determined target threads to either a remote node or a local node corresponding to a remote memory access and to a local memory access, respectively. Also disclosed is a computer-readable storage medium that stores a program of executable instructions that implements the method, and a data processing system. The invention can be implemented using a language such as Unified Parallel C (UPC) directed to a partitioned global address space (PGAS) paradigm.
摘要:
A computer implemented method analyzes shared memory accesses during execution of an application program. The method includes instrumenting events of shared memory accesses in the application program, where the application program is to be executed on a target configuration having p nodes; executing the application program using p1 processing nodes, where p1 is less than p and satisfies a constraint. For accesses made by the executing application program, the method determines a target thread and maps determined target threads to either a remote node or a local node corresponding to a remote memory access and to a local memory access, respectively. Also disclosed is a computer-readable storage medium that stores a program of executable instructions that implements the method, and a data processing system. The invention can be implemented using a language such as Unified Parallel C (UPC) directed to a partitioned global address space (PGAS) paradigm.
摘要:
A target application is automatically tuned. A list of solutions for identified performance bottlenecks in a target application is retrieved from a storage device. A plurality of modules is executed to compute specific parameters for solutions contained in the list of solutions. A list of modification commands associated with specific parameters computed by the plurality of modules is generated. The list of modification commands associated with the specific parameters is appended to a command sequence list. The list of modification commands is implemented in the target application. Specific source code regions corresponding to the identified performance bottlenecks in the target application are automatically tuned using the implemented list of modification commands. Then, the tuned target application is stored in the storage device.