摘要:
A computer implemented method, system and computer program product for accessing threadprivate memory for threadprivate variables in a parallel program during program compilation. A computer implemented method for accessing threadprivate variables in a parallel program during program compilation includes aggregating threadprivate variables in the program, replacing references of the threadprivate variables by indirect references, moving address load operations of the threadprivate variables, and replacing the address load operations of the threadprivate variables by calls to runtime routines to access the threadprivate memory. The invention enables a compiler to minimize the runtime routines call times to access the threadprivate variables, thus improving program performance.
摘要:
A mechanism for minimizing effective memory latency without unnecessary cost through fine-grained software-directed data prefetching using integrated high-level and low-level code analysis and optimizations is provided. The mechanism identifies and classifies streams, identifies data that is most likely to incur a cache miss, exploits effective hardware prefetching to determine the proper number of streams to be prefetched, exploits effective data prefetching on different types of streams in order to eliminate redundant prefetching and avoid cache pollution, and uses high-level transformations with integrated lower level cost analysis in the instruction scheduler to schedule prefetch instructions effectively.
摘要:
A method and system of modifying instructions forming a loop is provided. A method of modifying instructions forming a loop includes modifying instructions forming a loop including: determining static and dynamic characteristics for the instructions; selecting a modification factor for the instructions based on a number of separate equivalent sections forming a cache in a processor which is processing the instructions; and modifying the instructions to interleave the instructions in the loop according to the modification factor and the static and dynamic characteristics when the instructions satisfy a modification criteria based on the static and dynamic characteristics.
摘要:
A mechanism for minimizing effective memory latency without unnecessary cost through fine-grained software-directed data prefetching using integrated high-level and low-level code analysis and optimizations is provided. The mechanism identifies and classifies streams, identifies data that is most likely to incur a cache miss, exploits effective hardware prefetching to determine the proper number of streams to be prefetched, exploits effective data prefetching on different types of streams in order to eliminate redundant prefetching and avoid cache pollution, and uses high-level transformations with integrated lower level cost analysis in the instruction scheduler to schedule prefetch instructions effectively.
摘要:
A target operation in a normalized target loop, susceptible of vectorization and which may, after compilation into a vectorized form, seek to operate on data in nonconsecutive physical memory, is identified in source code. Hardware instructions are inserted into executable code generated from the source code, directing a system that will run the executable code to create a representation of the data in consecutive physical memory. A vector loop containing the target operation is replaced, in the executable code, with a function call to a vector library to call a vector function that will operate on the representation to generate a result identical to output expected from executing the vector loop containing the target operation. On execution, a representation of data residing in nonconsecutive physical memory is created in consecutive physical memory, and the vectorized target operation is applied to the representation to process the data.
摘要:
A target operation in a normalized target loop, susceptible of vectorization and which may, after compilation into a vectorized form, seek to operate on data in nonconsecutive physical memory, is identified in source code. Hardware instructions are inserted into executable code generated from the source code, directing a system that will run the executable code to create a representation of the data in consecutive physical memory. A vector loop containing the target operation is replaced, in the executable code, with a function call to a vector library to call a vector function that will operate on the representation to generate a result identical to output expected from executing the vector loop containing the target operation. On execution, a representation of data residing in nonconsecutive physical memory is created in consecutive physical memory, and the vectorized target operation is applied to the representation to process the data.
摘要:
A computer implemented method, apparatus, and computer usable program code for compiling source code for performing a complex operation followed by a complex reduction operation. A method is determined for generating executable code for performing the complex operation and the complex reduction operation. Executable code is generated for computing sub-products, reducing the sub-products to intermediate results, and summing the intermediate results to generate a final result in response to a determination that a reduced single instruction multiple data method is appropriate.
摘要:
A computer implemented method, apparatus, and computer usable program code for compiling source code for performing a complex operation followed by a complex reduction operation. A method is determined for generating executable code for performing the complex operation and the complex reduction operation. Executable code is generated for computing sub-products, reducing the sub-products to intermediate results, and summing the intermediate results to generate a final result in response to a determination that a reduced single instruction multiple data method is appropriate.
摘要:
A method and system for reducing or avoiding store misses with a data cache block zero (DCBZ) instruction in cooperation with the underlying hardware load stream prefetching support for helping to increase effective aggregate bandwith. The method identifies and classifies unique streams in a loop based on dependency and reuse analysis, and performs loop transformations, such as node splitting, loop distribution or stream unrolling to get the proper number of streams. Static prediction and run-time profile information are used to guide loop and stream selection. Compile-time loop cost analysis and run-time check code and versioning are used to determine the number of cache lines ahead of each reference for data cache line zeroing and to tolerate required data alignment relative to data cache lines.
摘要:
A method, apparatus, and computer instructions for processing instructions. A data dependency graph is built. The data dependency graph is analyzed for recurrences, and unpipelined instructions that lie outside of the recurrences are expanded.