摘要:
A method for compiling application source code that includes selecting multiple loops for parallelization. The multiple loops include a first loop and a second loop. The method further includes partitioning the first loop into a first set of chunks, partitioning the second loop into a second set of chunks, and calculating data dependencies between the first set of chunks and the second set of chunks. A first chunk of the second set of chunks is dependent on a first chunk of the first set of chunks. The method further includes inserting, into the first loop and prior to completing compilation, a precedent synchronization instruction for execution when execution of the first chunk of the first set of chunks completes, and completing the compilation of the application source code to create an application compiled code.
摘要:
A method for compiling application source code that includes selecting multiple loops for parallelization. The multiple loops include a first loop and a second loop. The method further includes partitioning the first loop into a first set of chunks, partitioning the second loop into a second set of chunks, and calculating data dependencies between the first set of chunks and the second set of chunks. A first chunk of the second set of chunks is dependent on a first chunk of the first set of chunks. The method further includes inserting, into the first loop and prior to completing compilation, a precedent synchronization instruction for execution when execution of the first chunk of the first set of chunks completes, and completing the compilation of the application source code to create an application compiled code.
摘要:
One embodiment of the present invention provides a system for communicating and performing synchronization operations between a main thread and a helper-thread. The system starts by executing a program in a main thread. Upon encountering a loop which has associated helper-thread code, the system commences the execution of the code by the helper-thread separately and in parallel with the main thread. While executing the code by the helper-thread, the system periodically checks the progress of the main thread and deactivates the helper-thread if the code being executed by the helper-thread is no longer performing useful work. Hence, the helper-thread is executes in advance of where the main thread is executing to prefetch data items for the main thread without unnecessarily consuming processor resources or hampering the execution of the main thread.
摘要:
One embodiment of the present invention provides a system that generates code for software scouting the regions of a program. During operation, the system receives source code for a program. The system then compiles the source code. In the first step of the compilation process, the system identifies a first set of loops from a hierarchy of loops in the source code, wherein each loop in the first set of loops contains at least one effective prefetch candidate. Then, from the first set of loops, the system identifies a second set of loops where scout-mode prefetching is profitable. Next, for each loop in the second set of loops, the system produces executable code for a helper-thread which contains a prefetch instruction for each effective prefetch candidate. At runtime the helper-thread is executed in parallel with the main thread in advance of where the main thread is executing to prefetch data items for the main thread.
摘要:
Methods and apparatus provide for a workload adjuster to estimate the startup cost of one or more non-main threads of loop execution and to estimate the amount of workload to be migrated between different threads. Upon deciding to parallelize the execution of a loop, the workload adjuster creates a scheduling policy with a workload for a main thread and workloads for respective non-main threads. The scheduling policy distributes iterations of a parallelized loop to the workload of the main thread and iterations of the parallelized loop to the workloads of the non-main threads. The workload adjuster evaluates a start-up cost of the workload of a non-main thread and, based on the start-up cost, migrates a portion of the workload for that non-main thread to the main thread's workload.
摘要:
Methods and apparatus provide for a workload adjuster to estimate the startup cost of one or more non-main threads of loop execution and to estimate the amount of workload to be migrated between different threads. Upon deciding to parallelize the execution of a loop, the workload adjuster creates a scheduling policy with a workload for a main thread and workloads for respective non-main threads. The scheduling policy distributes iterations of a parallelized loop to the workload of the main thread and iterations of the parallelized loop to the workloads of the non-main threads. The workload adjuster evaluates a start-up cost of the workload of a non-main thread and, based on the start-up cost, migrates a portion of the workload for that non-main thread to the main thread's workload.
摘要:
Embodiments of the invention provide systems and methods for automatically parallelizing loops with non-speculative pipelined execution of chunks of iterations with pre-computation of selected values. Non-DOALL loops are identified and divided the loops into chunks. The chunks are assigned to separate logical threads, which may be further assigned to hardware threads. As a thread performs its runtime computations, subsequent threads attempt to pre-compute their respective chunks of the loop. These pre-computations may result in a set of assumed initial values and pre-computed final variable values associated with each chunk. As subsequent pre-computed chunks are reached at runtime, those assumed initial values can be verified to determine whether to proceed with runtime computation of the chunk or to avoid runtime execution and instead use the pre-computed final variable values.
摘要:
Methods are disclosed of compiling a software application having multiple functions. At least one of the functions is identified as a targeted function having a significant contribution to performance of the software application. A code version of the targeted function is generated with one of multiple machine models corresponding to different target utilizations for a target architecture, specifically corresponding to the one with the greatest of the different target utilizations. The generated code version of the targeted function is matched with an application thread of the target architecture.
摘要:
Embodiments of the invention provide systems and methods for throughput-aware software pipelining in compilers to produce optimal code for single-thread and multi-thread execution on multi-threaded systems. A loop is identified within source code as a candidate for software pipelining. An attempt is made to generate pipelined code (e.g., generate an instruction schedule and a set of register assignments) for the loop in satisfaction of throughput-aware pipelining criteria, like maximum register count, minimum trip count, target core pipeline resource utilization, maximum code size, etc. If the attempt fails to generate code in satisfaction of the criteria, embodiments adjust one or more settings (e.g., by reducing scalarity or latency settings being used to generate the instruction schedule). Additional attempts are made to generate pipelined code in satisfaction of the criteria by iteratively adjusting the settings, regenerating the code using the adjusted settings, and recalculating whether the code satisfies the criteria.
摘要:
A method includes scheduling instructions within a trace disregarding data dependencies from off trace basic blocks. After scheduling, errors caused by instruction movement are corrected. By disregarding data dependencies from off trace basic blocks, more parallelism is exposed resulting in more instruction motion. In this manner, efficiency is maximized.