摘要:
Partitioning programs between a general purpose core and one or more accelerators is provided. A compiler front end is provided for converting a program source code in a corresponding high level programming language into an intermediate code representation. This intermediate code representation is provided to an interprocedural optimizer which determines which core processor or accelerator each portion of the program should execute on and partitions the program into sub-programs based on this set of decisions. The interprocedural optimizer may further add instructions to the partitions to coordinate and synchronize the sub-programs as required. Each sub-program is compiled on an appropriate compiler backend for the instruction set architecture of the particular core processor or accelerator selected to execute the sub-program. The compiled sub-programs and then linked to thereby generate an executable program.
摘要:
An mechanism is provided for partitioning programs between a general purpose core and one or more accelerators. With the apparatus and method, a compiler front end is provided for converting a program source code in a corresponding high level programming language into an intermediate code representation. This intermediate code representation is provided to an interprocedural optimizer which determines which core processor or accelerator each portion of the program should execute on and partitions the program into sub-programs based on this set of decisions. The interprocedural optimizer may further add instructions to the partitions to coordinate and synchronize the sub-programs as required. Each sub-program is compiled on an appropriate compiler backend for the instruction set architecture of the particular core processor or accelerator selected to execute the sub-program. The compiled sub-programs and then linked to thereby generate an executable program.
摘要:
An apparatus and method for partitioning programs between a general purpose core and one or more accelerators are provided. With the apparatus and method, a compiler front end is provided for converting a program source code in a corresponding high level programming language into an intermediate code representation. This intermediate code representation is provided to an interprocedural optimizer which determines which core processor or accelerator each portion of the program should execute on and partitions the program into sub-programs based on this set of decisions. The interprocedural optimizer may further add instructions to the partitions to coordinate and synchronize the sub-programs as required. Each sub-program is compiled on an appropriate compiler backend for the instruction set architecture of the particular core processor or accelerator selected to execute the sub-program. The compiled sub-programs and then linked to thereby generate an executable program.
摘要:
A compiler includes a mechanism for employing multiple synergistic processors to execute long vectors. The compiler receives a single source program. The compiler identifies vectorizable loop code in the single source program and extracts the vectorizable loop code from the single source program. The compiler then compiles the extracted vectorizable loop code for a plurality of synergistic processors. The compiler also compiles a remainder of the single source program for a principal processor to form an executable main program such that the executable main program controls operation of the executable vectorizable loop code on the plurality of synergistic processors.
摘要:
A process, compiler, computer program product and system for workload partitioning in a heterogeneous system. The process includes determining heterogeneous alignment constraints in the workload, partitioning a portion of tasks to a processing element sensitive to alignment constraints, and partitioning a remaining portion of tasks to a processing element not sensitive to alignment constraints.
摘要:
A system and method to efficiently pre-fetch and batch compiler-assisted software cache accesses are provided. The system and method reduce the overhead associated with software cache directory accesses. With the system and method, the local memory address of the cache line that stores the pre-fetched data is itself cached, such as in a register or well known location in local memory, so that a later data access does not need to perform address translation and software cache operations and can instead access the data directly from the software cache using the cached local memory address. This saves processor cycles that would otherwise be required to perform the address translation a second time when the data is to be used. Moreover, the system and method directly enable software cache accesses to be effectively decoupled from address translation in order to increase the overlap between computation and communication.
摘要:
The present invention provides for a system for computer program code size partitioning for multiple memory multi-processor systems. At least one system parameter of a computer system comprising one or more disparate processing nodes is identified. Computer program code comprising a program to be run on the computer system is received. A program representation based on received computer program code is generated. At least one single-entry-single-exit (SESE) region is identified based on the whole program representation. At least one SESE region of less than a certain size (store-size-specific) is identified based on identified SESE regions and the at least one system parameter. Each store-size-specific SESE region is grouped into a node-specific subroutine. The non node-specific parts of the computer program code are modified based on the partitioning into node-specific subroutines. The modified computer program code including each node-specific subroutine is compiled based on a specified node characteristic.
摘要:
A compiler includes a mechanism for employing multiple synergistic processors to execute long vectors. The compiler receives a single source program. The compiler identifies vectorizable loop code in the single source program and extracts the vectorizable loop code from the single source program. The compiler then compiles the extracted vectorizable loop code for a plurality of synergistic processors. The compiler also compiles a remainder of the single source program for a principal processor to form an executable main program such that the executable main program controls operation of the executable vectorizable loop code on the plurality of synergistic processors.
摘要:
A method to efficiently pre-fetch and batch compiler-assisted software cache accesses is provided. The method reduces the overhead associated with software cache directory accesses. With the method, the local memory address of the cache line that stores the pre-fetched data is itself cached, such as in a register or well known location in local memory, so that a later data access does not need to perform address translation and software cache operations and can instead access the data directly from the software cache using the cached local memory address. This saves processor cycles that would otherwise be required to perform the address translation a second time when the data is to be used. Moreover, the system and method directly enable software cache accesses to be effectively decoupled from address translation in order to increase the overlap between computation and communication.
摘要:
A computer implemented method, apparatus, and computer usable program code for eliminating redundant read-modify-write code sequences in non-vectorizable code. Code is received comprising a sequence of operations. The sequence of operations includes a loop. Non-vectorizable operations are identified within the loop that modifies at least one sub-part of a storage location. The non-vectorizable operations are modified to include a single store operation for the number of sub-parts of the storage location.