Abstract:
Various embodiments are directed to a heterogeneous processor architecture comprised of a CPU and a GPU on the same processor die. The heterogeneous processor architecture may optimize source code in a GPU compiler using vector strip mining to reduce instructions of arbitrary vector lengths into GPU supported vector lengths and loop peeling. It may be first determined that the source code is eligible for optimization if more than one machine code instruction of compiled source code under-utilizes GPU instruction bandwidth limitations. The initial vector strip mining results may be discarded and the first iteration of the inner loop body may be peeled out of the loop. The type of operands in the source code may be lowered and the peeled out inner loop body of source code may be vector strip mined again to obtain optimized source code.
Abstract:
A vector data access unit for accessing data stored within a data store in response to decoded vector data access instructions is disclosed. Each of the vector data access instructions comprise a plurality of elements indicating a data access to be performed, the elements being in an order within the vector data access instruction that the corresponding data access is instructed to be performed in. The vector data access unit comprises data access ordering circuitry for issuing data access requests indicated by the elements to the data store, the data access ordering circuitry being configured in response to receipt of at least two decoded vector data access instructions, an earlier of the at least two decoded vector data access instructions being received before a later of the at least two decoded vector instructions and one of the at least two decoded vector data access instructions being a write instruction and to an indication that data accesses from the at least two decoded vector data access instructions can be interleaved to a limited extent, to: determine for each of the at least two vector data access instructions, from a position of the elements within the plurality of elements which of the plurality of data accesses indicated by the plurality of elements is a next data access to be performed for the vector data access instructions, the data accesses being performed in the instructed order; determine an element indicating the next data access for each of said vector data access instructions; select one of the next data accesses as a next data access to be issued to the data store in dependence upon an order the at least two vector data instructions were received in and the position of the elements indicating the next data accesses relative to each other within their respective plurality of elements, subject to a constraint that a difference between a numerical position of the element indicating the next data access within the plurality of elements of a later of the vector data access instructions and a numerical position of the element indicating the next data access within the plurality of elements of an earlier vector access data instruction is less than a predetermined value.
Abstract:
The present invention relates to a method for compiling code for a multi-core processor, comprising: detecting and optimizing a loop, partitioning the loop into partitions executable and mappable on physical hardware with optimal instruction level parallelism, optimizing the loop iterations and/or loop counter for ideal mapping on hardware, chaining the loop partitions generating a list representing the execution sequence of the partitions.
Abstract:
A computing device-implemented method includes receiving a program created by a technical computing environment, analyzing the program, generating multiple program portions based on the analysis of the program, dynamically allocating the multiple program portions to multiple software units of execution for parallel programming, receiving multiple results associated with the multiple program portions from the multiple software units of execution, and providing the multiple results or a single result to the program.
Abstract:
A loop can be executed on a parallel processor by partitioning the loop iterations into chunks of decreasing size. An increase in speed can be realized by reducing the time taken by a thread when determining the next set of iterations to be assigned to a thread. The next set of iterations can be determined from a chunk index stored in a shared variable. Using a shared variable enables threads to perform operations concurrently to reduce the wait time to the period while another thread increments the shared variable.
Abstract:
A loop can be executed on a parallel processor by partitioning the loop iterations into chunks of decreasing size. An increase in speed can be realized by reducing the time taken by a thread when determining the next set of iterations to be assigned to a thread. The next set of iterations can be determined from a chunk index stored in a shared variable. Using a shared variable enables threads to perform operations concurrently to reduce the wait time to the period while another thread increments the shared variable.
Abstract:
Systems and methods generate code from a source program where the generated code may be compiled and executed on a Graphics Processing Unit (GPU). A parallel loop analysis check may be performed on regions of the source program identified for parallelization. One or more optimizations also may be applied to the source program that convert mathematical operations into a parallel form. The source program may be partitioned into segments for execution on a host and a device. Kernels may be created for the segments to be executed on the device. The size of the kernels may be determined, and memory transfers between the host and device may be optimized.
Abstract:
Extracting a system architecture in high level synthesis includes determining a first function of a high level programming language description and a second function contained within a control flow construct of the high level programming description (210, 215, 220). The second function is determined to be a data consuming function of the first function (225). Within a circuit design, a port including a local memory is automatically generated (240). The port couples a first circuit block implementation of the first function to a second circuit block implementation of the second function within the circuit design.