摘要:
Systems, methods and computer products for cross-thread scheduling. Exemplary embodiments include a cross thread scheduling method for compiling code, the method including scheduling a scheduling unit with a scheduler sub-operation in response to the scheduling unit being in a non-multithreaded part of the code and scheduling the scheduling unit with a cross-thread scheduler sub-operation in response to the scheduling unit being in a multithreaded part of the code.
摘要:
Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.
摘要:
Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.
摘要:
One embodiment of a microprocessor core capable of executing a plurality of threads substantially simultaneously includes a plurality of register resources available for use by the threads, where the register resources are fewer in number than the number threads multiplied by a number of architectural register resources required per thread, and a supervisor for allocating the register resources among the plurality of threads.
摘要:
Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated based on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.
摘要:
A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=−1.
摘要:
Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated based on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.
摘要:
A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=−1.
摘要:
A method (and structure) for performing a matrix subroutine, includes storing data for a matrix subroutine call in a computer memory in an increment block size that is based on a cache size.
摘要:
Apparatus for evaluating the performance of DMA-based algorithmic tasks on a target multi-core processing system includes a memory and at least one processor coupled to the memory. The processor is operative: to input a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; to evaluate performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and to provide results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.