Method and system for reducing memory reference overhead associated with threadprivate variables in parallel programs
    11.
    发明授权
    Method and system for reducing memory reference overhead associated with threadprivate variables in parallel programs 失效
    用于减少并行程序中与线程私有变量相关联的内存引用开销的方法和系统

    公开(公告)号:US07590977B2

    公开(公告)日:2009-09-15

    申请号:US11250833

    申请日:2005-10-13

    IPC分类号: G06F9/45

    CPC分类号: G06F8/445 G06F8/443 G06F8/453

    摘要: A computer implemented method, system and computer program product for accessing threadprivate memory for threadprivate variables in a parallel program during program compilation. A computer implemented method for accessing threadprivate variables in a parallel program during program compilation includes aggregating threadprivate variables in the program, replacing references of the threadprivate variables by indirect references, moving address load operations of the threadprivate variables, and replacing the address load operations of the threadprivate variables by calls to runtime routines to access the threadprivate memory. The invention enables a compiler to minimize the runtime routines call times to access the threadprivate variables, thus improving program performance.

    摘要翻译: 一种计算机实现的方法,系统和计算机程序产品,用于在程序编译期间在并行程序中访问线程私有变量的线程私有存储器。 在程序编译过程中,一种用于在并行程序中访问线程私有变量的计算机实现方法包括在程序中聚合线程私有变量,通过间接引用替代线程私有变量的引用,移动线程私有变量的地址加载操作,以及替换 threadprivate变量通过调用运行时程序访问线程私有内存。 本发明使得编译器能够最小化运行时程序调用时间以访问线程私有变量,从而提高程序性能。

    Sparse vectorization without hardware gather / scatter
    15.
    发明申请
    Sparse vectorization without hardware gather / scatter 失效
    稀疏矢量化无硬件收集/散射

    公开(公告)号:US20080092125A1

    公开(公告)日:2008-04-17

    申请号:US11549172

    申请日:2006-10-13

    IPC分类号: G06F9/45

    CPC分类号: G06F8/447

    摘要: A target operation in a normalized target loop, susceptible of vectorization and which may, after compilation into a vectorized form, seek to operate on data in nonconsecutive physical memory, is identified in source code. Hardware instructions are inserted into executable code generated from the source code, directing a system that will run the executable code to create a representation of the data in consecutive physical memory. A vector loop containing the target operation is replaced, in the executable code, with a function call to a vector library to call a vector function that will operate on the representation to generate a result identical to output expected from executing the vector loop containing the target operation. On execution, a representation of data residing in nonconsecutive physical memory is created in consecutive physical memory, and the vectorized target operation is applied to the representation to process the data.

    摘要翻译: 标准化目标循环中的目标操作,易于向量化,并且可以在编译成向量化形式之后寻求对非连续物理存储器中的数据进行操作,在源代码中被识别。 硬件指令被插入到从源代码生成的可执行代码中,指示将运行可执行代码的系统在连续的物理内存中创建数据的表示。 包含目标操作的向量循环在可执行代码中被替换为对向量库的函数调用,以调用将在表示上操作的向量函数,以生成与执行包含目标的向量循环所期望的输出相同的结果 操作。 在执行时,在连续物理存储器中创建驻留在非连续物理存储器中的数据的表示,并且向量化的目标操作被应用于表示以处理数据。

    Sparse vectorization without hardware gather/scatter
    16.
    发明授权
    Sparse vectorization without hardware gather/scatter 失效
    稀疏矢量化无硬件收集/散射

    公开(公告)号:US08191056B2

    公开(公告)日:2012-05-29

    申请号:US11549172

    申请日:2006-10-13

    IPC分类号: G06F9/45

    CPC分类号: G06F8/447

    摘要: A target operation in a normalized target loop, susceptible of vectorization and which may, after compilation into a vectorized form, seek to operate on data in nonconsecutive physical memory, is identified in source code. Hardware instructions are inserted into executable code generated from the source code, directing a system that will run the executable code to create a representation of the data in consecutive physical memory. A vector loop containing the target operation is replaced, in the executable code, with a function call to a vector library to call a vector function that will operate on the representation to generate a result identical to output expected from executing the vector loop containing the target operation. On execution, a representation of data residing in nonconsecutive physical memory is created in consecutive physical memory, and the vectorized target operation is applied to the representation to process the data.

    摘要翻译: 标准化目标循环中的目标操作,易于向量化,并且可以在编译成向量化形式之后寻求对非连续物理存储器中的数据进行操作,在源代码中被识别。 硬件指令被插入到从源代码生成的可执行代码中,指示将运行可执行代码的系统在连续的物理内存中创建数据的表示。 包含目标操作的向量循环在可执行代码中被替换为对向量库的函数调用,以调用将在表示上操作的向量函数,以生成与执行包含目标的向量循环所期望的输出相同的结果 操作。 在执行时,在连续物理存储器中创建驻留在非连续物理存储器中的数据的表示,并且向量化的目标操作被应用于表示以处理数据。

    Code generation for complex arithmetic reduction for architectures lacking cross data-path support
    17.
    发明申请
    Code generation for complex arithmetic reduction for architectures lacking cross data-path support 有权
    针对缺乏跨数据路径支持的架构的复杂算术减少的代码生成

    公开(公告)号:US20080092124A1

    公开(公告)日:2008-04-17

    申请号:US11548851

    申请日:2006-10-12

    IPC分类号: G06F9/45

    CPC分类号: G06F8/445 G06F8/45

    摘要: A computer implemented method, apparatus, and computer usable program code for compiling source code for performing a complex operation followed by a complex reduction operation. A method is determined for generating executable code for performing the complex operation and the complex reduction operation. Executable code is generated for computing sub-products, reducing the sub-products to intermediate results, and summing the intermediate results to generate a final result in response to a determination that a reduced single instruction multiple data method is appropriate.

    摘要翻译: 一种计算机实现的方法,装置和计算机可用程序代码,用于编译用于执行复杂操作的复杂缩减操作的源代码。 确定用于生成用于执行复杂操作和复合缩减操作的可执行代码的方法。 生成用于计算子产品的可执行代码,将子产品减少到中间结果,并且对中间结果求和以响应于减少的单指令多数据方法的确定而产生最终结果。

    Code generation for complex arithmetic reduction for architectures lacking cross data-path support
    18.
    发明授权
    Code generation for complex arithmetic reduction for architectures lacking cross data-path support 有权
    针对缺乏跨数据路径支持的架构的复杂算术减少的代码生成

    公开(公告)号:US08423979B2

    公开(公告)日:2013-04-16

    申请号:US11548851

    申请日:2006-10-12

    IPC分类号: G06F9/45

    CPC分类号: G06F8/445 G06F8/45

    摘要: A computer implemented method, apparatus, and computer usable program code for compiling source code for performing a complex operation followed by a complex reduction operation. A method is determined for generating executable code for performing the complex operation and the complex reduction operation. Executable code is generated for computing sub-products, reducing the sub-products to intermediate results, and summing the intermediate results to generate a final result in response to a determination that a reduced single instruction multiple data method is appropriate.

    摘要翻译: 一种计算机实现的方法,装置和计算机可用程序代码,用于编译用于执行复杂操作的复杂缩减操作的源代码。 确定用于生成用于执行复杂操作和复合缩减操作的可执行代码的方法。 生成用于计算子产品的可执行代码,将子产品减少到中间结果,并且对中间结果求和以响应于减少的单指令多数据方法的确定而产生最终结果。

    Aggregate bandwidth through management using insertion of reset instructions for cache-to-cache data transfer
    19.
    发明授权
    Aggregate bandwidth through management using insertion of reset instructions for cache-to-cache data transfer 失效
    通过使用插入缓存到缓存数据传输的重置指令来管理带宽

    公开(公告)号:US07168070B2

    公开(公告)日:2007-01-23

    申请号:US10853304

    申请日:2004-05-25

    IPC分类号: G06F9/45 G06F13/00

    摘要: A method and system for reducing or avoiding store misses with a data cache block zero (DCBZ) instruction in cooperation with the underlying hardware load stream prefetching support for helping to increase effective aggregate bandwith. The method identifies and classifies unique streams in a loop based on dependency and reuse analysis, and performs loop transformations, such as node splitting, loop distribution or stream unrolling to get the proper number of streams. Static prediction and run-time profile information are used to guide loop and stream selection. Compile-time loop cost analysis and run-time check code and versioning are used to determine the number of cache lines ahead of each reference for data cache line zeroing and to tolerate required data alignment relative to data cache lines.

    摘要翻译: 与底层硬件负载流预取支持协作,通过数据缓存块零(DCBZ)指令减少或避免存储错误的方法和系统,以帮助增加有效的聚合带宽。 该方法基于依赖和重用分析在循环中识别和分类唯一流,并执行循环转换,例如节点分割,循环分布或流展开以获得适当数量的流。 静态预测和运行时间轮廓信息用于指导循环和流选择。 编译时循环成本分析和运行时检查代码和版本控制用于确定数据高速缓存行归零的每个引用之前的高速缓存行数,并允许相对于数据高速缓存行的所需数据对齐。