SIMD compare instruction using permute logic for distributed register files
    2.
    发明授权
    SIMD compare instruction using permute logic for distributed register files 有权
    SIMD比较指令使用分布式寄存器文件的置换逻辑

    公开(公告)号:US09575753B2

    公开(公告)日:2017-02-21

    申请号:US13420699

    申请日:2012-03-15

    IPC分类号: G06F9/30 G06F9/38

    摘要: Mechanisms, in a data processing system comprising a single instruction multiple data (SIMD) processor, for performing a data dependency check operation on vector element values of at least two input vector registers are provided. Two calls to a simd-check instruction are performed, one with input vector registers having a first order and one with the input vector registers having a different order. The simd-check instruction performs comparisons to determine if any data dependencies are present. Results of the two calls to the simd-check instruction are obtained and used to determine if any data dependencies are present in the at least two input vector registers. Based on the results, the SIMD processor may perform various operations.

    摘要翻译: 提供了一种包括用于对至少两个输入向量寄存器的向量元素值进行数据相关性检查操作的单指令多数据(SIMD)处理器的数据处理系统中的机制。 执行对SIMD检查指令的两次调用,其中一个具有输入向量寄存器具有第一级,一个具有不同顺序的输入向量寄存器。 simd检查指令执行比较以确定是否存在任何数据依赖性。 获得对simd检查指令的两次调用的结果,并用于确定至少两个输入向量寄存器中是否存在任何数据依赖性。 基于该结果,SIMD处理器可以执行各种操作。

    Efficient software cache accessing with handle reuse
    3.
    发明授权
    Efficient software cache accessing with handle reuse 有权
    高效的软件缓存访问与句柄重用

    公开(公告)号:US08819651B2

    公开(公告)日:2014-08-26

    申请号:US12177543

    申请日:2008-07-22

    IPC分类号: G06F9/45

    CPC分类号: G06F8/4442

    摘要: A mechanism for efficient software cache accessing with handle reuse is provided. The mechanism groups references in source code into a reference stream with the reference stream having a size equal to or less than a size of a software cache line. The source code is transformed into optimized code by modifying the source code to include code for performing at most two cache lookup operations for the reference stream to obtain two cache line handles. Moreover, the transformation involves inserting code to resolve references in the reference stream based on the two cache line handles. The optimized code may be output for generation of executable code.

    摘要翻译: 提供了一种用于具有句柄重用的高效软件高速缓存访​​问的机制。 该机制将源代码中的引用分组为具有等于或小于软件高速缓存行的大小的参考流的参考流。 源代码通过修改源代码来转换成优化的代码,以包括为参考流执行至多两个高速缓存查找操作的代码,以获得两个高速缓存行句柄。 此外,转换涉及插入代码以基于两个高速缓存行句柄来解析引用流中的引用。 可以输出优化的代码以生成可执行代码。

    Complex matrix multiplication operations with data pre-conditioning in a high performance computing architecture
    4.
    发明授权
    Complex matrix multiplication operations with data pre-conditioning in a high performance computing architecture 失效
    在高性能计算架构中使用数据预处理的复矩阵乘法运算

    公开(公告)号:US08650240B2

    公开(公告)日:2014-02-11

    申请号:US12542324

    申请日:2009-08-17

    IPC分类号: G06F7/52

    摘要: Mechanisms for performing a complex matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the complex matrix multiplication operation to a first target vector register. The first vector operand comprises a real and imaginary part of a first complex vector value. A complex load and splat operation is performed to load a second complex vector value of a second vector operand and replicate the second complex vector value within a second target vector register. The second complex vector value has a real and imaginary part. A cross multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the complex matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored in a result vector register.

    摘要翻译: 提供了执行复矩阵乘法运算的机制。 执行矢量加载操作以将复矩阵乘法运算的第一向量操作数加载到第一目标向量寄存器。 第一矢量操作数包括第一复矢量值的实部和虚部。 执行复杂的加载和拼接操作以加载第二向量操作数的第二复数向量值,并在第二目标向量寄存器内复制第二复数向量值。 第二个复矢量值具有实部和虚部。 对第一目标向量寄存器的元素和第二目标向量寄存器的元素执行交叉乘法运算,以生成复矩阵乘法运算的部分乘积。 部分产品与其他部分产品一起累积,并将结果积累的部分产品存储在结果向量寄存器中。

    Write-through cache optimized for dependence-free parallel regions
    5.
    发明授权
    Write-through cache optimized for dependence-free parallel regions 有权
    针对无依赖并行区域优化的直写缓存

    公开(公告)号:US08627010B2

    公开(公告)日:2014-01-07

    申请号:US13604349

    申请日:2012-09-05

    IPC分类号: G06F12/00

    CPC分类号: G06F12/0837

    摘要: An apparatus and computer program product for improving performance of a parallel computing system. A first hardware local cache controller associated with a first local cache memory device of a first processor detects an occurrence of a false sharing of a first cache line by a second processor running the program code and allows the false sharing of the first cache line by the second processor. The false sharing of the first cache line occurs upon updating a first portion of the first cache line in the first local cache memory device by the first hardware local cache controller and subsequent updating a second portion of the first cache line in a second local cache memory device by a second hardware local cache controller.

    摘要翻译: 一种用于提高并行计算系统性能的设备和计算机程序产品。 与第一处理器的第一本地高速缓冲存储器设备相关联的第一硬件本地高速缓存控制器通过运行程序代码的第二处理器来检测第一高速缓存行的虚假共享的发生,并允许第一高速缓存行的错误共享由 第二处理器。 当由第一硬件本地高速缓存控制器更新第一本地高速缓存存储器设备中的第一高速缓存行的第一部分并且随后在第二本地高速缓冲存储器中更新第一高速缓存行的第二部分时,发生第一高速缓存行的错误共享 设备由第二硬件本地缓存控制器。

    Efficient Enqueuing of Values in SIMD Engines with Permute Unit
    7.
    发明申请
    Efficient Enqueuing of Values in SIMD Engines with Permute Unit 审中-公开
    有效排队SIMD发动机与价值单位

    公开(公告)号:US20130151822A1

    公开(公告)日:2013-06-13

    申请号:US13315596

    申请日:2011-12-09

    IPC分类号: G06F9/38

    摘要: Mechanisms, in a data processing system having a processor, for generating enqueued data for performing computations of a conditional branch of code are provided. Mask generation logic of the processor operates to generate a mask representing a subset of iterations of a loop of the code that results in a condition of the conditional branch being satisfied. The mask is used to select data elements from an input data element vector register corresponding to the subset of iterations of the loop of the code that result in the condition of the conditional branch being satisfied. Furthermore, the selected data elements are used to perform computations of the conditional branch of code. Iterations of the loop of the code that do not result in the condition of the conditional branch being satisfied are not used as a basis for performing computations of the conditional branch of code.

    摘要翻译: 提供了在具有处理器的数据处理系统中用于生成用于执行代码的条件分支的计算的入队数据的机制。 处理器的掩码生成逻辑操作以产生代表导致条件分支的条件得到满足的代码循环的迭代子集的掩码。 该掩码用于从输入数据元素向量寄存器中选择数据元素,该数据元素对应于导致满足条件分支条件的代码循环的迭代子集。 此外,所选择的数据元素用于执行代码的条件分支的计算。 不导致满足条件分支的条件的代码的循环的迭代不用作执行代码的条件分支的计算的基础。

    Optimizing scalar code executed on a SIMD engine by alignment of SIMD slots
    8.
    发明授权
    Optimizing scalar code executed on a SIMD engine by alignment of SIMD slots 失效
    通过SIMD插槽的对齐来优化在SIMD引擎上执行的标量码

    公开(公告)号:US08370817B2

    公开(公告)日:2013-02-05

    申请号:US12127491

    申请日:2008-05-27

    IPC分类号: G06F9/45 G06F15/00

    摘要: A mechanism is provided for optimizing scalar code executed on a single instruction multiple data (SIMD) engine by aligning the slots of SIMD registers. With the mechanism, a compiler is provided that parses source code and, for each statement in the program, generates an expression tree. The compiler inspects all storage inputs to scalar operations in the expression tree to determine their alignment in the SIMD registers. This alignment is propagated up the expression tree from the leaves. When the alignments of two operands in the expression tree are the same, the resulting alignment is the shared value. When the alignments of two operands in the expression tree are different, one operand is shifted. For shifted operands, a shift operation is inserted in the expression tree. The executable code is then generated for the expression tree and shifts are inserted where indicated.

    摘要翻译: 提供了一种用于通过对准SIMD寄存器的时隙来优化在单个指令多数据(SIMD)引擎上执行的标量码的机制。 使用该机制,提供了解析源代码的编译器,对于程序中的每个语句,都生成一个表达式树。 编译器检查表达式树中的所有存储输入到标量运算,以确定它们在SIMD寄存器中的对齐。 该对齐方式从树叶中向上传播。 当表达式树中的两个操作数的对齐方式相同时,生成的对齐方式是共享值。 当表达式树中的两个操作数的对齐不同时,一个操作数被移位。 对于移位的操作数,在表达式树中插入shift操作。 然后为表达式树生成可执行代码,并在指定的位置插入移位。

    WRITE-THROUGH CACHE OPTIMIZED FOR DEPENDENCE-FREE PARALLEL REGIONS
    9.
    发明申请
    WRITE-THROUGH CACHE OPTIMIZED FOR DEPENDENCE-FREE PARALLEL REGIONS 有权
    写入 - 通过高速缓存优化为无依赖的并行区域

    公开(公告)号:US20120331232A1

    公开(公告)日:2012-12-27

    申请号:US13604349

    申请日:2012-09-05

    IPC分类号: G06F12/08

    CPC分类号: G06F12/0837

    摘要: An apparatus and computer program product for improving performance of a parallel computing system. A first hardware local cache controller associated with a first local cache memory device of a first processor detects an occurrence of a false sharing of a first cache line by a second processor running the program code and allows the false sharing of the first cache line by the second processor. The false sharing of the first cache line occurs upon updating a first portion of the first cache line in the first local cache memory device by the first hardware local cache controller and subsequent updating a second portion of the first cache line in a second local cache memory device by a second hardware local cache controller.

    摘要翻译: 一种用于提高并行计算系统性能的设备和计算机程序产品。 与第一处理器的第一本地高速缓冲存储器设备相关联的第一硬件本地高速缓存控制器通过运行程序代码的第二处理器来检测第一高速缓存行的虚假共享的发生,并允许第一高速缓存行的错误共享由 第二处理器。 当由第一硬件本地高速缓存控制器更新第一本地高速缓存存储器设备中的第一高速缓存行的第一部分并且随后在第二本地高速缓冲存储器中更新第一高速缓存行的第二部分时,发生第一高速缓存行的错误共享 设备由第二硬件本地缓存控制器。

    Method using SLP packing with statements having both isomorphic and non-isomorphic expressions
    10.
    发明授权
    Method using SLP packing with statements having both isomorphic and non-isomorphic expressions 失效
    使用具有同构和非同构表达式的语句的SLP打包的方法

    公开(公告)号:US08266587B2

    公开(公告)日:2012-09-11

    申请号:US11964324

    申请日:2007-12-26

    IPC分类号: G06F9/44

    CPC分类号: G06F8/456

    摘要: Disclosure for using SLP in processing a plurality of statements, wherein the statements are associated with an array having a number of array positions, and each statement includes one or more expressions. Expressions are gathered for each of the statements into a structure comprising a single merge stream furnished with a location for each expression. The location for a given expression is associated with one of the array positions. A plurality of expressions are selectively identified and SLP packing operations are applied to the identified expressions to merge into one or more isomorphic sub-streams. Expressions of the isomorphic sub-streams and other expressions of the single merge stream are combined into a number of input vectors that are substantially equal in length to one another. A location vector is generated that contains the respective locations for all of the expressions in the single merge stream.

    摘要翻译: 在处理多个语句中使用SLP的公开,其中所述语句与具有多个数组位置的数组相关联,并且每个语句包括一个或多个表达式。 将每个语句的表达式收集到包含单个合并流的结构中,每个合并流都包含每个表达式的位置。 给定表达式的位置与其中一个数组位置相关联。 选择性地识别多个表达,并且将SLP打包操作应用于所识别的表达,以合并到一个或多个同构子流中。 单个合并流的同构子流和其他表达式的表达式被组合成彼此长度上基本相等的多个输入向量。 生成位置向量,其包含单个合并流中所有表达式的相应位置。