Matrix multiply with reduced bandwidth requirements
    21.
    发明申请
    Matrix multiply with reduced bandwidth requirements 审中-公开
    矩阵乘以减少带宽要求

    公开(公告)号:US20070271325A1

    公开(公告)日:2007-11-22

    申请号:US11430324

    申请日:2006-05-08

    CPC classification number: G06F17/16

    Abstract: Systems and methods for reducing the bandwidth needed to read the inputs to a matrix multiply operation may improve system performance. Rather than reading a row of a first input matrix and a column of a second input matrix to produce a column of a product matrix, a column of the first input matrix and a single element of the second input matrix are read to produce a column of partial dot products of the product matrix. Therefore, the number of input matrix elements read to produce each product matrix element is reduced from 2N to N+1, where N is the number of elements in a column of the product matrix.

    Abstract translation: 用于减少将矩阵乘法运算的输入读取所需带宽的系统和方法可能会提高系统性能。 读取第一输入矩阵和第二输入矩阵的列以产生乘积矩阵的列而不是读取第一输入矩阵的列和第二输入矩阵的单个元素以产生一列 产品矩阵的部分点积。 因此,读取以产生每个乘积矩阵元素的输入矩阵元素的数量从2N减少到N + 1,其中N是乘积矩阵的列中的元素的数量。

    Rapid execution of floating point load control word instructions
    22.
    发明授权
    Rapid execution of floating point load control word instructions 有权
    快速执行浮点负载控制字指令

    公开(公告)号:US06405305B1

    公开(公告)日:2002-06-11

    申请号:US09394024

    申请日:1999-09-10

    Abstract: A microprocessor with a floating point unit configured to rapidly execute floating point load control word (FLDCW) type instructions in an out of program order context is disclosed. The floating point unit is configured to schedule instructions older than the FLDCW-type instruction before the FLDCW-type instruction is scheduled. The FLDCW-type instruction acts as a barrier to prevent instructions occurring after the FLDCW-type instruction in program order from executing before the FLDCW-type instruction. Indicator bits may be used to simplify instruction scheduling, and copies of the floating point control word may be stored for instruction that have long execution cycles. A method and computer configured to rapidly execute FLDCW-type instructions in an out of program order context are also disclosed.

    Abstract translation: 具有浮点单元的微处理器被配置为在程序顺序上下文中快速执行浮点负载控制字(FLDCW)类型指令。 浮点单元被配置为在调度FLDCW类型指令之前调度比FLDCW类型指令更早的指令。 FLDCW型指令作为屏障,以防止在FLDCW类型指令之前执行FLDCW类型指令之后的程序顺序发生的指令。 指示符位可以用于简化指令调度,并且可以存储具有长执行周期的指令的浮点控制字的副本。 还公开了一种配置成在程序顺序上下文中快速执行FLDCW型指令的方法和计算机。

    Floating point addition pipeline including extreme value, comparison and accumulate functions
    23.
    发明授权
    Floating point addition pipeline including extreme value, comparison and accumulate functions 失效
    浮点附加流水线包括极值,比较和累加功能

    公开(公告)号:US06298367B1

    公开(公告)日:2001-10-02

    申请号:US09055916

    申请日:1998-04-06

    Abstract: A multimedia execution unit configured to perform vectored floating point and integer instructions. The execution unit may include an add/subtract pipeline having far and close data paths. The far path is configured to handle effective addition operations and effective subtraction operations for operands having an absolute exponent difference greater than one. The close path is configured to handle effective subtraction operations for operands having an absolute exponent difference less than or equal to one. The close path is configured to generate two output values, wherein one output value is the first input operand plus an inverted version of the second input operand, while the second output value is equal to the first output value plus one. Selection of the first or second output value in the close path effectuates the round-to-nearest operation for the output of the adder. The execution unit may be configured to perform vectored addition and subtraction, integer/floating point conversion, reverse subtraction, accumulate, extreme value (minimum/maximum), and comparison instructions.

    Abstract translation: 多媒体执行单元被配置为执行矢量的浮点和整数指令。 执行单元可以包括具有远近数据路径的加法/减法流水线。 远程路径被配置为处理具有大于1的绝对指数差的操作数的有效加法运算和有效减法运算。 关闭路径被配置为处理具有小于或等于1的绝对指数差的操作数的有效减法操作。 关闭路径被配置为生成两个输出值,其中一个输出值是第一输入操作数加上第二输入操作数的反转版本,而第二输出值等于第一输出值加1。 在闭合路径中选择第一或第二输出值对加法器的输出实现了舍入到最近的运算。 执行单元可以被配置为执行向量加法和减法,整数/浮点转换,反向减法,累加,极值(最小/最大)和比较指令。

    Bipartite look-up table with output values having minimized absolute error
    24.
    发明授权
    Bipartite look-up table with output values having minimized absolute error 失效
    输出值为绝对误差最小的双向查找表

    公开(公告)号:US06223192B1

    公开(公告)日:2001-04-24

    申请号:US09098482

    申请日:1998-06-16

    Abstract: A method for generating entries for a bipartite look-up table having base and difference table portions. In one embodiment, these entries are usable to form output values for a mathematical function, f(x), in response to receiving corresponding input values within a predetermined input range. The method first comprises partitioning the input range into I intervals, J subintervals/interval, and K sub-subintervals/subinterval. For a given interval M, the method includes generating K difference table entries and J base table entries. Each of the K difference table entries corresponds to a particular group of sub-subintervals within interval M, each of which has the same relative position within their respective subintervals. Each difference table entry is computed by averaging difference values for the sub-subintervals included in a corresponding group N. Each difference value which makes up this average is equal to f(X1)−f(X2), where X1 is the midpoint of the sub-subinterval within group N, and X2 is the midpoint of a predetermined reference sub-subinterval within the same subinterval as X1. Each of these midpoints is calculated such that maximum absolute error is minimized for all possible input values in the sub-subinterval. Each of the J base table entries, on the other hand, corresponds to a subinterval within interval M. Each entry is equal to f(X2)+adjust, where X2 is the midpoint of the reference sub-subinterval of the subinterval corresponding to the base table entry. The adjust value is calculated so that error introduced by the averaging of the difference table entries is evenly distributed over the entire subinterval.

    Abstract translation: 一种用于为具有基准和差分表部分的二分查找表生成条目的方法。 在一个实施例中,响应于在预定输入范围内接收对应的输入值,这些条目可用于形成数学函数f(x)的输出值。 该方法首先包括将输入范围分为I个间隔,J个子间隔/间隔和K个子间隔/子间隔。 对于给定的间隔M,该方法包括生成K个差表表项和J个基表项。 K个差异表条目中的每一个对应于间隔M内的特定的子子区间组,每个子区间在它们各自的子区间内具有相同的相对位置。 通过对包括在对应组N中的子子间隔的差分值进行平均来计算每个差分表项。构成该平均值的每个差值等于f(X1)-f(X2),其中X1是 在组N内的子子间隔,X2是与X1相同的子间隔内的预定参考子子间隔的中点。 计算这些中点中的每一个,使得对子子区间中的所有可能输入值的最大绝对误差最小化。 另一方面,每个J基表条目对应于间隔M内的子间隔。每个条目等于f(X2)+调整,其中X2是对应于子帧的子间隔的参考子子间隔的中点 基表项。 计算调整值,使得通过差表表项的平均引入的误差在整个子间隔上均匀分布。

    Optimized 3D lighting computations using a logarithmic number system
    26.
    发明授权
    Optimized 3D lighting computations using a logarithmic number system 有权
    使用对数数字系统优化3D照明计算

    公开(公告)号:US09304739B1

    公开(公告)日:2016-04-05

    申请号:US11609273

    申请日:2006-12-11

    Applicant: Norbert Juffa

    Inventor: Norbert Juffa

    CPC classification number: G06F7/4833 G06T15/005 G06T2210/32

    Abstract: Embodiments of the present invention set forth a technique for optimizing the performance and efficiency of complex, software-based computations, such as lighting computations. Data entering a graphics application programming interface (API) in a conventional arithmetic representation, such as floating-point or fixed-point, is converted to an internal logarithmic representation for greater computational efficiency. Lighting computations are then performed using logarithmic space arithmetic routines that, on average, execute more efficiently than similar routines performed in a native floating-point format. The lighting computation results, represented as logarithmic space numbers, are converted back to floating-point numbers before being transmitted to a graphics processing unit (GPU) for further processing. Because of efficiencies of logarithmic space arithmetic, performance improvements may be realized relative to prior art approaches to performing software-based floating-point operations.

    Abstract translation: 本发明的实施例提出了一种用于优化复杂的基于软件的计算(诸如照明计算)的性能和效率的技术。 以常规算术表示形式(如浮点或定点)输入图形应用程序编程接口(API)的数据被转换为内部对数表示,以提高计算效率。 然后使用对数空间算术程序执行照明计算,平均来说,执行比以本机浮点格式执行的类似例程更有效。 表示为对数空格数的照明计算结果在传输到图形处理单元(GPU)以进一步处理之前被转换回浮点数。 由于对数空间算术的效率,相对于执行基于软件的浮点运算的现有技术方法,可以实现性能改进。

    Graphics processor with memory management unit and cache coherent link
    27.
    发明授权
    Graphics processor with memory management unit and cache coherent link 有权
    具有内存管理单元和缓存一致链接的图形处理器

    公开(公告)号:US08860741B1

    公开(公告)日:2014-10-14

    申请号:US11608436

    申请日:2006-12-08

    CPC classification number: G09G5/36 G06F9/50 G06F12/0831 G06F2212/302 G09G5/363

    Abstract: In contrast to a conventional computing system in which the graphics processor (graphics processing unit or GPU) is treated as a slave to one or several CPUs, systems and methods are provided that allow the GPU to be treated as a central processing unit (CPU) from the perspective of the operating system. The GPU can access a memory space shared by other CPUs in the computing system. Caches utilized by the GPU may be coherent with caches utilized by other CPUs in the computing system. The GPU may share execution of general-purpose computations with other CPUs in the computing system.

    Abstract translation: 与将图形处理器(图形处理单元或GPU)视为一个或多个CPU的从属设备的常规计算系统相反,提供允许GPU被视为中央处理单元(CPU)的系统和方法, 从操作系统的角度。 GPU可以访问计算系统中其他CPU共享的内存空间。 GPU使用的高速缓存可能与计算系统中其他CPU所使用的高速缓存一致。 GPU可能与计算系统中的其他CPU共享通用计算的执行。

    Mapping the threads of a CTA to the elements of a tile for efficient matrix multiplication
    28.
    发明授权
    Mapping the threads of a CTA to the elements of a tile for efficient matrix multiplication 有权
    将CTA的线程映射到块的元素以实现有效的矩阵乘法

    公开(公告)号:US07912889B1

    公开(公告)日:2011-03-22

    申请号:US11454680

    申请日:2006-06-16

    CPC classification number: G06F17/16

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Hardware/software-based mapping of CTAs to matrix tiles for efficient matrix multiplication
    29.
    发明授权
    Hardware/software-based mapping of CTAs to matrix tiles for efficient matrix multiplication 有权
    基于硬件/软件的CTA映射到矩阵瓦片,用于有效的矩阵乘法

    公开(公告)号:US07836118B1

    公开(公告)日:2010-11-16

    申请号:US11454499

    申请日:2006-06-16

    CPC classification number: G06F17/16 G06F9/3851 G06F9/3885 G06F9/3887

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Graphics processing unit used for cryptographic processing
    30.
    发明申请
    Graphics processing unit used for cryptographic processing 有权
    用于加密处理的图形处理单元

    公开(公告)号:US20070198412A1

    公开(公告)日:2007-08-23

    申请号:US11350137

    申请日:2006-02-08

    Applicant: Norbert Juffa

    Inventor: Norbert Juffa

    CPC classification number: G06F21/72 G06F9/30181 G06F9/3879 G06F2207/3824

    Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.

    Abstract translation: 图形处理单元被编程为执行加密处理,使得可以提供快速有效的加密处理解决方案而不产生额外的硬件成本。 图形处理单元可以有效地执行加密处理,因为它具有被配置为处理大量并行进程的体系结构。 通过将图形处理单元配置为能够进行浮点运算和整数运算,可以进一步提高在图形处理单元上执行的加密处理。

Patent Agency Ranking