Maximized memory throughput using cooperative thread arrays
    21.
    发明授权
    Maximized memory throughput using cooperative thread arrays 有权
    使用协作线程数组最大化内存吞吐量

    公开(公告)号:US07925860B1

    公开(公告)日:2011-04-12

    申请号:US11748298

    申请日:2007-05-14

    CPC classification number: G06F9/3887 G06F9/3455 G06F9/3851 G06F9/3889

    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.

    Abstract translation: 在用于流计算的并行处理装置中,流的每个数据元素的处理可能不是计算密集的,因此与读取流并写入结果所需的存储器访问时间相比,处理可能需要相对较少的时间来计算。 因此,内存吞吐量通常会限制流计算的性能。 一般来说,提供了用于在这种存储器吞吐量限制的流计算中实现改进的,优化的或最终最大化的存储器吞吐量的方法。 通过提高跨多个处理元件和线程的聚合内存吞吐量,最大化流计算性能。 通过平衡线程和线程组之间的处理负载以及耦合到并行处理设备的硬件存储器接口来实现高聚合内存吞吐量。

    Efficient matrix multiplication on a parallel processing device
    22.
    发明授权
    Efficient matrix multiplication on a parallel processing device 有权
    在并行处理设备上有效的矩阵乘法

    公开(公告)号:US07792895B1

    公开(公告)日:2010-09-07

    申请号:US11454411

    申请日:2006-06-16

    CPC classification number: G06F17/16

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Matrix multiply with reduced bandwidth requirements
    23.
    发明申请
    Matrix multiply with reduced bandwidth requirements 审中-公开
    矩阵乘以减少带宽要求

    公开(公告)号:US20070271325A1

    公开(公告)日:2007-11-22

    申请号:US11430324

    申请日:2006-05-08

    CPC classification number: G06F17/16

    Abstract: Systems and methods for reducing the bandwidth needed to read the inputs to a matrix multiply operation may improve system performance. Rather than reading a row of a first input matrix and a column of a second input matrix to produce a column of a product matrix, a column of the first input matrix and a single element of the second input matrix are read to produce a column of partial dot products of the product matrix. Therefore, the number of input matrix elements read to produce each product matrix element is reduced from 2N to N+1, where N is the number of elements in a column of the product matrix.

    Abstract translation: 用于减少将矩阵乘法运算的输入读取所需带宽的系统和方法可能会提高系统性能。 读取第一输入矩阵和第二输入矩阵的列以产生乘积矩阵的列而不是读取第一输入矩阵的列和第二输入矩阵的单个元素以产生一列 产品矩阵的部分点积。 因此,读取以产生每个乘积矩阵元素的输入矩阵元素的数量从2N减少到N + 1,其中N是乘积矩阵的列中的元素的数量。

    Microprocessor including an efficient implementation of extreme value instructions
    24.
    发明授权
    Microprocessor including an efficient implementation of extreme value instructions 有权
    微处理器包括极端值指令的有效实现

    公开(公告)号:US06557098B2

    公开(公告)日:2003-04-29

    申请号:US09478139

    申请日:2000-01-05

    Abstract: An execution unit is provided for executing a first instruction which includes an opcode field, a first operand field, and a second operand field. The execution unit includes a first input register for receiving a first operand specified by a value of the first operand field, and a second input register for receiving a second operand specified by a value of the second operand field. The execution unit further includes a comparator unit which is coupled to receive a value of the opcode field for the first instruction. The comparator unit is also coupled to receive the first and second operand values from the first and second input registers, respectively. The execution further includes a multiplexer which receives a plurality of inputs. These inputs include a first constant value, a second constant value, and the values of the first and second operand. If the decoded opcode value received by the comparator indicates that the first instruction is either a compare or extreme value function, the comparator conveys one or more control signals to the multiplexer for the purpose of selecting an output of the multiplexer as the result of the first instruction. If the first instruction is one of a plurality of extreme value instructions, the one or more control signals conveyed by the comparator unit select between the first operand and second operand to determine the result of the first instruction. If the first instruction is one of a plurality of compare instructions, the one or more control signals conveyed by the comparator unit select between the first and second constant value to determine the result of the first instruction. In another embodiment, a similar execution unit is provided which handles vector operands.

    Abstract translation: 提供执行单元,用于执行包括操作码字段,第一操作数字段和第二操作数字段的第一指令。 执行单元包括用于接收由第一操作数字段的值指定的第一操作数的第一输入寄存器和用于接收由第二操作数字段的值指定的第二操作数的第二输入寄存器。 执行单元还包括比较器单元,其被耦合以接收第一指令的操作码字段的值。 比较器单元还被耦合以分别从第一和第二输入寄存器接收第一和第二操作数值。 执行还包括接收多个输入的多路复用器。 这些输入包括第一常数值,第二常数值以及第一和第二操作数的值。 如果由比较器接收的解码的操作码值指示第一指令是比较值或极值函数,则比较器将一个或多个控制信号传送到多路复用器,以便作为第一个指令的结果来选择多路复用器的输出 指令。 如果第一指令是多个极值指令之一,则由比较器单元传送的一个或多个控制信号在第一操作数和第二操作数之间进行选择,以确定第一指令的结果。 如果第一指令是多个比较指令之一,则由比较器单元传送的一个或多个控制信号在第一和第二常数值之间进行选择,以确定第一指令的结果。 在另一个实施例中,提供了处理向量操作数的类似执行单元。

    Rapid execution of floating point load control word instructions
    25.
    发明授权
    Rapid execution of floating point load control word instructions 有权
    快速执行浮点负载控制字指令

    公开(公告)号:US06405305B1

    公开(公告)日:2002-06-11

    申请号:US09394024

    申请日:1999-09-10

    Abstract: A microprocessor with a floating point unit configured to rapidly execute floating point load control word (FLDCW) type instructions in an out of program order context is disclosed. The floating point unit is configured to schedule instructions older than the FLDCW-type instruction before the FLDCW-type instruction is scheduled. The FLDCW-type instruction acts as a barrier to prevent instructions occurring after the FLDCW-type instruction in program order from executing before the FLDCW-type instruction. Indicator bits may be used to simplify instruction scheduling, and copies of the floating point control word may be stored for instruction that have long execution cycles. A method and computer configured to rapidly execute FLDCW-type instructions in an out of program order context are also disclosed.

    Abstract translation: 具有浮点单元的微处理器被配置为在程序顺序上下文中快速执行浮点负载控制字(FLDCW)类型指令。 浮点单元被配置为在调度FLDCW类型指令之前调度比FLDCW类型指令更早的指令。 FLDCW型指令作为屏障,以防止在FLDCW类型指令之前执行FLDCW类型指令之后的程序顺序发生的指令。 指示符位可以用于简化指令调度,并且可以存储具有长执行周期的指令的浮点控制字的副本。 还公开了一种配置成在程序顺序上下文中快速执行FLDCW型指令的方法和计算机。

    Optimized 3D lighting computations using a logarithmic number system
    26.
    发明授权
    Optimized 3D lighting computations using a logarithmic number system 有权
    使用对数数字系统优化3D照明计算

    公开(公告)号:US09304739B1

    公开(公告)日:2016-04-05

    申请号:US11609273

    申请日:2006-12-11

    Applicant: Norbert Juffa

    Inventor: Norbert Juffa

    CPC classification number: G06F7/4833 G06T15/005 G06T2210/32

    Abstract: Embodiments of the present invention set forth a technique for optimizing the performance and efficiency of complex, software-based computations, such as lighting computations. Data entering a graphics application programming interface (API) in a conventional arithmetic representation, such as floating-point or fixed-point, is converted to an internal logarithmic representation for greater computational efficiency. Lighting computations are then performed using logarithmic space arithmetic routines that, on average, execute more efficiently than similar routines performed in a native floating-point format. The lighting computation results, represented as logarithmic space numbers, are converted back to floating-point numbers before being transmitted to a graphics processing unit (GPU) for further processing. Because of efficiencies of logarithmic space arithmetic, performance improvements may be realized relative to prior art approaches to performing software-based floating-point operations.

    Abstract translation: 本发明的实施例提出了一种用于优化复杂的基于软件的计算(诸如照明计算)的性能和效率的技术。 以常规算术表示形式(如浮点或定点)输入图形应用程序编程接口(API)的数据被转换为内部对数表示,以提高计算效率。 然后使用对数空间算术程序执行照明计算,平均来说,执行比以本机浮点格式执行的类似例程更有效。 表示为对数空格数的照明计算结果在传输到图形处理单元(GPU)以进一步处理之前被转换回浮点数。 由于对数空间算术的效率,相对于执行基于软件的浮点运算的现有技术方法,可以实现性能改进。

    Mapping the threads of a CTA to the elements of a tile for efficient matrix multiplication
    27.
    发明授权
    Mapping the threads of a CTA to the elements of a tile for efficient matrix multiplication 有权
    将CTA的线程映射到块的元素以实现有效的矩阵乘法

    公开(公告)号:US07912889B1

    公开(公告)日:2011-03-22

    申请号:US11454680

    申请日:2006-06-16

    CPC classification number: G06F17/16

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Hardware/software-based mapping of CTAs to matrix tiles for efficient matrix multiplication
    28.
    发明授权
    Hardware/software-based mapping of CTAs to matrix tiles for efficient matrix multiplication 有权
    基于硬件/软件的CTA映射到矩阵瓦片,用于有效的矩阵乘法

    公开(公告)号:US07836118B1

    公开(公告)日:2010-11-16

    申请号:US11454499

    申请日:2006-06-16

    CPC classification number: G06F17/16 G06F9/3851 G06F9/3885 G06F9/3887

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Graphics processing unit used for cryptographic processing
    29.
    发明申请
    Graphics processing unit used for cryptographic processing 有权
    用于加密处理的图形处理单元

    公开(公告)号:US20070198412A1

    公开(公告)日:2007-08-23

    申请号:US11350137

    申请日:2006-02-08

    Applicant: Norbert Juffa

    Inventor: Norbert Juffa

    CPC classification number: G06F21/72 G06F9/30181 G06F9/3879 G06F2207/3824

    Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.

    Abstract translation: 图形处理单元被编程为执行加密处理,使得可以提供快速有效的加密处理解决方案而不产生额外的硬件成本。 图形处理单元可以有效地执行加密处理,因为它具有被配置为处理大量并行进程的体系结构。 通过将图形处理单元配置为能够进行浮点运算和整数运算,可以进一步提高在图形处理单元上执行的加密处理。

    Converting register data from a first format type to a second format
type if a second type instruction consumes data produced by a first
type instruction
    30.
    发明授权
    Converting register data from a first format type to a second format type if a second type instruction consumes data produced by a first type instruction 失效
    如果第二类型指令消耗由第一类型指令产生的数据,则将寄存器数据从第一格式类型转换为第二格式类型

    公开(公告)号:US6105129A

    公开(公告)日:2000-08-15

    申请号:US25233

    申请日:1998-02-18

    Abstract: A microprocessor includes one or more registers which are architecturally defined to be used for at least two data formats. In one embodiment, the registers are the floating point registers defined in the x86 architecture, and the data formats are the floating point data format and the multimedia data format. The registers actually implemented by the microprocessor for the floating point registers use an internal format for floating point data. Part of the internal format is a classification field which classifies the floating point data in the extended precision defined by the x86 microprocessor architecture. Additionally, a classification field encoding is reserved for multimedia data. As the microprocessor begins execution of each multimedia instruction, the classification information of the source operands is examined to determine if the data is either in the multimedia class, or in a floating point class in which the significand portion of the register is the same as the corresponding significand in extended precision. If so, the multimedia instruction executes normally. If not, the multimedia instruction is faulted. Similarly, as the microprocessor begins execution of each floating point instruction, the classification information of the source operands is examined. If the data is classified as multimedia, the floating point instruction is faulted. A microcode routine is used to reformat the data stored in at least the source registers of the faulting instruction into a format useable by the faulting instruction. Subsequently, the faulting instruction is re-executed.

    Abstract translation: 微处理器包括一个或多个寄存器,其被架构地定义为用于至少两种数据格式。 在一个实施例中,寄存器是在x86架构中定义的浮点寄存器,数据格式是浮点数据格式和多媒体数据格式。 微处理器为浮点寄存器实际实现的寄存器使用浮点数据的内部格式。 内部格式的一部分是分类字段,它以由x86微处理器架构定义的扩展精度对浮点数据进行分类。 此外,分类字段编码被保留用于多媒体数据。 当微处理器开始执行每个多媒体指令时,检查源操作数的分类信息以确定数据是在多媒体类中还是在浮点类中,其中寄存器的有效部分与 相应的显着性在扩展精度。 如果是这样,多媒体指令正常执行。 如果不是,则多媒体指令发生故障。 类似地,当微处理器开始执行每个浮点指令时,检查源操作数的分类信息。 如果数据被分类为多媒体,则浮点指令发生故障。 微码程序用于将存储在故障指令的至少源寄存器中的数据重新格式化为故障指令可使用的格式。 随后,重新执行故障指令。

Patent Agency Ranking