Apparatus and method for handling tiny numbers using a super sticky bit in a microprocessor
    11.
    发明授权
    Apparatus and method for handling tiny numbers using a super sticky bit in a microprocessor 有权
    在微处理器中使用超级粘性位处理微小数字的装置和方法

    公开(公告)号:US06374345B1

    公开(公告)日:2002-04-16

    申请号:US09359919

    申请日:1999-07-22

    Abstract: An apparatus and method for handling tiny numbers using a super sticky bit are provided. In response to detecting that a preliminary result of an instruction corresponds to a tiny number and an underflow exception is masked, an execution pipeline can be configured to store a value corresponding to the preliminary result and a super sticky bit in a destination register. Also, a destination register tag corresponding to the destination register and a denormal exception indicator corresponding to the tiny number and masked underflow exception can be stored. A trap handler can be initiated to generate a corrected result for the instruction. The trap handler can detect that the denormal exception indicator has been set and can read the value and the super sticky bit from the destination register using the destination register tag. The trap handler can generate a corrected result for the instruction based on the value and the super sticky bit. An instruction subsequent to the trapping instruction can then be restarted.

    Abstract translation: 提供了一种使用超级粘性位处理微小数字的装置和方法。 响应于检测到指令的初步结果对应于微数,并且下溢异常被屏蔽,执行流水线可以被配置为存储与目标寄存器中的初步结果和超粘性位对应的值。 此外,可以存储对应于目的地寄存器的目的地寄存器标签和对应于微小数量和掩蔽的下溢异常的异常异常指示符。 可以启动陷阱处理程序以生成指令的校正结果。 陷阱处理程序可以检测到异常异常指示器已设置,并可以使用目标寄存器标签从目标寄存器读取该值和超级粘性位。 陷阱处理程序可以根据值和超级粘性位产生指令的校正结果。 然后可以重新启动捕获指令之后的指令。

    Method and apparatus for achieving higher frequencies of exactly rounded
results
    12.
    发明授权
    Method and apparatus for achieving higher frequencies of exactly rounded results 失效
    用于实现更高频率的精确圆整结果的方法和装置

    公开(公告)号:US6134574A

    公开(公告)日:2000-10-17

    申请号:US75073

    申请日:1998-05-08

    Abstract: A multiplier configured to obtain higher frequencies of exactly rounded results by adding an adjustment constant to intermediate products generated during iterative multiplication operations is disclosed. One such iterative multiplication operation is the Newton-Raphson iteration, which may be utilized by the multiplier to perform reciprocal calculations and reciprocal square root calculations. For each iteration, the results converge toward an infinitely precise result. To improve the frequency of the exactly rounded result, the results of the iterative calculations may be studied for a large number of differing input operands to determine the best suited value for the adjustment constant. The multiplier may also be configured to perform scalar and packed vector multiplication using the same hardware.

    Abstract translation: 公开了一种乘法器,其被配置为通过向迭代乘法运算中产生的中间乘积增加一个调整常数来获得更高频率的精确舍入结果。 一个这样的迭代乘法运算是牛顿 - 拉夫逊迭代,乘法运算可以用来进行相互计算和相互平方根计算。 对于每次迭代,结果趋向于无限精确的结果。 为了提高精确舍入结果的频率,可以针对大量不同的输入操作数来研究迭代计算的结果,以确定调整常数的最佳值。 乘法器还可以被配置为使用相同的硬件执行标量和压缩向量乘法。

    Microprocessor including an efficient implemention of an accumulate
instruction
    13.
    发明授权
    Microprocessor including an efficient implemention of an accumulate instruction 失效
    微处理器包括有效实现累加指令

    公开(公告)号:US5918062A

    公开(公告)日:1999-06-29

    申请号:US14507

    申请日:1998-01-28

    Abstract: An execution unit configured to perform a plurality of arithmetic operations using the same set of operands. These operands include corresponding input vector values in each of a plurality of input registers. The execution unit is coupled to receive these input vector values, as well as an instruction value indicative of one of the plurality of arithmetic operations. In one embodiment, the plurality of arithmetic operations includes a vectored add instruction, a vectored subtract instruction, a vectored reverse subtract instruction, and an accumulate instruction. The vectored instructions perform arithmetic operations concurrently using corresponding values from each of the plurality of input registers. The accumulate instruction, however, is executable to add together all input values within a single input register. The execution unit further includes a multiplexer unit configured to selectively route the input vector values to a plurality of adder units according to the opcode value. In an embodiment in which the execution unit is configured to perform subtraction operations as well as addition, the multiplexer unit is additionally configured to selectively route negated versions (either one's or two's complement format) to the plurality of adder units. Each of the plurality of adder units is configured to generate a sum based upon the values conveyed from the multiplexer unit. The accumulate instruction advantageously allows important operations such as the matrix multiply to be performed rapidly. Because the matrix multiply is an integral part of many applications (particularly graphics applications), the accumulate instruction may lead to increased overall system performance.

    Abstract translation: 执行单元,被配置为使用相同的一组操作数执行多个算术运算。 这些操作数在多个输入寄存器的每一个中包括相应的输入向量值。 执行单元被耦合以接收这些输入向量值,以及指示多个算术运算之一的指令值。 在一个实施例中,多个算术运算包括矢量加法指令,矢量减法指令,向量反向减法指令和累加指令。 矢量指令使用来自多个输入寄存器中的每一个的对应值同时执行算术运算。 然而,累加指令可执行,以将单个输入寄存器中的所有输入值相加。 执行单元还包括多路复用器单元,被配置为根据操作码值选择性地将输入矢量值路由到多个加法器单元。 在其中执行单元被配置为执行减法运算以及加法的实施例中,多路复用器单元另外配置成选择性地将否定版本(一者或二者的补码格式)路由到多个加法器单元。 多个加法器单元中的每一个被配置为基于从多路复用器单元传送的值产生和。 累加指令有利地允许快速执行诸如矩阵乘法的重要操作。 由于矩阵乘法是许多应用程序(特别是图形应用程序)的组成部分,累加指令可能会导致整体系统性能的提高。

    Graphics processing unit used for cryptographic processing
    14.
    发明授权
    Graphics processing unit used for cryptographic processing 有权
    用于加密处理的图形处理单元

    公开(公告)号:US07916864B2

    公开(公告)日:2011-03-29

    申请号:US11350137

    申请日:2006-02-08

    Applicant: Norbert Juffa

    Inventor: Norbert Juffa

    CPC classification number: G06F21/72 G06F9/30181 G06F9/3879 G06F2207/3824

    Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.

    Abstract translation: 图形处理单元被编程为执行加密处理,使得可以提供快速有效的加密处理解决方案而不产生额外的硬件成本。 图形处理单元可以有效地执行加密处理,因为它具有被配置为处理大量并行进程的体系结构。 通过将图形处理单元配置为能够进行浮点运算和整数运算,可以进一步提高在图形处理单元上执行的加密处理。

    Hardware resource based mapping of cooperative thread arrays (CTA) to result matrix tiles for efficient matrix multiplication in computing system comprising plurality of multiprocessors
    15.
    发明授权
    Hardware resource based mapping of cooperative thread arrays (CTA) to result matrix tiles for efficient matrix multiplication in computing system comprising plurality of multiprocessors 有权
    基于硬件资源的协作线程数组(CTA)的映射结果用于在包括多个多处理器的计算系统中有效的矩阵乘法的矩阵瓦片

    公开(公告)号:US07506134B1

    公开(公告)日:2009-03-17

    申请号:US11454542

    申请日:2006-06-16

    CPC classification number: G06F9/5066 G06F9/5038 G06F2209/5017

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Method and apparatus for calculating a power of an operand
    16.
    发明授权
    Method and apparatus for calculating a power of an operand 有权
    用于计算操作数的功率的方法和装置

    公开(公告)号:US06381625B2

    公开(公告)日:2002-04-30

    申请号:US09782474

    申请日:2001-02-12

    Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.

    Abstract translation: 公开了能够执行有符号和无符号标量和矢量乘法的乘法器。 乘法器配置为以标量或压缩向量形式接收带符号或无符号乘数和被乘数操作数。 可以计算乘数和被乘数操作数的有效符号,并用于根据布斯算法创建和选择多个部分乘积。 一旦创建并选择了部分产品,就可以对它们进行求和并输出结果。 结果可能是有符号或无符号的,可能表示向量或标量。 当执行向量乘法时,乘法器可以被配置为产生和选择部分乘积,以便有效地隔离每对向量分量的乘法过程。 乘法器还可以被配置为对矢量分量的乘积求和以形成向量点积。 最终产品可以分段输出,以便需要更少的总线。 可以通过添加舍入常数来对段进行舍入。 可以在两个路径中执行舍入和归一化,一个假设将发生溢出,另一个假设不会发生溢出。 乘法器还可以被配置为执行迭代计算以评估操作数的恒定功率。 形成的中间产品可以在两个路径中进行圆化和归一化,然后压缩并存储以用于下一次迭代。 还可以添加调整常数以增加精确舍入结果的频率。

    Apparatus and method for using checking instructions in a floating-point execution unit
    17.
    发明授权
    Apparatus and method for using checking instructions in a floating-point execution unit 有权
    在浮点执行单元中使用检查指令的装置和方法

    公开(公告)号:US06247117B1

    公开(公告)日:2001-06-12

    申请号:US09265230

    申请日:1999-03-08

    Applicant: Norbert Juffa

    Inventor: Norbert Juffa

    CPC classification number: G06F9/226 G06F9/30014 G06F9/30192

    Abstract: The use of checking instructions to detect special and exceptional cases of a defined data format in a microprocessor is disclosed. Generally speaking, a checking instruction is included with the microcode of floating-point instructions to detect special and exceptional cases of operand values for the floating-point instructions. A checking instruction is configured to set one or more flags in a flags register if it detects a special or exceptional case for an operand value. A checking instruction may also set the result or results of a floating-point instruction to a result value if a special or exceptional case is detected. In addition, a checking instruction may be configured to set one or more bits in status register if a special or exceptional case is detected. After a checking instruction completes execution, a subsequent microcode instruction can be executed to determine if one or more flags were set by the checking instruction. If one or more flags have been set by the checking instruction, the subsequent microcode instruction can branch to a non-sequential microcode instruction to handle the special or exceptional case detected by the checking instruction.

    Abstract translation: 公开了使用检查指令来检测微处理器中定义的数据格式的特殊和异常情况。 一般来说,浮点指令的微码中包含检查指令,以检测浮点指令的操作数值的特殊情况和异常情况。 检查指令被配置为在标志寄存器中设置一个或多个标志,如果它检测到操作数值的特殊或异常情况。 如果检测到特殊或特殊情况,则检查指令还可以将浮点指令的结果或结果设置为结果值。 此外,如果检测到特殊或特殊情况,则检查指令可以被配置为在状态寄存器中设置一个或多个位。 在检查指令完成执行之后,可以执行随后的微代码指令以确定检查指令是否设置了一个或多个标志。 如果通过检查指令设置了一个或多个标志,则后续的微代码指令可以转移到非顺序的微代码指令,以处理由检查指令检测到的特殊或特殊情况。

    EFFICIENT MATRIX MULTIPLICATION ON A PARALLEL PROCESSING DEVICE
    18.
    发明申请
    EFFICIENT MATRIX MULTIPLICATION ON A PARALLEL PROCESSING DEVICE 有权
    并行处理器件的高效矩阵乘法

    公开(公告)号:US20100325187A1

    公开(公告)日:2010-12-23

    申请号:US12875961

    申请日:2010-09-03

    CPC classification number: G06F17/16

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Method and apparatus for rounding in a multiplier
    19.
    发明授权
    Method and apparatus for rounding in a multiplier 有权
    在乘法器中舍入的方法和装置

    公开(公告)号:US06397238B2

    公开(公告)日:2002-05-28

    申请号:US09782475

    申请日:2001-02-12

    Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.

    Abstract translation: 公开了能够执行有符号和无符号标量和矢量乘法的乘法器。 乘法器配置为以标量或压缩向量形式接收带符号或无符号乘数和被乘数操作数。 可以计算乘数和被乘数操作数的有效符号,并用于根据布斯算法创建和选择多个部分乘积。 一旦创建并选择了部分产品,就可以对它们进行求和并输出结果。 结果可能是有符号或无符号的,可能表示向量或标量。 当执行向量乘法时,乘法器可以被配置为产生和选择部分乘积,以便有效地隔离每对向量分量的乘法过程。 乘法器还可以被配置为对矢量分量的乘积求和以形成向量点积。 最终产品可以分段输出,以便需要更少的总线。 可以通过添加舍入常数来对段进行舍入。 可以在两个路径中执行舍入和归一化,一个假设将发生溢出,另一个假设不会发生溢出。 乘法器还可以被配置为执行迭代计算以评估操作数的恒定功率。 形成的中间产品可以在两个路径中进行圆化和归一化,然后压缩并存储以用于下一次迭代。 还可以添加调整常数以增加精确舍入结果的频率。

    Pipelined integer division using floating-point reciprocal
    20.
    发明授权
    Pipelined integer division using floating-point reciprocal 有权
    使用浮点互易的流水线整数除法

    公开(公告)号:US08140608B1

    公开(公告)日:2012-03-20

    申请号:US11756188

    申请日:2007-05-31

    Applicant: Norbert Juffa

    Inventor: Norbert Juffa

    CPC classification number: G06F7/535 G06F7/4873 G06F2207/5351 G06F2207/5356

    Abstract: One embodiment of the present invention sets forth a technique for performing fast integer division using commonly available arithmetic operations. The technique may be implemented in a two-stage process using a single-precision floating point reciprocal in conjunction with integer addition and multiplication. Furthermore, the technique may be fully pipelined on many conventional processors for performance that is comparable to the best available high-performance alternatives.

    Abstract translation: 本发明的一个实施例提出了一种使用常用的算术运算进行快速整数除法的技术。 该技术可以在使用单精度浮点互易结合整数加法和乘法的两阶段过程中实现。 此外,该技术可以在许多常规处理器上完全流水线化,以便与最佳可用的高性能替代方案相当。

Patent Agency Ranking