Patent search ap:("Norbert Juffa") AND inv:"Norbert Juffa" Page 2

11.

发明授权
Apparatus and method for handling tiny numbers using a super sticky bit in a microprocessor 有权
Title translation: 在微处理器中使用超级粘性位处理微小数字的装置和方法

公开(公告)号：US06374345B1

公开(公告)日：2002-04-16

申请号：US09359919

申请日：1999-07-22

Applicant: Norbert Juffa , Stuart F. Oberman

Inventor： Norbert Juffa , Stuart F. Oberman

IPC: G06F700

CPC classification number: G06F7/483 , G06F7/49905 , G06F7/4991 , G06F7/49952 , G06F9/30014 , G06F9/30036 , G06F9/30105 , G06F9/30116 , G06F9/3865

Abstract: An apparatus and method for handling tiny numbers using a super sticky bit are provided. In response to detecting that a preliminary result of an instruction corresponds to a tiny number and an underflow exception is masked, an execution pipeline can be configured to store a value corresponding to the preliminary result and a super sticky bit in a destination register. Also, a destination register tag corresponding to the destination register and a denormal exception indicator corresponding to the tiny number and masked underflow exception can be stored. A trap handler can be initiated to generate a corrected result for the instruction. The trap handler can detect that the denormal exception indicator has been set and can read the value and the super sticky bit from the destination register using the destination register tag. The trap handler can generate a corrected result for the instruction based on the value and the super sticky bit. An instruction subsequent to the trapping instruction can then be restarted.

Abstract translation: 提供了一种使用超级粘性位处理微小数字的装置和方法。响应于检测到指令的初步结果对应于微数，并且下溢异常被屏蔽，执行流水线可以被配置为存储与目标寄存器中的初步结果和超粘性位对应的值。此外，可以存储对应于目的地寄存器的目的地寄存器标签和对应于微小数量和掩蔽的下溢异常的异常异常指示符。可以启动陷阱处理程序以生成指令的校正结果。陷阱处理程序可以检测到异常异常指示器已设置，并可以使用目标寄存器标签从目标寄存器读取该值和超级粘性位。陷阱处理程序可以根据值和超级粘性位产生指令的校正结果。然后可以重新启动捕获指令之后的指令。

12.

发明授权
Method and apparatus for achieving higher frequencies of exactly rounded results 失效
Title translation: 用于实现更高频率的精确圆整结果的方法和装置

公开(公告)号：US6134574A

公开(公告)日：2000-10-17

申请号：US75073

申请日：1998-05-08

Applicant: Stuart F. Oberman , Norbert Juffa , Fred Weber

Inventor： Stuart F. Oberman , Norbert Juffa , Fred Weber

IPC: G06F7/52 , G06F7/533 , G06F7/544 , G06F9/318 , G06F9/38 , G06F17/16 , G06F7/552

CPC classification number: G06F7/53 , G06F17/16 , G06F7/5443 , G06F9/30036 , G06F9/3017 , G06F9/3804 , G06F9/3885 , G06F2207/3828 , G06F7/4991 , G06F7/49936 , G06F7/49963 , G06F7/49994 , G06F7/5338

Abstract: A multiplier configured to obtain higher frequencies of exactly rounded results by adding an adjustment constant to intermediate products generated during iterative multiplication operations is disclosed. One such iterative multiplication operation is the Newton-Raphson iteration, which may be utilized by the multiplier to perform reciprocal calculations and reciprocal square root calculations. For each iteration, the results converge toward an infinitely precise result. To improve the frequency of the exactly rounded result, the results of the iterative calculations may be studied for a large number of differing input operands to determine the best suited value for the adjustment constant. The multiplier may also be configured to perform scalar and packed vector multiplication using the same hardware.

Abstract translation: 公开了一种乘法器，其被配置为通过向迭代乘法运算中产生的中间乘积增加一个调整常数来获得更高频率的精确舍入结果。一个这样的迭代乘法运算是牛顿 - 拉夫逊迭代，乘法运算可以用来进行相互计算和相互平方根计算。对于每次迭代，结果趋向于无限精确的结果。为了提高精确舍入结果的频率，可以针对大量不同的输入操作数来研究迭代计算的结果，以确定调整常数的最佳值。乘法器还可以被配置为使用相同的硬件执行标量和压缩向量乘法。

13.

发明授权
Microprocessor including an efficient implemention of an accumulate instruction 失效
Title translation: 微处理器包括有效实现累加指令

公开(公告)号：US5918062A

公开(公告)日：1999-06-29

申请号：US14507

申请日：1998-01-28

Applicant: Stuart F. Oberman , Norbert Juffa

Inventor： Stuart F. Oberman , Norbert Juffa

IPC: G06F7/50 , G06F7/509 , G06F9/30 , G06F9/302 , G06F9/318 , G06F9/38 , G06F17/16 , H03M7/24 , G06F9/40

CPC classification number: G06F7/509 , G06F17/16 , G06F9/30014 , G06F9/30021 , G06F9/30036 , G06F9/3017 , G06F9/3804 , G06F9/3885 , H03M7/24

Abstract: An execution unit configured to perform a plurality of arithmetic operations using the same set of operands. These operands include corresponding input vector values in each of a plurality of input registers. The execution unit is coupled to receive these input vector values, as well as an instruction value indicative of one of the plurality of arithmetic operations. In one embodiment, the plurality of arithmetic operations includes a vectored add instruction, a vectored subtract instruction, a vectored reverse subtract instruction, and an accumulate instruction. The vectored instructions perform arithmetic operations concurrently using corresponding values from each of the plurality of input registers. The accumulate instruction, however, is executable to add together all input values within a single input register. The execution unit further includes a multiplexer unit configured to selectively route the input vector values to a plurality of adder units according to the opcode value. In an embodiment in which the execution unit is configured to perform subtraction operations as well as addition, the multiplexer unit is additionally configured to selectively route negated versions (either one's or two's complement format) to the plurality of adder units. Each of the plurality of adder units is configured to generate a sum based upon the values conveyed from the multiplexer unit. The accumulate instruction advantageously allows important operations such as the matrix multiply to be performed rapidly. Because the matrix multiply is an integral part of many applications (particularly graphics applications), the accumulate instruction may lead to increased overall system performance.

Abstract translation: 执行单元，被配置为使用相同的一组操作数执行多个算术运算。这些操作数在多个输入寄存器的每一个中包括相应的输入向量值。执行单元被耦合以接收这些输入向量值，以及指示多个算术运算之一的指令值。在一个实施例中，多个算术运算包括矢量加法指令，矢量减法指令，向量反向减法指令和累加指令。矢量指令使用来自多个输入寄存器中的每一个的对应值同时执行算术运算。然而，累加指令可执行，以将单个输入寄存器中的所有输入值相加。执行单元还包括多路复用器单元，被配置为根据操作码值选择性地将输入矢量值路由到多个加法器单元。在其中执行单元被配置为执行减法运算以及加法的实施例中，多路复用器单元另外配置成选择性地将否定版本（一者或二者的补码格式）路由到多个加法器单元。多个加法器单元中的每一个被配置为基于从多路复用器单元传送的值产生和。累加指令有利地允许快速执行诸如矩阵乘法的重要操作。由于矩阵乘法是许多应用程序（特别是图形应用程序）的组成部分，累加指令可能会导致整体系统性能的提高。

14.

发明授权
Pipelined integer division using floating-point reciprocal 有权
Title translation: 使用浮点互易的流水线整数除法

公开(公告)号：US08140608B1

公开(公告)日：2012-03-20

申请号：US11756188

申请日：2007-05-31

Applicant: Norbert Juffa

Inventor： Norbert Juffa

IPC: G06F7/52

CPC classification number: G06F7/535 , G06F7/4873 , G06F2207/5351 , G06F2207/5356

Abstract: One embodiment of the present invention sets forth a technique for performing fast integer division using commonly available arithmetic operations. The technique may be implemented in a two-stage process using a single-precision floating point reciprocal in conjunction with integer addition and multiplication. Furthermore, the technique may be fully pipelined on many conventional processors for performance that is comparable to the best available high-performance alternatives.

Abstract translation: 本发明的一个实施例提出了一种使用常用的算术运算进行快速整数除法的技术。该技术可以在使用单精度浮点互易结合整数加法和乘法的两阶段过程中实现。此外，该技术可以在许多常规处理器上完全流水线化，以便与最佳可用的高性能替代方案相当。

15.

发明授权
Maximized memory throughput using cooperative thread arrays 有权
Title translation: 使用协作线程数组最大化内存吞吐量

公开(公告)号：US07925860B1

公开(公告)日：2011-04-12

申请号：US11748298

申请日：2007-05-14

Applicant: Norbert Juffa , Brett W. Coon

Inventor： Norbert Juffa , Brett W. Coon

IPC: G06F9/30

CPC classification number: G06F9/3887 , G06F9/3455 , G06F9/3851 , G06F9/3889

Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.

Abstract translation: 在用于流计算的并行处理装置中，流的每个数据元素的处理可能不是计算密集的，因此与读取流并写入结果所需的存储器访问时间相比，处理可能需要相对较少的时间来计算。因此，内存吞吐量通常会限制流计算的性能。一般来说，提供了用于在这种存储器吞吐量限制的流计算中实现改进的，优化的或最终最大化的存储器吞吐量的方法。通过提高跨多个处理元件和线程的聚合内存吞吐量，最大化流计算性能。通过平衡线程和线程组之间的处理负载以及耦合到并行处理设备的硬件存储器接口来实现高聚合内存吞吐量。

16.

发明授权
Efficient matrix multiplication on a parallel processing device 有权
Title translation: 在并行处理设备上有效的矩阵乘法

公开(公告)号：US07792895B1

公开(公告)日：2010-09-07

申请号：US11454411

申请日：2006-06-16

Applicant: Norbert Juffa , Radoslav Danilak

Inventor： Norbert Juffa , Radoslav Danilak

IPC: G06F7/52

CPC classification number: G06F17/16

Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。另一个实施例是用于将CTA映射到结果瓦片的第二种方法。其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算，源瓦片复制操作以及源瓦片复制和转置操作的方法。本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算，使得能够将源瓦片复制到本地存储器，以减少当计算一个结果图块，并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作，而没有存储体冲突。

17.

发明申请
Matrix multiply with reduced bandwidth requirements 审中-公开
Title translation: 矩阵乘以减少带宽要求

公开(公告)号：US20070271325A1

公开(公告)日：2007-11-22

申请号：US11430324

申请日：2006-05-08

Applicant: Norbert Juffa , John Nickolls

Inventor： Norbert Juffa , John Nickolls

IPC: G06F7/52

CPC classification number: G06F17/16

Abstract: Systems and methods for reducing the bandwidth needed to read the inputs to a matrix multiply operation may improve system performance. Rather than reading a row of a first input matrix and a column of a second input matrix to produce a column of a product matrix, a column of the first input matrix and a single element of the second input matrix are read to produce a column of partial dot products of the product matrix. Therefore, the number of input matrix elements read to produce each product matrix element is reduced from 2N to N+1, where N is the number of elements in a column of the product matrix.

Abstract translation: 用于减少将矩阵乘法运算的输入读取所需带宽的系统和方法可能会提高系统性能。读取第一输入矩阵和第二输入矩阵的列以产生乘积矩阵的列而不是读取第一输入矩阵的列和第二输入矩阵的单个元素以产生一列产品矩阵的部分点积。因此，读取以产生每个乘积矩阵元素的输入矩阵元素的数量从2N减少到N + 1，其中N是乘积矩阵的列中的元素的数量。

18.

发明授权
Microprocessor including an efficient implementation of extreme value instructions 有权
Title translation: 微处理器包括极端值指令的有效实现

公开(公告)号：US06557098B2

公开(公告)日：2003-04-29

申请号：US09478139

申请日：2000-01-05

Applicant: Stuart Oberman , Norbert Juffa

Inventor： Stuart Oberman , Norbert Juffa

IPC: G06F9305

CPC classification number: G06F17/16 , G06F9/30014 , G06F9/30021 , G06F9/30036 , G06F9/30167 , G06F9/3017 , G06F9/3804 , G06F9/3885

Abstract: An execution unit is provided for executing a first instruction which includes an opcode field, a first operand field, and a second operand field. The execution unit includes a first input register for receiving a first operand specified by a value of the first operand field, and a second input register for receiving a second operand specified by a value of the second operand field. The execution unit further includes a comparator unit which is coupled to receive a value of the opcode field for the first instruction. The comparator unit is also coupled to receive the first and second operand values from the first and second input registers, respectively. The execution further includes a multiplexer which receives a plurality of inputs. These inputs include a first constant value, a second constant value, and the values of the first and second operand. If the decoded opcode value received by the comparator indicates that the first instruction is either a compare or extreme value function, the comparator conveys one or more control signals to the multiplexer for the purpose of selecting an output of the multiplexer as the result of the first instruction. If the first instruction is one of a plurality of extreme value instructions, the one or more control signals conveyed by the comparator unit select between the first operand and second operand to determine the result of the first instruction. If the first instruction is one of a plurality of compare instructions, the one or more control signals conveyed by the comparator unit select between the first and second constant value to determine the result of the first instruction. In another embodiment, a similar execution unit is provided which handles vector operands.

Abstract translation: 提供执行单元，用于执行包括操作码字段，第一操作数字段和第二操作数字段的第一指令。执行单元包括用于接收由第一操作数字段的值指定的第一操作数的第一输入寄存器和用于接收由第二操作数字段的值指定的第二操作数的第二输入寄存器。执行单元还包括比较器单元，其被耦合以接收第一指令的操作码字段的值。比较器单元还被耦合以分别从第一和第二输入寄存器接收第一和第二操作数值。执行还包括接收多个输入的多路复用器。这些输入包括第一常数值，第二常数值以及第一和第二操作数的值。如果由比较器接收的解码的操作码值指示第一指令是比较值或极值函数，则比较器将一个或多个控制信号传送到多路复用器，以便作为第一个指令的结果来选择多路复用器的输出指令。如果第一指令是多个极值指令之一，则由比较器单元传送的一个或多个控制信号在第一操作数和第二操作数之间进行选择，以确定第一指令的结果。如果第一指令是多个比较指令之一，则由比较器单元传送的一个或多个控制信号在第一和第二常数值之间进行选择，以确定第一指令的结果。在另一个实施例中，提供了处理向量操作数的类似执行单元。

19.

发明授权
Rapid execution of floating point load control word instructions 有权
Title translation: 快速执行浮点负载控制字指令

公开(公告)号：US06405305B1

公开(公告)日：2002-06-11

申请号：US09394024

申请日：1999-09-10

Applicant: Stephan G. Meier , Jeffrey E. Trull , Derrick R. Meyer , Norbert Juffa

Inventor： Stephan G. Meier , Jeffrey E. Trull , Derrick R. Meyer , Norbert Juffa

IPC: G06F9302

CPC classification number: G06F9/30087 , G06F9/30043 , G06F9/30094 , G06F9/30101 , G06F9/30189 , G06F9/3836 , G06F9/384 , G06F9/3855 , G06F9/3857 , G06F9/3861

Abstract: A microprocessor with a floating point unit configured to rapidly execute floating point load control word (FLDCW) type instructions in an out of program order context is disclosed. The floating point unit is configured to schedule instructions older than the FLDCW-type instruction before the FLDCW-type instruction is scheduled. The FLDCW-type instruction acts as a barrier to prevent instructions occurring after the FLDCW-type instruction in program order from executing before the FLDCW-type instruction. Indicator bits may be used to simplify instruction scheduling, and copies of the floating point control word may be stored for instruction that have long execution cycles. A method and computer configured to rapidly execute FLDCW-type instructions in an out of program order context are also disclosed.

Abstract translation: 具有浮点单元的微处理器被配置为在程序顺序上下文中快速执行浮点负载控制字（FLDCW）类型指令。浮点单元被配置为在调度FLDCW类型指令之前调度比FLDCW类型指令更早的指令。 FLDCW型指令作为屏障，以防止在FLDCW类型指令之前执行FLDCW类型指令之后的程序顺序发生的指令。指示符位可以用于简化指令调度，并且可以存储具有长执行周期的指令的浮点控制字的副本。还公开了一种配置成在程序顺序上下文中快速执行FLDCW型指令的方法和计算机。

20.

发明授权
Floating point addition pipeline including extreme value, comparison and accumulate functions 失效
Title translation: 浮点附加流水线包括极值，比较和累加功能

公开(公告)号：US06298367B1

公开(公告)日：2001-10-02

申请号：US09055916

申请日：1998-04-06

Applicant: Stuart F. Oberman , Norbert Juffa , Fred Weber , Krishnan Ramani , Ravi Krishna

Inventor： Stuart F. Oberman , Norbert Juffa , Fred Weber , Krishnan Ramani , Ravi Krishna

IPC: G06F738

CPC classification number: G06F7/483 , G06F9/30014 , G06F9/30021 , G06F9/30036 , H03M7/24

Abstract: A multimedia execution unit configured to perform vectored floating point and integer instructions. The execution unit may include an add/subtract pipeline having far and close data paths. The far path is configured to handle effective addition operations and effective subtraction operations for operands having an absolute exponent difference greater than one. The close path is configured to handle effective subtraction operations for operands having an absolute exponent difference less than or equal to one. The close path is configured to generate two output values, wherein one output value is the first input operand plus an inverted version of the second input operand, while the second output value is equal to the first output value plus one. Selection of the first or second output value in the close path effectuates the round-to-nearest operation for the output of the adder. The execution unit may be configured to perform vectored addition and subtraction, integer/floating point conversion, reverse subtraction, accumulate, extreme value (minimum/maximum), and comparison instructions.

Abstract translation: 多媒体执行单元被配置为执行矢量的浮点和整数指令。执行单元可以包括具有远近数据路径的加法/减法流水线。远程路径被配置为处理具有大于1的绝对指数差的操作数的有效加法运算和有效减法运算。关闭路径被配置为处理具有小于或等于1的绝对指数差的操作数的有效减法操作。关闭路径被配置为生成两个输出值，其中一个输出值是第一输入操作数加上第二输入操作数的反转版本，而第二输出值等于第一输出值加1。在闭合路径中选择第一或第二输出值对加法器的输出实现了舍入到最近的运算。执行单元可以被配置为执行向量加法和减法，整数/浮点转换，反向减法，累加，极值（最小/最大）和比较指令。

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification