Microprocessor including an efficient implementation of extreme value instructions
    11.
    发明授权
    Microprocessor including an efficient implementation of extreme value instructions 有权
    微处理器包括极端值指令的有效实现

    公开(公告)号:US06557098B2

    公开(公告)日:2003-04-29

    申请号:US09478139

    申请日:2000-01-05

    Abstract: An execution unit is provided for executing a first instruction which includes an opcode field, a first operand field, and a second operand field. The execution unit includes a first input register for receiving a first operand specified by a value of the first operand field, and a second input register for receiving a second operand specified by a value of the second operand field. The execution unit further includes a comparator unit which is coupled to receive a value of the opcode field for the first instruction. The comparator unit is also coupled to receive the first and second operand values from the first and second input registers, respectively. The execution further includes a multiplexer which receives a plurality of inputs. These inputs include a first constant value, a second constant value, and the values of the first and second operand. If the decoded opcode value received by the comparator indicates that the first instruction is either a compare or extreme value function, the comparator conveys one or more control signals to the multiplexer for the purpose of selecting an output of the multiplexer as the result of the first instruction. If the first instruction is one of a plurality of extreme value instructions, the one or more control signals conveyed by the comparator unit select between the first operand and second operand to determine the result of the first instruction. If the first instruction is one of a plurality of compare instructions, the one or more control signals conveyed by the comparator unit select between the first and second constant value to determine the result of the first instruction. In another embodiment, a similar execution unit is provided which handles vector operands.

    Abstract translation: 提供执行单元,用于执行包括操作码字段,第一操作数字段和第二操作数字段的第一指令。 执行单元包括用于接收由第一操作数字段的值指定的第一操作数的第一输入寄存器和用于接收由第二操作数字段的值指定的第二操作数的第二输入寄存器。 执行单元还包括比较器单元,其被耦合以接收第一指令的操作码字段的值。 比较器单元还被耦合以分别从第一和第二输入寄存器接收第一和第二操作数值。 执行还包括接收多个输入的多路复用器。 这些输入包括第一常数值,第二常数值以及第一和第二操作数的值。 如果由比较器接收的解码的操作码值指示第一指令是比较值或极值函数,则比较器将一个或多个控制信号传送到多路复用器,以便作为第一个指令的结果来选择多路复用器的输出 指令。 如果第一指令是多个极值指令之一,则由比较器单元传送的一个或多个控制信号在第一操作数和第二操作数之间进行选择,以确定第一指令的结果。 如果第一指令是多个比较指令之一,则由比较器单元传送的一个或多个控制信号在第一和第二常数值之间进行选择,以确定第一指令的结果。 在另一个实施例中,提供了处理向量操作数的类似执行单元。

    Rapid execution of floating point load control word instructions
    12.
    发明授权
    Rapid execution of floating point load control word instructions 有权
    快速执行浮点负载控制字指令

    公开(公告)号:US06405305B1

    公开(公告)日:2002-06-11

    申请号:US09394024

    申请日:1999-09-10

    Abstract: A microprocessor with a floating point unit configured to rapidly execute floating point load control word (FLDCW) type instructions in an out of program order context is disclosed. The floating point unit is configured to schedule instructions older than the FLDCW-type instruction before the FLDCW-type instruction is scheduled. The FLDCW-type instruction acts as a barrier to prevent instructions occurring after the FLDCW-type instruction in program order from executing before the FLDCW-type instruction. Indicator bits may be used to simplify instruction scheduling, and copies of the floating point control word may be stored for instruction that have long execution cycles. A method and computer configured to rapidly execute FLDCW-type instructions in an out of program order context are also disclosed.

    Abstract translation: 具有浮点单元的微处理器被配置为在程序顺序上下文中快速执行浮点负载控制字(FLDCW)类型指令。 浮点单元被配置为在调度FLDCW类型指令之前调度比FLDCW类型指令更早的指令。 FLDCW型指令作为屏障,以防止在FLDCW类型指令之前执行FLDCW类型指令之后的程序顺序发生的指令。 指示符位可以用于简化指令调度,并且可以存储具有长执行周期的指令的浮点控制字的副本。 还公开了一种配置成在程序顺序上下文中快速执行FLDCW型指令的方法和计算机。

    Floating point addition pipeline including extreme value, comparison and accumulate functions
    13.
    发明授权
    Floating point addition pipeline including extreme value, comparison and accumulate functions 失效
    浮点附加流水线包括极值,比较和累加功能

    公开(公告)号:US06298367B1

    公开(公告)日:2001-10-02

    申请号:US09055916

    申请日:1998-04-06

    Abstract: A multimedia execution unit configured to perform vectored floating point and integer instructions. The execution unit may include an add/subtract pipeline having far and close data paths. The far path is configured to handle effective addition operations and effective subtraction operations for operands having an absolute exponent difference greater than one. The close path is configured to handle effective subtraction operations for operands having an absolute exponent difference less than or equal to one. The close path is configured to generate two output values, wherein one output value is the first input operand plus an inverted version of the second input operand, while the second output value is equal to the first output value plus one. Selection of the first or second output value in the close path effectuates the round-to-nearest operation for the output of the adder. The execution unit may be configured to perform vectored addition and subtraction, integer/floating point conversion, reverse subtraction, accumulate, extreme value (minimum/maximum), and comparison instructions.

    Abstract translation: 多媒体执行单元被配置为执行矢量的浮点和整数指令。 执行单元可以包括具有远近数据路径的加法/减法流水线。 远程路径被配置为处理具有大于1的绝对指数差的操作数的有效加法运算和有效减法运算。 关闭路径被配置为处理具有小于或等于1的绝对指数差的操作数的有效减法操作。 关闭路径被配置为生成两个输出值,其中一个输出值是第一输入操作数加上第二输入操作数的反转版本,而第二输出值等于第一输出值加1。 在闭合路径中选择第一或第二输出值对加法器的输出实现了舍入到最近的运算。 执行单元可以被配置为执行向量加法和减法,整数/浮点转换,反向减法,累加,极值(最小/最大)和比较指令。

    Bipartite look-up table with output values having minimized absolute error
    14.
    发明授权
    Bipartite look-up table with output values having minimized absolute error 失效
    输出值为绝对误差最小的双向查找表

    公开(公告)号:US06223192B1

    公开(公告)日:2001-04-24

    申请号:US09098482

    申请日:1998-06-16

    Abstract: A method for generating entries for a bipartite look-up table having base and difference table portions. In one embodiment, these entries are usable to form output values for a mathematical function, f(x), in response to receiving corresponding input values within a predetermined input range. The method first comprises partitioning the input range into I intervals, J subintervals/interval, and K sub-subintervals/subinterval. For a given interval M, the method includes generating K difference table entries and J base table entries. Each of the K difference table entries corresponds to a particular group of sub-subintervals within interval M, each of which has the same relative position within their respective subintervals. Each difference table entry is computed by averaging difference values for the sub-subintervals included in a corresponding group N. Each difference value which makes up this average is equal to f(X1)−f(X2), where X1 is the midpoint of the sub-subinterval within group N, and X2 is the midpoint of a predetermined reference sub-subinterval within the same subinterval as X1. Each of these midpoints is calculated such that maximum absolute error is minimized for all possible input values in the sub-subinterval. Each of the J base table entries, on the other hand, corresponds to a subinterval within interval M. Each entry is equal to f(X2)+adjust, where X2 is the midpoint of the reference sub-subinterval of the subinterval corresponding to the base table entry. The adjust value is calculated so that error introduced by the averaging of the difference table entries is evenly distributed over the entire subinterval.

    Abstract translation: 一种用于为具有基准和差分表部分的二分查找表生成条目的方法。 在一个实施例中,响应于在预定输入范围内接收对应的输入值,这些条目可用于形成数学函数f(x)的输出值。 该方法首先包括将输入范围分为I个间隔,J个子间隔/间隔和K个子间隔/子间隔。 对于给定的间隔M,该方法包括生成K个差表表项和J个基表项。 K个差异表条目中的每一个对应于间隔M内的特定的子子区间组,每个子区间在它们各自的子区间内具有相同的相对位置。 通过对包括在对应组N中的子子间隔的差分值进行平均来计算每个差分表项。构成该平均值的每个差值等于f(X1)-f(X2),其中X1是 在组N内的子子间隔,X2是与X1相同的子间隔内的预定参考子子间隔的中点。 计算这些中点中的每一个,使得对子子区间中的所有可能输入值的最大绝对误差最小化。 另一方面,每个J基表条目对应于间隔M内的子间隔。每个条目等于f(X2)+调整,其中X2是对应于子帧的子间隔的参考子子间隔的中点 基表项。 计算调整值,使得通过差表表项的平均引入的误差在整个子间隔上均匀分布。

    EFFICIENT MATRIX MULTIPLICATION ON A PARALLEL PROCESSING DEVICE
    16.
    发明申请
    EFFICIENT MATRIX MULTIPLICATION ON A PARALLEL PROCESSING DEVICE 有权
    并行处理器件的高效矩阵乘法

    公开(公告)号:US20100325187A1

    公开(公告)日:2010-12-23

    申请号:US12875961

    申请日:2010-09-03

    CPC classification number: G06F17/16

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Floating point addition pipeline including extreme value, comparison and accumulate functions
    17.
    发明授权
    Floating point addition pipeline including extreme value, comparison and accumulate functions 有权
    浮点附加流水线包括极值,比较和累加功能

    公开(公告)号:US06397239B2

    公开(公告)日:2002-05-28

    申请号:US09778352

    申请日:2001-02-06

    Abstract: A multimedia execution unit configured to perform vectored floating point and integer instructions. The execution unit may include an add/subtract pipeline having far and close data paths. The far path is configured to handle effective addition operations and effective subtraction operations for operands having an absolute exponent difference greater than one. The close path is configured to handle effective subtraction operations for operands having an absolute exponent difference less than or equal to one. The close path is configured to generate two output values, wherein one output value is the first input operand plus an inverted version of the second input operand, while the second output value is equal to the first output value plus one. Selection of the first or second output value in the close path effectuates the round-to-nearest operation for the output of the adder.

    Abstract translation: 多媒体执行单元被配置为执行矢量的浮点和整数指令。 执行单元可以包括具有远近数据路径的加法/减法流水线。 远程路径被配置为处理具有大于1的绝对指数差的操作数的有效加法运算和有效减法运算。 关闭路径被配置为处理具有小于或等于1的绝对指数差的操作数的有效减法操作。 关闭路径被配置为生成两个输出值,其中一个输出值是第一输入操作数加上第二输入操作数的反转版本,而第二输出值等于第一输出值加1。 在闭合路径中选择第一或第二输出值对加法器的输出实现了舍入到最近的运算。

    Method and apparatus for rounding in a multiplier
    18.
    发明授权
    Method and apparatus for rounding in a multiplier 有权
    在乘法器中舍入的方法和装置

    公开(公告)号:US06397238B2

    公开(公告)日:2002-05-28

    申请号:US09782475

    申请日:2001-02-12

    Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.

    Abstract translation: 公开了能够执行有符号和无符号标量和矢量乘法的乘法器。 乘法器配置为以标量或压缩向量形式接收带符号或无符号乘数和被乘数操作数。 可以计算乘数和被乘数操作数的有效符号,并用于根据布斯算法创建和选择多个部分乘积。 一旦创建并选择了部分产品,就可以对它们进行求和并输出结果。 结果可能是有符号或无符号的,可能表示向量或标量。 当执行向量乘法时,乘法器可以被配置为产生和选择部分乘积,以便有效地隔离每对向量分量的乘法过程。 乘法器还可以被配置为对矢量分量的乘积求和以形成向量点积。 最终产品可以分段输出,以便需要更少的总线。 可以通过添加舍入常数来对段进行舍入。 可以在两个路径中执行舍入和归一化,一个假设将发生溢出,另一个假设不会发生溢出。 乘法器还可以被配置为执行迭代计算以评估操作数的恒定功率。 形成的中间产品可以在两个路径中进行圆化和归一化,然后压缩并存储以用于下一次迭代。 还可以添加调整常数以增加精确舍入结果的频率。

    Rapid execution of FCMOV following FCOMI by storing comparison result in temporary register in floating point unit
    19.
    发明授权
    Rapid execution of FCMOV following FCOMI by storing comparison result in temporary register in floating point unit 有权
    通过将比较结果存储在浮点单元中的临时寄存器中,FCOMI后快速执行FCMOV

    公开(公告)号:US06393555B1

    公开(公告)日:2002-05-21

    申请号:US09370787

    申请日:1999-08-05

    Abstract: A microprocessor with a floating point unit configured to rapidly execute floating point compare (FCOMI) type instructions that are followed by floating point conditional move (FCMOV) type instructions is disclosed. FCOMI-type instructions, which normally store their results to integer status flag registers, are modified to store a copy of their results to a temporary register located within the floating point unit. If an FCMOV-type instruction is detected following an FCOMI-type instruction, then the FCMOV-type instruction's source for flag information is changed from the integer flag register to the temporary register. FCMOV-type instructions are thereby able to execute earlier because they need not wait for the integer flags to be read from the integer portion of the microprocessor. A computer system and method for rapidly executing FCOMI-type instructions followed by FCMOV-type instructions are also disclosed.

    Abstract translation: 具有浮点单元的微处理器被配置为快速执行浮点比较(FCOMI)类型指令,其后面是浮点条件移动(FC​​MOV)类型指令。 通常将其结果存储到整数状态标志寄存器的FCOMI型指令进行修改,以将其结果的副本存储到位于浮点单元内的临时寄存器。 如果在FCOMI型指令之后检测到FCMOV型指令,则FCMOV型指令的标志信息源从整数标志寄存器改变为临时寄存器。 因此,FCMOV型指令能够早期执行,因为它们不需要等待从微处理器的整数部分读取整数标志。 还公开了一种用于快速执行FCOMI型指令的计算机系统和方法,随后是FCMOV型指令。

    Apparatus and method for handling tiny numbers using a super sticky bit in a microprocessor
    20.
    发明授权
    Apparatus and method for handling tiny numbers using a super sticky bit in a microprocessor 有权
    在微处理器中使用超级粘性位处理微小数字的装置和方法

    公开(公告)号:US06374345B1

    公开(公告)日:2002-04-16

    申请号:US09359919

    申请日:1999-07-22

    Abstract: An apparatus and method for handling tiny numbers using a super sticky bit are provided. In response to detecting that a preliminary result of an instruction corresponds to a tiny number and an underflow exception is masked, an execution pipeline can be configured to store a value corresponding to the preliminary result and a super sticky bit in a destination register. Also, a destination register tag corresponding to the destination register and a denormal exception indicator corresponding to the tiny number and masked underflow exception can be stored. A trap handler can be initiated to generate a corrected result for the instruction. The trap handler can detect that the denormal exception indicator has been set and can read the value and the super sticky bit from the destination register using the destination register tag. The trap handler can generate a corrected result for the instruction based on the value and the super sticky bit. An instruction subsequent to the trapping instruction can then be restarted.

    Abstract translation: 提供了一种使用超级粘性位处理微小数字的装置和方法。 响应于检测到指令的初步结果对应于微数,并且下溢异常被屏蔽,执行流水线可以被配置为存储与目标寄存器中的初步结果和超粘性位对应的值。 此外,可以存储对应于目的地寄存器的目的地寄存器标签和对应于微小数量和掩蔽的下溢异常的异常异常指示符。 可以启动陷阱处理程序以生成指令的校正结果。 陷阱处理程序可以检测到异常异常指示器已设置,并可以使用目标寄存器标签从目标寄存器读取该值和超级粘性位。 陷阱处理程序可以根据值和超级粘性位产生指令的校正结果。 然后可以重新启动捕获指令之后的指令。

Patent Agency Ranking