Converting register data from a first format type to a second format
type if a second type instruction consumes data produced by a first
type instruction
    31.
    发明授权
    Converting register data from a first format type to a second format type if a second type instruction consumes data produced by a first type instruction 失效
    如果第二类型指令消耗由第一类型指令产生的数据,则将寄存器数据从第一格式类型转换为第二格式类型

    公开(公告)号:US6105129A

    公开(公告)日:2000-08-15

    申请号:US25233

    申请日:1998-02-18

    Abstract: A microprocessor includes one or more registers which are architecturally defined to be used for at least two data formats. In one embodiment, the registers are the floating point registers defined in the x86 architecture, and the data formats are the floating point data format and the multimedia data format. The registers actually implemented by the microprocessor for the floating point registers use an internal format for floating point data. Part of the internal format is a classification field which classifies the floating point data in the extended precision defined by the x86 microprocessor architecture. Additionally, a classification field encoding is reserved for multimedia data. As the microprocessor begins execution of each multimedia instruction, the classification information of the source operands is examined to determine if the data is either in the multimedia class, or in a floating point class in which the significand portion of the register is the same as the corresponding significand in extended precision. If so, the multimedia instruction executes normally. If not, the multimedia instruction is faulted. Similarly, as the microprocessor begins execution of each floating point instruction, the classification information of the source operands is examined. If the data is classified as multimedia, the floating point instruction is faulted. A microcode routine is used to reformat the data stored in at least the source registers of the faulting instruction into a format useable by the faulting instruction. Subsequently, the faulting instruction is re-executed.

    Abstract translation: 微处理器包括一个或多个寄存器,其被架构地定义为用于至少两种数据格式。 在一个实施例中,寄存器是在x86架构中定义的浮点寄存器,数据格式是浮点数据格式和多媒体数据格式。 微处理器为浮点寄存器实际实现的寄存器使用浮点数据的内部格式。 内部格式的一部分是分类字段,它以由x86微处理器架构定义的扩展精度对浮点数据进行分类。 此外,分类字段编码被保留用于多媒体数据。 当微处理器开始执行每个多媒体指令时,检查源操作数的分类信息以确定数据是在多媒体类中还是在浮点类中,其中寄存器的有效部分与 相应的显着性在扩展精度。 如果是这样,多媒体指令正常执行。 如果不是,则多媒体指令发生故障。 类似地,当微处理器开始执行每个浮点指令时,检查源操作数的分类信息。 如果数据被分类为多媒体,则浮点指令发生故障。 微码程序用于将存储在故障指令的至少源寄存器中的数据重新格式化为故障指令可使用的格式。 随后,重新执行故障指令。

    Efficient matrix multiplication on a parallel processing device
    32.
    发明授权
    Efficient matrix multiplication on a parallel processing device 有权
    在并行处理设备上有效的矩阵乘法

    公开(公告)号:US08589468B2

    公开(公告)日:2013-11-19

    申请号:US12875961

    申请日:2010-09-03

    CPC classification number: G06F17/16

    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

    Abstract translation: 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。

    Maximized memory throughput on parallel processing devices
    33.
    发明授权
    Maximized memory throughput on parallel processing devices 有权
    最大化并行处理设备的内存吞吐量

    公开(公告)号:US08327123B2

    公开(公告)日:2012-12-04

    申请号:US13069384

    申请日:2011-03-23

    CPC classification number: G06F9/3887 G06F9/3455 G06F9/3851 G06F9/3889

    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.

    Abstract translation: 在用于流计算的并行处理装置中,流的每个数据元素的处理可能不是计算密集的,因此与读取流并写入结果所需的存储器访问时间相比,处理可能需要相对较少的时间来计算。 因此,内存吞吐量通常会限制流计算的性能。 一般来说,提供了用于在这种存储器吞吐量限制的流计算中实现改进的,优化的或最终最大化的存储器吞吐量的方法。 通过提高跨多个处理元件和线程的聚合内存吞吐量,最大化流计算性能。 通过平衡线程和线程组之间的处理负载以及耦合到并行处理设备的硬件存储器接口来实现高聚合内存吞吐量。

    MAXIMIZED MEMORY THROUGHPUT ON PARALLEL PROCESSING DEVICES
    34.
    发明申请
    MAXIMIZED MEMORY THROUGHPUT ON PARALLEL PROCESSING DEVICES 有权
    最大化的并行处理器件的存储器

    公开(公告)号:US20110173414A1

    公开(公告)日:2011-07-14

    申请号:US13069384

    申请日:2011-03-23

    CPC classification number: G06F9/3887 G06F9/3455 G06F9/3851 G06F9/3889

    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.

    Abstract translation: 在用于流计算的并行处理装置中,流的每个数据元素的处理可能不是计算密集的,因此与读取流并写入结果所需的存储器访问时间相比,处理可能需要相对较少的时间来计算。 因此,内存吞吐量通常会限制流计算的性能。 一般来说,提供了用于在这种存储器吞吐量限制的流计算中实现改进的,优化的或最终最大化的存储器吞吐量的方法。 通过提高跨多个处理元件和线程的聚合内存吞吐量,最大化流计算性能。 通过平衡线程和线程组之间的处理负载以及耦合到并行处理设备的硬件存储器接口来实现高聚合内存吞吐量。

    Apparatus and method for superforwarding load operands in a microprocessor
    35.
    发明授权
    Apparatus and method for superforwarding load operands in a microprocessor 有权
    用于在微处理器中超载负载操作数的装置和方法

    公开(公告)号:US06442677B1

    公开(公告)日:2002-08-27

    申请号:US09329497

    申请日:1999-06-10

    CPC classification number: G06F9/30043 G06F9/3826

    Abstract: An apparatus and method for superforwarding load operands in a microprocessor are provided. An execution unit in a microprocessor is configured to receive a load instruction and a subsequent instruction. If the load instruction corresponds to a simple load instruction, a destination operand of the load instruction can be superforwarded to a subsequent instruction if the subsequent instruction specifies a source operand that depends on the destination operand of the load instruction. The subsequent instruction is not required to wait until a load instruction executes or completes and can be scheduled and/or executed prior to or at the same time as the load instruction. Consequently, latencies associated with operand dependencies may be reduced.

    Abstract translation: 提供了一种用于在微处理器中超载负载操作数的装置和方法。 微处理器中的执行单元被配置为接收加载指令和后续指令。 如果加载指令对应于简单的加载指令,则如果后续指令指定依赖于加载指令的目的地操作数的源操作数,则加载指令的目标操作数可以被超前给后续指令。 后续指令不需要等待加载指令执行或完成,并且可以在加载指令之前或同时进行调度和/或执行。 因此,可以减少与操作数相关性相关联的延迟。

    Method and apparatus for rapid execution of FCOM and FSTSW
    36.
    发明授权
    Method and apparatus for rapid execution of FCOM and FSTSW 有权
    用于快速执行FCOM和FSTSW的方法和装置

    公开(公告)号:US06425074B1

    公开(公告)日:2002-07-23

    申请号:US09393524

    申请日:1999-09-10

    Abstract: A microprocessor configured to rapidly execute floating point store status word (FSTSW) type instructions that are immediately preceded by floating point compare (FCOM) type instructions is disclosed. FCOM-type instructions are modified to store their results to an architectural floating point status word and a temporary destination register. If an FSTSW-type instruction is detected immediately following an FCOM-type instruction, then the FSTSW-type instruction is transformed into a special fast floating point store status word (FSTSWEF) instruction. Unlike the FSTSW-type instruction, which is serializing and negatively impacts performance, the FSTSWEF instruction is not serializing and allows execution to continue without undue serialization. A computer system and method for rapidly executing FSTSW instructions immediately preceded by FCOM-type instructions are also disclosed.

    Abstract translation: 公开了一种被配置为快速执行浮点比较(FCOM)类型指令之前的浮点存储状态字(FSTSW)类型指令的微处理器。 修改FCOM类型的指令以将其结果存储到架构浮点状态字和临时目标寄存器。 如果在FCOM型指令之后立即检测到FSTSW型指令,则FSTSW型指令被转换为特殊的快速浮点存储状态字(FSTSWEF)指令。 与串行化和负面影响性能的FSTSW型指令不同,FSTSWEF指令不是序列化的,允许执行继续,而不会过多的序列化。 还公开了一种用于在紧接在FCOM型指令之前快速执行FSTSW指令的计算机系统和方法。

    Multi-function bipartite look-up table
    37.
    发明授权
    Multi-function bipartite look-up table 失效
    多功能二分查询表

    公开(公告)号:US06256653B1

    公开(公告)日:2001-07-03

    申请号:US09015084

    申请日:1998-01-29

    Abstract: A multi-function look-up table for determining output values for predetermined ranges of a first mathematical function and a second mathematical function. In one embodiment, the multi-function look-up table is a bipartite look-up table including a first plurality of storage locations and a second plurality of storage locations. The first plurality of storage locations store base values for the first and second mathematical functions. Each base value is an output value (for either the first or second function) corresponding to an input region which includes the look-up table input value. The second plurality of storage locations, on the other hand, store difference values for both the first and second mathematical functions. These difference values are used for linear interpolation in conjunction with a corresponding base value in order to generate a look-up table output value. The multi-function look-up table further includes an address control unit coupled to receive a first input value and a signal which indicates whether an output value is to be generated for the first or second mathematical function. The address control unit then generates a first address value from these signals which is in turn conveyed to the first and second plurality of storage locations. In response to receiving the first address value, the first and second plurality of storage locations are configured to output a first base value and a first difference value, respectively. The first base value and first difference value are then conveyed to an output unit configured to generate a look-up table output value from the two values.

    Abstract translation: 一种用于确定第一数学函数和第二数学函数的预定范围的输出值的多功能查找表。 在一个实施例中,多功能查找表是包括第一多个存储位置和第二多个存储位置的二分查找表。 第一多个存储位置存储第一和第二数学函数的基值。 每个基值是对应于包括查找表输入值的输入区域的输出值(对于第一或第二函数)。 另一方面,第二多个存储位置存储第一和第二数学函数的差值。 这些差值用于与对应的基值相结合的线性插值,以产生查询表输出值。 多功能查找表还包括地址控制单元,其被耦合以接收第一输入值和指示是否为第一或第二数学函数生成输出值的信号。 地址控制单元然后从这些信号产生一个第一地址值,该第一地址值又被传送到第一和第二多个存储位置。 响应于接收到第一地址值,第一和第二多个存储位置被配置为分别输出第一基值和第一差值。 然后将第一基值和第一差分值传送到被配置为从两个值生成查找表输出值的输出单元。

Patent Agency Ranking