Load-store unit and method of loading and storing single-precision
floating-point registers in a double-precision architecture
    1.
    发明授权
    Load-store unit and method of loading and storing single-precision floating-point registers in a double-precision architecture 失效
    在双精度架构中加载和存储单精度浮点寄存器的加载存储单元和方法

    公开(公告)号:US5805475A

    公开(公告)日:1998-09-08

    申请号:US816067

    申请日:1997-03-11

    摘要: A floating point numbers load-store unit includes a translator for converting between the single-precision and double-precision representations, and Special-Case logic for providing Special-Case signals when a store is being performed on zero, infinity, or NaN. A store-float-double instruction is executed by concatenating a suffix to the mantissa in the single-precision floating-point register and replacing the high-order bit of the exponent with a prefix selected as a function of the high-order bit, wherein the resulting mantissa and exponent form a double-precision floating-point number that is then stored to memory. A load-float-double instruction is executed by dropping the suffix from the mantissa of the double-precision floating-point number in memory, and replacing the prefix with the high-order bit, wherein the resulting mantissa and exponent form a single-precision floating-point number that is then loaded into the single-precision floating-point register.

    摘要翻译: 浮点数加载存储单元包括用于在单精度和双精度表示之间进行转换的转换器,以及当在零,无穷大或NaN上执行存储时提供特殊情况信号的特殊情况逻辑。 通过将后缀连接到单精度浮点寄存器中的尾数来执行store-float-double指令,并且以由高位位选择的前缀替换指数的高位,其中 所得到的尾数和指数形成双精度浮点数,然后将其存储到存储器中。 通过从存储器中的双精度浮点数的尾数丢弃后缀,并用高位替换前缀,执行load-float-double指令,其中所得到的尾数和指数形成单精度 浮点数然后加载到单精度浮点寄存器中。

    Method and apparatus for executing fixed-point instructions within idle
execution units of a superscalar processor
    2.
    发明授权
    Method and apparatus for executing fixed-point instructions within idle execution units of a superscalar processor 失效
    用于在超标量处理器的空闲执行单元内执行定点指令的方法和装置

    公开(公告)号:US5809323A

    公开(公告)日:1998-09-15

    申请号:US530552

    申请日:1995-09-19

    IPC分类号: G06F9/302 G06F9/38

    摘要: A superscalar processor and method for executing fixed-point instructions within a superscalar processor are disclosed. The superscalar processor has a memory and multiple execution units, including a fixed point execution unit (FXU) and a non-fixed point execution unit (non-FXU). According to the present invention, a set of instructions to be executed are fetched from among a number of instructions stored within memory. A determination is then made if n instructions, the maximum number possible, can be dispatched to the multiple execution units during a first processor cycle if fixed point arithmetic and logical instructions are dispatched only to the FXU. If so, n instructions are dispatched to the multiple execution units for execution. In response to a determination that n instructions cannot be dispatched during the first processor cycle, a determination is made whether a fixed point instruction is available to be dispatched and whether dispatching the fixed point instruction to the non-FXU for execution will result in greater efficiency. In response to a determination that a fixed point instruction is not available to be dispatched or that dispatching the fixed point instruction to the non-FXU will not result in greater efficiency, dispatch of the fixed point instruction is delayed until a second processor cycle. However, in response to a determination that dispatching the fixed point instruction to the non-FXU will result in greater efficiency, the fixed point instruction is dispatched to the non-FXU and executed, thereby improving execution unit utilization.

    摘要翻译: 公开了一种用于在超标量处理器内执行定点指令的超标量处理器和方法。 超标量处理器具有存储器和多个执行单元,包括固定点执行单元(FXU)和非固定点执行单元(非FXU)。 根据本发明,从存储在存储器中的多个指令中取出要执行的一组指令。 然后如果将固定点算术和逻辑指令仅发送到FXU,则可以在第一处理器周期期间将n个指令(尽可能最大数)分派到多个执行单元进行确定。 如果是这样,n个指令被分派到多个执行单元执行。 响应于在第一处理器周期期间不能调度n个指令的确定,确定是否可以调度固定点指令,以及是否向非FXU分派定点指令以执行将导致更高的效率 。 响应于确定不能发送固定点指令或者将定点指令分派到非FXU不会导致更高的效率,所以定点指令的调度被延迟到第二处理器周期。 然而,响应于将定点指令发送到非FXU的确定将导致更高的效率,将定点指令分派到非FXU并执行,从而提高执行单元的利用率。

    Processor and method for managing execution of an instruction which
determine subsequent to dispatch if an instruction is subject to
serialization
    3.
    发明授权
    Processor and method for managing execution of an instruction which determine subsequent to dispatch if an instruction is subject to serialization 失效
    用于管理指令的执行的处理器和方法,所述指令确定在调度指令是否进行序列化之后

    公开(公告)号:US5678016A

    公开(公告)日:1997-10-14

    申请号:US512741

    申请日:1995-08-08

    IPC分类号: G06F9/312 G06F9/38

    摘要: A method and apparatus are disclosed for managing the execution of a floating-point store instruction within a data processing system including a memory and a superscalar processor having a number of floating-point registers (FPRs). According to the present invention, multiple instructions are dispatched for execution by the processor, including a floating-point store instruction having as an operand the content of a particular FPR. A determination is made whether the particular FPR is a destination register for results of a second instruction which precedes the store instruction in program order. If so, a determination is made whether the second instruction must complete before subsequent instructions can be successfully dispatched. In response to a determination that the second instruction must be completed prior to successfully dispatching subsequent instructions, the floating-point instruction is cancelled and redispatched after the completion of the second instruction. In response to a determination that the second instruction need not be completed prior to successfully dispatching subsequent instructions, execution of the floating-point store instruction is initiated by computing the destination address within memory into which the operand of the floating-point store instruction is to be stored, thereby minimizing the delay in executing a floating-point store instruction.

    摘要翻译: 公开了一种用于管理包括具有多个浮点寄存器(FPR)的存储器和超标量处理器的数据处理系统内的浮点存储指令的执行的方法和装置。 根据本发明,调度多个指令以供处理器执行,包括具有作为特定FPR的内容的操作数的浮点存储指令。 确定特定FPR是否是用于以程序顺序在存储指令之前的第二指令的结果的目的地寄存器。 如果是,则确定第二条指令是否必须在后续指令可以成功发送之前完成。 响应于在成功发送后续指令之前必须完成第二条指令的确定,在完成第二条指令之后,浮点指令被取消并重新分配。 响应于在成功发送后续指令之前不需要完成第二指令的确定,通过计算浮点存储指令的操作数所在的存储器内的目标地址来启动浮点存储指令的执行 被存储,从而最小化执行浮点存储指令的延迟。

    Processor having vector processing capability and method for executing a vector instruction in a processor
    4.
    发明授权
    Processor having vector processing capability and method for executing a vector instruction in a processor 有权
    具有向量处理能力的处理器和用于在处理器中执行向量指令的方法

    公开(公告)号:US06324638B1

    公开(公告)日:2001-11-27

    申请号:US09282268

    申请日:1999-03-31

    IPC分类号: G06F1517

    摘要: A processor capable of executing vector instructions includes at least an instruction sequencing unit and a vector processing unit that receives vector instructions to be executed from the instruction sequencing unit. The vector processing unit includes a plurality of multiply structures, each containing only a single multiply array, that each correspond to at least one element of a vector input operand. Utilizing the single multiply array, each of the plurality of multiply structures is capable of performing a multiplication operation on one element of a vector input operand and is also capable of performing a multiplication operation on multiple elements of a vector input operand concurrently. In an embodiment in which the maximum length of an element of a vector input operand is N bits, each of the plurality of multiply arrays can handle both N by N bit integer multiplication and M by M bit integer multiplication, where N is a non-unitary integer multiple of M. At least one of the multiply structures also preferably includes an accumulating adder that receives as a first input a result produced by that multiply structure and receives as a second input a result produced by another multiply structure. From these inputs, the accumulating adder produces as an output an accumulated sum of the results in response to execution of the same instruction that caused the multiply structures to produce the intermediate results.

    摘要翻译: 能够执行向量指令的处理器至少包括指令排序单元和向量处理单元,其从指令排序单元接收要执行的向量指令。 矢量处理单元包括多个乘法结构,每个乘法结构仅包含单个乘法阵列,每个乘法阵列对应于向量输入操作数的至少一个元素。 利用单个乘法阵列,多个乘法结构中的每一个能够对向量输入操作数的一个元素执行乘法运算,并且还能够同时对矢量输入操作数的多个元素执行乘法运算。 在矢量输入操作数的元素的最大长度为N位的实施例中,多个乘法阵列中的每一个可以处理N乘N位整数乘法和M乘M位整数乘法,其中N是非乘法, 多重结构中的至少一个还优选地包括累积加法器,其接收由该乘法结构产生的结果作为第一输入,并且作为第二输入接收由另一乘法结构产生的结果。 从这些输入中,积累加法器响应于导致乘法结构产生中间结果的相同指令的执行而产生结果的累加和。

    Method and apparatus for dynamic allocation of registers for
intermediate floating-point results
    5.
    发明授权
    Method and apparatus for dynamic allocation of registers for intermediate floating-point results 失效
    用于中间浮点数结果的寄存器的动态分配方法和装置

    公开(公告)号:US5805916A

    公开(公告)日:1998-09-08

    申请号:US758017

    申请日:1996-11-27

    IPC分类号: G06F9/302 G06F9/38

    摘要: The present invention relates to a multiple stage execution unit for executing instructions in a microprocessor having a plurality of rename registers for storing execution results, an instruction cache for storing instructions, each instruction being associated with a rename register, a sequencer unit for providing an instruction to the execution unit, and a data cache for providing data to the execution unit. In one version, the execution unit includes a first stage which generates an intermediate result from the data according to an instruction; a means for providing a first portion of the intermediate result to an intermediate register; a means for providing a second portion of the intermediate result to a rename register associated with the instruction; a means for passing the first portion from the intermediate register to a second stage of the execution unit; a means for passing the second portion from the rename register to the second stage of the execution unit; wherein the second stage of the execution unit operates on the first and second portions according to the instruction.

    摘要翻译: 本发明涉及一种多级执行单元,用于在微处理器中执行指令,该微处理器具有用于存储执行结果的多个重命名寄存器,用于存储指令的指令高速缓存,每个指令与重命名寄存器相关联,定序器单元用于提供指令 以及用于向执行单元提供数据的数据高速缓存。 在一个版本中,执行单元包括根据指令从数据生成中间结果的第一阶段; 用于将中间结果的第一部分提供给中间寄存器的装置; 用于将中间结果的第二部分提供给与指令相关联的重命名寄存器的装置; 用于将第一部分从中间寄存器传递到执行单元的第二级的装置; 用于将第二部分从重命名寄存器传递到执行单元的第二级的装置; 其中执行单元的第二级根据该指令在第一和第二部分上操作。

    Processor and method for out-of-order execution of instructions based
upon an instruction parameter
    6.
    发明授权
    Processor and method for out-of-order execution of instructions based upon an instruction parameter 失效
    基于指令参数的指令无序执行的处理器和方法

    公开(公告)号:US5872948A

    公开(公告)日:1999-02-16

    申请号:US616613

    申请日:1996-03-15

    IPC分类号: G06F9/38 G06F9/28

    摘要: A processor and method for out-of-order execution of instructions are disclosed which fetch a first and a second instruction, wherein the first instruction precedes the second instruction in a program order. A determination is made whether execution of the second instruction is subject to execution of the first instruction. In response to a determination that execution of the second instruction is subject to execution of the first instruction, the second instruction is selectively executed prior to the first instruction in response to a parameter of at least one of the first and second instructions. In one embodiment, the parameter is an execution latency parameter of the first and second instructions.

    摘要翻译: 公开了用于执行指令的处理器和方法,其提取第一和第二指令,其中第一指令以程序顺序在第二指令之前。 确定第二指令的执行是否受到第一指令的执行。 响应于第二指令的执行被执行第一指令的确定,响应于第一和第二指令中的至少一个指令的参数在第一指令之前选择性地执行第二指令。 在一个实施例中,该参数是第一和第二指令的执行等待时间参数。

    Method for implementing a four-way least recently used (LRU) mechanism
in high-performance
    7.
    发明授权
    Method for implementing a four-way least recently used (LRU) mechanism in high-performance 失效
    在高性能数据处理系统中实现四路最近最少使用(LRU)机制的方法

    公开(公告)号:US5765191A

    公开(公告)日:1998-06-09

    申请号:US641060

    申请日:1996-04-29

    IPC分类号: G06F12/08 G06F12/12

    CPC分类号: G06F12/123

    摘要: A method for implementing a four-way least recently used cache line replacement scheme in a four-way cache memory is disclosed. The cache memory includes multiple cache lines, and each cache line includes four congruence sets. In accordance with the present disclosure, a 5-bit Least Recently Used (LRU) field is associated with each of the cache lines within the cache memory. For a particular cache line, a set number of a least recently used set among the four congruence sets is stored in any two bits of the LRU field associated with that cache line. Next, a set number of the second least recently used set among the four congruence sets is stored in another two bits of the same LRU field associated with the same cache line. Finally, a last bit of the 5-bit LRU field is set to a specific state in response to a determination of which one of the remaining two sets is the second most recently used set.

    摘要翻译: 公开了一种用于在四路高速缓冲存储器中实现四路最少使用的高速缓存行替换方案的方法。 高速缓冲存储器包括多个高速缓存行,并且每个高速缓存行包括四个一致集合。 根据本公开,5位最近使用(LRU)字段与高速缓冲存储器内的每个高速缓存行相关联。 对于特定的高速缓存行,四个同余集中的最近最少使用的集合的集合数存储在与该高速缓存行相关联的LRU字段的任何两个位中。 接下来,将四个同余集合中的第二最近使用的集合的集合数存储在与相同高速缓存行相关联的相同LRU字段的另外两个比特中。 最后,响应于确定剩余两组中的哪一组是最近使用的第二组,将5位LRU字段的最后一位设置为特定状态。

    High performance parallel binary byte adder
    8.
    发明授权
    High performance parallel binary byte adder 失效
    高性能并行二进制字节加法器

    公开(公告)号:US4914617A

    公开(公告)日:1990-04-03

    申请号:US66580

    申请日:1987-06-26

    CPC分类号: G06F7/505 G06F2207/382

    摘要: A parallel binary byte adder performs addition and subtraction on the individual bytes of an A-operand and a B-operand as well as on the entire A and B operand. An A-operand is input to a special adder circuit. A B-operand is modified in a set up logic circuit, in accordance with the specific operation to be performed, before being input to the special adder circuit. A set/mask logic generates set, mask and carry signals which are further input to the special adder circuit. The special adder circuit includes an auxiliary functions circuit and a pseudo carry circuit for generating a set of variables which are processed by a sum circuit to produce three partial results. The first partial result relates to bits 0-5 of the particular byte being processed, the second relates to bit 6, and the third relates to bit 7. A concatenation of the three partial results produces a final sum or difference of the particular byte or bytes involved.

    Instruction dispatch queue for improved instruction cache to queue timing
    9.
    发明授权
    Instruction dispatch queue for improved instruction cache to queue timing 失效
    指令调度队列,用于改进指令缓存到队列时序

    公开(公告)号:US5754811A

    公开(公告)日:1998-05-19

    申请号:US730606

    申请日:1996-10-08

    IPC分类号: G06F5/10 G06F9/38 G06F9/00

    摘要: A circular dispatch queue is used to implement an instruction queue, in a microprocessor, in order to reduce the delay associated with the critical timing path between an instruction cache memory and the instruction queue. In the circular dispatch queue, instructions are never moved from one stage to another. Instead, pointers are maintained that indicate the top and bottom instructions within the circular dispatch queue. This technique removes inputs from the multiplexor between the register stages in the circular dispatch queue and the instruction cache memory, thus reducing the critical delay.

    摘要翻译: 循环调度队列用于在微处理器中实现指令队列,以便减少与指令高速缓冲存储器和指令队列之间的关键定时路径相关联的延迟。 在循环调度队列中,指令不会从一个阶段移动到另一个阶段。 相反,维护指示循环调度队列中的顶部和底部指令的指针。 该技术从循环调度队列中的寄存器阶段和指令高速缓冲存储器之间的多路复用器消除输入,从而减少临界延迟。

    Apparatus and method for prediction of zero arithmetic/logic results
    10.
    发明授权
    Apparatus and method for prediction of zero arithmetic/logic results 失效
    用于预测零运算/逻辑结果的装置和方法

    公开(公告)号:US4947359A

    公开(公告)日:1990-08-07

    申请号:US346147

    申请日:1989-05-01

    摘要: The invention determines when two operands are equivalent directly from the operand without the use of an adder. In one embodiment, conditions for the sum being equal to zero are determined from half sum to carry and transmit operators derived from the input operands. These operands are used in some known types of adders and, thus may be provided from a parallel adder to the condition prediction circuitry. In another embodiment, the equations for a carry-save-adder are modified to provide a circuit specifically designed for the determination of the condition when the sum of the operands is equal to zero. This sum is equal to zero circuit greatly reduces the gate delay and gate count thus allowing the central processing unit to determine the condition prior to the actual sum of two operands. This allows the CPU to react to the condition more quickly, thus increasing overall system speed.

    摘要翻译: 本发明确定何时两个操作数在不使用加法器的情况下直接等效于操作数。 在一个实施例中,等于零的和的条件由半数确定以携带和发送从输入操作数导出的运算符。 这些操作数用于一些已知类型的加法器,并且因此可以从并行加法器提供给条件预测电路。 在另一个实施例中,用于进位保存加法器的等式被修改以提供专门设计用于当操作数之和等于零时确定条件的电路。 该和等于零电路大大减少了门延迟和门数,从而允许中央处理单元确定两个操作数的实际总和之前的状态。 这样就可以使CPU更快地对条件作出反应,从而提高整体系统的速度。