摘要:
A multi-function ALU (arithmetic/logic unit) for use in digital data processing facilitates the execution of instructions in parallel, thereby enhancing processor performance. The proposed apparatus reduces the instruction execution latency that results from data dependency hazards in a pipelined machine. This latency reduction is accomplished by collapsing the interlocks due to these hazards. The proposed apparatus achieves performance improvement while maintaining compatibility with previous implementations designed using an identical architecture.
摘要:
In a digital computer system both rotation of bits in a data byte and rotation in combination with additional manipulation, a multifunction permutation switch, in a cyclic mode of operation, connects the input bit lines to the output bit lines so that the sequence of input bits are maintained on the output bit lines when the bits on the input lines are considered as arranged in a circle, and in a non-cyclic mode of operation, connects the input bit lines to the output bit lines in a manner to execute gather operations and spread operations.
摘要:
A multi-function ALU (arithmetic/logic unit) for use in digital data processing facilitates the execution of instructions in parallel, thereby enhancing processor performance. The proposed apparatus reduces the instruction execution latency that results from data dependency hazards in a pipelined machine. This latency reduction is accomplished by collapsing the interlocks due to these hazards. The proposed apparatus achieves performance improvement while maintaining compatibility with previous implementations designed using an identical architecture.
摘要:
Generation of functional status followed by the use of the status to control the sequencing of microinstructions is a well known critical path in processor designs. The delay associated with the path is exacerbated in superscalar machines by the additional statuses that are produced by multiple functional units from which the appropriate status must be selected for controlling the sequencing of microinstructions. This is especially true in horizontally microcoded machines. The adverse affects on the delay can be reduced by using a staged multiplexor design. For the staged multiplexor to be useful, all functional unit status should be produced as early as possible. In this invention, a status predictor is described that allows the status associated with the shifter to be generated directly from the inputs to the shifter. As a result, the status is available early in the pipeline cycle in which the shift is actually performed and made available to the multiplexor producing the controls for microinstruction sequencing. In addition, the invention allows the early generation of all shifter status used to set condition codes. The predictor has been implemented in an ESA/390 processor implementation where it was instrumental in achieving the desired cycle time.
摘要:
An apparatus implementing an algorithm for generating carries due to the second instruction of an interlocked instruction pair when executing all combinations of logical as well as arithmetic instruction pairs is developed. The algorithm is then applied to three interlock collapsing ALU means implementations that have been proposed. The critical path for calculating the carries is first presented. Next the expression for generating these carries is used to derive a fast implementation for generating overflow which is implemented in the apparatus. The resulting ALU status determination apparatus includes a three-to-one ALU means for executing plural instructions which can predict the status of three-to-one ALU operations related to the presence/absence of carries incorporated in the three-to-one ALU designed to execute a second instruction of a pair of instructions in parallel and whether or not the second instruction of the pair is independent or dependent on the result of the operation of the first instruction. Additionally, an implementation scheme for predicting result equal to zero is developed for the three-to-one ALU operations.
摘要:
An apparatus for the reduction of partial products of a multiplier combines attributes of pre-addition and the regularity found in array multipliers by employing improved four-to-two composite counter cells. This composite counter cell, the basic block for reducing the partial products, is itself comprised of two new four-to-two counters. One of the four-to-two counters is used to perform pre-addition of the partial products while the second counter is used to perform addition between the sum produced by the counter performing the pre-addition and the outputs from the second counter of a cell in a previous stage of the addition. The regularity of array multiplication schemes is preserved and interconnections required by the mechanism span no more than two columns of the matrix.
摘要:
Three high performance implementations for an interlock collapsing ALU are presented as alternative embodiments. The critical path delay of each embodiment provides reduction in delay. For one of the implementations the delay is shown to be an equivalent number of stages as required by a three-to-one adder assuming a commonly available bookset. The delay for the other two implementations is comparable to the three-to-one adder. In addition, trade-offs for the design complexity of implementation alternatives are set out. The embodiments achieve minimum delays without a prohibitive increase in hardware.
摘要:
A mechanism is presented for detecting overflow in an interlock collapsing hardware apparatus that simultaneously executes two instructions. The overflow is determined as if the second instruction executes by itself using results from execution of the first instruction. Overflow detection is accomplished by using only values input into, and generated within, the interlock collapsing apparatus.
摘要:
A high speed three-to-one data dependency collapsing ALU can be used to support multiple issue of instructions. The computing apparatus supports multiple issue of instructions it is useful in CISC, superscalar, superscalar RISC, etc. type computer designs. The concept of the ALU is presented along with a detailed description of a design. The apparatus allows the execution of any combination of two independent or dependent arithmetic or logical instructions in a single machine cycle. The 3-1 collapsing ALU structure has a 3-2 carry save adder (CSA); and a 2-1 control arithmetic logic unit (CALU) coupled for an input from the carry save adder; and a first pre-adder logic block coupled with an output to the control arithmentic logic unit; and a control generator; and a second controlled logic block coupled to receive an input from said control generator and having its output coupled to said control arithmetic logic unit. Instructions have an add/logical combinatorial operation which combines all four of the combinations: add-add, add-logical, logical-add, and logical-logical functions; and wherein two or more disassociated ALU operations are specified by a single interlock collapsing ALU which responds to the parallel issuance of a plurality of separate instructions, including RISC type instructions, each of which specifies ALU operations, and the computing apparatus executes the instructions in parallel in a single machine cycle.
摘要:
An apparatus is presented and proved for detecting storage operand overlap for instructions having identical overlap detection requirements as the move character (MVC) instruction. The apparatus is applicable to all Enterprise Systems Architecture (ESA)/390 addressing modes encompassing access register addressing for either 24 bit or 31 bit addressing. S/370 addressing in 24 bit and 31 bit modes are also supported by the proposed apparatus and treated as special cases of access register addressing. In addition, the apparatus is extended to support other addressing modes with an example provided to include a 64 bit addressing mode. A fast parallel implementation of the apparatus is also presented. The apparatus results in a one cycle savings for all invocations of the MVC instruction which comprises approximately 2% of the dynamic instruction stream of a representative instruction mix. The one cycle savings results in a 21 percent improvement in the performance of the execution of the MVC instruction for the frequent case (84%) when the operand length is less than or equal to eight bytes and a 9 percent improvement in performance for the less frequent case (16%) in which the operand length is greater than eight bytes.