摘要:
A partitioned shift right logic circuit that is programmable and contains rounding support. The circuit of the present invention accepts a 32-bit value and a shift amount and then performs a right shift operation on the 32-bits and automatically rounds the result(s). Signed or unsigned values can be accepted. The right shift circuit is partitioned so that the 32-bit value can represent: (1) a single 32-bit number; or (2) two 16-bit values. A 1 bit selection input indicates the particular partition format. In operation, if the input value is not negative, then one (“1”) is added at the guard bit position and a right shift with truncate is performed. If the input is negative and the guard bit is zero, then no addition is done and a right shift with truncate is performed. If the input is negative and the guard bit is one and the sticky bit is zero, then no addition is done and a right shift with truncate is performed. If the input is negative and the guard bit is one and the sticky bit is one, then one is added at the guard bit position and a right shift with truncate is performed. The shift circuitry used by the present invention is fully partitioned to accept word or half-word input and contains multiple cascaded multiplexer stages for performing partitioned right shifting and supports signed shifting. Each multiplexer stage can be programmed to perform a selected shift amount (including 0 shift). The right shift circuit of the present invention can be used in multi-media applications and can also be used for general purpose and VLIW (very long instruction word) processor without performance degradation.
摘要:
An improved Booth encoder/selector circuit having an optimized critical path. The Booth encoder has a number of inverters coupled to several of the input multiplier bits. The inverted/non-inverted multiplier bits are then fed as inputs to NAND gates as well as a series of pass gates. The outputs of the pass gates are then fed as inputs to other NAND gates. The output from the NAND gates serve as control signals for controlling the Booth selector. The Booth selector is comprised of inverters and pass gates. Multiplicand bits are input to the pass gates. The control signals generated by the Booth encoder are selectively coupled to the inverters and pass gates such that they control which one of a plurality of multiplicand bits are selected for output. Basically, the Booth selector functions as a multiplexer whereby one of the following is output: the multiplicand bit is multiplied by zero, multiplied by one, multiplied by negative one, multiplied by two, or multiplied by negative two. The Booth encoder/selector is used in a multiplier circuit to minimize the number of partial products. An adder is then used to sum all of the partial products to arrive at the final answer. In the present invention, the critical path has been optimized such that the overall speed of the multiplier is greatly improved.
摘要:
A pipelined data path architecture for use, in one embodiment, in a multimedia processor. The data path architecture requires a maximum of two execution pipestages to perform all instructions including wide data format multiply instructions and specially adapted multimedia instructions, such as the sum of absolute differences (SABD) instruction and other multiply with add (MADD) instructions. The data path architecture includes two wide data format input registers that feed four partitioned 32×32 multiplier circuits. Within two pipestages, the multiply circuit can perform one 128×128 multiply operation, four 32×32 multiply operations, eight 16×16 multiply operations or sixteen 8×8 multiply operations in parallel. The multiply circuit contains a compressor tree which generates a 256-bit sum and a 256-bit carry vector. These vectors are supplied to four 64-bit carry propagate adder circuits which generate the multiply results. When the data path architecture is performing specially adapted multimedia instructions the input registers are supplied to a pipelined logic unit containing adders, subtractors, shifters, average/round/absolute value circuits, and other logic operation circuits, compressor circuits and multiplexers. The output of the pipelined logic unit is then fed to the four 64-bit carry propagate adder circuits. In this way, the adder circuits of the multiply operation can be effectively used to also process the specially adapted multimedia instructions thereby saving IC area. Multiply circuitry is disabled to save power when the data path architecture is not processing a multiplication instruction.
摘要:
A partitioned multiplier circuit which is designed for high speed operations. The multiplier of the present invention can perform one 32×32 bit multiplication, two 16×16 bit multiplications (simultaneously) or four 8×8 bit multiplications (simultaneously) depending on input partitioning signals. The time required to perform either the 32×32 bit or the 16×16 bit or the 8×8 bit multiplications is constant. Therefore, multiplication results are available with a constant latency regardless of operand bit-size. In one embodiment, the latency is two clock cycles but the multiplier circuit has a throughput of one clock cycle due to pipelining. The input operands can be signed or unsigned. The hardware is partitioned without any significant increase in the delay or area and the multiplier can provide six different modes of operation. In one embodiment, Booth encoding is used for the generation of 17 partial products which are compressed using a compression tree into two 64-bit values. This is performed in the first pipeline stage to generate a sum and a carry vector. These values are then added, in the second pipestage, using a carry propagate adder circuit to provide a single 64-bit result. In the case of 16×16 bit multiplication, the 64-bit result contains two 32-bit results. In the case of 8×8 bit multiplication, the 64-bit result contains four 16-bit results. Due to its high operating speed, the multiplier circuit is advantageous for use in multi-media applications, such as audio/visual rendering and playback.
摘要:
Reduction of the complexity of a Viterbi-type sequence detector is disclosed. It was based on elimination of less probably taken branches in the trellis. The method is applied to the design of the E2PR4 channel with 8/9 rate sliding block trellis code. Coding, by itself eliminates two states by coding constraints, and the disclosed method reduces the number of required ACS units from 14 to 11, while reducing their complexity as well. For the implementation of E2PR4 detection, 4 4-way, 3 3-way, 3 2-way and one 1-way ACSs are needed. System simulations show no BER performance drop at common SNRs when compared with a full 16-state E2PR4 implementation in magnetic disk drives.
摘要:
Methods and apparatus for an encryption processor for performing accelerated computations to establish secure network sessions. The encryption processor includes an execution unit and a decode unit. The execution unit is configured to execute Montgomery operations and including at least one adder and at least two multipliers. The decode unit is configured to determine if a square operation or a product operation needs to be performed and to issue the appropriate instructions so that certain multiply and/or addition operations are performed in parallel in the execution unit while performing either the Montgomery square or Montgomery product operation.
摘要:
The present invention relates to a method of context switching from a first task to a second task in a data processing unit having a register file with a plurality of general purpose registers and a context switch register, a memory comprising a previous context save area and an unused context save area. The memory is coupled with the register file and an instruction control unit with a program counter register and a program status word register coupled with the memory and the register file. The method comprises the steps of acquiring a new save area from said unused save area, storing the context of the first task in said new area, linking the new area with said previous context save area.
摘要:
An improved SR latch has a two stages. A generation block generates Q and {overscore (Q)} signals from a set signal and a reset signal. The generation block also has an inactive state. A storage block receives the Q and {overscore (Q)} signals and maintains the Q signal and {overscore (Q)} signals at the voltage level that was output by the generation block prior to when the generation block blocks becomes inactive. In another embodiment, an improved D flip-flop has a sensing block with the improved SR latch of the present invention.
摘要:
A floating point instruction control mechanism which processes loads and stores in parallel with arithmetic instructions. This results from register renaming, which removes output dependencies in the instruction control mechanism and allows computations aliased to the same register to proceed in parallel.
摘要:
A register selection mechanism for an instruction prefetch buffer which allows instructions having different lengths to be accessed on the instruction boundaries. The instruction prefetch buffer comprises a one-port-write, two-port-read array (10). Address generation and control logic (16) is responsive to a read pointer (15) for controlling access to odd and oven addresses in the array. Additional logic may be provided to provide an indication that the instruction prefetch buffer is empty.