摘要:
In a digital processor performing division, quotient accumulation apparatus is formed of a set of muxes and a single carry save adder. Partial quotients are accumulated in carry-save form with proper sign extension. Delay of partial quotient bit fragments from one iteration to a following iteration enables the apparatus to limit use to one carry save adder. By enlarging minimal logic, the quotient accumulation apparatus operates at a rate fast enough to support the rate of fast dividers.
摘要:
A semiconductor processor is described. The semiconductor processor includes logic circuitry to perform a logical reduction instruction. The logic circuitry has swizzle circuitry to swizzle a vector's elements so as to form a swizzle vector. The logic circuitry also has vector logic circuitry to perform a vector logic operation on said vector and said swizzle vector.
摘要:
Methods, apparatus, instructions and logic are disclosed providing double rounded combined floating-point multiply and add functionality as scalar or vector SIMD instructions or as fused micro-operations. Embodiments include detecting floating-point (FP) multiplication operations and subsequent FP operations specifying as source operands results of the FP multiplications. The FP multiplications and the subsequent FP operations are encoded as combined FP operations including rounding of the results of FP multiplication followed by the subsequent FP operations. The encoding of said combined FP operations may be stored and executed as part of an executable thread portion using fused-multiply-add hardware that includes overflow detection for the product of FP multipliers, first and second FP adders to add third operand addend mantissas and the products of the FP multipliers with different rounding inputs based on overflow, or no overflow, in the products of the FP multiplier. Final results are selected respectively using overflow detection.
摘要:
Computer method and apparatus for performing a square root or division operation generating a root or quotient is presented. A partial remainder is stored in radix-2 or radix-4 signed digit format. A decoder is provided for computing a root or quotient digit, and a correction term dependent on a number of the most significant digits of the partial remainder. An adder is provided for computing the sum of the signed digit partial remainder and the correction term in binary format, and providing the result in signed digit format. The adder computes a carry out independent of a carry in bit and a sum dependent on a Carry_in bit providing a fast adder independent of carry propagate delays. The scaler performs a multiplication by two of the result output from the adder in signed digit format to provide a signed digit next partial remainder.
摘要:
A mechanism is described for facilitating dynamic and efficient fusion of computing instructions according to one embodiment. A method of embodiments, as described herein, includes monitoring a software program for a program region having fusion candidate instructions for a fusion operation at a computing system; evaluating whether the macro operation of the candidate instructions is valuable to the software program; and performing the fusion operation if it is evaluated to be valuable.
摘要:
A method of performing vector operations on a semiconductor chip is described. The method includes performing a first vector instruction with a vector functional unit implemented on the semiconductor chip and performing a second vector instruction with the vector functional unit. The first vector instruction is a vector multiply add instruction. The second vector instruction is a vector leading zeros count instruction.
摘要:
A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams.
摘要:
Embodiments of systems, apparatuses, and methods for performing a blend instruction in a computer processor are described. In some embodiments, the execution of a blend instruction causes a data element-by-element selection of data elements of first and second source operands using the corresponding bit positions of a writemask as a selector between the first and second operands and storage of the selected data elements into the destination at the corresponding position in the destination.
摘要:
The specification discloses a structure of and a method of operating a subtractive division (SD) cell where a portion of the partial remainder or estimated partial remainder directly indicates the next quotient digit. More particularly, by sufficiently constraining the prescaled range for each possible divisor, only a few bits of the partial remainder (the exact number dependent upon the radix), along with their related carries (if any), directly indicate the value of the next quotient digit. Because fewer bits of the partial remainder are needed to make this determination than needed in related art devices, and further because no look-up table or hard-coded decision tree is required, calculation time within each SD cell is shorter than related art devices. Having a shorter calculation time within each SD cell allows for either completion of a greater number of SD cells within each clock cycle, or completion of the calculation to full precision in less time.
摘要:
A floating point multiply of two n-bit operands creams a 2n-bit result, but ordinarily only n-bit precision is needed, so rounding is performed. Some rounding algorithms require the knowledge of the presence of any "1" in the n-2 low-order bits of the 2n-bit result. The presence of such a "1", indicates the so-called "sticky bit" is set. The sticky bit is calculated in a path separate from the multiply operation, so the n-2 least significant sums need not be calculated. This saves time and circuitry in an array multiplier, for example. In an example method, the difference between n and the number of trailing zeros, "x", in one of the n-bit operands is detected, by transposing the operand and detecting the leading one. The other operand is right-shifted by a number of bit positions equal to this difference. A sticky bit is generated if any logic "1's" are in the low-order n-x-2 bits fight shifted out of the second operand.