摘要:
An apparatus and method for handling tiny numbers using a super sticky bit are provided. In response to detecting that a preliminary result of an instruction corresponds to a tiny number and an underflow exception is masked, an execution pipeline can be configured to store a value corresponding to the preliminary result and a super sticky bit in a destination register. Also, a destination register tag corresponding to the destination register and a denormal exception indicator corresponding to the tiny number and masked underflow exception can be stored. A trap handler can be initiated to generate a corrected result for the instruction. The trap handler can detect that the denormal exception indicator has been set and can read the value and the super sticky bit from the destination register using the destination register tag. The trap handler can generate a corrected result for the instruction based on the value and the super sticky bit. An instruction subsequent to the trapping instruction can then be restarted.
摘要:
A multiplier configured to obtain higher frequencies of exactly rounded results by adding an adjustment constant to intermediate products generated during iterative multiplication operations is disclosed. One such iterative multiplication operation is the Newton-Raphson iteration, which may be utilized by the multiplier to perform reciprocal calculations and reciprocal square root calculations. For each iteration, the results converge toward an infinitely precise result. To improve the frequency of the exactly rounded result, the results of the iterative calculations may be studied for a large number of differing input operands to determine the best suited value for the adjustment constant. The multiplier may also be configured to perform scalar and packed vector multiplication using the same hardware.
摘要:
An execution unit configured to execute vectored floating point and integer instructions. The execution unit may include an add/subtract pipeline having far and close data paths. The far data path is configured to handle effective addition operations, as well as effective subtraction operations for operands having an absolute exponent difference greater than one. The close data path is configured to handle effective subtraction operations for operands having an absolute exponent difference less than or equal to one. The close data path includes an adder unit configured to generate a first and second output value. The first output value is equal to the first input operand plus an inverted version of the second input operand, while the second output value is equal to the first output value plus one. The two output values are conveyed to a multiplexer unit, which selects one of the output values as a preliminary subtraction result based on a final selection signal received from a selection unit. The selection unit generates the final selection signal from a plurality of preliminary selection signals based on the carry in signal to the most significant bit of the first adder output value. Selection of the first or second output value in the close data path effectuates the round-to-nearest operation.
摘要:
An optimized multimedia execution unit configured to perform vectored floating point and integer instructions. In one embodiment, the execution unit includes an add/subtract pipeline having far and close data paths. The far data path is configured to handle effective addition operations, as well as effective subtraction operations for operands having an absolute exponent difference greater than one. The close data path, conversely, is configured to handle effective subtraction operations for operands having an absolute exponent difference less than or equal to one. The execution unit may also include a plurality of add/subtract pipelines, allowing vectored add, subtract, and integer/floating point conversion instructions to be performed. The execution unit may also be expanded to handle additional arithmetic instructions (such as reverse subtract and accumulate functions) by appropriate input multiplexing. The execution unit may also be configured with a leading one prediction unit that is configured to predict the position of a leading one value for certain results in order to improve normalization times.
摘要:
An execution unit configured to perform a plurality of arithmetic operations using the same set of operands. These operands include corresponding input vector values in each of a plurality of input registers. The execution unit is coupled to receive these input vector values, as well as an instruction value indicative of one of the plurality of arithmetic operations. In one embodiment, the plurality of arithmetic operations includes a vectored add instruction, a vectored subtract instruction, a vectored reverse subtract instruction, and an accumulate instruction. The vectored instructions perform arithmetic operations concurrently using corresponding values from each of the plurality of input registers. The accumulate instruction, however, is executable to add together all input values within a single input register. The execution unit further includes a multiplexer unit configured to selectively route the input vector values to a plurality of adder units according to the opcode value. In an embodiment in which the execution unit is configured to perform subtraction operations as well as addition, the multiplexer unit is additionally configured to selectively route negated versions (either one's or two's complement format) to the plurality of adder units. Each of the plurality of adder units is configured to generate a sum based upon the values conveyed from the multiplexer unit. The accumulate instruction advantageously allows important operations such as the matrix multiply to be performed rapidly. Because the matrix multiply is an integral part of many applications (particularly graphics applications), the accumulate instruction may lead to increased overall system performance.
摘要:
Multipurpose arithmetic functional units can perform planar attribute interpolation and unary function approximation operations. In one embodiment, planar interpolation operations for coordinates (x, y) are executed by computing A*x+B*y+C, and unary function approximation operations for operand x are executed by computing F2(xb)*xh2+F1(xb)*xh+F0(xb), where xh=x−xb. Shared multiplier and adder circuits are advantageously used to implement the product and sum operations for both classes of operations.
摘要翻译:多用途算术功能单元可以执行平面属性插值和一元函数近似运算。 在一个实施例中,通过计算A * x + B * y + C来执行坐标(x,y)的平面内插操作,并且通过计算F2(xb)* xh2 + F1(xb)来执行操作数x的一元函数近似运算 )* xh + F0(xb),其中xh = x-xb。 共享乘法器和加法器电路有利地用于实现两类操作的乘积和求和运算。
摘要:
A multipurpose functional unit is configurable to support a number of operations including multiply-add and format conversion operations, as well as other integer and/or floating-point arithmetic operations, Boolean operations, and logical test operations.
摘要:
A system and method for late-dropping packets in a network switch. A network switch may include multiple input ports, multiple output ports, and a shared random access memory coupled to the input ports and output ports by data transport logic. Packets entering the switch may be subject to input thresholding, and may be assigned to a flow within a group. A portion of a packet subject to input thresholding may be accepted into the switch and assigned to a group and flow even if, at the time of arrival of the portion, there are not enough resources available to receive the remainder of the packet. This partial receipt of the packet is allowed because of the possibility of additional resources becoming available between the time of receipt of and resource allocation for the portion of the packet and receipt of subsequent portions of the packet.
摘要:
Methods and apparatuses are presented for determining coefficients for a polynomial-based approximation of a function, by iteratively estimating a first coefficient, reducing the first coefficient to a lower precision to obtain a first limited-precision coefficient, analytically calculating a second coefficient by taking into account the first limited-precision coefficient, reducing the second coefficient to a lower precision to obtain a second limited-precision coefficient, iteratively estimating a third coefficient by taking into account at least one of the first limited-precision coefficient and the second limited-precision coefficient, and reducing the third coefficient to a lower precision to obtain a third limited-precision coefficient. In one embodiment of the invention, the polynomial-based approximation relates to a minimax approximation of the function approximated, and at least one of the steps for iteratively estimating the first coefficient and iteratively estimating the third coefficient involves use of a Remez exchange algorithm.
摘要:
A microprocessor with a floating point unit configured to efficiently allocate multi-pipeline executable instructions is disclosed. Multi-pipeline executable instructions are instructions that are not forced to execute in a particular type of execution pipe. For example, junk ops are multi-pipeline executable. A junk op is an instruction that is executed at an early stage of the floating point unit's pipeline (e.g., during register rename), but still passes through an execution pipeline for exception checking. Junk ops are not limited to a particular execution pipeline, but instead may pass through any of the microprocessor's execution pipelines in the floating point unit. Multi-pipeline executable instructions are allocated on a per-clock cycle basis using a number of different criteria. For example, the allocation may vary depending upon the number of multi-pipeline executable instructions received by the floating point unit in a single clock cycle.