摘要:
A vector functional unit implemented on a semiconductor chip to perform vector operations of dimension N is described. The vector functional unit includes N functional units. Each of the N functional units have logic circuitry to perform: a first integer multiply add instruction that presents highest ordered bits but not lowest ordered bits of a first integer multiply add calculation, and, a second integer multiply add instruction that presents lowest ordered bits but not highest ordered bits of a second integer multiply add calculation.
摘要:
A method and apparatus for operating on floating point numbers is provided that accepts two floating point numbers as operands in order to perform addition, a rounding adder circuit is provided which can accept the operands and a rounding increment bit at various bit positions. The circuit uses full adders at required bit positions to accommodate a bit from each operand and the rounding bit. Since the proper position in which the rounding bit should be injected into the addition may be unknown at the start, respective low and high increment bit addition circuits are provided to compute a result for both a low and a high increment rounding bit condition. The final result is selected based upon the most significant bit of the low rounding bit increment result. In this manner, the present rounding adder circuit eliminates the need to perform a no increment calculation used to select a result, as in the prior art. Through the use of full adders, the circuit not only accounts for the round increment bit, but can accept increment bits at any bit position to perform operations such as two's complement, thus further reducing the operations required to perform a desired floating point mathematical operation.
摘要:
In a floating point arithmetic execution unit, an additional adder unit and a selection network are added to the apparatus typically performing the arithmetic floating point function. The additional apparatus permits certain processes forming part of arithmetic operations to be executed in parallel. For selected arithmetic operations, the final result can be one of two values typically related by an intermediate shifting operation. By performing the processes in parallel and selecting the appropriate result, the execution time can be reduced when compared to the execution of the process in a serial implementation. The fundamental arithmetic operations of addition, subtraction, multiplication and division can each have the execution time decreased using the disclosed additional apparatus.
摘要:
A set of threads of an application are identified to be executed on a platform, where the platform comprises a multi-node architecture. A set of queues of an I/O device of the platform are reserved and associated with one of a plurality of nodes in the multi-node architecture. Data is received at the I/O device, where the I/O device is included in a particular one of the plurality of nodes. Response data is generated through execution of a thread in the set of threads using a processing core and memory of the particular node, and the response data is caused to be sent on the I/O device based on inclusion of the I/O device in the particular node.
摘要:
A mechanism is described for facilitating dynamic and efficient fusion of computing instructions according to one embodiment. A method of embodiments, as described herein, includes monitoring a software program for a program region having fusion candidate instructions for a fusion operation at a computing system; evaluating whether the macro operation of the candidate instructions is valuable to the software program; and performing the fusion operation if it is evaluated to be valuable.
摘要:
Vector blend and permute functionality are provided, responsive to instructions specifying: a destination vector register comprising fields to store vector elements, a first vector register, a vector element size, a second vector register, and a third operand. Indices are read from fields in the second register. Each index has a first selector portion and a second selector portion. Corresponding unmasked vector elements are stored to fields of the destination register, wherein each vector element, responsive to the respective first selector portion having a first value, is copied to an intermediate vector from a corresponding data field of the first register, and responsive to the respective first selector portion having a second value, is copied to the intermediate vector from a corresponding data field of the third operand. Then unmasked data fields of the destination are replaced by data fields in the intermediate vector indexed by the corresponding second selector portions.
摘要:
A vector functional unit implemented on a semiconductor chip to perform vector operations of dimension N is described. The vector functional unit includes N functional units. Each of the N functional units have logic circuitry to perform: a first integer multiply add instruction that presents highest ordered bits but not lowest ordered bits of a first integer multiply add calculation, and, a second integer multiply add instruction that presents lowest ordered bits but not highest ordered bits of a second integer multiply add calculation.
摘要:
Disclosed is an apparatus and method generally related to controlling a multimedia extension control and status register (MXCSR). A processor core may include a floating point unit (FPU) to perform arithmetic functions; and a multimedia extension control register (MXCR) to provide control bits to the FPU. Further an optimizer may be used to select a speculative multimedia extension status register (SPEC_MXSR) from a plurality of SPEC_MXSRs to update a multimedia extension status register (MXSR) based upon an instruction.
摘要:
In a digital processor performing division, quotient accumulation apparatus is formed of a set of muxes and a single carry save adder. Partial quotients are accumulated in carry-save form with proper sign extension. Delay of partial quotient bit fragments from one iteration to a following iteration enables the apparatus to limit use to one carry save adder. By enlarging minimal logic, the quotient accumulation apparatus operates at a rate fast enough to support the rate of fast dividers.
摘要:
A circuit is provided for using the input operands of a floating point addition or subtraction operation to detect the leading one or zero bit position in parallel with the arithmetic operation. This allows the alignment to be performed on the available result in the next cycle of the floating point operation and results in a significant performance advantage. The leading I/O detection is decoupled from the adder that is computing the result in parallel, eliminating the need for special circuitry to compute a carry-dependent adjustment signal. The single-bit fraction overflow that can result from leading I/O misprediction is corrected with existing circuitry during a later stage of computation.