Abstract:
The present invention is generally related to the field of image processing, and more specifically to vector units for supporting image processing. A dual vector unit implementation is described wherein two vector units are configured receive data from a common register file. The vector units may independently and simultaneously process instructions. Furthermore, the vector units may be adapted to perform scalar operations thereby integrating the vector and scalar processing. The vector units may also be configured to share resources to perform an operation, for example, a cross product operation.
Abstract:
A method and apparatus are provided for implementing a power of two estimation function in a general purpose floating-point processor. A floating point number is stored within a memory. The floating point number includes a sign bit, a plurality of exponent bits, and a mantissa having an implied bit and a plurality of fraction bits. In response to a floating-point instruction, the mantissa is partitioned into an integer part and a fraction part, based on the exponent bits. A floating-point result is provided by assigning the integer part of the floating point number as an unbiased exponent of the floating-point result, and by utilizing combinational logic hardware for converting the fraction part of the floating point number to a fraction part of the floating point result.
Abstract:
A circuit arrangement and method couple a hardware-based pseudorandom number generator (PRNG) to an execution unit in such a manner that pseudorandom numbers generated by the PRNG may be selectively output to the execution unit for use as an operand during the execution of instructions by the execution unit. A PRNG may be coupled to an input of an operand multiplexer that outputs to an operand input of an execution unit so that operands provided by instructions supplied to the execution unit are selectively overridden with pseudorandom numbers generated by the PRNG. Furthermore, overridden operands provided by instructions supplied to the execution unit may be used as seed values for the PRNG.
Abstract:
A circuit arrangement and method utilize texture data prefetching to prefetch texture data used by an anisotropic filtering algorithm. In particular, stride-based prefetching may be used to prefetch texture data for use in anisotropic filtering, where the value of the stride, or difference between successive accesses, is based upon a distance in a memory address space between sample points taken along the line of anisotropy used in an anisotropic filtering algorithm.
Abstract:
A pipelined execution unit incorporates one or more low power modes that reduce power consumption by dynamically merging pipeline stages in an execution pipeline together with one another. In particular, the execution logic in successive pipeline stages in an execution pipeline may be dynamically merged together by setting one or more latches that are intermediate to such pipeline stages to a transparent state such that the output of the pipeline stage preceding such latches is passed to the subsequent pipeline stage during the same clock cycle so that both such pipeline stages effectively perform steps for the same instruction during each clock cycle. Then, with the selected pipeline stages merged, the power consumption of the execution pipeline can be reduced (e.g., by reducing the clock frequency and/or operating voltage of the execution pipeline), often with minimal adverse impact on performance.
Abstract:
A method, computer-readable medium, and an apparatus for generating a transcendental value. The method includes receiving an input containing an input value and an opcode and determining whether the opcode corresponds to a trigonometric operation or a power-of-two operation. The method also includes calculating a fractional value and an integer value from the input value, generating the transcendental value based on the fractional value by adding at least a portion of the fractional value with at least one of a shifted fractional value produced by shifting the portion of the fractional value and a constant value, and providing the transcendental value in response to the request. In this fashion, the same circuit area may be used to carry out both trigonometric and power-of-two calculations, leading to greater circuit area savings and performance advantages while not sacrificing significant accuracy.
Abstract:
A computer-implemented method includes initializing a driver associated with an input/output adapter in response to receiving an initialize driver request from a client application. The computer-implemented method includes initializing the input/output adapter to enable adapter capabilities of the input/output adapter to be determined. The computer-implemented method also includes determining the adapter capabilities of the input/output adapter. The computer-implemented method further includes determining slot capabilities of a slot associated with the input/output adapter. The computer-implemented method also includes setting configurable capabilities of the input/output adapter based on the adapter capabilities and the slot capabilities.
Abstract:
A circuit arrangement and method utilize a plurality of execution units having different power and performance characteristics and capabilities within a multithreaded processor core, and selectively route instructions having different performance requirements to different execution units based upon those performance requirements. As such, instructions that have high performance requirements, such as instructions associated with primary tasks or time sensitive tasks, can be routed to a higher performance execution unit to maximize performance when executing those instructions, while instructions that have low performance requirements, such as instructions associated with background tasks or non-time sensitive tasks, can be routed to a reduced power execution unit to reduce the power consumption (and associated heat generation) associated with executing those instructions.
Abstract:
A processing unit includes multiple execution units and sequencer logic that is disposed downstream of instruction buffer logic, and that is responsive to a sequencer instruction present in an instruction stream. In response to such an instruction, the sequencer logic issues a plurality of instructions associated with a long latency operation to one execution unit, while blocking instructions from the instruction buffer logic from being issued to that execution unit. In addition, the blocking of instructions from being issued to the execution unit does not affect the issuance of instructions to any other execution unit, and as such, other instructions from the instruction buffer logic are still capable of being issued to and executed by other execution units even while the sequencer logic is issuing the plurality of instructions associated with the long latency operation.
Abstract:
A software-accessible special purpose register is architected into a processing unit in order to implement persistent vector multiplexer control of a vector-based execution unit. A persistent swizzle instruction is defined in an instruction set for the vector-based execution unit and is used to cause state information to be stored in the special purpose register such that the operand vectors processed by subsequent vector instructions executed by the vector-based execution unit will be selectively shuffled using the persisted state information. As a result, when multiple vector instructions require a common custom word ordering for one or more operand vectors, a single persistent swizzle instruction may be used to select the desired custom word ordering for all of the vector instructions.