Abstract:
A multirate execution unit is capable of being operated in a plurality of modes, with the execution unit being capable of clocked at multiple different rates relative to a multithreaded issue unit such that, in applications where maximum performance is desired, the execution unit can be clocked at a rate that is faster than the clock rate for the multithreaded issue unit, and in applications where a lower power profile is desired, the execution unit can be throttled back to a slower rate to reduce the power consumption of the execution unit. When the execution unit is clocked at a faster rate than the multithreaded issue unit, the issue unit is permitted to issue more instructions per cycle than when the execution unit is throttled to the slower rate to increase overall instruction throughput.
Abstract:
The present invention is generally related to integrated circuit devices, and more particularly, to methods, systems and design structures for the field of image processing, and more specifically to vector units for supporting image processing. A dual vector unit implementation is described wherein two vector units are configured receive data from a common register file. The vector units may independently and simultaneously process instructions. Furthermore, the vector units may be adapted to perform scalar operations thereby integrating the vector and scalar processing. The vector units may also be configured to share resources to perform an operation, for example, a cross product operation.
Abstract:
Embodiments of the invention provide methods and apparatus for reallocating workload related to traversal of a ray through a spatial index. In a first operating state a workload manager may be experiencing a first or a normal workload. In the first operating state the workload manager may be responsible for traversing the entire spatial index and a vector throughput engine may be responsible for performing ray-primitive intersection tests. In an increased workload state the workload manager may experience an increased workload. In response to the increased workload the image processing system may partition the spatial index such that the workload manager may be responsible for traversing a first portion of the spatial index and the vector throughput engine may be responsible for traversing a second portion of the spatial index and for performing ray-primitive intersection tests.
Abstract:
Persistent vector multiplexer control is used in a vector-based execution unit to control the shuffling of words in operand vectors processed by the execution unit. In addition, a persistent swizzle instruction is defined in an instruction set for the vector-based execution unit and is used to cause state information to be persisted such that the operand vectors processed by subsequent vector instructions executed by the vector-based execution unit will be selectively shuffled using the persisted state information. As a result, when multiple vector instructions require a common custom word ordering for one or more operand vectors, a single persistent swizzle instruction may be used to select the desired custom word ordering for all of the vector instructions.
Abstract:
Embodiments of the invention provide methods and apparatus for executing a multiple operand instruction. Executing the multiple operand instruction comprises computing an arithmetic result of a pair of operands in each processing lane of a vector unit. The arithmetic results generated in each processing lane of the vector unit may be transferred to a dot product unit. The dot product unit may compute an arithmetic result using the arithmetic result computed by each processing lane of the vector unit to generate an arithmetic result of more than two operands.
Abstract:
The present invention is generally related to the field of image processing, and more specifically to an instruction set for processing images. Vector processing may involve performing a plurality of permute operations to arrange vector operands in desired locations of a register prior to performing vector operation, for example, a cross product. The permute instructions may be dependent on one another and may require the use of temporary registers. Embodiments of the invention provide a permute instruction wherein a mask field may be used to specify a particular location of a target register in which to transfer data, thereby reducing the number of instructions for arranging data, reducing dependencies between instructions, and the usage of temporary registers.
Abstract:
Systems and methods are disclosed herein for providing improved cache structures and methods that are optimally sized to support a predetermined range of late stage adjustments and in which image data is intelligently read out of DRAM and cached in such a way as to eliminate re-fetching of input image data from DRAM and minimize DRAM bandwidth and power.
Abstract:
The present invention is generally related to the field of image processing, and more specifically to an instruction set for processing images. Vector processing may involve performing a plurality of permute operations to arrange vector operands in desired locations of a register prior to performing vector operation, for example, a cross product. The permute instructions may be dependent on one another and may require the use of temporary registers. Embodiments of the invention provide a permute instruction wherein a mask field may be used to specify a particular location of a target register in which to transfer data, thereby reducing the number of instructions for arranging data, reducing dependencies between instructions, and the usage of temporary registers.
Abstract:
Methods for preprocessing pixel data using a Direct Memory Access (DMA) engine during a data transfer of the pixel data from a first memory (e.g., a DRAM) to a second memory (e.g., an SRAM) are described. The pixel data may derive from a color camera or a depth camera in which individual pixel values are not a multiple of eight bits. In some cases, the DMA engine may perform a variety of image processing operations on the pixel data prior to the pixel data being written into the second memory. In one embodiment, the DMA engine may be configured to determine whether one or more pixels corresponding with the pixel data may be invalidated or skipped based on a minimum pixel value threshold and a maximum pixel value threshold and to embed pixel skipping information within unused bits of the pixel data.
Abstract:
The present invention is generally related to the field of image processing, and more specifically to an instruction set for processing images. Vector processing may involve performing a plurality of dot product operations to generate operands for generating operands for a new vector. The dot product operations may require the issue of a plurality of permute instructions to arrange the vector operands in desired locations of a target register. Embodiments of the invention provide a dot product instruction wherein a mask field may be used to specify a particular location of a target register in which to transfer data, thereby avoiding the need for permute instructions for arranging data, reducing dependencies between instructions, and the usage of temporary registers.