摘要:
Methods and apparatus for source operand collector caching. In one embodiment, a processor includes a register file that may be coupled to storage elements (i.e., an operand collector) that provide inputs to the datapath of the processor core for executing an instruction. In order to reduce bandwidth between the register file and the operand collector, operands may be cached and reused in subsequent instructions. A scheduling unit maintains a cache table for monitoring which register values are currently stored in the operand collector. The scheduling unit may also configure the operand collector to select the particular storage elements that are coupled to the inputs to the datapath for a given instruction.
摘要:
Methods and apparatus for source operand collector caching. In one embodiment, a processor includes a register file that may be coupled to storage elements (i.e., an operand collector) that provide inputs to the datapath of the processor core for executing an instruction. In order to reduce bandwidth between the register file and the operand collector, operands may be cached and reused in subsequent instructions. A scheduling unit maintains a cache table for monitoring which register values are currently stored in the operand collector. The scheduling unit may also configure the operand collector to select the particular storage elements that are coupled to the inputs to the datapath for a given instruction.
摘要:
A parallel thread processor executes thread groups belonging to multiple cooperative thread arrays (CTAs). At each cycle of the parallel thread processor, an instruction scheduler selects a thread group to be issued for execution during a subsequent cycle. The instruction scheduler selects a thread group to issue for execution by (i) identifying a pool of available thread groups, (ii) identifying a CTA that has the greatest seniority value, and (iii) selecting the thread group that has the greatest credit value from within the CTA with the greatest seniority value.
摘要:
A parallel thread processor executes thread groups belonging to multiple cooperative thread arrays (CTAs). At each cycle of the parallel thread processor, an instruction scheduler selects a thread group to be issued for execution during a subsequent cycle. The instruction scheduler selects a thread group to issue for execution by (i) identifying a pool of available thread groups, (ii) identifying a CTA that has the greatest seniority value, and (iii) selecting the thread group that has the greatest credit value from within the CTA with the greatest seniority value.
摘要:
One embodiment of the present invention sets forth a technique for a program to access multi-dimensional formatted graphics surface memory. Multi-dimensional memory objects called “surfaces” stored in a user-specified data or pixel format and arranged in a graphics optimized layout are accessed by programs using surface instructions. A set of memory access instructions e.g., load, store, reduce, and atomic, referred to as surface instructions, may be used to access the surfaces. Coordinate bounds checking is performed with configurable clamping. Caching behavior may also be specified by the surface instructions. Data format conversion and packing to a specified storage format is supported for store, reduction, and atomic surface instructions. Data format conversion and unpacking from a specified storage format is supported for loads and atomic surface instructions.
摘要:
Systems and methods for compiling programs using condition codes and executing those programs when non-numeric values are present allow for explicit handling of non-numeric values. In addition to the conventional condition code values of positive, negative, and zero, a fourth value may be encoded, not a number (NaN) representing a non-numeric value. New condition tests are defined that explicitly account for condition code values of NaN. A compiler may produce code using the new condition tests to represent if and if-else statements. The code including the new condition tests generates deterministic results during execution when non-numeric values are present.
摘要:
One embodiment of the present invention sets forth a technique controlling the pixel location at which the plane equation is evaluated. Multiple pixel offsets (dx, dy) may be specified that each define to a sub-pixel sample position. Attributes are then calculated for each sub-pixel sample position that is covered by a geometric primitive. One advantage of the technique is that anti-aliasing quality may be improved since high frequency color components may be selectively supersampled for particular geometric primitives.
摘要:
One embodiment of the present invention sets forth a technique for generating a batch clip state stored in clip state machine (CSM) associated with a batch of vertices. Per-vertex clip state is generated for each vertex in the batch of vertices based on the position of each vertex relative to each clip plane. For a given vertex, per-vertex clip state indicates whether the vertex is inside or outside each of the one or more clip planes. The per-vertex clip states of all the vertices in the batch of vertices are coalesced into a batch clip state by determining whether each vertex in the batch of vertices is inside every clip plane, each vertex is outside at least one clip plane or neither. The batch clip state is stored in the CSM associated with the thread group that processes the batch of vertices that can be accessed by further stages of the graphics pipeline.
摘要:
One embodiment of the invention sets forth a mechanism for compiling a vertex shader program into two portions, a culling portion and a shading portion. The culling portion of the compiled vertex shader program specifies vertex attributes and instructions of the vertex shader program needed to determine whether early vertex culling operations should be performed on a batch of vertices associated with one or more primitives of a graphics scene. The shading portion of the compiled vertex shader program specifies the remaining vertex attributes and instructions of the vertex shader program for performing vertex lighting and performing other operations on the vertices in the batch of vertices. When the compiled vertex shader program is executed by graphics processing hardware, the shading portion of the compiled vertex shader is executed only when early vertex culling operations are not performed on the batch of vertices.
摘要:
One embodiment of the present invention sets forth a technique for collecting operands specified by an instruction. As a sequence of instructions is received the operands specified by the instructions are assigned to ports, so that each one of the operands specified by a single instruction is assigned to a different port. Reading of the operands from a multi-bank register file is scheduled by selecting an operand from each one of the different ports to produce an operand read request and ensuring that two or more of the selected operands are not stored in the same bank of the multi-bank register file. The operands specified by the operand read request are read from the multi-bank register file in a single clock cycle. Each instruction is then executed as the operands specified by the instruction are read from the multi-bank register file and collected over one or more clock cycles.