Abstract:
Techniques are disclosed relating to per-pipeline control for an operand cache. In some embodiments, an apparatus includes a register file and multiple execution pipelines. In some embodiments, the apparatus also includes an operand cache that includes multiple entries that each include multiple portions that are each configured to store an operand for a corresponding execution pipeline. In some embodiments, the operand cache is configured, during operation of the apparatus, to store data in only a subset of the portions of an entry. In some embodiments, the apparatus is configured to store, for each entry in the operand cache, a per-entry validity value that indicates whether the entry is valid and per-portion state information that indicates whether data for each portion is valid and whether data for each portion is modified relative to data in a corresponding entry in the register file.
Abstract:
Techniques are disclosed relating to predication. In one embodiment, a graphics processing unit is disclosed that includes a first set of architecturally-defined registers configured to store predication information. The graphics processing unit further includes a second set of registers configured to mirror the first set of registers and an execution pipeline configured to discontinue execution of an instruction sequence based on predication information in the second set of registers. In one embodiment, the second set of registers includes one or more registers proximal to an output of the execution pipeline. In some embodiments, the execution pipeline writes back a predicate value determined for a predicate writer to the second set of registers. The first set of architecturally-defined registers is then updated with the predicate value written back to the second set of registers. In some embodiments, the execution pipeline discontinues execution of the instruction sequence without stalling.
Abstract:
An apparatus includes an operand cache for storing operands from a register file for use by execution circuitry. In some embodiments, eviction priority for the operand cache is based on the status of entries (e.g., whether dirty or clean) and the retention priority of entries. In some embodiments, flushes are handled differently based on their retention priority (e.g., low-priority entries may be pre-emptively flushed). In some embodiments, timing for cache clean operations is specified on a per-instruction basis. Disclosed techniques may spread out write backs in time, facilitate cache clean operations, facilitate thread switching, extend the time operands are available in an operand cache, and/or improve the use of compiler hints, in some embodiments.
Abstract:
Techniques are disclosed relating to implementation of gradient-type graphics instructions. In one embodiment, an apparatus includes first and second execution pipelines and a register file. In this embodiment, the register file is coupled to the first and second execution pipelines and configured to store operands for the first and second execution pipelines. In this embodiment, the apparatus is configured to determine that a graphics instruction imposes a dependency between the first and second pipeline. In response to this determination, the apparatus is configured to read a plurality of operands from the register file including an operand assigned to the second execution pipeline and to select the operand assigned to the second execution pipeline as an input operand for the first execution pipeline. The apparatus may be configured such that operands assigned to the second execution pipeline are accessible by the first execution pipeline only via the register file and not from other locations.
Abstract:
Techniques are disclosed relating to predication. In one embodiment, a graphics processing unit is disclosed that includes a first set of architecturally-defined registers configured to store predication information. The graphics processing unit further includes a second set of registers configured to mirror the first set of registers and an execution pipeline configured to discontinue execution of an instruction sequence based on predication information in the second set of registers. In one embodiment, the second set of registers includes one or more registers proximal to an output of the execution pipeline. In some embodiments, the execution pipeline writes back a predicate value determined for a predicate writer to the second set of registers. The first set of architecturally-defined registers is then updated with the predicate value written back to the second set of registers. In some embodiments, the execution pipeline discontinues execution of the instruction sequence without stalling.
Abstract:
Disclosed techniques relate to work distribution in graphics processors. In some embodiments, an apparatus includes circuitry that implements a plurality of logical slots and a set of graphics processor sub-units that each implement multiple distributed hardware slots. The circuitry may determine different distribution rules for first and second sets of graphics work and map logical slots to distributed hardware slots based on the distribution rules. In various embodiments, disclosed techniques may advantageously distribute work efficiently across distributed shader processors for graphics kicks of various sizes.
Abstract:
Techniques are disclosed relating to dynamically adjusting buffering for distributing compute work in a graphics processor. In some embodiments, the graphics processor includes shader circuitry configured to process compute work from a compute kernel, multiple distributed workload parser circuits configured to send compute work to the shader circuitry, primary workload parser circuitry configured to send, via a communications fabric, compute work from the compute kernel to the distributed workload parser circuits, and buffer circuitry configured to buffer compute work received by one or more of the distributed workload parser circuits from the primary workload parser circuitry. In some embodiments, the graphics processor is configured to dynamically adjust a limit on the number of entries used in the buffer circuitry based on information indicating complexity of the compute kernel. This may advantageously maintain launch rates while reducing or avoiding workload imbalances, in some embodiments.
Abstract:
Techniques are disclosed relating to dynamically adjusting buffering for distributing compute work in a graphics processor. In some embodiments, the graphics processor includes shader circuitry configured to process compute work from a compute kernel, multiple distributed workload parser circuits configured to send compute work to the shader circuitry, primary workload parser circuitry configured to send, via a communications fabric, compute work from the compute kernel to the distributed workload parser circuits, and buffer circuitry configured to buffer compute work received by one or more of the distributed workload parser circuits from the primary workload parser circuitry. In some embodiments, the graphics processor is configured to dynamically adjust a limit on the number of entries used in the buffer circuitry based on information indicating complexity of the compute kernel. This may advantageously maintain launch rates while reducing or avoiding workload imbalances, in some embodiments.
Abstract:
Techniques are disclosed relating to tracking compute workgroup completions in a distributed processor. In some embodiments, an apparatus includes a plurality of shader processors configured to perform operations for compute workgroups included in compute kernels, a master workload parser circuit, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and the master workload parser circuit. In some embodiments, a distributed workload parser circuit is configured to maintain, for each of a set of the shader processors, a data structure that specifies a count of workgroup completions for one or more kernels processed by the shader processor, determine, for the set of shader processors based on counts of workgroup completions for a first kernel, an aggregate count of completions to report for the first kernel, send the aggregate count to the master workload parser circuit over the communications fabric, and adjust the data structures to reflect counts included in the aggregate count.
Abstract:
Techniques are disclosed relating to fetching items from a compute command stream that includes compute kernels. In some embodiments, stream fetch circuitry sequentially pre-fetches items from the stream and stores them in a buffer. In some embodiments, fetch parse circuitry iterate through items in the buffer using a fetch parse pointer to detect indirect-data-access items and/or redirect items in the buffer. The fetch parse circuitry may send detected indirect data accesses to indirect-fetch circuitry, which may buffer requests. In some embodiments, execute parse circuitry iterates through items in the buffer using an execute parse pointer (e.g., which may trail the fetch parse pointer) and outputs both item data from the buffer and indirect-fetch results from indirect-fetch circuitry for execution. In various embodiments, the disclosed techniques may reduce fetch latency for compute kernels.