Abstract:
Techniques are disclosed relating to implementation of gradient-type graphics instructions. In one embodiment, an apparatus includes first and second execution pipelines and a register file. In this embodiment, the register file is coupled to the first and second execution pipelines and configured to store operands for the first and second execution pipelines. In this embodiment, the apparatus is configured to determine that a graphics instruction imposes a dependency between the first and second pipeline. In response to this determination, the apparatus is configured to read a plurality of operands from the register file including an operand assigned to the second execution pipeline and to select the operand assigned to the second execution pipeline as an input operand for the first execution pipeline. The apparatus may be configured such that operands assigned to the second execution pipeline are accessible by the first execution pipeline only via the register file and not from other locations.
Abstract:
Techniques are disclosed relating to multi-stage thread scheduling. In some embodiments, processor circuitry includes multiple channel pipelines for multiple channels and multiple execution pipelines shared by the channel pipelines and configured to perform different types of operations provided by the channel pipelines. First scheduler circuitry may arbitrate among threads to assign threads to channels. Second scheduler circuitry may arbitrate among channels to assign an operation from a given channel to a given execution pipeline. The execution pipelines may provide backpressure information to the first scheduler circuitry based on execution status and the first scheduler circuitry may adjust priority of a thread for assignment to a channel based on the backpressure information. Disclosed techniques may reduce channel conflicts and starvation for execution resources.
Abstract:
Techniques are disclosed relating to specifying memory consistency constraints. In some embodiments, an instruction may specify, for a memory operation, a type of memory consistency and a scope at which to enforce the type of consistency. For example, these fields may specify whether to sequence memory accesses relative to the operation at one or more of multiple different cache levels based on the type of memory consistency and the scope.
Abstract:
Techniques are disclosed relating to routing circuitry configured to perform permute operations for operands of threads in a single-instruction multiple-data group. In some embodiments, an apparatus includes hierarchical operand routing circuitry configured to route operands between a set of single-instruction multiple-data (SIMD) pipelines based on a permute instruction. In some embodiments, the routing circuitry includes a first level and a second level. The first level may include a set of multiple crossbar circuits each configured to receive operands from a respective subset of the pipelines and output one or more of the received operands on multiple output lines based on the permute instruction, where the crossbar circuits support full permutation within a respective subset. A second level may be configured to select an operand from a previous level for each of the pipelines, and may select from among only a portion of output operands from the previous level to provide an operand for a respective pipeline.
Abstract:
Techniques are disclosed relating to a thread-group-scoped gate instruction. In some embodiments, graphics processor circuitry is configured to execute, for multiple SIMD groups of a thread group, a graphics program that includes a gate instruction. During execution of the gate instruction for a first SIMD group, the processor accesses state information to determine that a threshold number of other SIMD groups in the thread group have not yet executed the gate instruction. Based on the determination, the processor executes a particular set of instructions of the graphics program for the first SIMD group (that is not executed by one or more other SIMD groups that reach the gate instruction after the first SIMD group). For example, the particular set of instructions may be a utility program that performs one or more operations for the entire thread group but is only executed by a subset of the SIMD groups.
Abstract:
In various embodiments, a resource allocation management circuit may allocate a plurality of different types of hardware resources (e.g., different types of registers) to a plurality of threads. The different types of hardware resources may correspond to a plurality of hardware resource allocation circuits. The resource allocation management circuit may track allocation of the hardware resources to the threads using state identification values of the threads. In response to determining that fewer than a respective requested number of one or more types of the hardware resources are available, the resource allocation management circuit may identify one or more threads for deallocation. As a result, the hardware resource allocation system may allocate hardware resources to threads more efficiently (e.g., may deallocate hardware resources allocated to fewer threads), as compared to a hardware resource allocation system that does not track allocation of hardware resources to threads using state identification values.
Abstract:
Techniques are disclosed relating to specifying memory consistency constraints. In some embodiments, an instruction may specify, for a memory operation, a type of memory consistency and a scope at which to enforce the type of consistency. For example, these fields may specify whether to sequence memory accesses relative to the operation at one or more of multiple different cache levels based on the type of memory consistency and the scope.
Abstract:
In general, embodiments are disclosed for tracking and allocating graphics processor hardware resources. More particularly, a graphics hardware resource allocation system is able to generate a priority list for a plurality of data masters for graphics processor based on a comparison between a current utilizations for the data masters and a target utilizations for the data masters. The graphics hardware resource allocation system designate, based on the priority list, a first data master with a higher priority to submit work to the graphics processor compared to a second data master. The graphics hardware resource allocation system determines a stall counter value for the data master and generates a notification to pause work for the second data master based on the stall counter value.
Abstract:
A data queuing and format apparatus is disclosed. A first selection circuit may be configured to selectively couple a first subset of data to a first plurality of data lines dependent upon control information, and a second selection circuit may be configured to selectively couple a second subset of data to a second plurality of data lines dependent upon the control information. A storage array may include multiple storage units, and each storage unit may be configured to receive data from one or more data lines of either the first or second plurality of data lines dependent upon the control information.
Abstract:
Techniques are disclosed relating to synchronizing access to pixel resources. Examples of pixel resources include color attachments, a stencil buffer, and a depth buffer. In some embodiments, hardware registers are used to track status of assigned pixel resources and pixel wait and pixel release instruction are used to synchronize access to the pixel resources. In some embodiments, other accesses to the pixel resources may occur out of program order. Relative to tracking and ordering pass groups, this weak ordering and explicit synchronization may improve performance and reduce power consumption. Disclosed techniques may also facilitate coordination between fragment rendering threads and auxiliary mid-render compute tasks.