Abstract:
One embodiment of the present invention includes techniques to decrease power consumption by reducing the number of redundant operations performed. In operation, a streamlining multiprocessor (SM) identifies uniform groups of threads that, when executed, apply the same deterministic operation to uniform sets of input operands. Within each uniform group of threads, the SM designates one thread as the anchor thread. The SM disables execution units assigned to all of the threads except the anchor thread. The anchor execution unit, assigned to the anchor thread, executes the operation on the uniform set of input operands. Subsequently, the SM sets the outputs of the non-anchor threads included in the uniform group of threads to equal the value of the anchor execution unit output. Advantageously, by exploiting the uniformity of data to reduce the number of execution units that execute, the SM dramatically reduces the power consumption compared to conventional SMs.
Abstract:
Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.
Abstract:
One embodiment of the present invention includes approaches for processing graphics primitives associated with cache tiles when rendering an image. A set of graphics primitives associated with a first render target configuration is received from a first portion of a graphics processing pipeline, and the set of graphics primitives is stored in a memory. A condition is detected indicating that the set of graphics primitives is ready for processing, and a cache tile is selected that intersects at least one graphics primitive in the set of graphics primitives. At least one graphics primitive in the set of graphics primitives that intersects the cache tile is transmitted to a second portion of the graphics processing pipeline for processing. One advantage of the disclosed embodiments is that graphics primitives and associated data are more likely to remain stored on-chip during cache tile rendering, thereby reducing power consumption and improving rendering performance.
Abstract:
Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.
Abstract:
A parallel counter accesses data generated by an application and stored within a register. The register includes different segments that include different portions of the application data. The parallel counter is configured to count the number of values within each segment that have a particular characteristic in a parallel fashion. The parallel counter may then return the individual segment counts to the application, or combine those segment counts and return a register count to the application. Advantageously, applications that rely on population count operations may be accelerated. Further, increasing the number of segments in a given register may reduce the time needed to count the values in that register, thereby providing a scalable solution to population counting. Additionally, the architecture of the parallel counter is sufficiently flexible to allow both register counting and segment counting, thereby combining two separate functionalities into just one hardware unit.
Abstract:
A hardware-based traversal coprocessor provides acceleration of tree traversal operations searching for intersections between primitives represented in a tree data structure and a ray. The primitives may include triangles used in generating a virtual scene. The hardware-based traversal coprocessor is configured to properly handle numerically challenging computations at or near edges and/or vertices of primitives and/or ensure that a single intersection is reported when a ray intersects a surface formed by primitives at or near edges and/or vertices of the primitives.
Abstract:
Techniques are provided for handling a trap encountered in a thread that is part of a thread array that is being executed in a plurality of execution units. In these techniques, a data structure with an identifier associated with the thread is updated to indicate that the trap occurred during the execution of the thread array. Also in these techniques, the execution units execute a trap handling routine that includes a context switch. The execution units perform this context switch for at least one of the execution units as part of the trap handling routine while allowing the remaining execution units to exit the trap handling routine before the context switch. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors.
Abstract:
A streaming multiprocessor in a parallel processing subsystem processes atomic operations for multiple threads in a multi-threaded architecture. The streaming multiprocessor receives a request from a thread in a thread group to acquire access to a memory location in a lock-protected shared memory, and determines whether a address lock in a plurality of address locks is asserted, where the address lock is associated the memory location. If the address lock is asserted, then the streaming multiprocessor refuses the request. Otherwise, the streaming multiprocessor asserts the address lock, asserts a thread group lock in a plurality of thread group locks, where the thread group lock is associated with the thread group, and grants the request. One advantage of the disclosed techniques is that acquired locks are released when a thread is preempted. As a result, a preempted thread that has previously acquired a lock does not retain the lock indefinitely.