Abstract:
Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache. Such distributed shared memory supports cooperative parallelism and strong scaling across multiple processing cores by permitting data sharing and communications previously possible only within the same processing core.
Abstract:
Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.
Abstract:
A streaming multiprocessor (SM) included within a parallel processing unit (PPU) is configured to suspend a thread group executing on the SM and to save the operating state of the suspended thread group. A load-store unit (LSU) within the SM re-maps local memory associated with the thread group to a location in global memory. Subsequently, the SM may re-launch the suspended thread group. The LSU may then perform local memory access operations on behalf of the re-launched thread group with the re-mapped local memory that resides in global memory.
Abstract:
Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.
Abstract:
A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.
Abstract:
A method for scheduling work for processing by a GPU is disclosed. The method includes accessing a work completion data structure and accessing a work tracking data structure. Dependency logic analysis is then performed using work completion data and work tracking data. Work items that have dependencies are then launched into the GPU by using a software work item launch interface.
Abstract:
Techniques are provided for handling a trap encountered in a thread that is part of a thread array that is being executed in a plurality of execution units. In these techniques, a data structure with an identifier associated with the thread is updated to indicate that the trap occurred during the execution of the thread array. Also in these techniques, the execution units execute a trap handling routine that includes a context switch. The execution units perform this context switch for at least one of the execution units as part of the trap handling routine while allowing the remaining execution units to exit the trap handling routine before the context switch. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors.