摘要:
One embodiment of the present invention sets forth a technique instruction level and compute thread array granularity execution preemption. Preempting at the instruction level does not require any draining of the processing pipeline. No new instructions are issued and the context state is unloaded from the processing pipeline. When preemption is performed at a compute thread array boundary, the amount of context state to be stored is reduced because execution units within the processing pipeline complete execution of in-flight instructions and become idle. If, the amount of time needed to complete execution of the in-flight instructions exceeds a threshold, then the preemption may dynamically change to be performed at the instruction level instead of at compute thread array granularity.
摘要:
One embodiment of the present invention sets forth a technique instruction level and compute thread array granularity execution preemption. Preempting at the instruction level does not require any draining of the processing pipeline. No new instructions are issued and the context state is unloaded from the processing pipeline. When preemption is performed at a compute thread array boundary, the amount of context state to be stored is reduced because execution units within the processing pipeline complete execution of in-flight instructions and become idle. If, the amount of time needed to complete execution of the in-flight instructions exceeds a threshold, then the preemption may dynamically change to be performed at the instruction level instead of at compute thread array granularity.
摘要:
A time slice group (TSG) is a grouping of different streams of work (referred to herein as “channels”) that share the same context information. The set of channels belonging to a TSG are processed in a pre-determined order. However, when a channel stalls while processing, the next channel with independent work can be switched to fully load the parallel processing unit. Importantly, because each channel in the TSG shares the same context information, a context switch operation is not needed when the processing of a particular channel in the TSG stops and the processing of a next channel in the TSG begins. Therefore, multiple independent streams of work are allowed to run concurrently within a single context increasing utilization of parallel processing units.
摘要:
One embodiment of the present invention sets forth a technique for encapsulating compute task state that enables out-of-order scheduling and execution of the compute tasks. The scheduling circuitry organizes the compute tasks into groups based on priority levels. The compute tasks may then be selected for execution using different scheduling schemes. Each group is maintained as a linked list of pointers to compute tasks that are encoded as task metadata (TMD) stored in memory. A TMD encapsulates the state and parameters needed to initialize, schedule, and execute a compute task.
摘要:
A technique for controlling the distribution of compute task processing in a multi-threaded system encodes each processing task as task metadata (TMD) stored in memory. The TMD includes work distribution parameters specifying how the processing task should be distributed for processing. Scheduling circuitry selects a task for execution when entries of a work queue for the task have been written. The work distribution parameters may define a number of work queue entries needed before a cooperative thread array” (“CTA”) may be launched to process the work queue entries according to the compute task. The work distribution parameters may define a number of CTAS that are launched to process the same work queue entries. Finally, the work distribution parameters may define a step size that is used to update pointers to the work queue entries.
摘要:
One embodiment of the present invention is a control unit for distributing packets of work to one or more consumer of works. The control unit is configured to assign at least one processing domain from a set of processing domains to each consumer included in the one or more consumers, receive a plurality of packets of work from at least one producer of work, wherein each packet of work is associated with a processing domain from the set of processing domains, and a first packet of work associated with a first processing domain can be processed by the one or more consumers independently of a second packet of work associated with a second processing domain, identify a first consumer that has been assigned the first processing domain, and transmit the first packet of work to the first consumer for processing.