摘要:
One embodiment of the present invention sets forth a technique for managing the allocation and release of resources during multi-threaded program execution. Programmable reference counters are initialized to values that limit the amount of resources for allocation to tasks that share the same reference counter. Resource parameters are specified for each task to define the amount of resources allocated for consumption by each array of execution threads that is launched to execute the task. The resource parameters also specify the behavior of the array for acquiring and releasing resources. Finally, during execution of each thread in the array, an exit instruction may be configured to override the release of the resources that were allocated to the array. The resources may then be retained for use by a child task that is generated during execution of a thread.
摘要:
One embodiment of the present invention sets forth a technique for enabling the insertion of generated tasks into a scheduling pipeline of a multiple processor system allows a compute task that is being executed to dynamically generate a dynamic task and notify a scheduling unit of the multiple processor system without intervention by a CPU. A reflected notification signal is generated in response to a write request when data for the dynamic task is written to a queue. Additional reflected notification signals are generated for other events that occur during execution of a compute task, e.g., to invalidate cache entries storing data for the compute task and to enable scheduling of another compute task.
摘要:
One embodiment of the present invention sets forth a technique for selecting a first processor included in a plurality of processors to receive work related to a compute task. The technique involves analyzing state data of each processor in the plurality of processors to identify one or more processors that have already been assigned one compute task and are eligible to receive work related to the one compute task, receiving, from each of the one or more processors identified as eligible, an availability value that indicates the capacity of the processor to receive new work, selecting a first processor to receive work related to the one compute task based on the availability values received from the one or more processors, and issuing, to the first processor via a cooperative thread array (CTA), the work related to the one compute task.
摘要:
One embodiment of the present invention sets forth a technique for assigning a compute task to a first processor included in a plurality of processors. The technique involves analyzing each compute task in a plurality of compute tasks to identify one or more compute tasks that are eligible for assignment to the first processor, where each compute task is listed in a first table and is associated with a priority value and an allocation order that indicates relative time at which the compute task was added to the first table. The technique further involves selecting a first task compute from the identified one or more compute tasks based on at least one of the priority value and the allocation order, and assigning the first compute task to the first processor for execution.
摘要:
One embodiment of the present invention sets forth a technique for performing nested kernel execution within a parallel processing subsystem. The technique involves enabling a parent thread to launch a nested child grid on the parallel processing subsystem, and enabling the parent thread to perform a thread synchronization barrier on the child grid for proper execution semantics between the parent thread and the child grid. This technique advantageously enables the parallel processing subsystem to perform a richer set of programming constructs, such as conditionally executed and nested operations and externally defined library functions without the additional complexity of CPU involvement.
摘要:
Systems and methods for auto-throttling encapsulated compute tasks. A device driver may configure a parallel processor to execute compute tasks in a number of discrete throttled modes. The device driver may also allocate memory to a plurality of different processing units in a non-throttled mode. The device driver may also allocate memory to a subset of the plurality of processing units in each of the throttling modes. Data structures defined for each task include a flag that instructs the processing unit whether the task may be executed in the non-throttled mode or in the throttled mode. A work distribution unit monitors each of the tasks scheduled to run on the plurality of processing units and determines whether the processor should be configured to run in the throttled mode or in the non-throttled mode.
摘要:
One embodiment of the present invention sets forth a technique for error-checking a compute task. The technique involves receiving a pointer to a compute task, storing the pointer in a scheduling queue, determining that the compute task should be executed, retrieving the pointer from the scheduling queue, determining via an error-check procedure that the compute task is eligible for execution, and executing the compute task.
摘要:
A time slice group (TSG) is a grouping of different streams of work (referred to herein as “channels”) that share the same context information. The set of channels belonging to a TSG are processed in a pre-determined order. However, when a channel stalls while processing, the next channel with independent work can be switched to fully load the parallel processing unit. Importantly, because each channel in the TSG shares the same context information, a context switch operation is not needed when the processing of a particular channel in the TSG stops and the processing of a next channel in the TSG begins. Therefore, multiple independent streams of work are allowed to run concurrently within a single context increasing utilization of parallel processing units.
摘要:
One embodiment of the present invention sets forth a technique for encapsulating compute task state that enables out-of-order scheduling and execution of the compute tasks. The scheduling circuitry organizes the compute tasks into groups based on priority levels. The compute tasks may then be selected for execution using different scheduling schemes. Each group is maintained as a linked list of pointers to compute tasks that are encoded as task metadata (TMD) stored in memory. A TMD encapsulates the state and parameters needed to initialize, schedule, and execute a compute task.
摘要:
One embodiment of the present invention sets forth a technique for reducing the amount of memory required to store vertex data processed within a processing pipeline that includes a plurality of shading engines. The method includes determining a first active shading engine and a second active shading engine included within the processing pipeline, wherein the second active shading engine receives vertex data output by the first active shading engine. An output map is received and indicates one or more attributes that are included in the vertex data and output by the first active shading engine. An input map is received and indicates one or more attributes that are included in the vertex data and received by the second active shading engine from the first active shading engine. Then, a buffer map is generated based on the input map, the output map, and a pre-defined set of rules that includes rule data associated with both the first shading engine and the second shading engine, wherein the buffer map indicates one or more attributes that are included in the vertex data and stored in a memory that is accessible by both the first active shading engine and the second active shading engine.