Abstract:
A method of allocating a memory to a plurality of concurrent threads is presented. The method includes dynamically determining writer threads each having at least one pending write to the memory; and dynamically allocating respective contiguous blocks in the memory for each of the writer threads. Another method of allocating a memory to a plurality of concurrent threads includes launching the plurality of threads as a plurality of wavefronts, dynamically determining a group of wavefronts each having at least one thread requiring a write to the memory, and dynamically allocating respective contiguous blocks in the memory for each wavefront from the group of wavefronts. A corresponding method of assigning a memory to a plurality of reader threads includes determining a first number corresponding to a number of writer threads having a block allocated in said memory, launching a first number of reader threads, entering a first wavefront of said reader threads from said group of wavefronts to an atomic operation, and assigning a first block in the memory to the first wavefront during the corresponding atomic operation, where the first block is contiguous to a previously allocated block dynamically allocated to another wavefront from said group of wavefronts. Corresponding system embodiments and computer program product embodiments are also presented.
Abstract:
Embodiments describe herein provide a method of for managing task scheduling on a accelerated processing device. The method includes executing a first task within the accelerated processing device (APD), monitoring for an interruption of the execution of the first task, and switching to a second task when an interruption is detected.
Abstract:
A system, method, and computer program product are provided for improving resource utilization of multithreaded applications. Rather than requiring threads to block while waiting for data from a channel or requiring context switching to minimize blocking, the techniques disclosed herein provide an event-driven approach to launch kernels only when needed to perform operations on channel data, and then terminate in order to free resources. These operations are handled efficiently in hardware, but are flexible enough to be implemented in all manner of programming models.
Abstract:
A method resumes an accelerated processing device (APD) wavefront in which a subset of elements have faulted. A restore command for a job including a wavefront is received. A list of context states for the wavefront is read from a memory associated with a APD. An empty shell wavefront is created for restoring the list of context states. A portion of not acknowledged data is masked over a portion of acknowledged data within the restored wavefronts.
Abstract:
Provided is a method for processing a command in a computing system including an accelerated processing device (APD) having a command processor. The method includes executing an interrupt routine to save one or more contexts related to a first set of instructions on a shader core in response to an instruction to preempt processing of the first set of instructions.
Abstract:
A method tolerates virtual to physical address translation failures. A translation request is sent from a graphics processing device to a translation mechanism. The translation request is associated with a first wavefront. A fault notification is received within an accelerated processing device (APD) from the translation mechanism that a request cannot be acknowledged. The first wavefront is, stored within a shader core of the APD if the fault notification is received. The first wavefront is replaced with a second wavefront if the fault notification is received, the second wavefront being ready to be executed.
Abstract:
An apparatus with circuit redundancy includes a set of parallel arithmetic logic units (ALUs), a redundant parallel ALU, input data shifting logic that is coupled to the set of parallel ALUs and that is operatively coupled to the redundant parallel ALU. The input data shifting logic shifts input data for a defective ALU, in a first direction, to a neighboring ALU in the set. When the neighboring ALU is the last or end ALU in the set, the shifting logic continues to shift the input data for the end ALU that is not defective, to the redundant parallel ALU. The redundant parallel ALU then operates for the defective ALU. Output data shifting logic is coupled to an output of the parallel redundant ALU and all other ALU outputs to shift the output data in a second and opposite direction than the input shifting logic, to realign output of data for continued processing, including for storage or for further processing by other circuitry.
Abstract:
A method of allocating a memory to a plurality of concurrent threads is presented. The method includes dynamically determining writer threads each having at least one pending write to the memory; and dynamically allocating respective contiguous blocks in the memory for each of the writer threads. Another method of allocating a memory to a plurality of concurrent threads includes launching the plurality of threads as a plurality of wavefronts, dynamically determining a group of wavefronts each having at least one thread requiring a write to the memory, and dynamically allocating respective contiguous blocks in the memory for each wavefront from the group of wavefronts. A corresponding method of assigning a memory to a plurality of reader threads includes determining a first number corresponding to a number of writer threads having a block allocated in said memory, launching a first number of reader threads, entering a first wavefront of said reader threads from said group of wavefronts to an atomic operation, and assigning a first block in the memory to the first wavefront during the corresponding atomic operation, where the first block is contiguous to a previously allocated block dynamically allocated to another wavefront from said group of wavefronts. Corresponding system embodiments and computer program product embodiments are also presented.
Abstract:
A processor includes a first shader engine and a second shader engine. The first shader engine is configured to process pixel shaders for a first subset of pixels to be displayed on a display device. The second shader engine is configured to process pixel shaders for a second subset of pixels to be displayed on the display device. Both the first and second shader engines are also configured to process general-compute shaders and non-pixel graphics shaders. The processor may also include a level-one (L1) data cache, coupled to and positioned between the first and second shader engines.
Abstract:
Briefly, graphics data processing logic includes a plurality of parallel arithmetic logic units (ALUs), such as floating point processors or any other suitable logic, that operate as a vector processor on at least one of pixel data and vertex data (or both) and a programmable storage element that contains data representing which of the plurality of arithmetic logic units are not to receive data for processing. The graphics data processing logic also includes parallel ALU data packing logic that is operatively coupled to the plurality of arithmetic logic processing units and to the programmable storage element to pack data only for the plurality of arithmetic logic units identified by the data in the programmable storage element as being enabled.