Counter-based delay of dependent thread group execution
    1.
    发明授权
    Counter-based delay of dependent thread group execution 有权
    依赖线程组执行的基于计数器的延迟

    公开(公告)号:US07526634B1

    公开(公告)日:2009-04-28

    申请号:US11535871

    申请日:2006-09-27

    IPC分类号: G06F9/40

    摘要: Systems and methods for synchronizing processing work performed by threads, cooperative thread arrays (CTAs), or “sets” of CTAs. A central processing unit can load launch commands for a first set of CTAs and a second set of CTAs in a pushbuffer, and specify a dependency of the second set upon completion of execution of the first set. A parallel or graphics processor (GPU) can autonomously execute the first set of CTAs and delay execution of the second set of CTAs until the first set of CTAs is complete. In some embodiments the GPU may determine that a third set of CTAs is not dependent upon the first set, and may launch the third set of CTAs while the second set of CTAs is delayed. In this manner, the GPU may execute launch commands out of order with respect to the order of the launch commands in the pushbuffer.

    摘要翻译: 由线程执行的处理工作同步的系统和方法,协同线程数组(CIA)或CTA的“集合”。 中央处理单元可以加载针对第一组CTA和第二组CTA的推送命令,并且在第一组的执行完成时指定第二组的依赖关系。 并行或图形处理器(GPU)可以自主地执行第一组CTA并且延迟第二组CTA的执行,直到第一组CTA完成。 在一些实施例中,GPU可以确定第三组CTA不依赖于第一组,并且可以启动第三组CTA,同时第二组CTA被延迟。 以这种方式,GPU可以相对于推送缓冲器中的发射命令的顺序执行命令无序。

    Methods for scalably exploiting parallelism in a parallel processing system
    2.
    发明授权
    Methods for scalably exploiting parallelism in a parallel processing system 有权
    在并行处理系统中可扩展地利用并行性的方法

    公开(公告)号:US08099584B2

    公开(公告)日:2012-01-17

    申请号:US13099035

    申请日:2011-05-02

    IPC分类号: G06F9/30

    摘要: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.

    摘要翻译: 并行处理子系统中的并行性以可扩展的方式被利用。 要解决的问题可以被分层分解成至少两个级别的子问题。 定义程序执行的各个线程来解决最低级别的问题。 线程被分组成一个或多个线程数组,每个线程数组都解决了较高级的子问题。 线程数组可以通过处理内核执行,每个核心可以一次执行至少一个线程数组。 线程数组可以分组成独立线程数组的网格,从而解决更高级的子问题或整个问题。 网格中的线程数组或整个网格可以分布在所有可用处理核心中,如特定系统实现中可用的。

    Methods for scalably exploiting parallelism in a parallel processing system
    3.
    发明授权
    Methods for scalably exploiting parallelism in a parallel processing system 有权
    在并行处理系统中可扩展地利用并行性的方法

    公开(公告)号:US07937567B1

    公开(公告)日:2011-05-03

    申请号:US11555623

    申请日:2006-11-01

    IPC分类号: G06F9/30

    摘要: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.

    摘要翻译: 并行处理子系统中的并行性以可扩展的方式被利用。 要解决的问题可以被分层分解成至少两个级别的子问题。 定义程序执行的各个线程来解决最低级别的问题。 线程被分组成一个或多个线程数组,每个线程数组都解决了较高级的子问题。 线程数组可以通过处理内核执行,每个核心可以一次执行至少一个线程数组。 线程数组可以分组成独立线程数组的网格,从而解决更高级的子问题或整个问题。 网格中的线程数组或整个网格可以分布在所有可用处理核心中,如特定系统实现中可用的。

    PARALLEL DATA PROCESSING SYSTEMS AND METHODS USING COOPERATIVE THREAD ARRAYS
    4.
    发明申请
    PARALLEL DATA PROCESSING SYSTEMS AND METHODS USING COOPERATIVE THREAD ARRAYS 有权
    并行数据处理系统和使用合作螺纹阵列的方法

    公开(公告)号:US20110087860A1

    公开(公告)日:2011-04-14

    申请号:US12972361

    申请日:2010-12-17

    IPC分类号: G06F15/16

    摘要: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.

    摘要翻译: 并行数据处理系统和方法使用协同线程数组(CIA),即在输入数据集上同时执行相同程序的多线程组,以产生输出数据集。 CTA中的每个线程都有一个唯一的标识符(线程ID),可以在线程启动时分配。 线程ID控制线程的处理行为的各个方面,例如由每个线程处理的输入数据集的部分,由每个线程产生的输出数据集的部分和/或线程之间的中间结果的共享 。 还描述了在代表性处理核心中加载和启动CTA并在CTA内同步线程的机制。

    Synchronization of threads in a cooperative thread array
    5.
    发明授权
    Synchronization of threads in a cooperative thread array 有权
    协同线程数组中的线程同步

    公开(公告)号:US07788468B1

    公开(公告)日:2010-08-31

    申请号:US11303780

    申请日:2005-12-15

    IPC分类号: G06F15/00 G06F15/76

    摘要: A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point.

    摘要翻译: “协同线程数组”或“CTA”是一组多个线程,它们在输入数据集上同时执行相同的程序以产生输出数据集。 CTA中的每个线程都具有在线程启动时分配的唯一线程标识符,用于控制线程的处理行为的各个方面,例如要由每个线程处理的输入数据集的部分,要生成的输出数据集的部分 通过每个线程,和/或在线程之间共享中间结果。 CTA的不同线程有利地在CTA执行期间在适当的点处同步,其中使用屏障同步技术,其中检测到CTA程序中的障碍指令并用于暂停某些线程的执行,直到指定数量的其他线程也到达屏障点。

    Parallel data processing systems and methods using cooperative thread arrays with unique thread identifiers as an input to compute an identifier of a location in a shared memory
    6.
    发明授权
    Parallel data processing systems and methods using cooperative thread arrays with unique thread identifiers as an input to compute an identifier of a location in a shared memory 有权
    使用具有唯一线程标识符的协作线程数组作为输入的并行数据处理系统和方法来计算共享存储器中位置的标识符

    公开(公告)号:US08112614B2

    公开(公告)日:2012-02-07

    申请号:US12972361

    申请日:2010-12-17

    IPC分类号: G06F15/16

    摘要: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.

    摘要翻译: 并行数据处理系统和方法使用协同线程数组(CIA),即在输入数据集上同时执行相同程序的多线程组,以产生输出数据集。 CTA中的每个线程都有一个唯一的标识符(线程ID),可以在线程启动时分配。 线程ID控制线程的处理行为的各个方面,例如由每个线程处理的输入数据集的部分,由每个线程生成的输出数据集的部分和/或线程之间的中间结果的共享 。 还描述了在代表性处理核心中加载和启动CTA并在CTA内同步线程的机制。

    METHODS FOR SCALABLY EXPLOITING PARALLELISM IN A PARALLEL PROCESSING SYSTEM
    7.
    发明申请
    METHODS FOR SCALABLY EXPLOITING PARALLELISM IN A PARALLEL PROCESSING SYSTEM 有权
    在平行处理系统中大量开发并行的方法

    公开(公告)号:US20110238955A1

    公开(公告)日:2011-09-29

    申请号:US13099035

    申请日:2011-05-02

    IPC分类号: G06F9/30

    摘要: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.

    摘要翻译: 并行处理子系统中的并行性以可扩展的方式被利用。 要解决的问题可以被分层分解成至少两个级别的子问题。 定义程序执行的各个线程来解决最低级别的问题。 线程被分组成一个或多个线程数组,每个线程数组都解决了较高级的子问题。 线程数组可以通过处理内核执行,每个核心可以一次执行至少一个线程数组。 线程数组可以分组成独立线程数组的网格,从而解决更高级的子问题或整个问题。 网格中的线程数组或整个网格可以分布在所有可用处理核心中,如特定系统实现中可用的。

    Parallel data processing systems and methods using cooperative thread arrays and thread identifier values to determine processing behavior
    8.
    发明授权
    Parallel data processing systems and methods using cooperative thread arrays and thread identifier values to determine processing behavior 有权
    并行数据处理系统和方法使用协作线程数组和线程标识符值来确定处理行为

    公开(公告)号:US07861060B1

    公开(公告)日:2010-12-28

    申请号:US11305178

    申请日:2005-12-15

    IPC分类号: G06F15/16

    摘要: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.

    摘要翻译: 并行数据处理系统和方法使用协同线程数组(CIA),即在输入数据集上同时执行相同程序的多线程组,以产生输出数据集。 CTA中的每个线程都有一个唯一的标识符(线程ID),可以在线程启动时分配。 线程ID控制线程的处理行为的各个方面,例如由每个线程处理的输入数据集的部分,由每个线程产生的输出数据集的部分和/或线程之间的中间结果的共享 。 还描述了在代表性处理核心中加载和启动CTA并在CTA内同步线程的机制。

    Coalescing memory barrier operations across multiple parallel threads
    9.
    发明授权
    Coalescing memory barrier operations across multiple parallel threads 有权
    在多个并行线程之间合并记忆障碍操作

    公开(公告)号:US09223578B2

    公开(公告)日:2015-12-29

    申请号:US12887081

    申请日:2010-09-21

    IPC分类号: G06F9/46 G06F9/38 G06F9/30

    摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.

    摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。