专利检索 ap:("Jerome F. Duluk, Jr." OR "Stephen D. Lew" OR "John R. Nickolls") AND inv:"John R. Nickolls" 第 1 页

1.

发明授权
Counter-based delay of dependent thread group execution 有权
标题翻译：依赖线程组执行的基于计数器的延迟

公开(公告)号：US07526634B1

公开(公告)日：2009-04-28

申请号：US11535871

申请日：2006-09-27

申请人： Jerome F. Duluk, Jr. , Stephen D. Lew , John R. Nickolls

发明人： Jerome F. Duluk, Jr. , Stephen D. Lew , John R. Nickolls

IPC分类号： G06F9/40

CPC分类号： G06F9/52 , G06F9/546 , G06F2209/548

摘要： Systems and methods for synchronizing processing work performed by threads, cooperative thread arrays (CTAs), or “sets” of CTAs. A central processing unit can load launch commands for a first set of CTAs and a second set of CTAs in a pushbuffer, and specify a dependency of the second set upon completion of execution of the first set. A parallel or graphics processor (GPU) can autonomously execute the first set of CTAs and delay execution of the second set of CTAs until the first set of CTAs is complete. In some embodiments the GPU may determine that a third set of CTAs is not dependent upon the first set, and may launch the third set of CTAs while the second set of CTAs is delayed. In this manner, the GPU may execute launch commands out of order with respect to the order of the launch commands in the pushbuffer.

摘要翻译： 由线程执行的处理工作同步的系统和方法，协同线程数组（CIA）或CTA的“集合”。中央处理单元可以加载针对第一组CTA和第二组CTA的推送命令，并且在第一组的执行完成时指定第二组的依赖关系。并行或图形处理器（GPU）可以自主地执行第一组CTA并且延迟第二组CTA的执行，直到第一组CTA完成。在一些实施例中，GPU可以确定第三组CTA不依赖于第一组，并且可以启动第三组CTA，同时第二组CTA被延迟。以这种方式，GPU可以相对于推送缓冲器中的发射命令的顺序执行命令无序。

2.

发明授权
Methods for scalably exploiting parallelism in a parallel processing system 有权
标题翻译：在并行处理系统中可扩展地利用并行性的方法

公开(公告)号：US08099584B2

公开(公告)日：2012-01-17

申请号：US13099035

申请日：2011-05-02

申请人： John R. Nickolls , Stephen D. Lew

发明人： John R. Nickolls , Stephen D. Lew

IPC分类号： G06F9/30

CPC分类号： G06F9/3851 , G06F9/30072 , G06F9/3012 , G06F9/3889 , G06F9/5066

摘要： Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.

摘要翻译： 并行处理子系统中的并行性以可扩展的方式被利用。要解决的问题可以被分层分解成至少两个级别的子问题。定义程序执行的各个线程来解决最低级别的问题。线程被分组成一个或多个线程数组，每个线程数组都解决了较高级的子问题。线程数组可以通过处理内核执行，每个核心可以一次执行至少一个线程数组。线程数组可以分组成独立线程数组的网格，从而解决更高级的子问题或整个问题。网格中的线程数组或整个网格可以分布在所有可用处理核心中，如特定系统实现中可用的。

3.

发明授权
Methods for scalably exploiting parallelism in a parallel processing system 有权
标题翻译：在并行处理系统中可扩展地利用并行性的方法

公开(公告)号：US07937567B1

公开(公告)日：2011-05-03

申请号：US11555623

申请日：2006-11-01

申请人： John R. Nickolls , Stephen D. Lew

发明人： John R. Nickolls , Stephen D. Lew

IPC分类号： G06F9/30

CPC分类号： G06F9/3851 , G06F9/30072 , G06F9/3012 , G06F9/3889 , G06F9/5066

摘要： Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.

摘要翻译： 并行处理子系统中的并行性以可扩展的方式被利用。要解决的问题可以被分层分解成至少两个级别的子问题。定义程序执行的各个线程来解决最低级别的问题。线程被分组成一个或多个线程数组，每个线程数组都解决了较高级的子问题。线程数组可以通过处理内核执行，每个核心可以一次执行至少一个线程数组。线程数组可以分组成独立线程数组的网格，从而解决更高级的子问题或整个问题。网格中的线程数组或整个网格可以分布在所有可用处理核心中，如特定系统实现中可用的。

4.

发明申请
PARALLEL DATA PROCESSING SYSTEMS AND METHODS USING COOPERATIVE THREAD ARRAYS 有权
标题翻译：并行数据处理系统和使用合作螺纹阵列的方法

公开(公告)号：US20110087860A1

公开(公告)日：2011-04-14

申请号：US12972361

申请日：2010-12-17

申请人： John R. Nickolls , Stephen D. Lew

发明人： John R. Nickolls , Stephen D. Lew

IPC分类号： G06F15/16

CPC分类号： G06F9/544 , G06F9/3851 , G06F9/3887 , G06F9/522

摘要： Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.

摘要翻译： 并行数据处理系统和方法使用协同线程数组（CIA），即在输入数据集上同时执行相同程序的多线程组，以产生输出数据集。 CTA中的每个线程都有一个唯一的标识符（线程ID），可以在线程启动时分配。线程ID控制线程的处理行为的各个方面，例如由每个线程处理的输入数据集的部分，由每个线程产生的输出数据集的部分和/或线程之间的中间结果的共享。还描述了在代表性处理核心中加载和启动CTA并在CTA内同步线程的机制。

5.

发明授权
Synchronization of threads in a cooperative thread array 有权
标题翻译：协同线程数组中的线程同步

公开(公告)号：US07788468B1

公开(公告)日：2010-08-31

申请号：US11303780

申请日：2005-12-15

申请人： John R. Nickolls , Stephen D. Lew , Brett W. Coon , Peter C. Mills

发明人： John R. Nickolls , Stephen D. Lew , Brett W. Coon , Peter C. Mills

IPC分类号： G06F15/00 , G06F15/76

CPC分类号： G06F9/3851 , G06F9/30087 , G06F9/3009 , G06F9/3834 , G06F9/3887 , G06F9/522

摘要： A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point.

摘要翻译： “协同线程数组”或“CTA”是一组多个线程，它们在输入数据集上同时执行相同的程序以产生输出数据集。 CTA中的每个线程都具有在线程启动时分配的唯一线程标识符，用于控制线程的处理行为的各个方面，例如要由每个线程处理的输入数据集的部分，要生成的输出数据集的部分通过每个线程，和/或在线程之间共享中间结果。 CTA的不同线程有利地在CTA执行期间在适当的点处同步，其中使用屏障同步技术，其中检测到CTA程序中的障碍指令并用于暂停某些线程的执行，直到指定数量的其他线程也到达屏障点。

6.

发明授权
Parallel data processing systems and methods using cooperative thread arrays with unique thread identifiers as an input to compute an identifier of a location in a shared memory 有权
标题翻译：使用具有唯一线程标识符的协作线程数组作为输入的并行数据处理系统和方法来计算共享存储器中位置的标识符

公开(公告)号：US08112614B2

公开(公告)日：2012-02-07

申请号：US12972361

申请日：2010-12-17

申请人： John R. Nickolls , Stephen D. Lew

发明人： John R. Nickolls , Stephen D. Lew

IPC分类号： G06F15/16

CPC分类号： G06F9/544 , G06F9/3851 , G06F9/3887 , G06F9/522

摘要： Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.

摘要翻译： 并行数据处理系统和方法使用协同线程数组（CIA），即在输入数据集上同时执行相同程序的多线程组，以产生输出数据集。 CTA中的每个线程都有一个唯一的标识符（线程ID），可以在线程启动时分配。线程ID控制线程的处理行为的各个方面，例如由每个线程处理的输入数据集的部分，由每个线程生成的输出数据集的部分和/或线程之间的中间结果的共享。还描述了在代表性处理核心中加载和启动CTA并在CTA内同步线程的机制。

7.

发明申请
METHODS FOR SCALABLY EXPLOITING PARALLELISM IN A PARALLEL PROCESSING SYSTEM 有权
标题翻译：在平行处理系统中大量开发并行的方法

公开(公告)号：US20110238955A1

公开(公告)日：2011-09-29

申请号：US13099035

申请日：2011-05-02

申请人： John R. Nickolls , Stephen D. Lew

发明人： John R. Nickolls , Stephen D. Lew

IPC分类号： G06F9/30

CPC分类号： G06F9/3851 , G06F9/30072 , G06F9/3012 , G06F9/3889 , G06F9/5066

摘要： Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.

摘要翻译： 并行处理子系统中的并行性以可扩展的方式被利用。要解决的问题可以被分层分解成至少两个级别的子问题。定义程序执行的各个线程来解决最低级别的问题。线程被分组成一个或多个线程数组，每个线程数组都解决了较高级的子问题。线程数组可以通过处理内核执行，每个核心可以一次执行至少一个线程数组。线程数组可以分组成独立线程数组的网格，从而解决更高级的子问题或整个问题。网格中的线程数组或整个网格可以分布在所有可用处理核心中，如特定系统实现中可用的。

8.

发明授权
Parallel data processing systems and methods using cooperative thread arrays and thread identifier values to determine processing behavior 有权
标题翻译：并行数据处理系统和方法使用协作线程数组和线程标识符值来确定处理行为

公开(公告)号：US07861060B1

公开(公告)日：2010-12-28

申请号：US11305178

申请日：2005-12-15

申请人： John R. Nickolls , Stephen D. Lew

发明人： John R. Nickolls , Stephen D. Lew

IPC分类号： G06F15/16

CPC分类号： G06F9/544 , G06F9/3851 , G06F9/3887 , G06F9/522

摘要： Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.

摘要翻译： 并行数据处理系统和方法使用协同线程数组（CIA），即在输入数据集上同时执行相同程序的多线程组，以产生输出数据集。 CTA中的每个线程都有一个唯一的标识符（线程ID），可以在线程启动时分配。线程ID控制线程的处理行为的各个方面，例如由每个线程处理的输入数据集的部分，由每个线程产生的输出数据集的部分和/或线程之间的中间结果的共享。还描述了在代表性处理核心中加载和启动CTA并在CTA内同步线程的机制。

9.

发明授权
Coalescing memory barrier operations across multiple parallel threads 有权
标题翻译：在多个并行线程之间合并记忆障碍操作

公开(公告)号：US09223578B2

公开(公告)日：2015-12-29

申请号：US12887081

申请日：2010-09-21

申请人： John R. Nickolls , Steven James Heinrich , Brett W. Coon , Michael C. Shebanow

发明人： John R. Nickolls , Steven James Heinrich , Brett W. Coon , Michael C. Shebanow

IPC分类号： G06F9/46 , G06F9/38 , G06F9/30

CPC分类号： G06F9/3834 , G06F9/3004 , G06F9/30087 , G06F9/3851

摘要： One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.

摘要翻译： 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。此外，存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。例如，第一类型的存储器障碍指令可以将存储器事务提交到共享L1（一级）高速缓存的一组协作线程的级别。第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。最后，第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。

10.

发明申请
THREAD GROUP SCHEDULER FOR COMPUTING ON A PARALLEL THREAD PROCESSOR 有权
标题翻译：用于并行螺纹加工器的螺纹组合调度器

公开(公告)号：US20120110586A1

公开(公告)日：2012-05-03

申请号：US13247819

申请日：2011-09-28

申请人： Brett W. Coon , John R. Nickolls , John Erik Lindholm , Robert J. Stoll , Nicholas Wang , Jack Hilaire Choquette , Kathleen Elliott Nickolls

发明人： Brett W. Coon , John R. Nickolls , John Erik Lindholm , Robert J. Stoll , Nicholas Wang , Jack Hilaire Choquette , Kathleen Elliott Nickolls

IPC分类号： G06F9/46

CPC分类号： G06F9/4881 , G06F2209/483

摘要： A parallel thread processor executes thread groups belonging to multiple cooperative thread arrays (CTAs). At each cycle of the parallel thread processor, an instruction scheduler selects a thread group to be issued for execution during a subsequent cycle. The instruction scheduler selects a thread group to issue for execution by (i) identifying a pool of available thread groups, (ii) identifying a CTA that has the greatest seniority value, and (iii) selecting the thread group that has the greatest credit value from within the CTA with the greatest seniority value.

摘要翻译： 并行线程处理器执行属于多个协作线程数组（CTA）的线程组。在并行线程处理器的每个周期，指令调度器在随后的周期中选择要发行的线程组以执行。指令调度器通过（i）识别可用线程组的池，（ii）识别具有最大资历值的CTA来选择要执行的线程组，以及（iii）选择具有最大信用值的线程组从具有最高资历价值的CTA内。

搜索结果

国家/区域

专利有效性

申请日

公布(公告)日

申请人

申请人所在国/区域

发明人

IPC

IPC部

IPC大类

IPC小类

IPC大组

IPC小组

外观分类