-
1.
公开(公告)号:US09223578B2
公开(公告)日:2015-12-29
申请号:US12887081
申请日:2010-09-21
CPC分类号: G06F9/3834 , G06F9/3004 , G06F9/30087 , G06F9/3851
摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.
摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。
-
2.
公开(公告)号:US20110078692A1
公开(公告)日:2011-03-31
申请号:US12887081
申请日:2010-09-21
IPC分类号: G06F9/46
CPC分类号: G06F9/3834 , G06F9/3004 , G06F9/30087 , G06F9/3851
摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.
摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。
-
公开(公告)号:US20110072213A1
公开(公告)日:2011-03-24
申请号:US12888409
申请日:2010-09-22
CPC分类号: G06F9/3887 , G06F9/30043 , G06F9/3009 , G06F9/3836 , G06F12/0811 , G06F12/0862 , G06F12/0875 , G06F12/0897 , G06F12/121 , G06F2212/452
摘要: A method for managing a parallel cache hierarchy in a processing unit. The method includes receiving an instruction from a scheduler unit, where the instruction comprises a load instruction or a store instruction; determining that the instruction includes a cache operations modifier that identifies a policy for caching data associated with the instruction at one or more levels of the parallel cache hierarchy; and executing the instruction and caching the data associated with the instruction based on the cache operations modifier.
摘要翻译: 一种用于在处理单元中管理并行高速缓存层级的方法。 该方法包括从调度器单元接收指令,其中指令包括加载指令或存储指令; 确定所述指令包括高速缓存操作修饰符,所述缓存操作修饰符标识用于缓存与所述并行高速缓存层级的一个或多个级别上的所述指令相关联的数据的 并且基于高速缓存操作修饰符执行指令并缓存与指令相关联的数据。
-
公开(公告)号:US09639479B2
公开(公告)日:2017-05-02
申请号:US12888409
申请日:2010-09-22
IPC分类号: G06F12/121 , G06F12/0811 , G06F12/0862 , G06F9/30
CPC分类号: G06F9/3887 , G06F9/30043 , G06F9/3009 , G06F9/3836 , G06F12/0811 , G06F12/0862 , G06F12/0875 , G06F12/0897 , G06F12/121 , G06F2212/452
摘要: A method for managing a parallel cache hierarchy in a processing unit. The method includes receiving an instruction from a scheduler unit, where the instruction comprises a load instruction or a store instruction; determining that the instruction includes a cache operations modifier that identifies a policy for caching data associated with the instruction at one or more levels of the parallel cache hierarchy; and executing the instruction and caching the data associated with the instruction based on the cache operations modifier.
-
公开(公告)号:US08522000B2
公开(公告)日:2013-08-27
申请号:US12569831
申请日:2009-09-29
申请人: Michael C. Shebanow , Jack Choquette , Brett W. Coon , Steven J. Heinrich , Aravind Kalaiah , John R. Nickolls , Daniel Salinas , Ming Y. Siu , Tommy Thorn , Nicholas Wang
发明人: Michael C. Shebanow , Jack Choquette , Brett W. Coon , Steven J. Heinrich , Aravind Kalaiah , John R. Nickolls , Daniel Salinas , Ming Y. Siu , Tommy Thorn , Nicholas Wang
IPC分类号: G06F9/00
CPC分类号: G06F9/327 , G06F9/3851 , G06F9/3861
摘要: A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment.
摘要翻译: 陷阱处理器架构被并入到诸如GPU的并行处理子系统中。 陷阱处理器架构通过强加与流式多处理器相关联的所有线程组都在其各自的代码段内执行或全部在陷阱处理程序代码段内执行的属性来最小化并发执行线程的设计复杂性和验证工作。
-
公开(公告)号:US20110078427A1
公开(公告)日:2011-03-31
申请号:US12569831
申请日:2009-09-29
申请人: Michael C. Shebanow , Jack Choquette , Brett W. Coon , Steven J. Heinrich , Aravind Kalaiah , John R. Nickolls , Daniel Salinas , Ming Y. Siu , Tommy Thorn , Nicholas Wang
发明人: Michael C. Shebanow , Jack Choquette , Brett W. Coon , Steven J. Heinrich , Aravind Kalaiah , John R. Nickolls , Daniel Salinas , Ming Y. Siu , Tommy Thorn , Nicholas Wang
IPC分类号: G06F9/38
CPC分类号: G06F9/327 , G06F9/3851 , G06F9/3861
摘要: A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment.
摘要翻译: 陷阱处理器架构被并入到诸如GPU的并行处理子系统中。 陷阱处理器架构通过强加与流式多处理器相关联的所有线程组都在其各自的代码段内执行或全部在陷阱处理程序代码段内执行的属性来最小化并发执行线程的设计复杂性和验证工作。
-
公开(公告)号:US08700877B2
公开(公告)日:2014-04-15
申请号:US12890518
申请日:2010-09-24
CPC分类号: G06F12/0284 , G06F9/3851 , G06F12/0607
摘要: A method for thread address mapping in a parallel thread processor. The method includes receiving a thread address associated with a first thread in a thread group; computing an effective address based on a location of the thread address within a local window of a thread address space; computing a thread group address in an address space associated with the thread group based on the effective address and a thread identifier associated with a first thread; and computing a virtual address associated with the first thread based on the thread group address and a thread group identifier, where the virtual address is used to access a location in a memory associated with the thread address to load or store data.
摘要翻译: 一种并行线程处理器中线程地址映射的方法。 该方法包括接收与线程组中的第一线程相关联的线程地址; 基于线程地址在线程地址空间的本地窗口内的位置来计算有效地址; 基于有效地址和与第一线程相关联的线程标识符计算与线程组相关联的地址空间中的线程组地址; 以及基于所述线程组地址和线程组标识符计算与所述第一线程相关联的虚拟地址,其中所述虚拟地址用于访问与所述线程地址相关联的存储器中的位置以加载或存储数据。
-
公开(公告)号:US07627723B1
公开(公告)日:2009-12-01
申请号:US11533896
申请日:2006-09-21
CPC分类号: G06F13/4022 , G06F9/3001 , G06F9/30018 , G06F9/30021 , G06F9/3004 , G06F9/30087 , G06F9/3824 , G06F9/3834 , G06F9/3851 , G06F9/3887 , G06F9/526 , G06F2209/521 , G06T1/20 , G09G5/363 , G09G5/393
摘要: Methods, apparatuses, and systems are presented for updating data in memory while executing multiple threads of instructions, involving receiving a single instruction from one of a plurality of concurrently executing threads of instructions, in response to the single instruction received, reading data from a specific memory location, performing an operation involving the data read from the memory location to generate a result, and storing the result to the specific memory location, without requiring separate load and store instructions, and in response to the single instruction received, precluding another one of the plurality of threads of instructions from altering data at the specific memory location while reading of the data from the specific memory location, performing the operation involving the data, and storing the result to the specific memory location.
摘要翻译: 呈现用于在执行多个指令线程的同时更新存储器中的数据的方法,装置和系统,包括从多个并发执行的指令线程中的一个接收单个指令,响应于接收的单个指令,从特定的指令读取数据 存储器位置,执行涉及从存储器位置读取的数据以产生结果的操作,以及将结果存储到特定存储器位置,而不需要单独的加载和存储指令,并且响应于接收的单个指令,排除另一个 在从特定存储器位置读取数据的同时改变在特定存储器位置处的数据的多条指令线程,执行涉及数据的操作,以及将结果存储到特定存储器位置。
-
公开(公告)号:US08817031B2
公开(公告)日:2014-08-26
申请号:US12894001
申请日:2010-09-29
IPC分类号: G06F15/80
CPC分类号: G06T1/00
摘要: A technique for performing stream output operations in a parallel processing system is disclosed. A stream synchronization unit is provided that enables the parallel processing unit to track batches of vertices being processed in a graphics processing pipeline. A plurality of stream output units is also provided, where each stream output unit writes vertex attribute data to one or more stream output buffers for a portion of the batches of vertices. A messaging protocol is implemented between the stream synchronization unit and the plurality of stream output units that ensures that each of the stream output units writes vertex attribute data for the particular batch of vertices distributed to that particular stream output unit in the same order in the stream output buffers as the order in which the batch of vertices was received from a device driver by the parallel processing unit.
摘要翻译: 公开了一种用于在并行处理系统中执行流输出操作的技术。 提供流同步单元,其使并行处理单元能够跟踪在图形处理流水线中正在处理的顶点的批次。 还提供了多个流输出单元,其中每个流输出单元将顶点属性数据写入一批或多个顶点的一部分的流输出缓冲器。 在流同步单元和多个流输出单元之间实现消息传递协议,确保每个流输出单元以流中相同的顺序写入分配给该特定流输出单元的特定批次的顶点的顶点属性数据 输出缓冲器作为由并行处理单元从设备驱动器接收到顶点批次的顺序。
-
公开(公告)号:US20120280992A1
公开(公告)日:2012-11-08
申请号:US13461666
申请日:2012-05-01
申请人: Michael C. Shebanow , Anjul Patney
发明人: Michael C. Shebanow , Anjul Patney
IPC分类号: G06T17/00
CPC分类号: G06T11/40
摘要: The grid walk sampling technique is an efficient sampling algorithm aimed at optimizing the cost of triangle rasterization for modern graphics workloads. Grid walk sampling is an iterative rasterization algorithm that intelligently tests the intersection of triangle edges with multi-cell grids, determining coverage for a grid cell while identifying other cells in the grid that are either fully covered or fully uncovered by the triangle. Grid walk sampling rasterizes triangles using fewer computations and simpler computations compared with conventional highly parallel rasterizers. Therefore, a rasterizer employing grid walk sampling may compute sample coverage of triangles more efficiently in terms of power and circuitry die area compared with conventional highly parallel rasterizers.
摘要翻译: 网格行走采样技术是一种高效的采样算法,旨在优化现代图形工作负载的三角形光栅化成本。 网格行走采样是一种迭代光栅化算法,它可以智能地测试三角形边缘与多单元格网格的交点,确定网格单元格的覆盖范围,同时识别网格中由三角形完全覆盖或完全未覆盖的其他单元格。 与传统的高度平行光栅化器相比,栅格行走采样使用更少的计算和更简单的计算来对三角形进行光栅化。 因此,与传统的高度平行光栅化器相比,使用栅格行走采样的光栅化器可以在功率和电路裸片面积方面更有效地计算三角形的样本覆盖。
-
-
-
-
-
-
-
-
-