Sharing data crossbar for reads and writes in a data cache
    3.
    发明授权
    Sharing data crossbar for reads and writes in a data cache 有权
    在数据高速缓存中共享用于读写数据的交叉开关

    公开(公告)号:US09286256B2

    公开(公告)日:2016-03-15

    申请号:US12892862

    申请日:2010-09-28

    CPC分类号: G06F13/4022 G06F13/4031

    摘要: The invention sets forth an L1 cache architecture that includes a crossbar unit configured to transmit data associated with both read data requests and write data requests. Data associated with read data requests is retrieved from a cache memory and transmitted to the client subsystems. Similarly, data associated with write data requests is transmitted from the client subsystems to the cache memory. To allow for the transmission of both read and write data on the crossbar unit, an arbiter is configured to schedule the crossbar unit transmissions as well and arbitrate between data requests received from the client subsystems.

    摘要翻译: 本发明提出了一种L1缓存架构,其包括被配置为发送与读取数据请求和写入数据请求相关联的数据的交叉单元。 与读取数据请求相关联的数据从高速缓冲存储器检索并发送到客户机子系统。 类似地,与写数据请求相关联的数据从客户端子系统发送到高速缓冲存储器。 为了允许在交叉开关单元上传输读取和写入数据,仲裁器被配置为调度交叉单元传输以及在从客户端子系统接收的数据请求之间进行仲裁。

    Sharing Data Crossbar for Reads and Writes in a Data Cache
    4.
    发明申请
    Sharing Data Crossbar for Reads and Writes in a Data Cache 有权
    共享数据交叉开关用于在数据缓存中进行读写

    公开(公告)号:US20110082961A1

    公开(公告)日:2011-04-07

    申请号:US12892862

    申请日:2010-09-28

    IPC分类号: G06F13/36 G06F13/00

    CPC分类号: G06F13/4022 G06F13/4031

    摘要: The invention sets forth an L1 cache architecture that includes a crossbar unit configured to transmit data associated with both read data requests and write data requests. Data associated with read data requests is retrieved from a cache memory and transmitted to the client subsystems. Similarly, data associated with write data requests is transmitted from the client subsystems to the cache memory. To allow for the transmission of both read and write data on the crossbar unit, an arbiter is configured to schedule the crossbar unit transmissions as well and arbitrate between data requests received from the client subsystems.

    摘要翻译: 本发明提出了一种L1缓存架构,其包括被配置为发送与读取数据请求和写入数据请求相关联的数据的交叉单元。 与读取数据请求相关联的数据从高速缓冲存储器检索并发送到客户机子系统。 类似地,与写数据请求相关联的数据从客户端子系统发送到高速缓冲存储器。 为了允许在交叉开关单元上传输读取和写入数据,仲裁器被配置为调度交叉单元传输以及在从客户端子系统接收的数据请求之间进行仲裁。

    Unified addressing and instructions for accessing parallel memory spaces
    5.
    发明授权
    Unified addressing and instructions for accessing parallel memory spaces 有权
    统一寻址和访问并行存储空间的指令

    公开(公告)号:US08271763B2

    公开(公告)日:2012-09-18

    申请号:US12567637

    申请日:2009-09-25

    IPC分类号: G06F12/10

    摘要: One embodiment of the present invention sets forth a technique for unifying the addressing of multiple distinct parallel memory spaces into a single address space for a thread. A unified memory space address is converted into an address that accesses one of the parallel memory spaces for that thread. A single type of load or store instruction may be used that specifies the unified memory space address for a thread instead of using a different type of load or store instruction to access each of the distinct parallel memory spaces.

    摘要翻译: 本发明的一个实施例提出了一种用于将多个不同的并行存储器空间的寻址统一为用于线程的单个地址空间的技术。 统一的存储空间地址被转换为访问该线程的并行存储器空间之一的地址。 可以使用单一类型的加载或存储指令,其指定线程的统一存储器空间地址,而不是使用不同类型的加载或存储指令来访问每个不同的并行存储器空间。

    Unified Addressing and Instructions for Accessing Parallel Memory Spaces
    6.
    发明申请
    Unified Addressing and Instructions for Accessing Parallel Memory Spaces 有权
    统一寻址和访问并行内存空间的说明

    公开(公告)号:US20110078406A1

    公开(公告)日:2011-03-31

    申请号:US12567637

    申请日:2009-09-25

    IPC分类号: G06F12/10

    摘要: One embodiment of the present invention sets forth a technique for unifying the addressing of multiple distinct parallel memory spaces into a single address space for a thread. A unified memory space address is converted into an address that accesses one of the parallel memory spaces for that thread. A single type of load or store instruction may be used that specifies the unified memory space address for a thread instead of using a different type of load or store instruction to access each of the distinct parallel memory spaces.

    摘要翻译: 本发明的一个实施例提出了一种用于将多个不同的并行存储器空间的寻址统一为用于线程的单个地址空间的技术。 统一的存储空间地址被转换为访问该线程的并行存储器空间之一的地址。 可以使用单一类型的加载或存储指令,其指定线程的统一存储器空间地址,而不是使用不同类型的加载或存储指令来访问每个不同的并行存储器空间。

    Cooperative thread array reduction and scan operations
    7.
    发明授权
    Cooperative thread array reduction and scan operations 有权
    合作线程数组减少和扫描操作

    公开(公告)号:US08539204B2

    公开(公告)日:2013-09-17

    申请号:US12890227

    申请日:2010-09-24

    IPC分类号: G06F9/30 G06F9/40 G06F15/00

    摘要: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.

    摘要翻译: 本发明的一个实施例提出了一种用于跨独立执行的多个线程执行聚合操作的技术。 聚合被指定为屏障同步或屏障到达指令的一部分,其中除了执行屏障同步或到达之外,指令聚合(使用缩减或扫描操作)由每个线程提供的值。 当线程执行屏障聚合指令时,线程有助于扫描或缩小结果,并等待执行任何更多指令,直到所有线程都执行了阻挡聚合指令为止。 在所有线程执行了屏障聚合指令之后,向每个线程传送减少结果,并且当线程执行屏障聚合指令时,将扫描结果传送给每个线程。

    EFFICIENT IMPLEMENTATION OF ARRAYS OF STRUCTURES ON SIMT AND SIMD ARCHITECTURES
    8.
    发明申请
    EFFICIENT IMPLEMENTATION OF ARRAYS OF STRUCTURES ON SIMT AND SIMD ARCHITECTURES 有权
    对SIMT和SIMD建筑结构的有效实施

    公开(公告)号:US20120089792A1

    公开(公告)日:2012-04-12

    申请号:US13247855

    申请日:2011-09-28

    IPC分类号: G06F12/00

    摘要: One embodiment of the present invention sets forth a technique providing an optimized way to allocate and access memory across a plurality of thread/data lanes. Specifically, the device driver receives an instruction targeted to a memory set up as an array of structures of arrays. The device driver computes an address within the memory using information about the number of thread/data lanes and parameters from the instruction itself. The result is a memory allocation and access approach where the device driver properly computes the target address in the memory. Advantageously, processing efficiency is improved where memory in a parallel processing subsystem is internally stored and accessed as an array of structures of arrays, proportional to the SIMT/SIMD group width (the number of threads or lanes per execution group).

    摘要翻译: 本发明的一个实施例提出了一种技术,其提供了一种在多个线程/数据通道上分配和访问存储器的优化方式。 具体来说,设备驱动程序接收到作为阵列结构的阵列设置的存储器的指令。 设备驱动程序使用关于指令本身的线程/数据通道数和参数的信息来计算存储器中的地址。 结果是存储器分配和访问方法,其中设备驱动器正确地计算存储器中的目标地址。 有利的是,处理效率得到改善,其中并行处理子系统中的存储器被内部存储和访问为与SIMT / SIMD组宽度(每个执行组的线程或通道数)成比例的阵列结构的阵列。

    COOPERATIVE THREAD ARRAY REDUCTION AND SCAN OPERATIONS
    9.
    发明申请
    COOPERATIVE THREAD ARRAY REDUCTION AND SCAN OPERATIONS 有权
    合作螺线减排和扫描作业

    公开(公告)号:US20110078417A1

    公开(公告)日:2011-03-31

    申请号:US12890227

    申请日:2010-09-24

    IPC分类号: G06F9/38

    摘要: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.

    摘要翻译: 本发明的一个实施例提出了一种用于跨独立执行的多个线程执行聚合操作的技术。 聚合被指定为屏障同步或屏障到达指令的一部分,其中除了执行屏障同步或到达之外,指令聚合(使用缩减或扫描操作)由每个线程提供的值。 当线程执行屏障聚合指令时,线程有助于扫描或缩小结果,并等待执行任何更多指令,直到所有线程都执行了阻挡聚合指令为止。 在所有线程执行了屏障聚合指令之后,向每个线程传送减少结果,并且当线程执行屏障聚合指令时,将扫描结果传送给每个线程。

    Coalescing memory barrier operations across multiple parallel threads
    10.
    发明授权
    Coalescing memory barrier operations across multiple parallel threads 有权
    在多个并行线程之间合并记忆障碍操作

    公开(公告)号:US09223578B2

    公开(公告)日:2015-12-29

    申请号:US12887081

    申请日:2010-09-21

    IPC分类号: G06F9/46 G06F9/38 G06F9/30

    摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.

    摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。