Coalescing memory barrier operations across multiple parallel threads
    1.
    发明授权
    Coalescing memory barrier operations across multiple parallel threads 有权
    在多个并行线程之间合并记忆障碍操作

    公开(公告)号:US09223578B2

    公开(公告)日:2015-12-29

    申请号:US12887081

    申请日:2010-09-21

    IPC分类号: G06F9/46 G06F9/38 G06F9/30

    摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.

    摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。

    COALESCING MEMORY BARRIER OPERATIONS ACROSS MULTIPLE PARALLEL THREADS
    2.
    发明申请
    COALESCING MEMORY BARRIER OPERATIONS ACROSS MULTIPLE PARALLEL THREADS 有权
    通过多个并行线程来解决存储器障碍操作

    公开(公告)号:US20110078692A1

    公开(公告)日:2011-03-31

    申请号:US12887081

    申请日:2010-09-21

    IPC分类号: G06F9/46

    摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.

    摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。

    Configurable cache for multiple clients
    4.
    发明授权
    Configurable cache for multiple clients 有权
    多个客户端的可配置缓存

    公开(公告)号:US08595425B2

    公开(公告)日:2013-11-26

    申请号:US12567445

    申请日:2009-09-25

    IPC分类号: G06F12/00

    摘要: One embodiment of the present invention sets forth a technique for providing a L1 cache that is a central storage resource. The L1 cache services multiple clients with diverse latency and bandwidth requirements. The L1 cache may be reconfigured to create multiple storage spaces enabling the L1 cache may replace dedicated buffers, caches, and FIFOs in previous architectures. A “direct mapped” storage region that is configured within the L1 cache may replace dedicated buffers, FIFOs, and interface paths, allowing clients of the L1 cache to exchange attribute and primitive data. The direct mapped storage region may used as a global register file. A “local and global cache” storage region configured within the L1 cache may be used to support load/store memory requests to multiple spaces. These spaces include global, local, and call-return stack (CRS) memory.

    摘要翻译: 本发明的一个实施例提出了一种用于提供作为中央存储资源的L1高速缓存的技术。 L1缓存为多个客户端提供不同的延迟和带宽要求。 可以重新配置L1高速缓存以创建多个存储空间,使得L1高速缓存可以替代先前架构中的专用缓冲器,高速缓存和FIFO。 配置在L1高速缓存内的“直接映射”存储区可以替代专用缓冲器,FIFO和接口路径,允许L1高速缓存的客户端交换属性和原始数据。 直接映射存储区域可以用作全局寄存器文件。 配置在L1高速缓存内的“本地和全局高速缓存”存储区域可用于支持对多个空间的加载/存储存储器请求。 这些空格包括全局,本地和回调栈(CRS)内存。

    CONFIGURABLE CACHE FOR MULTIPLE CLIENTS
    5.
    发明申请
    CONFIGURABLE CACHE FOR MULTIPLE CLIENTS 有权
    多个客户端的可配置缓存

    公开(公告)号:US20110078367A1

    公开(公告)日:2011-03-31

    申请号:US12567445

    申请日:2009-09-25

    IPC分类号: G06F12/02 G06F12/00 G06F12/08

    摘要: One embodiment of the present invention sets forth a technique for providing a L1 cache that is a central storage resource. The L1 cache services multiple clients with diverse latency and bandwidth requirements. The L1 cache may be reconfigured to create multiple storage spaces enabling the L1 cache may replace dedicated buffers, caches, and FIFOs in previous architectures. A “direct mapped” storage region that is configured within the L1 cache may replace dedicated buffers, FIFOs, and interface paths, allowing clients of the L1 cache to exchange attribute and primitive data. The direct mapped storage region may used as a global register file. A “local and global cache” storage region configured within the L1 cache may be used to support load/store memory requests to multiple spaces. These spaces include global, local, and call-return stack (CRS) memory.

    摘要翻译: 本发明的一个实施例提出了一种用于提供作为中央存储资源的L1高速缓存的技术。 L1缓存为多个客户端提供不同的延迟和带宽要求。 可以重新配置L1高速缓存以创建多个存储空间,使得L1高速缓存可以替代先前架构中的专用缓冲器,高速缓存和FIFO。 配置在L1高速缓存内的“直接映射”存储区可以替代专用缓冲器,FIFO和接口路径,允许L1高速缓存的客户端交换属性和原始数据。 直接映射存储区域可以用作全局寄存器文件。 配置在L1高速缓存内的“本地和全局高速缓存”存储区域可用于支持对多个空间的加载/存储存储器请求。 这些空格包括全局,本地和回调栈(CRS)内存。

    Dynamic bank mode addressing for memory access
    6.
    发明授权
    Dynamic bank mode addressing for memory access 有权
    用于存储器访问的动态存储区模式寻址

    公开(公告)号:US09262174B2

    公开(公告)日:2016-02-16

    申请号:US13440945

    申请日:2012-04-05

    IPC分类号: G06F13/00 G06F13/28 G06F9/38

    CPC分类号: G06F9/3887 G06F9/3851

    摘要: One embodiment sets forth a technique for dynamically mapping addresses to banks of a multi-bank memory based on a bank mode. Application programs may be configured to perform read and write a memory accessing different numbers of bits per bank, e.g., 32-bits per bank, 64-bits per bank, or 128-bits per bank. On each clock cycle an access request may be received from one of the application programs and per processing thread addresses of the access request are dynamically mapped based on the bank mode to produce a set of bank addresses. The bank addresses are then used to access the multi-bank memory. Allowing different bank mappings enables each application program to avoid bank conflicts when the memory is accesses compared with using a single bank mapping for all accesses.

    摘要翻译: 一个实施例提出了一种用于基于银行模式将地址动态地映射到多存储体存储器的存储体的技术。 应用程序可以被配置为执行读取和写入访问每个存储体的不同位数的存储器,例如每个存储体32位,每个存储体64位或每个存储体128位。 在每个时钟周期上,可以从应用程序之一接收访问请求,并且基于所述存储体模式动态地映射访问请求的每个处理线程地址以产生一组存储体地址。 然后,银行地址用于访问多存储存储器。 允许不同的银行映射使每个应用程序避免存储器访问时的存储器冲突,与对所有访问使用单个存储库映射相比。

    MECHANISM FOR TRACKING AGE OF COMMON RESOURCE REQUESTS WITHIN A RESOURCE MANAGEMENT SUBSYSTEM
    7.
    发明申请
    MECHANISM FOR TRACKING AGE OF COMMON RESOURCE REQUESTS WITHIN A RESOURCE MANAGEMENT SUBSYSTEM 有权
    跟踪资源管理子系统中共同资源年龄的机制

    公开(公告)号:US20130311686A1

    公开(公告)日:2013-11-21

    申请号:US13476825

    申请日:2012-05-21

    IPC分类号: G06F5/00

    CPC分类号: H04L49/254 G06F9/46

    摘要: One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.

    摘要翻译: 本公开的一个实施例阐述了在与重放操作相关的公共资源访问请求的调度中维持公平性和顺序的有效方式。 具体地说,流式多处理器(SM)包括配置成通过一个或多个执行周期调度访问请求的总顺序队列(TOQ)。 访问请求被允许在需要时将共同资源分配给该请求来进行进展。 在多个访问请求需要相同的公共资源的情况下,优先级被赋予较旧的访问请求。 访问请求可能处于睡眠状态,等待某些公共资源的可用性。 可以通过允许较旧的访问请求从较年轻的资源请求中窃取资源来避免死锁。 所公开的技术的一个优点是较旧的公共资源访问请求不被重复阻止以通过较新的访问请求提前进展。

    DYNAMIC BANK MODE ADDRESSING FOR MEMORY ACCESS
    8.
    发明申请
    DYNAMIC BANK MODE ADDRESSING FOR MEMORY ACCESS 有权
    用于存储器访问的动态银行模式寻址

    公开(公告)号:US20130268715A1

    公开(公告)日:2013-10-10

    申请号:US13440945

    申请日:2012-04-05

    IPC分类号: G06F12/06

    CPC分类号: G06F9/3887 G06F9/3851

    摘要: One embodiment sets forth a technique for dynamically mapping addresses to banks of a multi-bank memory based on a bank mode. Application programs may be configured to perform read and write a memory accessing different numbers of bits per bank, e.g., 32-bits per bank, 64-bits per bank, or 128-bits per bank. On each clock cycle an access request may be received from one of the application programs and per processing thread addresses of the access request are dynamically mapped based on the bank mode to produce a set of bank addresses. The bank addresses are then used to access the multi-bank memory. Allowing different bank mappings enables each application program to avoid bank conflicts when the memory is accesses compared with using a single bank mapping for all accesses.

    摘要翻译: 一个实施例提出了一种用于基于银行模式将地址动态地映射到多存储体存储器的存储体的技术。 应用程序可以被配置为执行读取和写入访问每个存储体的不同位数的存储器,例如每个存储体32位,每个存储体64位或每个存储体128位。 在每个时钟周期上,可以从应用程序之一接收访问请求,并且基于所述存储体模式动态地映射访问请求的每个处理线程地址以产生一组存储体地址。 然后,银行地址用于访问多存储存储器。 允许不同的银行映射使每个应用程序避免存储器访问时的存储器冲突,与对所有访问使用单个存储库映射相比。

    RESOURCE MANAGEMENT SUBSYSTEM THAT MAINTAINS FAIRNESS AND ORDER
    9.
    发明申请
    RESOURCE MANAGEMENT SUBSYSTEM THAT MAINTAINS FAIRNESS AND ORDER 有权
    资源管理子系统维护公平和秩序

    公开(公告)号:US20130311999A1

    公开(公告)日:2013-11-21

    申请号:US13476791

    申请日:2012-05-21

    IPC分类号: G06F9/50

    CPC分类号: G06F9/5011 G06F2209/507

    摘要: One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.

    摘要翻译: 本公开的一个实施例阐述了在与重放操作相关的公共资源访问请求的调度中维持公平性和顺序的有效方式。 具体地说,流式多处理器(SM)包括配置成通过一个或多个执行周期调度访问请求的总顺序队列(TOQ)。 访问请求被允许在需要时将共同资源分配给该请求来进行进展。 在多个访问请求需要相同的公共资源的情况下,优先级被赋予较旧的访问请求。 访问请求可能处于睡眠状态,等待某些公共资源的可用性。 可以通过允许较旧的访问请求从较年轻的资源请求中窃取资源来避免死锁。 所公开的技术的一个优点是较旧的公共资源访问请求不被重复阻止以通过较新的访问请求提前进展。

    BATCHED REPLAYS OF DIVERGENT OPERATIONS
    10.
    发明申请
    BATCHED REPLAYS OF DIVERGENT OPERATIONS 有权
    批量操作的重复操作

    公开(公告)号:US20130159684A1

    公开(公告)日:2013-06-20

    申请号:US13329066

    申请日:2011-12-16

    IPC分类号: G06F9/38 G06F9/312

    CPC分类号: G06F9/3851 G06F9/3861

    摘要: One embodiment of the present invention sets forth an optimized way to execute replay operations for divergent operations in a parallel processing subsystem. Specifically, the streaming multiprocessor (SM) includes a multistage pipeline configured to batch two or more replay operations for processing via replay loop. A logic element within the multistage pipeline detects whether the current pipeline stage is accessing a shared resource, such as loading data from a shared memory. If the threads are accessing data which are distributed across multiple cache lines, then the multistage pipeline batches two or more replay operations, where the replay operations are inserted into the pipeline back-to-back. Advantageously, divergent operations requiring two or more replay operations operate with reduced latency. Where memory access operations require transfer of more than two cache lines to service all threads, the number of clock cycles required to complete all replay operations is reduced.

    摘要翻译: 本发明的一个实施例阐述了在并行处理子系统中对发散操作执行重放操作的优化方法。 具体地说,流式多处理器(SM)包括多级流水线,其被配置为批量两个或更多个重播操作以便经由重放循环进行处理。 多级流水线内的逻辑元件检测当前流水线阶段是否正在访问共享资源,例如从共享内存加载数据。 如果线程正在访问分布在多个高速缓存行中的数据,则多级管道批量执行两个或更多个重放操作,其中重放操作被背对背地插入到管道中。 有利地,需要两次或更多次重放操作的发散操作以降低的等待时间运行。 在存储器访问操作需要传送两条以上的高速缓存行以服务所有线程的情况下,完成所有重放操作所需的时钟周期数减少。