-
公开(公告)号:US08266382B1
公开(公告)日:2012-09-11
申请号:US12650214
申请日:2009-12-30
申请人: Alexander L. Minkin , Steven J. Heinrich , Rajeshwaran Selvanesan , Charles McCarver , Stewart Glenn Carlton , Anjana Rajendran , Yan Yan Tang
发明人: Alexander L. Minkin , Steven J. Heinrich , Rajeshwaran Selvanesan , Charles McCarver , Stewart Glenn Carlton , Anjana Rajendran , Yan Yan Tang
CPC分类号: G06F13/28
摘要: One embodiment of the present invention sets forth a technique for arbitrating requests received from one of the multiple clients of an L1 cache and for providing hints to the client to assist in arbitration. The L1 cache services multiple clients with diverse latency and bandwidth requirements and may be reconfigured to provide memory spaces for clients executing multiple parallel threads, where the memory spaces each have a different scope.
摘要翻译: 本发明的一个实施例提出了一种用于仲裁从L1高速缓存的多个客户机中的一个接收的请求并且向客户端提供帮助以协助仲裁的技术。 L1缓存服务于具有不同延迟和带宽需求的多个客户端,并且可以被重新配置为为执行多个并行线程的客户端提供存储空间,其中每个存储空间具有不同的范围。
-
公开(公告)号:US08335892B1
公开(公告)日:2012-12-18
申请号:US12650226
申请日:2009-12-30
申请人: Alexander L. Minkin , Steven J. Heinrich , Rajeshwaran Selvanesan , Charles McCarver , Stewart Glenn Carlton , Anjana Rajendran
发明人: Alexander L. Minkin , Steven J. Heinrich , Rajeshwaran Selvanesan , Charles McCarver , Stewart Glenn Carlton , Anjana Rajendran
CPC分类号: G06F12/084 , G06F12/0857
摘要: One embodiment of the present invention sets forth a technique for arbitrating requests received by an L1 cache from multiple clients. The L1 cache outputs bubble requests to a first one of the multiple clients that cause the first one of the multiple clients to insert bubbles into the request stream, where a bubble is the absence of a request. The bubbles allow the L1 cache to grant access to another one of the multiple clients without stalling the first one of the multiple clients. The L1 cache services multiple clients with diverse latency and bandwidth requirements and may be reconfigured to provide memory spaces for clients executing multiple parallel threads, where the memory spaces each have a different scope.
摘要翻译: 本发明的一个实施例提出了一种用于仲裁来自多个客户端的L1高速缓存的请求的技术。 L1缓存将气泡请求输出到多个客户端中的第一个客户端,导致多个客户端中的第一个客户端将气泡插入到请求流中,其中气泡不存在请求。 这些气泡允许L1高速缓存向多个客户机中的另一个客户端授予访问权限,而不会使多个客户端中的第一个客户端停顿。 L1缓存服务于具有不同延迟和带宽需求的多个客户端,并且可以被重新配置为为执行多个并行线程的客户端提供存储空间,其中每个存储空间具有不同的范围。
-
公开(公告)号:US09952977B2
公开(公告)日:2018-04-24
申请号:US12890476
申请日:2010-09-24
申请人: Steven James Heinrich , Alexander L. Minkin , Brett W. Coon , Rajeshwaran Selvanesan , Robert Steven Glanville , Charles McCarver , Anjana Rajendran , Stewart Glenn Carlton , John R. Nickolls , Brian Fahs
发明人: Steven James Heinrich , Alexander L. Minkin , Brett W. Coon , Rajeshwaran Selvanesan , Robert Steven Glanville , Charles McCarver , Anjana Rajendran , Stewart Glenn Carlton , John R. Nickolls , Brian Fahs
IPC分类号: G06F12/00 , G06F12/0842 , G06F12/0897
CPC分类号: G06F12/0842 , G06F12/0897
摘要: A method for managing a parallel cache hierarchy in a processing unit. The method including receiving an instruction that includes a cache operations modifier that identifies a level of the parallel cache hierarchy in which to cache data associated with the instruction; and implementing a cache replacement policy based on the cache operations modifier.
-
公开(公告)号:US20110078381A1
公开(公告)日:2011-03-31
申请号:US12890476
申请日:2010-09-24
申请人: Steven James HEINRICH , Alexander L. Minkin , Brett W. Coon , Rajeshwaran Selvanesan , Robert Steven Glanville , Charles McCarver , Anjana Rajendran , Stewart Glenn Carlton , John R. Nickolls , Brian Fahs
发明人: Steven James HEINRICH , Alexander L. Minkin , Brett W. Coon , Rajeshwaran Selvanesan , Robert Steven Glanville , Charles McCarver , Anjana Rajendran , Stewart Glenn Carlton , John R. Nickolls , Brian Fahs
CPC分类号: G06F12/0842 , G06F12/0897
摘要: A method for managing a parallel cache hierarchy in a processing unit. The method including receiving an instruction that includes a cache operations modifier that identifies a level of the parallel cache hierarchy in which to cache data associated with the instruction; and implementing a cache replacement policy based on the cache operations modifier.
摘要翻译: 一种用于在处理单元中管理并行高速缓存层级的方法。 该方法包括接收包括高速缓存操作修饰符的指令,该缓存操作修饰符标识其中要缓存与指令相关联的数据的并行高速缓存层级的级别; 并基于高速缓存操作修饰符实现高速缓存替换策略。
-
公开(公告)号:US08266383B1
公开(公告)日:2012-09-11
申请号:US12650189
申请日:2009-12-30
申请人: Alexander L. Minkin , Steven J. Heinrich , Rajeshwaran Selvanesan , Charles McCarver , Stewart Glenn Carlton , Ming Y. Siu , Yan Yan Tang , Robert J. Stoll
发明人: Alexander L. Minkin , Steven J. Heinrich , Rajeshwaran Selvanesan , Charles McCarver , Stewart Glenn Carlton , Ming Y. Siu , Yan Yan Tang , Robert J. Stoll
CPC分类号: G06F12/0859 , G06F12/084
摘要: One embodiment of the present invention sets forth a technique for processing cache misses resulting from a request received from one of the multiple clients of an L1 cache. The L1 cache services multiple clients with diverse latency and bandwidth requirements, including at least one client whose requests cannot be stalled. The L1 cache includes storage to buffer pending requests for caches misses. When an entry is available to store a pending request, a request causing a cache miss is accepted. When the data for a read request becomes available, the cache instructs the client to resubmit the read request to receive the data. When an entry is not available to store a pending request, a request causing a cache miss is deferred and the cache provides the client with status information that is used to determine when the request should be resubmitted.
摘要翻译: 本发明的一个实施例提出了一种用于处理由从L1高速缓存的多个客户端之一接收到的请求而产生的高速缓存未命中的技术。 L1缓存服务于具有不同延迟和带宽需求的多个客户端,包括至少一个客户端,其请求不能停止。 L1高速缓存包括缓存未缓存缓存请求的存储。 当条目可用于存储挂起的请求时,接受导致高速缓存未命中的请求。 当读请求的数据变得可用时,缓存指示客户端重新提交读请求以接收数据。 当条目不可用于存储挂起的请求时,导致高速缓存未命中的请求被延迟,并且高速缓存为客户端提供用于确定何时应该重新提交请求的状态信息。
-
公开(公告)号:US20130212364A1
公开(公告)日:2013-08-15
申请号:US13370173
申请日:2012-02-09
申请人: Michael FETTERMAN , Stewart Glenn Carlton , Jack Hilaire Choquette , Shirish Gadre , Olivier Giroux , Douglas J. Hahn , Steven James Heinrich , Eric Lyell Hill , Charles McCarver , Omkar Paranjape , Anjana Rajendran , Rajeshwaran Selvanesan
发明人: Michael FETTERMAN , Stewart Glenn Carlton , Jack Hilaire Choquette , Shirish Gadre , Olivier Giroux , Douglas J. Hahn , Steven James Heinrich , Eric Lyell Hill , Charles McCarver , Omkar Paranjape , Anjana Rajendran , Rajeshwaran Selvanesan
CPC分类号: G06F9/3861 , G06F9/3836 , G06F9/3851 , G06F9/3887
摘要: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced. One advantage of the disclosed technique is that divergent operations requiring one or more replay operations execute with reduced latency.
摘要翻译: 本公开的一个实施例阐述了在并行处理子系统中执行用于发散操作的预先安排的重播操作的优化方式。 具体地,流式多处理器(SM)包括多级流水线,其被配置为将预先安排的重播操作插入到多级流水线中。 预先安排的重播单元检测与当前指令相关联的操作是否正在访问公共资源。 如果线程正在访问分布在多个高速缓存线上的数据,则预先安排的重播单元在当前指令后面插入预先安排的重放操作。 多级流水线顺序执行指令和相关的预先安排的重播操作。 如果附加线程在执行指令和预先安排的重放操作之后保持未被接受,则通过重放循环插入附加的重放操作,直到对所有线程进行服务。 所公开技术的一个优点是需要一个或多个重放操作的发散操作以较低的等待时间执行。
-
公开(公告)号:US10152329B2
公开(公告)日:2018-12-11
申请号:US13370173
申请日:2012-02-09
申请人: Michael Fetterman , Stewart Glenn Carlton , Jack Hilaire Choquette , Shirish Gadre , Olivier Giroux , Douglas J. Hahn , Steven James Heinrich , Eric Lyell Hill , Charles McCarver , Omkar Paranjape , Anjana Rajendran , Rajeshwaran Selvanesan
发明人: Michael Fetterman , Stewart Glenn Carlton , Jack Hilaire Choquette , Shirish Gadre , Olivier Giroux , Douglas J. Hahn , Steven James Heinrich , Eric Lyell Hill , Charles McCarver , Omkar Paranjape , Anjana Rajendran , Rajeshwaran Selvanesan
IPC分类号: G06F9/38
摘要: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced. One advantage of the disclosed technique is that divergent operations requiring one or more replay operations execute with reduced latency.
-
公开(公告)号:US09286256B2
公开(公告)日:2016-03-15
申请号:US12892862
申请日:2010-09-28
申请人: Alexander L. Minkin , Steven J. Heinrich , Rajeshwaran Selvanesan , Stewart Glenn Carlton , John R. Nickolls
发明人: Alexander L. Minkin , Steven J. Heinrich , Rajeshwaran Selvanesan , Stewart Glenn Carlton , John R. Nickolls
CPC分类号: G06F13/4022 , G06F13/4031
摘要: The invention sets forth an L1 cache architecture that includes a crossbar unit configured to transmit data associated with both read data requests and write data requests. Data associated with read data requests is retrieved from a cache memory and transmitted to the client subsystems. Similarly, data associated with write data requests is transmitted from the client subsystems to the cache memory. To allow for the transmission of both read and write data on the crossbar unit, an arbiter is configured to schedule the crossbar unit transmissions as well and arbitrate between data requests received from the client subsystems.
摘要翻译: 本发明提出了一种L1缓存架构,其包括被配置为发送与读取数据请求和写入数据请求相关联的数据的交叉单元。 与读取数据请求相关联的数据从高速缓冲存储器检索并发送到客户机子系统。 类似地,与写数据请求相关联的数据从客户端子系统发送到高速缓冲存储器。 为了允许在交叉开关单元上传输读取和写入数据,仲裁器被配置为调度交叉单元传输以及在从客户端子系统接收的数据请求之间进行仲裁。
-
公开(公告)号:US09626191B2
公开(公告)日:2017-04-18
申请号:US13335868
申请日:2011-12-22
申请人: Jack Hilaire Choquette , Michael Fetterman , Shirish Gadre , Xiaogang Qiu , Omkar Paranjape , Anjana Rajendran , Stewart Glenn Carlton , Eric Lyell Hill , Rajeshwaran Selvanesan , Douglas J. Hahn
发明人: Jack Hilaire Choquette , Michael Fetterman , Shirish Gadre , Xiaogang Qiu , Omkar Paranjape , Anjana Rajendran , Stewart Glenn Carlton , Eric Lyell Hill , Rajeshwaran Selvanesan , Douglas J. Hahn
CPC分类号: G06F9/3851 , G06F9/3012
摘要: One embodiment of the present invention sets forth a technique for performing a shaped access of a register file that includes a set of N registers, wherein N is greater than or equal to two. The technique involves, for at least one thread included in a group of threads, receiving a request to access a first amount of data from each register in the set of N registers, and configuring a crossbar to allow the at least one thread to access the first amount of data from each register in the set of N registers.
-
公开(公告)号:US20130159684A1
公开(公告)日:2013-06-20
申请号:US13329066
申请日:2011-12-16
申请人: Michael Fetterman , Jack Hilaire Choquette , Omkar Paranjape , Anjana Rajendran , Eric Lyell Hill , Stewart glenn Carlton , Rajeshwaran Selvanesan , Douglas J. Hahn , Steven James Heinrich
发明人: Michael Fetterman , Jack Hilaire Choquette , Omkar Paranjape , Anjana Rajendran , Eric Lyell Hill , Stewart glenn Carlton , Rajeshwaran Selvanesan , Douglas J. Hahn , Steven James Heinrich
CPC分类号: G06F9/3851 , G06F9/3861
摘要: One embodiment of the present invention sets forth an optimized way to execute replay operations for divergent operations in a parallel processing subsystem. Specifically, the streaming multiprocessor (SM) includes a multistage pipeline configured to batch two or more replay operations for processing via replay loop. A logic element within the multistage pipeline detects whether the current pipeline stage is accessing a shared resource, such as loading data from a shared memory. If the threads are accessing data which are distributed across multiple cache lines, then the multistage pipeline batches two or more replay operations, where the replay operations are inserted into the pipeline back-to-back. Advantageously, divergent operations requiring two or more replay operations operate with reduced latency. Where memory access operations require transfer of more than two cache lines to service all threads, the number of clock cycles required to complete all replay operations is reduced.
摘要翻译: 本发明的一个实施例阐述了在并行处理子系统中对发散操作执行重放操作的优化方法。 具体地说,流式多处理器(SM)包括多级流水线,其被配置为批量两个或更多个重播操作以便经由重放循环进行处理。 多级流水线内的逻辑元件检测当前流水线阶段是否正在访问共享资源,例如从共享内存加载数据。 如果线程正在访问分布在多个高速缓存行中的数据,则多级管道批量执行两个或更多个重放操作,其中重放操作被背对背地插入到管道中。 有利地,需要两次或更多次重放操作的发散操作以降低的等待时间运行。 在存储器访问操作需要传送两条以上的高速缓存行以服务所有线程的情况下,完成所有重放操作所需的时钟周期数减少。
-
-
-
-
-
-
-
-
-