PRE-SCHEDULED REPLAYS OF DIVERGENT OPERATIONS
    1.
    发明申请
    PRE-SCHEDULED REPLAYS OF DIVERGENT OPERATIONS 审中-公开
    预先安排的重复操作

    公开(公告)号:US20130212364A1

    公开(公告)日:2013-08-15

    申请号:US13370173

    申请日:2012-02-09

    IPC分类号: G06F9/38 G06F9/312

    摘要: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced. One advantage of the disclosed technique is that divergent operations requiring one or more replay operations execute with reduced latency.

    摘要翻译: 本公开的一个实施例阐述了在并行处理子系统中执行用于发散操作的预先安排的重播操作的优化方式。 具体地,流式多处理器(SM)包括多级流水线,其被配置为将预先安排的重播操作插入到多级流水线中。 预先安排的重播单元检测与当前指令相关联的操作是否正在访问公共资源。 如果线程正在访问分布在多个高速缓存线上的数据,则预先安排的重播单元在当前指令后面插入预先安排的重放操作。 多级流水线顺序执行指令和相关的预先安排的重播操作。 如果附加线程在执行指令和预先安排的重放操作之后保持未被接受,则通过重放循环插入附加的重放操作,直到对所有线程进行服务。 所公开技术的一个优点是需要一个或多个重放操作的发散操作以较低的等待时间执行。

    BATCHED REPLAYS OF DIVERGENT OPERATIONS
    4.
    发明申请
    BATCHED REPLAYS OF DIVERGENT OPERATIONS 有权
    批量操作的重复操作

    公开(公告)号:US20130159684A1

    公开(公告)日:2013-06-20

    申请号:US13329066

    申请日:2011-12-16

    IPC分类号: G06F9/38 G06F9/312

    CPC分类号: G06F9/3851 G06F9/3861

    摘要: One embodiment of the present invention sets forth an optimized way to execute replay operations for divergent operations in a parallel processing subsystem. Specifically, the streaming multiprocessor (SM) includes a multistage pipeline configured to batch two or more replay operations for processing via replay loop. A logic element within the multistage pipeline detects whether the current pipeline stage is accessing a shared resource, such as loading data from a shared memory. If the threads are accessing data which are distributed across multiple cache lines, then the multistage pipeline batches two or more replay operations, where the replay operations are inserted into the pipeline back-to-back. Advantageously, divergent operations requiring two or more replay operations operate with reduced latency. Where memory access operations require transfer of more than two cache lines to service all threads, the number of clock cycles required to complete all replay operations is reduced.

    摘要翻译: 本发明的一个实施例阐述了在并行处理子系统中对发散操作执行重放操作的优化方法。 具体地说,流式多处理器(SM)包括多级流水线,其被配置为批量两个或更多个重播操作以便经由重放循环进行处理。 多级流水线内的逻辑元件检测当前流水线阶段是否正在访问共享资源,例如从共享内存加载数据。 如果线程正在访问分布在多个高速缓存行中的数据,则多级管道批量执行两个或更多个重放操作,其中重放操作被背对背地插入到管道中。 有利地,需要两次或更多次重放操作的发散操作以降低的等待时间运行。 在存储器访问操作需要传送两条以上的高速缓存行以服务所有线程的情况下,完成所有重放操作所需的时钟周期数减少。

    Dynamic bank mode addressing for memory access
    6.
    发明授权
    Dynamic bank mode addressing for memory access 有权
    用于存储器访问的动态存储区模式寻址

    公开(公告)号:US09262174B2

    公开(公告)日:2016-02-16

    申请号:US13440945

    申请日:2012-04-05

    IPC分类号: G06F13/00 G06F13/28 G06F9/38

    CPC分类号: G06F9/3887 G06F9/3851

    摘要: One embodiment sets forth a technique for dynamically mapping addresses to banks of a multi-bank memory based on a bank mode. Application programs may be configured to perform read and write a memory accessing different numbers of bits per bank, e.g., 32-bits per bank, 64-bits per bank, or 128-bits per bank. On each clock cycle an access request may be received from one of the application programs and per processing thread addresses of the access request are dynamically mapped based on the bank mode to produce a set of bank addresses. The bank addresses are then used to access the multi-bank memory. Allowing different bank mappings enables each application program to avoid bank conflicts when the memory is accesses compared with using a single bank mapping for all accesses.

    摘要翻译: 一个实施例提出了一种用于基于银行模式将地址动态地映射到多存储体存储器的存储体的技术。 应用程序可以被配置为执行读取和写入访问每个存储体的不同位数的存储器,例如每个存储体32位,每个存储体64位或每个存储体128位。 在每个时钟周期上,可以从应用程序之一接收访问请求,并且基于所述存储体模式动态地映射访问请求的每个处理线程地址以产生一组存储体地址。 然后,银行地址用于访问多存储存储器。 允许不同的银行映射使每个应用程序避免存储器访问时的存储器冲突,与对所有访问使用单个存储库映射相比。

    DYNAMIC BANK MODE ADDRESSING FOR MEMORY ACCESS
    7.
    发明申请
    DYNAMIC BANK MODE ADDRESSING FOR MEMORY ACCESS 有权
    用于存储器访问的动态银行模式寻址

    公开(公告)号:US20130268715A1

    公开(公告)日:2013-10-10

    申请号:US13440945

    申请日:2012-04-05

    IPC分类号: G06F12/06

    CPC分类号: G06F9/3887 G06F9/3851

    摘要: One embodiment sets forth a technique for dynamically mapping addresses to banks of a multi-bank memory based on a bank mode. Application programs may be configured to perform read and write a memory accessing different numbers of bits per bank, e.g., 32-bits per bank, 64-bits per bank, or 128-bits per bank. On each clock cycle an access request may be received from one of the application programs and per processing thread addresses of the access request are dynamically mapped based on the bank mode to produce a set of bank addresses. The bank addresses are then used to access the multi-bank memory. Allowing different bank mappings enables each application program to avoid bank conflicts when the memory is accesses compared with using a single bank mapping for all accesses.

    摘要翻译: 一个实施例提出了一种用于基于银行模式将地址动态地映射到多存储体存储器的存储体的技术。 应用程序可以被配置为执行读取和写入访问每个存储体的不同位数的存储器,例如每个存储体32位,每个存储体64位或每个存储体128位。 在每个时钟周期上,可以从应用程序之一接收访问请求,并且基于所述存储体模式动态地映射访问请求的每个处理线程地址以产生一组存储体地址。 然后,银行地址用于访问多存储存储器。 允许不同的银行映射使每个应用程序避免存储器访问时的存储器冲突,与对所有访问使用单个存储库映射相比。

    UNIFORM LOAD PROCESSING FOR PARALLEL THREAD SUB-SETS
    8.
    发明申请
    UNIFORM LOAD PROCESSING FOR PARALLEL THREAD SUB-SETS 有权
    用于并联螺纹子组的均匀加载处理

    公开(公告)号:US20130232322A1

    公开(公告)日:2013-09-05

    申请号:US13412438

    申请日:2012-03-05

    IPC分类号: G06F9/312 G06F9/38

    摘要: One embodiment of the present invention sets forth a technique for processing load instructions for parallel threads of a thread group when a sub-set of the parallel threads request the same memory address. The load/store unit determines if the memory addresses for each sub-set of parallel threads match based on one or more uniform patterns. When a match is achieved for at least one of the uniform patterns, the load/store unit transmits a read request to retrieve data for the sub-set of parallel threads. The number of read requests transmitted is reduced compared with performing a separate read request for each thread in the sub-set. A variety of uniform patterns may be defined based on common access patterns present in program instructions. A variety of uniform patterns may also be defined based on interconnect constraints between the load/store unit and the memory when a full crossbar interconnect is not available.

    摘要翻译: 本发明的一个实施例提出了一种当并行线程的子集请求相同存储器地址时处理线程组的并行线程的加载指令的技术。 加载/存储单元确定每个并行线程子集的存储器地址是否基于一个或多个均匀模式相匹配。 当对至少一个均匀模式实现匹配时,加载/存储单元发送读取请求以检索用于并行线程子集的数据。 与对子集中的每个线程执行单独的读取请求相比,发送的读取请求的数量减少。 可以基于程序指令中存在的公共访问模式来定义各种均匀模式。 当完整的交叉互连不可用时,也可以基于加载/存储单元和存储器之间的互连约束来定义各种均匀模式。

    Register transfer level simulation using a graphics processor
    10.
    发明授权
    Register transfer level simulation using a graphics processor 有权
    使用图形处理器注册传输级仿真

    公开(公告)号:US07830386B1

    公开(公告)日:2010-11-09

    申请号:US11156001

    申请日:2005-06-17

    申请人: Douglas J. Hahn

    发明人: Douglas J. Hahn

    IPC分类号: G06T15/00 G06T1/00 G06F9/45

    CPC分类号: G06T15/005 G06T1/20

    摘要: Systems and methods for using a graphics processor as a coprocessor to a general purpose processor to perform register transfer level simulations may improve simulation performance compared with using only the general purpose processor. The internal state of memory elements of an RTL model of an electronic circuit are stored as surface data for each simulation timestep. Transform functions are used to determine a next state based on the current state and simulation inputs. The transfer functions are expressed as a graphics program, such as a shader or vertex program that may be executed by a programmable graphics processor.

    摘要翻译: 使用图形处理器作为通用处理器执行寄存器传输级仿真的协处理器的系统和方法可以与仅使用通用处理器相比提高仿真性能。 电子电路的RTL模型的存储元件的内部状态被存储为每个模拟时间步的表面数据。 变换函数用于根据当前状态和模拟输入确定下一个状态。 传递函数表示为图形程序,例如可由可编程图形处理器执行的着色器或顶点程序。