System and method for performing shaped memory access operations

    公开(公告)号:US10255228B2

    公开(公告)日:2019-04-09

    申请号:US13312954

    申请日:2011-12-06

    摘要: One embodiment of the present invention sets forth a technique that provides an efficient way to retrieve operands from a register file. Specifically, the instruction dispatch unit receives one or more instructions, each of which includes one or more operands. Collectively, the operands are organized into one or more operand groups from which a shaped access may be formed. The operands are retrieved from the register file and stored in a collector. Once all operands are read and collected in the collector, the instruction dispatch unit transmits the instructions and corresponding operands to functional units within the streaming multiprocessor for execution. One advantage of the present invention is that multiple operands are retrieved from the register file in a single register access operation without resource conflict. Performance in retrieving operands from the register file is improved by forming shaped accesses that efficiently retrieve operands exhibiting recognized memory access patterns.

    Speculative execution and rollback

    公开(公告)号:US09830158B2

    公开(公告)日:2017-11-28

    申请号:US13289643

    申请日:2011-11-04

    IPC分类号: G06F9/38

    摘要: One embodiment of the present invention sets forth a technique for speculatively issuing instructions to allow a processing pipeline to continue to process some instructions during rollback of other instructions. A scheduler circuit issues instructions for execution assuming that, several cycles later, when the instructions reach multithreaded execution units, that dependencies between the instructions will be resolved, resources will be available, operand data will be available, and other conditions will not prevent execution of the instructions. When a rollback condition exists at the point of execution for an instruction for a particular thread group, the instruction is not dispatched to the multithreaded execution units. However, other instructions issued by the scheduler circuit for execution by different thread groups, and for which a rollback condition does not exist, are executed by the multithreaded execution units. The instruction incurring the rollback condition is reissued after the rollback condition no longer exists.

    BATCHED REPLAYS OF DIVERGENT OPERATIONS
    4.
    发明申请
    BATCHED REPLAYS OF DIVERGENT OPERATIONS 有权
    批量操作的重复操作

    公开(公告)号:US20130159684A1

    公开(公告)日:2013-06-20

    申请号:US13329066

    申请日:2011-12-16

    IPC分类号: G06F9/38 G06F9/312

    CPC分类号: G06F9/3851 G06F9/3861

    摘要: One embodiment of the present invention sets forth an optimized way to execute replay operations for divergent operations in a parallel processing subsystem. Specifically, the streaming multiprocessor (SM) includes a multistage pipeline configured to batch two or more replay operations for processing via replay loop. A logic element within the multistage pipeline detects whether the current pipeline stage is accessing a shared resource, such as loading data from a shared memory. If the threads are accessing data which are distributed across multiple cache lines, then the multistage pipeline batches two or more replay operations, where the replay operations are inserted into the pipeline back-to-back. Advantageously, divergent operations requiring two or more replay operations operate with reduced latency. Where memory access operations require transfer of more than two cache lines to service all threads, the number of clock cycles required to complete all replay operations is reduced.

    摘要翻译: 本发明的一个实施例阐述了在并行处理子系统中对发散操作执行重放操作的优化方法。 具体地说,流式多处理器(SM)包括多级流水线,其被配置为批量两个或更多个重播操作以便经由重放循环进行处理。 多级流水线内的逻辑元件检测当前流水线阶段是否正在访问共享资源,例如从共享内存加载数据。 如果线程正在访问分布在多个高速缓存行中的数据,则多级管道批量执行两个或更多个重放操作,其中重放操作被背对背地插入到管道中。 有利地,需要两次或更多次重放操作的发散操作以降低的等待时间运行。 在存储器访问操作需要传送两条以上的高速缓存行以服务所有线程的情况下,完成所有重放操作所需的时钟周期数减少。

    SPECULATIVE EXECUTION AND ROLLBACK
    5.
    发明申请
    SPECULATIVE EXECUTION AND ROLLBACK 有权
    统一执行和滚动

    公开(公告)号:US20130117541A1

    公开(公告)日:2013-05-09

    申请号:US13289643

    申请日:2011-11-04

    IPC分类号: G06F9/30

    摘要: One embodiment of the present invention sets forth a technique for speculatively issuing instructions to allow a processing pipeline to continue to process some instructions during rollback of other instructions. A scheduler circuit issues instructions for execution assuming that, several cycles later, when the instructions reach multithreaded execution units, that dependencies between the instructions will be resolved, resources will be available, operand data will be available, and other conditions will not prevent execution of the instructions. When a rollback condition exists at the point of execution for an instruction for a particular thread group, the instruction is not dispatched to the multithreaded execution units. However, other instructions issued by the scheduler circuit for execution by different thread groups, and for which a rollback condition does not exist, are executed by the multithreaded execution units. The instruction incurring the rollback condition is reissued after the rollback condition no longer exists.

    摘要翻译: 本发明的一个实施例提出了一种用于推测发出指令以允许处理流水线在其他指令的回滚期间继续处理一些指令的技术。 调度器电路发出执行指令,假设几个周期后,当指令到达多线程执行单元时,指令之间的相关性将被解决,资源将可用,操作数数据将可用,而其他条件将不会阻止执行 说明。 当在特定线程组的指令的执行点处存在回滚条件时,指令不会分派给多线程执行单元。 然而,由多线程执行单元执行由调度器电路发出的用于由不同线程组执行并且不存在回滚条件的其他指令。 在回滚条件不再存在之后,重新发出导致回滚条件的指令。

    Methods and apparatus for scheduling instructions using pre-decode data

    公开(公告)号:US09798548B2

    公开(公告)日:2017-10-24

    申请号:US13333879

    申请日:2011-12-21

    摘要: Systems and methods for scheduling instructions using pre-decode data corresponding to each instruction. In one embodiment, a multi-core processor includes a scheduling unit in each core for selecting instructions from two or more threads each scheduling cycle for execution on that particular core. As threads are scheduled for execution on the core, instructions from the threads are fetched into a buffer without being decoded. The pre-decode data is determined by a compiler and is extracted by the scheduling unit during runtime and used to control selection of threads for execution. The pre-decode data may specify a number of scheduling cycles to wait before scheduling the instruction. The pre-decode data may also specify a scheduling priority for the instruction. Once the scheduling unit selects an instruction to issue for execution, a decode unit fully decodes the instruction.

    METHODS AND APPARATUS FOR SCHEDULING INSTRUCTIONS WITHOUT INSTRUCTION DECODE
    7.
    发明申请
    METHODS AND APPARATUS FOR SCHEDULING INSTRUCTIONS WITHOUT INSTRUCTION DECODE 审中-公开
    用于在没有指令解码的情况下安排指令的方法和装置

    公开(公告)号:US20130166882A1

    公开(公告)日:2013-06-27

    申请号:US13335872

    申请日:2011-12-22

    IPC分类号: G06F9/30 G06F9/38 G06F9/312

    CPC分类号: G06F9/3851 G06F9/382

    摘要: Systems and methods for scheduling instructions without instruction decode. In one embodiment, a multi-core processor includes a scheduling unit in each core for scheduling instructions from two or more threads scheduled for execution on that particular core. As threads are scheduled for execution on the core, instructions from the threads are fetched into a buffer without being decoded. The scheduling unit includes a macro-scheduler unit for performing a priority sort of the two or more threads and a micro-scheduler arbiter for determining the highest order thread that is ready to execute. The macro-scheduler unit and the micro-scheduler arbiter use pre-decode data to implement the scheduling algorithm. The pre-decode data may be generated by decoding only a small portion of the instruction or received along with the instruction. Once the micro-scheduler arbiter has selected an instruction to dispatch to the execution unit, a decode unit fully decodes the instruction.

    摘要翻译: 用于调度指令而不进行指令解码的系统和方法。 在一个实施例中,多核处理器包括每个核心中的调度单元,用于调度来自在该特定核心上执行的两个或更多个线程的指令。 由于线程被安排在核心上执行,所以来自线程的指令被取入到缓冲器中而不被解码。 调度单元包括用于执行两个或更多个线程的优先级排序的宏调度器单元和用于确定准备执行的最高阶线程的微调度器仲裁器。 宏调度器单元和微调度器仲裁器使用预解码数据来实现调度算法。 预解码数据可以仅通过解码指令的一小部分或与该指令一起被接收来产生。 一旦微调度器仲裁器选择了向执行单元发送的指令,则解码单元对该指令进行完全解码。

    METHODS AND APPARATUS FOR SOURCE OPERAND COLLECTOR CACHING
    8.
    发明申请
    METHODS AND APPARATUS FOR SOURCE OPERAND COLLECTOR CACHING 有权
    来源操作收集器缓存的方法和装置

    公开(公告)号:US20130159628A1

    公开(公告)日:2013-06-20

    申请号:US13326183

    申请日:2011-12-14

    IPC分类号: G06F12/08

    摘要: Methods and apparatus for source operand collector caching. In one embodiment, a processor includes a register file that may be coupled to storage elements (i.e., an operand collector) that provide inputs to the datapath of the processor core for executing an instruction. In order to reduce bandwidth between the register file and the operand collector, operands may be cached and reused in subsequent instructions. A scheduling unit maintains a cache table for monitoring which register values are currently stored in the operand collector. The scheduling unit may also configure the operand collector to select the particular storage elements that are coupled to the inputs to the datapath for a given instruction.

    摘要翻译: 源操作数采集器缓存的方法和装置。 在一个实施例中,处理器包括可以耦合到存储元件(即,操作数收集器)的寄存器文件,其提供用于执行指令的处理器核的数据路径的输入。 为了减少寄存器文件和操作数收集器之间的带宽,操作数可以在随后的指令中缓存并重新使用。 调度单元维护高速缓存表,用于监视当前存储在操作数收集器中的寄存器值。 调度单元还可以配置操作数收集器以选择耦合到给定指令的数据路径的输入的特定存储元件。

    PRE-SCHEDULED REPLAYS OF DIVERGENT OPERATIONS
    10.
    发明申请
    PRE-SCHEDULED REPLAYS OF DIVERGENT OPERATIONS 审中-公开
    预先安排的重复操作

    公开(公告)号:US20130212364A1

    公开(公告)日:2013-08-15

    申请号:US13370173

    申请日:2012-02-09

    IPC分类号: G06F9/38 G06F9/312

    摘要: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced. One advantage of the disclosed technique is that divergent operations requiring one or more replay operations execute with reduced latency.

    摘要翻译: 本公开的一个实施例阐述了在并行处理子系统中执行用于发散操作的预先安排的重播操作的优化方式。 具体地,流式多处理器(SM)包括多级流水线,其被配置为将预先安排的重播操作插入到多级流水线中。 预先安排的重播单元检测与当前指令相关联的操作是否正在访问公共资源。 如果线程正在访问分布在多个高速缓存线上的数据,则预先安排的重播单元在当前指令后面插入预先安排的重放操作。 多级流水线顺序执行指令和相关的预先安排的重播操作。 如果附加线程在执行指令和预先安排的重放操作之后保持未被接受,则通过重放循环插入附加的重放操作,直到对所有线程进行服务。 所公开技术的一个优点是需要一个或多个重放操作的发散操作以较低的等待时间执行。