METHODS AND APPARATUS TO AVOID SURGES IN DI/DT BY THROTTLING GPU EXECUTION PERFORMANCE
    1.
    发明申请
    METHODS AND APPARATUS TO AVOID SURGES IN DI/DT BY THROTTLING GPU EXECUTION PERFORMANCE 有权
    通过GPU执行性能避免DI / DT中的采样的方法和设备

    公开(公告)号:US20130262831A1

    公开(公告)日:2013-10-03

    申请号:US13437765

    申请日:2012-04-02

    IPC分类号: G06F9/30

    摘要: Systems and methods for throttling GPU execution performance to avoid surges in DI/DT. A processor includes one or more execution units coupled to a scheduling unit configured to select instructions for execution by the one or more execution units. The execution units may be connected to one or more decoupling capacitors that store power for the circuits of the execution units. The scheduling unit is configured to throttle the instruction issue rate of the execution units based on a moving average issue rate over a large number of scheduling periods. The number of instructions issued during the current scheduling period is less than or equal to a throttling rate maintained by the scheduling unit that is greater than or equal to a minimum throttling issue rate. The throttling rate is set equal to the moving average plus an offset value at the end of each scheduling period.

    摘要翻译: 节省GPU执行性能的系统和方法,以避免DI / DT中的浪涌。 处理器包括耦合到调度单元的一个或多个执行单元,调度单元被配置为选择用于由一个或多个执行单元执行的指令。 执行单元可以连接到一个或多个存储执行单元的电路的去耦电容器。 调度单元被配置为基于在大量调度周期上的移动平均发布速率来抑制执行单元的指令发布速率。 在当前调度周期内发出的指令数小于或等于由调度单元维持的大于或等于最小节流发布率的节流速率。 节流速度设置为等于每个调度周期结束时的移动平均加上偏移值。

    Throttling instruction issue rate based on updated moving average to avoid surges in DI/DT
    2.
    发明授权
    Throttling instruction issue rate based on updated moving average to avoid surges in DI/DT 有权
    基于更新移动平均线的限制指令发布率,以避免DI / DT中的激增

    公开(公告)号:US09430242B2

    公开(公告)日:2016-08-30

    申请号:US13437765

    申请日:2012-04-02

    摘要: Systems and methods for throttling GPU execution performance to avoid surges in DI/DT. A processor includes one or more execution units coupled to a scheduling unit configured to select instructions for execution by the one or more execution units. The execution units may be connected to one or more decoupling capacitors that store power for the circuits of the execution units. The scheduling unit is configured to throttle the instruction issue rate of the execution units based on a moving average issue rate over a large number of scheduling periods. The number of instructions issued during the current scheduling period is less than or equal to a throttling rate maintained by the scheduling unit that is greater than or equal to a minimum throttling issue rate. The throttling rate is set equal to the moving average plus an offset value at the end of each scheduling period.

    摘要翻译: 节省GPU执行性能的系统和方法,以避免DI / DT中的浪涌。 处理器包括耦合到调度单元的一个或多个执行单元,调度单元被配置为选择用于由一个或多个执行单元执行的指令。 执行单元可以连接到一个或多个存储执行单元的电路的去耦电容器。 调度单元被配置为基于在大量调度周期上的移动平均发布速率来抑制执行单元的指令发布速率。 在当前调度周期内发出的指令数小于或等于由调度单元维持的大于或等于最小节流发布率的节流速率。 节流速度设置为等于每个调度周期结束时的移动平均加上偏移值。

    Methods and apparatus for scheduling instructions using pre-decode data

    公开(公告)号:US09798548B2

    公开(公告)日:2017-10-24

    申请号:US13333879

    申请日:2011-12-21

    摘要: Systems and methods for scheduling instructions using pre-decode data corresponding to each instruction. In one embodiment, a multi-core processor includes a scheduling unit in each core for selecting instructions from two or more threads each scheduling cycle for execution on that particular core. As threads are scheduled for execution on the core, instructions from the threads are fetched into a buffer without being decoded. The pre-decode data is determined by a compiler and is extracted by the scheduling unit during runtime and used to control selection of threads for execution. The pre-decode data may specify a number of scheduling cycles to wait before scheduling the instruction. The pre-decode data may also specify a scheduling priority for the instruction. Once the scheduling unit selects an instruction to issue for execution, a decode unit fully decodes the instruction.

    PRE-SCHEDULED REPLAYS OF DIVERGENT OPERATIONS
    4.
    发明申请
    PRE-SCHEDULED REPLAYS OF DIVERGENT OPERATIONS 审中-公开
    预先安排的重复操作

    公开(公告)号:US20130212364A1

    公开(公告)日:2013-08-15

    申请号:US13370173

    申请日:2012-02-09

    IPC分类号: G06F9/38 G06F9/312

    摘要: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced. One advantage of the disclosed technique is that divergent operations requiring one or more replay operations execute with reduced latency.

    摘要翻译: 本公开的一个实施例阐述了在并行处理子系统中执行用于发散操作的预先安排的重播操作的优化方式。 具体地,流式多处理器(SM)包括多级流水线,其被配置为将预先安排的重播操作插入到多级流水线中。 预先安排的重播单元检测与当前指令相关联的操作是否正在访问公共资源。 如果线程正在访问分布在多个高速缓存线上的数据,则预先安排的重播单元在当前指令后面插入预先安排的重放操作。 多级流水线顺序执行指令和相关的预先安排的重播操作。 如果附加线程在执行指令和预先安排的重放操作之后保持未被接受,则通过重放循环插入附加的重放操作,直到对所有线程进行服务。 所公开技术的一个优点是需要一个或多个重放操作的发散操作以较低的等待时间执行。

    Speculative execution and rollback

    公开(公告)号:US09830158B2

    公开(公告)日:2017-11-28

    申请号:US13289643

    申请日:2011-11-04

    IPC分类号: G06F9/38

    摘要: One embodiment of the present invention sets forth a technique for speculatively issuing instructions to allow a processing pipeline to continue to process some instructions during rollback of other instructions. A scheduler circuit issues instructions for execution assuming that, several cycles later, when the instructions reach multithreaded execution units, that dependencies between the instructions will be resolved, resources will be available, operand data will be available, and other conditions will not prevent execution of the instructions. When a rollback condition exists at the point of execution for an instruction for a particular thread group, the instruction is not dispatched to the multithreaded execution units. However, other instructions issued by the scheduler circuit for execution by different thread groups, and for which a rollback condition does not exist, are executed by the multithreaded execution units. The instruction incurring the rollback condition is reissued after the rollback condition no longer exists.

    SYSTEM AND METHOD FOR PERFORMING SHAPED MEMORY ACCESS OPERATIONS
    8.
    发明申请
    SYSTEM AND METHOD FOR PERFORMING SHAPED MEMORY ACCESS OPERATIONS 审中-公开
    用于执行形状记忆访问操作的系统和方法

    公开(公告)号:US20130145124A1

    公开(公告)日:2013-06-06

    申请号:US13312954

    申请日:2011-12-06

    IPC分类号: G06F9/30

    摘要: One embodiment of the present invention sets forth a technique that provides an efficient way to retrieve operands from a register file. Specifically, the instruction dispatch unit receives one or more instructions, each of which includes one or more operands. Collectively, the operands are organized into one or more operand groups from which a shaped access may be formed. The operands are retrieved from the register file and stored in a collector. Once all operands are read and collected in the collector, the instruction dispatch unit transmits the instructions and corresponding operands to functional units within the streaming multiprocessor for execution. One advantage of the present invention is that multiple operands are retrieved from the register file in a single register access operation without resource conflict. Performance in retrieving operands from the register file is improved by forming shaped accesses that efficiently retrieve operands exhibiting recognized memory access patterns.

    摘要翻译: 本发明的一个实施例提出了提供从寄存器文件中检索操作数的有效方式的技术。 具体地,指令分派单元接收一个或多个指令,每个指令包括一个或多个操作数。 总的来说,操作数被组织成一个或多个操作数组,从中可以形成成形的访问。 操作数从寄存器文件中检索并存储在收集器中。 一旦所有操作数被读取并收集在收集器中,指令分派单元将指令和相应的操作数发送到流多处理器内的功能单元以供执行。 本发明的一个优点是在没有资源冲突的情况下,在单个寄存器访问操作中从寄存器文件中检索多个操作数。 通过形成有效地检索具有公认的存储器访问模式的操作数的形状访问来改进从寄存器文件中检索操作数的性能。

    Multi-level instruction cache prefetching
    9.
    发明授权
    Multi-level instruction cache prefetching 有权
    多级指令缓存预取

    公开(公告)号:US09110810B2

    公开(公告)日:2015-08-18

    申请号:US13312962

    申请日:2011-12-06

    摘要: One embodiment of the present invention sets forth an improved way to prefetch instructions in a multi-level cache. Fetch unit initiates a prefetch operation to transfer one of a set of multiple cache lines, based on a function of a pseudorandom number generator and the sector corresponding to the current instruction L1 cache line. The fetch unit selects a prefetch target from the set of multiple cache lines according to some probability function. If the current instruction L1 cache 370 is located within the first sector of the corresponding L1.5 cache line, then the selected prefetch target is located at a sector within the next L1.5 cache line. The result is that the instruction L1 cache hit rate is improved and instruction fetch latency is reduced, even where the processor consumes instructions in the instruction L1 cache at a fast rate.

    摘要翻译: 本发明的一个实施例提出了一种改进的方式来预取多级缓存中的指令。 提取单元基于伪随机数发生器的功能和与当前指令L1高速缓存行相对应的扇区,发起预取操作以传送一组多个高速缓存行中的一个。 提取单元根据一些概率函数从多条高速缓存行集合中选择预取目标。 如果当前指令L1高速缓存370位于对应的L1.5高速缓存行的第一扇区内,则所选择的预取目标位于下一个L1.5高速缓存行内的扇区处。 结果是,即使在处理器以快速的速率消耗指令L1高速缓存中的指令的情况下,指令L1高速缓存命中率得到改善并且指令提取延迟被降低。

    Methods and apparatus for source operand collector caching
    10.
    发明授权
    Methods and apparatus for source operand collector caching 有权
    源操作数采集器缓存的方法和装置

    公开(公告)号:US08639882B2

    公开(公告)日:2014-01-28

    申请号:US13326183

    申请日:2011-12-14

    IPC分类号: G06F12/00

    摘要: Methods and apparatus for source operand collector caching. In one embodiment, a processor includes a register file that may be coupled to storage elements (i.e., an operand collector) that provide inputs to the datapath of the processor core for executing an instruction. In order to reduce bandwidth between the register file and the operand collector, operands may be cached and reused in subsequent instructions. A scheduling unit maintains a cache table for monitoring which register values are currently stored in the operand collector. The scheduling unit may also configure the operand collector to select the particular storage elements that are coupled to the inputs to the datapath for a given instruction.

    摘要翻译: 源操作数采集器缓存的方法和装置。 在一个实施例中,处理器包括可以耦合到存储元件(即,操作数收集器)的寄存器文件,其提供用于执行指令的处理器核的数据路径的输入。 为了减少寄存器文件和操作数收集器之间的带宽,操作数可以在随后的指令中缓存并重新使用。 调度单元维护高速缓存表,用于监视当前存储在操作数收集器中的寄存器值。 调度单元还可以配置操作数收集器以选择耦合到给定指令的数据路径的输入的特定存储元件。