DYNAMICALLY DETECTING UNIFORMITY AND ELIMINATING REDUNDANT COMPUTATIONS TO REDUCE POWER CONSUMPTION
    1.
    发明申请
    DYNAMICALLY DETECTING UNIFORMITY AND ELIMINATING REDUNDANT COMPUTATIONS TO REDUCE POWER CONSUMPTION 审中-公开
    动态检测均匀性,消除冗余计算,减少耗电量

    公开(公告)号:US20150100764A1

    公开(公告)日:2015-04-09

    申请号:US14048647

    申请日:2013-10-08

    CPC classification number: G06F9/30072 G06F9/3836 G06F9/3851 G06F9/3887

    Abstract: One embodiment of the present invention includes techniques to decrease power consumption by reducing the number of redundant operations performed. In operation, a streamlining multiprocessor (SM) identifies uniform groups of threads that, when executed, apply the same deterministic operation to uniform sets of input operands. Within each uniform group of threads, the SM designates one thread as the anchor thread. The SM disables execution units assigned to all of the threads except the anchor thread. The anchor execution unit, assigned to the anchor thread, executes the operation on the uniform set of input operands. Subsequently, the SM sets the outputs of the non-anchor threads included in the uniform group of threads to equal the value of the anchor execution unit output. Advantageously, by exploiting the uniformity of data to reduce the number of execution units that execute, the SM dramatically reduces the power consumption compared to conventional SMs.

    Abstract translation: 本发明的一个实施例包括通过减少执行的冗余操作的数量来降低功耗的技术。 在操作中,精简多处理器(SM)识别统一的线程组,当被执行时,该组线程对于均匀的输入操作数集合应用相同的确定性操作。 在每个均匀的螺纹组内,SM指定一根螺纹作为锚定螺纹。 SM禁用分配给所有线程的执行单元,除了锚点线程。 分配给锚线程的锚执行单元对均匀的输入操作数集合执行操作。 随后,SM将包括在统一的线程组中的非锚线程的输出设置为等于锚执行单元输出的值。 有利地,通过利用数据的均匀性来减少执行的执行单元的数量,与常规SM相比,SM大大降低了功耗。

    PRIMITIVE RE-ORDERING BETWEEN WORLD-SPACE AND SCREEN-SPACE PIPELINES WITH BUFFER LIMITED PROCESSING
    3.
    发明申请
    PRIMITIVE RE-ORDERING BETWEEN WORLD-SPACE AND SCREEN-SPACE PIPELINES WITH BUFFER LIMITED PROCESSING 有权
    世界空间与SCREEN-SPACE管道之间的主要重新订购与缓冲器有限加工

    公开(公告)号:US20140118381A1

    公开(公告)日:2014-05-01

    申请号:US14023309

    申请日:2013-09-10

    Abstract: One embodiment of the present invention includes approaches for processing graphics primitives associated with cache tiles when rendering an image. A set of graphics primitives associated with a first render target configuration is received from a first portion of a graphics processing pipeline, and the set of graphics primitives is stored in a memory. A condition is detected indicating that the set of graphics primitives is ready for processing, and a cache tile is selected that intersects at least one graphics primitive in the set of graphics primitives. At least one graphics primitive in the set of graphics primitives that intersects the cache tile is transmitted to a second portion of the graphics processing pipeline for processing. One advantage of the disclosed embodiments is that graphics primitives and associated data are more likely to remain stored on-chip during cache tile rendering, thereby reducing power consumption and improving rendering performance.

    Abstract translation: 本发明的一个实施例包括在渲染图像时处理与高速缓存拼贴相关联的图形图元的方法。 从图形处理流水线的第一部分接收与第一渲染目标配置相关联的一组图形基元,并且将该组图形基元存储在存储器中。 检测到指示图形基元集合准备好进行处理的条件,并且选择与该组图形基元中的至少一个图形基元相交的高速缓存片。 与高速缓存片相交的图形基元组中的至少一个图形原语被传送到图形处理流水线的第二部分进行处理。 所公开的实施例的一个优点是图形原语和相关联的数据在高速缓存图块呈现期间更可能保持在芯片上,从而降低功耗并提高渲染性能。

    COOPERATIVE THREAD ARRAY GRANULARITY CONTEXT SWITCH DURING TRAP HANDLING
    4.
    发明申请
    COOPERATIVE THREAD ARRAY GRANULARITY CONTEXT SWITCH DURING TRAP HANDLING 有权
    跟踪处理期间的合作螺旋线阵列格局开关

    公开(公告)号:US20170010914A1

    公开(公告)日:2017-01-12

    申请号:US15271171

    申请日:2016-09-20

    CPC classification number: G06F9/461 G06F9/4812 G06F9/485

    Abstract: Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.

    Abstract translation: 提供了用于恢复处理核心内的线程的技术。 这些技术包括对于包括在多个线程组中的第一线程组,执行上下文恢复例程以从存储器恢复与第一线程组相关联的上下文的第一部分,确定第一线程组是否完成了分配的功能 ,并且如果第一个线程组完成了分配的函数,则退出上下文恢复例程,或者如果第一个线程组未完成分配的函数,则执行与陷阱处理程序例程相关联的一个或多个操作。

    TECHNIQUE FOR COUNTING VALUES IN A REGISTER
    5.
    发明申请
    TECHNIQUE FOR COUNTING VALUES IN A REGISTER 有权
    在注册表中的计数值的技术

    公开(公告)号:US20150089207A1

    公开(公告)日:2015-03-26

    申请号:US14033385

    申请日:2013-09-20

    CPC classification number: G06F9/30105 G06F9/30021 G06F9/30036

    Abstract: A parallel counter accesses data generated by an application and stored within a register. The register includes different segments that include different portions of the application data. The parallel counter is configured to count the number of values within each segment that have a particular characteristic in a parallel fashion. The parallel counter may then return the individual segment counts to the application, or combine those segment counts and return a register count to the application. Advantageously, applications that rely on population count operations may be accelerated. Further, increasing the number of segments in a given register may reduce the time needed to count the values in that register, thereby providing a scalable solution to population counting. Additionally, the architecture of the parallel counter is sufficiently flexible to allow both register counting and segment counting, thereby combining two separate functionalities into just one hardware unit.

    Abstract translation: 并行计数器访问应用程序生成并存储在寄存器中的数据。 寄存器包括不同的段,包括应用数据的不同部分。 并行计数器被配置为以并行方式对具有特定特性的每个段内的值的数目进行计数。 然后,并行计数器可以将各个段计数返回到应用程序,或者将这些段计数结合起来,并向应用程序返回寄存器计数。 有利地,可以加速依赖于群体计数操作的应用。 此外,增加给定寄存器中的段的数量可以减少对该寄存器中的值进行计数所需的时间,从而为群体计数提供可扩展的解决方案。 另外,并行计数器的架构足够灵活,允许寄存器计数和段计数,从而将两个单独的功能组合成一个硬件单元。

    COOPERATIVE THREAD ARRAY GRANULARITY CONTEXT SWITCH DURING TRAP HANDLING
    7.
    发明申请
    COOPERATIVE THREAD ARRAY GRANULARITY CONTEXT SWITCH DURING TRAP HANDLING 审中-公开
    跟踪处理期间的合作螺旋线阵列格局开关

    公开(公告)号:US20140189329A1

    公开(公告)日:2014-07-03

    申请号:US13728784

    申请日:2012-12-27

    CPC classification number: G06F9/3851 G06F9/3861 G06F9/3887 G06F9/4812

    Abstract: Techniques are provided for handling a trap encountered in a thread that is part of a thread array that is being executed in a plurality of execution units. In these techniques, a data structure with an identifier associated with the thread is updated to indicate that the trap occurred during the execution of the thread array. Also in these techniques, the execution units execute a trap handling routine that includes a context switch. The execution units perform this context switch for at least one of the execution units as part of the trap handling routine while allowing the remaining execution units to exit the trap handling routine before the context switch. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors.

    Abstract translation: 提供了用于处理在作为在多个执行单元中执行的线程数组的一部分的线程中遇到的陷阱的技术。 在这些技术中,具有与线程相关联的标识符的数据结构被更新,以指示在执行线程数组期间发生陷阱。 同样在这些技术中,执行单元执行包括上下文切换的陷阱处理例程。 执行单元为至少一个执行单元执行该上下文切换,作为陷阱处理例程的一部分,同时允许剩余执行单元在上下文切换之前退出陷阱处理例程。 所公开技术的一个优点是陷阱处理例程在并行处理器中有效地操作。

    APPROACH FOR CONTEXT SWITCHING OF LOCK-BIT PROTECTED MEMORY
    8.
    发明申请
    APPROACH FOR CONTEXT SWITCHING OF LOCK-BIT PROTECTED MEMORY 有权
    锁定保护存储器的上下文切换方法

    公开(公告)号:US20140189260A1

    公开(公告)日:2014-07-03

    申请号:US13728813

    申请日:2012-12-27

    Abstract: A streaming multiprocessor in a parallel processing subsystem processes atomic operations for multiple threads in a multi-threaded architecture. The streaming multiprocessor receives a request from a thread in a thread group to acquire access to a memory location in a lock-protected shared memory, and determines whether a address lock in a plurality of address locks is asserted, where the address lock is associated the memory location. If the address lock is asserted, then the streaming multiprocessor refuses the request. Otherwise, the streaming multiprocessor asserts the address lock, asserts a thread group lock in a plurality of thread group locks, where the thread group lock is associated with the thread group, and grants the request. One advantage of the disclosed techniques is that acquired locks are released when a thread is preempted. As a result, a preempted thread that has previously acquired a lock does not retain the lock indefinitely.

    Abstract translation: 并行处理子系统中的多流处理器在多线程架构中处理多个线程的原子操作。 流式多处理器从线程组中的线程接收请求以获得对锁定保护的共享存储器中的存储器位置的访问,并且确定多个地址锁中的地址锁定是否被断言,其中地址锁定与 内存位置。 如果地址锁定被确认,则流式多处理器拒绝该请求。 否则,流多处理器断言地址锁定,在多个线程组锁定中断定线程组锁定,其中线程组锁与线程组相关联,并且授予该请求。 所公开技术的一个优点是当线程被抢占时获得的锁定被释放。 因此,先前获得锁定的抢占线程不会无限期地保留锁定。

Patent Agency Ranking