Cooperative thread array reduction and scan operations

    公开(公告)号:US09417875B2

    公开(公告)日:2016-08-16

    申请号:US14025482

    申请日:2013-09-12

    Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.

    Microcontroller for memory management unit
    9.
    发明授权
    Microcontroller for memory management unit 有权
    内存管理单元微控制器

    公开(公告)号:US09588903B2

    公开(公告)日:2017-03-07

    申请号:US14011655

    申请日:2013-08-27

    Abstract: One embodiment of the present invention includes a microcontroller coupled to a memory management unit (MMU). The MMU is coupled to a page table included in a physical memory, and the microcontroller is configured to perform one or more virtual memory operations associated with the physical memory and the page table. In operation, the microcontroller receives a page fault generated by the MMU in response to an invalid memory access via a virtual memory address. To remedy such a page fault, the microcontroller performs actions to map the virtual memory address to an appropriate location in the physical memory. By contrast, in prior-art systems, a fault handler would typically remedy the page fault. Advantageously, because the microcontroller executes these tasks locally with respect to the MMU and the physical memory, latency associated with remedying page faults may be decreased. Consequently, overall system performance may be increased.

    Abstract translation: 本发明的一个实施例包括耦合到存储器管理单元(MMU)的微控制器。 MMU耦合到包括在物理存储器中的页表,并且微控制器被配置为执行与物理存储器和页表相关联的一个或多个虚拟存储器操作。 在操作中,微控制器响应于通过虚拟存储器地址的无效存储器访问而接收由MMU产生的页面错误。 为了纠正这种页面错误,微控制器执行操作以将虚拟存储器地址映射到物理存储器中的适当位置。 相比之下,在现有技术的系统中,故障处理器通常会补救页面错误。 有利地,由于微控制器相对于MMU和物理存储器在本地执行这些任务,所以与补救页错误相关联的延迟可能会降低。 因此,整体系统性能可能会增加。

    DEEP LEARNING THREAD COMMUNICATION
    10.
    发明申请

    公开(公告)号:US20200334076A1

    公开(公告)日:2020-10-22

    申请号:US16389548

    申请日:2019-04-19

    Abstract: An application binary interface (ABI) can be exposed in a processor to enable blocks of threads, which may correspond to separately compiled operators, to communicate without storing data to global memory external to the processor. The ABI can define how results of one computation, corresponding to a first thread block, will be organized in registers and shared memory of a processor at the end of one operator (i.e., kernel). The start of the next operator (i.e., kernel), corresponding to a second thread block, can consume the results from the registers and shared memory. Data can be stored to processor local storage for individual threads as they exit the block. Once published, libraries can be separately compiled, optimized, and tested as long as they adhere to the published ABI.

Patent Agency Ranking