NETWORK-RELATED PERFORMANCE FOR GPUS
    31.
    发明申请

    公开(公告)号:US20200034195A1

    公开(公告)日:2020-01-30

    申请号:US16049216

    申请日:2018-07-30

    Abstract: Techniques for improved networking performance in systems where a graphics processing unit or other highly parallel non-central-processing-unit (referred to as an accelerated processing device or “APD” herein) has the ability to directly issue commands to a networking device such as a network interface controller (“NIC”) are disclosed. According to a first technique, the latency associated with loading certain metadata into NIC hardware memory is reduced or eliminated by pre-fetching network command queue metadata into hardware network command queue metadata slots of the NIC, thereby reducing the latency associated with fetching that metadata at a later time. A second technique involves reducing latency by prioritizing work on an APD when it is known that certain network traffic is soon to arrive over the network via a NIC.

    Conditional atomic operations in single instruction multiple data processors

    公开(公告)号:US10209990B2

    公开(公告)日:2019-02-19

    申请号:US14728643

    申请日:2015-06-02

    Abstract: A conditional fetch-and-phi operation tests a memory location to determine if the memory locations stores a specified value and, if so, modifies the value at the memory location. The conditional fetch-and-phi operation can be implemented so that it can be concurrently executed by a plurality of concurrently executing threads, such as the threads of wavefront at a GPU. To execute the conditional fetch-and-phi operation, one of the concurrently executing threads is selected to execute a compare-and-swap (CAS) operation at the memory location, while the other threads await the results. The CAS operation tests the value at the memory location and, if the CAS operation is successful, the value is passed to each of the concurrently executing threads.

    Flexible framework to support memory synchronization operations

    公开(公告)号:US10198261B2

    公开(公告)日:2019-02-05

    申请号:US15096205

    申请日:2016-04-11

    Abstract: A method of performing memory synchronization operations is provided that includes receiving, at a programmable cache controller in communication with one or more caches, an instruction in a first language to perform a memory synchronization operation of synchronizing a plurality of instruction sequences executing on a processor, mapping the received instruction in the first language to one or more selected cache operations in a second language executable by the cache controller and executing the one or more cache operations to perform the memory synchronization operation. The method further comprises receiving a second mapping that provides mapping instructions to map the received instruction to one or more other cache operations, mapping the received instruction to one or more other cache operations and executing the one or more other cache operations to perform the memory synchronization operation.

    MONITOR SUPPORT ON ACCELERATED PROCESSING DEVICE

    公开(公告)号:US20190034151A1

    公开(公告)日:2019-01-31

    申请号:US15661843

    申请日:2017-07-27

    Abstract: A technique for implementing synchronization monitors on an accelerated processing device (“APD”) is provided. Work on an APD includes workgroups that include one or more wavefronts. All wavefronts of a workgroup execute on a single compute unit. A monitor is a synchronization construct that allows workgroups to stall until a particular condition is met. Responsive to all wavefronts of a workgroup executing a wait instruction, the monitor coordinator records the workgroup in an “entry queue.” The workgroup begins saving its state to a general APD memory and, when such saving is complete, the monitor coordinator moves the workgroup to a “condition queue.” When the condition specified by the wait instruction is met, the monitor coordinator moves the workgroup to a “ready queue,” and, when sufficient resources are available on a compute unit, the APD schedules the ready workgroup for execution on a compute unit.

    DYNAMIC WAVEFRONT CREATION FOR PROCESSING UNITS USING A HYBRID COMPACTOR
    36.
    发明申请
    DYNAMIC WAVEFRONT CREATION FOR PROCESSING UNITS USING A HYBRID COMPACTOR 有权
    使用混合压缩机处理单元的动态波形创建

    公开(公告)号:US20160239302A1

    公开(公告)日:2016-08-18

    申请号:US14682971

    申请日:2015-04-09

    Abstract: A method, a non-transitory computer readable medium, and a processor for repacking dynamic wavefronts during program code execution on a processing unit, each dynamic wavefront including multiple threads are presented. If a branch instruction is detected, a determination is made whether all wavefronts following a same control path in the program code have reached a compaction point, which is the branch instruction. If no branch instruction is detected in executing the program code, a determination is made whether all wavefronts following the same control path have reached a reconvergence point, which is a beginning of a program code segment to be executed by both a taken branch and a not taken branch from a previous branch instruction. The dynamic wavefronts are repacked with all threads that follow the same control path, if all wavefronts following the same control path have reached the branch instruction or the reconvergence point.

    Abstract translation: 提出了一种方法,非暂时计算机可读介质和用于在处理单元上的程序代码执行期间重新包装动态波前的处理器,每个动态波前包括多个线程。 如果检测到分支指令,则确定程序代码中跟随相同控制路径的所有波前是否已经到达作为分支指令的压缩点。 如果在执行程序代码时没有检测到分支指令,则确定跟随相同控制路径的所有波前是否已经达到重新收敛点,该再失真点是要由执行分支而不是执行的程序代码段的开始 从前一个分支指令中分支。 如果跟随相同控制路径的所有波前已经到达分支指令或再聚合点,那么动态波前将重新打包所有遵循相同控制路径的线程。

    Stack cache management and coherence techniques
    37.
    发明授权
    Stack cache management and coherence techniques 有权
    堆栈缓存管理和一致性技术

    公开(公告)号:US09189399B2

    公开(公告)日:2015-11-17

    申请号:US13887196

    申请日:2013-05-03

    Abstract: A processor system presented here has a plurality of execution cores and a plurality of stack caches, wherein each of the stack caches is associated with a different one of the execution cores. A method of managing stack data for the processor system is presented here. The method maintains a stack cache manager for the plurality of execution cores. The stack cache manager includes entries for stack data accessed by the plurality of execution cores. The method processes, for a requesting execution core of the plurality of execution cores, a virtual address for requested stack data. The method continues by accessing the stack cache manager to search for an entry of the stack cache manager that includes the virtual address for requested stack data, and using information in the entry to retrieve the requested stack data.

    Abstract translation: 这里呈现的处理器系统具有多个执行内核和多个堆栈高速缓存,其中每个堆栈高速缓存与不同的执行核心相关联。 此处介绍了处理器系统的堆栈数据管理方法。 该方法维护多个执行核心的堆栈高速缓存管理器。 堆栈缓存管理器包括由多个执行核心访问的堆栈数据的条目。 该方法对于多个执行核心的请求执行核心处理所请求的堆栈数据的虚拟地址。 该方法通过访问堆栈高速缓存管理器来继续搜索包括所请求的堆栈数据的虚拟地址的堆栈高速缓存管理器的条目,并使用条目中的信息来检索所请求的堆栈数据。

    HIERARCHICAL WRITE-COMBINING CACHE COHERENCE
    38.
    发明申请
    HIERARCHICAL WRITE-COMBINING CACHE COHERENCE 有权
    分层写组合高速缓存的一致性

    公开(公告)号:US20150058567A1

    公开(公告)日:2015-02-26

    申请号:US14010096

    申请日:2013-08-26

    CPC classification number: G06F12/0811 G06F12/0804 Y02D10/13

    Abstract: A method, computer program product, and system is described that enforces a release consistency with special accesses sequentially consistent (RCsc) memory model and executes release synchronization instructions such as a StRel event without tracking an outstanding store event through a memory hierarchy, while efficiently using bandwidth resources. What is also described is the decoupling of a store event from an ordering of the store event with respect to a RCsc memory model. The description also includes a set of hierarchical read-only cache and write-only combining buffers that coalesce stores from different parts of the system. In addition, a pool component maintains partial order of received store events and release synchronization events to avoid content addressable memory (CAM) structures, full cache flushes, as well as direct write-throughs to memory. The approach improves the performance of both global and local synchronization events and reduces overhead in maintaining write-only combining buffers.

    Abstract translation: 描述了一种方法,计算机程序产品和系统,其强制与特殊访问顺序一致(RCsc)存储器模型的版本一致性,并且执行诸如StRel事件之类的释放同步指令,而不通过存储器层次来跟踪未完成的存储事件,同时有效地使用 带宽资源。 还描述了存储事件与存储事件的顺序相对于RCsc存储器模型的去耦。 该描述还包括一组分层只读缓存和只写组合缓冲器,其将来自系统的不同部分的存储合并。 此外,池组件维护接收到的存储事件的部分顺序并释放同步事件,以避免内容可寻址存储器(CAM)结构,全缓存刷新以及对存储器的直接写入。 该方法提高了全局和本地同步事件的性能,并降低了维持只写组合缓冲区的开销。

    Conditional Notification Mechanism
    39.
    发明申请
    Conditional Notification Mechanism 审中-公开
    条件通知机制

    公开(公告)号:US20140250442A1

    公开(公告)日:2014-09-04

    申请号:US13782063

    申请日:2013-03-01

    CPC classification number: G06F9/542 G06F2209/543

    Abstract: The described embodiments include a computing device. In these embodiments, an entity in the computing device receives an identification of a memory location and a condition to be met by a value in the memory location. Upon a predetermined event occurring, the entity causes an operation to be performed when the value in the memory location meets the condition.

    Abstract translation: 所描述的实施例包括计算设备。 在这些实施例中,计算设备中的实体通过存储器位置中的值接收存储器位置的标识和要满足的条件。 当预定事件发生时,当存储器位置中的值满足条件时,实体导致执行操作。

    CACHE COHERENCY USING DIE-STACKED MEMORY DEVICE WITH LOGIC DIE
    40.
    发明申请
    CACHE COHERENCY USING DIE-STACKED MEMORY DEVICE WITH LOGIC DIE 有权
    使用带LOGO DIE的堆叠式存储器设备进行高速缓存

    公开(公告)号:US20140181417A1

    公开(公告)日:2014-06-26

    申请号:US13726146

    申请日:2012-12-23

    Abstract: A die-stacked memory device implements an integrated coherency manager to offload cache coherency protocol operations for the devices of a processing system. The die-stacked memory device includes a set of one or more stacked memory dies and a set of one or more logic dies. The one or more logic dies implement hardware logic providing a memory interface and the coherency manager. The memory interface operates to perform memory accesses in response to memory access requests from the coherency manager and the one or more external devices. The coherency manager comprises logic to perform coherency operations for shared data stored at the stacked memory dies. Due to the integration of the logic dies and the memory dies, the coherency manager can access shared data stored in the memory dies and perform related coherency operations with higher bandwidth and lower latency and power consumption compared to the external devices.

    Abstract translation: 堆叠堆叠的存储器件实现集成的一致性管理器以卸载处理系统的设备的高速缓存一致性协议操作。 芯片堆叠的存储器件包括一组一个或多个堆叠的存储器管芯和一组一个或多个逻辑管芯。 一个或多个逻辑模块实现提供存储器接口和一致性管理器的硬件逻辑。 存储器接口操作以响应来自一致性管理器和一个或多个外部设备的存储器访问请求来执行存储器访问。 相关性管理器包括对存储在堆叠存储器管芯上的共享数据执行一致性操作的逻辑。 由于逻辑管芯和存储器管芯的集成,一致性管理器可以访问存储在存储器管芯中的共享数据,并且与外部器件相比具有更高带宽和更低的延迟和功耗的相关一致性操作。

Patent Agency Ranking