Non-Blocking Parallel Bulk Memory Operations

    公开(公告)号:US20250110647A1

    公开(公告)日:2025-04-03

    申请号:US18477885

    申请日:2023-09-29

    Abstract: Non-blocking processing system are described. In accordance with the described techniques, a pending range store receives, at a start of a bulk memory operation, a pending memory range of the bulk memory operation. A logic unit includes at least one of check conflict logic or check address logic. The logic unit detects a conflicting memory access based on a target address of the pending memory range conflicting with a memory access request separate from the bulk memory operation. The logic unit performs at least a portion of the bulk memory operation associated with the target address before the memory access request is allowed to proceed.

    IN-SWITCH EMBEDDING BAG POOLING
    3.
    发明申请

    公开(公告)号:US20250110899A1

    公开(公告)日:2025-04-03

    申请号:US18478659

    申请日:2023-09-29

    Abstract: An apparatus and method for reducing the memory bandwidth of executing machine learning models. A computing system includes two or more processing nodes, each including at least one or more processors and a corresponding local memory. Switch circuitry communicates with at least the local memories and a system memory of the computing system. The switch includes multiple direct memory access (DMA) interfaces. Each of one or more processing nodes stores multiple embedding rows of embedding tables. A processor of the processing node identifies two or more embedding rows as source operands of a reduction operation. The switch executes memory access requests to retrieve data of the two or more embedding rows from the corresponding local memory, and generates a result by performing the reduction operation. The switch sends the result to the local memory.

    NETWORK-RELATED PERFORMANCE FOR GPUS
    6.
    发明申请

    公开(公告)号:US20200034195A1

    公开(公告)日:2020-01-30

    申请号:US16049216

    申请日:2018-07-30

    Abstract: Techniques for improved networking performance in systems where a graphics processing unit or other highly parallel non-central-processing-unit (referred to as an accelerated processing device or “APD” herein) has the ability to directly issue commands to a networking device such as a network interface controller (“NIC”) are disclosed. According to a first technique, the latency associated with loading certain metadata into NIC hardware memory is reduced or eliminated by pre-fetching network command queue metadata into hardware network command queue metadata slots of the NIC, thereby reducing the latency associated with fetching that metadata at a later time. A second technique involves reducing latency by prioritizing work on an APD when it is known that certain network traffic is soon to arrive over the network via a NIC.

    Processing Element-Centric All-to-All Communication

    公开(公告)号:US20240220336A1

    公开(公告)日:2024-07-04

    申请号:US18147081

    申请日:2022-12-28

    CPC classification number: G06F9/54 G06F9/5044 G06F15/17356

    Abstract: In accordance with described techniques for PE-centric all-to-all communication, a distributed computing system includes processing elements, such as graphics processing units, distributed in clusters. An all-to-all communication procedure is performed by the processing elements that are each configured to generate data packets in parallel for all-to-all data communication between the clusters. The all-to-all communication procedure includes a first stage of intra-cluster parallel data communication between respective processing elements of each of the clusters; a second stage of inter-cluster data exchange for all-to-all data communication between the clusters; and a third stage of intra-cluster data distribution to the respective processing elements of each of the clusters.

Patent Agency Ranking