Memory access operation in distributed computing system

    公开(公告)号:US11874785B1

    公开(公告)日:2024-01-16

    申请号:US17949151

    申请日:2022-09-20

    Abstract: In one example, an apparatus comprises: a local on-chip memory; a computation engine configured to generate local data and to store the local data at the local on-chip memory; and a controller. The apparatus is configured to be coupled with a second device via an interconnect, the second device comprising a local memory. The controller is configured to: fetch the local data from the local on-chip memory; fetch remote data generated by another device from a local off-chip memory; generate output data based on combining the local data and the remote data; and store, via the interconnect, the output data at the local memory of the second device.

    Data replication for accelerator
    12.
    发明授权

    公开(公告)号:US11500802B1

    公开(公告)日:2022-11-15

    申请号:US17301344

    申请日:2021-03-31

    Abstract: A direct memory access (DMA) engine can be used to multicast data from system memory to a target memory for loading into an array. The DMA engine may include a controller that is configured to receive a data transfer request, and generate a set of write operations for the output interface. The set of write operations can include, for each of multiple partitions of the target memory, a write operation to write usable data from the multicast data to an address offset in the corresponding partition, and an additional write operation to write filler data from the multicast data to a null device address.

    Data synchronization operation at distributed computing system

    公开(公告)号:US11409685B1

    公开(公告)日:2022-08-09

    申请号:US17031653

    申请日:2020-09-24

    Abstract: In one example, a method comprises: receiving, by a hardware data processor and from a network adapter, a transfer complete message indicating that the network adapter has initiated a transfer of data received from a network to the hardware data processor, the transfer being performed over an interconnect coupled between the hardware data processor and the network adapter; based on receiving the transfer complete message, performing, by the hardware data processor, a flush operation to fetch any remaining portion of the data buffered in the interconnect to a local memory of the hardware data processor; based on determining that flush operation is complete, storing, by the data hardware processor, the transfer complete message at the local memory; and based on determining that the transfer complete message is stored at the local memory, starting the computation operation of the data at the hardware data processor or preforming an error handling operation.

    Control plane operation at distributed computing system

    公开(公告)号:US11354258B1

    公开(公告)日:2022-06-07

    申请号:US17038623

    申请日:2020-09-30

    Abstract: In one example, an apparatus comprises: a first local memory, a computation engine configured to generate local data and to store the local data at the first local memory, and a controller. The apparatus is coupled with a host processor and a second device via an interconnect, the second device comprising a second local memory, the host processor hosting an application. The controller is configured to: receive, from the second device, a first message indicating that first data is stored in the second local memory; based on the first message: fetch the first data from the second local memory via the interconnect; control the computation engine to perform a computation operation on the first data to generate second data to support the application hosted by the host processor; and transmit, to the second device, a second message indicating that the second data is stored in the first local memory.

    SPECULATIVE TRAINING USING PARTIAL GRADIENTS UPDATE

    公开(公告)号:US20210304008A1

    公开(公告)日:2021-09-30

    申请号:US16831060

    申请日:2020-03-26

    Abstract: The exchange of weight gradients among the processing nodes can introduce a substantial bottleneck to the training process. Instead of remaining idle during the weight gradients exchange process, a processing node can update its own set of weights for the next iteration of the training process using the processing node's local weight gradients. The next iteration of training can be started by using these speculative weights until the weight gradients exchange process completes and a global weights update is available. If the speculative weights is close enough to the weight values from the global weights update, the training process at the processing node can continue training using the results computed from the speculative weights to reduce the overall training time.

Patent Agency Ranking