Sparse machine learning acceleration

    公开(公告)号:US12254398B2

    公开(公告)日:2025-03-18

    申请号:US17301271

    申请日:2021-03-30

    Abstract: To reduce the storage size of weight tensors and speed up loading of weight tensors from system memory, a compression technique can be employed to remove zero values from a weight tensor before storing the weight tensor in system memory. A sparsity threshold can be enforced to achieve a compression ratio target by forcing small weight values to zero during training. When the weight tensor is loaded from system memory, a direct memory access (DMA) engine with an in-line decompression unit can decompress the weight tensor on-the-fly. By performing the decompression in the DMA engine, expansion of the weight values back to the original weight tensor size can be carried out in parallel while other neural network computations are being performed by the processing unit.

    Address generation for page collision prevention

    公开(公告)号:US11789859B1

    公开(公告)日:2023-10-17

    申请号:US17449579

    申请日:2021-09-30

    Abstract: To generate sequential addresses when multiple integrated circuit (IC) devices are accessing the same memory, an address token is sent along the IC devices communicatively coupled in a ring topology. The address token is first transferred along the ring topology during a memory reservation phase in which each IC device can set a corresponding memory request bit to indicate that the IC device has data to write to the memory. The modified address token is then transferred along the ring topology again during a memory access phase. During the memory access phase, each IC device that has data to write can perform a memory write operation using a sequential address determined from the contents of the address token.

    Data replication for accelerator
    4.
    发明授权

    公开(公告)号:US11500802B1

    公开(公告)日:2022-11-15

    申请号:US17301344

    申请日:2021-03-31

    Abstract: A direct memory access (DMA) engine can be used to multicast data from system memory to a target memory for loading into an array. The DMA engine may include a controller that is configured to receive a data transfer request, and generate a set of write operations for the output interface. The set of write operations can include, for each of multiple partitions of the target memory, a write operation to write usable data from the multicast data to an address offset in the corresponding partition, and an additional write operation to write filler data from the multicast data to a null device address.

    Powering-down or rebooting a device in a system fabric

    公开(公告)号:US11321179B1

    公开(公告)日:2022-05-03

    申请号:US17001145

    申请日:2020-08-24

    Abstract: A circuit at an interface between a device and an interconnect fabric is configured to track outstanding transactions associated with the device and ensure the completion of the outstanding transactions before rebooting or powering down the device. In some embodiments, the circuit is also configurable to provide appropriate responses when the device is powered down or is being rebooted such that other devices in the system can still operate even without knowing that the device is inactive and would not hang because no response is received from the device.

    GRADIENT COMPRESSION FOR DISTRIBUTED TRAINING

    公开(公告)号:US20210295168A1

    公开(公告)日:2021-09-23

    申请号:US16827444

    申请日:2020-03-23

    Inventor: Kun Xu Ron Diamant

    Abstract: Techniques for exchanging compressed gradient data within a distributed system are disclosed. A set of gradients are computed at a first worker node of the distributed system using a neural network model and a set of weights associated with the neural network model. Each of the set of gradients having a value less than a threshold is clipped, resulting in non-clipped data elements and clipped data elements. A mapping indicating which of the set of gradients correspond to non-clipped data elements and which of the set of gradients correspond to clipped data elements is generated. Compressed data is generated based on the non-clipped data elements. The mapping and the compressed data are transmitted from the first worker node to a second worker node of the distributed system

    Matrix transpose hardware acceleration

    公开(公告)号:US12141468B1

    公开(公告)日:2024-11-12

    申请号:US17875805

    申请日:2022-07-28

    Abstract: In one example, an apparatus comprises: a memory array having an array of memory elements arranged in rows and columns, each memory element being configured to store a data element; and a memory access circuit configured to: perform a row write operation to store a first group of data elements at a first row of the array of memory elements; perform a column read operation at a first column of the array of memory elements to obtain a second group of data elements; and perform a column write operation to store a third group of data elements at the first column of the array of memory elements to replace the second group of data elements.

    Communication of data between software applications

    公开(公告)号:US10860397B1

    公开(公告)日:2020-12-08

    申请号:US16297467

    申请日:2019-03-08

    Abstract: A computer system has a memory configured for sharing data between a first application and a second application. The memory includes a metadata region and a data region. The metadata region includes metadata that indicates how data being communicated between the first application and the second application is to be interpreted. The metadata also indicates whether the data can be found in the metadata itself or in a particular location in the data region. Each application can be assigned its own memory location containing a flag that can be set in order to indicate to the other application that the memory is ready to be accessed by the other application. The memory location can be implemented using a hardware register or in memory, either the same memory that includes the metadata and data regions or on a separate memory.

    PCI-based bus system having peripheral device address translation based on base address register (BAR) index

    公开(公告)号:US10740265B1

    公开(公告)日:2020-08-11

    申请号:US16144910

    申请日:2018-09-27

    Inventor: Kun Xu Ron Diamant

    Abstract: Methods and apparatus for performing memory access are provided. In one example, an apparatus comprises a hardware processor, a memory, and a bus interface. The hardware processor is configured to: receive, from a host device and via the bus interface, a packet including a host input address, the host input address being defined based on a first host address space operated by the host device; determine, based on the host input address, a host relative address, the host relative address being relative to a first host base address of the first host address space; determine, based on the host relative address, a target device address of the memory; and access the memory at the target device address on behalf of the host device.

Patent Agency Ranking