Distributive training with multicast

    公开(公告)号:US12189569B1

    公开(公告)日:2025-01-07

    申请号:US17449300

    申请日:2021-09-29

    Inventor: Kun Xu Ron Diamant

    Abstract: Techniques for distributing data associated with the weight values of a neural network model are described. The techniques can include performing computations associated with the neural network model in a neural network accelerator to generate data associated with weights of the neural network model. A multicast request packet is then generated to distribute the data. The multicast request packet may contain the data associated with the weights, and an address in a multicast address range of a peripheral bus multicast switch. The multicast request packet is sent to a port of the peripheral bus multicast switch, and in response, the peripheral bus multicast switch generates multiple packets containing the data from the multicast request packet and forwards them to multiple peripheral bus ports corresponding to other processing nodes of the system.

    Matrix transpose hardware acceleration

    公开(公告)号:US12125124B1

    公开(公告)日:2024-10-22

    申请号:US18118251

    申请日:2023-03-07

    Inventor: Kun Xu Ron Diamant

    Abstract: In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.

    Matrix transpose hardware acceleration

    公开(公告)号:US11636569B1

    公开(公告)日:2023-04-25

    申请号:US17029609

    申请日:2020-09-23

    Inventor: Kun Xu Ron Diamant

    Abstract: In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.

    SPARSE MACHINE LEARNING ACCELERATION

    公开(公告)号:US20220318604A1

    公开(公告)日:2022-10-06

    申请号:US17301271

    申请日:2021-03-30

    Abstract: To reduce the storage size of weight tensors and speed up loading of weight tensors from system memory, a compression technique can be employed to remove zero values from a weight tensor before storing the weight tensor in system memory. A sparsity threshold can be enforced to achieve a compression ratio target by forcing small weight values to zero during training. When the weight tensor is loaded from system memory, a direct memory access (DMA) engine with an in-line decompression unit can decompress the weight tensor on-the-fly. By performing the decompression in the DMA engine, expansion of the weight values back to the original weight tensor size can be carried out in parallel while other neural network computations are being performed by the processing unit.

    Error reporting when reading data
    16.
    发明授权

    公开(公告)号:US10970155B1

    公开(公告)日:2021-04-06

    申请号:US16366169

    申请日:2019-03-27

    Abstract: System and method for performing a read transaction between a requester device, such as a host processor, and a completer device, such as a peripheral device. A device driver operating on the requester device receives a read request including a target address at which target data is to be read on the completer device. The length of the read request is increased from an initial length by an additional length for exchanging information with the completer device. The completer device generates and sends a read response comprising the target data and information about the target data. The length of the target data is equal to the initial length and the length of the information about the target data is less than or equal to the additional length. The device driver receives the read response and performs a resolution operation.

    Powering-down or rebooting a device in a system fabric

    公开(公告)号:US10761939B1

    公开(公告)日:2020-09-01

    申请号:US16219489

    申请日:2018-12-13

    Abstract: A circuit at an interface between a device and an interconnect fabric is configured to track outstanding transactions associated with the device and ensure the completion of the outstanding transactions before rebooting or powering down the device. In some embodiments, the circuit is also configurable to provide appropriate responses when the device is powered down or is being rebooted such that other devices in the system can still operate even without knowing that the device is inactive and would not hang because no response is received from the device.

    Strong ordered transaction for DMA transfers

    公开(公告)号:US12204757B1

    公开(公告)日:2025-01-21

    申请号:US18067514

    申请日:2022-12-16

    Abstract: A technique for processing strong ordered transactions in a direct memory access engine may include retrieving a memory descriptor to perform a strong ordered transaction, and delaying the strong ordered transaction until pending write transactions associated with previous memory descriptors retrieved prior to the memory descriptor are complete. Subsequent transactions associated with memory descriptors following the memory descriptor are allowed to be issued while waiting for the pending write transactions to complete. Upon completion of the pending write transactions, the strong ordered transaction is performed.

    Address generation for page collision prevention in memory regions

    公开(公告)号:US11748253B1

    公开(公告)日:2023-09-05

    申请号:US17449580

    申请日:2021-09-30

    Abstract: To generate sequential addresses when multiple integrated circuit (IC) devices are accessing a memory region, an address token is sent along the IC devices communicatively coupled in a ring topology. The address token includes a data increment value for the memory region. When a device receives the address token, a memory write address is determined based on the data increment value and a base address corresponding to the memory region for the current write cycle. The IC device can perform a write operation using the memory write address if the device has data to write. The data increment value of the address token is then updated based on the number of data units being written in the current write cycle to the memory region by the IC device, and the updated address token is transmitted to the next IC device of the ring topology.

    Programmable computations in direct memory access engine

    公开(公告)号:US11494326B1

    公开(公告)日:2022-11-08

    申请号:US17301273

    申请日:2021-03-30

    Inventor: Kun Xu Ron Diamant

    Abstract: To perform complex arithmetic operations in neural networks without compromising the performance of the neural network accelerator, a programmable computation unit is integrated with a direct memory access (DMA) engine that is used to exchange neural network parameters between the neural network accelerator and system memory. The DMA engine may include a calculation circuit operable to perform a multiply-and-add calculation on a set of operands, and an operand selector circuit operable to select a source for each operand of the calculation circuit. The DMA engine may also include a control circuit operable to retrieve a meta-descriptor for performing a computation, configure the operand selector circuit based on the meta-descriptor, and use the calculation circuit to perform the computation based on the meta-descriptor to generate a computation result.

Patent Agency Ranking