-
公开(公告)号:US12189569B1
公开(公告)日:2025-01-07
申请号:US17449300
申请日:2021-09-29
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant
IPC: G06F15/173 , G06F13/40 , G06N3/08
Abstract: Techniques for distributing data associated with the weight values of a neural network model are described. The techniques can include performing computations associated with the neural network model in a neural network accelerator to generate data associated with weights of the neural network model. A multicast request packet is then generated to distribute the data. The multicast request packet may contain the data associated with the weights, and an address in a multicast address range of a peripheral bus multicast switch. The multicast request packet is sent to a port of the peripheral bus multicast switch, and in response, the peripheral bus multicast switch generates multiple packets containing the data from the multicast request packet and forwards them to multiple peripheral bus ports corresponding to other processing nodes of the system.
-
公开(公告)号:US12125124B1
公开(公告)日:2024-10-22
申请号:US18118251
申请日:2023-03-07
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant
CPC classification number: G06T1/60 , G06F12/0862 , G06N3/04 , G06N3/08 , G06T3/606 , G06F2212/455 , G06V10/95
Abstract: In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.
-
公开(公告)号:US11983128B1
公开(公告)日:2024-05-14
申请号:US18067109
申请日:2022-12-16
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant , Ilya Minkin , Mohammad El-Shabani , Raymond S. Whiteside , Uday Shilton Udayaselvam
CPC classification number: G06F13/30 , G06F13/1621 , G06F13/1642
Abstract: Techniques to reduce overhead in a direct memory access (DMA) engine can include processing descriptors from a descriptor queue to obtain a striding configuration to generate tensorized memory descriptors. The striding configuration can include, for each striding dimension, a stride and a repetition number indicating a number of times to repeat striding in the corresponding striding dimension. One or more sets of tensorized memory descriptors can be generated based on the striding configuration. Data transfers are then performed based on the generated tensorized memory descriptors.
-
公开(公告)号:US11636569B1
公开(公告)日:2023-04-25
申请号:US17029609
申请日:2020-09-23
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant
Abstract: In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.
-
公开(公告)号:US20220318604A1
公开(公告)日:2022-10-06
申请号:US17301271
申请日:2021-03-30
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant , Patricio Kaplan
Abstract: To reduce the storage size of weight tensors and speed up loading of weight tensors from system memory, a compression technique can be employed to remove zero values from a weight tensor before storing the weight tensor in system memory. A sparsity threshold can be enforced to achieve a compression ratio target by forcing small weight values to zero during training. When the weight tensor is loaded from system memory, a direct memory access (DMA) engine with an in-line decompression unit can decompress the weight tensor on-the-fly. By performing the decompression in the DMA engine, expansion of the weight values back to the original weight tensor size can be carried out in parallel while other neural network computations are being performed by the processing unit.
-
公开(公告)号:US10970155B1
公开(公告)日:2021-04-06
申请号:US16366169
申请日:2019-03-27
Applicant: Amazon Technologies, Inc.
Inventor: Brian Robert Silver , Kun Xu
Abstract: System and method for performing a read transaction between a requester device, such as a host processor, and a completer device, such as a peripheral device. A device driver operating on the requester device receives a read request including a target address at which target data is to be read on the completer device. The length of the read request is increased from an initial length by an additional length for exchanging information with the completer device. The completer device generates and sends a read response comprising the target data and information about the target data. The length of the target data is equal to the initial length and the length of the information about the target data is less than or equal to the additional length. The device driver receives the read response and performs a resolution operation.
-
公开(公告)号:US10761939B1
公开(公告)日:2020-09-01
申请号:US16219489
申请日:2018-12-13
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Thomas A. Volpe , Ron Diamant , Mark Anthony Banse
Abstract: A circuit at an interface between a device and an interconnect fabric is configured to track outstanding transactions associated with the device and ensure the completion of the outstanding transactions before rebooting or powering down the device. In some embodiments, the circuit is also configurable to provide appropriate responses when the device is powered down or is being rebooted such that other devices in the system can still operate even without knowing that the device is inactive and would not hang because no response is received from the device.
-
公开(公告)号:US12204757B1
公开(公告)日:2025-01-21
申请号:US18067514
申请日:2022-12-16
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant , Ilya Minkin , Raymond S. Whiteside
IPC: G06F3/06
Abstract: A technique for processing strong ordered transactions in a direct memory access engine may include retrieving a memory descriptor to perform a strong ordered transaction, and delaying the strong ordered transaction until pending write transactions associated with previous memory descriptors retrieved prior to the memory descriptor are complete. Subsequent transactions associated with memory descriptors following the memory descriptor are allowed to be issued while waiting for the pending write transactions to complete. Upon completion of the pending write transactions, the strong ordered transaction is performed.
-
公开(公告)号:US11748253B1
公开(公告)日:2023-09-05
申请号:US17449580
申请日:2021-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Suresh Hariharan , Kun Xu
IPC: G06F12/02 , G06F12/1081 , G06F13/16 , G06F15/173 , G06N3/04
CPC classification number: G06F12/0238 , G06F12/1081 , G06F13/1668 , G06F15/17375 , G06N3/04
Abstract: To generate sequential addresses when multiple integrated circuit (IC) devices are accessing a memory region, an address token is sent along the IC devices communicatively coupled in a ring topology. The address token includes a data increment value for the memory region. When a device receives the address token, a memory write address is determined based on the data increment value and a base address corresponding to the memory region for the current write cycle. The IC device can perform a write operation using the memory write address if the device has data to write. The data increment value of the address token is then updated based on the number of data units being written in the current write cycle to the memory region by the IC device, and the updated address token is transmitted to the next IC device of the ring topology.
-
公开(公告)号:US11494326B1
公开(公告)日:2022-11-08
申请号:US17301273
申请日:2021-03-30
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant
Abstract: To perform complex arithmetic operations in neural networks without compromising the performance of the neural network accelerator, a programmable computation unit is integrated with a direct memory access (DMA) engine that is used to exchange neural network parameters between the neural network accelerator and system memory. The DMA engine may include a calculation circuit operable to perform a multiply-and-add calculation on a set of operands, and an operand selector circuit operable to select a source for each operand of the calculation circuit. The DMA engine may also include a control circuit operable to retrieve a meta-descriptor for performing a computation, configure the operand selector circuit based on the meta-descriptor, and use the calculation circuit to perform the computation based on the meta-descriptor to generate a computation result.
-
-
-
-
-
-
-
-
-