Patent search ap:("Amazon Technologies Page Inc.") AND inv:"Kun Xu"

11.

发明授权
Distributive training with multicast 有权

公开(公告)号：US12189569B1

公开(公告)日：2025-01-07

申请号：US17449300

申请日：2021-09-29

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Ron Diamant

IPC: G06F15/173 , G06F13/40 , G06N3/08

Abstract: Techniques for distributing data associated with the weight values of a neural network model are described. The techniques can include performing computations associated with the neural network model in a neural network accelerator to generate data associated with weights of the neural network model. A multicast request packet is then generated to distribute the data. The multicast request packet may contain the data associated with the weights, and an address in a multicast address range of a peripheral bus multicast switch. The multicast request packet is sent to a port of the peripheral bus multicast switch, and in response, the peripheral bus multicast switch generates multiple packets containing the data from the multicast request packet and forwards them to multiple peripheral bus ports corresponding to other processing nodes of the system.

12.

发明授权
Matrix transpose hardware acceleration 有权

公开(公告)号：US12125124B1

公开(公告)日：2024-10-22

申请号：US18118251

申请日：2023-03-07

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Ron Diamant

IPC: G06T3/60 , G06F12/0862 , G06N3/04 , G06N3/08 , G06T1/60 , G06T3/606 , G06V10/94

CPC classification number: G06T1/60 , G06F12/0862 , G06N3/04 , G06N3/08 , G06T3/606 , G06F2212/455 , G06V10/95

Abstract: In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.

13.

发明授权
Multidimensional and multiblock tensorized direct memory access descriptors 有权

公开(公告)号：US11983128B1

公开(公告)日：2024-05-14

申请号：US18067109

申请日：2022-12-16

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Ron Diamant , Ilya Minkin , Mohammad El-Shabani , Raymond S. Whiteside , Uday Shilton Udayaselvam

IPC: G06F13/30 , G06F13/16

CPC classification number: G06F13/30 , G06F13/1621 , G06F13/1642

Abstract: Techniques to reduce overhead in a direct memory access (DMA) engine can include processing descriptors from a descriptor queue to obtain a striding configuration to generate tensorized memory descriptors. The striding configuration can include, for each striding dimension, a stride and a repetition number indicating a number of times to repeat striding in the corresponding striding dimension. One or more sets of tensorized memory descriptors can be generated based on the striding configuration. Data transfers are then performed based on the generated tensorized memory descriptors.

14.

发明授权
Matrix transpose hardware acceleration 有权

公开(公告)号：US11636569B1

公开(公告)日：2023-04-25

申请号：US17029609

申请日：2020-09-23

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Ron Diamant

IPC: G06T3/60 , G06T1/60 , G06F12/0862 , G06N3/08 , G06N3/04 , G06V10/94

Abstract: In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.

15.

发明申请
SPARSE MACHINE LEARNING ACCELERATION 有权

公开(公告)号：US20220318604A1

公开(公告)日：2022-10-06

申请号：US17301271

申请日：2021-03-30

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Ron Diamant , Patricio Kaplan

IPC: G06N3/063 , G06N3/04 , G06N3/08

Abstract: To reduce the storage size of weight tensors and speed up loading of weight tensors from system memory, a compression technique can be employed to remove zero values from a weight tensor before storing the weight tensor in system memory. A sparsity threshold can be enforced to achieve a compression ratio target by forcing small weight values to zero during training. When the weight tensor is loaded from system memory, a direct memory access (DMA) engine with an in-line decompression unit can decompress the weight tensor on-the-fly. By performing the decompression in the DMA engine, expansion of the weight values back to the original weight tensor size can be carried out in parallel while other neural network computations are being performed by the processing unit.

16.

发明授权
Error reporting when reading data 有权

公开(公告)号：US10970155B1

公开(公告)日：2021-04-06

申请号：US16366169

申请日：2019-03-27

Applicant: Amazon Technologies, Inc.

Inventor： Brian Robert Silver , Kun Xu

IPC: G06F11/07 , G06F13/42

Abstract: System and method for performing a read transaction between a requester device, such as a host processor, and a completer device, such as a peripheral device. A device driver operating on the requester device receives a read request including a target address at which target data is to be read on the completer device. The length of the read request is increased from an initial length by an additional length for exchanging information with the completer device. The completer device generates and sends a read response comprising the target data and information about the target data. The length of the target data is equal to the initial length and the length of the information about the target data is less than or equal to the additional length. The device driver receives the read response and performs a resolution operation.

17.

发明授权
Powering-down or rebooting a device in a system fabric 有权

公开(公告)号：US10761939B1

公开(公告)日：2020-09-01

申请号：US16219489

申请日：2018-12-13

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Thomas A. Volpe , Ron Diamant , Mark Anthony Banse

IPC: G06F11/14 , G06F13/40 , G06F11/07

Abstract: A circuit at an interface between a device and an interconnect fabric is configured to track outstanding transactions associated with the device and ensure the completion of the outstanding transactions before rebooting or powering down the device. In some embodiments, the circuit is also configurable to provide appropriate responses when the device is powered down or is being rebooted such that other devices in the system can still operate even without knowing that the device is inactive and would not hang because no response is received from the device.

18.

发明授权
Strong ordered transaction for DMA transfers 有权

公开(公告)号：US12204757B1

公开(公告)日：2025-01-21

申请号：US18067514

申请日：2022-12-16

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Ron Diamant , Ilya Minkin , Raymond S. Whiteside

IPC: G06F3/06

Abstract: A technique for processing strong ordered transactions in a direct memory access engine may include retrieving a memory descriptor to perform a strong ordered transaction, and delaying the strong ordered transaction until pending write transactions associated with previous memory descriptors retrieved prior to the memory descriptor are complete. Subsequent transactions associated with memory descriptors following the memory descriptor are allowed to be issued while waiting for the pending write transactions to complete. Upon completion of the pending write transactions, the strong ordered transaction is performed.

19.

发明授权
Address generation for page collision prevention in memory regions 有权

公开(公告)号：US11748253B1

公开(公告)日：2023-09-05

申请号：US17449580

申请日：2021-09-30

Applicant: Amazon Technologies, Inc.

Inventor： Suresh Hariharan , Kun Xu

IPC: G06F12/02 , G06F12/1081 , G06F13/16 , G06F15/173 , G06N3/04

CPC classification number: G06F12/0238 , G06F12/1081 , G06F13/1668 , G06F15/17375 , G06N3/04

Abstract: To generate sequential addresses when multiple integrated circuit (IC) devices are accessing a memory region, an address token is sent along the IC devices communicatively coupled in a ring topology. The address token includes a data increment value for the memory region. When a device receives the address token, a memory write address is determined based on the data increment value and a base address corresponding to the memory region for the current write cycle. The IC device can perform a write operation using the memory write address if the device has data to write. The data increment value of the address token is then updated based on the number of data units being written in the current write cycle to the memory region by the IC device, and the updated address token is transmitted to the next IC device of the ring topology.

20.

发明授权
Programmable computations in direct memory access engine 有权

公开(公告)号：US11494326B1

公开(公告)日：2022-11-08

申请号：US17301273

申请日：2021-03-30

Applicant: Amazon Technologies, Inc.

Inventor： Kun Xu , Ron Diamant

IPC: G06F13/28 , G06F7/544 , G06N3/063 , G06N3/06

Abstract: To perform complex arithmetic operations in neural networks without compromising the performance of the neural network accelerator, a programmable computation unit is integrated with a direct memory access (DMA) engine that is used to exchange neural network parameters between the neural network accelerator and system memory. The DMA engine may include a calculation circuit operable to perform a multiply-and-add calculation on a set of operands, and an operand selector circuit operable to select a source for each operand of the calculation circuit. The DMA engine may also include a control circuit operable to retrieve a meta-descriptor for performing a computation, configure the operand selector circuit based on the meta-descriptor, and use the calculation circuit to perform the computation based on the meta-descriptor to generate a computation result.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification