-
公开(公告)号:US12001352B1
公开(公告)日:2024-06-04
申请号:US17937395
申请日:2022-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Rashika Kheria , Ron Diamant , Se Wang Oh , Guy Nakibly
CPC classification number: G06F13/1621 , G06F9/466
Abstract: Techniques are provided to maintain data coherency for data transfers among data processing devices in a distributed computing environment. A data buffer in each data processing device can be mapped to an address range that is assigned to transactions that allow out-of-order completions, and a message buffer in each data processing device can be mapped to an address range that is assigned to transactions that follow transaction ordering. Thus, a transaction to store a set of data into the data buffer is completed before a transaction to write a synchronization message in the message buffer indicating that the set of data is stored in the data buffer based on the mapping irrespective of the transaction ordering indicated by each transaction.
-
公开(公告)号:US11983128B1
公开(公告)日:2024-05-14
申请号:US18067109
申请日:2022-12-16
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant , Ilya Minkin , Mohammad El-Shabani , Raymond S. Whiteside , Uday Shilton Udayaselvam
CPC classification number: G06F13/30 , G06F13/1621 , G06F13/1642
Abstract: Techniques to reduce overhead in a direct memory access (DMA) engine can include processing descriptors from a descriptor queue to obtain a striding configuration to generate tensorized memory descriptors. The striding configuration can include, for each striding dimension, a stride and a repetition number indicating a number of times to repeat striding in the corresponding striding dimension. One or more sets of tensorized memory descriptors can be generated based on the striding configuration. Data transfers are then performed based on the generated tensorized memory descriptors.
-
公开(公告)号:US11816559B2
公开(公告)日:2023-11-14
申请号:US17832039
申请日:2022-06-03
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Ron Diamant
IPC: G06N3/063 , G06F15/80 , G06F17/15 , H04L49/9047 , G06V30/413
CPC classification number: G06N3/063 , G06F15/8046 , G06F17/153 , G06V30/413 , H04L49/9047
Abstract: In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided.
-
公开(公告)号:US20230359876A1
公开(公告)日:2023-11-09
申请号:US18352768
申请日:2023-07-14
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Ron Diamant , Hongbin Zheng , Yizhi Liu , Animesh Jain , Yida Wang , Vinod Sharma , Richard John Heaton , Randy Renfu Huang , Sundeep Amirineni , Drazen Borkovic
Abstract: Generating instructions for programming a processing element array to implement a convolution operation can include determining that the convolution operation under-utilizes the processing element array. The convolution operation involves using the processing element array to perform a series of matrix multiplications between a set of filters and a set of input matrices. Each filter comprises a weight matrix. Each input matrix is assigned to a respective row in the processing element array. Under-utilization can be determined through detecting that less than a threshold number of rows would be used concurrently. In response to determining that the convolution operation under-utilizes the processing element array, instructions can be added for modifying the convolution operation to increase the number of rows used concurrently. The added instructions are executable to cause at least one input matrix to be processed in parallel across more rows compared to processing without modifying the convolution operation.
-
公开(公告)号:US20230351186A1
公开(公告)日:2023-11-02
申请号:US18144129
申请日:2023-05-05
Applicant: Amazon Technologies, Inc.
Inventor: Dana Michelle Vantrease , Ron Diamant , Thomas A. Volpe , Randy Huang
CPC classification number: G06N3/082 , G06F3/0604 , G06F3/0644 , G06F3/0673 , G06N3/045
Abstract: Disclosed herein are techniques for performing multi-layer neural network processing for multiple contexts. In one embodiment, a computing engine is set in a first configuration to implement a second layer of a neural network and to process first data related to a first context to generate first context second layer output. The computing engine can be switched from the first configuration to a second configuration to implement a first layer of the neural network. The computing engine can be used to process second data related to a second context to generate second context first layer output. The computing engine can be set to a third configuration to implement a third layer of the neural network to process the first context second layer output and the second context first layer output to generate a first processing result of the first context and a second processing result of the second context.
-
公开(公告)号:US11741345B2
公开(公告)日:2023-08-29
申请号:US17033573
申请日:2020-09-25
Applicant: Amazon Technologies, Inc.
Inventor: Randy Huang , Ron Diamant
CPC classification number: G06N3/045 , G06F3/061 , G06F3/065 , G06F3/0683 , G06F13/28 , G06F13/4068 , G06F15/80
Abstract: Provided are systems, methods, and integrated circuits for a neural network processing system. In various implementations, the system can include a first array of processing engines coupled to a first set of memory banks and a second array of processing engines coupled to a second set of memory banks. The first and second set of memory banks be storing all the weight values for a neural network, where the weight values are stored before any input data is received. Upon receiving input data, the system performs a task defined for the neural network. Performing the task can include computing an intermediate result using the first array of processing engines, copying the intermediate result to the second set of memory banks, and computing a final result using the second array of processing engines, where the final result corresponds to an outcome of performing the task.
-
公开(公告)号:US11720523B2
公开(公告)日:2023-08-08
申请号:US16653578
申请日:2019-10-15
Applicant: Amazon Technologies, Inc.
Inventor: Dana Michelle Vantrease , Ron Diamant
CPC classification number: G06F15/8046 , G06F15/173 , G06F17/15 , G06F17/16 , G06N3/02 , G06N3/045 , G06N3/063
Abstract: A processing element (PE) of a systolic array can perform neural networks computations on two or more data elements of an input data set using the same weight. Thus, two or more output data elements corresponding to an output data set may be generated. Based on the size of the input data set and an input data type, the systolic array can process a single data element or multiple data elements in parallel.
-
公开(公告)号:US11676021B1
公开(公告)日:2023-06-13
申请号:US17947355
申请日:2022-09-19
Applicant: Amazon Technologies, Inc.
Inventor: Patricio Kaplan , Ron Diamant
Abstract: A first worker node of a distributed system computes a first set of gradients using a first neural network model and a first set of weights associated with the first neural network model. The first set of gradients are transmitted from the first worker node to a second worker node of the distributed system. The second worker node computes a first set of synchronized gradients based on the first set of gradients. While the first set of synchronized gradients are being computed, the first worker node computes a second set of gradients using a second neural network model and a second set of weights associated with the second neural network model. The second set of gradients are transmitted from the first worker node to the second worker node. The second worker node computes a second set of synchronized gradients based on the second set of gradients.
-
公开(公告)号:US11636569B1
公开(公告)日:2023-04-25
申请号:US17029609
申请日:2020-09-23
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant
Abstract: In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.
-
公开(公告)号:US20230004523A1
公开(公告)日:2023-01-05
申请号:US17363900
申请日:2021-06-30
Applicant: Amazon Technologies, Inc.
Inventor: Paul Gilbert Meyer , Thomas A. Volpe , Ron Diamant , Joshua Wayne Bowman , Nishith Desai , Thomas Elmer
Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reducer can receive a particular input and generate multiple reduced inputs from the input. The reduced inputs can include reduced input data elements and/or a reduced weights. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide multiple reduced inputs with second shorter bit-length to the array. The systolic array may perform multiply-accumulate operations on each unique combination of the multiple reduced input data elements and the reduced weights to generate multiple partial outputs. The systolic array may sum the partial outputs to generate the output.
-
-
-
-
-
-
-
-
-