-
公开(公告)号:US20230385233A1
公开(公告)日:2023-11-30
申请号:US18446357
申请日:2023-08-08
Applicant: Amazon Technologies, Inc.
Inventor: Thomas A. Volpe , Sundeep Amirineni , Thomas Elmer
CPC classification number: G06F15/8046 , G06F7/53 , G06F7/5443 , G06F7/505 , G06F9/3001
Abstract: Systems and methods are provided to enable parallelized multiply-accumulate operations in a systolic array. Each column of the systolic array can include multiple busses enabling independent transmission of input partial sums along the respective bus. Each processing element of a given columnar bus can receive an input partial sum from a prior element of the given columnar bus, and perform arithmetic operations on the input partial sum. Each processing element can generate an output partial sum based on the arithmetic operations, provide the output partial sum to a next processing element of the given columnar bus, without the output partial sum being processed by a processing element of the column located between the two processing elements that uses a different columnar bus. Use of columnar busses can enable parallelization to increase speed or enable increased latency at individual processing elements.
-
公开(公告)号:US20230359876A1
公开(公告)日:2023-11-09
申请号:US18352768
申请日:2023-07-14
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Ron Diamant , Hongbin Zheng , Yizhi Liu , Animesh Jain , Yida Wang , Vinod Sharma , Richard John Heaton , Randy Renfu Huang , Sundeep Amirineni , Drazen Borkovic
Abstract: Generating instructions for programming a processing element array to implement a convolution operation can include determining that the convolution operation under-utilizes the processing element array. The convolution operation involves using the processing element array to perform a series of matrix multiplications between a set of filters and a set of input matrices. Each filter comprises a weight matrix. Each input matrix is assigned to a respective row in the processing element array. Under-utilization can be determined through detecting that less than a threshold number of rows would be used concurrently. In response to determining that the convolution operation under-utilizes the processing element array, instructions can be added for modifying the convolution operation to increase the number of rows used concurrently. The added instructions are executable to cause at least one input matrix to be processed in parallel across more rows compared to processing without modifying the convolution operation.
-
公开(公告)号:US11507378B1
公开(公告)日:2022-11-22
申请号:US17188548
申请日:2021-03-01
Applicant: Amazon Technologies, Inc.
Inventor: Ron Diamant , Sundeep Amirineni , Mohammad El-Shabani , Sagar Sonar , Kenneth Wayne Patton
Abstract: In one example, an integrated circuit comprises: a memory configured to store a first mapping between a first opcode and first control information and a second mapping between the first opcode and second control information; a processing engine configured to perform processing operations based on the control information; and a controller configured to: at a first time, provide the first opcode to the memory to, based on the first mapping stored in the memory, fetch the first control information for the processing engine, to enable the processing engine to perform a first processing operation based on the first control information; and at a second time, provide the first opcode to the memory to, based on the second mapping stored in the memory, fetch the second control information for the processing engine, to enable the processing engine to perform a second processing operation based on the second control information.
-
公开(公告)号:US11275997B1
公开(公告)日:2022-03-15
申请号:US15967318
申请日:2018-04-30
Applicant: Amazon Technologies, Inc.
Inventor: Dana Michelle Vantrease , Ron Diamant , Sundeep Amirineni
Abstract: Disclosed herein are techniques for obtain weights for neural network computations. In one embodiment, an integrated circuit may include memory configured to store a first weight and a second weight; a row of processing elements comprising a first processing element and a second processing element, the first processing element comprising a first weight register, the second processing element comprising a second weight register, both of the first weight register and the second weight register being controllable by a weight load signal; and a controller configured to: provide the first weight from the memory to the row of processing elements; set the weight load signal to enable the first weight to propagate through the row to reach the first processing element; and set the weight load signal to store the first weight at the first weight register and the flush value at the second weight register.
-
公开(公告)号:US11232016B1
公开(公告)日:2022-01-25
申请号:US16138145
申请日:2018-09-21
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Ron Diamant , Sundeep Amirineni , Randy Renfu Huang
Abstract: Techniques disclosed herein relate generally to debugging complex computing systems, such as those executing neural networks. A neural network processor includes a processing engine configured to execute instructions to implement multiple layers of a neural network. The neural network processor includes a debugging circuit configured to generate error detection codes for input data to the processing engine or error detection codes for output data generated by the processing engine. The neural network processor also includes an interface to a memory device, where the interface is configured to save the error detection codes generated by the debugging circuit into the memory device. The error detection codes generated by the debugging circuit are compared with expected error detection codes generated using a function model of the neural network to identify defects of the neural network.
-
公开(公告)号:US11762803B2
公开(公告)日:2023-09-19
申请号:US17659642
申请日:2022-04-18
Applicant: Amazon Technologies, Inc.
Inventor: Thomas A Volpe , Sundeep Amirineni , Thomas Elmer
CPC classification number: G06F15/8046 , G06F7/505 , G06F7/53 , G06F7/5443 , G06F9/3001 , G06F9/3893
Abstract: Systems and methods are provided to enable parallelized multiply-accumulate operations in a systolic array. Each column of the systolic array can include multiple busses enabling independent transmission of input partial sums along the respective bus. Each processing element of a given columnar bus can receive an input partial sum from a prior element of the given columnar bus, and perform arithmetic operations on the input partial sum. Each processing element can generate an output partial sum based on the arithmetic operations, provide the output partial sum to a next processing element of the given columnar bus, without the output partial sum being processed by a processing element of the column located between the two processing elements that uses a different columnar bus. Use of columnar busses can enable parallelization to increase speed or enable increased latency at individual processing elements.
-
公开(公告)号:US11741350B2
公开(公告)日:2023-08-29
申请号:US16698461
申请日:2019-11-27
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Ron Diamant , Hongbin Zheng , Yizhi Liu , Animesh Jain , Yida Wang , Vinod Sharma , Richard John Heaton , Randy Renfu Huang , Sundeep Amirineni , Drazen Borkovic
Abstract: A computer-implemented method includes receiving a neural network model for implementation using a processing element array, where the neural network model includes a convolution operation on a set of input feature maps and a set of filters. The method also includes determining, based on the neural network model, that the convolution operation utilizes less than a threshold number of rows in the processing element array for applying a set of filter elements to the set of input feature maps, where the set of filter elements includes one filter element in each filter of the set of filters. The method further includes generating, for the convolution operation and based on the neural network model, a first instruction and a second instruction for execution by respective rows in the processing element array, where the first instruction and the second instruction use different filter elements of a filter in the set of filters.
-
公开(公告)号:US11314842B1
公开(公告)日:2022-04-26
申请号:US16987830
申请日:2020-08-07
Applicant: Amazon Technologies, Inc.
Inventor: Ron Diamant , Randy Renfu Huang , Mohammad El-Shabani , Sundeep Amirineni , Kenneth Wayne Patton , Willis Wang
Abstract: Methods and systems for performing hardware computations of mathematical functions are provided. In one example, a system comprises a mapping table that maps each base value of a plurality of base values to parameters related to a mathematical function; a selection module configured to select, based on an input value, a first base value and first parameters mapped to the first base value in the mapping table; and arithmetic circuits configured to: receive, from the mapping table, the first base value and the first plurality of parameters; and compute, based on a relationship between the input value and the first base value, and based on the first parameters, an estimated output value of the mathematical function for the input value.
-
公开(公告)号:US11048569B1
公开(公告)日:2021-06-29
申请号:US16794467
申请日:2020-02-19
Applicant: Amazon Technologies, Inc.
Inventor: Kiran Kalkunte Seshadri , Sundeep Amirineni , Nafea Bshara , Asif Khan
Abstract: Disclosed herein are techniques for preventing or minimizing completion timeout errors on a computer device. An apparatus include a processing logic circuit and a timeout logic. The timeout logic is configured to: generate a timeout event based on a transaction not completed by the processing logic circuit within a timeout period; determine a number of the timeout events generated during a monitoring period; and responsive to determining that the number equals to or exceeds a threshold, reduce the timeout period.
-
公开(公告)号:US10942742B1
公开(公告)日:2021-03-09
申请号:US16216212
申请日:2018-12-11
Applicant: Amazon Technologies, Inc.
Inventor: Ron Diamant , Sundeep Amirineni , Mohammad El-Shabani , Sagar Sonar , Kenneth Wayne Patton
Abstract: A reconfigurable processing circuit and system are provided. The system allows a user to program machine-level instructions in order to reconfigure the way the circuit behaves, including by adding new operations. The system can include a profile access content-addressable memory (CAM) configured to receive an execution step value from a step counter. The execution step value can be incremented and/or reset by a step management logic. The profile access CAM can select an entry of a profile table based on an opcode and the execution step value, and the processing engine can execute microcode based on the selected entry of the profile table. The profile access CAM can translate the opcode to an internal short instruction identifier in order to select the entry of the profile table. The system can further include an instruction decoding module configured to merge multiple instruction fields into a single effective instruction field.
-
-
-
-
-
-
-
-
-