-
公开(公告)号:US11507378B1
公开(公告)日:2022-11-22
申请号:US17188548
申请日:2021-03-01
Applicant: Amazon Technologies, Inc.
Inventor: Ron Diamant , Sundeep Amirineni , Mohammad El-Shabani , Sagar Sonar , Kenneth Wayne Patton
Abstract: In one example, an integrated circuit comprises: a memory configured to store a first mapping between a first opcode and first control information and a second mapping between the first opcode and second control information; a processing engine configured to perform processing operations based on the control information; and a controller configured to: at a first time, provide the first opcode to the memory to, based on the first mapping stored in the memory, fetch the first control information for the processing engine, to enable the processing engine to perform a first processing operation based on the first control information; and at a second time, provide the first opcode to the memory to, based on the second mapping stored in the memory, fetch the second control information for the processing engine, to enable the processing engine to perform a second processing operation based on the second control information.
-
公开(公告)号:US11468325B2
公开(公告)日:2022-10-11
申请号:US16835161
申请日:2020-03-30
Applicant: Amazon Technologies, Inc.
Inventor: Patricio Kaplan , Ron Diamant
Abstract: A first worker node of a distributed system computes a first set of gradients using a first neural network model and a first set of weights associated with the first neural network model. The first set of gradients are transmitted from the first worker node to a second worker node of the distributed system. The second worker node computes a first set of synchronized gradients based on the first set of gradients. While the first set of synchronized gradients are being computed, the first worker node computes a second set of gradients using a second neural network model and a second set of weights associated with the second neural network model. The second set of gradients are transmitted from the first worker node to the second worker node. The second worker node computes a second set of synchronized gradients based on the second set of gradients.
-
公开(公告)号:US20220318604A1
公开(公告)日:2022-10-06
申请号:US17301271
申请日:2021-03-30
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Ron Diamant , Patricio Kaplan
Abstract: To reduce the storage size of weight tensors and speed up loading of weight tensors from system memory, a compression technique can be employed to remove zero values from a weight tensor before storing the weight tensor in system memory. A sparsity threshold can be enforced to achieve a compression ratio target by forcing small weight values to zero during training. When the weight tensor is loaded from system memory, a direct memory access (DMA) engine with an in-line decompression unit can decompress the weight tensor on-the-fly. By performing the decompression in the DMA engine, expansion of the weight values back to the original weight tensor size can be carried out in parallel while other neural network computations are being performed by the processing unit.
-
公开(公告)号:US11334358B2
公开(公告)日:2022-05-17
申请号:US16707857
申请日:2019-12-09
Applicant: Amazon Technologies, Inc.
Inventor: Ron Diamant
Abstract: In one example, a hardware accelerator comprises: a programmable hardware instruction decoder programmed to store a plurality of opcodes; a programmable instruction schema mapping table implemented as a content addressable memory (CAM) and programmed to map the plurality of opcodes to a plurality of definitions of operands in a plurality of instructions; a hardware execution engine; and a controller configured to: receive an instruction that includes a first opcode of the plurality of opcodes; control the hardware instruction decoder to extract the first opcode from the instruction; obtain, from the instruction schema mapping table and based on the first opcode, a first definition of a first operand; and forward the instruction and the first definition to the hardware execution engine to control the hardware execution engine to extract the first operand from the instruction based on the first definition, and execute the instruction based on the first operand.
-
公开(公告)号:US11308396B2
公开(公告)日:2022-04-19
申请号:US16455329
申请日:2019-06-27
Applicant: Amazon Technologies, Inc.
Inventor: Jindrich Zejda , Jeffrey T. Huynh , Drazen Borkovic , Se jong Oh , Ron Diamant , Randy Renfu Huang
Abstract: Techniques are disclosed for debugging a neural network execution on a target processor. A reference processor may generate a plurality of first reference tensors for the neural network. The neural network may be repeatedly reduced to produce a plurality of lengths. For each of the lengths, a compiler converts the neural network into first machine instructions, the target processor executes the first machine instructions to generate a first device tensor, and the debugger program determines whether the first device tensor matches a first reference tensor. A shortest length is identified for which the first device tensor does not match the first reference tensor. Tensor output is enabled for a lower-level intermediate representation of the shortest neural network, and the neural network is converted into second machine instructions, which are executed by the target processor to generate a second device tensor.
-
公开(公告)号:US11275997B1
公开(公告)日:2022-03-15
申请号:US15967318
申请日:2018-04-30
Applicant: Amazon Technologies, Inc.
Inventor: Dana Michelle Vantrease , Ron Diamant , Sundeep Amirineni
Abstract: Disclosed herein are techniques for obtain weights for neural network computations. In one embodiment, an integrated circuit may include memory configured to store a first weight and a second weight; a row of processing elements comprising a first processing element and a second processing element, the first processing element comprising a first weight register, the second processing element comprising a second weight register, both of the first weight register and the second weight register being controllable by a weight load signal; and a controller configured to: provide the first weight from the memory to the row of processing elements; set the weight load signal to enable the first weight to propagate through the row to reach the first processing element; and set the weight load signal to store the first weight at the first weight register and the flush value at the second weight register.
-
公开(公告)号:US11275661B1
公开(公告)日:2022-03-15
申请号:US16582346
申请日:2019-09-25
Applicant: Amazon Technologies, Inc.
Inventor: Dana Michelle Vantrease , Ron Diamant
Abstract: A method of generating instructions to be executed by a plurality of execution engines that shares a resource is provided. The method comprises, in a first generation step: reading a first engine logical timestamp vector of a first execution engine of the execution engines, the logical timestamp representing a history of access operations for the resource; determining whether the first engine logical timestamp vector includes a most-up-to-date logical timestamp of the resource in the first generation step; based on the first engine logical timestamp vector including the most-up-to-date logical timestamp of the resource in the first generation step, generating an access instruction to be executed by the first execution engine to access the resource; and scheduling the first execution engine to execute the access instruction.
-
公开(公告)号:US11250319B1
公开(公告)日:2022-02-15
申请号:US15714924
申请日:2017-09-25
Applicant: Amazon Technologies, Inc.
Inventor: Randy Huang , Ron Diamant
Abstract: Disclosed herein are techniques for classifying data with a data processing circuit. In one embodiment, the data processing circuit includes a probabilistic circuit configurable to generate a decision at a pre-determined probability, and an output generation circuit including an output node and configured to receive input data and a weight, and generate output data at the output node for approximating a product of the input data and the weight. The generation of the output data includes propagating the weight to the output node according a first decision of the probabilistic circuit. The probabilistic circuit is configured to generate the first decision at a probability determined based on the input data.
-
公开(公告)号:US11232016B1
公开(公告)日:2022-01-25
申请号:US16138145
申请日:2018-09-21
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Ron Diamant , Sundeep Amirineni , Randy Renfu Huang
Abstract: Techniques disclosed herein relate generally to debugging complex computing systems, such as those executing neural networks. A neural network processor includes a processing engine configured to execute instructions to implement multiple layers of a neural network. The neural network processor includes a debugging circuit configured to generate error detection codes for input data to the processing engine or error detection codes for output data generated by the processing engine. The neural network processor also includes an interface to a memory device, where the interface is configured to save the error detection codes generated by the debugging circuit into the memory device. The error detection codes generated by the debugging circuit are compared with expected error detection codes generated using a function model of the neural network to identify defects of the neural network.
-
公开(公告)号:US11188302B1
公开(公告)日:2021-11-30
申请号:US16267031
申请日:2019-02-04
Applicant: Amazon Technologies, Inc.
Inventor: Ron Diamant , Randy Renfu Huang , Richard John Heaton
Abstract: Top-k is a process by which the largest elements among a set of elements is found. In various implementations, a top-k computation can be executed by a neural network accelerator, where the top-k computation is performed using a process that makes use of the accelerators memory array. A set of numerical values on which to perform top-k can be stored in the memory array. The accelerator can locate the maximum value from among the set of numerical values, and can store the maximum value back into the memory array. The accelerator can next remove the maximum value from the set of numerical values, so that a next largest value can be found. To remove the maximum value, the accelerator can write a value representing negative infinity to the memory array at each location of the maximum value.
-
-
-
-
-
-
-
-
-