-
公开(公告)号:US20210304010A1
公开(公告)日:2021-09-30
申请号:US16836421
申请日:2020-03-31
Applicant: Amazon Technologies, Inc.
Inventor: Sudipta Sengupta , Randy Renfu Huang , Ron Diamant , Vignesh Vivekraja
Abstract: Methods and systems for training a neural network are provided. In one example, an apparatus comprises a memory that stores instructions; and a hardware processor configured to execute the instructions to: control a neural network processor to perform a loss gradient operation to generate data gradients; after the loss gradient operation completes, control the neural network processor to perform a forward propagation operation to generate intermediate outputs; control the neural network processor to perform a backward propagation operation based on the data gradients and the intermediate outputs to generate weight gradients; receive the weight gradients from the neural network processor; and update weights of a neural network based on the weight gradients.
-
公开(公告)号:US20210247984A1
公开(公告)日:2021-08-12
申请号:US17243415
申请日:2021-04-28
Applicant: Amazon Technologies, Inc.
Inventor: Jeffrey T. Huynh , Drazen Borkovic , Jindrich Zejda , Randy Renfu Huang , Ron Diamant
Abstract: Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.
-
公开(公告)号:US10997277B1
公开(公告)日:2021-05-04
申请号:US16364837
申请日:2019-03-26
Applicant: Amazon Technologies, Inc.
Inventor: Yu Zhou , Vignesh Vivekraja , Ron Diamant
Abstract: An integrated circuit device such as a neural network accelerator can be programmed to select a numerical value based on a multinomial distribution. In various examples, the integrated circuit device can include an execution engine that includes multiple separate execution units. The multiple execution units can operate in parallel on different streams of data. For example, to make a selection based on a multinomial distribution, the execution units can be configured to perform cumulative sums on sets of numerical values, where the numerical values represent probabilities. In this example, to then obtain cumulative sums across the sets of numerical values, the largest values from the sets can be accumulated, and then added, in parallel to the sets. The resulting cumulative sum across all the numerical values can then be used to randomly select a specific index, which can provide a particular numerical value as the selected value.
-
公开(公告)号:US20210097396A1
公开(公告)日:2021-04-01
申请号:US16588603
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Vignesh Vivekraja , Thiam Khean Hah , Randy Renfu Huang , Ron Diamant , Richard John Heaton
Abstract: Methods and systems for performing a training operation of a neural network are provided. In one example, a method comprises: performing backward propagation computations for a second layer of a neural network to generate second weight gradients; splitting the second weight gradients into portions; causing a hardware interface to exchange a first portion of the second weight gradients with the second computer system; performing backward propagation computations for a first layer of the neural network to generate first weight gradients when the exchange of the first portion of the second weight gradients is underway, the first layer being a lower layer than the second layer in the neural network; causing the hardware interface to transmit the first weight gradients to the second computer system; and causing the hardware interface to transmit the remaining portions of the second weight gradients to the second computer system.
-
公开(公告)号:US10884707B1
公开(公告)日:2021-01-05
申请号:US16455201
申请日:2019-06-27
Applicant: Amazon Technologies, Inc.
Inventor: Haichen Li , Ron Diamant , Jeffrey T. Huynh , Yu Zhou , Se jong Oh
Abstract: Provided are systems and methods for transposing a tensor using processing element array operations. In some cases, it may be necessary to transpose elements of a tensor to perform a matrix operation. The tensor may be decomposed into blocks of data elements having dimensions consistent with the dimensions of a systolic array. An identity multiplication may be performed on each block of data elements loaded into a systolic array and the multiplication products summed in column partitions of a results buffer. The data elements in the column partitions of results buffer can then be mapped to row partitions of a buffer memory for further processing.
-
公开(公告)号:US10761939B1
公开(公告)日:2020-09-01
申请号:US16219489
申请日:2018-12-13
Applicant: Amazon Technologies, Inc.
Inventor: Kun Xu , Thomas A. Volpe , Ron Diamant , Mark Anthony Banse
Abstract: A circuit at an interface between a device and an interconnect fabric is configured to track outstanding transactions associated with the device and ensure the completion of the outstanding transactions before rebooting or powering down the device. In some embodiments, the circuit is also configurable to provide appropriate responses when the device is powered down or is being rebooted such that other devices in the system can still operate even without knowing that the device is inactive and would not hang because no response is received from the device.
-
公开(公告)号:US10761822B1
公开(公告)日:2020-09-01
申请号:US16217858
申请日:2018-12-12
Applicant: Amazon Technologies, Inc.
Inventor: Drazen Borkovic , Jindrich Zejda , Taemin Kim , Ron Diamant
IPC: G06F11/36 , G06F17/10 , G06F3/048 , G06F13/40 , G06F17/50 , G06F9/30 , G06F12/00 , G06F8/41 , G06N3/02 , G06F12/1081 , G06F12/06 , G06F12/0888 , G06F8/34 , G06F9/50 , G06F9/455
Abstract: Provided are systems and methods for generating program code for an integrated circuit, where instructions in the code synchronize computation engines that support non-blocking instructions. In various examples, a computing device can receiving an input data set including operations to be performed by an integrated circuit device and dependencies between the operations. The input data set can include a non-blocking instruction, and an operation that requires that the non-blocking instruction be completed. The computing device can generate instructions for performing the operation including a particular instruction to wait for a value to be set in a register of the integrated circuit device. The computing device can further generate program code including the non-blocking instruction and the instructions for performing the operation, wherein the non-blocking instruction is configured to set the value in the register.
-
公开(公告)号:US10742555B1
公开(公告)日:2020-08-11
申请号:US15838245
申请日:2017-12-11
Applicant: Amazon Technologies, Inc.
Inventor: Leah Shalev , Ron Diamant , Erez Izenberg , Nafea Bshara
IPC: H04L12/803 , H04L12/26 , H04L29/06 , H04L12/721 , H04L12/743 , H04L12/707 , H04J3/06
Abstract: A method and corresponding apparatus for detecting network congestion. The method includes capturing, using a local clock of a sender device, a send time of an outgoing packet sent from the sender device to a receiver device through a forward route, and capturing, using the local clock of the sender device, a receive time of an acknowledgment packet sent from the receiver device to the sender device through a backward route. The acknowledgment packet contains timing information, generated using a local clock of the receiver device, for determining an internal latency of the receiver device. A round trip time is computed as a difference between the send time and the receive time. The internal latency is subtracted from the round trip time to compute a total propagation time. If the total propagation time is above a threshold, the forward route and the backward route are changed.
-
公开(公告)号:US10664282B1
公开(公告)日:2020-05-26
申请号:US16266731
申请日:2019-02-04
Applicant: Amazon Technologies, Inc.
Inventor: Ilya Minkin , Ron Diamant , Mohammad El-Shabani , Dana Michelle Vantrease
Abstract: Methods for repeated execution of program code by an execution engine are provided. In order to execute large programs, the instruction buffer of an execution engine may be refilled may times with program code to complete one execution of the program. At completion of program execution, the program code needed to begin re-execution of the program is no longer in the instruction buffer. A runtime driver program can load instructions into the instruction buffer, or can cause instructions to be loaded. Once the instructions are loaded, the execution engine may be able to re-execute the instructions without needing further assistance from the runtime driver.
-
公开(公告)号:US10592250B1
公开(公告)日:2020-03-17
申请号:US16014646
申请日:2018-06-21
Applicant: Amazon Technologies, Inc.
Inventor: Ron Diamant , Ilya Minkin
Abstract: Disclosed herein are techniques for self-refilling an instruction buffer by an execution engine while the execution engine executes instructions in the instruction buffer. An instruction loader splits instruction code into sections of code and creates a data store (e.g., a DMA ring) for loading the sections of code into the instruction buffer. In some embodiments, an instruction is added to some sections of code. The instruction, when executed by the execution engine, triggers the loading of one or more sections of code into the instruction buffer based on one or more entries in the data store. In some embodiments, a hardware logic in the execution engine is configured to trigger the loading of the sections of code into the instruction buffer. In some embodiments, the one or more sections of code are loaded into the instruction buffer through a refill page that is different from the instruction buffer.
-
-
-
-
-
-
-
-
-