NEURAL NETWORK TRAINING UNDER MEMORY RESTRAINT

    公开(公告)号:US20210304010A1

    公开(公告)日:2021-09-30

    申请号:US16836421

    申请日:2020-03-31

    Abstract: Methods and systems for training a neural network are provided. In one example, an apparatus comprises a memory that stores instructions; and a hardware processor configured to execute the instructions to: control a neural network processor to perform a loss gradient operation to generate data gradients; after the loss gradient operation completes, control the neural network processor to perform a forward propagation operation to generate intermediate outputs; control the neural network processor to perform a backward propagation operation based on the data gradients and the intermediate outputs to generate weight gradients; receive the weight gradients from the neural network processor; and update weights of a neural network based on the weight gradients.

    NEURAL NETWORK OPERATION REORDERING FOR PARALLEL EXECUTION

    公开(公告)号:US20210247984A1

    公开(公告)日:2021-08-12

    申请号:US17243415

    申请日:2021-04-28

    Abstract: Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.

    Multinomial distribution on an integrated circuit

    公开(公告)号:US10997277B1

    公开(公告)日:2021-05-04

    申请号:US16364837

    申请日:2019-03-26

    Abstract: An integrated circuit device such as a neural network accelerator can be programmed to select a numerical value based on a multinomial distribution. In various examples, the integrated circuit device can include an execution engine that includes multiple separate execution units. The multiple execution units can operate in parallel on different streams of data. For example, to make a selection based on a multinomial distribution, the execution units can be configured to perform cumulative sums on sets of numerical values, where the numerical values represent probabilities. In this example, to then obtain cumulative sums across the sets of numerical values, the largest values from the sets can be accumulated, and then added, in parallel to the sets. The resulting cumulative sum across all the numerical values can then be used to randomly select a specific index, which can provide a particular numerical value as the selected value.

    NEURAL NETWORK TRAINING IN A DISTRIBUTED SYSTEM

    公开(公告)号:US20210097396A1

    公开(公告)日:2021-04-01

    申请号:US16588603

    申请日:2019-09-30

    Abstract: Methods and systems for performing a training operation of a neural network are provided. In one example, a method comprises: performing backward propagation computations for a second layer of a neural network to generate second weight gradients; splitting the second weight gradients into portions; causing a hardware interface to exchange a first portion of the second weight gradients with the second computer system; performing backward propagation computations for a first layer of the neural network to generate first weight gradients when the exchange of the first portion of the second weight gradients is underway, the first layer being a lower layer than the second layer in the neural network; causing the hardware interface to transmit the first weight gradients to the second computer system; and causing the hardware interface to transmit the remaining portions of the second weight gradients to the second computer system.

    Powering-down or rebooting a device in a system fabric

    公开(公告)号:US10761939B1

    公开(公告)日:2020-09-01

    申请号:US16219489

    申请日:2018-12-13

    Abstract: A circuit at an interface between a device and an interconnect fabric is configured to track outstanding transactions associated with the device and ensure the completion of the outstanding transactions before rebooting or powering down the device. In some embodiments, the circuit is also configurable to provide appropriate responses when the device is powered down or is being rebooted such that other devices in the system can still operate even without knowing that the device is inactive and would not hang because no response is received from the device.

    Network congestion detection and resolution

    公开(公告)号:US10742555B1

    公开(公告)日:2020-08-11

    申请号:US15838245

    申请日:2017-12-11

    Abstract: A method and corresponding apparatus for detecting network congestion. The method includes capturing, using a local clock of a sender device, a send time of an outgoing packet sent from the sender device to a receiver device through a forward route, and capturing, using the local clock of the sender device, a receive time of an acknowledgment packet sent from the receiver device to the sender device through a backward route. The acknowledgment packet contains timing information, generated using a local clock of the receiver device, for determining an internal latency of the receiver device. A round trip time is computed as a difference between the send time and the receive time. The internal latency is subtracted from the round trip time to compute a total propagation time. If the total propagation time is above a threshold, the forward route and the backward route are changed.

    Runtime augmentation of engine instructions

    公开(公告)号:US10664282B1

    公开(公告)日:2020-05-26

    申请号:US16266731

    申请日:2019-02-04

    Abstract: Methods for repeated execution of program code by an execution engine are provided. In order to execute large programs, the instruction buffer of an execution engine may be refilled may times with program code to complete one execution of the program. At completion of program execution, the program code needed to begin re-execution of the program is no longer in the instruction buffer. A runtime driver program can load instructions into the instruction buffer, or can cause instructions to be loaded. Once the instructions are loaded, the execution engine may be able to re-execute the instructions without needing further assistance from the runtime driver.

    Self-refill for instruction buffer
    160.
    发明授权

    公开(公告)号:US10592250B1

    公开(公告)日:2020-03-17

    申请号:US16014646

    申请日:2018-06-21

    Abstract: Disclosed herein are techniques for self-refilling an instruction buffer by an execution engine while the execution engine executes instructions in the instruction buffer. An instruction loader splits instruction code into sections of code and creates a data store (e.g., a DMA ring) for loading the sections of code into the instruction buffer. In some embodiments, an instruction is added to some sections of code. The instruction, when executed by the execution engine, triggers the loading of one or more sections of code into the instruction buffer based on one or more entries in the data store. In some embodiments, a hardware logic in the execution engine is configured to trigger the loading of the sections of code into the instruction buffer. In some embodiments, the one or more sections of code are loaded into the instruction buffer through a refill page that is different from the instruction buffer.

Patent Agency Ranking