Hardware engine with configurable instructions

    公开(公告)号:US11507378B1

    公开(公告)日:2022-11-22

    申请号:US17188548

    申请日:2021-03-01

    Abstract: In one example, an integrated circuit comprises: a memory configured to store a first mapping between a first opcode and first control information and a second mapping between the first opcode and second control information; a processing engine configured to perform processing operations based on the control information; and a controller configured to: at a first time, provide the first opcode to the memory to, based on the first mapping stored in the memory, fetch the first control information for the processing engine, to enable the processing engine to perform a first processing operation based on the first control information; and at a second time, provide the first opcode to the memory to, based on the second mapping stored in the memory, fetch the second control information for the processing engine, to enable the processing engine to perform a second processing operation based on the second control information.

    Multi-model training pipeline in distributed systems

    公开(公告)号:US11468325B2

    公开(公告)日:2022-10-11

    申请号:US16835161

    申请日:2020-03-30

    Abstract: A first worker node of a distributed system computes a first set of gradients using a first neural network model and a first set of weights associated with the first neural network model. The first set of gradients are transmitted from the first worker node to a second worker node of the distributed system. The second worker node computes a first set of synchronized gradients based on the first set of gradients. While the first set of synchronized gradients are being computed, the first worker node computes a second set of gradients using a second neural network model and a second set of weights associated with the second neural network model. The second set of gradients are transmitted from the first worker node to the second worker node. The second worker node computes a second set of synchronized gradients based on the second set of gradients.

    SPARSE MACHINE LEARNING ACCELERATION

    公开(公告)号:US20220318604A1

    公开(公告)日:2022-10-06

    申请号:US17301271

    申请日:2021-03-30

    Abstract: To reduce the storage size of weight tensors and speed up loading of weight tensors from system memory, a compression technique can be employed to remove zero values from a weight tensor before storing the weight tensor in system memory. A sparsity threshold can be enforced to achieve a compression ratio target by forcing small weight values to zero during training. When the weight tensor is loaded from system memory, a direct memory access (DMA) engine with an in-line decompression unit can decompress the weight tensor on-the-fly. By performing the decompression in the DMA engine, expansion of the weight values back to the original weight tensor size can be carried out in parallel while other neural network computations are being performed by the processing unit.

    Hardware accelerator having reconfigurable instruction set and reconfigurable decoder

    公开(公告)号:US11334358B2

    公开(公告)日:2022-05-17

    申请号:US16707857

    申请日:2019-12-09

    Inventor: Ron Diamant

    Abstract: In one example, a hardware accelerator comprises: a programmable hardware instruction decoder programmed to store a plurality of opcodes; a programmable instruction schema mapping table implemented as a content addressable memory (CAM) and programmed to map the plurality of opcodes to a plurality of definitions of operands in a plurality of instructions; a hardware execution engine; and a controller configured to: receive an instruction that includes a first opcode of the plurality of opcodes; control the hardware instruction decoder to extract the first opcode from the instruction; obtain, from the instruction schema mapping table and based on the first opcode, a first definition of a first operand; and forward the instruction and the first definition to the hardware execution engine to control the hardware execution engine to extract the first operand from the instruction based on the first definition, and execute the instruction based on the first operand.

    Neural network layer-by-layer debugging

    公开(公告)号:US11308396B2

    公开(公告)日:2022-04-19

    申请号:US16455329

    申请日:2019-06-27

    Abstract: Techniques are disclosed for debugging a neural network execution on a target processor. A reference processor may generate a plurality of first reference tensors for the neural network. The neural network may be repeatedly reduced to produce a plurality of lengths. For each of the lengths, a compiler converts the neural network into first machine instructions, the target processor executes the first machine instructions to generate a first device tensor, and the debugger program determines whether the first device tensor matches a first reference tensor. A shortest length is identified for which the first device tensor does not match the first reference tensor. Tensor output is enabled for a lower-level intermediate representation of the shortest neural network, and the neural network is converted into second machine instructions, which are executed by the target processor to generate a second device tensor.

    Weight loading in an array
    146.
    发明授权

    公开(公告)号:US11275997B1

    公开(公告)日:2022-03-15

    申请号:US15967318

    申请日:2018-04-30

    Abstract: Disclosed herein are techniques for obtain weights for neural network computations. In one embodiment, an integrated circuit may include memory configured to store a first weight and a second weight; a row of processing elements comprising a first processing element and a second processing element, the first processing element comprising a first weight register, the second processing element comprising a second weight register, both of the first weight register and the second weight register being controllable by a weight load signal; and a controller configured to: provide the first weight from the memory to the row of processing elements; set the weight load signal to enable the first weight to propagate through the row to reach the first processing element; and set the weight load signal to store the first weight at the first weight register and the flush value at the second weight register.

    Test generation of a distributed system

    公开(公告)号:US11275661B1

    公开(公告)日:2022-03-15

    申请号:US16582346

    申请日:2019-09-25

    Abstract: A method of generating instructions to be executed by a plurality of execution engines that shares a resource is provided. The method comprises, in a first generation step: reading a first engine logical timestamp vector of a first execution engine of the execution engines, the logical timestamp representing a history of access operations for the resource; determining whether the first engine logical timestamp vector includes a most-up-to-date logical timestamp of the resource in the first generation step; based on the first engine logical timestamp vector including the most-up-to-date logical timestamp of the resource in the first generation step, generating an access instruction to be executed by the first execution engine to access the resource; and scheduling the first execution engine to execute the access instruction.

    Circuit architecture with biased randomization

    公开(公告)号:US11250319B1

    公开(公告)日:2022-02-15

    申请号:US15714924

    申请日:2017-09-25

    Abstract: Disclosed herein are techniques for classifying data with a data processing circuit. In one embodiment, the data processing circuit includes a probabilistic circuit configurable to generate a decision at a pre-determined probability, and an output generation circuit including an output node and configured to receive input data and a weight, and generate output data at the output node for approximating a product of the input data and the weight. The generation of the output data includes propagating the weight to the output node according a first decision of the probabilistic circuit. The probabilistic circuit is configured to generate the first decision at a probability determined based on the input data.

    Debug for computation networks using error detection codes

    公开(公告)号:US11232016B1

    公开(公告)日:2022-01-25

    申请号:US16138145

    申请日:2018-09-21

    Abstract: Techniques disclosed herein relate generally to debugging complex computing systems, such as those executing neural networks. A neural network processor includes a processing engine configured to execute instructions to implement multiple layers of a neural network. The neural network processor includes a debugging circuit configured to generate error detection codes for input data to the processing engine or error detection codes for output data generated by the processing engine. The neural network processor also includes an interface to a memory device, where the interface is configured to save the error detection codes generated by the debugging circuit into the memory device. The error detection codes generated by the debugging circuit are compared with expected error detection codes generated using a function model of the neural network to identify defects of the neural network.

    Top value computation on an integrated circuit device

    公开(公告)号:US11188302B1

    公开(公告)日:2021-11-30

    申请号:US16267031

    申请日:2019-02-04

    Abstract: Top-k is a process by which the largest elements among a set of elements is found. In various implementations, a top-k computation can be executed by a neural network accelerator, where the top-k computation is performed using a process that makes use of the accelerators memory array. A set of numerical values on which to perform top-k can be stored in the memory array. The accelerator can locate the maximum value from among the set of numerical values, and can store the maximum value back into the memory array. The accelerator can next remove the maximum value from the set of numerical values, so that a next largest value can be found. To remove the maximum value, the accelerator can write a value representing negative infinity to the memory array at each location of the maximum value.

Patent Agency Ranking