Optimizing hardware FIFO instructions

    公开(公告)号:US11221879B2

    公开(公告)日:2022-01-11

    申请号:US16919968

    申请日:2020-07-02

    申请人: Google LLC

    摘要: Methods, systems, and apparatus for scheduling first-in-first-out instructions are described. In one aspect, a method includes receiving data representing code of a program to be executed by a processing unit comprising hardware processors. For each of one or more of the hardware processors, an order of independent groups of first-in-first-out (FIFO) instructions for execution by the hardware processor is identified in the data representing the code of the program. For each independent group of FIFO instructions for execution by the hardware processor, a path length metric that represents how long it will take to reach an end of the program from the independent group of FIFO instructions is determined. A new order of the independent groups of FIFO instructions for execution by the hardware processor is generated based at least on the path length metric for each independent group of FIFO instructions for execution by the hardware processor.

    Architecture to support synchronization between core and inference engine for machine learning

    公开(公告)号:US10929779B1

    公开(公告)日:2021-02-23

    申请号:US16420092

    申请日:2019-05-22

    摘要: A system to support a machine learning (ML) operation comprises a core configured to receive and interpret commands into a set of instructions for the ML operation and a memory unit configured to maintain data for the ML operation. The system further comprises an inference engine having a plurality of processing tiles, each comprising an on-chip memory (OCM) configured to maintain data for local access by components in the processing tile and one or more processing units configured to perform tasks of the ML operation on the data in the OCM. The system also comprises an instruction streaming engine configured to distribute the instructions to the processing tiles to control their operations and to synchronize data communication between the core and the inference engine so that data transmitted between them correctly reaches the corresponding processing tiles while ensuring coherence of data shared and distributed among the core and the OCMs.

    Core for a data processing engine in an integrated circuit

    公开(公告)号:US10747531B1

    公开(公告)日:2020-08-18

    申请号:US15944315

    申请日:2018-04-03

    申请人: Xilinx, Inc.

    摘要: An example core for a data processing engine (DPE) includes a register file, a processor, coupled to the register file. The processor includes a multiply-accumulate (MAC) circuit, and permute circuitry coupled between the register file and the MAC circuit, the permute circuitry configured to concatenate at least one pair of outputs of the register file to provide at least one input to the MAC circuit. The core further includes an instruction decoder, coupled to the processor, configured to decode a very large instruction word (VLIW) to set a plurality of parameters of the processor, the plurality of parameters including first parameters of the permute circuitry and second parameters of the MAC circuit.

    Instruction types for providing a result of an arithmetic operation on a selected vector input element to multiple adjacent vector output elements

    公开(公告)号:US10656943B2

    公开(公告)日:2020-05-19

    申请号:US15024095

    申请日:2014-09-17

    发明人: David Van Kampen

    摘要: According to an aspect, a digital signal processor obtains a program instruction, selects a first real valued input or a second real valued input as a given real valued input (the first and second real valued inputs organized as adjacent elements of a first input vector), depending on an instruction type. The processor performs an arithmetic operation on the selected real valued input to provide a real valued result, and provides a first real valued output and a second real valued output during a first operation cycle (organized as adjacent elements of a second output vector).The real valued result is provided as the first real valued output and as the second real valued output, depending on the instruction type, and the second output vector is a real valued second output vector for real-complex multiplication with a complex valued third vector.

    Apparatus and methods for matrix multiplication

    公开(公告)号:US10592241B2

    公开(公告)日:2020-03-17

    申请号:US16171291

    申请日:2018-10-25

    摘要: Aspects for matrix multiplication in neural network are described herein. The aspects may include a master computation module configured to receive a first matrix and transmit a row vector of the first matrix. In addition, the aspects may include one or more slave computation modules respectively configured to store a column vector of a second matrix, receive the row vector of the first matrix, and multiply the row vector of the first matrix with the stored column vector of the second matrix to generate a result element. Further, the aspects may include an interconnection unit configured to combine the one or more result elements generated respectively by the one or more slave computation modules to generate a row vector of a result matrix and transmit the row vector of the result matrix to the master computation module.

    Scatter reduction instruction
    10.
    发明授权

    公开(公告)号:US10191749B2

    公开(公告)日:2019-01-29

    申请号:US15301206

    申请日:2015-12-24

    申请人: INTEL CORPORATION

    摘要: Single Instruction, Multiple Data (SIMD) technologies are described. A processing device can include a processor core and a memory. The processor core can receive, from a software application, a request to perform an operation on a first set of variables that includes a first input value and a register value and perform the operation on a second set of variables that includes a second input value and the first register value. The processor core can vectorize the operation on the first set of variables and the second set of variables. The processor core can perform the operation on the first set of variables and the second set of variables in parallel to obtain a first operation value and a second operation value. The processor core can perform a horizontal add operation on the first operation value and the second operation value and write the result to memory.