Low latency long short-term memory inference with sequence interleaving

    公开(公告)号:US11769041B2

    公开(公告)日:2023-09-26

    申请号:US16177218

    申请日:2018-10-31

    摘要: Systems, apparatuses, and methods for implementing a low latency long short-term memory (LSTM) machine learning engine using sequence interleaving techniques are disclosed. A computing system includes at least a host processing unit, a machine learning engine, and a memory. The host processing unit detects a plurality of sequences which will be processed by the machine learning engine. The host processing unit interleaves the sequences into data blocks and stores the data blocks in the memory. When the machine learning engine receives a given data block, the machine learning engine performs, in parallel, a plurality of matrix multiplication operations on the plurality of sequences in the given data block and a plurality of coefficients. Then, the outputs of the matrix multiplication operations are coupled to one or more LSTM layers.

    INTEGRATED VIDEO CODEC AND INFERENCE ENGINE
    2.
    发明申请

    公开(公告)号:US20190028752A1

    公开(公告)日:2019-01-24

    申请号:US15657613

    申请日:2017-07-24

    摘要: Systems, apparatuses, and methods for integrating a video codec with an inference engine are disclosed. A system is configured to implement an inference engine and a video codec while sharing at least a portion of its processing elements between the inference engine and the video codec. By sharing processing elements when combining the inference engine and the video codec, the silicon area of the combination is reduced. In one embodiment, the portion of processing elements which are shared include a motion prediction/motion estimation/MACs engine with a plurality of multiplier-accumulator (MAC) units, an internal memory, and peripherals. The peripherals include a memory interface, a direct memory access (DMA) engine, and a microprocessor. The system is configured to perform a context switch to reprogram the processing elements to switch between operating modes. The context switch can occur at a frame boundary or at a sub-frame boundary.

    Sparse matrix-vector multiplication

    公开(公告)号:US11995149B2

    公开(公告)日:2024-05-28

    申请号:US17125457

    申请日:2020-12-17

    IPC分类号: G06F17/16 G06F9/30 G06F15/80

    摘要: A processing system includes a first set and a second set of general-purpose registers (GPRs) and memory access circuitry that fetches nonzero values of a sparse matrix into consecutive slots in the first set. The memory access circuitry also fetches values of an expanded matrix into consecutive slots in the second set of GPRs. The expanded matrix is formed based on values of a vector and locations of the nonzero values in the sparse matrix. The processing system also includes a set of multipliers that concurrently perform multiplication of the nonzero values in slots of the first set of GPRs with the values of the vector in corresponding slots of the second set. Reduced sum circuitry accumulates results from the set of multipliers for rows of the sparse matrix.

    Dynamically adaptable arrays for vector and matrix operations

    公开(公告)号:US11409840B2

    公开(公告)日:2022-08-09

    申请号:US17032314

    申请日:2020-09-25

    IPC分类号: G06F17/16 G06F13/28

    摘要: An array processor includes processor element arrays distributed in rows and columns. The processor element arrays perform operations on parameter values. The array processor also includes memory interfaces that are dynamically mapped to mutually exclusive subsets of the rows and columns of the processor element arrays based on dimensions of matrices that provide the parameter values to the processor element arrays. In some cases, the processor element arrays are vector arithmetic logic unit (ALU) processors and the memory interfaces are direct memory access (DMA) engines. The rows of the processor element arrays in the subsets are mutually exclusive to the rows in the other subsets and the columns of the processor element arrays in the subsets are mutually exclusive to the columns in the other subsets. The matrices can be symmetric or asymmetric, e.g., one of the matrices can be a vector having a single column.

    Broadcast command and response
    5.
    发明授权

    公开(公告)号:US11275632B2

    公开(公告)日:2022-03-15

    申请号:US16171451

    申请日:2018-10-26

    IPC分类号: G06F9/30 G06F9/54 G06N3/08

    摘要: Systems, apparatuses, and methods for implementing a broadcast read response protocol are disclosed. A computing system includes a plurality of processing engines coupled to a memory subsystem. A first processing engine executes a read and broadcast response command, wherein the read and broadcast response command targets first data at a first address in the memory subsystem. One or more other processing engines execute a wait command to wait to receive the first data requested by the first processing engine. After receiving the first data from the memory subsystem, the plurality of processing engines process the first data as part of completing a first operation. In one implementation, the first operation is implementing a given layer of a machine learning model. In one implementation, the given layer is a convolutional layer of a neural network.

    MEMORY BANDWIDTH REDUCTION TECHNIQUES FOR LOW POWER CONVOLUTIONAL NEURAL NETWORK INFERENCE APPLICATIONS

    公开(公告)号:US20220129752A1

    公开(公告)日:2022-04-28

    申请号:US17571045

    申请日:2022-01-07

    摘要: Systems, apparatuses, and methods for implementing memory bandwidth reduction techniques for low power convolutional neural network inference applications are disclosed. A system includes at least a processing unit and an external memory coupled to the processing unit. The system detects a request to perform a convolution operation on input data from a plurality of channels. Responsive to detecting the request, the system partitions the input data from the plurality of channels into 3D blocks so as to minimize the external memory bandwidth utilization for the convolution operation being performed. Next, the system loads a selected 3D block from external memory into internal memory and then generates convolution output data for the selected 3D block for one or more features. Then, for each feature, the system adds convolution output data together across channels prior to writing the convolution output data to the external memory.

    MACHINE LEARNING INFERENCE ENGINE SCALABILITY

    公开(公告)号:US20190325305A1

    公开(公告)日:2019-10-24

    申请号:US16117302

    申请日:2018-08-30

    IPC分类号: G06N3/08 G06N3/04

    摘要: Systems, apparatuses, and methods for adaptively mapping a machine learning model to a multi-core inference accelerator engine are disclosed. A computing system includes a multi-core inference accelerator engine with multiple inference cores coupled to a memory subsystem. The system also includes a control unit which determines how to adaptively map a machine learning model to the multi-core inference accelerator engine. In one implementation, the control unit selects a mapping scheme which minimizes the memory bandwidth utilization of the multi-core inference accelerator engine. In one implementation, this mapping scheme involves having one inference core of the multi-core inference accelerator engine fetch given data and broadcast the given data to other inference cores of the inference accelerator engine. Each inference core fetches second data unique to the respective inference core. The inference cores then perform computations on the first and second data in order to implement the machine learning model.

    Machine learning inference engine scalability

    公开(公告)号:US11948073B2

    公开(公告)日:2024-04-02

    申请号:US16117302

    申请日:2018-08-30

    IPC分类号: G06N3/08 G06N3/04

    CPC分类号: G06N3/08 G06N3/04

    摘要: Systems, apparatuses, and methods for adaptively mapping a machine learning model to a multi-core inference accelerator engine are disclosed. A computing system includes a multi-core inference accelerator engine with multiple inference cores coupled to a memory subsystem. The system also includes a control unit which determines how to adaptively map a machine learning model to the multi-core inference accelerator engine. In one implementation, the control unit selects a mapping scheme which minimizes the memory bandwidth utilization of the multi-core inference accelerator engine. In one implementation, this mapping scheme involves having one inference core of the multi-core inference accelerator engine fetch given data and broadcast the given data to other inference cores of the inference accelerator engine. Each inference core fetches second data unique to the respective inference core. The inference cores then perform computations on the first and second data in order to implement the machine learning model.

    Vertical and horizontal broadcast of shared operands

    公开(公告)号:US11635967B2

    公开(公告)日:2023-04-25

    申请号:US17032307

    申请日:2020-09-25

    摘要: An array processor includes processor element arrays distributed in rows and columns. The processor element arrays perform operations on parameter values. The array processor also includes memory interfaces that broadcast sets of the parameter values to mutually exclusive subsets of the rows and columns of the processor element arrays. In some cases, the array processor includes single-instruction-multiple-data (SIMD) units including subsets of the processor element arrays in corresponding rows, workgroup processors (WGPs) including subsets of the SIMD units, and a memory fabric configured to interconnect with an external memory that stores the parameter values. The memory interfaces broadcast the parameter values to the SIMD units that include the processor element arrays in rows associated with the memory interfaces and columns of processor element arrays that are implemented across the SIMD units in the WGPs. The memory interfaces access the parameter values from the external memory via the memory fabric.