Memory bandwidth reduction techniques for low power convolutional neural network inference applications

    公开(公告)号:US11227214B2

    公开(公告)日:2022-01-18

    申请号:US15812336

    申请日:2017-11-14

    Abstract: Systems, apparatuses, and methods for implementing memory bandwidth reduction techniques for low power convolutional neural network inference applications are disclosed. A system includes at least a processing unit and an external memory coupled to the processing unit. The system detects a request to perform a convolution operation on input data from a plurality of channels. Responsive to detecting the request, the system partitions the input data from the plurality of channels into 3D blocks so as to minimize the external memory bandwidth utilization for the convolution operation being performed. Next, the system loads a selected 3D block from external memory into internal memory and then generates convolution output data for the selected 3D block for one or more features. Then, for each feature, the system adds convolution output data together across channels prior to writing the convolution output data to the external memory.

    Integrated video codec and inference engine

    公开(公告)号:US10582250B2

    公开(公告)日:2020-03-03

    申请号:US15657613

    申请日:2017-07-24

    Abstract: Systems, apparatuses, and methods for integrating a video codec with an inference engine are disclosed. A system is configured to implement an inference engine and a video codec while sharing at least a portion of its processing elements between the inference engine and the video codec. By sharing processing elements when combining the inference engine and the video codec, the silicon area of the combination is reduced. In one embodiment, the portion of processing elements which are shared include a motion prediction/motion estimation/MACs engine with a plurality of multiplier-accumulator (MAC) units, an internal memory, and peripherals. The peripherals include a memory interface, a direct memory access (DMA) engine, and a microprocessor. The system is configured to perform a context switch to reprogram the processing elements to switch between operating modes. The context switch can occur at a frame boundary or at a sub-frame boundary.

    MEMORY BANDWIDTH REDUCTION TECHNIQUES FOR LOW POWER CONVOLUTIONAL NEURAL NETWORK INFERENCE APPLICATIONS

    公开(公告)号:US20190147332A1

    公开(公告)日:2019-05-16

    申请号:US15812336

    申请日:2017-11-14

    Abstract: Systems, apparatuses, and methods for implementing memory bandwidth reduction techniques for low power convolutional neural network inference applications are disclosed. A system includes at least a processing unit and an external memory coupled to the processing unit. The system detects a request to perform a convolution operation on input data from a plurality of channels. Responsive to detecting the request, the system partitions the input data from the plurality of channels into 3D blocks so as to minimize the external memory bandwidth utilization for the convolution operation being performed. Next, the system loads a selected 3D block from external memory into internal memory and then generates convolution output data for the selected 3D block for one or more features. Then, for each feature, the system adds convolution output data together across channels prior to writing the convolution output data to the external memory.

    Broadcast synchronization for dynamically adaptable arrays

    公开(公告)号:US11200060B1

    公开(公告)日:2021-12-14

    申请号:US17132002

    申请日:2020-12-23

    Abstract: An array processor includes processor element arrays (PEAs) distributed in rows and columns. The PEAs are configured to perform operations on parameter values. A first sequencer received a first direct memory access (DMA) instruction that includes a request to read data from at least one address in memory. A texture address (TA) engine requests the data from the memory based on the at least one address and a texture data (TD) engine provides the data to the PEAs. The PEAs provide first synchronization signals to the TD engine to indicate availability of registers for receiving the data. The TD engine provides second synchronization signals to the first sequencer in response to receiving acknowledgments that the PEAs have consumed the data.

    BROADCAST COMMAND AND RESPONSE
    15.
    发明申请

    公开(公告)号:US20200089550A1

    公开(公告)日:2020-03-19

    申请号:US16171451

    申请日:2018-10-26

    Abstract: Systems, apparatuses, and methods for implementing a broadcast read response protocol are disclosed. A computing system includes a plurality of processing engines coupled to a memory subsystem. A first processing engine executes a read and broadcast response command, wherein the read and broadcast response command targets first data at a first address in the memory subsystem. One or more other processing engines execute a wait command to wait to receive the first data requested by the first processing engine. After receiving the first data from the memory subsystem, the plurality of processing engines process the first data as part of completing a first operation. In one implementation, the first operation is implementing a given layer of a machine learning model. In one implementation, the given layer is a convolutional layer of a neural network.

    Concurrent image compression and thumbnail generation

    公开(公告)号:US10284861B2

    公开(公告)日:2019-05-07

    申请号:US15414466

    申请日:2017-01-24

    Abstract: A first memory stores values of blocks of pixels representative of a digital image, a second memory stores partial values of destination pixels in a thumbnail image, and a third memory stores compressed images and thumbnail images. A processor retrieves values of a block of pixels from the first memory. The processor also concurrently compresses the values to generate a compressed image and modify a partial value of a destination pixel based on values of pixels in portions of the block that overlap a scaling window for the destination pixel. The processor stores the modified partial value in the second memory and stores the compressed image and the thumbnail image in the third memory.

Patent Agency Ranking