-
11.
公开(公告)号:US11227214B2
公开(公告)日:2022-01-18
申请号:US15812336
申请日:2017-11-14
Applicant: Advanced Micro Devices, Inc. , ATI Technologies ULC
Inventor: Sateesh Lagudu , Lei Zhang , Allen Rush
IPC: G06N3/08 , G06F1/3296 , G06N3/04 , G06N3/063
Abstract: Systems, apparatuses, and methods for implementing memory bandwidth reduction techniques for low power convolutional neural network inference applications are disclosed. A system includes at least a processing unit and an external memory coupled to the processing unit. The system detects a request to perform a convolution operation on input data from a plurality of channels. Responsive to detecting the request, the system partitions the input data from the plurality of channels into 3D blocks so as to minimize the external memory bandwidth utilization for the convolution operation being performed. Next, the system loads a selected 3D block from external memory into internal memory and then generates convolution output data for the selected 3D block for one or more features. Then, for each feature, the system adds convolution output data together across channels prior to writing the convolution output data to the external memory.
-
公开(公告)号:US10582250B2
公开(公告)日:2020-03-03
申请号:US15657613
申请日:2017-07-24
Applicant: Advanced Micro Devices, Inc. , ATI Technologies ULC
Inventor: Lei Zhang , Sateesh Lagudu , Allen Rush , Razvan Dan-Dobre
IPC: H04N21/4143 , G06F3/14 , H04N7/14 , H04N7/15 , H04N19/172 , H04N19/90 , H04N21/418 , H04N19/42 , G06T9/00
Abstract: Systems, apparatuses, and methods for integrating a video codec with an inference engine are disclosed. A system is configured to implement an inference engine and a video codec while sharing at least a portion of its processing elements between the inference engine and the video codec. By sharing processing elements when combining the inference engine and the video codec, the silicon area of the combination is reduced. In one embodiment, the portion of processing elements which are shared include a motion prediction/motion estimation/MACs engine with a plurality of multiplier-accumulator (MAC) units, an internal memory, and peripherals. The peripherals include a memory interface, a direct memory access (DMA) engine, and a microprocessor. The system is configured to perform a context switch to reprogram the processing elements to switch between operating modes. The context switch can occur at a frame boundary or at a sub-frame boundary.
-
13.
公开(公告)号:US20190147332A1
公开(公告)日:2019-05-16
申请号:US15812336
申请日:2017-11-14
Applicant: Advanced Micro Devices, Inc. , ATI Technologies ULC
Inventor: Sateesh Lagudu , Lei Zhang , Allen Rush
Abstract: Systems, apparatuses, and methods for implementing memory bandwidth reduction techniques for low power convolutional neural network inference applications are disclosed. A system includes at least a processing unit and an external memory coupled to the processing unit. The system detects a request to perform a convolution operation on input data from a plurality of channels. Responsive to detecting the request, the system partitions the input data from the plurality of channels into 3D blocks so as to minimize the external memory bandwidth utilization for the convolution operation being performed. Next, the system loads a selected 3D block from external memory into internal memory and then generates convolution output data for the selected 3D block for one or more features. Then, for each feature, the system adds convolution output data together across channels prior to writing the convolution output data to the external memory.
-
公开(公告)号:US11200060B1
公开(公告)日:2021-12-14
申请号:US17132002
申请日:2020-12-23
Applicant: ADVANCED MICRO DEVICES, INC.
Inventor: Sateesh Lagudu , Arun Vaidyanathan Ananthanarayan , Michael Mantor , Allen H. Rush
Abstract: An array processor includes processor element arrays (PEAs) distributed in rows and columns. The PEAs are configured to perform operations on parameter values. A first sequencer received a first direct memory access (DMA) instruction that includes a request to read data from at least one address in memory. A texture address (TA) engine requests the data from the memory based on the at least one address and a texture data (TD) engine provides the data to the PEAs. The PEAs provide first synchronization signals to the TD engine to indicate availability of registers for receiving the data. The TD engine provides second synchronization signals to the first sequencer in response to receiving acknowledgments that the PEAs have consumed the data.
-
公开(公告)号:US20200089550A1
公开(公告)日:2020-03-19
申请号:US16171451
申请日:2018-10-26
Applicant: Advanced Micro Devices, Inc. , ATI Technologies ULC
Abstract: Systems, apparatuses, and methods for implementing a broadcast read response protocol are disclosed. A computing system includes a plurality of processing engines coupled to a memory subsystem. A first processing engine executes a read and broadcast response command, wherein the read and broadcast response command targets first data at a first address in the memory subsystem. One or more other processing engines execute a wait command to wait to receive the first data requested by the first processing engine. After receiving the first data from the memory subsystem, the plurality of processing engines process the first data as part of completing a first operation. In one implementation, the first operation is implementing a given layer of a machine learning model. In one implementation, the given layer is a convolutional layer of a neural network.
-
公开(公告)号:US10284861B2
公开(公告)日:2019-05-07
申请号:US15414466
申请日:2017-01-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Mahalakshmi Thikkireddy , Sateesh Lagudu
IPC: H04N19/176 , H04N19/136 , H04N19/182 , H04N19/423 , H04N19/625 , H04N19/80
Abstract: A first memory stores values of blocks of pixels representative of a digital image, a second memory stores partial values of destination pixels in a thumbnail image, and a third memory stores compressed images and thumbnail images. A processor retrieves values of a block of pixels from the first memory. The processor also concurrently compresses the values to generate a compressed image and modify a partial value of a destination pixel based on values of pixels in portions of the block that overlap a scaling window for the destination pixel. The processor stores the modified partial value in the second memory and stores the compressed image and the thumbnail image in the third memory.
-
-
-
-
-