-
公开(公告)号:US20190028752A1
公开(公告)日:2019-01-24
申请号:US15657613
申请日:2017-07-24
发明人: Lei Zhang , Sateesh Lagudu , Allen Rush , Razvan Dan-Dobre
IPC分类号: H04N21/4143 , G06F3/14 , H04N7/14 , H04N7/15
摘要: Systems, apparatuses, and methods for integrating a video codec with an inference engine are disclosed. A system is configured to implement an inference engine and a video codec while sharing at least a portion of its processing elements between the inference engine and the video codec. By sharing processing elements when combining the inference engine and the video codec, the silicon area of the combination is reduced. In one embodiment, the portion of processing elements which are shared include a motion prediction/motion estimation/MACs engine with a plurality of multiplier-accumulator (MAC) units, an internal memory, and peripherals. The peripherals include a memory interface, a direct memory access (DMA) engine, and a microprocessor. The system is configured to perform a context switch to reprogram the processing elements to switch between operating modes. The context switch can occur at a frame boundary or at a sub-frame boundary.
-
公开(公告)号:US20220129752A1
公开(公告)日:2022-04-28
申请号:US17571045
申请日:2022-01-07
发明人: Sateesh Lagudu , Lei Zhang , Allen Rush
IPC分类号: G06N3/08 , G06N3/063 , G06N3/04 , G06F1/3296
摘要: Systems, apparatuses, and methods for implementing memory bandwidth reduction techniques for low power convolutional neural network inference applications are disclosed. A system includes at least a processing unit and an external memory coupled to the processing unit. The system detects a request to perform a convolution operation on input data from a plurality of channels. Responsive to detecting the request, the system partitions the input data from the plurality of channels into 3D blocks so as to minimize the external memory bandwidth utilization for the convolution operation being performed. Next, the system loads a selected 3D block from external memory into internal memory and then generates convolution output data for the selected 3D block for one or more features. Then, for each feature, the system adds convolution output data together across channels prior to writing the convolution output data to the external memory.
-
公开(公告)号:US20190325305A1
公开(公告)日:2019-10-24
申请号:US16117302
申请日:2018-08-30
发明人: Lei Zhang , Sateesh Lagudu , Allen Rush
摘要: Systems, apparatuses, and methods for adaptively mapping a machine learning model to a multi-core inference accelerator engine are disclosed. A computing system includes a multi-core inference accelerator engine with multiple inference cores coupled to a memory subsystem. The system also includes a control unit which determines how to adaptively map a machine learning model to the multi-core inference accelerator engine. In one implementation, the control unit selects a mapping scheme which minimizes the memory bandwidth utilization of the multi-core inference accelerator engine. In one implementation, this mapping scheme involves having one inference core of the multi-core inference accelerator engine fetch given data and broadcast the given data to other inference cores of the inference accelerator engine. Each inference core fetches second data unique to the respective inference core. The inference cores then perform computations on the first and second data in order to implement the machine learning model.
-
公开(公告)号:US11948073B2
公开(公告)日:2024-04-02
申请号:US16117302
申请日:2018-08-30
发明人: Lei Zhang , Sateesh Lagudu , Allen Rush
摘要: Systems, apparatuses, and methods for adaptively mapping a machine learning model to a multi-core inference accelerator engine are disclosed. A computing system includes a multi-core inference accelerator engine with multiple inference cores coupled to a memory subsystem. The system also includes a control unit which determines how to adaptively map a machine learning model to the multi-core inference accelerator engine. In one implementation, the control unit selects a mapping scheme which minimizes the memory bandwidth utilization of the multi-core inference accelerator engine. In one implementation, this mapping scheme involves having one inference core of the multi-core inference accelerator engine fetch given data and broadcast the given data to other inference cores of the inference accelerator engine. Each inference core fetches second data unique to the respective inference core. The inference cores then perform computations on the first and second data in order to implement the machine learning model.
-
公开(公告)号:US11227214B2
公开(公告)日:2022-01-18
申请号:US15812336
申请日:2017-11-14
发明人: Sateesh Lagudu , Lei Zhang , Allen Rush
IPC分类号: G06N3/08 , G06F1/3296 , G06N3/04 , G06N3/063
摘要: Systems, apparatuses, and methods for implementing memory bandwidth reduction techniques for low power convolutional neural network inference applications are disclosed. A system includes at least a processing unit and an external memory coupled to the processing unit. The system detects a request to perform a convolution operation on input data from a plurality of channels. Responsive to detecting the request, the system partitions the input data from the plurality of channels into 3D blocks so as to minimize the external memory bandwidth utilization for the convolution operation being performed. Next, the system loads a selected 3D block from external memory into internal memory and then generates convolution output data for the selected 3D block for one or more features. Then, for each feature, the system adds convolution output data together across channels prior to writing the convolution output data to the external memory.
-
公开(公告)号:US10582250B2
公开(公告)日:2020-03-03
申请号:US15657613
申请日:2017-07-24
发明人: Lei Zhang , Sateesh Lagudu , Allen Rush , Razvan Dan-Dobre
IPC分类号: H04N21/4143 , G06F3/14 , H04N7/14 , H04N7/15 , H04N19/172 , H04N19/90 , H04N21/418 , H04N19/42 , G06T9/00
摘要: Systems, apparatuses, and methods for integrating a video codec with an inference engine are disclosed. A system is configured to implement an inference engine and a video codec while sharing at least a portion of its processing elements between the inference engine and the video codec. By sharing processing elements when combining the inference engine and the video codec, the silicon area of the combination is reduced. In one embodiment, the portion of processing elements which are shared include a motion prediction/motion estimation/MACs engine with a plurality of multiplier-accumulator (MAC) units, an internal memory, and peripherals. The peripherals include a memory interface, a direct memory access (DMA) engine, and a microprocessor. The system is configured to perform a context switch to reprogram the processing elements to switch between operating modes. The context switch can occur at a frame boundary or at a sub-frame boundary.
-
7.
公开(公告)号:US20190147332A1
公开(公告)日:2019-05-16
申请号:US15812336
申请日:2017-11-14
发明人: Sateesh Lagudu , Lei Zhang , Allen Rush
摘要: Systems, apparatuses, and methods for implementing memory bandwidth reduction techniques for low power convolutional neural network inference applications are disclosed. A system includes at least a processing unit and an external memory coupled to the processing unit. The system detects a request to perform a convolution operation on input data from a plurality of channels. Responsive to detecting the request, the system partitions the input data from the plurality of channels into 3D blocks so as to minimize the external memory bandwidth utilization for the convolution operation being performed. Next, the system loads a selected 3D block from external memory into internal memory and then generates convolution output data for the selected 3D block for one or more features. Then, for each feature, the system adds convolution output data together across channels prior to writing the convolution output data to the external memory.
-
公开(公告)号:US12067401B2
公开(公告)日:2024-08-20
申请号:US15855637
申请日:2017-12-27
发明人: Jiasheng Chen , Yunxiao Zou , Michael J. Mantor , Allen Rush
CPC分类号: G06F9/3867 , G06F7/5443 , G06F9/3001 , G06F9/30036 , G06F9/30101 , G06F17/16
摘要: Systems, apparatuses, and methods for implementing a low power parallel matrix multiply pipeline are disclosed. In one embodiment, a system includes at least first and second vector register files coupled to a matrix multiply pipeline. The matrix multiply pipeline comprises a plurality of dot product units. The dot product units are configured to calculate dot or outer products for first and second sets of operands retrieved from the first vector register file. The results of the dot or outer product operations are written back to the second vector register file. The second vector register file provides the results from the previous dot or outer product operations as inputs to subsequent dot or outer product operations. The dot product units receive the results from previous phases of the matrix multiply operation and accumulate these previous dot or outer product results with the current dot or outer product results.
-
公开(公告)号:US20190171448A1
公开(公告)日:2019-06-06
申请号:US15855637
申请日:2017-12-27
发明人: Jiasheng Chen , Yunxiao Zou , Michael J. Mantor , Allen Rush
摘要: Systems, apparatuses, and methods for implementing a low power parallel matrix multiply pipeline are disclosed. In one embodiment, a system includes at least first and second vector register files coupled to a matrix multiply pipeline. The matrix multiply pipeline comprises a plurality of dot product units. The dot product units are configured to calculate dot or outer products for first and second sets of operands retrieved from the first vector register file. The results of the dot or outer product operations are written back to the second vector register file. The second vector register file provides the results from the previous dot or outer product operations as inputs to subsequent dot or outer product operations. The dot product units receive the results from previous phases of the matrix multiply operation and accumulate these previous dot or outer product results with the current dot or outer product results.
-
-
-
-
-
-
-
-