CNN SEAMLESS TILE PROCESSING FOR LOW-POWER INFERENCE ACCELERATOR

    公开(公告)号:US20240112297A1

    公开(公告)日:2024-04-04

    申请号:US17957689

    申请日:2022-09-30

    CPC classification number: G06T1/60

    Abstract: Methods and devices are provided for processing image data on a sub-frame portion basis using layers of a convolutional neural network. The processing device comprises memory and a processor. The processor is configured to determine, for an input tile of an image, a receptive field via backward propagation and determine a size of the input tile based on the receptive field and an amount of local memory allocated to store data for the input tile. The processor determines whether the amount of local memory allocated to store the data of the input tile and padded data for the receptive field.

    PUSHED PREFETCHING IN A MEMORY HIERARCHY
    153.
    发明公开

    公开(公告)号:US20240111678A1

    公开(公告)日:2024-04-04

    申请号:US17958120

    申请日:2022-09-30

    CPC classification number: G06F12/0862 G06F12/0811

    Abstract: Systems and methods for pushed prefetching include: multiple core complexes, each core complex having multiple cores and multiple caches, the multiple caches configured in a memory hierarchy with multiple levels; an interconnect device coupling the core complexes to each other and coupling the core complexes to shared memory, the shared memory at a lower level of the memory hierarchy than the multiple caches; and a push-based prefetcher having logic to: monitor memory traffic between caches of a first level of the memory hierarchy and the shared memory; and based on the monitoring, initiate a prefetch of data to a cache of the first level of the memory hierarchy.

    Data Reuse Cache
    155.
    发明公开
    Data Reuse Cache 审中-公开

    公开(公告)号:US20240111674A1

    公开(公告)日:2024-04-04

    申请号:US17955618

    申请日:2022-09-29

    CPC classification number: G06F12/0811 G06F12/0875 G06F12/0884

    Abstract: Data reuse cache techniques are described. In one example, a load instruction is generated by an execution unit of a processor unit. In response to the load instruction, data is loaded by a load-store unit for processing by the execution unit and is also stored to a data reuse cache communicatively coupled between the load-store unit and the execution unit. Upon receipt of a subsequent load instruction for the data from the execution unit, the data is loaded from the data reuse cache for processing by the execution unit.

    SPECULATIVE DRAM REQUEST ENABLING AND DISABLING

    公开(公告)号:US20240111420A1

    公开(公告)日:2024-04-04

    申请号:US17956417

    申请日:2022-09-29

    CPC classification number: G06F3/0611 G06F3/0653 G06F3/0673

    Abstract: Methods, devices, and systems for retrieving information based on cache miss prediction. It is predicted, based on a history of cache misses at a private cache, that a cache lookup for the information will miss a shared victim cache. A speculative memory request is enabled based on the prediction that the cache lookup for the information will miss the shared victim cache. The information is fetched based on the enabled speculative memory request.

    Machine learning inference engine scalability

    公开(公告)号:US11948073B2

    公开(公告)日:2024-04-02

    申请号:US16117302

    申请日:2018-08-30

    CPC classification number: G06N3/08 G06N3/04

    Abstract: Systems, apparatuses, and methods for adaptively mapping a machine learning model to a multi-core inference accelerator engine are disclosed. A computing system includes a multi-core inference accelerator engine with multiple inference cores coupled to a memory subsystem. The system also includes a control unit which determines how to adaptively map a machine learning model to the multi-core inference accelerator engine. In one implementation, the control unit selects a mapping scheme which minimizes the memory bandwidth utilization of the multi-core inference accelerator engine. In one implementation, this mapping scheme involves having one inference core of the multi-core inference accelerator engine fetch given data and broadcast the given data to other inference cores of the inference accelerator engine. Each inference core fetches second data unique to the respective inference core. The inference cores then perform computations on the first and second data in order to implement the machine learning model.

    Gang scheduling for low-latency task synchronization

    公开(公告)号:US11948000B2

    公开(公告)日:2024-04-02

    申请号:US17219365

    申请日:2021-03-31

    CPC classification number: G06F9/4881 G06F9/544 G06T1/20

    Abstract: Systems, apparatuses, and methods for performing command buffer gang submission are disclosed. A system includes at least first and second processors and a memory. The first processor (e.g., CPU) generates a command buffer and stores the command buffer in the memory. A mechanism is implemented where a granularity of work provided to the second processor (e.g., GPU) is increased which, in turn, increases the opportunities for parallel work. In gang submission mode, the user-mode driver (UMD) specifies a set of multiple queues and command buffers to execute on those multiple queues, and that work is guaranteed to execute as a single unit from the GPU operating system scheduler point of view. Using gang submission, synchronization between command buffers executing on multiple queues in the same submit is safe. This opens up optimization opportunities for application use (explicit gang submission) and for internal driver use (implicit gang submission).

Patent Agency Ranking