High-speed selective cache invalidates and write-backs on GPUS

    公开(公告)号:US10540280B2

    公开(公告)日:2020-01-21

    申请号:US15390080

    申请日:2016-12-23

    Abstract: Techniques for performing cache invalidates and write-backs in an accelerated processing device (e.g., a graphics processing device that renders three-dimensional graphics) are disclosed. The techniques involve receiving requests from a “master” (e.g., the central processing unit). The techniques involve invalidating virtual-to-physical address translations in an address translation request. The techniques include splitting up the requests based on whether the requests target virtually or physically tagged caches. Addresses for the portions of a request that target physically tagged caches are translated using invalidated virtual-to-physical address translations for speed. The split up request is processed to generate micro-transactions for individual caches targeted by the request. Micro-transactions for physically and virtually tagged caches are processed in parallel. Once all micro-transactions for a request have been processed, the unit that made the request is notified.

    SHADER WRITES TO COMPRESSED RESOURCES
    12.
    发明申请

    公开(公告)号:US20180182155A1

    公开(公告)日:2018-06-28

    申请号:US15389075

    申请日:2016-12-22

    Abstract: Systems, apparatuses, and methods for performing shader writes to compressed surfaces are disclosed. In one embodiment, a processor includes at least a memory and one or more shader units. In one embodiment, a shader unit of the processor is configured to receive a write request targeted to a compressed surface. The shader unit is configured to identify a first block of the compressed surface targeted by the write request. Responsive to determining the data of the write request targets less than the entirety of the first block, the first shader unit reads the first block from the cache and decompress the first block. Next, the first shader unit merges the data of the write request with the decompressed first block. Then, the shader unit compresses the merged data and writes the merged data to the cache.

    Graphics discard engine
    13.
    发明授权

    公开(公告)号:US12236529B2

    公开(公告)日:2025-02-25

    申请号:US17562653

    申请日:2021-12-27

    Abstract: Systems, apparatuses, and methods for implementing a discard engine in a graphics pipeline are disclosed. A system includes a graphics pipeline with a geometry engine launching shaders that generate attribute data for vertices of each primitive of a set of primitives. The attribute data is consumed by pixel shaders, with each pixel shader generating a deallocation message when the pixel shader no longer needs the attribute data. A discard engine gathers deallocations from multiple pixel shaders and determines when the attribute data is no longer needed. Once a block of attributes has been consumed by all potential pixel shader consumers, the discard engine deallocates the given block of attributes. The discard engine sends a discard command to the caches so that the attribute data can be invalidated and not written back to memory.

    Data driven scheduler on multiple computing cores

    公开(公告)号:US10649810B2

    公开(公告)日:2020-05-12

    申请号:US14981257

    申请日:2015-12-28

    Abstract: Methods, devices, and systems for data driven scheduling of a plurality of computing cores of a processor. A plurality of threads may be executed on the plurality of computing cores, according to a default schedule. The plurality of threads may be analyzed, based on the execution, to determine correlations among the plurality of threads. A data driven schedule may be generated based on the correlations. The plurality of threads may be executed on the plurality of computing cores according to the data driven schedule.

    FLEXIBLE SHADER EXPORT DESIGN IN MULTIPLE COMPUTING CORES

    公开(公告)号:US20180314528A1

    公开(公告)日:2018-11-01

    申请号:US15607118

    申请日:2017-05-26

    Abstract: Systems, apparatuses, and methods for generating flexibly addressed memory requests are disclosed. In one embodiment, a system includes a processor, control unit, and memory subsystem. The processor launches a plurality of threads on a plurality of compute units, wherein each thread generates memory requests without specifying target memory addresses. The threads executing on the plurality of compute units convey a plurality of memory requests to the control unit. The control unit generates target memory addresses for the plurality of received memory requests. In one embodiment, the memory requests are write requests, and the control unit interleaves write requests from the plurality of threads into a single output buffer stored in the memory subsystem. The control unit can be located in a cache, in a memory controller, or in another location within the system.

Patent Agency Ranking