On-demand Memory Allocation
    11.
    发明申请

    公开(公告)号:US20210271606A1

    公开(公告)日:2021-09-02

    申请号:US16804128

    申请日:2020-02-28

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to dynamically allocating and mapping private memory for requesting circuitry. Disclosed circuitry may receive a private address and translate the private address to a virtual address (which an MMU may then translate to physical address to actually access a storage element). In some embodiments, private memory allocation circuitry is configured to generate page table information and map private memory pages for requests if the page table information is not already setup. In various embodiments, this may advantageously allow dynamic private memory allocation, e.g., to efficiently allocate memory for graphics shaders with different types of workloads. Disclosed caching techniques for page table information may improve performance relative to traditional techniques. Further, disclosed embodiments may facilitate memory consolidation across a device such as a graphics processor.

    Compression Techniques for Pixel Write Data

    公开(公告)号:US20210134052A1

    公开(公告)日:2021-05-06

    申请号:US16673883

    申请日:2019-11-04

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to compression of data stored at different cache levels. In some embodiments, programmable shader circuitry is configured to execute program instructions of compute kernels that write pixel data. In some embodiments, a first cache is configured to store pixel write data from the programmable shader circuitry and first compression circuitry is configured to compress a first block of pixel write data in response to full accumulation of the first block in the first cache circuitry. In some embodiments, second cache circuitry is configured to store pixel write data from the programmable shader circuitry at a higher level in a storage hierarchy than the first cache circuitry and second compression circuitry is configured to compress a second block of pixel write data in response to full accumulation of the second block in the second cache circuitry. In some embodiments, write circuitry is configured to write the first and second compressed blocks of pixel data in a combined write to a higher level in the storage hierarchy.

    Dependency Scheduling for Control Stream in Parallel Processor

    公开(公告)号:US20200301753A1

    公开(公告)日:2020-09-24

    申请号:US16361910

    申请日:2019-03-22

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to processing a control stream such as a compute control stream. In some embodiments, the control stream includes kernels and commands for multiple substreams. In some embodiments, multiple substream processors are each configured to: fetch and parse portions of the control stream corresponding to an assigned substream and, in response to a neighbor barrier command in the assigned substream that identifies another substream, communicate the identified other substream to a barrier clearing circuitry. In some embodiments, the barrier clearing circuitry is configured to determine whether to allow the assigned substream to proceed past the neighbor barrier command based on communication of a most-recently-completed command from a substream processor to which the other substream is assigned (e.g., based on whether the most-recently-completed command meets a command identifier communicated in the neighbor barrier command). The disclosed techniques may facilitate parallel control stream parsing and substream synchronization.

    Punch-through techniques for graphics processing

    公开(公告)号:US10074210B1

    公开(公告)日:2018-09-11

    申请号:US15659188

    申请日:2017-07-25

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to rendering graphics objects that require shader operations to determine visibility. In some embodiments, a graphics unit is configured to process feedback objects, which may require shading to determine whether they are visible relative to previously-processed objects, out of draw order. For example, in embodiments where a buffer is used to store fragment data for deferred rendering, the graphics unit may bypass the buffer and shade feedback objects ahead of earlier non-feedback objects whose fragment data is stored in the buffer. This may allow a determination of whether to remove occluded non-feedback fragment data from the buffer, which may reduce graphics overdraw. In disclosed two-pass techniques, data for feedback objects is first allowed to bypass the buffer for visibility shading, but is then stored in the buffer for a second pass to perform fragment shading to actually determine pixel attributes, which may further reduce overdraw.

    Compute Kernel Parsing with Limits in one or more Dimensions

    公开(公告)号:US20240345892A1

    公开(公告)日:2024-10-17

    申请号:US18673959

    申请日:2024-05-24

    Applicant: Apple Inc.

    CPC classification number: G06F9/505 G06T1/20

    Abstract: Techniques are disclosed relating to dispatching compute work from a compute stream. In some embodiments, a graphics processor executes instructions of compute kernels. Workload parser circuitry may determine, for distribution to the graphics processor circuitry, a set of workgroups from a compute kernel that includes workgroups organized in multiple dimensions, including a first number of workgroups in a first dimension and a second number of workgroups in a second dimension. This may include determining multiple sub-kernels for the compute kernel, wherein a first sub-kernel includes, in the first dimension, a limited number of workgroups that is smaller than the first number of workgroups. The parser circuitry may iterate through workgroups in both the first and second dimensions to generate the set of workgroups, proceeding through the first sub-kernel before iterating through any of the other sub-kernels. Disclosed techniques may provide desirable shapes for batches of workgroups.

    Compute kernel parsing with limits in one or more dimensions with iterating through workgroups in the one or more dimensions for execution

    公开(公告)号:US12020075B2

    公开(公告)日:2024-06-25

    申请号:US17018913

    申请日:2020-09-11

    Applicant: Apple Inc.

    CPC classification number: G06F9/505 G06T1/20

    Abstract: Techniques are disclosed relating to dispatching compute work from a compute stream. In some embodiments, a graphics processor executes instructions of compute kernels. Workload parser circuitry may determine, for distribution to the graphics processor circuitry, a set of workgroups from a compute kernel that includes workgroups organized in multiple dimensions, including a first number of workgroups in a first dimension and a second number of workgroups in a second dimension. This may include determining multiple sub-kernels for the compute kernel, wherein a first sub-kernel includes, in the first dimension, a limited number of workgroups that is smaller than the first number of workgroups. The parser circuitry may iterate through workgroups in both the first and second dimensions to generate the set of workgroups, proceeding through the first sub-kernel before iterating through any of the other sub-kernels. Disclosed techniques may provide desirable shapes for batches of workgroups.

    On-demand Memory Allocation
    17.
    发明公开

    公开(公告)号:US20240045808A1

    公开(公告)日:2024-02-08

    申请号:US18490588

    申请日:2023-10-19

    Applicant: Apple Inc.

    CPC classification number: G06F12/1018 G06F12/084 G06F30/392

    Abstract: Techniques are disclosed relating to dynamically allocating and mapping private memory for requesting circuitry. Disclosed circuitry may receive a private address and translate the private address to a virtual address (which an MMU may then translate to physical address to actually access a storage element). In some embodiments, private memory allocation circuitry is configured to generate page table information and map private memory pages for requests if the page table information is not already setup. In various embodiments, this may advantageously allow dynamic private memory allocation, e.g., to efficiently allocate memory for graphics shaders with different types of workloads. Disclosed caching techniques for page table information may improve performance relative to traditional techniques. Further, disclosed embodiments may facilitate memory consolidation across a device such as a graphics processor.

    Compression techniques for pixel write data

    公开(公告)号:US11062507B2

    公开(公告)日:2021-07-13

    申请号:US16673883

    申请日:2019-11-04

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to compression of data stored at different cache levels. In some embodiments, programmable shader circuitry is configured to execute program instructions of compute kernels that write pixel data. In some embodiments, a first cache is configured to store pixel write data from the programmable shader circuitry and first compression circuitry is configured to compress a first block of pixel write data in response to full accumulation of the first block in the first cache circuitry. In some embodiments, second cache circuitry is configured to store pixel write data from the programmable shader circuitry at a higher level in a storage hierarchy than the first cache circuitry and second compression circuitry is configured to compress a second block of pixel write data in response to full accumulation of the second block in the second cache circuitry. In some embodiments, write circuitry is configured to write the first and second compressed blocks of pixel data in a combined write to a higher level in the storage hierarchy.

Patent Agency Ranking