Low Latency Fetch Circuitry for Compute Kernels

    公开(公告)号:US20210026638A1

    公开(公告)日:2021-01-28

    申请号:US17065761

    申请日:2020-10-08

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to fetching items from a compute command stream that includes compute kernels. In some embodiments, stream fetch circuitry sequentially pre-fetches items from the stream and stores them in a buffer. In some embodiments, fetch parse circuitry iterate through items in the buffer using a fetch parse pointer to detect indirect-data-access items and/or redirect items in the buffer. The fetch parse circuitry may send detected indirect data accesses to indirect-fetch circuitry, which may buffer requests. In some embodiments, execute parse circuitry iterates through items in the buffer using an execute parse pointer (e.g., which may trail the fetch parse pointer) and outputs both item data from the buffer and indirect-fetch results from indirect-fetch circuitry for execution. In various embodiments, the disclosed techniques may reduce fetch latency for compute kernels.

    Hardware Resource Allocation System for Allocating Resources to Threads

    公开(公告)号:US20210248006A1

    公开(公告)日:2021-08-12

    申请号:US17240406

    申请日:2021-04-26

    Applicant: Apple Inc.

    Abstract: In various embodiments, a resource allocation management circuit may allocate a plurality of different types of hardware resources (e.g., different types of registers) to a plurality of threads. The different types of hardware resources may correspond to a plurality of hardware resource allocation circuits. The resource allocation management circuit may track allocation of the hardware resources to the threads using state identification values of the threads. In response to determining that fewer than a respective requested number of one or more types of the hardware resources are available, the resource allocation management circuit may identify one or more threads for deallocation. As a result, the hardware resource allocation system may allocate hardware resources to threads more efficiently (e.g., may deallocate hardware resources allocated to fewer threads), as compared to a hardware resource allocation system that does not track allocation of hardware resources to threads using state identification values.

    Distributed Compute Work Parser Circuitry using Communications Fabric

    公开(公告)号:US20200098160A1

    公开(公告)日:2020-03-26

    申请号:US16143412

    申请日:2018-09-26

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to distributing work from compute kernels using a distributed hierarchical parser architecture. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels processed by the apparatus, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and a master workload parser circuit. In some embodiments, the master workload parser circuit is configured to iteratively determine a next position in multiple dimensions for a next batch of workgroups from the kernel and send batch information to the distributed workload parser circuits via the communications fabric to assign the batch of workgroups. In some embodiments, the distributed parsers maintain coordinate information for the kernel and update the coordinate information in response to the batch information, even when the distributed parsers are not assigned to execute the batch.

    Distributed compute work parser circuitry using communications fabric

    公开(公告)号:US10593094B1

    公开(公告)日:2020-03-17

    申请号:US16143412

    申请日:2018-09-26

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to distributing work from compute kernels using a distributed hierarchical parser architecture. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels processed by the apparatus, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and a master workload parser circuit. In some embodiments, the master workload parser circuit is configured to iteratively determine a next position in multiple dimensions for a next batch of workgroups from the kernel and send batch information to the distributed workload parser circuits via the communications fabric to assign the batch of workgroups. In some embodiments, the distributed parsers maintain coordinate information for the kernel and update the coordinate information in response to the batch information, even when the distributed parsers are not assigned to execute the batch.

    Low latency fetch circuitry for compute kernels

    公开(公告)号:US11256510B2

    公开(公告)日:2022-02-22

    申请号:US17065761

    申请日:2020-10-08

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to fetching items from a compute command stream that includes compute kernels. In some embodiments, stream fetch circuitry sequentially pre-fetches items from the stream and stores them in a buffer. In some embodiments, fetch parse circuitry iterate through items in the buffer using a fetch parse pointer to detect indirect-data-access items and/or redirect items in the buffer. The fetch parse circuitry may send detected indirect data accesses to indirect-fetch circuitry, which may buffer requests. In some embodiments, execute parse circuitry iterates through items in the buffer using an execute parse pointer (e.g., which may trail the fetch parse pointer) and outputs both item data from the buffer and indirect-fetch results from indirect-fetch circuitry for execution. In various embodiments, the disclosed techniques may reduce fetch latency for compute kernels.

    Cache memory with transient storage for cache lines

    公开(公告)号:US11023162B2

    公开(公告)日:2021-06-01

    申请号:US16548784

    申请日:2019-08-22

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to caches that support transient storage fields for cache entries. In some embodiments, cache circuitry includes a set of multiple cache entries that each include a tag field and a data field. In some embodiments, transient storage circuitry includes a transient storage field for each of the multiple cache entries. In some embodiments, cache control circuitry stores received first data in the data field of a cache entry and stores received transient data in a corresponding transient storage field. In response to an eviction determination for the cache entry, however, the cache control circuitry may write the first data but not the transient data to a backing memory for the cache circuitry. In various embodiments, disclosed techniques may allow caching additional data that is transient without increasing bandwidth to the backing memory.

    Cache Memory with Transient Storage for Cache Lines

    公开(公告)号:US20210055883A1

    公开(公告)日:2021-02-25

    申请号:US16548784

    申请日:2019-08-22

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to caches that support transient storage fields for cache entries. In some embodiments, cache circuitry includes a set of multiple cache entries that each include a tag field and a data field. In some embodiments, transient storage circuitry includes a transient storage field for each of the multiple cache entries. In some embodiments, cache control circuitry stores received first data in the data field of a cache entry and stores received transient data in a corresponding transient storage field. In response to an eviction determination for the cache entry, however, the cache control circuitry may write the first data but not the transient data to a backing memory for the cache circuitry. In various embodiments, disclosed techniques may allow caching additional data that is transient without increasing bandwidth to the backing memory.

    Techniques for context switching using distributed compute workload parsers

    公开(公告)号:US10901777B1

    公开(公告)日:2021-01-26

    申请号:US16143432

    申请日:2018-09-26

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to context switching using distributed compute workload parsers. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels, a plurality of distributed workload parser circuits each configured to dispatch workgroups to a respective set of the shader units, a communications fabric, and a master workload parser circuit configured to communicate with the distributed workload parser circuits via the communications fabric. In some embodiments, the master workload parser circuit maintains a first set of master state information that does not change for a compute kernel based on operations by the shader units and a second set of master state information that may be changed by operations specified by the kernel. In some embodiments, the master workload parser circuit performs a multi-phase state storage process in communications with the distributed workload parser circuits.

    Initial object shader run for graphics workload distribution

    公开(公告)号:US12182926B1

    公开(公告)日:2024-12-31

    申请号:US18055111

    申请日:2022-11-14

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to using an initial version of an object shader to determine a child count and distribute geometry work based on the child count. In some embodiments, graphics shader circuitry is configured to execute shader programs including object shaders and mesh shaders. Vertex control circuitry is configured to, for a given object shader: launch an initial version of the given object shader to determine a number of meshlets to be generated by the given object shader (e.g., where the initial version of the given object shader does not commit side effects to architectural state of the apparatus) and select shader circuitry to execute a complete version of the given object shader based on the determined number of meshlets.

Patent Agency Ranking