Software control techniques for graphics hardware that supports logical slots and reservation of graphics hardware based on a priority threshold

    公开(公告)号:US12175300B2

    公开(公告)日:2024-12-24

    申请号:US17399759

    申请日:2021-08-11

    Applicant: Apple Inc.

    Abstract: Disclosed embodiments relate to software control of graphics hardware that supports logical slots. In some embodiments, a GPU includes circuitry that implements a plurality of logical slots and a set of graphics processor sub-units that each implement multiple distributed hardware slots. Control circuitry may determine mappings between logical slots and distributed hardware slots for different sets of graphics work. Various mapping aspects may be software-controlled. For example, software may specify one or more of the following: priority information for a set of graphics work, to retain the mapping after completion of the work, a distribution rule, a target group of sub-units, a sub-unit mask, a scheduling policy, to reclaim hardware slots from another logical slot, etc. Software may also query status of the work.

    Instruction Storage
    2.
    发明申请

    公开(公告)号:US20210358078A1

    公开(公告)日:2021-11-18

    申请号:US17334139

    申请日:2021-05-28

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to low-level instruction storage in a processing unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause. A given clause may be fetched and executed multiple times, e.g., for different SIMD groups, while stored in the caching circuitry.

    Instruction storage
    3.
    发明授权

    公开(公告)号:US11023997B2

    公开(公告)日:2021-06-01

    申请号:US15657531

    申请日:2017-07-24

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to low-level instruction storage in a processing unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause. A given clause may be fetched and executed multiple times, e.g., for different SIMD groups, while stored in the caching circuitry.

    Clock routing techniques
    4.
    发明授权
    Clock routing techniques 有权
    时钟路由技术

    公开(公告)号:US09594395B2

    公开(公告)日:2017-03-14

    申请号:US14160179

    申请日:2014-01-21

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to clock routing techniques in processors with both pipelined and non-pipelined circuitry. In some embodiments, an apparatus includes execution units that are non-pipelined and configured to perform instructions without receiving a clock signal. In these embodiments, one or more clock lines routed throughout the apparatus do not extend into the one or more execution units in each pipeline, reducing the length of the clock lines. In some embodiments, the apparatus includes multiple such pipelines arranged in an array, with the execution units located on an outer portion of the array and clocked control circuitry located on an inner portion of the array. In some embodiments, clock lines do not extend into the outer portion of the array. In some embodiments, the array includes one or more rows of execution units. These arrangements may further reduce the length of clock lines.

    Abstract translation: 公开了涉及具有流水线和非流水线电路的处理器中的时钟路由技术的技术。 在一些实施例中,装置包括非流水线并被配置为在不接收时钟信号的情况下执行指令的执行单元。 在这些实施例中,在整个装置中布线的一个或多个时钟线不延伸到每个流水线中的一个或多个执行单元中,从而减小时钟线的长度。 在一些实施例中,该装置包括布置成阵列的多个这样的管道,其中执行单元位于阵列的外部部分上,并且位于阵列内部的时钟控制电路。 在一些实施例中,时钟线不延伸到阵列的外部部分。 在一些实施例中,阵列包括一行或多行执行单元。 这些布置可以进一步减少时钟线的长度。

    CLOCK ROUTING TECHNIQUES
    5.
    发明申请
    CLOCK ROUTING TECHNIQUES 有权
    时钟路由技术

    公开(公告)号:US20150205324A1

    公开(公告)日:2015-07-23

    申请号:US14160179

    申请日:2014-01-21

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to clock routing techniques in processors with both pipelined and non-pipelined circuitry. In some embodiments, an apparatus includes execution units that are non-pipelined and configured to perform instructions without receiving a clock signal. In these embodiments, one or more clock lines routed throughout the apparatus do not extend into the one or more execution units in each pipeline, reducing the length of the clock lines. In some embodiments, the apparatus includes multiple such pipelines arranged in an array, with the execution units located on an outer portion of the array and clocked control circuitry located on an inner portion of the array. In some embodiments, clock lines do not extend into the outer portion of the array. In some embodiments, the array includes one or more rows of execution units. These arrangements may further reduce the length of clock lines.

    Abstract translation: 公开了涉及具有流水线和非流水线电路的处理器中的时钟路由技术的技术。 在一些实施例中,装置包括非流水线并被配置为在不接收时钟信号的情况下执行指令的执行单元。 在这些实施例中,在整个装置中布线的一个或多个时钟线不延伸到每个流水线中的一个或多个执行单元中,从而减小时钟线的长度。 在一些实施例中,该装置包括布置成阵列的多个这样的管道,其中执行单元位于阵列的外部部分上,并且位于阵列内部的时钟控制电路。 在一些实施例中,时钟线不延伸到阵列的外部部分。 在一些实施例中,阵列包括一行或多行执行单元。 这些布置可以进一步减少时钟线的长度。

    MULTI-THREADED GPU PIPELINE
    6.
    发明申请
    MULTI-THREADED GPU PIPELINE 有权
    多通道GPU管道

    公开(公告)号:US20150035841A1

    公开(公告)日:2015-02-05

    申请号:US13956299

    申请日:2013-07-31

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to a multithreaded execution pipeline. In some embodiments, an apparatus is configured to assign a number of threads to an execution pipeline that is an integer multiple of a minimum number of cycles that an execution unit is configured to use to generate an execution result from a given set of input operands. In one embodiment, the apparatus is configured to require strict ordering of the threads. In one embodiment, the apparatus is configured so that the same thread access (e.g., reads and writes) a register file in a given cycle. In one embodiment, the apparatus is configured so that the same thread does not write back an operand and a result to an operand cache in a given cycle.

    Abstract translation: 公开了涉及多线程执行流水线的技术。 在一些实施例中,设备被配置为向执行流水线分配多个线程,该执行流水线是执行单元被配置为用于从给定的一组输入操作数生成执行结果的最小循环数的整数倍。 在一个实施例中,该装置被配置为要求严格排列螺纹。 在一个实施例中,设备被配置为使得在给定周期中相同的线程访问(例如,读取和写入)寄存器文件。 在一个实施例中,该设备被配置为使得相同的线程不在给定周期中将操作数和结果写回操作数高速缓存。

    Software Control Techniques for Graphics Hardware that Supports Logical Slots

    公开(公告)号:US20230051906A1

    公开(公告)日:2023-02-16

    申请号:US17399759

    申请日:2021-08-11

    Applicant: Apple Inc.

    Abstract: Disclosed embodiments relate to software control of graphics hardware that supports logical slots. In some embodiments, a GPU includes circuitry that implements a plurality of logical slots and a set of graphics processor sub-units that each implement multiple distributed hardware slots. Control circuitry may determine mappings between logical slots and distributed hardware slots for different sets of graphics work. Various mapping aspects may be software-controlled. For example, software may specify one or more of the following: priority information for a set of graphics work, to retain the mapping after completion of the work, a distribution rule, a target group of sub-units, a sub-unit mask, a scheduling policy, to reclaim hardware slots from another logical slot, etc. Software may also query status of the work.

    Dependency scheduling for control stream in parallel processor

    公开(公告)号:US11080101B2

    公开(公告)日:2021-08-03

    申请号:US16361910

    申请日:2019-03-22

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to processing a control stream such as a compute control stream. In some embodiments, the control stream includes kernels and commands for multiple substreams. In some embodiments, multiple substream processors are each configured to: fetch and parse portions of the control stream corresponding to an assigned substream and, in response to a neighbor barrier command in the assigned substream that identifies another substream, communicate the identified other substream to a barrier clearing circuitry. In some embodiments, the barrier clearing circuitry is configured to determine whether to allow the assigned substream to proceed past the neighbor barrier command based on communication of a most-recently-completed command from a substream processor to which the other substream is assigned (e.g., based on whether the most-recently-completed command meets a command identifier communicated in the neighbor barrier command). The disclosed techniques may facilitate parallel control stream parsing and substream synchronization.

    Low Latency Fetch Circuitry for Compute Kernels

    公开(公告)号:US20200097293A1

    公开(公告)日:2020-03-26

    申请号:US16143416

    申请日:2018-09-26

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to fetching items from a compute command stream that includes compute kernels. In some embodiments, stream fetch circuitry sequentially pre-fetches items from the stream and stores them in a buffer. In some embodiments, fetch parse circuitry iterate through items in the buffer using a fetch parse pointer to detect indirect-data-access items and/or redirect items in the buffer. The fetch parse circuitry may send detected indirect data accesses to indirect-fetch circuitry, which may buffer requests. In some embodiments, execute parse circuitry iterates through items in the buffer using an execute parse pointer (e.g., which may trail the fetch parse pointer) and outputs both item data from the buffer and indirect-fetch results from indirect-fetch circuitry for execution. In various embodiments, the disclosed techniques may reduce fetch latency for compute kernels.

    Dependency handling for set-aside of compute control stream commands

    公开(公告)号:US10475152B1

    公开(公告)日:2019-11-12

    申请号:US15896831

    申请日:2018-02-14

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to managing dependencies in a compute control stream that specifies operations to be performed on a programmable shader (e.g., of a graphics unit). In some embodiments, the compute control stream includes commands and kernels. In some embodiments, dependency circuitry is configured to maintain dependencies such that younger kernels are allowed to execute ahead of a type of cache-related command (e.g., a command that signals a cache flush and/or invalidate). Disclosed circuitry may include separate buffers for commands and kernels, command dependency circuitry, and kernel dependency circuitry. In various embodiments, the disclosed architecture may improve performance in a highly scalable manner.

Patent Agency Ranking