APPARATUS IMPLEMENTING INSTRUCTIONS THAT IMPOSE PIPELINE INTERDEPENDENCIES
    31.
    发明申请
    APPARATUS IMPLEMENTING INSTRUCTIONS THAT IMPOSE PIPELINE INTERDEPENDENCIES 有权
    设备实施管道接口的说明

    公开(公告)号:US20150009223A1

    公开(公告)日:2015-01-08

    申请号:US13935299

    申请日:2013-07-03

    Applicant: Apple Inc.

    CPC classification number: G06T1/20 G06F9/3838 G06F9/3851 G06F9/3867 G06F9/3885

    Abstract: Techniques are disclosed relating to implementation of gradient-type graphics instructions. In one embodiment, an apparatus includes first and second execution pipelines and a register file. In this embodiment, the register file is coupled to the first and second execution pipelines and configured to store operands for the first and second execution pipelines. In this embodiment, the apparatus is configured to determine that a graphics instruction imposes a dependency between the first and second pipeline. In response to this determination, the apparatus is configured to read a plurality of operands from the register file including an operand assigned to the second execution pipeline and to select the operand assigned to the second execution pipeline as an input operand for the first execution pipeline. The apparatus may be configured such that operands assigned to the second execution pipeline are accessible by the first execution pipeline only via the register file and not from other locations.

    Abstract translation: 公开了与梯度型图形指令的实现有关的技术。 在一个实施例中,装置包括第一和第二执行流水线和寄存器文件。 在该实施例中,寄存器文件耦合到第一和第二执行流水线并且被配置为存储用于第一和第二执行流水线的操作数。 在该实施例中,该装置被配置为确定图形指令施加第一和第二流水线之间的依赖关系。 响应于该确定,该装置被配置为从寄存器文件读取包括分配给第二执行流水线的操作数的多个操作数,并且将分配给第二执行流水线的操作数作为第一执行流水线的输入操作数进行选择。 该装置可以被配置为使得分配给第二执行流水线的操作数仅由第一执行流水线仅通过寄存器文件而不是来自其他位置。

    Compute Kernel Parsing with Limits in one or more Dimensions

    公开(公告)号:US20240345892A1

    公开(公告)日:2024-10-17

    申请号:US18673959

    申请日:2024-05-24

    Applicant: Apple Inc.

    CPC classification number: G06F9/505 G06T1/20

    Abstract: Techniques are disclosed relating to dispatching compute work from a compute stream. In some embodiments, a graphics processor executes instructions of compute kernels. Workload parser circuitry may determine, for distribution to the graphics processor circuitry, a set of workgroups from a compute kernel that includes workgroups organized in multiple dimensions, including a first number of workgroups in a first dimension and a second number of workgroups in a second dimension. This may include determining multiple sub-kernels for the compute kernel, wherein a first sub-kernel includes, in the first dimension, a limited number of workgroups that is smaller than the first number of workgroups. The parser circuitry may iterate through workgroups in both the first and second dimensions to generate the set of workgroups, proceeding through the first sub-kernel before iterating through any of the other sub-kernels. Disclosed techniques may provide desirable shapes for batches of workgroups.

    Compute kernel parsing with limits in one or more dimensions with iterating through workgroups in the one or more dimensions for execution

    公开(公告)号:US12020075B2

    公开(公告)日:2024-06-25

    申请号:US17018913

    申请日:2020-09-11

    Applicant: Apple Inc.

    CPC classification number: G06F9/505 G06T1/20

    Abstract: Techniques are disclosed relating to dispatching compute work from a compute stream. In some embodiments, a graphics processor executes instructions of compute kernels. Workload parser circuitry may determine, for distribution to the graphics processor circuitry, a set of workgroups from a compute kernel that includes workgroups organized in multiple dimensions, including a first number of workgroups in a first dimension and a second number of workgroups in a second dimension. This may include determining multiple sub-kernels for the compute kernel, wherein a first sub-kernel includes, in the first dimension, a limited number of workgroups that is smaller than the first number of workgroups. The parser circuitry may iterate through workgroups in both the first and second dimensions to generate the set of workgroups, proceeding through the first sub-kernel before iterating through any of the other sub-kernels. Disclosed techniques may provide desirable shapes for batches of workgroups.

    Low latency fetch circuitry for compute kernels

    公开(公告)号:US11256510B2

    公开(公告)日:2022-02-22

    申请号:US17065761

    申请日:2020-10-08

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to fetching items from a compute command stream that includes compute kernels. In some embodiments, stream fetch circuitry sequentially pre-fetches items from the stream and stores them in a buffer. In some embodiments, fetch parse circuitry iterate through items in the buffer using a fetch parse pointer to detect indirect-data-access items and/or redirect items in the buffer. The fetch parse circuitry may send detected indirect data accesses to indirect-fetch circuitry, which may buffer requests. In some embodiments, execute parse circuitry iterates through items in the buffer using an execute parse pointer (e.g., which may trail the fetch parse pointer) and outputs both item data from the buffer and indirect-fetch results from indirect-fetch circuitry for execution. In various embodiments, the disclosed techniques may reduce fetch latency for compute kernels.

    Completion Signaling Techniques in Distributed Processor

    公开(公告)号:US20210279832A1

    公开(公告)日:2021-09-09

    申请号:US16812724

    申请日:2020-03-09

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to tracking compute workgroup completions in a distributed processor. In some embodiments, an apparatus includes a plurality of shader processors configured to perform operations for compute workgroups included in compute kernels, a master workload parser circuit, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and the master workload parser circuit. In some embodiments, a distributed workload parser circuit is configured to maintain, for each of a set of the shader processors, a data structure that specifies a count of workgroup completions for one or more kernels processed by the shader processor, determine, for the set of shader processors based on counts of workgroup completions for a first kernel, an aggregate count of completions to report for the first kernel, send the aggregate count to the master workload parser circuit over the communications fabric, and adjust the data structures to reflect counts included in the aggregate count.

    Techniques for context switching using distributed compute workload parsers

    公开(公告)号:US10901777B1

    公开(公告)日:2021-01-26

    申请号:US16143432

    申请日:2018-09-26

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to context switching using distributed compute workload parsers. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels, a plurality of distributed workload parser circuits each configured to dispatch workgroups to a respective set of the shader units, a communications fabric, and a master workload parser circuit configured to communicate with the distributed workload parser circuits via the communications fabric. In some embodiments, the master workload parser circuit maintains a first set of master state information that does not change for a compute kernel based on operations by the shader units and a second set of master state information that may be changed by operations specified by the kernel. In some embodiments, the master workload parser circuit performs a multi-phase state storage process in communications with the distributed workload parser circuits.

    Re-using graphics vertex identifiers for primitive blocks across states

    公开(公告)号:US10269091B1

    公开(公告)日:2019-04-23

    申请号:US15809687

    申请日:2017-11-10

    Applicant: Apple Inc.

    Abstract: Techniques are disclosed relating to storage techniques for storing primitive information with vertex re-use. In some embodiments, graphics circuitry aggregates primitive information (including vertex data) for multiple primitives into a primitive block data structure. This may include storing only a single instance of a vertex for multiple primitives that share the vertex. The graphics circuitry may switch between primitive blocks, with one being active and the others non-active. For non-active primitive blocks, the graphics circuitry may track whether vertex identifiers have been used for a new vertex, which may prevent vertex re-use. If an identifier is not used for a new vertex, however, a vertex may be re-used across deactivation and reactivation of a primitive block.

    Clause Chaining for Clause-Based Instruction Execution

    公开(公告)号:US20180067748A1

    公开(公告)日:2018-03-08

    申请号:US15257386

    申请日:2016-09-06

    Applicant: Apple Inc.

    CPC classification number: G06F9/3867 G06F9/3851 G06F9/3887

    Abstract: Techniques are disclosed relating to clause-based execution of program instructions, which may be single-instruction multiple data (SIMD) computer instructions. In some embodiments, an apparatus includes execution circuitry configured to receive clauses of instructions and SIMD groups of input data to be operated on by the clauses. In some embodiments, the apparatus further includes one or more storage elements configured to store state information for clauses processed by the execution circuitry. In some embodiments, the apparatus further includes scheduling circuitry configured to send instructions of a first clause and corresponding input data for execution by the execution circuitry and indicate, prior to sending instruction and input data of a second clause to the execution circuitry for execution, whether the second clause and a first clause are assigned to operate on groups of input data corresponding to the same instruction stream. In some embodiments, the apparatus is configured to determine, based on the indication, whether to maintain as valid, for use by the second clause, stored state information for the first clause.

    UNIFIED INTEGER AND FLOATING-POINT COMPARE CIRCUITRY

    公开(公告)号:US20170357506A1

    公开(公告)日:2017-12-14

    申请号:US15180725

    申请日:2016-06-13

    Applicant: Apple Inc.

    CPC classification number: G06F9/30021 G06F9/3001 G06F9/30083

    Abstract: Techniques are disclosed relating to comparison circuitry. In some embodiments, compare circuitry is configured to generate comparison results for sets of inputs in both one or more integer formats and one or more floating-point formats. In some embodiments, the compare circuitry includes padding circuitry configured to add one or more bits to each of first and second input values to generate first and second padded values. In some embodiments, the compare circuitry also includes integer subtraction circuitry configured to subtract the first padded value from the second padded value to generate a subtraction result. In some embodiments, the compare circuitry includes output logic configured to generate the comparison result based on the subtraction result. In various embodiments, using at least a portion of the same circuitry (e.g., the subtractor) for both integer and floating-point comparisons may reduce processor area.

    Floating-Point Multiply-Add with Down-Conversion

    公开(公告)号:US20170293470A1

    公开(公告)日:2017-10-12

    申请号:US15092401

    申请日:2016-04-06

    Applicant: Apple Inc.

    CPC classification number: G06F7/483 G06F7/5443

    Abstract: Techniques are disclosed relating to floating-point operations with down-conversion. In some embodiments, a floating-point unit is configured to perform fused multiply-addition operations based on first and second different instruction types. In some embodiments, the first instruction type specifies result in the first floating-point format and the second instruction type specifies fused multiply addition of input operands in the first floating-point format to generate a result in a second, lower-precision floating-point format. For example, the first format may be a 32-bit format and the second format may be a 16-bit format. In some embodiments, the floating-point unit includes rounding circuitry, exponent circuitry, and/or increment circuitry configured to generate signals for the second instruction type in the same pipeline stage as for the first instruction type. In some embodiments, disclosed techniques may reduce the number of pipeline stages included in the floating-point circuitry.

Patent Agency Ranking