-
公开(公告)号:US12175300B2
公开(公告)日:2024-12-24
申请号:US17399759
申请日:2021-08-11
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Steven Fishwick , Melissa L. Velez
Abstract: Disclosed embodiments relate to software control of graphics hardware that supports logical slots. In some embodiments, a GPU includes circuitry that implements a plurality of logical slots and a set of graphics processor sub-units that each implement multiple distributed hardware slots. Control circuitry may determine mappings between logical slots and distributed hardware slots for different sets of graphics work. Various mapping aspects may be software-controlled. For example, software may specify one or more of the following: priority information for a set of graphics work, to retain the mapping after completion of the work, a distribution rule, a target group of sub-units, a sub-unit mask, a scheduling policy, to reclaim hardware slots from another logical slot, etc. Software may also query status of the work.
-
公开(公告)号:US20210358078A1
公开(公告)日:2021-11-18
申请号:US17334139
申请日:2021-05-28
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Dzung Q. Vu , Liang Kai Wang
Abstract: Techniques are disclosed relating to low-level instruction storage in a processing unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause. A given clause may be fetched and executed multiple times, e.g., for different SIMD groups, while stored in the caching circuitry.
-
公开(公告)号:US11023997B2
公开(公告)日:2021-06-01
申请号:US15657531
申请日:2017-07-24
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Dzung Q. Vu , Liang Kai Wang
Abstract: Techniques are disclosed relating to low-level instruction storage in a processing unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause. A given clause may be fetched and executed multiple times, e.g., for different SIMD groups, while stored in the caching circuitry.
-
公开(公告)号:US09594395B2
公开(公告)日:2017-03-14
申请号:US14160179
申请日:2014-01-21
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , James S. Blomgren , Terence M. Potter
CPC classification number: G06F1/10 , G06F1/32 , G06F1/3243 , G06F9/30014 , G06F9/3869 , G06F9/3871 , G06F9/3887 , Y02D10/152
Abstract: Techniques are disclosed relating to clock routing techniques in processors with both pipelined and non-pipelined circuitry. In some embodiments, an apparatus includes execution units that are non-pipelined and configured to perform instructions without receiving a clock signal. In these embodiments, one or more clock lines routed throughout the apparatus do not extend into the one or more execution units in each pipeline, reducing the length of the clock lines. In some embodiments, the apparatus includes multiple such pipelines arranged in an array, with the execution units located on an outer portion of the array and clocked control circuitry located on an inner portion of the array. In some embodiments, clock lines do not extend into the outer portion of the array. In some embodiments, the array includes one or more rows of execution units. These arrangements may further reduce the length of clock lines.
Abstract translation: 公开了涉及具有流水线和非流水线电路的处理器中的时钟路由技术的技术。 在一些实施例中,装置包括非流水线并被配置为在不接收时钟信号的情况下执行指令的执行单元。 在这些实施例中,在整个装置中布线的一个或多个时钟线不延伸到每个流水线中的一个或多个执行单元中,从而减小时钟线的长度。 在一些实施例中,该装置包括布置成阵列的多个这样的管道,其中执行单元位于阵列的外部部分上,并且位于阵列内部的时钟控制电路。 在一些实施例中,时钟线不延伸到阵列的外部部分。 在一些实施例中,阵列包括一行或多行执行单元。 这些布置可以进一步减少时钟线的长度。
-
公开(公告)号:US20150205324A1
公开(公告)日:2015-07-23
申请号:US14160179
申请日:2014-01-21
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , James S. Blomgren , Terence M. Potter
CPC classification number: G06F1/10 , G06F1/32 , G06F1/3243 , G06F9/30014 , G06F9/3869 , G06F9/3871 , G06F9/3887 , Y02D10/152
Abstract: Techniques are disclosed relating to clock routing techniques in processors with both pipelined and non-pipelined circuitry. In some embodiments, an apparatus includes execution units that are non-pipelined and configured to perform instructions without receiving a clock signal. In these embodiments, one or more clock lines routed throughout the apparatus do not extend into the one or more execution units in each pipeline, reducing the length of the clock lines. In some embodiments, the apparatus includes multiple such pipelines arranged in an array, with the execution units located on an outer portion of the array and clocked control circuitry located on an inner portion of the array. In some embodiments, clock lines do not extend into the outer portion of the array. In some embodiments, the array includes one or more rows of execution units. These arrangements may further reduce the length of clock lines.
Abstract translation: 公开了涉及具有流水线和非流水线电路的处理器中的时钟路由技术的技术。 在一些实施例中,装置包括非流水线并被配置为在不接收时钟信号的情况下执行指令的执行单元。 在这些实施例中,在整个装置中布线的一个或多个时钟线不延伸到每个流水线中的一个或多个执行单元中,从而减小时钟线的长度。 在一些实施例中,该装置包括布置成阵列的多个这样的管道,其中执行单元位于阵列的外部部分上,并且位于阵列内部的时钟控制电路。 在一些实施例中,时钟线不延伸到阵列的外部部分。 在一些实施例中,阵列包括一行或多行执行单元。 这些布置可以进一步减少时钟线的长度。
-
公开(公告)号:US20150035841A1
公开(公告)日:2015-02-05
申请号:US13956299
申请日:2013-07-31
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , James S. Blomgren , Terence M. Potter
CPC classification number: G06T1/20 , G06F9/3012 , G06F9/30138 , G06F9/3826 , G06F9/3851 , G06F9/3867 , G06F9/3873 , G06T1/60
Abstract: Techniques are disclosed relating to a multithreaded execution pipeline. In some embodiments, an apparatus is configured to assign a number of threads to an execution pipeline that is an integer multiple of a minimum number of cycles that an execution unit is configured to use to generate an execution result from a given set of input operands. In one embodiment, the apparatus is configured to require strict ordering of the threads. In one embodiment, the apparatus is configured so that the same thread access (e.g., reads and writes) a register file in a given cycle. In one embodiment, the apparatus is configured so that the same thread does not write back an operand and a result to an operand cache in a given cycle.
Abstract translation: 公开了涉及多线程执行流水线的技术。 在一些实施例中,设备被配置为向执行流水线分配多个线程,该执行流水线是执行单元被配置为用于从给定的一组输入操作数生成执行结果的最小循环数的整数倍。 在一个实施例中,该装置被配置为要求严格排列螺纹。 在一个实施例中,设备被配置为使得在给定周期中相同的线程访问(例如,读取和写入)寄存器文件。 在一个实施例中,该设备被配置为使得相同的线程不在给定周期中将操作数和结果写回操作数高速缓存。
-
公开(公告)号:US20230051906A1
公开(公告)日:2023-02-16
申请号:US17399759
申请日:2021-08-11
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Steven Fishwick , Melissa L. Velez
Abstract: Disclosed embodiments relate to software control of graphics hardware that supports logical slots. In some embodiments, a GPU includes circuitry that implements a plurality of logical slots and a set of graphics processor sub-units that each implement multiple distributed hardware slots. Control circuitry may determine mappings between logical slots and distributed hardware slots for different sets of graphics work. Various mapping aspects may be software-controlled. For example, software may specify one or more of the following: priority information for a set of graphics work, to retain the mapping after completion of the work, a distribution rule, a target group of sub-units, a sub-unit mask, a scheduling policy, to reclaim hardware slots from another logical slot, etc. Software may also query status of the work.
-
公开(公告)号:US11080101B2
公开(公告)日:2021-08-03
申请号:US16361910
申请日:2019-03-22
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Jason D. Carroll , Karl D. Mann
Abstract: Techniques are disclosed relating to processing a control stream such as a compute control stream. In some embodiments, the control stream includes kernels and commands for multiple substreams. In some embodiments, multiple substream processors are each configured to: fetch and parse portions of the control stream corresponding to an assigned substream and, in response to a neighbor barrier command in the assigned substream that identifies another substream, communicate the identified other substream to a barrier clearing circuitry. In some embodiments, the barrier clearing circuitry is configured to determine whether to allow the assigned substream to proceed past the neighbor barrier command based on communication of a most-recently-completed command from a substream processor to which the other substream is assigned (e.g., based on whether the most-recently-completed command meets a command identifier communicated in the neighbor barrier command). The disclosed techniques may facilitate parallel control stream parsing and substream synchronization.
-
公开(公告)号:US20200097293A1
公开(公告)日:2020-03-26
申请号:US16143416
申请日:2018-09-26
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Jeffrey T. Brady
Abstract: Techniques are disclosed relating to fetching items from a compute command stream that includes compute kernels. In some embodiments, stream fetch circuitry sequentially pre-fetches items from the stream and stores them in a buffer. In some embodiments, fetch parse circuitry iterate through items in the buffer using a fetch parse pointer to detect indirect-data-access items and/or redirect items in the buffer. The fetch parse circuitry may send detected indirect data accesses to indirect-fetch circuitry, which may buffer requests. In some embodiments, execute parse circuitry iterates through items in the buffer using an execute parse pointer (e.g., which may trail the fetch parse pointer) and outputs both item data from the buffer and indirect-fetch results from indirect-fetch circuitry for execution. In various embodiments, the disclosed techniques may reduce fetch latency for compute kernels.
-
公开(公告)号:US10475152B1
公开(公告)日:2019-11-12
申请号:US15896831
申请日:2018-02-14
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Jeffrey T. Brady
IPC: G06T1/20 , G06F12/0891 , G06F9/38
Abstract: Techniques are disclosed relating to managing dependencies in a compute control stream that specifies operations to be performed on a programmable shader (e.g., of a graphics unit). In some embodiments, the compute control stream includes commands and kernels. In some embodiments, dependency circuitry is configured to maintain dependencies such that younger kernels are allowed to execute ahead of a type of cache-related command (e.g., a command that signals a cache flush and/or invalidate). Disclosed circuitry may include separate buffers for commands and kernels, command dependency circuitry, and kernel dependency circuitry. In various embodiments, the disclosed architecture may improve performance in a highly scalable manner.
-
-
-
-
-
-
-
-
-