-
公开(公告)号:US20230047481A1
公开(公告)日:2023-02-16
申请号:US17399784
申请日:2021-08-11
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Ajay Simha Modugala , Benjamin Bowman , Yunjun Zhang
Abstract: Techniques are disclosed relating to affinity-based scheduling of graphics work. In disclosed embodiments, first and second groups of graphics processor sub-units may share respective first and second caches. Distribution circuitry may receive a software-specified set of graphics work and a software-indicated mapping of portions of the set of graphics work to groups of graphics processor sub-units. The distribution circuitry may assign subsets of the set of graphics work based on the mapping. This may improve cache efficiency, in some embodiments, by allowing graphics work that accesses the same memory areas to be assigned to the same group of sub-units that share a cache.
-
公开(公告)号:US20220083377A1
公开(公告)日:2022-03-17
申请号:US17018913
申请日:2020-09-11
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Ajay Simha Modugala , Karl D. Mann
Abstract: Techniques are disclosed relating to dispatching compute work from a compute stream. In some embodiments, a graphics processor executes instructions of compute kernels. Workload parser circuitry may determine, for distribution to the graphics processor circuitry, a set of workgroups from a compute kernel that includes workgroups organized in multiple dimensions, including a first number of workgroups in a first dimension and a second number of workgroups in a second dimension. This may include determining multiple sub-kernels for the compute kernel, wherein a first sub-kernel includes, in the first dimension, a limited number of workgroups that is smaller than the first number of workgroups. The parser circuitry may iterate through workgroups in both the first and second dimensions to generate the set of workgroups, proceeding through the first sub-kernel before iterating through any of the other sub-kernels. Disclosed techniques may provide desirable shapes for batches of workgroups.
-
公开(公告)号:US20200098160A1
公开(公告)日:2020-03-26
申请号:US16143412
申请日:2018-09-26
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Benjamin Bowman , Jeffrey T. Brady
Abstract: Techniques are disclosed relating to distributing work from compute kernels using a distributed hierarchical parser architecture. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels processed by the apparatus, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and a master workload parser circuit. In some embodiments, the master workload parser circuit is configured to iteratively determine a next position in multiple dimensions for a next batch of workgroups from the kernel and send batch information to the distributed workload parser circuits via the communications fabric to assign the batch of workgroups. In some embodiments, the distributed parsers maintain coordinate information for the kernel and update the coordinate information in response to the batch information, even when the distributed parsers are not assigned to execute the batch.
-
公开(公告)号:US10593094B1
公开(公告)日:2020-03-17
申请号:US16143412
申请日:2018-09-26
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Benjamin Bowman , Jeffrey T. Brady
Abstract: Techniques are disclosed relating to distributing work from compute kernels using a distributed hierarchical parser architecture. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels processed by the apparatus, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and a master workload parser circuit. In some embodiments, the master workload parser circuit is configured to iteratively determine a next position in multiple dimensions for a next batch of workgroups from the kernel and send batch information to the distributed workload parser circuits via the communications fabric to assign the batch of workgroups. In some embodiments, the distributed parsers maintain coordinate information for the kernel and update the coordinate information in response to the batch information, even when the distributed parsers are not assigned to execute the batch.
-
公开(公告)号:US20170323420A1
公开(公告)日:2017-11-09
申请号:US15657531
申请日:2017-07-24
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Dzung Q. Vu , Liang Kai Wang
CPC classification number: G06T1/60 , G06F9/30145 , G06F9/3017 , G06F9/38 , G06F9/3838 , G06F9/3851 , G06F9/3853 , G06F9/3887 , G06F12/08 , G06F12/0875 , G06F2212/452 , G06T1/20 , Y02D10/13
Abstract: Techniques are disclosed relating to low-level instruction storage in a processing unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause. A given clause may be fetched and executed multiple times, e.g., for different SIMD groups, while stored in the caching circuitry.
-
公开(公告)号:US20170075810A1
公开(公告)日:2017-03-16
申请号:US14851859
申请日:2015-09-11
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Terence M. Potter , Liang-Kai Wang
IPC: G06F12/08
CPC classification number: G06F12/0884 , G06F9/3012 , G06F9/3824 , G06F9/383 , G06F9/3834 , G06F9/3838 , G06F9/3859 , G06F12/0848 , G06F2212/604
Abstract: Techniques are disclosed relating to per-pipeline control for an operand cache. In some embodiments, an apparatus includes a register file and multiple execution pipelines. In some embodiments, the apparatus also includes an operand cache that includes multiple entries that each include multiple portions that are each configured to store an operand for a corresponding execution pipeline. In some embodiments, the operand cache is configured, during operation of the apparatus, to store data in only a subset of the portions of an entry. In some embodiments, the apparatus is configured to store, for each entry in the operand cache, a per-entry validity value that indicates whether the entry is valid and per-portion state information that indicates whether data for each portion is valid and whether data for each portion is modified relative to data in a corresponding entry in the register file.
Abstract translation: 公开了关于操作数高速缓存的每流水线控制的技术。 在一些实施例中,装置包括寄存器文件和多个执行流水线。 在一些实施例中,设备还包括操作数高速缓存,其包括多个条目,每个条目包括多个部分,每个部分被配置为存储对应的执行管线的操作数。 在一些实施例中,操作数高速缓存在设备的操作期间被配置为仅在条目的部分的子集中存储数据。 在一些实施例中,所述装置被配置为针对操作数高速缓存中的每个条目存储指示该条目是否有效的每条目有效值以及指示每个部分的数据是否有效的每部分状态信息以及数据是否 对于每个部分相对于寄存器文件中相应条目中的数据进行修改。
-
公开(公告)号:US09378146B2
公开(公告)日:2016-06-28
申请号:US13971811
申请日:2013-08-20
Applicant: Apple Inc.
Inventor: James S. Blomgren , Terence M. Potter , Timothy A. Olson , Andrew M. Havlir
CPC classification number: G06F12/0875 , G06F9/30043 , G06F9/30138 , G06F9/30145 , G06F9/30185
Abstract: Instructions may require one or more operands to be executed, which may be provided from a register file. In the context of a GPU, however, a register file may be a relatively large structure, and reading from a register file may be energy and/or time intensive An operand cache may be used to store a subset of operands, and may use less power and have quicker access times than the register file. Selectors (e.g., multiplexers) may be used to read operands from the operand cache. Power savings may be achieved in some embodiments by activating only a subset of the selectors, which may be done by activators (e.g. flip-flops). Operands may also be concurrently provided to two or more locations via forwarding, which may be accomplished via a source selection unit in some embodiments. Operand forwarding may also reduce power and/or speed execution in one or more embodiments.
Abstract translation: 指令可能需要执行一个或多个操作数,这可以从寄存器文件提供。 然而,在GPU的上下文中,寄存器文件可以是相对较大的结构,并且从寄存器文件的读取可能是能量和/或时间密集的。操作数高速缓存可以用于存储操作数的子集,并且可以使用较少的 并且具有比寄存器文件更快的访问时间。 选择器(例如,多路复用器)可用于从操作数高速缓存读取操作数。 在一些实施例中可以通过激活选择器的子集来实现功率节省,这可以由激活器(例如,触发器)完成。 操作数还可以经由转发同时提供给两个或更多个位置,这在一些实施例中可以经由源选择单元来实现。 操作数转发还可以在一个或多个实施例中降低功率和/或速度执行。
-
公开(公告)号:US20160055833A1
公开(公告)日:2016-02-25
申请号:US14463271
申请日:2014-08-19
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Michael A. Geary , Robert Kenney
IPC: G09G5/399
CPC classification number: G09G5/399 , G06T1/60 , G09G5/393 , G09G2330/021
Abstract: Embodiments of a unified shading controller are disclosed. The embodiments may provide a first functional unit configured to send a write request to a second functional unit. The write request may include data and the data may include one or more control bits. Upon receiving the write request, the second functional unit may check the one or more control bits, and hold the data in a given queue dependent upon the control bits.
Abstract translation: 公开了统一阴影控制器的实施例。 实施例可以提供被配置为向第二功能单元发送写入请求的第一功能单元。 写请求可以包括数据,并且数据可以包括一个或多个控制位。 在接收到写请求时,第二功能单元可以检查一个或多个控制位,并且根据控制位保持给定队列中的数据。
-
公开(公告)号:US20150058571A1
公开(公告)日:2015-02-26
申请号:US13971782
申请日:2013-08-20
Applicant: Apple Inc.
Inventor: Terence M. Potter , Timothy A. Olson , James S. Blomgren , Andrew M. Havlir , Michael Geary
CPC classification number: G06F9/30043 , G06F9/38 , G06F12/0862 , G06F12/0875 , G06F2212/452 , G06T1/60 , Y02D10/13
Abstract: Instructions may require one or more operands to be executed, which may be provided from a register file. In the context of a GPU, however, a register file may be a relatively large structure, and reading from the register file may be energy and/or time intensive An operand cache may be used to store a subset of operands, and may use less power and have quicker access times than the register file. Hint values may be used in some embodiments to suggest that a particular operand should be stored in the operand cache (so that is available for current or future use). In one embodiment, a hint value indicates that an operand should be cached whenever possible. Hint values may be determined by software, such as a compiler, in some embodiments. One or more criteria may be used to determine hint values, such as how soon in the future or how frequently an operand will be used again.
Abstract translation: 指令可能需要执行一个或多个操作数,这可以从寄存器文件提供。 然而,在GPU的上下文中,寄存器文件可以是相对较大的结构,并且从寄存器文件的读取可能是能量和/或时间密集的。操作数高速缓存可以用于存储操作数的子集,并且可以使用较少的 并且具有比寄存器文件更快的访问时间。 在一些实施例中可以使用提示值来建议特定的操作数应存储在操作数高速缓存中(以便可用于当前或未来的使用)。 在一个实施例中,提示值指示操作数应尽可能缓存。 在一些实施例中,提示值可以由诸如编译器的软件来确定。 可以使用一个或多个标准来确定提示值,例如将来的时间以及操作数将再次被使用的频率。
-
公开(公告)号:US20150049106A1
公开(公告)日:2015-02-19
申请号:US13970578
申请日:2013-08-19
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Sreevathsa Ramachandra , William V. Miller
IPC: G06T1/60
CPC classification number: G06T1/20 , G06F9/30098 , G06F9/30105 , G06F9/3012 , G06F9/30123 , G06F9/3824 , G06F9/3885
Abstract: Techniques are disclosed relating to arbitration of requests to access a register file. In one embodiment, an apparatus includes a write queue and a register file that includes multiple entries. In one embodiment, the apparatus is configured to select a request from a plurality of requests based on a plurality of request characteristics, and write data from the accepted request into a write queue. In one embodiment, the request characteristics include: whether a request is a last request from an agent for a given register file entry and whether the request finishes a previous request. In one embodiment, a final arbiter is configured to select among requests from the write queue, a read queue, and multiple execution pipelines to access banks of the register file in a given cycle.
Abstract translation: 公开了关于访问寄存器文件的请求的仲裁的技术。 在一个实施例中,装置包括写入队列和包括多个条目的寄存器文件。 在一个实施例中,设备被配置为基于多个请求特性从多个请求中选择请求,并将数据从接受的请求写入写入队列。 在一个实施例中,请求特征包括:请求是否是针对给定寄存器文件条目的代理的最后请求以及请求是否完成先前的请求。 在一个实施例中,最终仲裁器被配置为在给定周期中从写入队列,读取队列和多个执行管线的请求中选择访问寄存器文件的存储体。
-
-
-
-
-
-
-
-
-