-
公开(公告)号:US10467724B1
公开(公告)日:2019-11-05
申请号:US15896923
申请日:2018-02-14
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Jeffrey T. Brady
Abstract: Techniques are disclosed relating to dispatching compute work from a compute stream. In some embodiments, workgroup batch circuitry is configured to select (e.g., in a single clock cycle) multiple workgroups to be distributed to different shader circuitry. In some embodiments, iterator circuitry is configured to determine next positions in different dimensions at least partially in parallel. For example, in some embodiments, first circuitry is configured to determine a next position in a first dimension and an increment amount for a second dimension. In some embodiments, second circuitry is configured to determine at least partially in parallel with the determination of the next position in the first dimension, next positions in the second dimension for multiple possible increment amounts in the second dimension. In some embodiments, this may facilitate a configurable number of workgroups per batch and may increase performance, e.g., by increasing the overall number of workgroups dispatched per clock cycle.
-
公开(公告)号:US10282169B2
公开(公告)日:2019-05-07
申请号:US15092401
申请日:2016-04-06
Applicant: Apple Inc.
Inventor: Liang-Kai Wang , Terence M. Potter , Andrew M. Havlir , Yu Sun , Nicolas X. Pena , Xiao-Long Wu , Christopher A. Burns
Abstract: Techniques are disclosed relating to floating-point operations with down-conversion. In some embodiments, a floating-point unit is configured to perform fused multiply-addition operations based on first and second different instruction types. In some embodiments, the first instruction type specifies result in the first floating-point format and the second instruction type specifies fused multiply addition of input operands in the first floating-point format to generate a result in a second, lower-precision floating-point format. For example, the first format may be a 32-bit format and the second format may be a 16-bit format. In some embodiments, the floating-point unit includes rounding circuitry, exponent circuitry, and/or increment circuitry configured to generate signals for the second instruction type in the same pipeline stage as for the first instruction type. In some embodiments, disclosed techniques may reduce the number of pipeline stages included in the floating-point circuitry.
-
公开(公告)号:US09846579B1
公开(公告)日:2017-12-19
申请号:US15180725
申请日:2016-06-13
Applicant: Apple Inc.
Inventor: Liang-Kai Wang , Terence M. Potter , Andrew M. Havlir
CPC classification number: G06F9/30021 , G06F9/3001 , G06F9/30083
Abstract: Techniques are disclosed relating to comparison circuitry. In some embodiments, compare circuitry is configured to generate comparison results for sets of inputs in both one or more integer formats and one or more floating-point formats. In some embodiments, the compare circuitry includes padding circuitry configured to add one or more bits to each of first and second input values to generate first and second padded values. In some embodiments, the compare circuitry also includes integer subtraction circuitry configured to subtract the first padded value from the second padded value to generate a subtraction result. In some embodiments, the compare circuitry includes output logic configured to generate the comparison result based on the subtraction result. In various embodiments, using at least a portion of the same circuitry (e.g., the subtractor) for both integer and floating-point comparisons may reduce processor area.
-
公开(公告)号:US20170024323A1
公开(公告)日:2017-01-26
申请号:US14805124
申请日:2015-07-21
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Terence M. Potter
CPC classification number: G06F12/0875 , G06F9/383 , G06F12/0815 , G06F12/126 , G06F2212/1028 , G06F2212/1056 , G06F2212/452 , G06F2212/6046 , Y02D10/13
Abstract: An apparatus includes an operand cache for storing operands from a register file for use by execution circuitry. In some embodiments, eviction priority for the operand cache is based on the status of entries (e.g., whether dirty or clean) and the retention priority of entries. In some embodiments, flushes are handled differently based on their retention priority (e.g., low-priority entries may be pre-emptively flushed). In some embodiments, timing for cache clean operations is specified on a per-instruction basis. Disclosed techniques may spread out write backs in time, facilitate cache clean operations, facilitate thread switching, extend the time operands are available in an operand cache, and/or improve the use of compiler hints, in some embodiments
Abstract translation: 一种装置包括用于存储来自寄存器文件以供执行电路使用的操作数的操作数高速缓存。 在一些实施例中,操作数高速缓存的逐出优先级基于条目的状态(例如,是否是脏或干净)以及条目的保留优先级。 在一些实施例中,基于它们的保留优先级来处理刷新不同(例如,低优先级条目可以被预先刷新)。 在一些实施例中,高速缓存清理操作的定时是基于每个指令来指定的。 公开的技术可以在一些实施例中及时分散写回,促进高速缓存清理操作,促进线程切换,延长操作数缓存中可用的时间操作数和/或改进编译器提示的使用
-
公开(公告)号:US20160371810A1
公开(公告)日:2016-12-22
申请号:US14746034
申请日:2015-06-22
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Dzung Q. Vu , Liang Kai Wang
CPC classification number: G06T1/60 , G06F9/30145 , G06F9/3017 , G06F9/38 , G06F9/3838 , G06F9/3851 , G06F9/3853 , G06F9/3887 , G06F12/08 , G06F12/0875 , G06F2212/452 , G06T1/20 , Y02D10/13
Abstract: Techniques are disclosed relating to low-level instruction storage in a graphics unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the hazard circuitry is configured to generate hazard information that specifies dependencies between ones of the decoded graphics instructions in the same clause. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause and hazard information generated by the decode circuitry for the clause. This may reduce power consumption, in some embodiments, by reducing hazard checking when clauses are executed multiple times.
Abstract translation: 公开了与图形单元中的低级指令存储有关的技术。 在一些实施例中,图形单元包括执行电路,解码电路,危险电路和高速缓存电路。 在一些实施例中,执行电路被配置为执行图形指令的子句。 在一些实施例中,解码电路被配置为接收图形指令和用于每个接收到的图形指令的子句标识符并对接收到的图形指令进行解码。 在一些实施例中,危险电路被配置为产生指定相同条款中的解码图形指令之间的依赖关系的危险信息。 在一些实施例中,高速缓存电路包括多个条目,每个条目被配置为存储在相同条款中的解码指令集合以及由该子句的解码电路产生的危险信息。 这在一些实施例中可以通过减少多次执行子句的危险检查来降低功耗。
-
公开(公告)号:US09459869B2
公开(公告)日:2016-10-04
申请号:US13971800
申请日:2013-08-20
Applicant: Apple Inc.
Inventor: Timothy A. Olson , Terence M. Potter , James S. Blomgren , Andrew M. Havlir
CPC classification number: G06F9/30043 , G06F9/38 , G06F12/0862 , G06F12/0875 , G06F2212/452 , G06T1/20 , Y02D10/13
Abstract: Instructions may require one or more operands to be executed, which may be provided from a register file. In the context of a GPU, however, a register file may be a relatively large structure, and reading from the register file may be energy and/or time intensive An operand cache may store a subset of operands, and may use less power and have quicker access times than the register file. In some embodiments, intelligent operand prefetching may speed execution by reducing memory bank conflicts (e.g., conflicts within a register file containing multiple memory banks). An unused operand slot for another instruction (e.g., an instruction that does not require a maximum number of source operands allowed by an instruction set architecture) may be used to prefetch an operand for another instruction in one embodiment. Prefetched operands may be stored in an operand cache, and prefetching may occur based on software-provided information.
Abstract translation: 指令可能需要执行一个或多个操作数,这可以从寄存器文件提供。 然而,在GPU的上下文中,寄存器文件可以是相对较大的结构,并且从寄存器文件的读取可能是能量和/或时间密集的。操作数高速缓存可以存储操作数的子集,并且可以使用更少的功率并具有 比寄存器文件更快的访问时间。 在一些实施例中,智能操作数预取可以通过减少存储体冲突(例如,包含多个存储体的寄存器文件内的冲突)来加速执行。 在一个实施例中,用于另一指令的未用操作数时隙(例如,不需要由指令集体系结构允许的最大数量的源操作数的指令)可用于预取另一指令的操作数。 预取操作数可以存储在操作数缓存中,并且可以基于软件提供的信息进行预取。
-
公开(公告)号:US09330432B2
公开(公告)日:2016-05-03
申请号:US13970578
申请日:2013-08-19
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Sreevathsa Ramachandra , William V. Miller
CPC classification number: G06T1/20 , G06F9/30098 , G06F9/30105 , G06F9/3012 , G06F9/30123 , G06F9/3824 , G06F9/3885
Abstract: Techniques are disclosed relating to arbitration of requests to access a register file. In one embodiment, an apparatus includes a write queue and a register file that includes multiple entries. In one embodiment, the apparatus is configured to select a request from a plurality of requests based on a plurality of request characteristics, and write data from the accepted request into a write queue. In one embodiment, the request characteristics include: whether a request is a last request from an agent for a given register file entry and whether the request finishes a previous request. In one embodiment, a final arbiter is configured to select among requests from the write queue, a read queue, and multiple execution pipelines to access banks of the register file in a given cycle.
Abstract translation: 公开了关于访问寄存器文件的请求的仲裁的技术。 在一个实施例中,装置包括写入队列和包括多个条目的寄存器文件。 在一个实施例中,设备被配置为基于多个请求特性从多个请求中选择请求,并将数据从接受的请求写入写入队列。 在一个实施例中,请求特征包括:请求是否是针对给定寄存器文件条目的代理的最后请求以及请求是否完成先前的请求。 在一个实施例中,最终仲裁器被配置为在给定周期中从写入队列,读取队列和多个执行管线的请求中选择访问寄存器文件的存储体。
-
公开(公告)号:US20150058572A1
公开(公告)日:2015-02-26
申请号:US13971800
申请日:2013-08-20
Applicant: Apple Inc.
Inventor: Timothy A. Olson , Terence M. Potter , James S. Blomgren , Andrew M. Havlir
CPC classification number: G06F9/30043 , G06F9/38 , G06F12/0862 , G06F12/0875 , G06F2212/452 , G06T1/20 , Y02D10/13
Abstract: Instructions may require one or more operands to be executed, which may be provided from a register file. In the context of a GPU, however, a register file may be a relatively large structure, and reading from the register file may be energy and/or time intensive An operand cache may store a subset of operands, and may use less power and have quicker access times than the register file. In some embodiments, intelligent operand prefetching may speed execution by reducing memory bank conflicts (e.g., conflicts within a register file containing multiple memory banks). An unused operand slot for another instruction (e.g., an instruction that does not require a maximum number of source operands allowed by an instruction set architecture) may be used to prefetch an operand for another instruction in one embodiment. Prefetched operands may be stored in an operand cache, and prefetching may occur based on software-provided information.
Abstract translation: 指令可能需要执行一个或多个操作数,这可以从寄存器文件提供。 然而,在GPU的上下文中,寄存器文件可以是相对较大的结构,并且从寄存器文件的读取可能是能量和/或时间密集的。操作数高速缓存可以存储操作数的子集,并且可以使用更少的功率并具有 比寄存器文件更快的访问时间。 在一些实施例中,智能操作数预取可以通过减少存储体冲突(例如,包含多个存储体的寄存器文件内的冲突)来加速执行。 在一个实施例中,用于另一指令的未用操作数时隙(例如,不需要由指令集体系结构允许的最大数量的源操作数的指令)可用于预取另一指令的操作数。 预取操作数可以存储在操作数缓存中,并且可以基于软件提供的信息进行预取。
-
公开(公告)号:US12086644B2
公开(公告)日:2024-09-10
申请号:US17399711
申请日:2021-08-11
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Steven Fishwick , David A. Gotwalt , Benjamin Bowman , Ralph C. Taylor , Melissa L. Velez , Mladen Wilder , Ali Rabbani Rankouhi , Fergus W. MacGarry
CPC classification number: G06F9/5044 , G06F9/4881 , G06F9/505 , G06T1/20 , G06T1/60
Abstract: Disclosed techniques relate to work distribution in graphics processors. In some embodiments, an apparatus includes circuitry that implements a plurality of logical slots and a set of graphics processor sub-units that each implement multiple distributed hardware slots. The circuitry may determine different distribution rules for first and second sets of graphics work and map logical slots to distributed hardware slots based on the distribution rules. In various embodiments, disclosed techniques may advantageously distribute work efficiently across distributed shader processors for graphics kicks of various sizes.
-
公开(公告)号:US11727530B2
公开(公告)日:2023-08-15
申请号:US17334139
申请日:2021-05-28
Applicant: Apple Inc.
Inventor: Andrew M. Havlir , Dzung Q. Vu , Liang Kai Wang
CPC classification number: G06T1/60 , G06F9/3017 , G06F9/30145 , G06F9/38 , G06F9/3838 , G06F9/3851 , G06F9/3853 , G06F9/3887 , G06F12/08 , G06T1/20 , G06F12/0875 , G06F2212/452 , Y02D10/00
Abstract: Techniques are disclosed relating to low-level instruction storage in a processing unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause. A given clause may be fetched and executed multiple times, e.g., for different SIMD groups, while stored in the caching circuitry.
-
-
-
-
-
-
-
-
-