-
公开(公告)号:US20240134719A1
公开(公告)日:2024-04-25
申请号:US17973234
申请日:2022-10-24
申请人: Intel Corporation
发明人: Fangwen Fu , Chunhui Mei , John A. Wiegert , Yongsheng Liu , Ben J. Ashbaugh
CPC分类号: G06F9/522 , G06F9/4881
摘要: Embodiments described herein provide a technique to facilitate the synchronization of workgroups executed on multiple graphics cores of a graphics core cluster. One embodiment provides a graphics core including a cache memory and a graphics core coupled with the cache memory. The graphics core includes execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers.
-
公开(公告)号:US20220309124A1
公开(公告)日:2022-09-29
申请号:US17211627
申请日:2021-03-24
申请人: Intel Corporation
发明人: Chunhui Mei , Hong Jiang , Jiasheng Chen , Yongsheng Liu , Yan Li
摘要: Matrix multiply units can take advantage of input sparsity by zero gating ALUs, which saves power consumption, but compute throughput does not increase. To improve compute throughput from sparsity, processing resources in a matrix accelerator can skip computation with zero involved in input or output. If zeros in input can be skipped, the processing units can focus calculations on generating meaningful non-zero output.
-
公开(公告)号:US20240220420A1
公开(公告)日:2024-07-04
申请号:US18148994
申请日:2022-12-30
申请人: Intel Corporation
IPC分类号: G06F12/121 , G06F12/0895
CPC分类号: G06F12/121 , G06F12/0895
摘要: Locally biased cache replacement for a clustered cache architecture is described. An example of an apparatus includes clusters of cores; a clustered cache including multiple cache partitions for the clusters of cores, each cache partition including multiple cachelines; and a computer memory including memory partitions, each of the cache partitions being associated with a respective local memory partition, wherein each cacheline of the cache partitions includes a cacheline tag, each cacheline tag including a local tag to indicate whether data stored in the cacheline is local data stored in the local memory partition or remote data stored in a remote memory partition, and a used tag to indicate whether data stored in the cacheline is recently accessed; and wherein the clustered cache includes circuitry to select cachelines for cache replacement in a cache partition based on values of the tags of the cachelines.
-
公开(公告)号:US20240220254A1
公开(公告)日:2024-07-04
申请号:US18148997
申请日:2022-12-30
申请人: Intel Corporation
发明人: Chunhui Mei , Yongsheng Liu , John A. Wiegert , Vasanth Ranganathan , Ben J. Ashbaugh , Fangwen Fu , Hong Jiang , Guei-Yuan Lueh , James Valerio , Alan M. Curtis , Maxim Kazakov
CPC分类号: G06F9/30087 , G06F9/3877 , G06F9/5072 , G06F9/544
摘要: Data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a first processor, the first processor including one or more clusters of cores and a memory, wherein each cluster of cores includes multiple cores, each core including one or more processing resources, shared memory, and broadcast circuitry; and wherein a first core in a first cluster of cores is to request a data element, determine whether any additional cores in the first cluster require the data element, and, upon determining that one or more additional cores in the first cluster require the data element, broadcast the data element to the one or more additional cores via interconnects between the broadcast circuitry of the cores of the first core cluster.
-
公开(公告)号:US11977885B2
公开(公告)日:2024-05-07
申请号:US17107823
申请日:2020-11-30
申请人: Intel Corporation
发明人: Subramaniam Maiyuran , Jorge Parra , Ashutosh Garg , Chandra Gurram , Chunhui Mei , Durgesh Borkar , Shubra Marwaha , Supratim Pal , Varghese George , Wei Xiong , Yan Li , Yongsheng Liu , Dipankar Das , Sasikanth Avancha , Dharma Teja Vooturi , Naveen K. Mellempudi
CPC分类号: G06F9/30036 , G06F9/3001 , G06F9/30101 , G06F9/3893 , G06F15/8046
摘要: An apparatus to facilitate utilizing structured sparsity in systolic arrays is disclosed. The apparatus includes a processor comprising a systolic array to receive data from a plurality of source registers, the data comprising unpacked source data, structured source data that is packed based on sparsity, and metadata corresponding to the structured source data; identify portions of the unpacked source data to multiply with the structured source data, the portions of the unpacked source data identified based on the metadata; and output, to a destination register, a result of multiplication of the portions of the unpacked source data and the structured source data.
-
公开(公告)号:US20240104025A1
公开(公告)日:2024-03-28
申请号:US17951914
申请日:2022-09-23
申请人: Intel Corporation
IPC分类号: G06F12/123 , G06F12/0862
CPC分类号: G06F12/123 , G06F12/0862 , G06F2212/1021
摘要: Prefetch aware LRU cache replacement policy is described. An example of an apparatus includes one or more processors including a graphic processor, the graphics processor including a load store cache having multiple cache lines (CLs), each including bits for a cache line level (CL level) and one or more sectors for data storage; wherein the graphics processor is to receive one or more data elements for storage in the cache; set a CL level to track each CL receiving data, including setting CL level 1 for a CL receiving data in response to a miss in the cache and setting a CL level 2 for a CL receiving prefetched data in response to a prefetch request, and, upon determining that space is required in the cache to store data, apply a cache replacement policy, the policy being based at least in part on set CL levels for the CLs.
-
7.
公开(公告)号:US20230153176A1
公开(公告)日:2023-05-18
申请号:US17528386
申请日:2021-11-17
申请人: Intel Corporation
发明人: Chunhui Mei , James Valerio , Supratim Pal , Guei-Yuan Lueh , Hong Jiang
摘要: An apparatus to facilitate facilitating forward progress guarantee using single-level synchronization at individual thread granularity is disclosed. The apparatus includes a processor comprising a barrier synchronization hardware circuitry to assign a set of global named barrier identifiers (IDs) to individual execution threads of a plurality of execution threads and synchronize execution of the individual execution threads on a single level via the set of global named barrier IDs; and a plurality of processing resources to execute the plurality of execution threads and comprising divergent barrier scheduling hardware circuitry to facilitate execution flow switching from a first divergent branch executed by a first thread to a second divergent branch executed by a second thread, the execution flow switching performed responsive to the first thread stalling to wait on a named barrier of the set of global named barrier IDs.
-
公开(公告)号:US20220414054A1
公开(公告)日:2022-12-29
申请号:US17304797
申请日:2021-06-25
申请人: Intel Corporation
发明人: Jorge Parra , Jiasheng Chen , Supratim Pal , Fangwen Fu , Sabareesh Ganapathy , Chandra Gurram , Chunhui Mei , Yue Qi
摘要: A processing apparatus described herein includes a general-purpose parallel processing engine comprising a systolic array having multiple pipelines, each of the multiple pipelines including multiple pipeline stages, wherein the multiple pipelines include a first pipeline, a second pipeline, and a common input shared between the first pipeline and the second pipeline.
-
公开(公告)号:US20240111534A1
公开(公告)日:2024-04-04
申请号:US17957486
申请日:2022-09-30
申请人: Intel Corporation
发明人: Fangwen Fu , Chunhui Mei , Maxim Kazakov , Biju George , Jorge Parra , Supratim Pal
CPC分类号: G06F9/30047 , G06F9/3009 , G06F9/542
摘要: Embodiments described herein provide a technique enable a broadcast load from an L1 cache or shared local memory to register files associated with hardware threads of a graphics core. One embodiment provides a graphics processor comprising a cache memory and a graphics core coupled with the cache memory. The graphics core includes a plurality of hardware threads and memory access circuitry to facilitate access to memory by the plurality of hardware threads. The graphics core is configurable to process a plurality of load request from the plurality of hardware threads, detect duplicate load requests within the plurality of load requests, perform a single read from the cache memory in response to the duplicate load requests, and transmit data associated with the duplicate load requests to requesting hardware threads.
-
公开(公告)号:US11494163B2
公开(公告)日:2022-11-08
申请号:US16562979
申请日:2019-09-06
申请人: Intel Corporation
发明人: Naveen Mellempudi , Dipankar Das , Chunhui Mei , Kristopher Wong , Dhiraj D. Kalamkar , Hong H. Jiang , Subramaniam Maiyuran , Varghese George
摘要: An apparatus to facilitate a computer number format conversion is disclosed. The apparatus comprises a control unit to receive to receive data format information indicating a first precision data format that input data is to be received and converter hardware to receive the input data and convert the first precision data format to a second precision data format based on the data format information.
-
-
-
-
-
-
-
-
-