-
公开(公告)号:US20240220448A1
公开(公告)日:2024-07-04
申请号:US18148998
申请日:2022-12-30
申请人: Intel Corporation
发明人: Chunhui Mei , Jiasheng Chen , Ben J. Ashbaugh , Fangwen Fu , Hong Jiang , Guei-Yuan Lueh , Rama S.B. Harihara , Maxim Kazakov
CPC分类号: G06F15/8046 , G06F13/1668
摘要: A scalable and configurable clustered systolic array is described. An example of apparatus includes a cluster including multiple cores; and a cache memory coupled with the cluster, wherein each core includes multiple processing resources, a memory coupled with the plurality of processing resources, a systolic array coupled with the memory, and one or more interconnects with one or more other cores of the plurality of cores; and wherein the systolic arrays of the cores are configurable by the apparatus to form a logically combined systolic array for processing of an operation by a cooperative group of threads running on one or more of the plurality of cores in the cluster.
-
公开(公告)号:US20240220335A1
公开(公告)日:2024-07-04
申请号:US18148993
申请日:2022-12-30
申请人: Intel Corporation
发明人: Chunhui Mei , Yongsheng Liu , John A. Wiegert , Vasanth Ranganathan , Ben J. Ashbaugh , Fangwen Fu , Hong Jiang , Guei-Yuan Lueh , James Valerio , Alan M. Curtis , Maxim Kazakov
CPC分类号: G06F9/522 , G06F9/3877 , G06F9/5072 , G06F9/3887
摘要: Synchronization for data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a graphics processing unit (GPU), the GPU including one or more clusters of cores and a memory, wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared local memory, and gateway circuitry, wherein the GPU is to initiate broadcast of a data element from a producer core to one or more consumer cores, and synchronize the broadcast of the data element utilizing the gateway circuitry of the producer core and the one or more consumer cores, and wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element.
-
公开(公告)号:US20240112295A1
公开(公告)日:2024-04-04
申请号:US17958216
申请日:2022-09-30
申请人: Intel Corporation
发明人: Biju George , Fangwen Fu , Supratim Pal , Jorge Parra , Chunhui Mei , Maxim Kazakov , Joydeep Ray
CPC分类号: G06T1/20 , G06F9/30098 , G06F9/3836
摘要: Shared local registers for thread team processing is described. An example of an apparatus includes one or more processors including a graphic processor having multiple processing resources; and memory for storage of data, the graphics processor to allocate a first thread team to a first processing resource, the first thread team including hardware threads to be executed solely by the first processing resource; allocate a shared local register (SLR) space that may be directly reference in the ISA instructions to the first processing resource, the SLR space being accessible to the threads of the thread team and being inaccessible to threads outside of the thread team; and allocate individual register spaces to the thread team, each of the individual register spaces being accessible to a respective thread of the thread team.
-
公开(公告)号:US20240168807A1
公开(公告)日:2024-05-23
申请号:US18056949
申请日:2022-11-18
申请人: Intel Corporation
发明人: Jorge Eduardo Parra Osorio , Guei-Yuan Lueh , Maxim Kazakov , Fangwen Fu , Supratim Pal , Kaiyu Chen
CPC分类号: G06F9/5027 , G06F9/48 , G06F9/522 , G06F15/8046
摘要: An apparatus to facilitate cross-thread register sharing for matrix multiplication compute is disclosed. The apparatus includes matrix acceleration hardware comprising a plurality of data processing units, wherein the respective plurality of data processing units are to: receive a decoded instruction for a first thread having a first register space, wherein the decoded instruction is for a matrix multiplication operation and comprises an indication to utilize a second register space of a second thread for an operand of the decoded instruction for the first thread; access the second register space of the second thread to obtain data for the operand of the decoded instruction; and perform the matrix multiplication operation for the first thread using the data for the operand from the second register space of the second thread.
-
公开(公告)号:US20240111534A1
公开(公告)日:2024-04-04
申请号:US17957486
申请日:2022-09-30
申请人: Intel Corporation
发明人: Fangwen Fu , Chunhui Mei , Maxim Kazakov , Biju George , Jorge Parra , Supratim Pal
CPC分类号: G06F9/30047 , G06F9/3009 , G06F9/542
摘要: Embodiments described herein provide a technique enable a broadcast load from an L1 cache or shared local memory to register files associated with hardware threads of a graphics core. One embodiment provides a graphics processor comprising a cache memory and a graphics core coupled with the cache memory. The graphics core includes a plurality of hardware threads and memory access circuitry to facilitate access to memory by the plurality of hardware threads. The graphics core is configurable to process a plurality of load request from the plurality of hardware threads, detect duplicate load requests within the plurality of load requests, perform a single read from the cache memory in response to the duplicate load requests, and transmit data associated with the duplicate load requests to requesting hardware threads.
-
公开(公告)号:US20240220254A1
公开(公告)日:2024-07-04
申请号:US18148997
申请日:2022-12-30
申请人: Intel Corporation
发明人: Chunhui Mei , Yongsheng Liu , John A. Wiegert , Vasanth Ranganathan , Ben J. Ashbaugh , Fangwen Fu , Hong Jiang , Guei-Yuan Lueh , James Valerio , Alan M. Curtis , Maxim Kazakov
CPC分类号: G06F9/30087 , G06F9/3877 , G06F9/5072 , G06F9/544
摘要: Data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a first processor, the first processor including one or more clusters of cores and a memory, wherein each cluster of cores includes multiple cores, each core including one or more processing resources, shared memory, and broadcast circuitry; and wherein a first core in a first cluster of cores is to request a data element, determine whether any additional cores in the first cluster require the data element, and, upon determining that one or more additional cores in the first cluster require the data element, broadcast the data element to the one or more additional cores via interconnects between the broadcast circuitry of the cores of the first core cluster.
-
-
-
-
-