-
公开(公告)号:US20230289190A1
公开(公告)日:2023-09-14
申请号:US17691288
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Apoorv PARLE , Ronny KRASHINSKY , John EDMONDSON , Jack CHOQUETTE , Shirish GADRE , Steve HEINRICH , Manan PATEL , Prakash Bangalore PRABHAKAR, JR. , Ravi MANYAM , Wish GANDHI , Lacky SHAH , Alexander L. Minkin
CPC classification number: G06F9/3887 , G06F9/522 , G06F13/4022 , G06F13/1689 , H04L49/101 , G06T1/20 , G06T1/60
Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.
-
公开(公告)号:US20250060938A1
公开(公告)日:2025-02-20
申请号:US18449381
申请日:2023-08-14
Applicant: NVIDIA Corporation
Inventor: Jack CHOQUETTE , Po-An TSAI , Alexander L. MINKIN , Manan PATEL , Neal Clayton CRAGO , Daniel STIFFLER , Kefeng DUAN , Yu-Jung CHEN , Jing LI , Qian WANG , Ronny KRASHINSKY , Jun YANG , Feng XIE
Abstract: Systems and methods for efficient convolution based on matrix multiply and add (MMA) are described. An example processor having a plurality of processing lanes is configured to perform convolution of a matrix of activation elements and a filter matrix in accordance with a configurable series of instructions including a plurality of MMA instructions and shift instructions while reusing activation elements already loaded to the datapath or associated memory over a plurality of MMA operations. Associated methods are also described.
-
3.
公开(公告)号:US20210124627A1
公开(公告)日:2021-04-29
申请号:US16712236
申请日:2019-12-12
Applicant: NVIDIA Corporation
Inventor: Olivier GIROUX , Jack CHOQUETTE , Ronny KRASHINSKY , Steve HEINRICH , Xiaogang QIU , Shirish GADRE
Abstract: To synchronize operations of a computing system, a new type of synchronization barrier is disclosed. In one embodiment, the disclosed synchronization barrier provides for certain synchronization mechanisms such as, for example, “Arrive” and “Wait” to be split to allow for greater flexibility and efficiency in coordinating synchronization. In another embodiment, the disclosed synchronization barrier allows for hardware components such as, for example, dedicated copy or direct-memory-access (DMA) engines to be synchronized with software-based threads.
-
公开(公告)号:US20240289132A1
公开(公告)日:2024-08-29
申请号:US18660763
申请日:2024-05-10
Applicant: NVIDIA Corporation
Inventor: Apoorv PARLE , Ronny KRASHINSKY , John EDMONDSON , Jack CHOQUETTE , Shirish GADRE , Steve HEINRICH , Manan PATEL , Prakash Bangalore PRABHAKAR, JR. , Ravi MANYAM , Wish GANDHI , Lacky SHAH , Alexander L. Minkin
CPC classification number: G06F9/3887 , G06F9/522 , G06F13/1689 , G06F13/4022 , G06T1/20 , G06T1/60 , H04L49/101
Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.
-
公开(公告)号:US20230289398A1
公开(公告)日:2023-09-14
申请号:US17691406
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Jack CHOQUETTE , Manan PATEL , Matt TYRLIK , Ronny KRASHINSKY
CPC classification number: G06F17/16 , G06F9/3001 , G06F7/5443
Abstract: This specification describes techniques for implementing matrix multiply and add (MMA) operations in graphics processing units (GPU)s and other processors. The implementations provide for a plurality of warps of threads to collaborate in generating the result matrix by enabling each thread to share its respective register files to be accessed by the datapaths associated with other threads in the group of warps. A state machine circuit controls a MMA execution among the warps executing on asynchronous computation units. A group MMA (GMMA) instruction provides for a descriptor to be provided as parameter where the descriptor may include information regarding size and formats of input data to be loaded into shared memory and/or the datapath.
-
公开(公告)号:US20230289189A1
公开(公告)日:2023-09-14
申请号:US17691690
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Prakash BANGALORE PRABHAKAR , Gentaro HIROTA , Ronny KRASHINSKY , Ze LONG , Brian PHARRIS , Rajballav DASH , Jeff TUCKEY , Jerome F. DULUK, JR. , Lacky SHAH , Luke DURANT , Jack CHOQUETTE , Eric WERNESS , Naman GOVIL , Manan PATEL , Shayani DEB , Sandeep NAVADA , John EDMONDSON , Greg PALMER , Wish GANDHI , Ravi MANYAM , Apoorv PARLE , Olivier GIROUX , Shirish GADRE , Steve HEINRICH
IPC: G06F3/06
CPC classification number: G06F3/064 , G06F3/0604 , G06F3/0679
Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache. Such distributed shared memory supports cooperative parallelism and strong scaling across multiple processing cores by permitting data sharing and communications previously possible only within the same processing core.
-
公开(公告)号:US20230289242A1
公开(公告)日:2023-09-14
申请号:US17691296
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Timothy GUO , Jack CHOQUETTE , Shirish GADRE , Olivier GIROUX , Carter EDWARDS , John EDMONDSON , Manan PATEL , Raghavan MADHAVAN, JR. , Jessie HUANG , Peter NELSON , Ronny KRASHINSKY
IPC: G06F9/52
CPC classification number: G06F9/522 , G06F2209/521
Abstract: A new transaction barrier synchronization primitive enables executing threads and asynchronous transactions to synchronize across parallel processors. The asynchronous transactions may include transactions resulting from, for example, hardware data movement units such as direct memory units, etc. A hardware synchronization circuit may provide for the synchronization primitive to be stored in a cache memory so that barrier operations may be accelerated by the circuit. A new wait mechanism reduces software overhead associated with waiting on a barrier.
-
公开(公告)号:US20230289215A1
公开(公告)日:2023-09-14
申请号:US17691621
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Greg PALMER , Gentaro HIROTA , Ronny KRASHINSKY , Ze LONG , Brian PHARRIS , Rajballav DASH , Jeff TUCKEY , Jerome F. DULUK, JR. , Lacky SHAH , Luke DURANT , Jack CHOQUETTE , Eric WERNESS , Naman GOVIL , Manan PATEL , Shayani DEB , Sandeep NAVADA , John EDMONDSON , Prakash BANGALORE PRABHAKAR , Wish GANDHI , Ravi MANYAM , Apoorv PARLE , Olivier GIROUX , Shirish GADRE , Steve HEINRICH
CPC classification number: G06F9/4881 , G06F9/3851 , G06F9/3009 , G06F9/544
Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.
-
公开(公告)号:US20230288471A1
公开(公告)日:2023-09-14
申请号:US17691759
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Jerome F. DULUK , Gentaro HIROTA , Ronny KRASHINSKY , Greg PALMER , Jeff TUCKEY , Kaushik NADADHUR , Philip Browning JOHNSON , Praveen JOGINIPALLY
IPC: G01R31/28
CPC classification number: G01R31/2884 , G01R31/2889 , G01R31/2896 , G01R31/2839
Abstract: Processing hardware of a processor is virtualized to provide a façade between a consistent programming interface and specific hardware instances. Hardware processor components can be permanently or temporarily disabled when not needed to support the consistent programming interface and/or to balance hardware processing across a hardware arrangement such as an integrated circuit. Executing software can be migrated from one hardware arrangement to another without need to reset the hardware.
-
公开(公告)号:US20230315655A1
公开(公告)日:2023-10-05
申请号:US17691303
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Jack CHOQUETTE , Ronny KRASHINSKY , Timothy GUO , Carter EDWARDS , Steve HEINRICH , John EDMONDSON , Prakash Bangalore PRABHAKAR , Apoorv PARLE, JR. , Manan PATEL , Olivier GIROUX , Michael PELLAUER
IPC: G06F13/16
CPC classification number: G06F13/1689 , G06F13/1673
Abstract: A new synchronization system synchronizes data exchanges between producer processes and consumer processes which may be on the same or different processors in a multiprocessor system. The synchronization incurs less than one roundtrip of latency - in some implementations, in approximately 0.5 roundtrip times. A key aspect of the fast synchronization is that the producer’s data store is followed without delay with the updating of a barrier on which the consumer is waiting.
-
-
-
-
-
-
-
-
-