-
公开(公告)号:US20250060938A1
公开(公告)日:2025-02-20
申请号:US18449381
申请日:2023-08-14
Applicant: NVIDIA Corporation
Inventor: Jack CHOQUETTE , Po-An TSAI , Alexander L. MINKIN , Manan PATEL , Neal Clayton CRAGO , Daniel STIFFLER , Kefeng DUAN , Yu-Jung CHEN , Jing LI , Qian WANG , Ronny KRASHINSKY , Jun YANG , Feng XIE
Abstract: Systems and methods for efficient convolution based on matrix multiply and add (MMA) are described. An example processor having a plurality of processing lanes is configured to perform convolution of a matrix of activation elements and a filter matrix in accordance with a configurable series of instructions including a plurality of MMA instructions and shift instructions while reusing activation elements already loaded to the datapath or associated memory over a plurality of MMA operations. Associated methods are also described.
-
公开(公告)号:US20240354106A1
公开(公告)日:2024-10-24
申请号:US18755097
申请日:2024-06-26
Applicant: NVIDIA Corporation
Inventor: Srinivas Santosh Kumar MADUGULA , Olivier GIROUX , Wishwesh Anil GANDHI , Michael Allen PARKER , Raghuram L , Ivan TANASIC , Manan PATEL , Mark HUMMEL , Alexander L. MINKIN , Gregory Michael THORSON
IPC: G06F9/30
CPC classification number: G06F9/30043 , G06F9/30087
Abstract: Various embodiments include techniques for performing self-synchronizing remote memory operations in a data center or multiprocessor computing system. During a remote memory operation, a source processor transmits multiple data segments to a destination processor. For each data segment, the source processor transmits a remote memory operation to the destination processor that includes associated metadata that identifies the memory location of a corresponding synchronization object representing a count of data segments to be stored or a flag for each data segment to be stored. The remote memory operation along with the metadata is transmitted as a single unit to the destination processor. The destination processor splits the operation into the remote memory operation and the memory synchronization operation. As a result, the source processor avoids the need to perform a separate memory synchronization operation, thereby reducing inter-processor communications and increasing performance of remote memory operations.
-
公开(公告)号:US20190146817A1
公开(公告)日:2019-05-16
申请号:US15897090
申请日:2018-02-14
Applicant: NVIDIA CORPORATION
Inventor: Ajay TIRUMALA , Jack CHOQUETTE , Manan PATEL , Shirish GADRE , Praveen KAUSHIK , Amanpreet GREWAL , Shekhar DIVEKAR , Andrei KHODAKOVSKY
Abstract: A just-in-time (JIT) compiler binds constants to specific memory locations at runtime. The JIT compiler parses program code derived from a multithreaded application and identifies an instruction that references a uniform constant. The JIT compiler then determines a chain of pointers that originates within a root table specified in the multithreaded application and terminates at the uniform constant. The JIT compiler generates additional instructions for traversing the chain of pointers and inserts these instructions into the program code. A parallel processor executes this compiled code and, in doing so, causes a thread to traverse the chain of pointers and bind the uniform constant to a uniform register at runtime. Each thread in a group of threads executing on the parallel processor may then access the uniform constant.
-
公开(公告)号:US20230315655A1
公开(公告)日:2023-10-05
申请号:US17691303
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Jack CHOQUETTE , Ronny KRASHINSKY , Timothy GUO , Carter EDWARDS , Steve HEINRICH , John EDMONDSON , Prakash Bangalore PRABHAKAR , Apoorv PARLE, JR. , Manan PATEL , Olivier GIROUX , Michael PELLAUER
IPC: G06F13/16
CPC classification number: G06F13/1689 , G06F13/1673
Abstract: A new synchronization system synchronizes data exchanges between producer processes and consumer processes which may be on the same or different processors in a multiprocessor system. The synchronization incurs less than one roundtrip of latency - in some implementations, in approximately 0.5 roundtrip times. A key aspect of the fast synchronization is that the producer’s data store is followed without delay with the updating of a barrier on which the consumer is waiting.
-
5.
公开(公告)号:US20200050920A1
公开(公告)日:2020-02-13
申请号:US16514078
申请日:2019-07-17
Applicant: NVIDIA Corporation
Inventor: Sachin IDGUNJI , Michael SIU , Alex GU , James REILLEY , Manan PATEL , Raj SELVANESAN , Ewa KUBALSKA
IPC: G06N3/04 , G06N3/08 , G06F1/3206 , G06F9/30 , G06F9/38
Abstract: An integrated circuit such as, for example a graphics processing unit (GPU), includes a dynamic power controller for adjusting operating voltage and/or frequency. The controller may receive current power used by the integrated circuit and a predicted power determined based on instructions pending in a plurality of processors. The controller determines adjustments that need to be made to the operating voltage and/or frequency to minimize the difference between the current power and the predicted power. An in-system reinforced learning mechanism is included to self-tune parameters of the controller.
-
公开(公告)号:US20190146796A1
公开(公告)日:2019-05-16
申请号:US15897092
申请日:2018-02-14
Applicant: NVIDIA CORPORATION
Inventor: Ajay TIRUMALA , Jack CHOQUETTE , Manan PATEL , Shirish GADRE , Praveen KAUSHIK
Abstract: A compiler parses a multithreaded application into cohesive blocks of instructions. Cohesive blocks include instructions that do not diverge or converge. Each cohesive block is associated with one or more uniform registers. When a set of threads executes the instructions in a given cohesive block, each thread in the set may access the uniform register independently of the other threads in the set. Accordingly, the uniform register may store a single copy of data on behalf of all threads in the set of threads, thereby conserving resources.
-
公开(公告)号:US20240393951A1
公开(公告)日:2024-11-28
申请号:US18768983
申请日:2024-07-10
Applicant: NVIDIA Corporation
Inventor: Srinivas Santosh Kumar MADUGULA , Olivier GIROUX , Wishwesh Anil GANDHI , Michael Allen PARKER , Raghuram L , Ivan TANASIC , Manan PATEL , Mark HUMMEL , Alexander L. MINKIN
IPC: G06F3/06
Abstract: Various embodiments include techniques for performing self-synchronizing remote memory operations in a multiprocessor computing system. During a remote memory operation in the multiprocessor computing system, a source processing unit transmits multiple segments of data to a destination processing. For each segment of data, the source processing unit transmits a remote memory operation to the destination processing unit that includes associated metadata that identifies the memory location of a corresponding synchronization object. The remote memory operation along with the metadata is transmitted as a single unit to the destination processing unit. The destination processing unit splits the operation into the remote memory operation and the memory synchronization operation. As a result, the source processing unit avoids the need to perform a separate memory synchronization operation, thereby reducing inter-processor communications and increasing performance of remote memory operations.
-
公开(公告)号:US20240289132A1
公开(公告)日:2024-08-29
申请号:US18660763
申请日:2024-05-10
Applicant: NVIDIA Corporation
Inventor: Apoorv PARLE , Ronny KRASHINSKY , John EDMONDSON , Jack CHOQUETTE , Shirish GADRE , Steve HEINRICH , Manan PATEL , Prakash Bangalore PRABHAKAR, JR. , Ravi MANYAM , Wish GANDHI , Lacky SHAH , Alexander L. Minkin
CPC classification number: G06F9/3887 , G06F9/522 , G06F13/1689 , G06F13/4022 , G06T1/20 , G06T1/60 , H04L49/101
Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.
-
公开(公告)号:US20230289398A1
公开(公告)日:2023-09-14
申请号:US17691406
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Jack CHOQUETTE , Manan PATEL , Matt TYRLIK , Ronny KRASHINSKY
CPC classification number: G06F17/16 , G06F9/3001 , G06F7/5443
Abstract: This specification describes techniques for implementing matrix multiply and add (MMA) operations in graphics processing units (GPU)s and other processors. The implementations provide for a plurality of warps of threads to collaborate in generating the result matrix by enabling each thread to share its respective register files to be accessed by the datapaths associated with other threads in the group of warps. A state machine circuit controls a MMA execution among the warps executing on asynchronous computation units. A group MMA (GMMA) instruction provides for a descriptor to be provided as parameter where the descriptor may include information regarding size and formats of input data to be loaded into shared memory and/or the datapath.
-
公开(公告)号:US20230289189A1
公开(公告)日:2023-09-14
申请号:US17691690
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Prakash BANGALORE PRABHAKAR , Gentaro HIROTA , Ronny KRASHINSKY , Ze LONG , Brian PHARRIS , Rajballav DASH , Jeff TUCKEY , Jerome F. DULUK, JR. , Lacky SHAH , Luke DURANT , Jack CHOQUETTE , Eric WERNESS , Naman GOVIL , Manan PATEL , Shayani DEB , Sandeep NAVADA , John EDMONDSON , Greg PALMER , Wish GANDHI , Ravi MANYAM , Apoorv PARLE , Olivier GIROUX , Shirish GADRE , Steve HEINRICH
IPC: G06F3/06
CPC classification number: G06F3/064 , G06F3/0604 , G06F3/0679
Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache. Such distributed shared memory supports cooperative parallelism and strong scaling across multiple processing cores by permitting data sharing and communications previously possible only within the same processing core.
-
-
-
-
-
-
-
-
-