-
1.
公开(公告)号:US20230289304A1
公开(公告)日:2023-09-14
申请号:US17691276
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Alexander L. Minkin , Alan Kaatz , Oliver Giroux , Jack Choquette , Shirish Gadre , Manan Patel , John Tran , Ronny Krashinsky , Jeff Schottmiller
IPC: G06F13/16
CPC classification number: G06F13/1663
Abstract: A parallel processing unit comprises a plurality of processors each being coupled to a memory access hardware circuitry. Each memory access hardware circuitry is configured to receive, from the coupled processor, a memory access request specifying a coordinate of a multidimensional data structure, wherein the memory access hardware circuit is one of a plurality of memory access circuitry each coupled to a respective one of the processors; and, in response to the memory access request, translate the coordinate of the multidimensional data structure into plural memory addresses for the multidimensional data structure and using the plural memory addresses, asynchronously transfer at least a portion of the multidimensional data structure for processing by at least the coupled processor. The memory locations may be in the shared memory of the coupled processor and/or an external memory.
-
公开(公告)号:US12141082B2
公开(公告)日:2024-11-12
申请号:US17691276
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Alexander L. Minkin , Alan Kaatz , Oliver Giroux , Jack Choquette , Shirish Gadre , Manan Patel , John Tran , Ronny Krashinsky , Jeff Schottmiller
IPC: G06F13/16
Abstract: A parallel processing unit comprises a plurality of processors each being coupled to a memory access hardware circuitry. Each memory access hardware circuitry is configured to receive, from the coupled processor, a memory access request specifying a coordinate of a multidimensional data structure, wherein the memory access hardware circuit is one of a plurality of memory access circuitry each coupled to a respective one of the processors; and, in response to the memory access request, translate the coordinate of the multidimensional data structure into plural memory addresses for the multidimensional data structure and using the plural memory addresses, asynchronously transfer at least a portion of the multidimensional data structure for processing by at least the coupled processor. The memory locations may be in the shared memory of the coupled processor and/or an external memory.
-
3.
公开(公告)号:US20230289292A1
公开(公告)日:2023-09-14
申请号:US17691422
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Alexander L. Minkin , Alan Kaatz , Olivier Giroux , Jack Choquette , Shirish Gadre , Manan Patel , John Tran , Ronny Krashinsky , Jeff Schottmiller
IPC: G06F12/0875
CPC classification number: G06F12/0875 , G06F2212/62 , G06F2212/452
Abstract: A parallel processing unit comprises a plurality of processors each being coupled to a memory access hardware circuitry. Each memory access hardware circuitry is configured to receive, from the coupled processor, a memory access request specifying a coordinate of a multidimensional data structure, wherein the memory access hardware circuit is one of a plurality of memory access circuitry each coupled to a respective one of the processors; and, in response to the memory access request, translate the coordinate of the multidimensional data structure into plural memory addresses for the multidimensional data structure and using the plural memory addresses, asynchronously transfer at least a portion of the multidimensional data structure for processing by at least the coupled processor. The memory locations may be in the shared memory of the coupled processor and/or an external memory.
-
公开(公告)号:US11347668B2
公开(公告)日:2022-05-31
申请号:US16921795
申请日:2020-07-06
Applicant: NVIDIA Corporation
Inventor: Xiaogang Qiu , Ronny Krashinsky , Steven Heinrich , Shirish Gadre , John Edmondson , Jack Choquette , Mark Gebhart , Ramesh Jandhyala , Poornachandra Rao , Omkar Paranjape , Michael Siu
IPC: G06F13/28 , G06F12/0891 , G06F12/0811 , G06F12/084 , G06F12/0895 , G06F12/122 , G11C7/10
Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.
-
公开(公告)号:US10459861B2
公开(公告)日:2019-10-29
申请号:US15716461
申请日:2017-09-26
Applicant: NVIDIA Corporation
Inventor: Xiaogang Qiu , Ronny Krashinsky , Steven Heinrich , Shirish Gadre , John Edmondson , Jack Choquette , Mark Gebhart , Ramesh Jandhyala , Poornachandra Rao , Omkar Paranjape , Michael Siu
IPC: G06F12/02 , G06F13/28 , G06F12/0891 , G06F12/0811 , G06F12/084
Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.
-
公开(公告)号:US12020035B2
公开(公告)日:2024-06-25
申请号:US17691288
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Apoorv Parle , Ronny Krashinsky , John Edmondson , Jack Choquette , Shirish Gadre , Steve Heinrich , Manan Patel , Prakash Bangalore Prabhakar, Jr. , Ravi Manyam , Wish Gandhi , Lacky Shah , Alexander L. Minkin
IPC: G06F5/06 , G06F9/38 , G06F9/48 , G06F9/52 , G06F13/16 , G06F13/40 , G06T1/20 , G06T1/60 , H04L49/101
CPC classification number: G06F9/3887 , G06F9/522 , G06F13/1689 , G06F13/4022 , G06T1/20 , G06T1/60 , H04L49/101
Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.
-
公开(公告)号:US20240118899A1
公开(公告)日:2024-04-11
申请号:US18105679
申请日:2023-02-03
Applicant: NVIDIA Corporation
Inventor: Aditya Avinash Atluri , Jack Choquette , Carter Edwards , Olivier Giroux , Praveen Kumar Kaushik , Ronny Krashinsky , Rishkul Kulkarni , Konstantinos Kyriakopoulos
IPC: G06F9/38
CPC classification number: G06F9/3851
Abstract: Apparatuses, systems, and techniques to adapt instructions in a SIMT architecture for execution on serial execution units. In at least one embodiment, a set of one or more threads is selected from a group of active threads associated with an instruction and the instruction is executed for the set of one or more threads on a serial execution unit.
-
8.
公开(公告)号:US11803380B2
公开(公告)日:2023-10-31
申请号:US16712236
申请日:2019-12-12
Applicant: NVIDIA Corporation
Inventor: Olivier Giroux , Jack Choquette , Ronny Krashinsky , Steve Heinrich , Xiaogang Qiu , Shirish Gadre
IPC: G06F7/08 , G06F9/30 , G06F12/0808 , G06F12/0888 , G06F9/32 , G06F9/38 , G06F9/52 , G06F9/54
CPC classification number: G06F9/30043 , G06F9/3009 , G06F9/321 , G06F9/3838 , G06F9/3871 , G06F9/522 , G06F9/542 , G06F9/544 , G06F9/546 , G06F12/0808 , G06F12/0888 , G06F9/3004 , G06F2212/621
Abstract: To synchronize operations of a computing system, a new type of synchronization barrier is disclosed. In one embodiment, the disclosed synchronization barrier provides for certain synchronization mechanisms such as, for example, “Arrive” and “Wait” to be split to allow for greater flexibility and efficiency in coordinating synchronization. In another embodiment, the disclosed synchronization barrier allows for hardware components such as, for example, dedicated copy or direct-memory-access (DMA) engines to be synchronized with software-based threads.
-
公开(公告)号:US12248788B2
公开(公告)日:2025-03-11
申请号:US17691690
申请日:2022-03-10
Applicant: NVIDIA Corporation
Inventor: Prakash Bangalore Prabhakar , Gentaro Hirota , Ronny Krashinsky , Ze Long , Brian Pharris , Rajballav Dash , Jeff Tuckey , Jerome F. Duluk, Jr. , Lacky Shah , Luke Durant , Jack Choquette , Eric Werness , Naman Govil , Manan Patel , Shayani Deb , Sandeep Navada , John Edmondson , Greg Palmer , Wish Gandhi , Ravi Manyam , Apoorv Parle , Olivier Giroux , Shirish Gadre , Steve Heinrich
Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache. Such distributed shared memory supports cooperative parallelism and strong scaling across multiple processing cores by permitting data sharing and communications previously possible only within the same processing core.
-
公开(公告)号:US20230110438A1
公开(公告)日:2023-04-13
申请号:US17497507
申请日:2021-10-08
Applicant: Nvidia Corporation
Inventor: Pradeep Ramani , Alex Minkin , Alan Kaatz , Yang Xu , Ronny Krashinsky
Abstract: Apparatuses, systems, and techniques are presented to perform one or more operations. In at least one embodiment, one or more data values, to be used by one or more neural networks, are caused to be replaced by one or more invalid data values.
-
-
-
-
-
-
-
-
-