-
公开(公告)号:US20210191754A1
公开(公告)日:2021-06-24
申请号:US16722422
申请日:2019-12-20
Applicant: Nvidia Corporation
Inventor: Jonathon Evans , Lacky Shah , Phil Johnson , Jonah Alben , Brian Pharris , Greg Palmer , Brian Fahs
Abstract: Apparatuses, systems, and techniques to optimize processor resources at a user-defined level. In at least one embodiment, priority of one or more tasks are adjusted to prevent one or more other dependent tasks from entering an idle state due to lack of resources to consume.
-
公开(公告)号:US09830262B2
公开(公告)日:2017-11-28
申请号:US14133488
申请日:2013-12-18
Applicant: NVIDIA CORPORATION
Inventor: Jerome F. Duluk, Jr. , Cameron Buschardt , James Leroy Deming , Brian Fahs
CPC classification number: G06F12/08 , G06F11/3037 , G06F11/3442 , G06F11/3471 , G06F2201/81 , G06F2201/815 , G06F2201/88 , G06F2212/205
Abstract: Embodiments of the approaches disclosed herein include a subsystem that includes an access tracking mechanism configured to monitor access operations directed to a first memory and a second memory. The access tracking mechanism detects an access operation generated by a processor for accessing a first memory page residing on the second memory. The access tracking mechanism further determines that the first memory page is included in a first subset of memory pages residing on the second memory. The access tracking mechanism further locates, within a reference vector, a reference bit that corresponds to the first memory page, and sets the reference bit. One advantage of the present invention is that memory pages in a hybrid system migrate as needed to increase overall memory performance.
-
公开(公告)号:US09417875B2
公开(公告)日:2016-08-16
申请号:US14025482
申请日:2013-09-12
Applicant: NVIDIA Corporation
Inventor: Brian Fahs , Ming Y. Siu , Brett W. Coon , John R. Nickolls , Lars Nyland
CPC classification number: G06F9/522 , G06F8/458 , G06F9/3004 , G06F9/30087 , G06F9/30145 , G06F9/3851
Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
-
公开(公告)号:US20140168245A1
公开(公告)日:2014-06-19
申请号:US13720745
申请日:2012-12-19
Applicant: NVIDIA CORPORATION
Inventor: Brian Fahs , Eric T. Anderson , Nick Barrow-Williams , Shirish Gadre , Joel James McCormack , Bryon S. Nordquist , Nirmal Raj Saxena , Lacky V. Shah
IPC: G06F13/14
CPC classification number: G06F13/14 , G06T1/20 , G06T1/60 , G06T15/005 , G06T2210/36
Abstract: A texture processing pipeline can be configured to service memory access requests that represent texture data access operations or generic data access operations. When the texture processing pipeline receives a memory access request that represents a texture data access operation, the texture processing pipeline may retrieve texture data based on texture coordinates. When the memory access request represents a generic data access operation, the texture pipeline extracts a virtual address from the memory access request and then retrieves data based on the virtual address. The texture processing pipeline is also configured to cache generic data retrieved on behalf of a group of threads and to then invalidate that generic data when the group of threads exits.
-
公开(公告)号:US11954518B2
公开(公告)日:2024-04-09
申请号:US16722422
申请日:2019-12-20
Applicant: Nvidia Corporation
Inventor: Jonathon Evans , Lacky Shah , Phil Johnson , Jonah Alben , Brian Pharris , Greg Palmer , Brian Fahs
CPC classification number: G06F9/4831 , G06N3/08
Abstract: Apparatuses, systems, and techniques to optimize processor resources at a user-defined level. In at least one embodiment, priority of one or more tasks are adjusted to prevent one or more other dependent tasks from entering an idle state due to lack of resources to consume.
-
公开(公告)号:US10061526B2
公开(公告)日:2018-08-28
申请号:US15169532
申请日:2016-05-31
Applicant: NVIDIA Corporation
Inventor: John Mashey , Cameron Buschardt , James Leroy Deming , Jerome F. Duluk, Jr. , Brian Fahs
IPC: G06F3/06 , G06F12/1027 , G06F12/1009
CPC classification number: G06F3/0622 , G06F3/0631 , G06F3/0647 , G06F3/0685 , G06F12/1009 , G06F12/1027 , G06F2212/656 , G06F2212/684
Abstract: One embodiment of the present invention is a memory subsystem that includes a sliding window tracker that tracks memory accesses associated with a sliding window of memory page groups. When the sliding window tracker detects an access operation associated with a memory page group within the sliding window, the sliding window tracker sets a reference bit that is associated with the memory page group and is included in a reference vector that represents accesses to the memory page groups within the sliding window. Based on the values of the reference bits, the sliding window tracker causes the selection a memory page in a memory page group that has fallen into disuse from a first memory to a second memory. Because the sliding window tracker tunes the memory pages that are resident in the first memory to reflect memory access patterns, the overall performance of the memory subsystem is improved.
-
公开(公告)号:US09830276B2
公开(公告)日:2017-11-28
申请号:US15437400
申请日:2017-02-20
Applicant: NVIDIA Corporation
Inventor: James Leroy Deming , Jerome F. Duluk, Jr. , John Mashey , Mark Hairgrove , Lucien Dunning , Jonathon Stuart Ramsey Evans , Samuel H. Duncan , Cameron Buschardt , Brian Fahs
IPC: G06F12/08 , G06F9/46 , G06F12/1027
CPC classification number: G06F12/1027 , G06F9/467 , G06F12/08 , G06F2212/301 , G06F2212/684
Abstract: One embodiment of the present invention is a parallel processing unit (PPU) that includes one or more streaming multiprocessors (SMs) and implements a replay unit per SM. Upon detecting a page fault associated with a memory transaction issued by a particular SM, the corresponding replay unit causes the SM, but not any unaffected SMs, to cease issuing new memory transactions. The replay unit then stores the faulting memory transaction and any faulting in-flight memory transaction in a replay buffer. As page faults are resolved, the replay unit replays the memory transactions in the replay buffer—removing successful memory transactions from the replay buffer—until all of the stored memory transactions have successfully executed. Advantageously, the overall performance of the PPU is improved compared to conventional PPUs that, upon detecting a page fault, stop performing memory transactions across all SMs included in the PPU until the fault is resolved.
-
公开(公告)号:US09639474B2
公开(公告)日:2017-05-02
申请号:US14134148
申请日:2013-12-19
Applicant: NVIDIA CORPORATION
Inventor: Jerome F. Duluk, Jr. , John Mashey , Mark Hairgrove , Chenghuan Jia , Cameron Buschardt , Lucien Dunning , Brian Fahs
IPC: G06F13/00 , G06F12/1009 , G06F12/0804
CPC classification number: G06F3/0604 , G06F3/0647 , G06F3/0664 , G06F12/0804 , G06F12/1009 , G06F13/4022 , G06F13/4282 , G06F2212/657
Abstract: Techniques are provided by which memory pages may be migrated among PPU memories in a multi-PPU system. According to the techniques, a UVM driver determines that a particular memory page should change ownership state and/or be migrated between one PPU memory and another PPU memory. In response to this determination, the UVM driver initiates a peer transition sequence to cause the ownership state and/or location of the memory page to change. Various peer transition sequences involve modifying mappings for one or more PPU, and copying a memory page from one PPU memory to another PPU memory. Several steps in peer transition sequences may be performed in parallel for increased processing speed.
-
公开(公告)号:US09588903B2
公开(公告)日:2017-03-07
申请号:US14011655
申请日:2013-08-27
Applicant: NVIDIA CORPORATION
Inventor: Cameron Buschardt , Jerome F. Duluk, Jr. , John Mashey , Mark Hairgrove , James Leroy Deming , Brian Fahs
Abstract: One embodiment of the present invention includes a microcontroller coupled to a memory management unit (MMU). The MMU is coupled to a page table included in a physical memory, and the microcontroller is configured to perform one or more virtual memory operations associated with the physical memory and the page table. In operation, the microcontroller receives a page fault generated by the MMU in response to an invalid memory access via a virtual memory address. To remedy such a page fault, the microcontroller performs actions to map the virtual memory address to an appropriate location in the physical memory. By contrast, in prior-art systems, a fault handler would typically remedy the page fault. Advantageously, because the microcontroller executes these tasks locally with respect to the MMU and the physical memory, latency associated with remedying page faults may be decreased. Consequently, overall system performance may be increased.
Abstract translation: 本发明的一个实施例包括耦合到存储器管理单元(MMU)的微控制器。 MMU耦合到包括在物理存储器中的页表,并且微控制器被配置为执行与物理存储器和页表相关联的一个或多个虚拟存储器操作。 在操作中,微控制器响应于通过虚拟存储器地址的无效存储器访问而接收由MMU产生的页面错误。 为了纠正这种页面错误,微控制器执行操作以将虚拟存储器地址映射到物理存储器中的适当位置。 相比之下,在现有技术的系统中,故障处理器通常会补救页面错误。 有利地,由于微控制器相对于MMU和物理存储器在本地执行这些任务,所以与补救页错误相关联的延迟可能会降低。 因此,整体系统性能可能会增加。
-
公开(公告)号:US20200334076A1
公开(公告)日:2020-10-22
申请号:US16389548
申请日:2019-04-19
Applicant: Nvidia Corporation
Inventor: Brian Fahs , Michael Lightstone , Mostafa Hagog
Abstract: An application binary interface (ABI) can be exposed in a processor to enable blocks of threads, which may correspond to separately compiled operators, to communicate without storing data to global memory external to the processor. The ABI can define how results of one computation, corresponding to a first thread block, will be organized in registers and shared memory of a processor at the end of one operator (i.e., kernel). The start of the next operator (i.e., kernel), corresponding to a second thread block, can consume the results from the registers and shared memory. Data can be stored to processor local storage for individual threads as they exit the block. Once published, libraries can be separately compiled, optimized, and tested as long as they adhere to the published ABI.
-
-
-
-
-
-
-
-
-