-
公开(公告)号:US12130750B2
公开(公告)日:2024-10-29
申请号:US18118020
申请日:2023-03-06
Applicant: NVIDIA Corporation
Inventor: Aninda Manocha , Zi Yan , David Nellans
IPC: G06F12/1027
CPC classification number: G06F12/1027
Abstract: Computer systems often employ virtual address translation hierarchies in which virtual memory addresses are mapped to physical memory. Use of the virtual address translation hierarchy speeds up the virtual address translation when the required mapping is stored in one of the higher levels of the hierarchy. To reduce a number of misses occurring in the virtual address translation hierarchy, huge memory pages may be selectively employed, which map larger continuous regions of virtual memory to continuous regions of physical memory, thereby increasing the coverage of each entry in the virtual address translation hierarchy. The present disclosure provides hardware support for optimizing this huge memory page selection.
-
公开(公告)号:US11609879B2
公开(公告)日:2023-03-21
申请号:US17365315
申请日:2021-07-01
Applicant: NVIDIA CORPORATION
Inventor: Yaosheng Fu , Evgeny Bolotin , Niladrish Chatterjee , Stephen William Keckler , David Nellans
IPC: G06F15/78 , G06F12/0811 , G06F12/12 , G06F13/40
Abstract: In various embodiments, a parallel processor includes a parallel processor module implemented within a first die and a memory system module implemented within a second die. The memory system module is coupled to the parallel processor module via an on-package link. The parallel processor module includes multiple processor cores and multiple cache memories. The memory system module includes a memory controller for accessing a DRAM. Advantageously, the performance of the parallel processor module can be effectively tailored for memory bandwidth demands that typify one or more application domains via the memory system module.
-
公开(公告)号:US20230079978A1
公开(公告)日:2023-03-16
申请号:US17709720
申请日:2022-03-31
Applicant: Nvidia Corporation
Inventor: Evgeny Bolotin , Yaosheng Fu , Zi Yan , Gal Dalal , Shie Mannor , David Nellans
Abstract: A system, method, and apparatus of power management for computing systems are included herein that optimize individual frequencies of components of the computing systems using machine learning. The computing systems can be tightly integrated systems that consider an overall operating budget that is shared between the components of the computing system while adjusting the frequencies of the individual components. An example of an automated method of power management includes: (1) learning, using a power management (PM) agent, frequency settings for different components of a computing system during execution of a repetitive application, and (2) adjusting the frequency settings of the different components using the PM agent, wherein the adjusting is based on the repetitive application and one or more limitations corresponding to a shared operating budget for the computing system.
-
公开(公告)号:US20240303201A1
公开(公告)日:2024-09-12
申请号:US18118020
申请日:2023-03-06
Applicant: NVIDIA Corporation
Inventor: Aninda Manocha , Zi Yan , David Nellans
IPC: G06F12/1027
CPC classification number: G06F12/1027
Abstract: Computer systems often employ virtual address translation hierarchies in which virtual memory addresses are mapped to physical memory. Use of the virtual address translation hierarchy speeds up the virtual address translation when the required mapping is stored in one of the higher levels of the hierarchy. To reduce a number of misses occurring in the virtual address translation hierarchy, huge memory pages may be selectively employed, which map larger continuous regions of virtual memory to continuous regions of physical memory, thereby increasing the coverage of each entry in the virtual address translation hierarchy. The present disclosure provides hardware support for optimizing this huge memory page selection.
-
公开(公告)号:US11625279B2
公开(公告)日:2023-04-11
申请号:US16787967
申请日:2020-02-11
Applicant: NVIDIA Corporation
Inventor: Daniel Lustig , Oreste Villa , David Nellans
IPC: G06F9/50 , G06F11/30 , G06F9/54 , G06F12/1027 , G06F11/07 , G06F12/0882
Abstract: In general, an application executes on a compute unit, such as a central processing unit (CPU) or graphics processing unit (GPU), to perform some function(s). In some circumstances, improved performance of an application, such as a graphics application, may be provided by executing the application across multiple compute units. However, when using multiple compute units in this manner, synchronization must be provided between the compute units. Synchronization, including the sharing of the data, is typically accomplished through memory. While a shared memory may cause bottlenecks, employing local memory for each compute unit may itself require synchronization (coherence) which can be costly in terms of resources, delay, etc. The present disclosure provides read-write page replication for multiple compute units that avoids the traditional challenges associated with coherence.
-
公开(公告)号:US11880261B2
公开(公告)日:2024-01-23
申请号:US17709720
申请日:2022-03-31
Applicant: Nvidia Corporation
Inventor: Evgeny Bolotin , Yaosheng Fu , Zi Yan , Gal Dalal , Shie Mannor , David Nellans
CPC classification number: G06F1/324 , G06F1/206 , G06F11/3495
Abstract: A system, method, and apparatus of power management for computing systems are included herein that optimize individual frequencies of components of the computing systems using machine learning. The computing systems can be tightly integrated systems that consider an overall operating budget that is shared between the components of the computing system while adjusting the frequencies of the individual components. An example of an automated method of power management includes: (1) learning, using a power management (PM) agent, frequency settings for different components of a computing system during execution of a repetitive application, and (2) adjusting the frequency settings of the different components using the PM agent, wherein the adjusting is based on the repetitive application and one or more limitations corresponding to a shared operating budget for the computing system.
-
公开(公告)号:US20230137205A1
公开(公告)日:2023-05-04
申请号:US17514735
申请日:2021-10-29
Applicant: Nvidia Corporation
Inventor: Yaosheng Fu , Shie Mannor , Evgeny Bolotin , David Nellans , Gal Dalal
IPC: G06F12/123 , G06N20/00 , G06T1/60
Abstract: Introduced herein is a technique that uses ML to autonomously find a cache management policy that achieves an optimal execution of a given workload of an application. Leveraging ML such as reinforcement learning, the technique trains an agent in an ML environment over multiple episodes of a stabilization process. For each time step in these training episodes, the agent executes the application while making an incremental change to the current policy, i.e., cache-residency statuses of memory address space associated with the workload, until the application can be executed at a stable level. The stable level of execution, for example, can be indicated by performance variations, such as standard deviations, between a certain number of neighboring measurement periods remaining within a certain threshold. The agent, who has been trained in the training episodes, infers the final cache management policy during the final, inferring episode.
-
公开(公告)号:US20210248014A1
公开(公告)日:2021-08-12
申请号:US16787967
申请日:2020-02-11
Applicant: NVIDIA Corporation
Inventor: Daniel Lustig , Oreste Villa , David Nellans
IPC: G06F9/50 , G06F9/54 , G06F12/0882 , G06F12/1027 , G06F11/07 , G06F11/30
Abstract: In general, an application executes on a compute unit, such as a central processing unit (CPU) or graphics processing unit (GPU), to perform some function(s). In some circumstances, improved performance of an application, such as a graphics application, may be provided by executing the application across multiple compute units. However, when using multiple compute units in this manner, synchronization must be provided between the compute units. Synchronization, including the sharing of the data, is typically accomplished through memory. While a shared memory may cause bottlenecks, employing local memory for each compute unit may itself require synchronization (coherence) which can be costly in terms of resources, delay, etc. The present disclosure provides read-write page replication for multiple compute units that avoids the traditional challenges associated with coherence.
-
-
-
-
-
-
-