-
公开(公告)号:US11977766B2
公开(公告)日:2024-05-07
申请号:US17683292
申请日:2022-02-28
Applicant: NVIDIA Corporation
Inventor: William James Dally , Carl Thomas Gray , Stephen W. Keckler , James Michael O'Connor
IPC: G06F3/06
CPC classification number: G06F3/0655 , G06F3/0604 , G06F3/0679
Abstract: A hierarchical network enables access for a stacked memory system including or more memory dies that each include multiple memory tiles. The processor die includes multiple processing tiles that are stacked with the one or more memory die. The memory tiles that are vertically aligned with a processing tile are directly coupled to the processing tile and comprise the local memory block for the processing tile. The hierarchical network provides access paths for each processing tile to access the processing tile's local memory block, the local memory block coupled to a different processing tile within the same processing die, memory tiles in a different die stack, and memory tiles in a different device. The ratio of memory bandwidth (byte) to floating-point operation (B:F) may improve 50× for accessing the local memory block compared with conventional memory. Additionally, the energy consumed to transfer each bit may be reduced by 10×.
-
公开(公告)号:US20240411709A1
公开(公告)日:2024-12-12
申请号:US18810657
申请日:2024-08-21
Applicant: NVIDIA Corporation
Inventor: William James Dally , Carl Thomas Gray , Stephen W. Keckler , James Michael O'Connor
IPC: G06F13/16 , G11C8/12 , H03K19/1776
Abstract: Embodiments of the present disclosure relate to application partitioning for locality in a stacked memory system. In an embodiment, one or more memory dies are stacked on the processor die. The processor die includes multiple processing tiles and each memory die includes multiple memory tiles. Vertically aligned memory tiles are directly coupled to and comprise the local memory block for a corresponding processing tile. An application program that operates on dense multi-dimensional arrays (matrices) may partition the dense arrays into sub-arrays associated with program tiles. Each program tile is executed by a processing tile using the processing tile's local memory block to process the associated sub-array. Data associated with each sub-array is stored in a local memory block and the processing tile corresponding to the local memory block executes the program tile to process the sub-array data.
-
公开(公告)号:US20240281300A1
公开(公告)日:2024-08-22
申请号:US18528333
申请日:2023-12-04
Applicant: NVIDIA Corporation
Inventor: Donghyuk Lee , Leul Wuletaw Belayneh , Niladrish Chatterjee , James Michael O'Connor
CPC classification number: G06F9/5083 , G06F9/542 , G06F2209/509
Abstract: An initiating processing tile generates an offload request that may include a processing tile ID, source data needed for the computation, program counter, and destination location where the computation result is stored. The offload processing tile may execute the offloaded computation. Alternatively, the offload processing tile may deny the offload request based on congestion criteria. The congestion criteria may include a processing workload measure, whether a resource needed to perform the computation is available, and an offload request buffer fullness. In an embodiment, the denial message that is returned to the initiating processing tile may include the data needed to perform the computation (read from the local memory of the offload processing tile). Returning the data with the denial message results in the same inter-processing tile traffic that would occur if no attempt to offload the computation were initiated.
-
公开(公告)号:US20240256153A1
公开(公告)日:2024-08-01
申请号:US18163167
申请日:2023-02-01
Applicant: NVIDIA Corporation
IPC: G06F3/06
CPC classification number: G06F3/0625 , G06F3/0644 , G06F3/0659 , G06F3/0673
Abstract: Embodiments of the present disclosure relate to memory page access instrumentation for generating a memory access profile. The memory access profile may be used to co-locate data near the processing unit that accesses the data, reducing memory access energy by minimizing distances to access data that is co-located with a different processing unit (i.e., remote data). Execution thread arrays and memory pages for execution of a program are partitioned across multiple processing units. The partitions are then each mapped to a specific processing unit to minimize inter-partition traffic given the processing unit physical topology.
-
公开(公告)号:US20230393788A1
公开(公告)日:2023-12-07
申请号:US18454693
申请日:2023-08-23
Applicant: NVIDIA Corporation
Inventor: Nilandrish Chatterjee , James Michael O'Connor , Donghyuk Lee , Gaurav Uttreja , Wishwesh Anil Gandhi
CPC classification number: G06F3/0659 , G06F3/0604 , G06F12/0607 , G06F12/10 , H01L25/18 , G06F2212/657 , G06F2212/151 , G06F2212/154 , G06F3/0673
Abstract: A combined on-package and off-package memory system uses a custom base-layer within which are fabricated one or more dedicated interfaces to off-package memories. An on-package processor and on-package memories are also directly coupled to the custom base-layer. The custom base-layer includes memory management logic between the processor and memories (both off and on package) to steer requests. The memories are exposed as a combined memory space having greater bandwidth and capacity compared with either the off-package memories or the on-package memories alone. The memory management logic services requests while maintaining quality of service (QoS) to satisfy bandwidth requirements for each allocation. An allocation may include any combination of the on and/or off package memories. The memory management logic also manages data migration between the on and off package memories.
-
公开(公告)号:US20230315651A1
公开(公告)日:2023-10-05
申请号:US17709031
申请日:2022-03-30
Applicant: NVIDIA Corporation
Inventor: William James Dally , Carl Thomas Gray , Stephen W. Keckler , James Michael O'Connor
IPC: G06F13/16 , H03K19/1776 , G11C8/12
CPC classification number: G06F13/161 , G06F13/1689 , G06F13/1673 , H03K19/1776 , G11C8/12
Abstract: Embodiments of the present disclosure relate to application partitioning for locality in a stacked memory system. In an embodiment, one or more memory dies are stacked on the processor die. The processor die includes multiple processing tiles and each memory die includes multiple memory tiles. Vertically aligned memory tiles are directly coupled to and comprise the local memory block for a corresponding processing tile. An application program that operates on dense multi-dimensional arrays (matrices) may partition the dense arrays into sub-arrays associated with program tiles. Each program tile is executed by a processing tile using the processing tile's local memory block to process the associated sub-array. Data associated with each sub-array is stored in a local memory block and the processing tile corresponding to the local memory block executes the program tile to process the sub-array data.
-
公开(公告)号:US11709812B2
公开(公告)日:2023-07-25
申请号:US17325133
申请日:2021-05-19
Applicant: NVIDIA CORPORATION
Inventor: Hanrui Wang , James Michael O'Connor , Donghyuk Lee
IPC: G06F16/22 , G06F17/16 , G06F18/2134
CPC classification number: G06F16/2237 , G06F16/2246 , G06F16/2272 , G06F17/16 , G06F18/21345
Abstract: One embodiment sets forth a technique for generating a tree structure within a computer memory for storing sparse data. The technique includes dividing a matrix into a first plurality of equally sized regions. The technique also includes dividing at least one region in the first plurality of regions into a second plurality of regions, where the second plurality of regions includes a first region and one or more second regions that have a substantially equal number of nonzero matrix values and are formed within the first region. The technique further includes creating the tree structure within the computer memory by generating a first plurality of nodes representing the first plurality of regions, generating a second plurality of nodes representing the second plurality of regions, and grouping, under a first node representing the first region, one or more second nodes representing the one or more second regions.
-
公开(公告)号:US12223201B2
公开(公告)日:2025-02-11
申请号:US18438139
申请日:2024-02-09
Applicant: NVIDIA Corporation
Inventor: William James Dally , Carl Thomas Gray , Stephen W. Keckler , James Michael O'Connor
IPC: G06F3/06
Abstract: A hierarchical network enables access for a stacked memory system including or more memory dies that each include multiple memory tiles. The processor die includes multiple processing tiles that are stacked with the one or more memory die. The memory tiles that are vertically aligned with a processing tile are directly coupled to the processing tile and comprise the local memory block for the processing tile. The hierarchical network provides access paths for each processing tile to access the processing tile's local memory block, the local memory block coupled to a different processing tile within the same processing die, memory tiles in a different die stack, and memory tiles in a different device. The ratio of memory bandwidth (byte) to floating-point operation (B:F) may improve 50× for accessing the local memory block compared with conventional memory. Additionally, the energy consumed to transfer each bit may be reduced by 10×.
-
公开(公告)号:US12141451B2
公开(公告)日:2024-11-12
申请号:US18163167
申请日:2023-02-01
Applicant: NVIDIA Corporation
Abstract: Embodiments of the present disclosure relate to memory page access instrumentation for generating a memory access profile. The memory access profile may be used to co-locate data near the processing unit that accesses the data, reducing memory access energy by minimizing distances to access data that is co-located with a different processing unit (i.e., remote data). Execution thread arrays and memory pages for execution of a program are partitioned across multiple processing units. The partitions are then each mapped to a specific processing unit to minimize inter-partition traffic given the processing unit physical topology.
-
公开(公告)号:US20240211166A1
公开(公告)日:2024-06-27
申请号:US18438139
申请日:2024-02-09
Applicant: NVIDIA Corporation
Inventor: William James Dally , Carl Thomas Gray , Stephen W. Keckler , James Michael O'Connor
IPC: G06F3/06
CPC classification number: G06F3/0655 , G06F3/0604 , G06F3/0679
Abstract: A hierarchical network enables access for a stacked memory system including or more memory dies that each include multiple memory tiles. The processor die includes multiple processing tiles that are stacked with the one or more memory die. The memory tiles that are vertically aligned with a processing tile are directly coupled to the processing tile and comprise the local memory block for the processing tile. The hierarchical network provides access paths for each processing tile to access the processing tile's local memory block, the local memory block coupled to a different processing tile within the same processing die, memory tiles in a different die stack, and memory tiles in a different device. The ratio of memory bandwidth (byte) to floating-point operation (B:F) may improve 50× for accessing the local memory block compared with conventional memory. Additionally, the energy consumed to transfer each bit may be reduced by 10×.
-
-
-
-
-
-
-
-
-