-
公开(公告)号:US20240402250A1
公开(公告)日:2024-12-05
申请号:US18203258
申请日:2023-05-30
Applicant: NVIDIA Corporation
Inventor: Saurabh Hukerikar , Nirmal Saxena , Atieh Lotfi , Samuel H. Duncan , Yanxiang Huang , Jason Campbell , Paul Racunas
IPC: G01R31/3187 , G01R31/319
Abstract: In various examples, faults are detected based at least in part on result value(s) generated by hardware component(s) by performing one or more diagnostic tests in accordance with a diagnostic test pattern. The diagnostic test pattern may be used to perform an assessment of functionality of the hardware component(s) by causing the hardware component(s) to generate the result value(s), which may be used to identify one or more hardware faults (e.g., by comparing the result value(s) to expected value(s)).
-
公开(公告)号:US11720440B2
公开(公告)日:2023-08-08
申请号:US17373678
申请日:2021-07-12
Applicant: NVIDIA CORPORATION
Inventor: Naveen Cherukuri , Saurabh Hukerikar , Paul Racunas , Nirmal Raj Saxena , David Charles Patrick , Yiyang Feng , Abhijeet Ghadge , Steven James Heinrich , Adam Hendrickson , Gentaro Hirota , Praveen Joginipally , Vaishali Kulkarni , Peter C. Mills , Sandeep Navada , Manan Patel , Liang Yin
IPC: G06F11/07 , G06F11/10 , G06F12/1018 , G06F11/14 , G06F12/1027
CPC classification number: G06F11/1016 , G06F11/0772 , G06F11/0793 , G06F11/1407 , G06F12/1018 , G06F12/1027
Abstract: Various embodiments include a parallel processing computer system that detects memory errors as a memory client loads data from memory and disables the memory client from storing data to memory, thereby reducing the likelihood that the memory error propagates to other memory clients. The memory client initiates a stall sequence, while other memory clients continue to execute instructions and the memory continues to service memory load and store operations. When a memory error is detected, a specific bit pattern is stored in conjunction with the data associated with the memory error. When the data is copied from one memory to another memory, the specific bit pattern is also copied, in order to identify the data as having a memory error.
-