-
公开(公告)号:US20230409876A1
公开(公告)日:2023-12-21
申请号:US17845543
申请日:2022-06-21
Applicant: NVIDIA Corporation
Inventor: Vibhor Agrawal , Tamar Viclizki , Vadim Gechman
CPC classification number: G06N3/0454 , G06N3/08
Abstract: Apparatuses, systems, and techniques to predict a probability of an error in processing units, such as those of a data center. In at least one embodiment, the probability of an error occurring in a processing unit is identified using a machine learning model trained using one or more previously trained machine learning models, in which the machine learning model is smaller than the previously trained machine learning models.
-
公开(公告)号:US20240406058A1
公开(公告)日:2024-12-05
申请号:US18629132
申请日:2024-04-08
Applicant: Nvidia Corporation
Inventor: Elad Alon , Eitan Zahavi , Gaby Diengott , Shie Mannor , Vadim Gechman
IPC: H04L41/0659 , H04L41/147 , H04L43/06 , H04L43/0811
Abstract: A network monitor may execute, or communicate with, one or more stored machine learning models that are trained to predict a failure probability for one or more ports and/or links within a network fabric. Systems and methods may monitor a set of ports and/or links to generate predictions for failure probabilities using a first trained model and low frequency telemetry data. For a subset of ports and/or links with failure probabilities exceeding a first threshold, high speed telemetry data may be used by a second trained model to generate predictions for failure probabilities for the subset of ports. Suspicious ports may then be isolated and undergo various remediation and/or monitoring actions prior to de-isolating the isolated ports.
-
公开(公告)号:US20240394130A1
公开(公告)日:2024-11-28
申请号:US18794219
申请日:2024-08-05
Applicant: NVIDIA Corporation
Inventor: Tamar Viclizki , Fay Wang , Divyansh Jain , Avighan Majumder , Vadim Gechman , Vibhor Agrawal
Abstract: Apparatuses, systems, and techniques to predict a probability of an error or anomay in processing units, such as those of a data center. In at least one embodiment, the probability of an error occuring in a proccessing unit is identified using multiple trained machine learning models, in which the trained machine learning models each outputs, for example, the probability of an error occuring within a different predetermined time period.
-
公开(公告)号:US12055995B2
公开(公告)日:2024-08-06
申请号:US17683191
申请日:2022-02-28
Applicant: NVIDIA Corporation
Inventor: Tamar Viclizki , Fay Wang , Divyansh Jain , Avighan Majumder , Vadim Gechman , Vibhor Agrawal
CPC classification number: G06F11/004 , G06N20/20 , G06F2201/86
Abstract: Apparatuses, systems, and techniques to predict a probability of an error or anomaly in processing units, such as those of a data center. In at least one embodiment, the probability of an error occurring in a processing unit is identified using multiple trained machine learning models, in which the trained machine learning models each outputs, for example, the probability of an error occurring within a different predetermined time period.
-
公开(公告)号:US20230297453A1
公开(公告)日:2023-09-21
申请号:US17683191
申请日:2022-02-28
Applicant: NVIDIA Corporation
Inventor: Tamar Viclizki , Fay Wang , Divyansh Jain , Avighan Majumder , Vadim Gechman , Vibhor Agrawal
CPC classification number: G06F11/004 , G06N20/20 , G06F2201/86
Abstract: Apparatuses, systems, and techniques to predict a probability of an error or anomay in processing units, such as those of a data center. In at least one embodiment, the probability of an error occuring in a proccessing unit is identified using multiple trained machine learning models, in which the trained machine learning models each outputs, for example, the probability of an error occuring within a different predetermined time period.
-
-
-
-