-
公开(公告)号:US20210097432A1
公开(公告)日:2021-04-01
申请号:US16588930
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Andrea Olgiati , Rahul Raghavendra Huilgol , Vikas Kumar
Abstract: Methods, systems, and computer-readable media for GPU code injection to summarize machine learning training data are disclosed. Training of a machine learning model is initiated using a graphics processing unit (GPU) associated with a machine learning training cluster. The training of the machine learning model generates tensor data in a memory of the GPU. The GPU determines a summary of the tensor data according to a reduction operator. The summary is smaller in size than the tensor data and is output by the GPU. A machine learning analysis system performs an analysis of the training of the machine learning model based at least in part on the summary of the tensor data. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis.
-
公开(公告)号:US20210097431A1
公开(公告)日:2021-04-01
申请号:US16588913
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Andrea Olgiati , Lakshmi Naarayanan Ramakrishnan , Jeffrey John Geevarghese , Denis Davydenko , Vikas Kumar , Rahul Raghavendra Huilgol , Amol Ashok Lele , Stefano Stefani , Vladimir Zhukov
Abstract: Methods, systems, and computer-readable media for debugging and profiling of machine learning model training are disclosed. A machine learning analysis system receives data associated with training of a machine learning model. The data was collected by a machine learning training cluster. The machine learning analysis system performs analysis of the data associated with the training of the machine learning model. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis. The machine learning analysis system generates one or more alarms describing the one or more conditions associated with the training of the machine learning model.
-
公开(公告)号:US11468365B2
公开(公告)日:2022-10-11
申请号:US16588930
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Andrea Olgiati , Rahul Raghavendra Huilgol , Vikas Kumar
Abstract: Methods, systems, and computer-readable media for GPU code injection to summarize machine learning training data are disclosed. Training of a machine learning model is initiated using a graphics processing unit (GPU) associated with a machine learning training cluster. The training of the machine learning model generates tensor data in a memory of the GPU. The GPU determines a summary of the tensor data according to a reduction operator. The summary is smaller in size than the tensor data and is output by the GPU. A machine learning analysis system performs an analysis of the training of the machine learning model based at least in part on the summary of the tensor data. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis.
-
公开(公告)号:US12039415B2
公开(公告)日:2024-07-16
申请号:US16588913
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Andrea Olgiati , Lakshmi Naarayanan Ramakrishnan , Jeffrey John Geevarghese , Denis Davydenko , Vikas Kumar , Rahul Raghavendra Huilgol , Amol Ashok Lele , Stefano Stefani , Vladimir Zhukov
Abstract: Methods, systems, and computer-readable media for debugging and profiling of machine learning model training are disclosed. A machine learning analysis system receives data associated with training of a machine learning model. The data was collected by a machine learning training cluster. The machine learning analysis system performs analysis of the data associated with the training of the machine learning model. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis. The machine learning analysis system generates one or more alarms describing the one or more conditions associated with the training of the machine learning model.
-
-
-