-
公开(公告)号:US11468365B2
公开(公告)日:2022-10-11
申请号:US16588930
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Andrea Olgiati , Rahul Raghavendra Huilgol , Vikas Kumar
Abstract: Methods, systems, and computer-readable media for GPU code injection to summarize machine learning training data are disclosed. Training of a machine learning model is initiated using a graphics processing unit (GPU) associated with a machine learning training cluster. The training of the machine learning model generates tensor data in a memory of the GPU. The GPU determines a summary of the tensor data according to a reduction operator. The summary is smaller in size than the tensor data and is output by the GPU. A machine learning analysis system performs an analysis of the training of the machine learning model based at least in part on the summary of the tensor data. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis.
-
公开(公告)号:US12039415B2
公开(公告)日:2024-07-16
申请号:US16588913
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Andrea Olgiati , Lakshmi Naarayanan Ramakrishnan , Jeffrey John Geevarghese , Denis Davydenko , Vikas Kumar , Rahul Raghavendra Huilgol , Amol Ashok Lele , Stefano Stefani , Vladimir Zhukov
Abstract: Methods, systems, and computer-readable media for debugging and profiling of machine learning model training are disclosed. A machine learning analysis system receives data associated with training of a machine learning model. The data was collected by a machine learning training cluster. The machine learning analysis system performs analysis of the data associated with the training of the machine learning model. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis. The machine learning analysis system generates one or more alarms describing the one or more conditions associated with the training of the machine learning model.
-
公开(公告)号:US20210097431A1
公开(公告)日:2021-04-01
申请号:US16588913
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Andrea Olgiati , Lakshmi Naarayanan Ramakrishnan , Jeffrey John Geevarghese , Denis Davydenko , Vikas Kumar , Rahul Raghavendra Huilgol , Amol Ashok Lele , Stefano Stefani , Vladimir Zhukov
Abstract: Methods, systems, and computer-readable media for debugging and profiling of machine learning model training are disclosed. A machine learning analysis system receives data associated with training of a machine learning model. The data was collected by a machine learning training cluster. The machine learning analysis system performs analysis of the data associated with the training of the machine learning model. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis. The machine learning analysis system generates one or more alarms describing the one or more conditions associated with the training of the machine learning model.
-
公开(公告)号:US12189717B1
公开(公告)日:2025-01-07
申请号:US17105998
申请日:2020-11-27
Applicant: Amazon Technologies, Inc.
Inventor: Can Karakus , Rahul Raghavendra Huilgol , Anirudh Subramanian , Fei Wu , Christopher Cade Daniel , Akhil Mehra , Ajay Paidi , Yutong Zhang , Indu Thangakrishnan , Luis Alves Pereira Quintela
Abstract: Automatic partitioning of a machine learning model may be performed for training across multiple processing devices. A training job for a machine learning model may specify a number of partitions for a machine learning model. An optimization parameter may be determined for the machine learning model. Different partitions of the machine learning model to be trained across multiple processing devices may be determined based on the specified number of partitions and the optimization parameter. A schedule for executing the training job may be generated according to the respective partitions of the machine learning model. The training job may be executed according to the schedule.
-
公开(公告)号:US20210097432A1
公开(公告)日:2021-04-01
申请号:US16588930
申请日:2019-09-30
Applicant: Amazon Technologies, Inc.
Inventor: Andrea Olgiati , Rahul Raghavendra Huilgol , Vikas Kumar
Abstract: Methods, systems, and computer-readable media for GPU code injection to summarize machine learning training data are disclosed. Training of a machine learning model is initiated using a graphics processing unit (GPU) associated with a machine learning training cluster. The training of the machine learning model generates tensor data in a memory of the GPU. The GPU determines a summary of the tensor data according to a reduction operator. The summary is smaller in size than the tensor data and is output by the GPU. A machine learning analysis system performs an analysis of the training of the machine learning model based at least in part on the summary of the tensor data. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis.
-
-
-
-