Patent search ap:("Huawei Technologies Co. Page Ltd.") AND inv:"Zach Melamed"

1.

发明授权
Systems and methods for fault tolerance recover during training of a model of a classifier using a distributed system 有权

公开(公告)号：US11461695B2

公开(公告)日：2022-10-04

申请号：US16363639

申请日：2019-03-25

Applicant: Huawei Technologies Co., Ltd.

Inventor： Roman Talyansky , Zach Melamed , Natan Peterfreund , Zuguang Wu

IPC: G06K9/00 , G06N20/00 , G06N20/20 , G06F17/18 , G06K9/62 , G06N5/04 , G06F11/14

Abstract: A distributed system for training a classifier is provided. The system comprises machine learning (ML) workers and a parameter server (PS). The PS is configured for parallel processing to provide the model to each of the ML workers, receive model updates from each of the ML workers, and iteratively update the model using each model update. The PS contains gradient datasets associated with a respective ML worker, for storing a model-update-identification (delta-M-ID) indicative of the computed model update and the respective model update, a global dataset that stores, the delta-M-ID, an identification of the ML worker (ML-worker-ID) that computed the model update, and a model version that marks a new model in PS that is computed from merging the model update with a previous model in PS; and a model download dataset that stores the ML-worker-ID and the model version of each transmitted model.

Patent Agency Ranking