Patent search ap:("Google LLC") AND inv:"Dayou Du" Page 1

1.

发明申请
PREFLIGHT CHECKS FOR HARDWARE ACCELERATORS IN A DISTRIBUTED SYSTEM 有权

公开(公告)号：US20240385873A1

公开(公告)日：2024-11-21

申请号：US18667501

申请日：2024-05-17

Applicant: Google LLC

Inventor： Jiafan Zhu , Jianqiao Liu , Xiangyu Dong , Xiao Zhang , Jikai Tang , Kexin Yang , Yong Zhao , Alireza Ghaffarkhah , Arash Rezaei , Dayou Du , Yazhou Zu , Xiangling Kong , Hoang-Vu Dang , Alexander Vadimovich Kolbasov

IPC: G06F9/48 , G06F9/50 , G06F11/30 , G06F11/34

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for performing preflight checks of a distributed computing system, are described. In one aspect, a method includes assigning a computing workload to a first subset of hardware accelerator machines each having one or more hardware accelerators. A preflight check on the first subset is performed before performing the computing workload to verify the functionality of each machine in the first subset. For each hardware accelerator machine of the first subset, a program code package is installed, including a task action based at least in part on characteristics of the computing workload. The task action including a sequence of operations is performed on the hardware accelerator machine to determine whether the task action fails. Whenever the task action fails, the computing workload is re-assigned to a second subset of hardware accelerator machines different from the first subset.

2.

发明授权
Smart online link repair and job scheduling in machine learning supercomputers 有权

公开(公告)号：US12289196B2

公开(公告)日：2025-04-29

申请号：US18077884

申请日：2022-12-08

Applicant: Google LLC

Inventor： Yazhou Zu , Alireza Ghaffarkhah , Dayou Du

IPC: G06F15/177 , H04L41/0659 , H04L41/16 , H04L43/02 , H04L43/0823

Abstract: Generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in Machine Learning (ML) and High-Performance Computing (HPC) applications. While a disabled link is repaired online, user jobs may continue to run. The broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. The network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.

3.

发明公开
SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS 审中-公开

公开(公告)号：US20240195679A1

公开(公告)日：2024-06-13

申请号：US18077884

申请日：2022-12-08

Applicant: Google LLC

Inventor： Yazhou Zu , Alireza Ghaffarkhah , Dayou Du

IPC: H04L41/0654 , H04L41/16 , H04L43/02 , H04L43/0823

CPC classification number: H04L41/0672 , H04L41/16 , H04L43/02 , H04L43/0823

Abstract: Generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in Machine Learning (ML) and High-Performance Computing (HPC) applications. While a disabled link is repaired online, user jobs may continue to run. The broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. The network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.

4.

发明公开
PREFLIGHT CHECKS FOR HARDWARE ACCELERATORS IN A DISTRIBUTED SYSTEM 审中-公开

公开(公告)号：US20230168919A1

公开(公告)日：2023-06-01

申请号：US17540123

申请日：2021-12-01

Applicant: Google LLC

Inventor： Jiafan Zhu , Jianqiao Liu , Xiangyu Dong , Xiao Zhang , Jikai Tang , Kexin Yang , Yong Zhao , Alireza Ghaffarkhah , Arash Rezaei , Dayou Du , Yazhou Zu , Xiangling Kong , Hoang-Vu Dang , Alexander Vadimovich Kolbasov

IPC: G06F9/48 , G06F9/50 , G06F11/34 , G06F11/30

CPC classification number: G06F9/4843 , G06F9/5027 , G06F11/3433 , G06F11/3024

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for performing preflight checks of a distributed computing system, are described. In one aspect, a method includes assigning a computing workload to a first subset of hardware accelerator machines each having one or more hardware accelerators. A preflight check on the first subset is performed before performing the computing workload to verify the functionality of each machine in the first subset. For each hardware accelerator machine of the first subset, a program code package is installed, including a task action based at least in part on characteristics of the computing workload. The task action including a sequence of operations is performed on the hardware accelerator machine to determine whether the task action fails. Whenever the task action fails, the computing workload is re-assigned to a second subset of hardware accelerator machines different from the first subset.

5.

发明授权
Preflight checks for hardware accelerators in a distributed system 有权

公开(公告)号：US12020063B2

公开(公告)日：2024-06-25

申请号：US17540123

申请日：2021-12-01

Applicant: Google LLC

Inventor： Jiafan Zhu , Jianqiao Liu , Xiangyu Dong , Xiao Zhang , Jikai Tang , Kexin Yang , Yong Zhao , Alireza Ghaffarkhah , Arash Rezaei , Dayou Du , Yazhou Zu , Xiangling Kong , Hoang-Vu Dang , Alexander Vadimovich Kolbasov

IPC: G06F9/50 , G06F9/48 , G06F11/30 , G06F11/34

CPC classification number: G06F9/4843 , G06F9/5027 , G06F11/3024 , G06F11/3433

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for performing preflight checks of a distributed computing system, are described. In one aspect, a method includes assigning a computing workload to a first subset of hardware accelerator machines each having one or more hardware accelerators. A preflight check on the first subset is performed before performing the computing workload to verify the functionality of each machine in the first subset. For each hardware accelerator machine of the first subset, a program code package is installed, including a task action based at least in part on characteristics of the computing workload. The task action including a sequence of operations is performed on the hardware accelerator machine to determine whether the task action fails. Whenever the task action fails, the computing workload is re-assigned to a second subset of hardware accelerator machines different from the first subset.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification