-
公开(公告)号:US20240385873A1
公开(公告)日:2024-11-21
申请号:US18667501
申请日:2024-05-17
Applicant: Google LLC
Inventor: Jiafan Zhu , Jianqiao Liu , Xiangyu Dong , Xiao Zhang , Jikai Tang , Kexin Yang , Yong Zhao , Alireza Ghaffarkhah , Arash Rezaei , Dayou Du , Yazhou Zu , Xiangling Kong , Hoang-Vu Dang , Alexander Vadimovich Kolbasov
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for performing preflight checks of a distributed computing system, are described. In one aspect, a method includes assigning a computing workload to a first subset of hardware accelerator machines each having one or more hardware accelerators. A preflight check on the first subset is performed before performing the computing workload to verify the functionality of each machine in the first subset. For each hardware accelerator machine of the first subset, a program code package is installed, including a task action based at least in part on characteristics of the computing workload. The task action including a sequence of operations is performed on the hardware accelerator machine to determine whether the task action fails. Whenever the task action fails, the computing workload is re-assigned to a second subset of hardware accelerator machines different from the first subset.
-
公开(公告)号:US11704110B2
公开(公告)日:2023-07-18
申请号:US17889508
申请日:2022-08-17
Applicant: Google LLC
Inventor: Jianqiao Liu , Xiangyu Dong , Pedram Z. Dashti , Kais Belgaied
IPC: G06F8/65
CPC classification number: G06F8/65
Abstract: A uniform and unified firmware in-field upgrade capability for the optics modules may ensure compatibility, security and code quality, and scalability. In some examples, an intermediate representation, which includes vendor firmware upgrade operations and control logic, may be defined, received, and parsed. Read/write operations may be communicated to optical module(s) based on the control logic. In some examples, a unified optics module firmware in-field upgrade framework, which has multiple defined software layers, may ensure a uniform and unified approach to managing optics module(s) from different vendors and used by different projects. The software layers that may properly translate optics module read/write operations, abstract and make uniform the read/write operations, provide libraries of intermediate representations, package the intermediate representations into executables/scripts, monitor optics module status, determine when a new firmware is released, and gradually upgrade the optics module firmware.
-
公开(公告)号:US20250053810A1
公开(公告)日:2025-02-13
申请号:US18928941
申请日:2024-10-28
Applicant: Google LLC
Inventor: Xiangyu Dong , Kais Belgaied , Yazhou Zu
Abstract: This disclosure generally provides solutions for improving the performance of a custom-built, packet-switched, TPU accelerator-side communication network. Specifically a set of solutions to improve the flow-control behavior by tuning the packet buffer queues in the on-chip router in the distributed training supercomputer network are described.
-
公开(公告)号:US12159225B2
公开(公告)日:2024-12-03
申请号:US17136229
申请日:2020-12-29
Applicant: Google LLC
Inventor: Xiangyu Dong , Kais Belgaied , Yazhou Zu
Abstract: This disclosure generally provides solutions for improving the performance of a custom-built, packet-switched, TPU accelerator-side communication network. Specifically a set of solutions to improve the flow-control behavior by tuning the packet buffer queues in the on-chip router in the distributed training supercomputer network are described.
-
公开(公告)号:US12014167B2
公开(公告)日:2024-06-18
申请号:US18201365
申请日:2023-05-24
Applicant: Google LLC
Inventor: Jianqiao Liu , Xiangyu Dong , Pedram Z. Dashti , Kais Belgaied
IPC: G06F8/65
CPC classification number: G06F8/65
Abstract: A uniform and unified firmware in-field upgrade capability for the optics modules may ensure compatibility, security and code quality, and scalability. In some examples, an intermediate representation, which includes vendor firmware upgrade operations and control logic, may be defined, received, and parsed. Read/write operations may be communicated to optical module(s) based on the control logic. In some examples, a unified optics module firmware in-field upgrade framework, which has multiple defined software layers, may ensure a uniform and unified approach to managing optics module(s) from different vendors and used by different projects. The software layers that may properly translate optics module read/write operations, abstract and make uniform the read/write operations, provide libraries of intermediate representations, package the intermediate representations into executables/scripts, monitor optics module status, determine when a new firmware is released, and gradually upgrade the optics module firmware.
-
公开(公告)号:US20220398090A1
公开(公告)日:2022-12-15
申请号:US17889508
申请日:2022-08-17
Applicant: Google LLC
Inventor: Jianqiao Liu , Xiangyu Dong , Pedram Z. Dashti , Kais Belgaied
IPC: G06F8/65
Abstract: A uniform and unified firmware in-field upgrade capability for the optics modules may ensure compatibility, security and code quality, and scalability. In some examples, an intermediate representation, which includes vendor firmware upgrade operations and control logic, may be defined, received, and parsed. Read/write operations may be communicated to optical module(s) based on the control logic. In some examples, a unified optics module firmware in-field upgrade framework, which has multiple defined software layers, may ensure a uniform and unified approach to managing optics module(s) from different vendors and used by different projects. The software layers that may properly translate optics module read/write operations, abstract and make uniform the read/write operations, provide libraries of intermediate representations, package the intermediate representations into executables/scripts, monitor optics module status, determine when a new firmware is released, and gradually upgrade the optics module firmware.
-
公开(公告)号:US20220291915A1
公开(公告)日:2022-09-15
申请号:US17201256
申请日:2021-03-15
Applicant: Google LLC
Inventor: Jianqiao Liu , Xiangyu Dong , Pedram Z. Dashti , Kais Belgaied
IPC: G06F8/65
Abstract: A uniform and unified firmware in-field upgrade capability for the optics modules may ensure compatibility, security and code quality, and scalability. In some examples, an intermediate representation, which includes vendor firmware upgrade operations and control logic, may be defined, received, and parsed. Read/write operations may be communicated to optical module(s) based on the control logic. In some examples, a unified optics module firmware in-field upgrade framework, which has multiple defined software layers, may ensure a uniform and unified approach to managing optics module(s) from different vendors and used by different projects. The software layers that may properly translate optics module read/write operations, abstract and make uniform the read/write operations, provide libraries of intermediate representations, package the intermediate representations into executables/scripts, monitor optics module status, determine when a new firmware is released, and gradually upgrade the optics module firmware.
-
公开(公告)号:US20230168919A1
公开(公告)日:2023-06-01
申请号:US17540123
申请日:2021-12-01
Applicant: Google LLC
Inventor: Jiafan Zhu , Jianqiao Liu , Xiangyu Dong , Xiao Zhang , Jikai Tang , Kexin Yang , Yong Zhao , Alireza Ghaffarkhah , Arash Rezaei , Dayou Du , Yazhou Zu , Xiangling Kong , Hoang-Vu Dang , Alexander Vadimovich Kolbasov
CPC classification number: G06F9/4843 , G06F9/5027 , G06F11/3433 , G06F11/3024
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for performing preflight checks of a distributed computing system, are described. In one aspect, a method includes assigning a computing workload to a first subset of hardware accelerator machines each having one or more hardware accelerators. A preflight check on the first subset is performed before performing the computing workload to verify the functionality of each machine in the first subset. For each hardware accelerator machine of the first subset, a program code package is installed, including a task action based at least in part on characteristics of the computing workload. The task action including a sequence of operations is performed on the hardware accelerator machine to determine whether the task action fails. Whenever the task action fails, the computing workload is re-assigned to a second subset of hardware accelerator machines different from the first subset.
-
公开(公告)号:US11467822B2
公开(公告)日:2022-10-11
申请号:US17201256
申请日:2021-03-15
Applicant: Google LLC
Inventor: Jianqiao Liu , Xiangyu Dong , Pedram Z. Dashti , Kais Belgaied
IPC: G06F8/65
Abstract: A uniform and unified firmware in-field upgrade capability for the optics modules may ensure compatibility, security and code quality, and scalability. In some examples, an intermediate representation, which includes vendor firmware upgrade operations and control logic, may be defined, received, and parsed. Read/write operations may be communicated to optical module(s) based on the control logic. In some examples, a unified optics module firmware in-field upgrade framework, which has multiple defined software layers, may ensure a uniform and unified approach to managing optics module(s) from different vendors and used by different projects. The software layers that may properly translate optics module read/write operations, abstract and make uniform the read/write operations, provide libraries of intermediate representations, package the intermediate representations into executables/scripts, monitor optics module status, determine when a new firmware is released, and gradually upgrade the optics module firmware.
-
公开(公告)号:US20220114440A1
公开(公告)日:2022-04-14
申请号:US17136229
申请日:2020-12-29
Applicant: Google LLC
Inventor: Xiangyu Dong , Kais Belgaied , Yazhou Zu
Abstract: This disclosure generally provides solutions for improving the performance of a custom-built, packet-switched, TPU accelerator-side communication network. Specifically a set of solutions to improve the flow-control behavior by tuning the packet buffer queues in the on-chip router in the distributed training supercomputer network are described.
-
-
-
-
-
-
-
-
-