SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS

    公开(公告)号:US20240195679A1

    公开(公告)日:2024-06-13

    申请号:US18077884

    申请日:2022-12-08

    Applicant: Google LLC

    CPC classification number: H04L41/0672 H04L41/16 H04L43/02 H04L43/0823

    Abstract: Generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in Machine Learning (ML) and High-Performance Computing (HPC) applications. While a disabled link is repaired online, user jobs may continue to run. The broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. The network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.

    Smart online link repair and job scheduling in machine learning supercomputers

    公开(公告)号:US12289196B2

    公开(公告)日:2025-04-29

    申请号:US18077884

    申请日:2022-12-08

    Applicant: Google LLC

    Abstract: Generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in Machine Learning (ML) and High-Performance Computing (HPC) applications. While a disabled link is repaired online, user jobs may continue to run. The broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. The network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.

    FAULT-TOLERANT ROUTING ALGORITHM FOR TOROIDAL NETWORK TOPOLOGIES

    公开(公告)号:US20240195732A1

    公开(公告)日:2024-06-13

    申请号:US18077906

    申请日:2022-12-08

    Applicant: Google LLC

    CPC classification number: H04L45/28 H04L45/54

    Abstract: Generally disclosed herein is an approach for optimizing routing strategy to tolerate faults in a toroidal network topology including, but not limited to, N-dimensional mesh, torus, and twisted torus. The approach may include balancing a load for a specified input traffic pattern operating offline or online. The approach may also include an optimization enhancement technique specifically applicable to symmetric, dynamically composable toroidal networks based on a set of centrally connected circuit switches.

Patent Agency Ranking