Smart online link repair and job scheduling in machine learning supercomputers

    公开(公告)号:US12289196B2

    公开(公告)日:2025-04-29

    申请号:US18077884

    申请日:2022-12-08

    Applicant: Google LLC

    Abstract: Generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in Machine Learning (ML) and High-Performance Computing (HPC) applications. While a disabled link is repaired online, user jobs may continue to run. The broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. The network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.

    SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS

    公开(公告)号:US20240195679A1

    公开(公告)日:2024-06-13

    申请号:US18077884

    申请日:2022-12-08

    Applicant: Google LLC

    CPC classification number: H04L41/0672 H04L41/16 H04L43/02 H04L43/0823

    Abstract: Generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in Machine Learning (ML) and High-Performance Computing (HPC) applications. While a disabled link is repaired online, user jobs may continue to run. The broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. The network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.

Patent Agency Ranking