-
1.
公开(公告)号:US20190007278A1
公开(公告)日:2019-01-03
申请号:US15640114
申请日:2017-06-30
发明人: Sathyanarayana SINGH , Gaurav JAGTIANI , Rohit PANDEY , Durmus Ugur KARATAY , Gil Lapid SHAFRIRI
IPC分类号: H04L12/24
摘要: Methods, systems, and computer program products are described herein for minimizing the downtime for nodes in a network-accessible server set. The downtime may be minimized by determining an optimal timeout value for which a fabric controller waits to perform a recovery action. The optimal timeout value may be determined for each cluster in the network-accessible server set. The optimal timeout value advantageously reduces the overall downtime for customer workloads running on a node for which contact has been lost. The optimal timeout value for each cluster may be based on a predictive model based on the observed historical patterns of the nodes within that cluster. In the event that an optimal timeout value is not determined for a particular cluster (e.g., due to a lack of observed historical patterns), the fabric controller may fall back to a less than optimal timeout value.
-
公开(公告)号:US20240338282A1
公开(公告)日:2024-10-10
申请号:US18330651
申请日:2023-06-07
发明人: Binit Ranjan MISHRA , Mukhtar AHMED , Christina Marianne CURLETTE , Steven Adrian WEST , Gaurav JAGTIANI , Naga Kiran GOVINDARAJU , James George CAVALARIS , Drew Douglas CROSS , Jason Stewart WOHLGEMUTH , James Anthony SCHWARTZ, JR. , Jennifer Marie BOURLIER , Sri Harsha KANUKUNTLA , Emma Sutherland BOYD , Scott Chao-Chueh LEE , Vijaybalaji MADHANAGOPAL , Terence Kwok Tak CHAN , Yuri DOTSENKO , Peter Hanpeng JIANG , Aacer Hatem DAKEN , Emily Nicole WILSON , Emily Cara CLEMENS , Cody Dean HARTWIG , Raz Meir ALONI , Sharon Scarlet TANG , Minsang KIM , Shen WANG
CPC分类号: G06F11/1471 , G06F11/0772 , G06F11/1441
摘要: In-place recovery of fatal system errors at virtualization hosts. A device identifies an occurrence of a fatal system error in the first instance of a host operating system (OS) executing in a computer system. The device determines to perform an in-place recovery for the fatal system error. The device performs the in-place recovery, including pausing the execution of a virtual machine (VM) by the first instance of the host OS, preserving a state of the VM within system memory of the computer system, and resuming the execution of the VM by a second instance of the host OS executing in the computer system based on the state of the VM that is preserved within the system memory of the computer system.
-
公开(公告)号:US20230396511A1
公开(公告)日:2023-12-07
申请号:US17833238
申请日:2022-06-06
发明人: Shandan ZHOU , Sam Prakash BHERI , Karthikeyan SUBRAMANIAN , Yancheng CHEN , Gaurav JAGTIANI , Abhay Sudhir KETKAR , Hemant MALIK , Thomas MOSCIBRODA , Shweta Balkrishna PATIL , Luke Rafael RODRIGUEZ , Dalianna Victoria VAYSMAN
CPC分类号: G06F11/1415 , G06N20/20 , G06F9/5072 , G06F2209/505
摘要: A computer implemented method includes receiving telemetry data corresponding to capacity health of nodes in a cloud based computing system. The received telemetry data is processed via a prediction engine to provide predictions of capacity health at multiple dimensions of the cloud based computing system. Node recoverability information is received and node recovery execution is initiated as a function of the representations of capacity health and node recoverability information.
-
公开(公告)号:US20240201767A1
公开(公告)日:2024-06-20
申请号:US18084822
申请日:2022-12-20
发明人: Emma Sutherland BOYD , Shekhar AGRAWAL , Amruta Bhalchandra PATHAK , Yu YAO , Aravind Narayanan KRISHNAMOORTHY , Derek James BOYER , Binit Ranjan MISHRA , Gaurav JAGTIANI , Abhay Sudhir KETKAR , Tri Minh TRAN
CPC分类号: G06F1/30 , G06F11/0721 , G06F11/0793
摘要: The present disclosure relates to utilizing a host failure recovery system to efficiently and accurately determine the health of host devices. For example, the host failure recovery system detects when a host server is failing by utilizing a power failure detection model that determines whether a host server is operating in a healthy power state or an unhealthy power state. In particular, the host failure recovery system utilizes a multi-layer power failure detection model that determines power-draw failure events on a host device. The failure detection model determines, with high confidence, the health of a host device based on power-draw signals and/or usage characteristics of the host device. Additionally, the host failure recovery system can initiate a quick recovery of a failing host device.
-
5.
公开(公告)号:US20200150972A1
公开(公告)日:2020-05-14
申请号:US16186340
申请日:2018-11-09
发明人: Abhay Sudhir KETKAR , Gaurav JAGTIANI , Ajay MANI , Richard Thomas RUSSO , Shweta Balkrishna PATIL , James Cameron WHITE
摘要: A method for opportunistically performing an action in a cloud computing system may include detecting a reboot event corresponding to a computing entity in the cloud computing system. The computing entity may be, for example, a host machine in the cloud computing system or a virtual machine in the cloud computing system. The method may also include causing the computing entity to be held in a stopped state and performing the action while the computing entity is being held in the stopped state, thereby eliminating a need to perform the action at a future time subsequent to the reboot event. The nature of the action is such that it would affect the computing entity if the action were performed subsequent to the reboot event. The method may also include causing the computing entity to be started after the action has been performed.
-
-
-
-