Abstract:
Disclosed are aspects of proactive high availability that proactively identify and predict hardware failure scenarios and migrate virtual resources to healthy hardware resources. In some aspects, a mapping that maps virtual resources to hardware resources. A plurality of hardware events are identified in association with a hardware resource. A hardware failure scenario is predicted based on a health score of a first hardware resource. A health score is determined based on the hardware events, and a fault model that indicates a level of severity of the hardware events. A particular virtual resource is migrated from the hardware resource to another hardware that has a greater health score.
Abstract:
The subject matter described herein is generally directed towards detection and remediation of virtual computing instance (VCI) failure on host devices. Monitoring is performed to detect suspected failures of different guest operating systems, identify failure information, and perform remediation to provide high availability for the VCI.
Abstract:
Exemplary methods, apparatuses, and systems include a hypervisor receiving an error message from an agent within a first virtual machine run by the hypervisor. In response to the error message, the hypervisor determines and initiates a corrective action for the hypervisor to take in response to the error message. An exemplary corrective action includes initiating a reset of the first virtual machine or a reset of a second virtual machine.
Abstract:
Embodiments maintain high availability of software application instances in a fault domain. Subordinate hosts are monitored by a master host. The subordinate hosts publish heartbeats via a network and datastores. Based at least in part on the published heartbeats, the master host determines the status of each subordinate host, distinguishing between subordinate hosts that are entirely inoperative and subordinate hosts that are operative but partitioned (e.g., unreachable via the network). The master host may restart software application instances, such as virtual machines, that are executed by inoperative subordinate hosts or that cease executing on partitioned subordinate hosts.
Abstract:
Techniques are disclosed for orchestrating high availability (HA) failover for virtual machines (VMs) running on host systems of a host cluster, where the host cluster aggregates locally-attached storage resources of the host systems to provide an object store, and where persistent data for one or more of the VMs is stored as per-VM storage objects across the locally-attached storage resources comprising the object store. In one embodiment, a host system in the host cluster executing a HA module determines a VM to be restarted on an active host system in the host cluster. The host system further determines if the VM's persistent data is stored in the object store. If so, the host system adds the VM to a list of VMs to be immediately restarted. Otherwise, the host system checks whether the VM is accessible to the host system by querying a storage layer of the host system configured to manage the object store.
Abstract:
A system for monitoring a virtual machine executed on a host. The system includes a processor that receives an indication that a failure caused a storage device to be inaccessible to the virtual machine, the inaccessible storage device impacting an ability of the virtual machine to provide service, and applies a remedy to restore access to the storage device based on a type of the failure.
Abstract:
Exemplary methods, apparatuses, and systems determine a list of virtual machines to be subject to a corrective action. When one or more of the listed virtual machines have dependencies upon other virtual machines, network connections, or storage devices, the determination of the list includes determining that the dependencies of the one or more virtual machines have been met. An attempt to restart or take another corrective action for the first virtual machine within the list is made. A second virtual machine that is currently deployed and running or powered off or paused in response to the corrective action for the first virtual machine is determined to be dependent upon the first virtual machine. In response to the second virtual machine's dependencies having been met by the attempt to restart or take corrective action for the first virtual machine, the second virtual machine is added to the list of virtual machines.
Abstract:
Recovery of virtual machines when one or more hosts fail includes identifying virtual machines running on the remaining functioning hosts. Some of the identified powered on virtual machines are suspended in favor of restarting some of the failed virtual machines from the failed host(s). A subsequent round of identifying virtual machines for suspension and virtual machines for restarting is performed. Virtual machines for suspension and restarting may be identified based on their associated “recovery time objective” (RTO) values or their “maximum number of RTO violations” value.
Abstract:
The disclosure provides an approach for the dynamic configuration of virtualized objects. A virtual object may be associated with a desired state defining a first plurality of resources for allocating to the virtual object. The first plurality of resources correspond to one or more resource types. Techniques include determining that each of a plurality of hosts does not have sufficient available resources to allocate the first plurality of resources to the virtual object according to the desired state. Techniques include selecting, a first host of the plurality of hosts to run the virtual object. Techniques include allocating a second plurality of resources to the virtual object from the first host, wherein the second plurality of resources is less than the first plurality of resources, and running the virtual object in the first host.
Abstract:
Exemplary methods, apparatuses, and systems include a target site management server transmitting, to a source site management server, a plurality of protection service plans available for replication of data from the source site to the target site. The transmission of the protection service plans includes a description of one or more service level characteristics provided by each protection service plan and excludes a listing of physical and virtual resources within the target site that are to provide the service level characteristics. The target site management server receives selection of one of the protection service plans and determines the physical resources within the target site to provide the advertised service level characteristics for the data replication. The target site management server further transmits configuration details to one or more of the determined physical resources to implement the replication infrastructure within the target site according to the selected protection service plan.