Abstract:
Embodiments maintain high availability of software application instances in a fault domain. Subordinate hosts are monitored by a master host. The subordinate hosts publish heartbeats via a network and datastores. Based at least in part on the published heartbeats, the master host determines the status of each subordinate host, distinguishing between subordinate hosts that are entirely inoperative and subordinate hosts that are operative but partitioned (e.g., unreachable via the network). The master host may restart software application instances, such as virtual machines, that are executed by inoperative subordinate hosts or that cease executing on partitioned subordinate hosts.
Abstract:
Exemplary methods, apparatuses, and systems include a hypervisor receiving an error message from an agent within a first virtual machine run by the hypervisor. In response to the error message, the hypervisor determines and initiates a corrective action for the hypervisor to take in response to the error message. An exemplary corrective action includes initiating a reset of the first virtual machine or a reset of a second virtual machine.
Abstract:
Techniques are disclosed for maintaining high availability (HA) for virtual machines (VMs) running on host systems of a host cluster, where each host system executes a HA module in a plurality of HA modules and a storage module in a plurality of storage modules, where the host cluster aggregates, via the plurality of storage modules, locally-attached storage resources of the host systems to provide an object store, where persistent data for the VMs is stored as per-VM storage objects across the locally-attached storage resources comprising the object store, and where a failure causes the plurality of storage modules to observe a network partition in the host cluster that the plurality of HA modules do not. In one embodiment, a host system in the host cluster executing a first HA module invokes an API exposed by the plurality of storage modules for persisting metadata for a VM to the object store. If the API is not processed successfully, the host system: (1) identifies a subset of second HA modules in the plurality of HA modules; (2) issues an accessibility query for the VM to the subset of second HA modules in parallel, the accessibility query being configured to determine whether the VM is accessible to the respective host systems of the subset of second HA modules; and (3) if at least one second HA module in the subset indicates that the VM is accessible to its respective host system, transmits a command to the at least one second HA module to invoke the API on its respective host system.
Abstract:
Techniques are disclosed for maintaining high availability (HA) for virtual machines (VMs) running on host systems of a host cluster, where each host system executes a HA module in a plurality of HA modules and a storage module in a plurality of storage modules, where the host cluster aggregates, via the plurality of storage modules, locally-attached storage resources of the host systems to provide an object store, where persistent data for the VMs is stored as per-VM storage objects across the locally-attached storage resources comprising the object store, and where a failure causes the plurality of storage modules to observe a network partition in the host cluster that the plurality of HA modules do not. In one embodiment, a host system in the host cluster executing a first HA module invokes an API exposed by the plurality of storage modules for persisting metadata for a VM to the object store. If the API is not processed successfully, the host system: (1) identifies a subset of second HA modules in the plurality of HA modules; (2) issues an accessibility query for the VM to the subset of second HA modules in parallel, the accessibility query being configured to determine whether the VM is accessible to the respective host systems of the subset of second HA modules; and (3) if at least one second HA module in the subset indicates that the VM is accessible to its respective host system, transmits a command to the at least one second HA module to invoke the API on its respective host system.
Abstract:
The subject matter described herein is generally directed towards detection and remediation of virtual computing instance (VCI) failure on host devices. Monitoring is performed to detect suspected failures of different guest operating systems, identify failure information, and perform remediation to provide high availability for the VCI.
Abstract:
A system for monitoring a virtual machine executed on a host. The system includes a processor that receives an indication that a failure caused a storage device to be inaccessible to the virtual machine, the inaccessible storage device impacting an ability of the virtual machine to provide service, and applies a remedy to restore access to the storage device based on a type of the failure.
Abstract:
The present disclosure is related to methods, systems, and machine-readable media for workflows for series of snapshots. A server can manage replication of a number of series of snapshots of a virtual computing instance (VCI). An on-host agent can replicate a parent series of the number of series of snapshots to at least one child series of the number of series of snapshots. The parent series can precede the at least one child series in the number of series of snapshots. A change in the parent series can be propagated to the child series. Management of the replication of the numbers of series of snapshots can be switched from the server to a different server.
Abstract:
A method for restoring a data volume using incremental snapshots of the data volume includes creating a first series of incremental snapshots according to a first predefined interval. The method further includes creating a second series of incremental snapshots according to a second predefined interval that is an integer multiple of the first predefined interval. The method also includes receiving a request to restore the data volume to a point-in-time. The method further includes restoring the data volume to the point-in-time using none or some of the snapshots in the first series that were created at or prior to the point-in-time, and all of the snapshots in the second series that were created at or prior to the point-in-time.
Abstract:
The present disclosure is related to methods, systems, and machine-readable media for modifying an instance catalog to perform operation. A storage system can include a plurality of packfiles that store data. The storage system can include a plurality of streams that include a plurality of hashes that identify the plurality of packfiles. The storage system can include an instance catalog that includes an identification of the plurality of streams. The storage system can include an operation engine to perform a number of operations on the plurality of packfiles by modifying the instance catalog using the identification of the plurality of streams.
Abstract:
The present disclosure is related to methods, systems, and machine-readable media for workflows for series of snapshots. A server can manage replication of a number of series of snapshots of a virtual computing instance (VCI). An on-host agent can replicate a parent series of the number of series of snapshots to at least one child series of the number of series of snapshots. The parent series can precede the at least one child series in the number of series of snapshots. A change in the parent series can be propagated to the child series. Management of the replication of the numbers of series of snapshots can be switched from the server to a different server.