摘要:
A cluster recovery and maintenance technique for a server cluster having plural nodes implementing a server tier in a client-server computing architecture. A first group of N active nodes each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on the clients. A second group of M spare nodes each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide client application services. First and second zones in the cluster are determined in response to an active node membership change involving one or more active nodes departing from or being added to the first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node.
摘要:
A method and system for maintaining a discovery record and a cluster bootstrap record is provided. The discovery record enables shared storage system discovery and the cluster bootstrap record enables cluster discovery and cooperative cluster startup. The cluster bootstrap record is updated in response to a change in the cluster membership. The update is performed by a cluster leader in the form of a transactionally consistent I/O update to the cluster bootstrap record on disk and a distributed cache update across the cluster (30, 50). The update is aborted (80) in the event of a failure in the cluster leaving the cluster bootstrap record in a consistent state. In the event of a disastrous cluster and/or storage system failure, the discovery record may be recovered (228) from a restored storage system (214) and the cluster bootstrap record may be reset to install a new cluster in the old cluster's place (232).
摘要:
A cluster recovery and maintenance technique for a server cluster having plural nodes implementing a server tier in a client-server computing architecture. A first group of N active nodes each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of one or more client applications running in a client application tier on the clients. A second group of M spare nodes each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide client application services. First and second zones in the cluster are determined in response to an active node membership change involving one or more active nodes departing from or being added to the first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node.
摘要:
A method and system for localizing and resolving a fault in a cluster environment. The cluster is configured with at least one multi-homed node, and at least one gateway for each network interface. Heartbeat messages are sent between peer nodes and the gateway in predefined periodic intervals. In the event of loss of a heartbeat message by any node or gateway, an ICMP echo is issued to each node and gateway in the cluster for each network interface. If neither a node loss not a network loss is validated in response to the ICMP echo, an application level ping is issued to determine if the fault associated with the absence of the heartbeat message is a transient error condition or an application software fault.
摘要:
A method and system for localizing and resolving a fault in a cluster environment. The cluster is configured with at least one multi-homed node, and at least one gateway for each network interface. Heartbeat messages are sent between peer nodes and the gateway in predefined periodic intervals. In the event of loss of a heartbeat message by any node or gateway, an ICMP echo is issued to each node and gateway in the cluster for each network interface. If neither a node loss not a network loss is validated in response to the ICMP echo, an application level ping is issued to determine if the fault associated with the absence of the heartbeat message is a transient error condition or an application software fault.
摘要:
A cluster recovery and maintenance technique for use in a server cluster having plural nodes implementing a server tier in a client-server computing architecture. A first group of N active nodes each run a software stack comprising a cluster management tier and a cluster application tier that actively provides services on behalf of client applications running in a client application tier. A second group of M spare nodes each run a software stack comprising a cluster management tier and a cluster application tier that does not actively provide services on behalf of client applications. First and second zones in the cluster are determined in response to an active node membership change involving active nodes departing from or being added to the first group as a result of an active node failing or becoming unreachable or as a result of a maintenance operation involving an active node.
摘要:
A system, method and computer program product for use in a server cluster having plural server nodes implementing a server tier in a client-server computing architecture in order to determine which of two or more partitioned server subgroups has a quorum. A determination is made of relative priorities of the subgroups and a quorum is awarded to the subgroup having a highest relative priority. The relative priorities are determined by policy rules that evaluate comparative server node application state information. The server node application state information may include one or more of client connectivity, application priority, resource connectivity, processing capability, memory availability, and input/output resource availability, etc. The policy rules evaluate the application state information for each subgroup and can assign different weights to different types of application state information. An interface may be provided for receiving policy rules specified by a cluster application.
摘要:
A method and system for localizing and resolving a fault in a cluster environment. The cluster is configured with at least one multi-homed node, and at least one gateway for each network interface. Heartbeat messages are sent between peer nodes and the gateway in predefined periodic intervals. In the event of loss of a heartbeat message by any node or gateway, an ICMP echo is issued to each node and gateway in the cluster for each network interface. If neither a node loss nor a network loss is validated in response to the ICMP echo, an application level ping is issued to determine if the fault associated with the absence of the heartbeat message is a transient error condition or an application software fault.