摘要:
A system and method are provided for ensuring delivery of a communication from one computer system or node to another. A first node includes an object handler, such as an ORB (Object Request Broker), that receives object references from higher-level services operating on the first node, wherein the referenced object resides on a second node. The first node's object handler generates a message to an object handler on the second node and attempts to send the message to the second node through a transport module. The message is assigned a unique identifier, such as a sequence number. If the first object handler receives an uncertain status concerning the message (e.g., other than a certain success or failure), it issues a query to the second node to determine if the message was received. If the query is received by the second object handler before the message itself is received, the message is considered lost or rescinded by the first node. The first node stores the identifier so that it will not be re-assigned to another message and the message is then re-sent with a different identifier. The second object handler notes the identifier and status of the rescinded message and will discard any message having that identifier that is received. The second node includes two or more data structures to track the status of communications sent from the first node. The first node, in addition to a collection of identifiers of lost messages, may also record the status of communications it attempts to send and may also note the identifiers of messages that could not be transmitted (e.g., because of communication link errors).
摘要:
One embodiment of the present invention provides a system that facilitates communications between a cluster of nodes within a clustered computing system in a manner that tolerates failures of communication pathways between the nodes. The system operates by configuring a distinct logical pathway between each possible source node and each possible destination node in the cluster, so that each distinct logical pathway is routed across one of at least two disjoint physical pathways between each possible source node and each possible destination node. In doing so, the system configures a first logical pathway between a first node and a second node across a first physical pathway of at least two disjoint physical pathways between the first node and the second node. Upon detecting a failure of the first physical pathway, the system reroutes the first logical pathway across a second physical pathway from the at least two disjoint physical pathways between the first node and the second node. In one embodiment of the present invention, the system associates a distinct per-node logical address with each node in the cluster. For each source node, the system associates the per-node logical address of each possible destination node with a corresponding logical pathway to the destination node. In this way, a communication from a given source node to a per-node logical address of a given destination node is directed across the corresponding logical pathway to the given destination node.
摘要:
In a multi-node computer system, a software version management system is described having a version manager for ensuring that cluster nodes running completely incompatible software are unable to communicate with each other. The version manager provides a mechanism for determining when nodes in the cluster are running incompatible software and providing a way for determining the exact version of software that each node must run. The version manager provides support for rolling upgrades to enable the version management software to ensure the chosen version of software that runs the cluster stays constant even though the software installed on individual nodes is changing.
摘要:
One embodiment of the present invention provides a system that facilitates interactions between different versions of software that support remote object invocations. During operation, the system receives a reference to an object that is implemented on a server. Next, the system identifies one or more versions of the object supported by the reference, wherein each successive version of the object inherits methods from a preceding version of the object. The system then invokes a method on the object that is supported by the one or more versions of the object.
摘要:
One embodiment of the present invention provides a system for selecting a node to host a primary server for a service from a plurality of nodes in a distributed computing system. The system operates by receiving an indication that a state of the distributed computing system has changed. In response to this indication, the system determines if there is already a node hosting the primary server for the service. If not, the system selects a node to host the primary server using the assumption that a given node from the plurality of nodes in the distributed computing system hosts the primary server. The system then communicates rank information between the given node and other nodes in the distributed computing system, wherein each node in the distributed computing system has a unique rank with respect to the other nodes in the distributed computing system. The system next compares the rank of the given node with the rank of the other nodes in the distributed computing system. If one of the other nodes has a higher rank than the given node, the system disqualifies the given node from hosting the primary server.