摘要:
A method and apparatus for transparent failover of a filesystem within a computer cluster is provided. For failover protection, a filesystem is physically connected to an active server node and a standby server node. A cluster file system provides distributed access to the filesystem throughout the computer cluster. The cluster file system monitors the progress of each operation performed on the failover protected filesystem. If the active server node should fail during an operation, all processes performing operations on the failover protected filesystem are caused to sleep. The filesystem is then relocated to the standby server node. The cluster file system then awakens each sleeping process and retries each pending operation.
摘要:
A method and apparatus for transparent failover of a filesystem within a computer cluster is provided. For failover protection, a filesystem is physically connected to an active server node and a standby server node. A cluster file system provides distributed access to the filesystem throughout the computer cluster. The cluster file system monitors the progress of each operation performed on the failover protected filesystem. If the active server node should fail during an operation, all processes performing operations on the failover protected filesystem are caused to sleep. The filesystem is then relocated to the standby server node. The cluster file system then awakens each sleeping process and retries each pending operation.
摘要:
A system for recovery of process relationships following node failure within a computer cluster is provided. For relationship recovery, each node maintains set of care relationships. Each relationship is of the form carer cares about care target. Care relationships describe process relations such as parent-child or group leader-group member. Care relationships are stored at the origin node of their care targets. Following node failure, a surrogate origin node is selected. The surviving nodes then cooperate to rebuild vproc structures and care relationships for the processes that originated at the failed node at the surrogate origin node. The surviving nodes then determine which of their own care targets were terminated by the node failure. For each terminated care targets, notifications are sent to the appropriate carers. This allows surviving processes to correctly recover from severed process relationships.
摘要:
A digital computer comprising a plurality of message generating nodes interconnected by a routing network. The routing network transfers messages among the message generating elements in accordance with address information identifying a destination message generating element. Each message generating node includes a message data generator and a network interface. The message data generator generates message data items each including an address data portion comprising a destination identifier. The network interface includes a message generator and an address translation table, the table including a plurality of entries identifying, for at least one destination identifier, a translated destination identifier. The message generator, in response to the receipt of a message data item from the message data generator, generates a message for transmission to the routing network. In generating the message, the message generator performs an address translation operation in connection with the address data and the contents of the address translation table to generate updated address data which it uses data in connection with generating address information for the message.
摘要:
A system for protection of filesystem data integrity within a computer cluster is provided. The system uses redundant data caches at client and server nodes within the computer cluster. Caching of filesystem data is controlled so that non-shared files are preferably cached at client nodes. This increases filesystem performance within the computer cluster and ensures that failures may not result in a loss of modified filesystem data without a corresponding loss to the process(es) accessing that data. Shared files are cached at the server node and a backup cache node. This protects modified filesystem data against any single node failure.
摘要:
Illustrated is a system and method for executing a checkpoint scheme as part of processing a workload using an application. The system and method also includes identifying a checkpoint event that requires an additional checkpoint scheme. The system and method includes retrieving checkpoint data associated with the checkpoint event. It also includes building a checkpoint model based upon the checkpoint data. The system and method further includes identifying the additional checkpoint scheme, based upon the checkpoint model, the additional checkpoint scheme to be executed as part of the processing of the workload using the application.
摘要:
Illustrated is a system and method for executing a checkpoint scheme as part of processing a workload using an application. The system and method also includes identifying a checkpoint event that requires an additional checkpoint scheme. The system and method includes retrieving checkpoint data associated with the checkpoint event. It also includes building a checkpoint model based upon the checkpoint data. The system and method further includes identifying the additional checkpoint scheme, based upon the checkpoint model, the additional checkpoint scheme to be executed as part of the processing of the workload using the application.