摘要:
This computer implemented software invention supervises networked system resources with the goal of maximizing service availability, providing on-demand and uninterrupted access to service, and minimizing the down time due to failures. It is a cluster-wide solution that co-ordinates the states and activities of resources, assigns availability roles, implements recovery from failures, and implements overall system policy. To do this, it maintains a system model of the system's physical and logical configuration and models the resources using managed objects that provides an extensive representation of the states, roles, and relationships of the systems resources.
摘要:
Architecture that reduces data loss resulting from failover in an asynchronous log shipping deployment, but leveraging mid-tier and frontend servers to fill in lost data. In an asynchronous log shipping operation, a replication component asynchronously replicates messaging data to a backend server in accordance with one or more replication operations, which can be updates to databases on the backend server. These databases can include messaging data, such as email address books, mailboxes, etc. A history component maintains a history of replication operations on a frontend server. In the event of a lossy failover, a replay component is used for replaying the replication operations from the history to the backend server.
摘要:
Techniques to leverage replication to provide rolling point in time backup are described. Some embodiments are directed to techniques to provide rolling point in time backup with simplified restoration through distributed transactional re-creation. In one embodiment, for example, a technique may comprise creating a plurality of availability copies of a primary set of data; designating at least one of the plurality of availability copies as a backup copy; creating a log file that indicates changes to the primary set of data; updating the plurality of availability copies from the log file in near real time, without updating the backup copy; and restoring at least one of: the primary set of data and an availability copy using the backup copy and content resubmitted from a content contributor when an error occurs in at least one of: the primary set of data and an availability copy. Other embodiments are described and claimed.
摘要:
High availability architecture that employs a mid-tier proxy server to route client communications to active data store instances in response to failover and switchover. The proxy server includes an active manager client that interfaces to an active manager in each of the backend servers. State information and configuration information are maintained separately and according to semantics consistent with needs of corresponding data, the configuration information changing less frequently and more available, the state information changing more frequently and less available. The active manager indicates to the proxy server which of the data storage instances is the currently the active instance. In the event that the currently active instance is inactive, the proxy server selects a different backend server that currently hosts the active data store instance. Client communications are then routed to the different backend server with minimal or no interruption to the client.
摘要:
Architecture for detecting lost writes using timestamps. During a replication process, lost writes in data replicated from a stream can be detected by noting discrepancies between the timestamps of data in the replica and timestamps associated with the corresponding data from the source in original data store. A lost write either in the original data store or in the replica data store can be inferred by comparing these timestamps with the timestamps in a number of other replica data stores. Additionally, check entries can be added to the replicas by the original data store to allow expanded comparison between recently modified data and the source data in the original data store. The check entries can be added to the replication journal after a time delay, thereby increasing effectiveness of the check by decreasing the likelihood that caching in the hardware will defeat the test.
摘要:
Detecting the failure of file transfers via a network copy service from a source computer to a destination computer via a network monitoring service. If the network monitoring service determines that the source computer is no longer available, the destination computer initiates a second file transfer request via the network copy service from a second source of the file.
摘要:
A primary active manager can manage a first copy of a database in a first computer system cluster according to a set of management rules that provide for an active copy and one or more passive copies of the database at a given time. The primary active manager can also manage a second copy of the database in a second computer system cluster according to the rules. The rules can allow the first copy of the database or the second copy of the database to be the active copy if one or more criterion in the rules is met for that active copy. The first copy can be designated as the active copy and the second copy can be designated as a passive copy. A failure of the first copy can be detected, and in response, the second copy can be automatically designated as the active copy.
摘要:
Asynchronous transaction log replication from a source database to a destination database utilizing file change notifications for a source log directory generated by an operating system of a source computing machine and received by a destination computing machine. In response to the received file change notification, a source transaction log in the source log directory is copied to a destination transaction log in a destination log directory of the destinations machine. After the copy is completed, transactions contained in the destination transaction log are applied to the destination database.
摘要:
A computer cluster can be divided into a plurality of failure scopes and a voting constraint can be enforced. The voting constraint can allow a portion of the cluster to provide the service if a majority of health votes from cluster members is obtained by that portion. A loss of connectivity between a first failure scope, which has a majority of cluster members in the cluster, and one or more other failure scopes in the cluster can be detected. The loss of connectivity can be such that the first failure scope does not have connectivity to a member in any other failure scope in the cluster. In response to detecting the loss of connectivity, a split brain situation in the cluster can be automatically protected against by preventing the first failure scope from providing the service.
摘要:
A central controlling service for datacenter activation/deactivation control in a cluster deployment to assist in preventing a split-brain scenario. The central controlling service provides a central point of control in the datacenter for application servers to periodically query as to whether to go offline, online, or normal. Redundancy of the central service facilitates detection of datacenter failure by the redundant services interacting to resolve the state of control information. This control information is then used to answer the server queries. On startup from a datacenter failure, a single instance of the central service queries other redundant instance(s) to determine if the single instance is starting up from a datacenter-wide failure or from operations other than total datacenter failure. If the failure is datacenter-wide, a central service protocol assists in resolving to the single service keeping the associated datacenter servers offline; otherwise, the server queries are answered to go online.