Abstract:
Instead of disabling PCI communication between system resources in a host computing device and I/O devices when a PCI Host Bridge (PHB) is unable to function, the host computing device may include a redundant PCI communication path for maintaining communication between the system resources and the I/O devices after a first PHB experiences an unrecoverable error. In one embodiment, the redundant PCI communication path includes a second PHB that is maintained in a standby state so long as the first PHB is functioning normally. However, once the first PHB experiences an unrecoverable error, the second PHB is changed to the master state and assumes the responsibility for maintaining communication between the system resources and the I/O devices.
Abstract:
A method, system, and computer program product for performing failover in a redundancy group, where the redundancy group comprises a plurality of routers including an active router and a standby router, the failover being characterized by zero black hole or significantly reduced black hole conditions versus a conventional failover system. The method comprises the steps of: receiving an incoming message at a switch; sending a request of identification to the plurality of routers to identify a current active router, where the current active router represents a virtual router of the redundancy group; and in response to receiving a reply containing an identification from the current active router within a predetermined time, forwarding the incoming message to the current active router.
Abstract:
A method and a system are provided for determining an AMF configuration of a highly available system with respect to whether to failover or restart a component when the component fails. The AMF configuration specifies at least two service-units containing components that represent resources, and a set of service-instances representing workload incurred by provision of services using the resources. The method identifies a failover duration and a restart duration for each component in a service-unit; and determines a failover outage and a restart outage for each service-instance impacted by a failure of a given component, based on the failover duration and the restart duration of each component in the service-unit. The method further determines whether to failover or to restart the given component if the given component fails, based on the failover outage and the restart outage of each service-instance impacted by the failure of the given component.
Abstract:
The disclosed embodiments disclose techniques for performing physical domain error isolation and recovery in a multi-domain system, where the multi-domain system includes two or more processor chips and one or more switch chips that provide connectivity and cache-coherency support for the processor chips, and the processor chips are divided into two or more distinct domains. During operation, one of the switch chips determines a fault in the multi-domain system. The switch chip determines an originating domain that is associated with the fault, and then signals the fault and an identifier for the originating domain to its internal units, some of which perform clearing operations that clear out all traffic for the originating domain without affecting the other domains of the multi-domain system.
Abstract:
A high performance computing (HPC) system includes computing blades having a first region that includes computing circuit boards having processors for performing a computation, and a second region that includes non-volatile memory for use in performing the computation. The regions are connected by a plurality of power connectors that convey power from the computing circuit boards to the memory, and a plurality of data connectors that convey data between the first and second regions. The power and data connectors are configured redundantly so that failure of a computing circuit board, a power connector, or a data connector does not interrupt the computation. A method of performing such a computation, and a computer program product implementing the method, are also disclosed.
Abstract:
Methods and systems for load balancing and failover among gateway devices are disclosed. One method provides for assigning communication transaction handling to a gateway. The method includes receiving a request for a license from a computing device at a control gateway within a group of gateway devices including a plurality of gateway devices configured to support communication of cryptographically split data. The method also includes assigning communications from the computing device to one of the plurality of gateway devices based on a load balancing algorithm, and routing the communication request to the assigned gateway device.
Abstract:
A storage device, includes: a plurality of controller modules; a bus disposed among the plurality of controller modules, the bus including a plurality of transmission paths; a detector configured to detect an error in data communication through the bus; and a connection controller configured to carry out partial fallback processing of the bus if the number of the errors has exceeded a given number.
Abstract:
Methods and systems for load balancing and failover among gateway devices are disclosed. One method provides for assigning communication transaction handling to a gateway. The method includes receiving a request for a license from a computing device at a control gateway within a group of gateway devices including a plurality of gateway devices configured to support communication of cryptographically split data. The method also includes assigning communications from the computing device to one of the plurality of gateway devices based on a load balancing algorithm, and routing the communication request to the assigned gateway device.
Abstract:
In one embodiment, a management device receives one or more fate-sharing reports locally generated by one or more corresponding reporting nodes in a shared-media communication network, the fate-sharing reports indicating a degree of localized fate-sharing between one or more pairs of nodes local to the corresponding reporting nodes. The management device may then determine, globally from aggregating the fate-sharing reports, one or more fate-sharing groups indicating sets of nodes having a global degree of fate-sharing within the communication network. As such, the management device may then advertise the fate-sharing groups within the communication network, wherein nodes of the communication network are configured to select a plurality of next-hops that minimizes fate-sharing between the plurality of next-hops.
Abstract:
Intelligent client computing devices track and record the changes they make to data, applications, and services. Systems, devices, and computer readable media for detecting service tier failures and maintaining application services provide a resilient client architecture that allows a client application on an intelligent client to automatically detect the unavailability of server tiers or sites and re-route requests and updates to secondary sites to maintain application services at the client tier in a manner that is transparent to a user. The resilient client architecture understands the level of currentness of secondary sites in order to select the best secondary site and to automatically and transparently bring this secondary site up to date to ensure no data updates are missing from the secondary site.