摘要:
A fault tolerant computer system having two virtual machines (VMs), each running on a separate host device, is connected over a network to one or more I/O devices. The system operates to monitor the health of one or more operational characteristics associated with each VM, and in the event that the health of both virtual machines dictates that one or the other of the VMs should be downgraded, but the system is not able to determine which VM should be downgraded and there is an imbalance in a monitored system operational characteristic, the system can defer downgrading one VM for a selected period of time during which the operational characteristic that is in imbalance is monitored. If the imbalance is resolved, the downgrade is cancelled, if an operational fault is confirmed prior to the expiration of the deferral period or if the deferral period expires, then one host is downgraded.
摘要:
In a first aspect, a method of synchronizing at least two computing elements that each have clocks that operate asynchronously of the clocks of the other computing elements includes selecting one or more signals, designated as meta time signals, from a set of signals produced by the computing elements, monitoring the computing elements to detect the production of a selected signal by one of the computing elements, waiting for the other computing elements to produce a selected signal, transmitting equally valued time updates to each of the computing elements, and updating the clocks of the computing elements based on the time updates.In a second aspect, fault resilient or fault tolerant computers are produced by designating a first processor as a computing element, designating a second processor as a controller, connecting the computing element and the controller to produce a modular pair, and connecting at least two modular pairs to produce a fault resilient or fault tolerant computer. Each computing element of the computer performs all instructions in the same number of cycles as the other computing elements.Computer systems include one or more controllers and at least two computing elements. System is provided for intercepting I/O operations by the computing elements and transmitting them to the one or more controllers.
摘要:
A dual processor data processing system having interprocessor error checking includes a first central processing unit executing a series of instructions. A second central processing unit executes the same series of instructions independently of and in synchronism with the first central processing unit. A first data bus is coupled to the first central processing unit for receiving data to be input to the first central processing unit and a second data bus is coupled to the second central processing unit for receiving data to be input to the second central processing unit. Error checking devices are coupled to the first and second data busses for checking data transmitted over the first and second data busses and for detecting errors on I/O reads prior to delivery of the data to the first and second central processing units. The error checking devices include comparison means for indicating an error when the data on the first and second data busses are unequal. Error isolation devices are responsive to an error detected from the error checking means for analyzing the cause of error while maintaining system synchronization.
摘要:
Method and apparatus for testing the operation of modules for use in a fault tolerant computing system that consists of two distinct computing zones. Diagnostic testing is performed when the system is powered on, the modules being subjected to module, zone and, if both zones are available, system diagnostic tests. Indications of faults detected during diagnostic testing are stored in an EEPROM on each module. Such fault indications can be cleared in the field by correcting the fault condition and successfully rerunning the diagnostic test during which the fault was detected. Indications of operating system detected faults are also stored in each module EEPROM. However, such fault indications are not field clearable.
摘要:
A method for determining a delay in a dynamic, event driven, checkpoint interval. In one embodiment, the method includes the steps of determining the number of network bits to be transferred; determining the target bit transfer rate; calculating the next cycle delay as the number of bits to be transferred divided by the target bit transfer rate. In another aspect, the invention relates to a method for delaying a checkpoint interval. In one embodiment, the method includes the steps of monitoring the transfer of a prior batch of network data and delaying a subsequent checkpoint until the transfer of a prior batch of network data has reached a certain predetermined level of completion. In another embodiment, the predetermined level of completion is 100%.
摘要:
Data transfer to computing elements is synchronized in a computer system that includes the computing elements and controllers that provide data from data sources to the computing elements. A request for data made by a computing element is intercepted and transmitted to the controllers. At least a first controller responds by transmitting requested data to the computing element and by indicating how a second controller will respond to the intercepted request.
摘要:
A dual processor data processing system having interprocessor error checking includes a first central processing unit executing a series of instructions. A second central processing unit executes the same series of instructions independently of and in synchronism with the first central processing unit. A first data bus is coupled to the first central processing unit for receiving data to be input to the first central processing unit and a second data bus is coupled to the second central processing unit for receiving data to be input to the second central processing unit. Error checking devices are coupled to the first and second data busses for checking data transmitted over the first and second data busses and for detecting errors on I/O reads prior to delivery of the data to the first and second central processing units. The error checking devices include comparison means for indicating an error when the data on the first and second data busses are unequal. Error isolation devices are responsive to an error detected from the error checking means for analyzing the cause of error while maintaining system synchronization.
摘要:
Method and apparatus for controlling initiating of bootstrap loading in a computer system having first and second discrete computing zones is disclosed. Each computing zone includes a status register for storing an operating system run (OSR) bit indicating that the zone has initiated bootstrap loading. A cable connects the computing zones to allow the first and second zones to read the status registers in the second and first zones, respectively. A CPU in each zone only enables initiation of bootstrap loading if the OSR bit in the other zone is not set.
摘要:
A fault tolerant computer system includes a fault tolerant data processing module which has means for detecting and correcting errors in the operation of the data processing module to maintain a high degree of data integrity. Data transmission control devices control the transmission of all data to the fault tolerant data processing module and the receipt of all data into the fault tolerant data processing module. Input/output terminals are coupled to the data transmission control means for receiving and transmitting data. A non-fault tolerant input/output module is coupled to transmit the data to the input/output terminals of the fault tolerant data processing module. This module includes a read device for transferring data to the fault tolerant computing system in response to requests from the data transmission control devices, and a firewall for preventing the non-fault tolerant input/output module from initiating transfers of data to the fault tolerant data processing module.
摘要:
A dual processor computer system with error checking includes a first processing system for executing a series of instructions including output instructions. A second processing system executes the series of instructions independently of and in synchronism with the first processing system. Shared resource devices are coupled to the first and second processing systems for receiving data from output instructions from the first and second processing systems substantially simultaneously. Error checking devices are located downstream of the shared resource means for checking the data received from the first and second processing systems only following a write operation into the shared resource means.