摘要:
A fault tolerant/fault resilient computer system includes a first coserver and a second coserver. The first coserver includes a first application environment (AE) processor and a first I/O subsystem processor on a first common motherboard. The second coserver includes a second AE processor and a second I/O subsystem processor on a second common motherboard.
摘要:
Data transfer to computing elements is synchronized in a computer system that includes the computing elements and controllers that provide data from data sources to the computing elements. A request for data made by a computing element is intercepted and transmitted to the controllers. At least a first controller responds by transmitting requested data to the computing element and by indicating how a second controller will respond to the intercepted request.
摘要:
Data transfer to computing elements is synchronized in a computer system that includes the computing elements and controllers that provide data from data sources to the computing elements. A request for data made by a computing element is intercepted and transmitted to the controllers. At least a first controller responds by transmitting requested data to the computing element and by indicating how a second controller will respond to the intercepted request.
摘要:
Synchronized execution is maintained by compute elements processing instruction streams in a computer system including the compute elements and a controller. Each compute element includes a clock that operates asynchronously with respect to clocks of the other compute elements. Each compute element processes instructions from an instruction stream and counts the instructions processed. Upon processing a quantum of instructions from the instruction stream, the compute element initiates a synchronization procedure and continues to process instructions from the instruction stream and to count instructions processed from the instruction stream. The compute element halts processing of instructions from the instruction stream after processing an unspecified number of instructions from the instruction stream in addition to the quantum of instructions. Upon halting processing, the compute element sends a synchronization request to the controller and waits for a synchronization reply.
摘要:
A computer system configured to provide fault tolerance includes a first host system and a second host system. The first host system is programmed to monitor a number of portions of memory of the first host system that have been modified by a guest running on the first host system and, upon determining that the number of portions exceeds a threshold level, determine that a checkpoint needs to be created. Upon determining that the checkpoint needs to be created, operation of the guest is paused and checkpoint data is generated. After generating the checkpoint data, operation of the guest is resumed while the checkpoint data is transmitted to the second host system.
摘要:
A symmetric multiprocessing fault-tolerant computer system controls memory access in a symmetric multiprocessing computer system. To do so, virtual page structures are created, where the virtual page structures reflect physical page access privileges to shared memory for processors in a symmetric multiprocessing computer system. Access to shared memory is controlled based on physical page access privileges reflected in the virtual paging structures to coordinate deterministic shared memory access between processors in the symmetric multiprocessing computer system. A symmetric multiprocessing fault-tolerant computer system may use duplication or continuous replay.
摘要:
In a first aspect, a method of synchronizing at least two computing elements that each have clocks that operate asynchronously of the clocks of the other computing elements includes selecting one or more signals, designated as meta time signals, from a set of signals produced by the computing elements, monitoring the computing elements to detect the production of a selected signal by one of the computing elements, waiting for the other computing elements to produce a selected signal, transmitting equally valued time updates to each of the computing elements, and updating the clocks of the computing elements based on the time updates.In a second aspect, fault resilient or fault tolerant computers are produced by designating a first processor as a computing element, designating a second processor as a controller, connecting the computing element and the controller to produce a modular pair, and connecting at least two modular pairs to produce a fault resilient or fault tolerant computer. Each computing element of the computer performs all instructions in the same number of cycles as the other computing elements.Computer systems include one or more controllers and at least two computing elements. System is provided for intercepting I/O operations by the computing elements and transmitting them to the one or more controllers.
摘要:
The effects of I/O race conditions caused by asynchrony between processors concurrently executing the same software and I/O devices are eliminated by executing an application program and a first associated operating system with firs processors, and executing an I/O processing program and a second associated operating system with an I/O processor. Memory requests from the application program or the first associated operating system are processed with the first processors, and memory requests from the application program to memory addresses associated with I/O devices are trapped and transmitted to the I/O processor. The I/O processor then performs the trapped memory requests with the I/O processing program after waiting for the identical request to be received from each of the first processors to eliminate the effects of race conditions caused by asynchrony between processors concurrently executing the application program or the first associated operating system and I/O devices. I/O requests may be trapped and performed by the I/O processor for the same purpose.
摘要:
A method for determining a delay in a dynamic, event driven, checkpoint interval. In one embodiment, the method includes the steps of determining the number of network bits to be transferred; determining the target bit transfer rate; calculating the next cycle delay as the number of bits to be transferred divided by the target bit transfer rate. In another aspect, the invention relates to a method for delaying a checkpoint interval. In one embodiment, the method includes the steps of monitoring the transfer of a prior batch of network data and delaying a subsequent checkpoint until the transfer of a prior batch of network data has reached a certain predetermined level of completion. In another embodiment, the predetermined level of completion is 100%.
摘要:
A fault tolerant/fault resilient computer system includes at least two compute elements connected to at least one controller. Each compute element has clocks that operate asynchronously to clocks of the other compute elements. The compute elements operate in a first mode in which the compute elements each execute a first stream of instructions in emulated clock lockstep, and in a second mode in which the compute elements each execute a second stream of instructions in instruction lockstep. Each compute element may be a multi-processor compute element.