摘要:
A fault-tolerant computer uses multiple commercial processors operating synchronously, i.e., in lock-step. In an exemplary embodiment, redundancy logic isolates the outputs of the processors from other computer components, so that the other components see only majority vote outputs of the processors. Processor resynchronization, initiated at predetermined time, milestones, and/or in response to processor faults, protects the computer from single event upsets. During resynchronization, processor state data is flushed and an instance of these data in accordance with processor majority vote is stored. Processor caches are flushed to update computer memory with more recent data stored in the caches. The caches are invalidated and disabled, and snooping is disabled. A controller is notified that snooping has been disabled. In response to the notification, the controller performs a hardware reset of the processors. The processors are loaded with the stored state data, and snooping and caches are enabled.
摘要:
Die beschriebene Anordnung zeichnet sich dadurch aus, daß jeder der zu überwachenden Systemkomponenten mindestens eine unabhängig von den zu überwachenden Systemkomponenten betreibbare eigene Überwachungseinrichtung zugeordnet ist. Durch eine solche Anordnung kann unter allen Umständen zuverlässig erkannt werden, ob und gegebenenfalls welche der zu überwachenden Systemkomponenten fehlerhaft arbeitet.
摘要:
A method and system for increasing the availability of a server cluster (60sub1-60sub5) while reducing its cost by requiring at a minimum only one node and a quorum replica set (57A) of storage devices (replica members) (58sub-1-58sub2) to form and continue operating as a cluster. A plurality of replica members maintain the cluster operational data and are independent from any given node. A cluster may be formed and continue to operate as long as one server node possesses a quorum (majority) of the replica members. This ensures that a new or surviving cluster has a least one replica member that belonged to the immediately prior cluster and is thus correct with respect to the cluster operational data. Update sequence numbers and/or timestamps are used to determine the most updated replica member from among those in the quorum for reconciling the other replica members.
摘要:
A fault tolerant computer system is disclosed which uses redundant voting at the hardware clock level to detect and to correct single event upsets (SEU) and other random failures. In one preferred embodiment, the computer (30) includes four or more commercial processing units (CPUs) (32) operating in strict "lock-step" and whose outputs (33, 37) to system memory (46) and system bus (12) are voted by a gate array (50) which may be implemented in a custom integrated circuit (34). A custom memory controller (18) interfaces to the system memory (46) and system bus (12). The data and address (35, 37) at each write to an read from memory (46) within the computer (30) are voted at each CPU clock cycle. A vote status and control circuit (38) "reads" the status of the vote and controls the state of the CPUs using hardware and software. The majority voted signals (35) are used by the agreeing CPUs (32) to continue processing operations without interruption. The system logic selects the best chance of recovering from a detected fault by re-synchronizing all CPUs (32), powering down a faulty CPU or switching to a spare computer (30), resetting and re-booting the substituted CPUs (32).
摘要:
A method of synchronizing at least two computing elements (CE1, CE2) that each have clocks that operate asynchronously of the clocks of the other computing elements includes selecting one or more signals, designated as meta time signals, from a set of signals produced by the computing elements (CE1, CE2), monitoring the computing elements (CE1, CE2) to detect the production of a selected signal by one of the computing elements (CE1), waiting for the other computing elements (CE2) to produce a selected signal, transmitting equally valued time updates to each of the computing elements, and updating the clocks of the computing elements (CE1, CE2) based on the time updates. In a second aspect of the invention, fault resilient, or tolerant, computers (200) are produced by designating a first processor as a computing element (204), designating a second processor (202) as a controller, connecting the computing element (204) and the controller (202) to produce a modular pair, and connecting at least two modular pairs to produce a fault resilient or fault tolerant computer (200). Each computing element (202, 204) of the computer (200) performs all instructions in the same number of cycles as the other computing elements (202, 204). The computer systems include one or more controllers (202) and at least two computing elements (204).
摘要:
In einem Mehrrechnersystem, insbesondere einem 2-aus-3-Rechner-System, soll ein als defekt erkannter Rechner unter Beachtung des "Fail-Safe"-Prinzips so isoliert werden, daß die nicht defekten Rechner weiterarbeiten können. Erfindungsgemäß erhält der defekte Rechner von den nicht defekten Rechnern ein Kommando (102), sich vollständig herunterzufahren und somit Datenausgaben einzustellen. Falls der defekte Rechner diesem Kommando nicht nachkommt und weiterhin Daten ausgibt, fahren sich die nicht defekten Rechner selbst herunter (104). Dadurch nimmt das System einen sicheren Zustand ein, da aufgrund systeminterner Abgleichprozesse ein Rechner allein keine wirksamen Ausgaben machen kann. Die erfindungsgemäße Lösung ist damit besonders geeignet für dem "Fail-Safe"-Prinzip gehorchende Steuerungssysteme, wie sie etwa bei der Sicherung von Fahrwegen im Eisenbahnverkehr oder bei der Überwachung von Kernkraftwerken gefordert werden. Das Verfahren kann vollständig als Software realisiert werden; bislang notwendige Relaisschalter werden überflüssig.
摘要:
In each module (11) of three or more central processor modules of a fault tolerant computer system, a detector (45) receives a comparator output signal and like comparator output signals from two adjacent modules and produces a detector output signal which confirms absence and presence of a fault in one of the above-mentioned each module. When the fault is confirmed, a controller or processor (49) isolates the module under consideration from the system by inhibiting delivery of a controlled output signal to a bus (31) and by connecting, with the module in question bypassed, switching units (53(1), 53(2)) of the adjacent modules. Preferably, one of the modules of the system is used as a master module of ordinarily delivering the controlled output signal to the bus with others used as checker modules of ordinarily inhibiting the delivery. When a fault appears in the master module, its controller delivers a module operation switching signal to the controllers of the checker modules to thereby substitute one of the checker modules for the master module subjected to the fault.
摘要:
Bus interface units (BIUs) (54) perform fault detection, identification, and reconfiguration for all information transfers between redundant central processing units (CPUs) (56) and memory or input/output (I/O) (57A-C) in a mesh interconnected array of a highly reliable fault-tolerant computer system. Errors are detected by self-checking within the BIUs, signal parity checks by the BIUs, cross channel comparisons, and mesh transaction assessments. Fault identification and mesh reconfiguration for the mesh is performed such that no faulty unit remains active in decision making after reconfiguration, and the number of good units isolated during reconfiguration is minimized.
摘要:
A method of synchronizing at least two computing elements (CE1, CE2) that each have clocks that operate asynchronously of the clocks of the other computing elements includes selecting one or more signals, designated as meta time signals, from a set of signals produced by the computing elements (CE1, CE2), monitoring the computing elements (CE1, CE2) to detect the production of a selected signal by one of the computing elements (CE1), waiting for the other computing elements (CE2) to produce a selected signal, transmitting equally valued time updates to each of the computing elements, and updating the clocks of the computing elements (CE1, CE2) based on the time updates. In a second aspect of the invention, fault resilient, or tolerant, computers (200) are produced by designating a first processor as a computing element (204), designating a second processor (202) as a controller, connecting the computing element (204) and the controller (202) to produce a modular pair, and connecting at least two modular pairs to produce a fault resilient or fault tolerant computer (200). Each computing element (202, 204) of the computer (200) performs all instructions in the same number of cycles as the other computing elements (202, 204). The computer systems include one or more controllers (202) and at least two computing elements (204).