SELF-CORRECTING COMPUTER
    1.
    发明公开
    SELF-CORRECTING COMPUTER 有权
    自校正计算机

    公开(公告)号:EP1625484A2

    公开(公告)日:2006-02-15

    申请号:EP04809302.5

    申请日:2004-02-24

    IPC分类号: G06F1/00

    摘要: A fault-tolerant computer uses multiple commercial processors operating synchronously, i.e., in lock-step. In an exemplary embodiment, redundancy logic isolates the outputs of the processors from other computer components, so that the other components see only majority vote outputs of the processors. Processor resynchronization, initiated at predetermined time, milestones, and/or in response to processor faults, protects the computer from single event upsets. During resynchronization, processor state data is flushed and an instance of these data in accordance with processor majority vote is stored. Processor caches are flushed to update computer memory with more recent data stored in the caches. The caches are invalidated and disabled, and snooping is disabled. A controller is notified that snooping has been disabled. In response to the notification, the controller performs a hardware reset of the processors. The processors are loaded with the stored state data, and snooping and caches are enabled.

    METHOD AND SYSTEM FOR CONSISTENT CLUSTER OPERATIONAL DATA IN A SERVER CLUSTER USING A QUORUM OF REPLICAS
    4.
    发明公开
    METHOD AND SYSTEM FOR CONSISTENT CLUSTER OPERATIONAL DATA IN A SERVER CLUSTER USING A QUORUM OF REPLICAS 有权
    方法和系统在服务器群集中一致的集团化经营数据和副本的QUORUM

    公开(公告)号:EP1222540A1

    公开(公告)日:2002-07-17

    申请号:EP00916146.4

    申请日:2000-03-06

    IPC分类号: G06F9/50 G06F11/20

    摘要: A method and system for increasing the availability of a server cluster (60sub1-60sub5) while reducing its cost by requiring at a minimum only one node and a quorum replica set (57A) of storage devices (replica members) (58sub-1-58sub2) to form and continue operating as a cluster. A plurality of replica members maintain the cluster operational data and are independent from any given node. A cluster may be formed and continue to operate as long as one server node possesses a quorum (majority) of the replica members. This ensures that a new or surviving cluster has a least one replica member that belonged to the immediately prior cluster and is thus correct with respect to the cluster operational data. Update sequence numbers and/or timestamps are used to determine the most updated replica member from among those in the quorum for reconciling the other replica members.

    FAULT TOLERANT COMPUTER SYSTEM
    5.
    发明公开
    FAULT TOLERANT COMPUTER SYSTEM 失效
    容错计算机系统

    公开(公告)号:EP0972244A4

    公开(公告)日:2000-11-15

    申请号:EP98915246

    申请日:1998-03-31

    发明人: WARDROP ANDREW J

    摘要: A fault tolerant computer system is disclosed which uses redundant voting at the hardware clock level to detect and to correct single event upsets (SEU) and other random failures. In one preferred embodiment, the computer (30) includes four or more commercial processing units (CPUs) (32) operating in strict "lock-step" and whose outputs (33, 37) to system memory (46) and system bus (12) are voted by a gate array (50) which may be implemented in a custom integrated circuit (34). A custom memory controller (18) interfaces to the system memory (46) and system bus (12). The data and address (35, 37) at each write to an read from memory (46) within the computer (30) are voted at each CPU clock cycle. A vote status and control circuit (38) "reads" the status of the vote and controls the state of the CPUs using hardware and software. The majority voted signals (35) are used by the agreeing CPUs (32) to continue processing operations without interruption. The system logic selects the best chance of recovering from a detected fault by re-synchronizing all CPUs (32), powering down a faulty CPU or switching to a spare computer (30), resetting and re-booting the substituted CPUs (32).

    FAULT RESILIENT/FAULT TOLERANT COMPUTING
    6.
    发明授权
    FAULT RESILIENT/FAULT TOLERANT COMPUTING 失效
    误动作SAFE /容错计算机操作方法

    公开(公告)号:EP0731945B1

    公开(公告)日:2000-05-17

    申请号:EP95902615.4

    申请日:1994-11-15

    摘要: A method of synchronizing at least two computing elements (CE1, CE2) that each have clocks that operate asynchronously of the clocks of the other computing elements includes selecting one or more signals, designated as meta time signals, from a set of signals produced by the computing elements (CE1, CE2), monitoring the computing elements (CE1, CE2) to detect the production of a selected signal by one of the computing elements (CE1), waiting for the other computing elements (CE2) to produce a selected signal, transmitting equally valued time updates to each of the computing elements, and updating the clocks of the computing elements (CE1, CE2) based on the time updates. In a second aspect of the invention, fault resilient, or tolerant, computers (200) are produced by designating a first processor as a computing element (204), designating a second processor (202) as a controller, connecting the computing element (204) and the controller (202) to produce a modular pair, and connecting at least two modular pairs to produce a fault resilient or fault tolerant computer (200). Each computing element (202, 204) of the computer (200) performs all instructions in the same number of cycles as the other computing elements (202, 204). The computer systems include one or more controllers (202) and at least two computing elements (204).

    Verfahren zur Isolation eines defekten Rechners in einem fehlertoleranten Mehrrechnersystem
    7.
    发明公开
    Verfahren zur Isolation eines defekten Rechners in einem fehlertoleranten Mehrrechnersystem 有权
    一种用于有缺陷的计算机的在容错多处理器系统中的隔离处理

    公开(公告)号:EP0902369A3

    公开(公告)日:1999-07-28

    申请号:EP98440187.7

    申请日:1998-08-28

    申请人: ALCATEL

    IPC分类号: G06F11/16 G06F11/18

    摘要: In einem Mehrrechnersystem, insbesondere einem 2-aus-3-Rechner-System, soll ein als defekt erkannter Rechner unter Beachtung des "Fail-Safe"-Prinzips so isoliert werden, daß die nicht defekten Rechner weiterarbeiten können. Erfindungsgemäß erhält der defekte Rechner von den nicht defekten Rechnern ein Kommando (102), sich vollständig herunterzufahren und somit Datenausgaben einzustellen. Falls der defekte Rechner diesem Kommando nicht nachkommt und weiterhin Daten ausgibt, fahren sich die nicht defekten Rechner selbst herunter (104). Dadurch nimmt das System einen sicheren Zustand ein, da aufgrund systeminterner Abgleichprozesse ein Rechner allein keine wirksamen Ausgaben machen kann. Die erfindungsgemäße Lösung ist damit besonders geeignet für dem "Fail-Safe"-Prinzip gehorchende Steuerungssysteme, wie sie etwa bei der Sicherung von Fahrwegen im Eisenbahnverkehr oder bei der Überwachung von Kernkraftwerken gefordert werden. Das Verfahren kann vollständig als Software realisiert werden; bislang notwendige Relaisschalter werden überflüssig.

    Mesh interconnected array in a fault-tolerant computer system
    9.
    发明公开
    Mesh interconnected array in a fault-tolerant computer system 失效
    马克西恩·布德森Matrize在einem fehlertoleranten计算机系统

    公开(公告)号:EP0811916A2

    公开(公告)日:1997-12-10

    申请号:EP97201662.0

    申请日:1997-06-06

    IPC分类号: G06F11/00

    摘要: Bus interface units (BIUs) (54) perform fault detection, identification, and reconfiguration for all information transfers between redundant central processing units (CPUs) (56) and memory or input/output (I/O) (57A-C) in a mesh interconnected array of a highly reliable fault-tolerant computer system. Errors are detected by self-checking within the BIUs, signal parity checks by the BIUs, cross channel comparisons, and mesh transaction assessments. Fault identification and mesh reconfiguration for the mesh is performed such that no faulty unit remains active in decision making after reconfiguration, and the number of good units isolated during reconfiguration is minimized.

    摘要翻译: 总线接口单元(BIU)(54)对冗余中央处理单元(CPU)(56)和存储器或输入/输出(I / O)(57A-C)之间的所有信息传输执行故障检测,识别和重新配置 网状互连阵列的高度可靠的容错计算机系统。 通过BIU内的自检,BIU进行信号奇偶校验,交叉通道比较和网格事务评估来检测错误。 执行网格的故障识别和网格重新配置,使得在重新配置后决策中没有故障单元保持活动状态,并且在重新配置期间隔离的良好单元的数量最小化。

    FAULT RESILIENT/FAULT TOLERANT COMPUTING
    10.
    发明公开
    FAULT RESILIENT/FAULT TOLERANT COMPUTING 失效
    故障弹性/容错计算

    公开(公告)号:EP0731945A1

    公开(公告)日:1996-09-18

    申请号:EP95902615.0

    申请日:1994-11-15

    IPC分类号: G06F11 G06F1 G06F13 G06F15

    摘要: A method of synchronizing at least two computing elements (CE1, CE2) that each have clocks that operate asynchronously of the clocks of the other computing elements includes selecting one or more signals, designated as meta time signals, from a set of signals produced by the computing elements (CE1, CE2), monitoring the computing elements (CE1, CE2) to detect the production of a selected signal by one of the computing elements (CE1), waiting for the other computing elements (CE2) to produce a selected signal, transmitting equally valued time updates to each of the computing elements, and updating the clocks of the computing elements (CE1, CE2) based on the time updates. In a second aspect of the invention, fault resilient, or tolerant, computers (200) are produced by designating a first processor as a computing element (204), designating a second processor (202) as a controller, connecting the computing element (204) and the controller (202) to produce a modular pair, and connecting at least two modular pairs to produce a fault resilient or fault tolerant computer (200). Each computing element (202, 204) of the computer (200) performs all instructions in the same number of cycles as the other computing elements (202, 204). The computer systems include one or more controllers (202) and at least two computing elements (204).

    摘要翻译: 一种同步至少两个计算元件(CE1,CE2)的方法,每个计算元件具有与其他计算元件的时钟异步操作的时钟,包括从由所述计算元件产生的一组信号中选择被指定为元时间信号的一个或多个信号 计算元件(CE1,CE2),监控计算元件(CE1,CE2)以检测由计算元件(CE1)之一产生的选定信号,等待其他计算元件(CE2)产生选择的信号, 将相同值的时间更新传送给每个计算元件,并基于时间更新更新计算元件(CE1,CE2)的时钟。 在本发明的第二方面中,通过将第一处理器指定为计算元件(204),将第二处理器(202)指定为控制器,将计算元件(204)连接到计算机 )和控制器(202)产生模块对,并且连接至少两个模块对以产生故障恢复或容错计算机(200)。 计算机(200)的每个计算元件(202,204)以与其他计算元件(202,204)相同数量的循环执行所有指令。 计算机系统包括一个或多个控制器(202)和至少两个计算元件(204)。