Fault tolerant computing systems using checkpoints
    1.
    发明授权
    Fault tolerant computing systems using checkpoints 有权
    使用检查点的容错计算系统

    公开(公告)号:US08812907B1

    公开(公告)日:2014-08-19

    申请号:US13186087

    申请日:2011-07-19

    IPC分类号: G06F11/00

    摘要: A computer system configured to provide fault tolerance includes a first host system and a second host system. The first host system is programmed to monitor a number of portions of memory of the first host system that have been modified by a guest running on the first host system and, upon determining that the number of portions exceeds a threshold level, determine that a checkpoint needs to be created. Upon determining that the checkpoint needs to be created, operation of the guest is paused and checkpoint data is generated. After generating the checkpoint data, operation of the guest is resumed while the checkpoint data is transmitted to the second host system.

    摘要翻译: 被配置为提供容错的计算机系统包括第一主机系统和第二主机系统。 第一主机系统被编程为监视由第一主机系统上运行的客户机修改的第一主机系统的多个部分部分,并且在确定部件数量超过阈值电平时,确定检查点 需要创建。 在确定需要创建检查点时,暂停客户机的操作并生成检查点数据。 在产生检查点数据之后,当检查点数据被发送到第二主机系统时,恢复访客的操作。

    Loosely-coupled, synchronized execution
    5.
    发明授权
    Loosely-coupled, synchronized execution 失效
    松散耦合,同步执行

    公开(公告)号:US5896523A

    公开(公告)日:1999-04-20

    申请号:US868670

    申请日:1997-06-04

    IPC分类号: G06F11/16 G06F11/00

    摘要: Synchronized execution is maintained by compute elements processing instruction streams in a computer system including the compute elements and a controller. Each compute element includes a clock that operates asynchronously with respect to clocks of the other compute elements. Each compute element processes instructions from an instruction stream and counts the instructions processed. Upon processing a quantum of instructions from the instruction stream, the compute element initiates a synchronization procedure and continues to process instructions from the instruction stream and to count instructions processed from the instruction stream. The compute element halts processing of instructions from the instruction stream after processing an unspecified number of instructions from the instruction stream in addition to the quantum of instructions. Upon halting processing, the compute element sends a synchronization request to the controller and waits for a synchronization reply.

    摘要翻译: 在包括计算元件和控制器的计算机系统中,计算元件处理指令流来维持同步执行。 每个计算元件包括相对于其他计算元件的时钟异步操作的时钟。 每个计算单元处理来自指令流的指令,并对所处理的指令进行计数。 在处理来自指令流的指令量时,计算元件启动同步过程并继续处理来自指令流的指令,并计数从指令流处理的指令。 除了指令量之外,计算单元在处理来自指令流的未指定数量的指令之后停止来自指令流的指令的处理。 在停止处理时,计算单元向控制器发送同步请求,并等待同步应答。

    Dynamic Checkpointing Systems and Methods
    7.
    发明申请
    Dynamic Checkpointing Systems and Methods 有权
    动态检查点系统和方法

    公开(公告)号:US20150205671A1

    公开(公告)日:2015-07-23

    申请号:US14571383

    申请日:2014-12-16

    IPC分类号: G06F11/14

    CPC分类号: G06F11/1484

    摘要: A method for determining a delay in a dynamic, event driven, checkpoint interval. In one embodiment, the method includes the steps of determining the number of network bits to be transferred; determining the target bit transfer rate; calculating the next cycle delay as the number of bits to be transferred divided by the target bit transfer rate. In another aspect, the invention relates to a method for delaying a checkpoint interval. In one embodiment, the method includes the steps of monitoring the transfer of a prior batch of network data and delaying a subsequent checkpoint until the transfer of a prior batch of network data has reached a certain predetermined level of completion. In another embodiment, the predetermined level of completion is 100%.

    摘要翻译: 一种用于确定动态,事件驱动的检查点间隔中的延迟的方法。 在一个实施例中,该方法包括以下步骤:确定要传送的网络位数; 确定目标比特传输速率; 计算下一周期延迟作为要传输的位数除以目标位传输速率。 在另一方面,本发明涉及一种用于延迟检查点间隔的方法。 在一个实施例中,该方法包括以下步骤:监视先前批次的网络数据的传输并延迟后续的检查点,直到先前批次的网络数据的传送已经达到一定的预定的完成水平。 在另一个实施例中,预定的完成水平为100%。

    Fault resilient/fault tolerant computing
    8.
    发明授权
    Fault resilient/fault tolerant computing 失效
    故障恢复/容错计算

    公开(公告)号:US5600784A

    公开(公告)日:1997-02-04

    申请号:US405193

    申请日:1995-03-16

    摘要: In a first aspect, a method of synchronizing at least two computing elements that each have clocks that operate asynchronously of the clocks of the other computing elements includes selecting one or more signals, designated as meta time signals, from a set of signals produced by the computing elements, monitoring the computing elements to detect the production of a selected signal by one of the computing elements, waiting for the other computing elements to produce a selected signal, transmitting equally valued time updates to each of the computing elements, and updating the clocks of the computing elements based on the time updates.In a second aspect, fault resilient or fault tolerant computers are produced by designating a first processor as a computing element, designating a second processor as a controller, connecting the computing element and the controller to produce a modular pair, and connecting at least two modular pairs to produce a fault resilient or fault tolerant computer. Each computing element of the computer performs all instructions in the same number of cycles as the other computing elements.Computer systems include one or more controllers and at least two computing elements. System is provided for intercepting I/O operations by the computing elements and transmitting them to the one or more controllers.

    摘要翻译: 在第一方面中,一种同步至少两个计算元件的方法,每个计算元件具有与其它计算元件的时钟异步工作的时钟,包括从由所述另一个计算元件产生的一组信号中选择一个或多个指定为元时间信号的信号 计算元件,监视所述计算元件以通过所述计算元件之一检测所选择的信号的产生,等待所述其他计算元件产生所选择的信号,向所述计算元件中的每一个发送等价的时间更新,以及更新所述时钟 的计算元素基于时间更新。 在第二方面,通过将第一处理器指定为计算元件,指定作为控制器的第二处理器,连接计算元件和控制器以产生模块对,并连接至少两个模块化 成对产生故障恢复或容错计算机。 计算机的每个计算元件执行与其它计算元件相同数量的循环的所有指令。 计算机系统包括一个或多个控制器和至少两个计算元件。 提供系统用于通过计算元件截取I / O操作并将其发送到一个或多个控制器。

    Fault resilient/fault tolerant computing
    9.
    发明授权
    Fault resilient/fault tolerant computing 有权
    故障恢复/容错计算

    公开(公告)号:US06279119B1

    公开(公告)日:2001-08-21

    申请号:US09190269

    申请日:1998-11-13

    IPC分类号: G06F1100

    摘要: A fault tolerant/fault resilient computer system includes at least two compute elements connected to at least one controller. Each compute element has clocks that operate asynchronously to clocks of the other compute elements. The compute elements operate in a first mode in which the compute elements each execute a first stream of instructions in emulated clock lockstep, and in a second mode in which the compute elements each execute a second stream of instructions in instruction lockstep. Each compute element may be a multi-processor compute element.

    摘要翻译: 容错/故障恢复计算机系统包括连接到至少一个控制器的至少两个计算元件。 每个计算元件具有与其他计算元素的时钟异步运行的时钟。 计算元件以第一模式工作,其中计算元件各自在仿真时钟锁步骤中执行指令的第一流,并且在第二模式中,计算元件在指令锁定步骤中每个执行第二指令流。 每个计算元件可以是多处理器计算元件。