System of performing checkpoint/restart of a parallel program
    1.
    发明授权
    System of performing checkpoint/restart of a parallel program 有权
    执行并行程序检查点/重新启动的系统

    公开(公告)号:US06401216B1

    公开(公告)日:2002-06-04

    申请号:US09181981

    申请日:1998-10-29

    IPC分类号: G06F1100

    CPC分类号: G06F11/1458

    摘要: A checkpoint of a parallel program is taken in order to provide a consistent state of the program in the event the program is to be restarted. Each process of the parallel program is responsible for taking its own checkpoint, however, the timing of when the checkpoint is to be taken by each process is the responsibility of a coordinating process. During the checkpointing, various data is written to a checkpoint file. This data includes, for instance, in-transit message data, a data section, file offsets, signal state, executable information, stack contents and register contents. The checkpoint file can be stored either in local or global storage. When it is stored in global storage, migration of the program is facilitated. When a parallel program is to be restarted, each process of the program initiates its own restart. The restart logic restores the process to the state at which the checkpoint was taken.

    摘要翻译: 采取并行程序的检查点,以便在程序重新启动的情况下提供程序的一致状态。 并行程序的每个过程都负责自己的检查点,但是每个进程检查点何时执行的时间是协调过程的责任。 在检查点期间,将各种数据写入检查点文件。 该数据包括例如转接消息数据,数据部分,文件偏移,信号状态,可执行信息,堆栈内容和寄存器内容。 检查点文件可以存储在本地或全局存储中。 当它存储在全局存储中时,便于程序的迁移。 当并行程序重新启动时,程序的每个进程都会自动重启。 重新启动逻辑将进程恢复到执行检查点的状态。

    Program products for performing checkpoint/restart of a parallel program
    2.
    发明授权
    Program products for performing checkpoint/restart of a parallel program 失效
    执行并行程序检查点/重新启动的程序产品

    公开(公告)号:US06338147B1

    公开(公告)日:2002-01-08

    申请号:US09182555

    申请日:1998-10-29

    IPC分类号: G06F1100

    CPC分类号: G06F11/1458

    摘要: A checkpoint of a parallel program is taken in order to provide a consistent state of the program in the event the program is to be restarted. Each process of the parallel program is responsible for taking its own checkpoint, however, the timing of when the checkpoint is to be taken by each process is the responsibility of a coordinating process. During the checkpointing, various data is written to a checkpoint file. This data includes, for instance, in-transit message data, a data section, file offsets, signal state, executable information, stack contents and register contents. The checkpoint file can be stored either in local or global storage. When it is stored in global storage, migration of the program is facilitated. When a parallel program is to be restarted, each process of the program initiates its own restart. The restart logic restores the process to the state at which the checkpoint was taken.

    摘要翻译: 采取并行程序的检查点,以便在程序重新启动的情况下提供程序的一致状态。 并行程序的每个过程都负责自己的检查点,但是每个进程检查点何时执行的时间是协调过程的责任。 在检查点期间,各种数据被写入检查点文件。 该数据包括例如转接消息数据,数据部分,文件偏移,信号状态,可执行信息,堆栈内容和寄存器内容。 检查点文件可以存储在本地或全局存储中。 当它存储在全局存储中时,便于程序的迁移。 当并行程序重新启动时,程序的每个进程都会自动重启。 重新启动逻辑将进程恢复到执行检查点的状态。

    Concurrent access of an unsegmented buffer by writers and readers of the buffer
    3.
    发明授权
    Concurrent access of an unsegmented buffer by writers and readers of the buffer 失效
    由缓冲区的写入者和读者同时访问未分段的缓冲区

    公开(公告)号:US06658525B1

    公开(公告)日:2003-12-02

    申请号:US09672642

    申请日:2000-09-28

    IPC分类号: G06F1300

    CPC分类号: G06F13/4059

    摘要: Data is written to an unsegmented buffer located within shared memory. While data is being written to the unsegmented buffer, at least a portion of the data is being read from the buffer. A counter is used to indicate how much space is available in the buffer to receive data. Further, the counter is employed to ensure that the reader does not advance beyond the writer.

    摘要翻译: 数据被写入位于共享存储器内的未分段缓冲器。 当数据被写入未分段的缓冲器时,数据的至少一部分正在从缓冲器读取。 计数器用于指示缓冲区中有多少空间可用于接收数据。 此外,计数器用于确保读者不超越作者。

    Method of performing checkpoint/restart of a parallel program
    4.
    发明授权
    Method of performing checkpoint/restart of a parallel program 失效
    执行并行程序的检查点/重启的方法

    公开(公告)号:US06393583B1

    公开(公告)日:2002-05-21

    申请号:US09181985

    申请日:1998-10-29

    IPC分类号: H02H305

    CPC分类号: G06F11/1438 G06F11/1458

    摘要: A checkpoint of a parallel program is taken in order to provide a consistent state of the program in the event the program is to be restarted. Each process of the parallel program is responsible for taking its own checkpoint, however, the timing of when the checkpoint is to be taken by each process is the responsibility of a coordinating process. During the checkpointing, various data is written to a checkpoint file. This data includes, for instance, in-transit message data, a data section, file offsets, signal state, executable information, stack contents and register contents. The checkpoint file can be stored either in local or global storage. When it is stored in global storage, migration of the program is facilitated. When a parallel program is to be restarted, each process of the program initiates its own restart. The restart logic restores the process to the state at which the checkpoint was taken.

    摘要翻译: 采取并行程序的检查点,以便在程序重新启动的情况下提供程序的一致状态。 并行程序的每个过程都负责自己的检查点,但是每个进程检查点何时执行的时间是协调过程的责任。 在检查点期间,将各种数据写入检查点文件。 该数据包括例如转接消息数据,数据部分,文件偏移,信号状态,可执行信息,堆栈内容和寄存器内容。 检查点文件可以存储在本地或全局存储中。 当它存储在全局存储中时,便于程序的迁移。 当并行程序重新启动时,程序的每个进程都会自动重启。 重新启动逻辑将进程恢复到执行检查点的状态。