Apparatus and method for passively monitoring liveness of jobs in a clustered computing environment
    1.
    发明授权
    Apparatus and method for passively monitoring liveness of jobs in a clustered computing environment 失效
    在集群计算环境中被动地监视作业活动的装置和方法

    公开(公告)号:US06990668B1

    公开(公告)日:2006-01-24

    申请号:US09421585

    申请日:1999-10-20

    IPC分类号: G06F9/461

    摘要: An apparatus and method passively determine when a job in a clustered computing environment is dead. Each node in the cluster has a cluster engine for communicating between each job on the node and jobs on other nodes. A protocol is defined that includes one or more acknowledge (ACK) rounds, and that only performs local processing between ACK rounds. The protocol is executed by jobs that are members of a defined group. Each job in the group has one or more work threads that execute the protocol. In addition, each job has a main thread that communicates between the job and jobs on other nodes (through the cluster engine), routes appropriate messages from the cluster engine to a work thread, and signals to the cluster engine when a fault occurs when the work thread executes the protocol. By assuring that a dead job is reported to other members of the group, liveness information for group members can be monitored without the overhead associated with active liveness checking.

    摘要翻译: 设备和方法被动地确定群集计算环境中的作业何时死亡。 集群中的每个节点都有一个集群引擎,用于在节点上的每个作业和其他节点上的作业之间进行通信。 定义了包括一个或多个确认(ACK)轮,并且仅在ACK轮之间执行本地处理的协议。 协议由作为定义组成员的作业执行。 组中的每个作业都有一个或多个执行协议的工作线程。 此外,每个作业都有一个主线程,通过其他节点(通过群集引擎)在作业和作业之间进行通信,将适当的消息从群集引擎路由到工作线程,并在发生故障时向群集引擎发出信号 工作线程执行协议。 通过确保向组织的其他成员报告死亡的工作,可以监视组成员的活动信息,而不需要与主动活动检查相关的开销。