Apparatus for recovery from failures in a multiprocessing system
    1.
    发明授权
    Apparatus for recovery from failures in a multiprocessing system 失效
    用于从多处理系统故障中恢复的装置

    公开(公告)号:US4503535A

    公开(公告)日:1985-03-05

    申请号:US393906

    申请日:1982-06-30

    IPC分类号: G06F11/00 G06F11/07

    摘要: A number of intelligent nodes (bus interface units-BIUs and memory control units-MCUs) are provided in a matrix composed of processor buses (105) with corresponding error-reporting and control lines (106); and memory buses (107) with corresponding error-reporting and control lines (108). Error-detection mechanisms deal with information flow occuring across area boundaries. Each node (100, 101, 102, 103) has means for logging errors and reporting errors on the error report lines (106, 108). If an error recurs the node at which the error exists initiates an error message which is received and repropagated on the error report lines by all nodes. The error message identifies the type of error and the node ID at which the error was detected. Confinement area isolation logic in a node isolates a faulty confinement area of which the node is a part, upon the condition that the node ID in an error report message identifies the node as a node which is a part of a faulty confinement area. Logic in the node reconfigures at least part of the system upon the condition that the node ID in the error report message identifies the node as a node which is part of a confinement area which should be recofigured to recover from the error reported in the error report message.

    摘要翻译: 在由具有相应的错误报告和控制线(106)的处理器总线(105)组成的矩阵中提供了许多智能节点(总线接口单元-IBU和存储器控制单元-MCU)。 和具有对应的错误报告和控制线(108)的存储器总线(107)。 错误检测机制处理跨越区域边界的信息流。 每个节点(100,101,102,103)具有用于在错误报告行(106,108)上记录错误和报告错误的装置。 如果存在错误的节点发生错误,则会发出在所有节点的错误报告行上接收和重新传播的错误消息。 错误消息标识错误的类型和检测到错误的节点ID。 一个节点中的限制区域隔离逻辑将错误报告消息中的节点ID标识为作为故障限制区域的一部分的节点,从而隔离节点是其中一部分的故障限制区域。 节点中的逻辑重新配置系统的至少一部分,条件是错误报告消息中的节点ID将节点标识为节点,该节点是应重新配置的节点,以从错误报告中报告的错误中恢复 信息。

    Apparatus for redundant operation of modules in a multiprocessing system
    2.
    发明授权
    Apparatus for redundant operation of modules in a multiprocessing system 失效
    用于多处理系统中的模块的冗余操作的装置

    公开(公告)号:US4503534A

    公开(公告)日:1985-03-05

    申请号:US393905

    申请日:1982-06-30

    IPC分类号: G06F11/00

    摘要: A number of intelligent nodes (bus-interface units-BIUs and memory-control units-MCUs) are provided in a matrix composed of processor buses (105) with corresponding error-reporting and control lines (106); and memory buses (107) with corresponding error-reporting and control lines (108). Each node (100, 101, 102, 103) has means for logging errors and reporting errors on the error-report lines (106, 108). Processor modules (110) and memory modules (112) are each connected to a node which controls access to a common memory bus (107). Each node includes means (a married bit-170 and a shadow bit-172) for marrying modules in pairs such that each module in the pair tracks the operations directed to the module pair, and each module in the pair alternates with the other module in the handling of requests or replies. Each node registers the ID of the other node in a spouse ID register. Comparison logic (162, 164) in each node resets the married bit upon the condition that the node ID (identifying the node at which the error occurred) in an error-report message is equal to the ID stored in the spouse ID register, thus identifying the spouse node (the partner of the node in which the comparison logic is located) as the source of the error. Resetting the married bit splits apart the primary/shadow pair, so that the error-free module takes over and ceases to alternate with its partner.

    摘要翻译: 在由具有相应的错误报告和控制线(106)的处理器总线(105)组成的矩阵中提供了许多智能节点(总线接口单元-IBU和存储器控制单元-MCU)。 和具有对应的错误报告和控制线(108)的存储器总线(107)。 每个节点(100,101,102,103)具有用于在错误报告行(106,108)上记录错误和报告错误的装置。 处理器模块(110)和存储器模块(112)各自连接到控制对公共存储器总线(107)的访问的节点。 每个节点包括用于成对结合模块的装置(已婚的位170和影子位172),使得该对中的每个模块跟踪针对模块对的操作,并且该对中的每个模块与另一模块中的每个模块交替 处理请求或回复。 每个节点在配偶ID寄存器中注册另一个节点的ID。 每个节点中的比较逻辑(162,164)在错误报告消息中识别发生错误的节点ID等于配偶ID寄存器中存储的ID的条件下重置已婚比特,因此 识别配偶节点(比较逻辑所在的节点的伙伴)作为错误的来源。 重新设置已拆分的主分割主体/阴影对,使得无错误的模块接管并停止与其伙伴交替使用。

    Apparatus of fault-handling in a multiprocessing system
    3.
    发明授权
    Apparatus of fault-handling in a multiprocessing system 失效
    多处理系统故障处理装置

    公开(公告)号:US4438494A

    公开(公告)日:1984-03-20

    申请号:US296025

    申请日:1981-08-25

    IPC分类号: G06F11/20 G06F11/07 G06F11/00

    摘要: A number of intelligent crossbar switches (100) are provided in a matrix of orthogonal lines interconnecting processor (110) and memory control unit (MCU) modules (112). The matrix is composed of processor buses (105) and corresponding error-reporting lines (106); and memory buses (107) with corresponding error-reporting lines (108). At the intersection of these lines is a crossbar switch node (100). The crossbar switches function to pass memory requests from a processor to a memory module attached to an MCU node and to pass any data associated with the requests. The system is organized into confinement areas at the boundaries of which are positioned error-detection mechanisms to deal with information flow occurring across area boundaries. Each crossbar switch and MCU node has means for the logging and signaling of errors to other nodes. Means are provided to reconfigure the system to reroute traffic around the confinement area at fault and for restarting system operation in a possibly degraded mode.

    摘要翻译: 在互连处理器(110)和存储器控制单元(MCU)模块(112)的正交线的矩阵中提供了许多智能交叉开关(100)。 矩阵由处理器总线(105)和相应的错误报告线(106)组成。 和具有对应的错误报告行(108)的存储器总线(107)。 这些线路的交叉点是交叉开关节点(100)。 交叉开关用于将存储器请求从处理器传递到连接到MCU节点的存储器模块,并传递与请求相关联的任何数据。 系统被组织成限制区域,其边界位于错误检测机制中,以处理跨区域边界发生的信息流。 每个交叉开关和MCU节点都有用于记录和向其他节点发送错误信号的手段。 提供了用于重新配置系统以重新路由处于故障的限制区域周围的业务并且以可能降级的模式重新启动系统操作的手段。