发明授权
- 专利标题: Methods and apparatus using commutative error detection values for fault isolation in multiple node computers
- 专利标题(中): 使用多节点计算机故障隔离交换误差检测值的方法和装置
-
申请号: US11106069申请日: 2005-04-14
-
公开(公告)号: US07383490B2公开(公告)日: 2008-06-03
- 发明人: Gheorghe Almasi , Matthias Augustin Blumrich , Dong Chen , Paul Coteus , Alan Gara , Mark E. Giampapa , Philip Heidelberger , Dirk I. Hoenicke , Sarabjeet Singh , Burkhard D. Steinmacher-Burow , Todd Takken , Pavlos Vranas
- 申请人: Gheorghe Almasi , Matthias Augustin Blumrich , Dong Chen , Paul Coteus , Alan Gara , Mark E. Giampapa , Philip Heidelberger , Dirk I. Hoenicke , Sarabjeet Singh , Burkhard D. Steinmacher-Burow , Todd Takken , Pavlos Vranas
- 申请人地址: US NY Armonk
- 专利权人: International Business Machines Corporation
- 当前专利权人: International Business Machines Corporation
- 当前专利权人地址: US NY Armonk
- 代理机构: Harrington & Smith, PC
- 主分类号: G06F11/00
- IPC分类号: G06F11/00 ; H03M13/00
摘要:
Methods and apparatus perform fault isolation in multiple node computing systems using commutative error detection values for—example, checksums—to identify and to isolate faulty nodes. When information associated with a reproducible portion of a computer program is injected into a network by a node, a commutative error detection value is calculated. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created and stored in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in values indicate a possible faulty node.
公开/授权文献
信息查询