Re-utilizing partially failed resources as network resources
    1.
    发明授权
    Re-utilizing partially failed resources as network resources 失效
    重新利用部分失败的资源作为网络资源

    公开(公告)号:US07620841B2

    公开(公告)日:2009-11-17

    申请号:US11335784

    申请日:2006-01-19

    IPC分类号: G06F11/00

    CPC分类号: G06F11/0793 G06F11/0724

    摘要: A method and apparatus for re-utilizing partially failed compute resources in a massively parallel super computer system. In the preferred embodiments the compute node comprises a number of clock domains that can be enabled separately. When an error in a compute node is detected, and the failure is not in network communication blocks, a clock enable circuit enables the clocks to the network communication blocks only to allow the partially failed compute node to be re-utilized as a network resource. The computer system can then continue to operate with only slightly diminished performance and thereby improve performance and perceived overall reliability.

    摘要翻译: 在大规模并行的超级计算机系统中重新利用部分失败的计算资源的方法和装置。 在优选实施例中,计算节点包括可以单独使能的多个时钟域。 当检测到计算节点中的错误,并且故障不在网络通信块中时,时钟使能电路仅允许网络通信块的时钟允许部分失败的计算节点被重新利用为网络资源。 然后,计算机系统可以继续操作,性能略有降低,从而提高性能和可察觉的整体可靠性。

    Apparatus and method of repairing a processor array for a failure detected at runtime
    2.
    发明授权
    Apparatus and method of repairing a processor array for a failure detected at runtime 失效
    修复在运行时检测到的故障的处理器阵列的装置和方法

    公开(公告)号:US06851071B2

    公开(公告)日:2005-02-01

    申请号:US09974967

    申请日:2001-10-11

    摘要: An apparatus and method of repairing a processor array for a failure detected at runtime in a system supporting persistent component deallocation are provided. The apparatus and method of the present invention allow redundant array bits to be used for recoverable faults detected in arrays during run time, instead of only at system boot, while still maintaining the dynamic and persistent processor deallocation features of the computing system. With the apparatus and method of the present invention, a failure of a cache array is detected and a determination is made as to whether a repairable failure threshold is exceeded during runtime. If this threshold is exceeded, a determination is made as to whether cache array redundancy may be applied to correct the failure, i.e. a bit error. If so, the cache array redundancy is applied without marking the processor as unavailable. At some time later, the system undergoes a re-initial program load (re-IPL) at which time it is determined whether a second failure of the processor occurs. If a second failure occurs, a determination is made as to whether any status bits are set for arrays other than the cache array that experienced the present failure, if so, the processor is marked unavailable. If not, a determination is made as to whether cache redundancy can be applied to correct the failure. If so, the failure is corrected using the cache redundancy. If not, the processor is marked unavailable.

    摘要翻译: 提供了一种用于在支持持久性组件分配的系统中在运行时检测到的故障的处理器阵列的修复的装置和方法。 本发明的装置和方法允许冗余阵列位用于在运行时间期间在阵列中检测到的可恢复故障,而不是仅在系统引导时,同时仍维持计算系统的动态和持久处理器释放特征。 利用本发明的装置和方法,检测到高速缓存阵列的故障,并且确定在运行时期间是否超过了可修复的故障阈值。 如果超过该阈值,则确定是否应用高速缓存阵列冗余来校正故障,即位错误。 如果是这样,则应用缓存阵列冗余,而不会将处理器标记为不可用。 在稍后的一段时间内,系统经历重新启动程序加载(re-IPL),此时确定处理器是否发生第二个故障。 如果发生第二个故障,则确定是否为经历当前故障的高速缓存阵列之外的阵列设置了任何状态位,否则,处理器被标记为不可用。 如果不是,则确定是否可以应用高速缓存冗余来校正故障。 如果是这样,则使用高速缓存冗余来校正故障。 如果没有,则处理器被标记为不可用。

    Method and apparatus for processing an invalid address request
    3.
    发明授权
    Method and apparatus for processing an invalid address request 失效
    处理无效地址请求的方法和装置

    公开(公告)号:US6047388A

    公开(公告)日:2000-04-04

    申请号:US838723

    申请日:1997-04-09

    IPC分类号: G06F12/14 G06F11/00

    CPC分类号: G06F12/1441

    摘要: A method, apparatus, and computer program product are provided for processing an invalid address request in a computer system. A processor in the computer system receives an address requested from software and compares a real address requested with a real address range available. An invalid address request is a real address requested outside the real address range available. Responsive to identifying an invalid address, the processor issues an interrupt to supervising software. Then an address exception is posted to the user software, if appropriate.

    摘要翻译: 提供了一种用于在计算机系统中处理无效地址请求的方法,装置和计算机程序产品。 计算机系统中的处理器接收从软件请求的地址,并将所请求的实际地址与可用的实际地址范围进行比较。 无效地址请求是在可用的实际地址范围之外请求的真实地址。 响应于识别无效地址,处理器发出中断来监督软件。 然后,如果适用,地址异常被发布到用户软件。