Nearest neighbor approach for improved training of real-time health monitors for data processing systems
    1.
    发明授权
    Nearest neighbor approach for improved training of real-time health monitors for data processing systems 有权
    最近邻近的方法来改进对数据处理系统的实时健康监测器的训练

    公开(公告)号:US07243265B1

    公开(公告)日:2007-07-10

    申请号:US10690917

    申请日:2003-10-22

    IPC分类号: G06F11/00

    摘要: Methods, systems, and articles of manufacture consistent with the present invention train a real-time health-monitor for a computer-based system while simultaneously monitoring the health of the system. A plurality of signals that each describe an operating condition of a subject data processing system are monitored in real-time. It is determined whether there is a problem with the subject data processing system by comparing at least one of the monitored signals to a corresponding at least one signal in a known signal dataset. The known signal dataset includes a signal value for at least one signal that describes an operating condition of one of a plurality of subject data processing systems. A new signal dataset having an entry for each monitored signal and a corresponding signal value is prepared simultaneously with monitoring the plurality of signals and determining whether there is a problem.

    摘要翻译: 与本发明一致的方法,系统和制品针对基于计算机的系统训练实时健康监测器,同时监测系统的健康状况。 每个描述对象数据处理系统的操作条件的多个信号被实时监视。 通过将所监测的信号中的至少一个与已知信号数据集中的对应的至少一个信号进行比较来确定对象数据处理系统是否存在问题。 已知的信号数据集包括用于描述多个对象数据处理系统之一的操作条件的至少一个信号的信号值。 在监视多个信号并确定是否存在问题的同时准备具有每个监视信号的条目和相应的信号值的新的信号数据集。

    Dynamic self-tuning soft-error-rate-discrimination for enhanced availability of enterprise computing systems
    2.
    发明授权
    Dynamic self-tuning soft-error-rate-discrimination for enhanced availability of enterprise computing systems 有权
    动态自整定软错误率歧视,增强企业计算系统的可用性

    公开(公告)号:US07526683B1

    公开(公告)日:2009-04-28

    申请号:US11141844

    申请日:2005-06-01

    IPC分类号: G06F11/00

    CPC分类号: G06F11/008

    摘要: A method for use in a computer system provides a dynamic, “self tuning” soft-error-rate-discrimination (SERD) method and apparatus. Specially designed SRAMs or other circuits are “tuned” in a manner that gives them extreme susceptibility to cosmic neutron events (soft errors), higher than that of the “regular” SRAM components, memory modules or other components in the computer system. One such specially designed SRAM is deployed per server. An interface algorithm continuously sends read/write traffic to the special SRAM to infer the soft error rate (SER), which is directly proportional to cosmic neutron flux. The inferred cosmic neutron flux rate is employed in a Poisson SPRT algorithmic approach that dynamically compensates the soft error discrimination sensitivity in accordance with the instantaneous neutron flux for all of the regular SRAM components in the server.

    摘要翻译: 一种在计算机系统中使用的方法提供了一种动态的“自调谐”软错误率鉴别(SERD)方法和装置。 专门设计的SRAM或其他电路以“调谐”的方式使其对宇宙中子事件(软错误)具有极高的敏感性,高于计算机系统中“常规”SRAM组件,存储器模块或其他组件的极端敏感性。 每个服务器部署一个这样专门设计的SRAM。 接口算法连续向专用SRAM发送读/写流量,推断出与宇宙中子通量成正比的软误码率(SER)。 推测的宇宙中子通量速率采用泊松SPRT算法方法,根据服务器中所有常规SRAM组件的瞬时中子通量动态补偿软误差鉴别灵敏度。

    Correlating and aligning monitored signals for computer system performance parameters
    3.
    发明授权
    Correlating and aligning monitored signals for computer system performance parameters 有权
    用于计算机系统性能参数的相关和对齐监控信号

    公开(公告)号:US07292659B1

    公开(公告)日:2007-11-06

    申请号:US10671705

    申请日:2003-09-26

    IPC分类号: H03D1/00 H04L27/06

    摘要: One embodiment of the present invention provides a system that facilitates aligning a first signal with a second signal in a manner that optimizes a correlation between the first signal and the second signal. The system starts by receiving a set of signals, including the first signal and the second signal. The system then determines a correlation between the first signal and the second signal. Next, the system adjusts an alignment between the first signal and again determines a correlation between the first signal and the second signal. If the correlation is greater with the alignment adjustment, the system adjusts the alignment between the first signal and the second signal. This process of adjusting the alignment is repeated for different alignments to find an optimal alignment. Hence, the present invention operates effectively for signal sources which may be independently speeding up and slowing down with respect to each other while under surveillance.

    摘要翻译: 本发明的一个实施例提供了一种有助于以优化第一信号和第二信号之间的相关性的方式将第一信号与第二信号对准的系统。 系统通过接收包括第一信号和第二信号的一组信号来开始。 然后,系统确定第一信号和第二信号之间的相关性。 接下来,系统调整第一信号之间的对准,并再次确定第一信号和第二信号之间的相关性。 如果通过对准调整相关性较大,则系统调整第一信号和第二信号之间的对准。 对于不同的比对重复调整对准的这个过程以找到最佳比对。 因此,本发明对于信号源有效地进行操作,信号源可以在监视时相对于彼此独立地加速和减速。

    Detecting and correcting a failure sequence in a computer system before a failure occurs
    5.
    发明授权
    Detecting and correcting a failure sequence in a computer system before a failure occurs 有权
    在发生故障之前检测和纠正计算机系统中的故障序列

    公开(公告)号:US07181651B2

    公开(公告)日:2007-02-20

    申请号:US10777532

    申请日:2004-02-11

    IPC分类号: G06F11/00

    摘要: One embodiment of the present invention provides a system that detects a failure sequence that leads to undesirable computer system behavior and that subsequently takes a corresponding remedial action. During operation, the system receives instrumentation signals from the computer system while the computer system is operating. The system then uses these instrumentation signals to determine if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, such as a system crash, wherein the determination involves considering predetermined multivariate correlations between multiple instrumentation signals and a failure sequence that is likely to lead to undesirable system behavior. Next, if the computer system is in a failure sequence that is likely to lead to undesirable system behavior, the system takes a remedial action.

    摘要翻译: 本发明的一个实施例提供了一种系统,其检测导致不期望的计算机系统行为的故障序列,并且随后采取相应的补救动作。 在操作期间,系统在计算机系统运行时从计算机系统接收仪表信号。 然后,系统使用这些仪器信号来确定计算机系统是否处于可能导致不期望的系统行为(例如系统崩溃)的故障序列,其中所述确定涉及考虑多个检测信号与故障序列之间的预定多变量相关性 这可能导致不良的系统行为。 接下来,如果计算机系统处于可能导致不期望的系统行为的故障序列中,则系统采取补救措施。