Enhanced error handling for I/O load/store operations to a PCI device via bad parity or zero byte enables
    1.
    发明授权
    Enhanced error handling for I/O load/store operations to a PCI device via bad parity or zero byte enables 失效
    通过坏的奇偶校验或零字节使I / O加载/存储操作到PCI设备的增强的错误处理能够实现

    公开(公告)号:US06223299B1

    公开(公告)日:2001-04-24

    申请号:US09072418

    申请日:1998-05-04

    IPC分类号: G06F1100

    摘要: Device selects lines from each I/O device are brought into a PCI host bridge individually so that the device number of a failing device may be logged in an error register when an error is seen on the PCI bus. Until the error register is reset, subsequent load and store operations are delayed until the device number of the subject device may be checked against the error register. If the subject device is a previously failing device, the load/store operation to that device is prevented from completing, either by forcing bad parity or zeroing all byte enables. By forcing bad parity of zero byte enables, the I/O device will respond to the load or store request by activating its device select line, but will not accept store data. Operations to devices which are not logged in the error register are permitted to proceed normally, as are all load store operations when the error register is clear. Normal system operations are thus not impacted, and operations during error recovery are permitted to proceed if no further damage will be caused by such operations.

    摘要翻译: 设备选择每个I / O设备的线路分别插入PCI主机桥,以便在PCI总线上出现错误时,可能会将故障设备的设备号记录在错误寄存器中。 在错误寄存器复位之前,后续的加载和存储操作将被延迟,直到可以针对错误寄存器检查主体设备的设备编号。 如果主机设备是先前发生故障的设备,则通过强制坏的奇偶校验或归零所有字节使能来防止对该设备的加载/存储操作完成。 通过强制零字节的不良奇偶使能,I / O设备将通过激活其设备选择行来响应加载或存储请求,但不接受存储数据。 允许对未登录在错误寄存器中的设备进行操作,正常情况下,正常情况下进行加载存储操作。 因此,正常的系统操作不会受到影响,并且如果这种操作不会造成进一步的损坏,则允许错误恢复期间的操作进行。

    Method and system for boot-time deconfiguration of a processor in a symmetrical multi-processing system
    2.
    发明授权
    Method and system for boot-time deconfiguration of a processor in a symmetrical multi-processing system 失效
    用于对称多处理系统中处理器引导时解体的方法和系统

    公开(公告)号:US06233680B1

    公开(公告)日:2001-05-15

    申请号:US09165952

    申请日:1998-10-02

    IPC分类号: G06F15177

    摘要: A method and system for deconfiguring a CPU in a processing system is disclosed. In one aspect, a processing system is disclosed that comprises a central processing unit (CPU), and a memory coupled to the CPU. The error status register for capturing information concerning the status of the CPU. The processing system includes a service processor for gathering and analyzing status information from the CPU error register. The processing system also includes a nonvolatile device coupled to the service processor. The nonvolatile device includes a deconfiguration area. The deconfiguration area stores information concerning the status of the CPU from the service processor. The deconfiguration area also provides information for deconfiguring a CPU during a boot time of the processing system. Accordingly, through the present invention, CPU errors are detected during normal computer operations by error detection logic. This detection is utilized during any subsequent boot process by service processor firmware to deallocate the defective CPU. This is accomplished through the use of error status registers within the CPU and through the use of a deconfiguration area in the nonvolatile device which provides information directly to the service processor.

    摘要翻译: 公开了一种用于在处理系统中对CPU进行解配置的方法和系统。 在一个方面,公开了一种包括中央处理单元(CPU)和耦合到CPU的存储器的处理系统。 用于捕获有关CPU状态的信息的错误状态寄存器。 处理系统包括用于从CPU错误寄存器收集和分析状态信息的服务处理器。 处理系统还包括耦合到服务处理器的非易失性设备。 非易失性器件包括解配置区域。 解除配置区域从服务处理器存储关于CPU的状态的信息。 解除配置区域还提供了在处理系统的引导时间期间对CPU进行解除配置的信息。 因此,通过本发明,通过错误检测逻辑在通常的计算机操作期间检测到CPU错误。 这种检测在服务处理器固件的任何后续启动过程中被利用以释放有缺陷的CPU。 这是通过使用CPU内的错误状态寄存器并通过使用非易失性设备中的解配置区来实现的,该非配置区域直接向服务处理器提供信息。

    Method and system for boot-time deconfiguration of a memory in a processing system
    4.
    发明授权
    Method and system for boot-time deconfiguration of a memory in a processing system 失效
    用于处理系统中存储器引导时解配置的方法和系统

    公开(公告)号:US06243823B1

    公开(公告)日:2001-06-05

    申请号:US09165955

    申请日:1998-10-02

    IPC分类号: G06F15177

    CPC分类号: G06F11/142

    摘要: A method and system for deconfiguring software in a processing system is disclosed. In one aspect, a processing system comprises a central processing unit (CPU), and a memory coupled to the CPU. The memory includes a memory array and a memory controller for capturing information concerning the status of the memory array. The processing system includes a service processor for gathering and analyzing status information from the memory controller. The processing system also includes a nonvolatile device coupled to the CPU and the service processor. The nonvolatile device includes a deconfiguration area. The deconfiguration area stores information concerning the status of the memory array from the service processor. The deconfiguration area also provides information for deconfiguring at least a portion of the memory array during a boot time of the processing system. Accordingly, through the present invention, memory errors are detected during normal computer operations by error detection logic. This detection is utilized during any subsequent boot process by service processor and CPU boot firmware to deallocate the defective memory module. This is accomplished through the use of error status registers within the memory controller and through the use of a deconfiguration area in the nonvolatile device which provides information directly to the CPU boot firmware.

    摘要翻译: 公开了一种在处理系统中解除配置软件的方法和系统。 在一个方面,处理系统包括中央处理单元(CPU)和耦合到CPU的存储器。 存储器包括存储器阵列和用于捕获关于存储器阵列的状态的信息的存储器控​​制器。 处理系统包括用于从存储器控制器收集和分析状态信息的服务处理器。 处理系统还包括耦合到CPU和服务处理器的非易失性设备。 非易失性器件包括解配置区域。 解除配置区域从服务处理器存储关于存储器阵列的状态的信息。 解配置区域还提供用于在处理系统的引导时间期间解除配置存储器阵列的至少一部分的信息。 因此,通过本发明,通过错误检测逻辑在正常的计算机操作期间检测存储器错误。 在任何后续引导过程中,服务处理器和CPU引导固件都会使用该检测来取消分配有缺陷的内存模块。 这是通过使用存储器控制器内的错误状态寄存器并且通过使用非易失性设备中的解除配置区域来实现的,该非配置区域直接向CPU引导固件提供信息。

    Recovery mechanism for L1 data cache parity errors
    5.
    发明授权
    Recovery mechanism for L1 data cache parity errors 失效
    L1数据缓存奇偶校验错误的恢复机制

    公开(公告)号:US06332181B1

    公开(公告)日:2001-12-18

    申请号:US09072324

    申请日:1998-05-04

    IPC分类号: G06F1208

    摘要: A method of handling a cache error (such as a parity error), which allows a software recovery, by reporting the error using an unrelated system resource, such as an interrupt service, and particularly a data storage interrupt. The parity error can be reported by generating a data storage interrupt and using the data storage interrupt status register (DSISR) to indicate that the data storage interrupt is a result of the parity error. The context of the processor can be fully synchronized while handling the parity error.

    摘要翻译: 通过使用诸如中断服务之类的不相关的系统资源(特别是数据存储中断)来报告错误来处理允许软件恢复的高速缓存错误(例如奇偶校验错误)的方法。 可以通过产生数据存储中断并使用数据存储中断状态寄存器(DSISR)来指示数据存储中断是奇偶校验错误的结果来报告奇偶校验错误。 处理器的上下文可以在处理奇偶校验错误的同时完全同步。

    Apparatus and method of repairing a processor array for a failure detected at runtime
    6.
    发明授权
    Apparatus and method of repairing a processor array for a failure detected at runtime 失效
    修复在运行时检测到的故障的处理器阵列的装置和方法

    公开(公告)号:US06851071B2

    公开(公告)日:2005-02-01

    申请号:US09974967

    申请日:2001-10-11

    摘要: An apparatus and method of repairing a processor array for a failure detected at runtime in a system supporting persistent component deallocation are provided. The apparatus and method of the present invention allow redundant array bits to be used for recoverable faults detected in arrays during run time, instead of only at system boot, while still maintaining the dynamic and persistent processor deallocation features of the computing system. With the apparatus and method of the present invention, a failure of a cache array is detected and a determination is made as to whether a repairable failure threshold is exceeded during runtime. If this threshold is exceeded, a determination is made as to whether cache array redundancy may be applied to correct the failure, i.e. a bit error. If so, the cache array redundancy is applied without marking the processor as unavailable. At some time later, the system undergoes a re-initial program load (re-IPL) at which time it is determined whether a second failure of the processor occurs. If a second failure occurs, a determination is made as to whether any status bits are set for arrays other than the cache array that experienced the present failure, if so, the processor is marked unavailable. If not, a determination is made as to whether cache redundancy can be applied to correct the failure. If so, the failure is corrected using the cache redundancy. If not, the processor is marked unavailable.

    摘要翻译: 提供了一种用于在支持持久性组件分配的系统中在运行时检测到的故障的处理器阵列的修复的装置和方法。 本发明的装置和方法允许冗余阵列位用于在运行时间期间在阵列中检测到的可恢复故障,而不是仅在系统引导时,同时仍维持计算系统的动态和持久处理器释放特征。 利用本发明的装置和方法,检测到高速缓存阵列的故障,并且确定在运行时期间是否超过了可修复的故障阈值。 如果超过该阈值,则确定是否应用高速缓存阵列冗余来校正故障,即位错误。 如果是这样,则应用缓存阵列冗余,而不会将处理器标记为不可用。 在稍后的一段时间内,系统经历重新启动程序加载(re-IPL),此时确定处理器是否发生第二个故障。 如果发生第二个故障,则确定是否为经历当前故障的高速缓存阵列之外的阵列设置了任何状态位,否则,处理器被标记为不可用。 如果不是,则确定是否可以应用高速缓存冗余来校正故障。 如果是这样,则使用高速缓存冗余来校正故障。 如果没有,则处理器被标记为不可用。

    Fault tolerant computer memory systems and components employing dual
level error correction and detection with disablement feature
    7.
    发明授权
    Fault tolerant computer memory systems and components employing dual level error correction and detection with disablement feature 失效
    容错计算机存储器系统和采用双级错误校正和检测功能的组件

    公开(公告)号:US5682394A

    公开(公告)日:1997-10-28

    申请号:US012186

    申请日:1993-02-02

    IPC分类号: G06F11/00 G06F11/10

    CPC分类号: G06F11/1052 G06F11/1008

    摘要: In a memory system comprising a plurality of memory units each of which possesses unit-level error correction capabilities and each of which is tied to a system level error correction function, memory reliability is enhanced by providing a mechanism for disabling the unit-level error correction capability, for example, in response to the occurrence of an uncorrectable error in one of the memory units. This counter-intuitive approach which disables an error correction function nonetheless enhances overall memory system reliability since it enables the employment of the complement/recomplement algorithm which depends upon the presence of reproducible errors for proper operation. Thus, chip level error correction systems, which are increasingly desirable at high packaging densities, are employed in a way which does not interfere with system level error correction methods.

    摘要翻译: 在包括多个存储器单元的存储器系统中,每个存储器单元具有单位级错误校正能力,并且每个都与系统级错误校正功能相关联,通过提供用于禁用单元级错误校正的机制来增强存储器的可靠性 能力,例如,响应于在一个存储器单元中发生不可校正的错误。 这种禁用纠错功能的反直觉方法仍然提高了整体存储系统的可靠性,因为它可以使用补充/重新补充算法,这取决于是否存在可重复的错误以进行正确的操作。 因此,在高封装密度下越来越需要的芯片级误差校正系统采用不干扰系统级误差校正方法的方式。

    Run time error probe in a network computing environment
    8.
    发明授权
    Run time error probe in a network computing environment 失效
    在网络计算环境中运行时错误探测器

    公开(公告)号:US5978936A

    公开(公告)日:1999-11-02

    申请号:US974574

    申请日:1997-11-19

    摘要: A first set of test instructions are provided for a first node in a computer network. A corresponding second set is provided for a second node in the network. The test instruction sets are partitioned into modules. The nodes process their respective sets of test instructions independently to generate test results for each module on each node, except when a synchronizing event occurs. Each node stores its test results for each test module. Since the test modules have an ordered processing sequence, each node's test results for corresponding test modules can be compared asynchronously on an ongoing basis.

    摘要翻译: 为计算机网络中的第一节点提供第一组测试指令。 为网络中的第二节点提供相应的第二集合。 测试指令集被分为模块。 节点独立处理其各自的测试指令集,以生成每个节点上每个模块的测试结果,除非发生同步事件。 每个节点存储每个测试模块的测试结果。 由于测试模块具有有序的处理顺序,因此可以在持续的基础上将相应测试模块的每个节点的测试结果进行异步比较。

    Method, apparatus, and computer program product for deconfiguring a processor
    9.
    发明授权
    Method, apparatus, and computer program product for deconfiguring a processor 有权
    用于解除配置处理器的方法,装置和计算机程序产品

    公开(公告)号:US06789048B2

    公开(公告)日:2004-09-07

    申请号:US10116626

    申请日:2002-04-04

    IPC分类号: G06F1130

    CPC分类号: G06F11/2236

    摘要: According to a method form of the invention, in a computer system having a processing load distributed among a number of processors in the system, test computations are performed at intervals by floating point logic of a processor responsive to stored test instructions. Responsive to the test computations indicating an erroneous result by one of the processors information is passed by a firmware process and entered into an operating system error log. Responsive to the information, an operating system deconfiguration service is notified of the error log entry, and the service deconfigures the indicated processor, while the system is still running.

    摘要翻译: 根据本发明的方法形式,在具有分布在系统中的多个处理器之间的处理负载的计算机系统中,响应于存储的测试指令,处理器的浮点逻辑以间隔执行测试计算。 响应于指示处理器信息之一的错误结果的测试计算由固件处理传递并输入到操作系统错误日志中。 响应于该信息,操作系统解除配置服务被通知错误日志条目,并且服务在系统仍在运行时取消指定处理器的配置。

    Method and system for end-to-end problem determination and fault isolation for storage area networks
    10.
    发明授权
    Method and system for end-to-end problem determination and fault isolation for storage area networks 有权
    存储区域网络的端到端问题确定和故障隔离的方法和系统

    公开(公告)号:US06636981B1

    公开(公告)日:2003-10-21

    申请号:US09478306

    申请日:2000-01-06

    IPC分类号: G06F15177

    摘要: A method and system for problem determination and fault isolation in a storage area network (SAN) is provided. A complex configuration of multi-vendor host systems, FC switches, and storage peripherals are connected in a SAN via a communications architecture (CA). A communications architecture element (CAE) is a network-connected device that has successfully registered with a communications architecture manager (CAM) on a host computer via a network service protocol, and the CAM contains problem determination (PD) functionality for the SAN and maintains a SAN PD information table (SPDIT). The CA comprises all network-connected elements capable of communicating information stored in the SPDIT. The CAM uses a SAN topology map and the SPDIT are used to create a SAN diagnostic table (SDT). A failing component in a particular device may generate errors that cause devices along the same network connection path to generate errors. As the CAM receives error packets or error messages, the errors are stored in the SDT, and each error is analyzed by temporally and spatially comparing the error with other errors in the SDT. If a CAE is determined to be a candidate for generating the error, then the CAE is reported for replacement if possible.

    摘要翻译: 提供了一种用于存储区域网络(SAN)中的问题确定和故障隔离的方法和系统。 多厂商主机系统,FC交换机和存储外设的复杂配置通过通信架构(CA)连接在SAN中。 通信体系结构元件(CAE)是一种网络连接的设备,其已经通过网络服务协议成功地与主计算机上的通信架构管理器(CAM)注册,并且CAM包含用于SAN的问题确定(PD)功能并且维护 SAN PD信息表(SPDIT)。 CA包括能够传送存储在SPDIT中的信息的所有网络连接元件。 CAM使用SAN拓扑图,SPDIT用于创建SAN诊断表(SDT)。 特定设备中的故障组件可能会产生错误,导致沿同一网络连接路径的设备产生错误。 当CAM接收到错误包或错误消息时,将错误存储在SDT中,并通过对错误与SDT中的其他错误进行时间和空间的比较来分析每个错误。 如果确定CAE是生成错误的候选者,则如果可能,报告CAE进行更换。