摘要:
Device selects lines from each I/O device are brought into a PCI host bridge individually so that the device number of a failing device may be logged in an error register when an error is seen on the PCI bus. Until the error register is reset, subsequent load and store operations are delayed until the device number of the subject device may be checked against the error register. If the subject device is a previously failing device, the load/store operation to that device is prevented from completing, either by forcing bad parity or zeroing all byte enables. By forcing bad parity of zero byte enables, the I/O device will respond to the load or store request by activating its device select line, but will not accept store data. Operations to devices which are not logged in the error register are permitted to proceed normally, as are all load store operations when the error register is clear. Normal system operations are thus not impacted, and operations during error recovery are permitted to proceed if no further damage will be caused by such operations.
摘要:
A method and system for deconfiguring a CPU in a processing system is disclosed. In one aspect, a processing system is disclosed that comprises a central processing unit (CPU), and a memory coupled to the CPU. The error status register for capturing information concerning the status of the CPU. The processing system includes a service processor for gathering and analyzing status information from the CPU error register. The processing system also includes a nonvolatile device coupled to the service processor. The nonvolatile device includes a deconfiguration area. The deconfiguration area stores information concerning the status of the CPU from the service processor. The deconfiguration area also provides information for deconfiguring a CPU during a boot time of the processing system. Accordingly, through the present invention, CPU errors are detected during normal computer operations by error detection logic. This detection is utilized during any subsequent boot process by service processor firmware to deallocate the defective CPU. This is accomplished through the use of error status registers within the CPU and through the use of a deconfiguration area in the nonvolatile device which provides information directly to the service processor.
摘要:
A method and apparatus in a multiprocessor data processing system for managing a plurality of processors. Monitoring for recoverable errors in a set of processors is performed. Responsive to detecting a recoverable error for a processor in the set of processors, a determination is made as to whether the recoverable error indicates a trend towards an unrecoverable error. Responsive to a determination that the recoverable error indicates a trend towards an unrecoverable error, actions are initiated to stop the processor.
摘要:
A method and system for deconfiguring software in a processing system is disclosed. In one aspect, a processing system comprises a central processing unit (CPU), and a memory coupled to the CPU. The memory includes a memory array and a memory controller for capturing information concerning the status of the memory array. The processing system includes a service processor for gathering and analyzing status information from the memory controller. The processing system also includes a nonvolatile device coupled to the CPU and the service processor. The nonvolatile device includes a deconfiguration area. The deconfiguration area stores information concerning the status of the memory array from the service processor. The deconfiguration area also provides information for deconfiguring at least a portion of the memory array during a boot time of the processing system. Accordingly, through the present invention, memory errors are detected during normal computer operations by error detection logic. This detection is utilized during any subsequent boot process by service processor and CPU boot firmware to deallocate the defective memory module. This is accomplished through the use of error status registers within the memory controller and through the use of a deconfiguration area in the nonvolatile device which provides information directly to the CPU boot firmware.
摘要:
Method and system aspects for check stop error handling are provided. A method aspect for check stop error handling in a computer system, the computer system comprising a plurality of components including a processor that supports an operating system and firmware, includes utilizing a service processor following a check stop error for error data retrieval and attempting a reboot of the computer system. The method further includes initiating firmware for failure reporting based on the error data retrieval when the reboot is successful.
摘要:
Aspects for detecting environmental faults in redundant components of a computer system are described. In an exemplary method aspect, the method includes monitoring system environment conditions, including a status for redundant power supply and cooling components. The method further includes registering a failure condition with an appropriate error type when a monitored system environment condition exceeds a design threshold, and utilizing the registered failure condition as data in an architected error log.
摘要:
The present invention provides method and system aspects for performing error data gathering from fault isolation registers of a computer system following a machine check occurrence. A method aspect includes utilizing firmware to perform failure information retrieval in software accessible registers and initiating a service processor (SP) for failure data retrieval in non-software accessible registers. The method further includes coordinating the combination of the failure information retrieved and the failure data retrieved in an error log for use in isolation of a fault source in the computer system.
摘要:
A computer system with reboot capability includes a processing mechanism, the processing mechanism supporting an operating system. The system further includes a service processor coupled to the processing mechanism, the service processor determining whether a reboot operation is needed. In addition, the computer system includes a memory mechanism coupled to the processing mechanism and the service processor, the memory mechanism storing a plurality of platform policy parameters and an automatic restart policy of the operating system to support the reboot operation of the service processor.
摘要:
A method, system, and computer program product are disclosed for managing an application's persistent data across multiple different release versions of the application. A format for a memory area is defined wherein the persistent data will be stored according to a first release version of the application. The format is organized to permit the application running at different release versions to access the memory area. The memory area is accessed using the application that is running at a second release version. The memory area is divided into individually accessible sections. The format includes a template file for each section for each release version.
摘要:
The present invention provides a method, apparatus, and computer implemented instructions for managing a set of objects for a plurality of terminals. The set of objects are stored in a memory, such as a nonvolatile random access memory in a data processing system. The set of objects are used to configure logical partitions within the data processing system. Access to the set of objects is provided to the plurality of terminals through a service processor.