摘要:
A system and method for fast system recovery that bypasses diagnostic routines by disconnecting failed hardware from the system before rebooting. Failed hardware and hardware that will be affected by removal of the failed hardware of the system are disconnected from the system. The system is restarted, and because the failed hardware is disconnected, diagnostic routines may safely be eliminated from the reboot process.
摘要:
A method, apparatus, and computer instructions for processing trace data in a logical partitioned data processing system. A partition causing an exception is identified in response to detecting the exception. The partition is one within a set of partitions in the logical partitioned data processing system. The trace data for the identified partition is stored in an error log or other data structure for a machine check interrupt handler.
摘要:
Aspects for detecting environmental faults in redundant components of a computer system are described. In an exemplary method aspect, the method includes monitoring system environment conditions, including a status for redundant power supply and cooling components. The method further includes registering a failure condition with an appropriate error type when a monitored system environment condition exceeds a design threshold, and utilizing the registered failure condition as data in an architected error log.
摘要:
The present invention provides method and system aspects for performing error data gathering from fault isolation registers of a computer system following a machine check occurrence. A method aspect includes utilizing firmware to perform failure information retrieval in software accessible registers and initiating a service processor (SP) for failure data retrieval in non-software accessible registers. The method further includes coordinating the combination of the failure information retrieved and the failure data retrieved in an error log for use in isolation of a fault source in the computer system.
摘要:
A computer system with reboot capability includes a processing mechanism, the processing mechanism supporting an operating system. The system further includes a service processor coupled to the processing mechanism, the service processor determining whether a reboot operation is needed. In addition, the computer system includes a memory mechanism coupled to the processing mechanism and the service processor, the memory mechanism storing a plurality of platform policy parameters and an automatic restart policy of the operating system to support the reboot operation of the service processor.
摘要:
A method, apparatus, and computer instructions for processing trace data in a logical partitioned data processing system. A partition causing an exception is identified in response to detecting the exception. The partition is one within a set of partitions in the logical partitioned data processing system. The trace data for the identified partition is stored in an error log or other data structure for a machine check interrupt handler.
摘要:
A system and method for fast system recovery that bypasses diagnostic routines by disconnecting failed hardware from the system before rebooting. Failed hardware and hardware that will be affected by removal of the failed hardware of the system are disconnected from the system. The system is restarted, and because the failed hardware is disconnected, diagnostic routines may safely be eliminated from the reboot process.
摘要:
A method, apparatus, and computer instructions for preserving trace data in a logical partitioned data processing system. A call is received from a partition in a plurality of partitions to register a buffer in the partition for the trace data. The call includes a pointer to the buffer. The buffer is associated with a trace routine in platform firmware. The trace routine stores the trace data for calls made by the partition to the platform firmware in the buffer.
摘要:
A system and method for fast system recovery that bypasses diagnostic routines by disconnecting failed hardware from the system before rebooting. Failed hardware and hardware that will be affected by removal of the failed hardware of the system are disconnected from the system. The system is restarted, and because the failed hardware is disconnected, diagnostic routines may safely be eliminated from the reboot process.
摘要:
A system, method, and computer program product are disclosed for preventing machine crashes due to hard errors in one of multiple, different processors that are included in a logically partitioned data processing system. An error occurring in one of the processors is detected. A determination is then made regarding whether the processor has been deconfigured. The partition is then rebooted only in response to a determination that the processor has been deconfigured and will not be included in the partition processor resources. Thus, only the configured processors are rebooted. The deconfigured processor is not rebooted.