摘要:
A method, computer program product, and data processing system for handling errors or other events in a logical partition (LPAR) data processing system is disclosed. When an operating system is initialized in a logical partition, it registers its capabilities for handling particular errors or other events with management software. When an error or other event affecting that logical partition occurs, the management software checks to see if the particular error or event is one that the operating system is capable of handling. If so, the operating system is notified. Otherwise, the management software directs the operating system to take other appropriate action, such as termination of the operating system and/or partition.
摘要:
A method, computer program product, and data processing system for locating hardware faults occurring in multiple devices in a data processing system is disclosed. The devices have a scanning order in which the devices (or at least information regarding the devices) are scanned to analyze any possible error condition. When a new error is detected in a device, an identification of the device is stored in a data structure. If another error is detected and causes the devices to be scanned again, the scanning process will skip over the device whose identity is stored in the data structure so that the new error can be located.
摘要:
A system, method, and computer program product are disclosed for preventing machine crashes due to hard errors in one of multiple, different processors that are included in a logically partitioned data processing system. An error occurring in one of the processors is detected. A determination is then made regarding whether the processor has been deconfigured. The partition is then rebooted only in response to a determination that the processor has been deconfigured and will not be included in the partition processor resources. Thus, only the configured processors are rebooted. The deconfigured processor is not rebooted.
摘要:
An apparatus, program product and method propagate errors detected in an IO fabric element from an IO fabric that is used to couple a plurality of endpoint IO resources to processing elements in a computer. In particular, such errors are propagated to the endpoint IO resources affected by the IO fabric element in connection with recovering from the errors in the IO fabric element. By doing so, a device driver or other program code used to access each affected IO resources may be permitted to asynchronously recover from the propagated error in its associated IO resource, and often without requiring the recovery from the error in the IO fabric element to wait for recovery to be completed for each of the affected IO resources. In addition, an IO fabric may be dynamically configured to support both recoverable and non-recoverable endpoint IO resources. In particular, IO fabric elements within an IO fabric may be dynamically configured to enable machine check signaling in such IO fabric elements in response to detection that an endpoint IO resource is non-recoverable in nature. The IO fabric elements that are dynamically configured as such are disposed within a hardware path that is defined between the non-recoverable resource and a processor that accesses the non-recoverable resource.
摘要:
A method, apparatus, and computer instructions for halting input/output error propagation in the logically partitioned data processing system. All components associated with the bridge are identified to form a set of failed components in response to detecting an error state in a bridge within a set of bridges in the logical partitioned data processing system. An identification of the failed components is stored in which the identification is used by each partition during a boot process.
摘要:
A system, method, and product in a logically partitioned data processing system are disclosed for preserving trace data after a partition crash. The logically partitioned data processing system includes multiple, different processors. An error is encountered in one of the processors. Data associated with the error is stored in a trace buffer. Contents of the trace buffer are stored prior to the data being overwritten.
摘要:
A interrupt is generated for all processors in a multiprocessor system when a critical datapath experiences an error. Serialization code in the interrupt handling routine for that interrupt suspends all processors except one and places the suspended processors in a waiting queue while the one processor handles the error. After the error has been handled, the remaining processors are allow to execute the interrupt handler, which simply exits detecting no error.
摘要:
A method, apparatus, and computer instructions for processing errors in a hierarchical input/output sub-system having an input/output bridge with a plurality of hardware devices in a level below the bridge. A value is read from a selected register to form a read value in response to detecting an error. The selected register is reset. Each bit in the read value associated with the error is cleared to form a cleared value. The cleared value is written into the selected register such that errors occurring since the register was cleared are preserved. The error registers below the bridge are scanned in response to an absence of an error being detected in a bridge within the input/output sub-system. A determination is made as to whether the error has previously occurred in response to a presence of an error is; being found by scanning the registers below the bridge. The error is reported in response to an absence of a determination that the error has previously occurred.
摘要:
A method, system, and computer program product for enforcing logical partitioning of a shared device to which multiple partitions within a data processing system have access is provided. In one embodiment, a firmware portion of the data processing system receives a request from a requesting device, such as a processor assigned to one of a plurality of partitions within the data processing system, to access (i.e., read from or write to) a portion of the shared device, such as an NVRAM. The request includes a virtual address corresponding to the portion of the shared device for which access is desired. If the virtual address is within a range of addresses for which the requesting device is authorized to access, the firmware provides access to the requested portion of the shared device to the requesting device. If the virtual address is not within a range of addresses for which the requesting device is authorized to access, the firmware denies the request.
摘要:
A method, apparatus, and computer implemented instructions for controlling power in a data processing system having a plurality of logical partitions. Responsive to receiving a request to turn off the power for a logical partition within the plurality of logical partitions in the data processing system, a determination is made as to whether an additional partition within the plurality of logical partitions is present in the data processing system. The power is turned off in the data processing system in response to a determination an additional partition within the plurality of logical partitions is absent in the data processing system. The logical partition is shut down in response to a determination that an additional partition within the plurality of logical partitions is present in the data processing system. The mechanism of the present invention also provides for rebooting logical partitions. A request is received to reboot a logical partition within the plurality of logical partitions. A reset signal is activated only for each processor assigned to the logical partition.