摘要:
A method, apparatus, system, and computer-readable storage medium that, in an embodiment, set uncorrectable error indicators in logical memory blocks in response to detecting an uncorrectable error in memory pages associated with the logical memory blocks. If the logical memory block is allocated to a hypervisor, the memory page may be deallocated in response to detection of the uncorrectable error. When an IPL of a partition is subsequently performed, a determination is made whether a logical memory block allocated to the partition previously encountered the uncorrectable error via the uncorrectable error indicator. If the logical memory block did previously encounter the uncorrectable error, the logical memory block is deallocated from the partition. In an embodiment, if spare memory exists, the logical memory block with the previously encountered uncorrectable error is replaced with the spare memory and the IPL of the partition is continued with the spare memory.
摘要:
A method, apparatus, system, and signal-bearing medium that, in an embodiment, set uncorrectable error indicators in logical memory blocks in response to detecting an uncorrectable error in memory pages associated with the logical memory blocks. If the logical memory block is allocated to a hypervisor, the memory page may be deallocated in response to detection of the uncorrectable error. When an IPL of a partition is subsequently performed, a determination is made whether a logical memory block allocated to the partition previously encountered the uncorrectable error via the uncorrectable error indicator. If the logical memory block did previously encounter the uncorrectable error, the logical memory block is deallocated from the partition. In an embodiment, if spare memory exists, the logical memory block with the previously encountered uncorrectable error is replaced with the spare memory and the IPL of the partition is continued with the spare memory. If spare memory does not exist, the IPL of the partition is continued without the logical memory block that previously encountered the uncorrectable error. This allows a partition to IPL if it had not been able to because of a persistent uncorrectable error in its IPL path.
摘要:
The present invention provides an improved method, an system, and a set of computer implemented instructions for handling a cache containing multiple single-bit hard errors on multiple addresses within a data processing system. Such handles will prevent any down time by logging in the parts to be replaced by an operator when certain level of bit errors is reached. When a hard error exists on a cache address for the first time, serviceable first hard error, that cache line is deleted. Thus the damaged memory device is no longer used by the system. As a result, the system is running with “N−x” lines wherein “N” constitutes the total number of existing lines and “x” is less than “N”. An alternative method is to exchange the damaged memory device to a spare memory device. In order to provide such services, the system must first differentiate whether an error is a soft or hard error.
摘要:
A method and apparatus for coordinating dynamic memory page deallocation with a redundant bit line steering mechanism are provided. With the method and apparatus, memory scrubbing and redundant bit line steering operations are performed in parallel with handling of notifications of runtime correctable errors. When a correctable error is encountered during runtime, and the correctable error is determined to be persistent, then dynamic memory page deallocation is requested of a hypervisor. The determination of persistence is based on a history CE table that is populated by the operation of the memory scrubbing and redundant bit line steering mechanism of a service processor. Thus, only those correctable errors that persist for longer than one memory scrubbing cycle are subject to memory page deallocation.
摘要:
A method, apparatus and program product improve computer reliability by, in part, identifying a plurality of error occurrences from Error Correction Codes. It may then be determined if the plurality of error occurrences are associated with a single bit of a bus. The determined, single bit may correspond to a faulty component of the bus. This level of identification efficiently addresses problems. For instance, a corrective algorithm may be applied if the plurality of error occurrences are associated with the single bit. Alternatively, the bus may be disabled if the plurality of error occurrences are not associated with the single bit of the bus. In this manner, implementations may detect, identify and act in response to multiple failure modes.
摘要:
A method, apparatus and program product improve computer reliability by, in part, identifying a plurality of error occurrences from Error Correction Codes. It may then be determined if the plurality of error occurrences are associated with a single bit of a bus. The determined, single bit may correspond to a faulty component of the bus. This level of identification efficiently addresses problems. For instance, a corrective algorithm may be applied if the plurality of error occurrences are associated with the single bit. Alternatively, the bus may be disabled if the plurality of error occurrences are not associated with the single bit of the bus. In this manner, implementations may detect, identify and act in response to multiple failure modes.
摘要:
A method, apparatus and computer program product are provided for implementing uncorrectable error isolation in a computer system while the system continues to run. A memory controller performs data fetching from a system memory, capturing error information, and responsive to detecting an uncorrectable error, generates a predefined attention to a service processor. The service processor utilizing a processor runtime diagnostic (PRD) program, reads the captured error data and identifies a memory extent with the uncorrectable error. Then the memory controller performs accelerated scrubbing of the identified memory extent with the uncorrectable error, capturing error information and responsive to a scrub correctable error threshold being exceeded, sends a predefined scrub threshold exceeded attention to the service processor. The service processor reads the captured error data and identifies a failed memory chip.