摘要:
Provided are a method, system and program for processing complexes to access shared devices. A lock to a plurality of shared devices is maintained and accessible to a first and second processing systems. The first processing complex determines a first delay time and the second processing complex determines a second delay time. The first processing complex issues a request for the lock in response to expiration of the first delay time and the second processing complex issues a request for the lock in response to expiration of the second delay time.
摘要:
A method is disclosed to adjust error thresholds in a data storage and retrieval system. The method supplies a data storage and retrieval system comprising memory and microcode, wherein that microcode comprises one or more default error thresholds. The method determines if the memory comprises one or more operational error thresholds. If the method determines that the memory comprises one or more operational error thresholds, then the method operates the data storage and retrieval system using those one or more operational error thresholds. Alternatively, if the method determines that the memory does not comprise one or more operational error thresholds, then the method sets the one or more default error thresholds as the one or more operational error thresholds.
摘要:
An apparatus, system, and method are disclosed for autonomously overriding a global resource lock. The apparatus includes a determination module, an override module, and an assertion module. The determination module determines whether a global resource lock is owned by a peer resource controller and that the peer resource controller is offline in response to the peer resource controller owning the global resource lock. The atomic module atomically overrides ownership of the global resource lock from the peer resource controller. The assertion module asserts active ownership of the global resource lock. The apparatus, system, and method provide an autonomous override of the global resource lock, minimizing system downtime and user intervention.
摘要:
A computer system including an error recovery system establishes error threshold inversely proportional to the number of a like kind of system resources, such as host adapters. When a host adapter is initialized or deactivated, a software subcomponent of a processing device calculates a new threshold number and writes it to a memory location associated with each host adapter. When a number of errors exceeds the threshold number, the host adapter is reset, quiesced for repair, or fenced for replacement.
摘要:
A computer system including a communication fabric initiates a forced diagnostic to isolate and identify genuine error conditions which are discerned from sympathetic error conditions. Error counters are only incremented for each genuine error condition, precluding the need to set error counter threshold artificially high. Recovery events are logged in a recovery table and recovery actions are only initiated after the diagnoses processes is complete. This prevents duplication of recovery actions and the unnecessary implementation of low-level recovery actions when they will be followed by higher-level recovery actions.
摘要:
An apparatus, method, and system associates an identifier with a data packet. The identifier uniquely identifies a communication module, such as a host interface card, within a data storage system. In operation, a computer host sends a data packet to a server. The communication module receives the data packet and associates an identifier, unique to the communication module, with the data packet. The data packet is stored in a disk array, such as a Redundant Array of Independent Disks (RAID) system. When the computer host later requests the stored data packet, a validation module, which may be implemented within a PCI adapter such as a host interface card, retrieves the data packet and determines whether the data packet is corrupt. If the data packet is corrupt, the validation module identifies which host interface card corrupted the data with the use of the unique identifier associated with the data packet. The faulty communication module may then be removed from operation in the data storage system.
摘要:
An apparatus, system, and method are disclosed for facilitating monitoring and responding to error events. An apparatus may includes a set of counters associated with a processing system resource, each counter associated with an error event and having attributes defining a count value, counter thresholds directly related to time, and empirical status information for the error event related to time. A user may adjust counter thresholds indirectly to set an error tolerance. An update module may update counters within the set based on an error event for the processing system resource. The management module persists and maintains a life cycle for counters based on counter attributes. Each counter may be of two types either a fixed counter that counts error events from a start time for a defined duration or a sliding counter that counts error events up to a predefined number of error events within a window of time.
摘要:
A computer system includes a communication adapter that connects a plurality of virtualized servers to one or more support system devices. The communication adapter includes a master lock register, a processing device, a queue, and a multitude of adapter access registers. Upon initialization, a virtual server asserts ownership over the communication adapter by writing its identification into the master lock register, if the register is empty. Service requests by images are transmitted to the communication adapter with an origination identification (“ID”). This ID is placed in one of the adapter access registers and the service request is placed in the queue. When a support system device responds to the service request, the response is married to the ID and broadcast back to all connected virtualized servers.
摘要:
An apparatus, system, and method are disclosed for data tracking and, in particular, for facilitating failure management within an electronic data communication system. The apparatus includes a tracking module and an error analysis module. The tracking module stores an adapter identifier in a tracking array. The adapter identifier corresponds to a source adapter from which data is received. The error analysis module determines a source of a data failure in response to recognition of the data failure. The data failure may occur on a host adapter, a device adapter, a communication fabric, a multi-processor, or another communication device. The apparatus, system, and method may be implemented in place of or in addition to hardware-assisted data integrity checking within a data storage system.
摘要:
A computer system includes a support system that report events, faults, and failures to a master virtual server. While the support system may be accessed and used by a multitude of virtual servers, only the master virtual server can manage the support system. The support system include a master lock register, a heartbeat timer, and a digital processing device (“processor”). Upon initialization and if the master lock register is empty, a virtual server asserts ownership over the support system by writing its identification into the master lock register, becoming the master virtual server. The master virtual server transmits periodic heartbeats to the support system to communicate that it is still viable and in control. If the heartbeat timer expires without communication from the master virtual server, the processor clears the master lock register and transmits a broadcast message inviting all connected virtual servers to attempt to assert control.