Abstract:
An embodiment of a method of checkpointing parallel processes in execution within a plurality of process domains begins with a step of setting communication rules to stop communication between the process domains. Each process domain comprises an execution environment at a user level for at least one of the parallel processes. The method continues with a step of checkpointing each process domain and any in-transit messages. The method concludes with a step of resetting the communication rules to allow the communication between the process domains.
Abstract:
A method of checkpointing and restarting processes that share a file that is open begins with a step of assigning a priority to one of the processes that share the file. This identifies a priority process. The method concludes with a step of reopening the file when restoring the priority process.
Abstract:
Provided are a method, system, and article of manufacture for monitoring patterns of processes accessing addresses in a storage device to determine access parameters to apply. Processes accessing addresses of data in a storage device are monitored. The processes are granted access to the addresses according to first access parameters that indicate how to arbitrate access by processes to the addresses. A condition occurring in response to a pattern of processes accessing addresses is detected. A determination is made of one of the processes in the pattern and the address accessed by the determined process. Indication is made that second access parameters apply for the determined address. The second access parameters are used to grant access to the determined address for subsequent accesses of the indicated address.
Abstract:
Provided are a method, system, and article of manufacture for checkpointing and restoring user space data structures used by an application accessing a data structure maintained by an operating system for an executing application. Information in the accessed data structure is saved with checkpoint information for the application. An operation to restore the application from the checkpoint information is initialized. A restored data structure is generated to include the saved information in the accessed data structure saved in the checkpoint information in response to restoring the application. An initialization routine of the application is modified to bypass initializing the data structure as part of the application initialization routine to restore the application.
Abstract:
Provided are a method, system, and article of manufacture for recovery of application faults in a mirrored application environment. Application events are recorded at a primary system executing an instruction for an application. The recorded events are transferred to a buffer. The recorded events are transferred from the buffer to a secondary system, wherein the secondary system implements processes indicated in the recorded events to execute the instructions indicated in the events. An error is detected at the primary system. A determination is made of a primary order in which the events are executed by processes in the primary system. A determination is made of a modified order of the execution of the events comprising a different order of executing the events than the primary order in response to detecting the error. The secondary system processes execute the instructions indicated in the recorded events according to the modified order.
Abstract:
An embodiment of a method of restoring a communication state of a process includes creating a new socket for a socket saved as part of a checkpoint of the communication state. The new socket is initialized with an adjusted transmission control protocol state saved as part of the checkpoint. The adjusted transmission control protocol state indicates that a send buffer and a receive buffer are empty. Send data saved as part of the checkpoint is written into the new socket. Receive data saved as part of the checkpoint is written into a restart buffer. While at least a portion of the receive data remains in the restart buffer, a socket read system call for the new socket is redirected to read the receive data that remains in the restart buffer.
Abstract:
A method, computer program and system for controlling accesses to memory by threads created by a process executing on a multiprocessor computer. A page table structure is allocated for each new thread and copied from the existing threads. The page access is controlled by a present bit and a writable bit. Upon a page fault the access is provided to one thread. The kernel handles the new page entry creation process and set the page present bits to zero which creates page faults. In a second embodiment, two page table structures are created, one for one thread having access to the address space and the other page table structure shared by all the other threads not having access to the address space.
Abstract:
A method for replicating a program and data storage according to one embodiment comprises sending program replication data from a first program to a second program, the second program having an application program that is a replica of an application program of the first program; sending data storage requests from the first program to a first storage system; and replicating data stored in the first storage system in a second storage system. Additional methods, systems, and computer program products are disclosed.
Abstract:
A system and method for replication of network state for transparent recovery of network connections are provided. The system and method avoid having to identify and intercept the internal non-deterministic events of a network stack by adopting a state-capture approach. This state-capture approach views the network state of the primary and replica application instances from the viewpoint of an external client. In this way, only network state changes of the primary application instance that are communicated to an external client need to be replicated at the replica application instance. Other network state changes, e.g., internal network state changes, that are not communicated to the external client need not be replicated at the replica application instance. In other words, the illustrative embodiments permit differences in internal network state for those network states that are not made public to the external world, i.e. outside the application instance.
Abstract:
An embodiment of a method of checkpointing a virtual memory for a process comprises: accessing a page table that correlates logical addresses for the process to physical locations; saving memory resident pages identified for the process from the page table; and saving disk swap pages identified for the process from the page table, the step of saving disk swap pages being performed after the step of saving the memory resident pages.