摘要:
The notion of controlling, using and monitoring remote resources in a distributed data processing system through the use of proxy resource managers and agents is extended to provide failover capability so that resource coverage is preserved and maintained even in the event of either temporary or longer duration node failure. Mechanisms are provided for consistent determination of resource status. Mechanisms are also provided which facilitate the joining of nodes to a group of nodes while still preserving remote resource operations. Additional mechanisms are also provided for the return of remote resource management to the control of a previously failed, but now recovered node, even if the failure had resulted in a node reset.
摘要:
The notion of controlling, using and monitoring remote resources in a distributed data processing system through the use of proxy resource managers and agents is extended to provide failover capability so that resource coverage is preserved and maintained even in the event of either temporary or longer duration node failure. Mechanisms are provided for consistent determination of resource status. Mechanisms are also provided which facilitate the joining of nodes to a group of nodes while still preserving remote resource operations. Additional mechanisms are also provided for the return of remote resource management to the control of a previously failed, but now recovered node, even if the failure had resulted in a node reset.
摘要:
The notion of controlling, using and monitoring remote resources in a distributed data processing system through the use of proxy resource managers and agents is extended to provide failover capability so that resource coverage is preserved and maintained even in the event of either temporary or longer duration node failure. Mechanisms are provided for consistent determination of resource status. Mechanisms are also provided which facilitate the joining of nodes to a group of nodes while still preserving remote resource operations. Additional mechanisms are also provided for the return of remote resource management to the control of a previously failed, but now recovered node, even if the failure had resulted in a node reset.
摘要:
The notion of controlling, using and monitoring remote resources in a distributed data processing system through the use of proxy resource managers and agents is extended to provide failover capability so that resource coverage is preserved and maintained even in the event of either temporary or longer duration node failure. Mechanisms are provided for consistent determination of resource status. Mechanisms are also provided which facilitate the joining of nodes to a group of nodes while still preserving remote resource operations. Additional mechanisms are also provided for the return of remote resource management to the control of a previously failed, but now recovered node, even if the failure had resulted in a node reset.
摘要:
In a multi node information processing system, a method for scheduling jobs, includes steps of: determining node-related performance parameters for a plurality of nodes; determining a ranking for each node based on the node related performance parameters for each node; and ordering each nodes by its ranking for job scheduling.
摘要:
Techniques for generating a target process are provided. The techniques include identifying at least one of one or more steps and one or more artifacts within a target process and one or more other processes, pre-fetching the at least one of one or more atomic steps, one or more decision steps and splits and one or more merges to be used in the target process from the one or more other processes, and associating the at least one of one or more atomic steps, one or more decision steps and splits and one or more merges to be used in the target process at one or more decision points to generate the target process.
摘要:
Method, apparatus and computer program product are configured to perform computer monitoring activities; to collect information regarding computer system status during the computer monitoring activities; to detect a problem in dependence on the information collected during the computer monitoring activities; and to determine whether to launch a diagnostic probe when the problem is detected. The monitoring activities may be performed on a periodic or event-driven basis. The determination whether to launch a diagnostic probe is based on a rule included in a hierarchy of rules. The hierarchy of rules is based on problem tickets; system logs; and computer system configuration information.
摘要:
A system for predicting an occurrence of a critical even in a computer cluster includes: a control system that includes an event log, a system parameter log, a memory for storing information related to occurrences of critical events, and a processor. The processor implements a hybrid prediction system; loads the information from the event log and the system performance log into a Bayesian network model; uses the Bayesian network model to predict a future critical event; makes future scheduling and current data migration selections; and adapts the Bayesian network model by feeding the scheduling and data migration selections.
摘要:
A plurality of base templates are generated. Each of the base templates models a corresponding process. A plurality of instances of each of the base templates are instantiated. Each of the plurality of instances corresponds to an application of the corresponding process to a particular environment. Each of the instances of each of the base templates is annotated, based, in each case, upon observation of functioning of the instance in the particular environment.
摘要:
A hybrid method of predicting the occurrence of future critical events in a computer cluster having a series of nodes records system performance parameters and the occurrence of past critical events. A data filter filters the logged to data to eliminate redundancies and decrease the data storage requirements of the system. Time-series models and rule based classification schemes are used to associate various system parameters with the past occurrence of critical events and predict the occurrence of future critical events. Ongoing processing jobs are migrated to nodes for which no critical events are predicted and future jobs are routed to more robust nodes.