摘要:
An improved method and apparatus for time stamping events occurring on a large scale distributed network uses a local counter associated with each processor of the distributed network. Each counter resets at the same time globally so that all events are recorded with respect to a particular time. The counter is stopped when a critical event is detected. The events are masked or filtered in an online or offline fashion to eliminate non-critical events from triggering a collection by the system monitor or service/host processor. The masking can be done dynamically through the use of an event history logger. The central system may poll the remote processor periodically to receive the accurate counter value from the local counter and device control register. Remedial action can be taken when conditional probability calculations performed on the historical information indicate that a critical event is about to occur.
摘要:
In a multi node information processing system, a method for scheduling jobs, includes steps of: determining node-related performance parameters for a plurality of nodes; determining a ranking for each node based on the node related performance parameters for each node; and ordering each nodes by its ranking for job scheduling.
摘要:
Techniques for generating a target process are provided. The techniques include identifying at least one of one or more steps and one or more artifacts within a target process and one or more other processes, pre-fetching the at least one of one or more atomic steps, one or more decision steps and splits and one or more merges to be used in the target process from the one or more other processes, and associating the at least one of one or more atomic steps, one or more decision steps and splits and one or more merges to be used in the target process at one or more decision points to generate the target process.
摘要:
Method, apparatus and computer program product are configured to perform computer monitoring activities; to collect information regarding computer system status during the computer monitoring activities; to detect a problem in dependence on the information collected during the computer monitoring activities; and to determine whether to launch a diagnostic probe when the problem is detected. The monitoring activities may be performed on a periodic or event-driven basis. The determination whether to launch a diagnostic probe is based on a rule included in a hierarchy of rules. The hierarchy of rules is based on problem tickets; system logs; and computer system configuration information.
摘要:
The notion of controlling, using and monitoring remote resources in a distributed data processing system through the use of proxy resource managers and agents is extended to provide failover capability so that resource coverage is preserved and maintained even in the event of either temporary or longer duration node failure. Mechanisms are provided for consistent determination of resource status. Mechanisms are also provided which facilitate the joining of nodes to a group of nodes while still preserving remote resource operations. Additional mechanisms are also provided for the return of remote resource management to the control of a previously failed, but now recovered node, even if the failure had resulted in a node reset.
摘要:
A system for predicting an occurrence of a critical even in a computer cluster includes: a control system that includes an event log, a system parameter log, a memory for storing information related to occurrences of critical events, and a processor. The processor implements a hybrid prediction system; loads the information from the event log and the system performance log into a Bayesian network model; uses the Bayesian network model to predict a future critical event; makes future scheduling and current data migration selections; and adapts the Bayesian network model by feeding the scheduling and data migration selections.
摘要:
A plurality of base templates are generated. Each of the base templates models a corresponding process. A plurality of instances of each of the base templates are instantiated. Each of the plurality of instances corresponds to an application of the corresponding process to a particular environment. Each of the instances of each of the base templates is annotated, based, in each case, upon observation of functioning of the instance in the particular environment.
摘要:
A hybrid method of predicting the occurrence of future critical events in a computer cluster having a series of nodes records system performance parameters and the occurrence of past critical events. A data filter filters the logged to data to eliminate redundancies and decrease the data storage requirements of the system. Time-series models and rule based classification schemes are used to associate various system parameters with the past occurrence of critical events and predict the occurrence of future critical events. Ongoing processing jobs are migrated to nodes for which no critical events are predicted and future jobs are routed to more robust nodes.
摘要:
Briefly, according to the invention in an information processing system including a plurality of information processing nodes, a request for checkpointing by an application includes node health criteria (or parameters). The system has the authority to grant or deny the checkpointing request depending on the system health or availability. This scheme significantly improves not only the system performance, but also the application running time as the system. By skipping a checkpoint the application can use the same time to run the application instead of spending extra time for checkpointing.
摘要:
A plurality of equivalent representations of a process are identified. The process has a plurality of tasks. Each of the representations specifies a different order of the tasks. The plurality of equivalent representations are consolidated into a single representation. The single representation captures, in at least one flexible order grouping, at least two of the tasks that may be performed in more than one order. At least one constraint is specified for the at least one flexible order grouping. Techniques for merging two or more flexible representations are also provided.