摘要:
A method and apparatus is provided for determining the most probable cause of a problem observed in a complex multi-host system. The approach relies on a probabilistic model to represent causes and effects in a complex computing system. However, complex systems include a multitude of independently operating components that can cause temporary anomalous states. To reduce the resources required to perform root cause analysis on each transient failure, as well as to raise the confidence in the most probable cause of a failure that is identified by the model, inputs to the probabilistic model are aggregated over a sliding window of values from the recent past.
摘要:
A method and apparatus are provided for determining that problems have occurred within a complex multi-host system and for identifying for each problem, sequences of causes and effects called a fault cause path, starting with a root cause. A probabilistic model representing the cause/effect relationships among potential system problems identifies the probability that a problem occurred in the system. Such failure probabilities may be determined based on aggregating, over a recent time interval, probability of failure values determined by the probabilistic model. Each fault cause path may have an associated probability of accuracy value reflecting the expected accuracy of the fault cause path relative to other fault cause paths. When more than one fault cause path is identified, the number and order of the fault cause paths may be ranked and displayed based on their probability of accuracy value.
摘要:
A method and apparatus are provided for performing cross-host root cause diagnosis within a complex multi-host environment. In a multi-host environment, sometimes system failures on one host may cause problems at another host within the same environment. A probabilistic model is used to represent failures that can occur within each host in the environment. The cause and effect relationships among these failures together with measurement values are used to generate a probability that each potential failure occurred in each host. When a problem is observed on one host without detecting a corresponding root cause within the same host, a cross-host failure diagnosis is performed. The probabilistic models for other hosts in the environment are used to determine the most likely cause of the failure.
摘要:
A method and apparatus are provided for determining the probability that one or more problems have occurred within a complex multi-host system. A probabilistic model representing the cause/effect relationships among potential system problems identifies the probability that a problem occurred in the system based at least on system measure states that are input into the probabilistic model. System measure states may be determined based on an aggregation of system measurement values taken periodically. Aggregating system measurement values may be performed over system measurement values that were taken during a recent time interval. A rolling count aggregation function may be used for this purpose. A rolling count function counts the number of system measurement values taken within the recent time interval that lie within a particular range of values. A system measure state may be determined based on whether the rolling count exceeds a threshold associated with the system measure.
摘要:
A method and apparatus are provided for determining that problems have occurred within a complex multi-host system and for identifying for each problem, sequences of causes and effects called a fault cause path, starting with a root cause. A probabilistic model representing the cause/effect relationships among potential system problems identifies the probability that a problem occurred in the system. Such failure probabilities may be determined based on aggregating, over a recent time interval, probability of failure values determined by the probabilistic model. Each fault cause path may have an associated probability of accuracy value reflecting the expected accuracy of the fault cause path relative to other fault cause paths. When more than one fault cause path is identified, the number and order of the fault cause paths may be ranked and displayed based on their probability of accuracy value.
摘要:
A method and apparatus are provided for determining the probability that one or more problems have occurred within a complex multi-host system. A probabilistic model representing the cause/effect relationships among potential system problems identifies the probability that a problem occurred in the system based at least on system measure states that are input into the probabilistic model. System measure states may be determined based on an aggregation of system measurement values taken periodically. Aggregating system measurement values may be performed over system measurement values that were taken during a recent time interval. A rolling count aggregation function may be used for this purpose. A rolling count function counts the number of system measurement values taken within the recent time interval that lie within a particular range of values. A system measure state may be determined based on whether the rolling count exceeds a threshold associated with the system measure.
摘要:
A method and apparatus are provided for performing cross-host root cause diagnosis within a complex multi-host environment. In a multi-host environment, sometimes system failures on one host may cause problems at another host within the same environment. A probabilistic model is used to represent failures that can occur within each host in the environment. The cause and effect relationships among these failures together with measurement values are used to generate a probability that each potential failure occurred in each host. When a problem is observed on one host without detecting a corresponding root cause within the same host, a cross-host failure diagnosis is performed. The probabilistic models for other hosts in the environment are used to determine the most likely cause of the failure.
摘要:
An approach is disclosed for implementing failover and resume when using ordered sequences in a multi-instance database environment. The approach commences by instantiating a first database instance initially to serve as an active instance, then instantiating a second database instance to serve as an instance of one or more passive instances. The active database establishes mastership over a sequence and then processes requests for the ‘next’ symbol by accessing a shared sequence cache only after accessing a first instance semaphore. The active instance and the passive instance perform a protocol such that upon passive database detection of a failure of the active database, one of the passive database instances takes over mastership of the sequence cache, and then proceeds to satisfy sequence value requests. The particular order is observed in spite of the failure.
摘要:
A method, system, and computer program product is disclosed for generating an ordered sequence from a predetermined sequence of symbols using protected interleaved caches, such as semaphore protected interleaved caches. The approach commences by dividing the predetermined sequence of symbols into two or more interleaved caches, then mapping each of the two or more interleaved caches to a particular semaphore of a group of semaphores. The group of semaphores is organized into bytes or machine words for storing the group of semaphores into a shared memory, the shared memory accessible by a plurality of session processes. Protected (serialized) access by the session processes is provided by granting access to one of the two or more interleaved caches only after one of the plurality of session processes performs a semaphore altering read-modify-write operation (e.g., a CAS) on the particular semaphore. The interleaved caches are assigned values successively from the predetermined sequence using a round-robin assignment technique.
摘要:
In a database system having a plurality of concurrently executing session processes, the method commences by establishing a master list of sequences, the master list comprising a plurality of sequence objects which in turn define a sequence of values used for numbering and other identification within the database system. To reduce sequence cache latch access contention, multiple tiers of latches are provided. Methods of the system provide a first tier having a first tier “global” latch to serialize access to the master list. A second tier of latches is provided, the second tier having multiple second tier latches to serialize access to corresponding allocated sequences of values such that at any point in time, only one of the concurrently executing session processes is granted access to the allocated sequence.