摘要:
Optimizing collective operations including receiving an instruction to perform a collective operation type; selecting an optimized collective operation for the collective operation type; performing the selected optimized collective operation; determining whether a resource needed by the one or more nodes to perform the collective operation is not available; if a resource needed by the one or more nodes to perform the collective operation is not available: notifying the other nodes that the resource is not available; selecting a next optimized collective operation; and performing the next optimized collective operation.
摘要:
Methods, apparatuses, and computer program products for performing collective operations on a hybrid distributed processing system that includes a plurality of compute nodes and a plurality of tasks, each task is assigned a unique rank, and each compute node is coupled for data communications by at least two different networking topologies. At least one of the two networking topologies is a tiered tree topology having a root task and at least two child tasks and the at least two child tasks are peers of one another in the same tier. Embodiments include for each task, sending at least a portion of data corresponding to the task to all child tasks of the task through the tree topology; and sending at least a portion of the data corresponding to the task to all peers of the task at the same tier in the tree topology through the second topology.
摘要:
Methods, apparatuses, and computer program products for processing unexpected messages at a compute node of a parallel computer are provided. Embodiments include receiving, by the compute node, a portion of a message from another compute node of the parallel computer, the message comprising a plurality of separate portions; in response to receiving the portion of the message, determining, by the compute node, whether one of the applications executing on the compute node, has indicated that the message is expected; if one of the applications executing on the compute node has not indicated that the message is expected, storing, by the compute node, the portion of the message in an unexpected message buffer within the compute node; and if one of the applications executing on the compute node has indicated that the message is expected, storing the portion of the message at a storage destination indicated by the message.
摘要:
Methods, systems, and computer program products for configurable alert delivery in a distributed processing system are provided. Embodiments include for each alert generated by an incident analyzer, applying active alert filters to the alert; wherein applying the active alert filters to the alert includes: creating a list of all active alert filters and a set of all active listeners; and for each active alert filter, running the active alert filter; if the active alert filter indicates that the alert should not go to one or more of the active listeners, removing the one or more active listeners from the set of all active listeners; if the active listeners set is empty, stopping processing of the alert; and if the active listeners set is not empty, selecting the next active alert filter from the active alert filter list.
摘要:
Methods, apparatuses, and computer program products for selected alert delivery in a distributed processing system are provided. Embodiments include receiving, by an incident analyzer, one or more events from one or more resources, each event identifying a location of the resource producing the event; creating, by the incident analyzer, potential alerts in dependence upon a location of the resource producing the event and location scoping rules; selecting for consolidation, by the incident analyzer, one or more of the potential alerts based on consolidation rules; and creating, by the incident analyzer, a consolidated alert based on the consolidation rules and the selected one or more potential alerts.
摘要:
Administering incident pools including receiving, by an incident analyzer from an incident queue, a plurality of incidents from one or more components of the distributed processing system; assigning, by the incident analyzer, each received incident to a pool of incidents; assigning, by the incident analyzer, to each incident a particular combined minimum time for inclusion in one or more pools, each particular combined minimum time corresponding to a particular incident; in response to the pool closing, determining, by the incident analyzer, for each incident in the pool whether the incident has met its combined minimum time for inclusion in one or more pools; and if the incident has been in the pool for its combined minimum time, including, by the incident analyzer, the incident in the closed pool; and if the incident has not been in the pool for its combined minimum time, including the incident in a next pool.
摘要:
Methods, systems and products are provided for dynamic administration of component event reporting in a distributed processing system including receiving, by an events analyzer from an events queue, a plurality of events from one or more components of the distributed processing system; determining, by the events analyzer in dependence upon the received events and one or more event analysis rules, to change the event reporting rules of one or more components; and instructing, by the events analyzer, the one or more components to change the event reporting rules.
摘要:
Reducing remote reads of memory in a hybrid computing environment by maintaining remote memory values locally, the hybrid computing environment including a host computer and a plurality of accelerators, the host computer and the accelerators each having local memory shared remotely with the other, including writing to the shared memory of the host computer packets of data representing changes in accelerator memory values, incrementing, in local memory and in remote shared memory on the host computer, a counter value representing the total number of packets written to the host computer, reading by the host computer from the shared memory in the host computer the written data packets, moving the read data to application memory, and incrementing, in both local memory and in remote shared memory on the accelerator, a counter value representing the total number of packets read by the host computer.
摘要:
Initiating a collective operation in a parallel computer that includes compute nodes coupled for data communications and organized in an operational group for collective operations with one compute node assigned as a root node, including: identifying, by a non-root compute node, a collective operation to execute in the operational group of compute nodes; initiating, by the non-root compute node, execution of the collective operation amongst the compute nodes of the operational group including: sending, by the non-root compute node to one or more of the other compute nodes in the operational group, an active message, the active message including information configured to initiate execution of the collective operation amongst the compute nodes of the operational group; and executing, by the compute nodes of the operational group, the collective operation.
摘要:
Compressing result data for a compute node in a parallel computer, the parallel computer including a collection of compute nodes organized as a tree, including: initiating a collective gather operation by a logical root of the collection of compute nodes, including adding result data of the logical root to a gather buffer; for each compute node in the collection of compute nodes, determining whether result data of the compute node is already written in the gather buffer; and if the result data of the compute node is already written in the gather buffer, incrementing a counter assigned to that result data already written in the gather buffer; and if the result data of the compute node is not already written in the gather buffer, writing the result data of the compute node as new result data in the gather buffer, incrementing a counter assigned to that new result data, and writing in the gather buffer a node ID.