摘要:
Optimizing collective operations including receiving an instruction to perform a collective operation type; selecting an optimized collective operation for the collective operation type; performing the selected optimized collective operation; determining whether a resource needed by the one or more nodes to perform the collective operation is not available; if a resource needed by the one or more nodes to perform the collective operation is not available: notifying the other nodes that the resource is not available; selecting a next optimized collective operation; and performing the next optimized collective operation.
摘要:
Methods, apparatuses, and computer program products for performing collective operations on a hybrid distributed processing system that includes a plurality of compute nodes and a plurality of tasks, each task is assigned a unique rank, and each compute node is coupled for data communications by at least two different networking topologies. At least one of the two networking topologies is a tiered tree topology having a root task and at least two child tasks and the at least two child tasks are peers of one another in the same tier. Embodiments include for each task, sending at least a portion of data corresponding to the task to all child tasks of the task through the tree topology; and sending at least a portion of the data corresponding to the task to all peers of the task at the same tier in the tree topology through the second topology.
摘要:
Methods, systems and products are provided for dynamic administration of component event reporting in a distributed processing system including receiving, by an events analyzer from an events queue, a plurality of events from one or more components of the distributed processing system; determining, by the events analyzer in dependence upon the received events and one or more event analysis rules, to change the event reporting rules of one or more components; and instructing, by the events analyzer, the one or more components to change the event reporting rules.
摘要:
Reducing remote reads of memory in a hybrid computing environment by maintaining remote memory values locally, the hybrid computing environment including a host computer and a plurality of accelerators, the host computer and the accelerators each having local memory shared remotely with the other, including writing to the shared memory of the host computer packets of data representing changes in accelerator memory values, incrementing, in local memory and in remote shared memory on the host computer, a counter value representing the total number of packets written to the host computer, reading by the host computer from the shared memory in the host computer the written data packets, moving the read data to application memory, and incrementing, in both local memory and in remote shared memory on the accelerator, a counter value representing the total number of packets read by the host computer.
摘要:
Compressing result data for a compute node in a parallel computer, the parallel computer including a collection of compute nodes organized as a tree, including: initiating a collective gather operation by a logical root of the collection of compute nodes, including adding result data of the logical root to a gather buffer; for each compute node in the collection of compute nodes, determining whether result data of the compute node is already written in the gather buffer; and if the result data of the compute node is already written in the gather buffer, incrementing a counter assigned to that result data already written in the gather buffer; and if the result data of the compute node is not already written in the gather buffer, writing the result data of the compute node as new result data in the gather buffer, incrementing a counter assigned to that new result data, and writing in the gather buffer a node ID.
摘要:
Dynamic administration of event pools for relevant event and alert analysis during event storms including receiving, by an events analyzer from an events queue, a plurality of events from one or more components of the distributed processing system, each event including an occurred time and a logged time; creating, by the event analyzer, an events pool; determining whether an arrival rate of the events from the components of the distributed processing system is greater than a predetermined threshold; if the arrival rate is greater than the predetermined threshold, assigning, by the events analyzer, a plurality of events to the events pool in dependence upon their occurred time; and if the arrival rate is not greater than the predetermined threshold, assigning, by the events analyzer, a plurality of events to the events pool in dependence upon their logged time.
摘要:
Terminating an accelerator application program in a hybrid computing environment that includes a host computer having a host computer architecture and an accelerator having an accelerator architecture, where the host computer and the accelerator are adapted to one another for data communications by a system level message passing module (‘SLMPM’), and terminating an accelerator application program in a hybrid computing environment includes receiving, by the SLMPM from a host application executing on the host computer, a request to terminate an accelerator application program executing on the accelerator; terminating, by the SLMPM, execution of the accelerator application program; returning, by the SLMPM to the host application, a signal indicating that execution of the accelerator application program was terminated; and performing, by the SLMPM, a cleanup of the execution environment associated with the terminated accelerator application program.
摘要:
Sending, by a node requesting information regarding a resource to one or more nodes in a distributed computing system, an active message to perform a collective operation; contributing, by each node not having a resource, a value of zero to the collective operation; contributing, by a node having the resource, the node's rank; storing the result of the collective operation in a buffer of the requesting node; and identifying, in dependence upon the result of the collective operation, the rank of the node having the resource.
摘要:
Methods, systems, and computer program products for flexible event data content management for relevant event and alert analysis within a distributed processing system are provided. Embodiments include capturing, by an interface connector, an event from a resource of the distributed processing system; inserting, by the interface connector, the event into an event database; receiving from the interface connector, by a notifier, a notification of insertion of the event into the event database; based on the received notification, tracking, by the notifier, the number of events indicated as inserted into the event database; receiving from the notifier, by a monitor, a cumulative notification indicating the number of events that have been inserted into the event database; in response to receiving the cumulative notification, retrieving, by the monitor, from the event database, events inserted into the event database; and processing, by the monitor, the retrieved events.
摘要:
Creating, by a parent master process of a parent communicator, a child communicator, including configuring the child communicator with a child master process, wherein a communicator includes a collection of one or more processes executing on compute nodes of a distributed computing system; determining, by the parent master process, whether a unique identifier is available to assign to the child communicator; if a unique identifier is available to assign to the child communicator, assigning, by the parent master process, the available unique identifier to the child communicator; and if a unique identifier is not available to assign to the child communicator: retrieving, by the parent master process, an available unique identifier from a master process of another communicator in a tree of communicators and assigning the retrieved unique identifier to the child communicator.