摘要:
The present invention provides a system and method for extracting elements from distributed arrays on a parallel processing system. The system includes a module that populates a result array with globally largest elements from input arrays, a module that generates a partition element, a module that counts the number of local elements greater than the partition element, and a module that determines the globally largest elements. The method for extracting elements from distributed arrays on a parallel processing system includes populating a result array with globally largest elements from input arrays, generating a partition element, counting the number of local elements greater than the partition and determining the globally largest elements.
摘要:
Routing data communications packets in a parallel computer that includes compute nodes organized for collective operations. Each compute node including an operating system kernel and a system-level messaging module that is a module of automated computing machinery that exposes a messaging interface to applications. Each compute node including a routing table that specifies, for each of a multiplicity of route identifiers, a data communications path through the compute node. Including to carry out the steps of: receiving in a compute node a data communications packet that includes a route identifier value; retrieving from the routing table a specification of a data communications path through the compute node; and routing, by the compute node, the data communications packet according to the data communications path identified by the compute node's routing table entry for the data communications packet's route identifier value.
摘要:
Fencing direct memory access (‘DMA’) data transfers in a parallel active messaging interface (‘PAMI’) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
摘要:
A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.
摘要:
Methods, apparatuses, and computer program products for utilizing a kernel administration hardware thread of a multi-threaded, multi-core compute node of a parallel computer are provided. Embodiments include a kernel assigning a memory space of a hardware thread of an application processing core to a kernel administration hardware thread of a kernel processing core. A kernel administration hardware thread is configured to advance the hardware thread to a next memory space associated with the hardware thread in response to the assignment of the kernel administration hardware thread to the memory space of the hardware thread. Embodiments also include the kernel administration hardware thread executing an instruction within the assigned memory space.
摘要:
Compiling software for a hierarchical distributed processing system including providing to one or more compiling nodes software to be compiled, wherein at least a portion of the software to be compiled is to be executed by one or more other nodes; compiling, by the compiling node, the software; maintaining, by the compiling node, any compiled software to be executed on the compiling node; selecting, by the compiling node, one or more nodes in a next tier of the hierarchy of the distributed processing system in dependence upon whether any compiled software is for the selected node or the selected node's descendants; sending to the selected node only the compiled software to be executed by the selected node or selected node's descendant.
摘要:
Methods, apparatus, and products are disclosed for generating an executable version of an application using a distributed compiler operating on a plurality of compute nodes that include: receiving, by each compute node, a portion of source code for an application; compiling, in parallel by each compute node, the portion of the source code received by that compute node into a portion of object code for the application; performing, in parallel by each compute node, inter-procedural analysis on the portion of the object code of the application for that compute node, including sharing results of the inter-procedural analysis among the compute nodes; optimizing, in parallel by each compute node, the portion of the object code of the application for that compute node using the shared results of the inter-procedural analysis; and generating the executable version of the application in dependence upon the optimized portions of the object code of the application.
摘要:
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
摘要:
Performing a local barrier operation with parallel tasks executing on a compute node including, for each task: retrieving a present value of a counter; calculating, in dependence upon the present value of the counter and a total number of tasks performing the local barrier operation, a base value of the counter, the base value representing the counter's value prior to any task joining the local barrier; calculating, in dependence upon the base value and the total number of tasks performing the local barrier operation, a target value of the counter, the target value representing the counter's value when all tasks have joined the local barrier; joining the local barrier, including atomically incrementing the value of the counter; and repetitively, until the present value of the counter is no less than the target value of the counter: retrieving the present value of the counter and determining whether the present value equals the target value.
摘要:
A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.