摘要:
Methods, compute nodes, and computer program products are provided for direct memory access (‘DMA’) transfer completion notification. Embodiments include determining, by an origin DMA engine on an origin compute node, whether a data descriptor for an application message to be sent to a target compute node is currently in an injection first-in-first-out (‘FIFO’) buffer in dependence upon a sequence number previously associated with the data descriptor, the total number of descriptors currently in the injection FIFO buffer, and the current sequence number for the newest data descriptor stored in the injection FIFO buffer; and notifying a processor core on the origin DMA engine that the message has been sent if the data descriptor for the message is not currently in the injection FIFO buffer.
摘要:
Methods, compute nodes, and computer program products are provided for direct memory access (‘DMA’) transfer completion notification. Embodiments include determining, by an origin DMA engine on an origin compute node, whether a data descriptor for an application message to be sent to a target compute node is currently in an injection first-in-first-out (‘FIFO’) buffer in dependence upon a sequence number previously associated with the data descriptor, the total number of descriptors currently in the injection FIFO buffer, and the current sequence number for the newest data descriptor stored in the injection FIFO buffer; and notifying a processor core on the origin DMA engine that the message has been sent if the data descriptor for the message is not currently in the injection FIFO buffer.
摘要:
Methods and apparatus perform fault isolation in multiple node computing systems using commutative error detection values for—example, checksums—to identify and to isolate faulty nodes. When information associated with a reproducible portion of a computer program is injected into a network by a node, a commutative error detection value is calculated. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created and stored in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in values indicate a possible faulty node.
摘要:
A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.
摘要:
A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices ate included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network and class structures. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.
摘要:
A system and method for enabling high-speed, low-latency global tree network communications among processing nodes interconnected according to a tree network structure. The global tree network enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the tree via links to facilitate performance of low-latency global processing operations at nodes of the virtual tree and sub-tree structures. The global operations performed include one or more of: broadcast operations downstream from a root node to leaf nodes of a virtual tree, reduction operations upstream from leaf nodes to the root node in the virtual tree, and point-to-point message passing from any node to the root node. The global tree network is configurable to provide global barrier and interrupt functionality in asynchronous or synchronized manner, and, is physically and logically partitionable.
摘要:
Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.
摘要:
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.
摘要:
A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network and class structures. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.
摘要:
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.