摘要:
In a multinode data processing system in which nodes exchange information over a network or through a switch, a structure and mechanism are provided which enables data packets to be sent and received in any order. Normally, if in-order transmission and receipt are required, then transmission over a single path is essential to insure proper reassembly. However, the present mechanism avoids this necessity and permits Remote Direct Memory Access (RDMA) operations to be carried out simultaneously over multiple paths. This provides a data striping mode of operation in which data transfers can be carried out much faster since packets of single or multiple RDMA messages can be portioned and transferred over several paths simultaneously, thus providing the ability to utilize the full system bandwidth that is available.
摘要:
In a multinode data processing system in which nodes exchange information over a network or through a switch, a structure and mechanism is provided within the realm of Remote Direct Memory Access (RDMA) operations in which DMA operations are present on one side of the transfer but not the other. On the side in which the transfer is not carried out in DMA fashion, transfer processing is carried out under program control; this is in contrast to the transfer on the DMA side which is characteristically carried out in hardware. Usage of these combination processes is useful in programming situations where RDMA is carried out to or from contiguous locations in memory on one side and where memory locations on the other side is noncontiguous. This split mode of transfer is provided both for read and for write operations.
摘要:
In remote direct memory access transfers in a multinode data processing system in which the nodes communicate with one another through communication adapters coupled to a switch or network, failures in the nodes or in the communication adapters can produce the phenomenon known as trickle traffic, which is data that has been received from the switch or from the network that is stale but which may have all the signatures of a valid packet data. The present invention addresses the trickle traffic problem in two situations: node failure and adapter failure. In the node failure situation randomly generated keys are used to reestablish connections to the adapter while providing a mechanism for the recognition of stale packets. In the adapter failure situation, a round robin context allocation approach is used with adapter state contexts being provided with state information which helps to identify stale packets. In another approach to handling the adapter failure situation counts are assigned which provide an adapter failure number to the node which will not match a corresponding number in a context field in the adapter, thus enabling the identification of stale packets.
摘要:
In remote direct memory access (RDMA) transfers in a multinode data processing system in which the nodes communicate with one another through communication adapters coupled to a switch or network, there is a need for the system to ensure efficient memory protection mechanisms across jobs. A method is thus desired for addressing virtual memory on local and remote servers that is independent of the process ID on the local and/or remote node. The use of global Translation Control Entry (TCE) tables that are accessed/owned by RDMA jobs and are managed by a device driver in conjunction with a Protocol Virtual Offset (PVO) address format solves this problem.
摘要:
A method and system for transferring noncontiguous messages group including assembling a set of data into a series of transmission packets, packaging a description of the layout of the transmission packets into description packets and then places each description packet into a local buffer while maintaining a count of the number of description packets, transfers each description packet into a transmit buffer for transmission to at least one receiving node, identifies the data packets, and forwards each data packet to the transmit buffer for transmission to the at least one receiving node. The receiving node receives the transmission packets, identifies each packet as a description packet or data packet, places the description packets in a local buffer for storage until the description is complete, places each description packet into a user data buffer, stores data packets in a local queue until the description is complete, then transfers the data packets to the user buffer.
摘要:
A protocol interface is provided for an active message protocol of a computing environment and a client process employing the active message protocol. The protocol interface includes an interface to a header handler function associated with the client process. The interface to the header handler function has parameters to be passed by and a parameter to be returned to the active message protocol when processing a message received through the active message protocol. The parameters to be passed include current message state information and current message type information for the received message. These parameters facilitate message-specific decisions by the header handler function about processing data of the message by the active message protocol. The parameter to be returned to the active message protocol instructs the active message protocol how to process the received message other than just where to store the message.
摘要:
Shared locks are employed for controlling a thread which extends across more than one protocol layer in a data processing system. The use of a counter is used as part of a data structure which makes it possible to implement shared locks across multiple layers. The use of shared locks avoids the processing overhead usually associated with lock acquisition and release. The thread which is controlled may be initiated in either an upper layer protocol or in a lower layer.
摘要:
In a multinode data processing system in which data is transferred, via direct memory access (DMA) or in remote direct memory access (RDMA), from a source node to at least one destination node through communication adapters coupling each node to a network or switch, a method is provided in which interrupt handling is overlapped with data transfer so as to allow interrupt processing overhead to run in parallel at the destination node with the movement of data to provide performance benefits. The method is also applicable to situations involving multiple interrupt levels corresponding to multithreaded handling capabilities.
摘要:
A method is provided for operating a communications adapter employed in a multinode data processing system in a fashion which enhances the performance of remote direct memory access data transfers. The system is provided with pointers and a table which are employed to determine whether or not an address which has been supplied for the transfer has already been mapped to a real address at the source or destination node. The table is also preferably provided with counters which can be incremented or decremented to enable the use of least recently used mechanisms at the upper level protocol layers to more efficiently control the setting and resetting of table entries.
摘要:
A deterministic, non-deadlocking technique to achieving distributed consensus in a multithreaded multiprocessing computing environment is provided. A communicator is established across multiple processes in the multithreaded computer environment notwithstanding that multiple groups of threads may be simultaneously trying to establish communicators. The technique includes communicating across the multiple processes to establish a candidate identifier for the communicator for a group of participating threads of the multiple processes; and communicating across the multiple processes to check at each participating thread of the multiple processes whether the candidate identifier can be claimed at its process, and if so, claiming the candidate identifier as the new identifier thereby establishing the communicator. As one example, the technique can be implemented via a subroutine call within a message passing interface (MPI) library.