摘要:
A target task ensures complete delivery of a global shared memory (GSM) message from an originating task to the target task. The target task's HFI receives a first of multiple GSM packets generated from a single GSM message sent from the originating task. The HFI logic assigns a sequence number and corresponding tuple to track receipt of the complete GSM message. The sequence number is unique relative to other sequence numbers assigned to GSM messages that have not been completely received from the initiating task. The HFI updates a count value within the tuple, which comprises the sequence number and the count value for the first GSM packet and for each subsequent GSM packet received for the GSM message. The HFI determines when receipt of the GSM message is complete by comparing the count value with a count total retrieved from the packet header.
摘要:
A method and data processing system for tracking global shared memory (GSM) operations to and from a local node configured with a host fabric interface (HFI) coupled to a network fabric. During task/job initialization, the system OS assigns HFI window(s) to handle the GSM packet generation and GSM packet receipt and processing for each local task. HFI processing logic automatically tags each GSM packet generated by the HFI window with a global job identifier (ID) of the job to which the local task is affiliated. The job ID is embedded within each GSM packet placed on the network fabric. On receipt of a GSM packet from the network fabric, the HFI logic retrieves the embedded job ID and compares the embedded job ID with the ID within the HFI window(s). GSM packets are forwarded to an HFI window only when the embedded job ID matches the HFI window's job ID.
摘要:
In a global shared memory (GSM) environment, an initiating task at a first node with a host fabric interface (HFI) uses epochs to provide reliability of transmission of packets via a network fabric to a target task. The HFI generates a packet for the initiating task addressed to the target task, and automatically inserts a current epoch of the initiating task into the packet. A copy of the current epoch is maintained by the target task, which accepts for processing only packets having the correct epoch, unless the packet is tagged for guaranteed-once delivery. When a packet delivery is accepted, the target task sends a notification to the initiating task. If the initiating task does not receive the notification of delivery for the issued packet, the initiating task updates the epoch at both the target node and the initiating node and re-transmits the packet.
摘要:
A method and data processing system for tracking global shared memory (GSM) operations to and from a local node configured with a host fabric interface (HFI) coupled to a network fabric. During task/job initialization, the system OS assigns HFI window(s) to handle the GSM packet generation and GSM packet receipt and processing for each local task. HFI processing logic automatically tags each GSM packet generated by the HFI window with a global job identifier (ID) of the job to which the local task is affiliated. The job ID is embedded within each GSM packet placed on the network fabric. On receipt of a GSM packet from the network fabric, the HFI logic retrieves the embedded job ID and compares the embedded job ID with the ID within the HFI window(s). GSM packets are forwarded to an HFI window only when the embedded job ID matches the HFI window's job ID.
摘要:
A method for issuing global shared memory (GSM) operations from an originating task on a first node coupled to a network fabric of a distributed network via a host fabric interface (HFI). The originating task generates a GSM command within an effective address (EA) space. The task then places the GSM command within a send FIFO. The send FIFO is a portion of real memory having real addresses (RA) that are memory mapped to EAs of a globally executing job. The originating task maintains a local EA-to-RA mapping of only a portion of the real address space of the globally executing job. The task enables the HFI to retrieve the GSM command from the send FIFO into an HFI window allocated to the originating task. The HFI window generates a corresponding GSM packet containing GSM operations and/or data, and the HFI window issues the GSM packet to the network fabric.
摘要:
A method and a data processing system for completing checkpoint processing of a distributed job with local tasks communicating with other remote tasks via a host fabric interface (HFI) and assigned HFI window. Each HFI window has a send count and a receive count, which tracks GSM messages that are sent from and received at the HFI window. When a checkpoint is initiated by a master task, each local task forwards the send count and the receive count to the master task. The master task sums the respective counts and then compares the totals to each other. When the send count total is equal to the receive count total, the tasks are permitted to continue processing. However, when the send count total is not equal to the receive count total, the master task notifies each task of the job to rollback to a previous checkpoint or kill the job execution.
摘要:
A host fabric interface (HFI) enables debugging of global shared memory (GSM) operations received at a local node from a network fabric. The local node has a memory management unit (MMU), which provides an effective address to real address (EA-to-RA) translation table that is utilized by the HFI to evaluate when EAs of GSM operations/data from a received GSM packet is memory-mapped to RAs of the local memory. The HFI retrieves the EA associated with a GSM operation/data within a received GSM packet. The HFI forwards the EA to the MMU, which determines when the EA is mapped to RAs within the local memory for the local task. The HFI processing logic enables processing of the GSM packet only when the EA of the GSM operation/data within the GSM packet is an EA that has a local RA translation. Non-matching EAs result in an error condition that requires debugging.
摘要:
A method for issuing global shared memory (GSM) operations from an originating task on a first node coupled to a network fabric of a distributed network via a host fabric interface (HFI). The originating task generates a GSM command within an effective address (EA) space. The task then places the GSM command within a send FIFO. The send FIFO is a portion of real memory having real addresses (RA) that are memory mapped to EAs of a globally executing job. The originating task maintains a local EA-to-RA mapping of only a portion of the real address space of the globally executing job. The task enables the HFI to retrieve the GSM command from the send FIFO into an HFI window allocated to the originating task. The HFI window generates a corresponding GSM packet containing GSM operations and/or data, and the HFI window issues the GSM packet to the network fabric.
摘要:
A target task ensures complete delivery of a global shared memory (GSM) message from an originating task to the target task. The target task's HFI receives a first of multiple GSM packets generated from a single GSM message sent from the originating task. The HFI logic assigns a sequence number and corresponding tuple to track receipt of the complete GSM message. The sequence number is unique relative to other sequence numbers assigned to GSM messages that have not been completely received from the initiating task. The HFI updates a count value within the tuple, which comprises the sequence number and the count value for the first GSM packet and for each subsequent GSM packet received for the GSM message. The HFI determines when receipt of the GSM message is complete by comparing the count value with a count total retrieved from the packet header.
摘要:
In a global shared memory (GSM) environment, an initiating task at a first node with a host fabric interface (HFI) uses epochs to provide reliability of transmission of packets via a network fabric to a target task. The HFI generates a packet for the initiating task addressed to the target task, and automatically inserts a current epoch of the initiating task into the packet. A copy of the current epoch is maintained by the target task, which accepts for processing only packets having the correct epoch, unless the packet is tagged for guaranteed-once delivery. When a packet delivery is accepted, the target task sends a notification to the initiating task. If the initiating task does not receive the notification of delivery for the issued packet, the initiating task updates the epoch at both the target node and the initiating node and re-transmits the packet.