Abstract:
Systems and methods for supporting heterogeneous and asymmetric dual rail fabric configurations in a high performance computing environment. A method can provide, comprising at one or more computers each including one or more microprocessors, a plurality hosts, each of the plurality of hosts comprising at least one dual port adapter, a private fabric, the private fabric comprising two or more switches, and a public fabric, the public fabric comprising a cloud fabric. A workload can be provisioned at a host of the plurality of hosts. A placement policy can be assigned to the provisioned workload. Then, network traffic between peer nodes of the provisioned workload can be assigned to one or more of the private fabric and the public fabric in accordance with the placement policy.
Abstract:
Systems and methods for multicast send duplication instead of replication in a high performance computing environment. A method can provide a plurality of switches, a plurality of hosts, the plurality of hosts being interconnected via the plurality of switches, wherein a host of the plurality of hosts comprises a multicast sender node, the sender node comprising a system image generation module and a current message sequence module. The method can organize the plurality of switches into two rails, the two or more rails providing redundant connectivity between the plurality of hosts. The method can send two or more duplicate multicast packets on different rails. Upon a receiving node receiving at least two versions of the same multicast packet, only one will be delivered to the communication stack/clients above the layer that handles the encapsulation header.
Abstract:
A system and method can support network management in a network environment. The network environment can include a plurality of configuration daemons (CDs), wherein each CD resides on a switch in the network environment. The CD operates to receive a configuration file that includes a list of known management key (M_Key) values. Furthermore, the CD operates to store the configuration file, and make the configuration file available to a local subnet manager (SM) on the switch, wherein the local SM is associated with a currently used M_Key value. Then, the CD operates to update the local SM with a new M_Key, after receiving an instruction from a master CD that is associated with a master SM in the network environment.
Abstract:
A system and method can ensure Internet Protocol (IP) address and node name consistency when performing remote transactions via multiple un-related IP addresses for the same remote peer. The system can ensure that all cooperating peer nodes are in full agreement of the names and IP addresses at any point in time. In particular, when network configurations can be updated dynamically, the system can ensure that such updates do not lead to inconsistent or failed transactions because a peer node has a stale view of what addresses to use. Furthermore, the peer node that initiates the transaction can verify that all the other peer nodes have exactly the same view of the overall system configuration, in order to ensure that each distributed transaction is carried out using consistent address information.
Abstract:
Systems and methods for supporting redundant independent networks in a high performance computing environment. A method can provide, at a computer comprising one or more microprocessors, one or more switches, one or more racks, each of the one or more racks comprising a set of the one or more switches, each set of the one or more switches comprising at least a leaf switch, a plurality of host channel adapters, at least one of the plurality of host channel adapters comprising a firmware and a processor, and a plurality of hosts. The method can provision two or more rails, the two or more rails providing redundant connectivity between the plurality of hosts. The method can isolate data traffic between the plurality of hosts to a rail of the two or more rails.
Abstract:
Systems and methods for supporting redundant independent networks in a high performance computing environment. A method can provide, at a computer comprising one or more microprocessors, one or more switches, one or more racks, each of the one or more racks comprising a set of the one or more switches, each set of the one or more switches comprising at least a leaf switch, a plurality of host channel adapters, at least one of the plurality of host channel adapters comprising a firmware and a processor, and a plurality of hosts. The method can provision two or more rails, the two or more rails providing redundant connectivity between the plurality of hosts. The method can isolate data traffic between the plurality of hosts to a rail of the two or more rails.
Abstract:
A system and method can implement highly available Internet Protocol (IP) based communication across multiple independent communication paths. The system can have different IP addresses associated with different interfaces and communication paths and can implement communication fail-over as part of the communication layers above the IP layer, e.g. at the application level. The system can provide a balance between an average fail-over time and implementation complexity, and can achieve simplicity and robustness while providing high communication performance.