摘要:
Cluster management software comprises a plurality of cluster agents, with each cluster agent associated with an HPC node including an integrated fabric and the cluster agent operable to determine a status of the associated HPC node. The software further includes a cluster management engine communicably coupled with the plurality of the HPC nodes and operable to execute an HPC job using a dynamically allocated subset of the plurality of HPC nodes based on the determined status of the plurality of HPC nodes.
摘要:
A method for managing HPC node failure includes determining that one of a plurality of HPC nodes has failed, with each HPC node comprising an integrated fabric. The failed node is then removed from a virtual list of HPC nodes, with the virtual list comprising one logical entry for each of the plurality of HPC nodes.
摘要:
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
摘要:
A High Performance Computing (HPC) node comprises a motherboard, a switch comprising eight or more ports integrated on the motherboard, and at least two processors operable to execute an HPC job, with each processor communicably coupled to the integrated switch and integrated on the motherboard.