Workload management system and process

    公开(公告)号:US10521260B2

    公开(公告)日:2019-12-31

    申请号:US15650357

    申请日:2017-07-14

    Abstract: A high performance computing (HPC) system has an architecture that separates data paths used by compute nodes exchanging computational data from the data paths used by compute nodes to obtain computational work units and save completed computations. The system enables an improved method of saving checkpoint data, and an improved method of using an analysis of the saved data to assign particular computational work units to particular compute nodes. The system includes a compute fabric and compute nodes that cooperatively perform a computation by mutual communication using the compute fabric. The system also includes a local data fabric that is coupled to the compute nodes, a memory, and a data node. The data node is configured to retrieve data for the computation from an external bulk data storage, and to store its work units in the memory for access by the compute nodes.

    Caching network fabric for high performance computing

    公开(公告)号:US10404800B2

    公开(公告)日:2019-09-03

    申请号:US15650296

    申请日:2017-07-14

    Abstract: An apparatus and method exchange data between two nodes of a high performance computing (HPC) system using a data communication link. The apparatus has one or more processing cores, RDMA engines, cache coherence engines, and multiplexers. The multiplexers may be programmed by a user application, for example through an API, to selectively couple either the RDMA engines, cache coherence engines, or a mix of these to the data communication link. Bulk data transfer to the nodes of the HPC system may be performed using paged RDMA during initialization. Then, during computation proper, random access to remote data may be performed using a coherence protocol (e.g. MESI) that operates on much smaller cache lines.

Patent Agency Ranking