-
公开(公告)号:US20180018242A1
公开(公告)日:2018-01-18
申请号:US15207943
申请日:2016-07-12
Applicant: Advanced Micro Devices, Inc.
Inventor: Sergey Blagodurov , Taniya Siddiqua , Vilas Sridharan
CPC classification number: G06F11/1471 , G06F2201/805 , G06F2201/84
Abstract: Methods and apparatus presented herein provide distributed checkpointing in a multi-node system, such as a network of servers in a data center. When checkpointing of application state data is needed in a node, the methods and apparatus determine whether checkpoint memory space is available in the node for checkpointing the application state data. If not enough checkpoint memory space is available in the node, the methods and apparatus request and find additional checkpoint memory space from other nodes in the system. In this manner, the methods and apparatus can checkpoint the application state data into available checkpoint memory spaces distributed among a plurality of nodes. This allows for high bandwidth and low latency checkpointing operations in the multi-node system.
-
公开(公告)号:US10073746B2
公开(公告)日:2018-09-11
申请号:US15207943
申请日:2016-07-12
Applicant: Advanced Micro Devices, Inc.
Inventor: Sergey Blagodurov , Taniya Siddiqua , Vilas Sridharan
CPC classification number: G06F3/0604 , G06F3/0631 , G06F3/067 , G06F11/2058 , G06F11/2069 , G06F2201/84
Abstract: Methods and apparatus presented herein provide distributed checkpointing in a multi-node system, such as a network of servers in a data center. When checkpointing of application state data is needed in a node, the methods and apparatus determine whether checkpoint memory space is available in the node for checkpointing the application state data. If not enough checkpoint memory space is available in the node, the methods and apparatus request and find additional checkpoint memory space from other nodes in the system. In this manner, the methods and apparatus can checkpoint the application state data into available checkpoint memory spaces distributed among a plurality of nodes. This allows for high bandwidth and low latency checkpointing operations in the multi-node system.
-