Fault management in a distributed computer system

    公开(公告)号:US11966292B2

    公开(公告)日:2024-04-23

    申请号:US17804392

    申请日:2022-05-27

    摘要: In some examples, a distributed computer system includes a plurality of computer nodes, where the plurality of computer nodes include respective programs to cooperate to perform a workload. A first computer node includes a communication proxy between the program of the first computer node and a communication library that supports communications between the program of the first computer node and the programs of other computer nodes of the plurality of computer nodes, and a fault management service to monitor a health of the other computer nodes, and in response to a detection of a fault of a second computer node of the plurality of computer nodes, relaunch the communication proxy. The relaunched communication proxy selects, from a plurality of states, a common state to which the programs are to roll back.

    Distributed network address discovery in non-uniform networks

    公开(公告)号:US11909816B2

    公开(公告)日:2024-02-20

    申请号:US17648645

    申请日:2022-01-21

    摘要: Distributed network address discovery in non-uniform node networks can be performed. Regarding a client request for a service, network management component (NMC) can determine a network address space associated with a client based on a network identifier associated with the client or a node identifier. NMC can determine a group of candidate nodes (CN group) from a group of nodes based on network addresses associated with nodes of the node group and the network address space. NMC can determine a group of available candidate nodes (ACN group), from the CN group, available and able to process the request and perform the service based on operational statuses associated with the nodes of the CN group or services associated with those nodes. From the ACN group, NMC can determine a ranked list of network addresses associated with available nodes that can process the request based on defined service performance criteria.

    Intelligent multi-path call home
    9.
    发明授权

    公开(公告)号:US11799944B1

    公开(公告)日:2023-10-24

    申请号:US18082068

    申请日:2022-12-15

    CPC分类号: H04L67/025 H04L67/1029

    摘要: A method for an intelligent multi-path call home includes detecting, at a BMC, an error in a computing device managed by the BMC and sending a call home message to a management server. The computing device is one of a plurality of computing devices each with a BMC in communication with the management server. The management server is programmed to relay the call home message to a call home destination remote from the computing devices and management server. The method includes determining that the management server failed to receive the call home message and/or failed to successfully relay the call home message to the call home destination, and transmitting, from the BMC, the call home message to the call home destination in response to determining that the management server failed to receive the call home message and/or failed to successfully relay the call home message to the call home destination.

    Computer system and scale-up management method

    公开(公告)号:US11778020B2

    公开(公告)日:2023-10-03

    申请号:US17902116

    申请日:2022-09-02

    申请人: Hitachi, Ltd.

    摘要: It aims to make it possible to readily and rapidly scale up the server which executes one application.
    In a computer system which includes one or more compute server(s) which each has an application container which executes the one application and a management server which manages the compute server(s), the management server is configured to, in a case of increasing the number of the compute servers which each has the execution unit which executes the one application, specify a logic unit that a data unit that the execution unit of an existing compute server utilizes upon execution of an application is stored, and in a case where the execution unit of a newly added computer server executes the application, set the newly added compute server so as to refer to the specified logic unit.