METHOD AND APPARATUS FOR CONTROLLING DISTRIBUTED OPERATION SYSTEM, AND DEVICE, MEDIUM AND PROGRAM PRODUCT

    公开(公告)号:EP4224317A1

    公开(公告)日:2023-08-09

    申请号:EP22847523.2

    申请日:2022-06-07

    IPC分类号: G06F9/455

    摘要: The present disclosure provides a method for controlling a distributed operation system, an apparatus for controlling a distributed operation system, a device, a medium and a program product, which relate to a computer application technology field, and in particular to a distributed operation technology field. A specific implementation includes: for a first container carrying a first process, determining a current fault type of a failure in the first container in response to detecting that the first process is triggered to terminate based on the failure in the first container; and reconstructing the first container and restarting the first process based on the first container reconstructed in response to determining that the current fault type is consistent with a target fault type. In the present disclosure, for a fault type of a container which allows the container to be successfully reconstructed, the container will be reconstructed, while for a fault type of a container which does not allow the container to be successfully reconstructed, the container will not be reconstructed, so as to save system operation costs and meet operation requirements.

    METHOD AND APPARATUS FOR PROCESSING DEVELOPMENT MACHINE OPERATION TASK, DEVICE AND STORAGE MEDIUM

    公开(公告)号:EP3869336A1

    公开(公告)日:2021-08-25

    申请号:EP21161072.0

    申请日:2021-03-05

    IPC分类号: G06F9/50

    摘要: The present application discloses a method and an apparatus for processing a development machine operation task, a device and a storage medium, and relates to the field of deep learning of artificial intelligence. The specific implementation solution is: receiving a task creating request initiated by a client; generating, according to the task creating request, the development machine operation task; allocating a target graphics processing unit (GPU) required for executing the development machine operation task for the development machine operation task; and sending a development machine operation task request to a master node in cluster nodes, where the task request is used to request executing the development machine operation task on the target GPU. Compared with the prior art, the present application can directly use an operating system of a local host by using a docker container to execute the development machine operation task on the GPU, thereby improving the utilization rate of the hardware of a physical machine.