摘要:
A system for managing communications to add a first Remote Direct Memory Access (RDMA) link between a TCP server and a TCP client, where the first RDMA link references first remote memory buffer (RMB) and a second RMB, and further based on a first remote direct memory access network interface card (RNIC) associated with the TCP server and a second RNIC associated with the TCP client. The system determines whether a third RNIC is enabled. The system adds a second RDMA link, responsive to a determination that the third RNIC is enabled. The system detects a failure in a failed RDMA link. The system reconfigures the first RDMA link to carry at least one TCP message of a connection formerly assigned to the failed RDMA link, responsive to detecting the failure. The system communicates at least one message of the at least one connection on the first RDMA link.
摘要:
A method, system, and computer program product for maintaining reliability in a computer system. In an example embodiment, the method includes managing workloads on a first processor with a first processor architecture by an agent process executing on a second processor with a second processor architecture. The method proceeds by activating redundant computation on the second processor by the agent process. The method continues by performing a same computation from a workload of the workloads at least twice. Finally, the method includes comparing results of the same computation. In this embodiment the first processor is coupled the second processor by a network, and the first processor architecture and second processor architecture are different architectures.
摘要:
Embodiments of the present invention provide high-throughput computing in a hybrid processing system. A set of high-throughput computing service level agreements (SLAs) is analyzed. The set of high-throughput computing SLAs are associated with a hybrid processing system. The hybrid processing system includes at least one server system that includes a first computing architecture and a set of accelerator systems each including a second computing architecture that is different from the first computing architecture. A first set of resources at the server system and a second set of resources at the set of accelerator systems are monitored. A set of data-parallel workload tasks is dynamically scheduled across at least one resource in the first set of resources and at least one resource in the second set of resources. The dynamic scheduling of the set of data-parallel workload tasks substantially satisfies the set of high-throughput computing SLAs.
摘要:
A method, system, and computer program product for maintaining reliability in a computer system. In an example embodiment, the method includes managing workloads on a first processor with a first processor architecture by an agent process executing on a second processor with a second processor architecture. The method proceeds by activating redundant computation on the second processor by the agent process. The method continues by performing a same computation from a workload of the workloads at least twice. Finally, the method includes comparing results of the same computation. In this embodiment the first processor is coupled the second processor by a network, and the first processor architecture and second processor architecture are different architectures.
摘要:
Embodiments of the present invention manage workloads in a high-throughput computing environment for a hybrid processing system. A set of high-throughput computing service level agreements (SLAs) is retrieved. The set of SLAs is associated with a hybrid processing system including a server system and a set of accelerator systems, where each system has a different architecture. A first set of data-parallel workload tasks scheduled on the server system and a second set of data-parallel workload tasks scheduled with the set of accelerator systems are identified. At least a portion of one of the first set of data-parallel workload tasks and the second set of data-parallel workload tasks is dynamically rescheduled on a second one of the server system and the set of accelerator systems. The dynamically rescheduling substantially satisfies the set of high-throughput computing SLAs.
摘要:
A method, system, and computer program product for maintaining reliability in a computer system. In an example embodiment, the method includes performing a first data computation by a first set of processors, the first set of processors having a first computer processor architecture. The method continues by performing a second data computation by a second processor coupled to the first set of processors, the second processor having a second computer processor architecture, the first computer processor architecture being different than the second computer processor architecture. Finally, the method includes dynamically allocating computational resources of the first set of processors and the second processor based on at least one metric while the first set of processors and the second processor are in operation such that the accuracy and processing speed of the first data computation and the second data computation are optimized.
摘要:
A system for managing communications to add a first Remote Direct Memory Access (RDMA) link between a TCP server and a TCP client, where the first RDMA link references first remote memory buffer (RMB) and a second RMB, and further based on a first remote direct memory access network interface card (RNIC) associated with the TCP server and a second RNIC associated with the TCP client. The system determines whether a third RNIC is enabled. The system adds a second RDMA link, responsive to a determination that the third RNIC is enabled. The system detects a failure in a failed RDMA link. The system reconfigures the first RDMA link to carry at least one TCP message of a connection formerly assigned to the failed RDMA link, responsive to detecting the failure. The system communicates at least one message of the at least one connection on the first RDMA link.
摘要:
Embodiments of the present invention manage workloads in a high-throughput computing environment for a hybrid processing system. A set of high-throughput computing service level agreements (SLAs) is retrieved. The set of SLAs is associated with a hybrid processing system including a server system and a set of accelerator systems, where each system has a different architecture. A first set of data-parallel workload tasks scheduled on the server system and a second set of data-parallel workload tasks scheduled with the set of accelerator systems are identified. At least a portion of one of the first set of data-parallel workload tasks and the second set of data-parallel workload tasks is dynamically rescheduled on a second one of the server system and the set of accelerator systems. The dynamically rescheduling substantially satisfies the set of high-throughput computing SLAs.
摘要:
A method, system, and computer program product for maintaining reliability in a computer system. In an example embodiment, the method includes performing a first data computation by a first set of processors, the first set of processors having a first computer processor architecture. The method continues by performing a second data computation by a second processor coupled to the first set of processors, the second processor having a second computer processor architecture, the first computer processor architecture being different than the second computer processor architecture. Finally, the method includes dynamically allocating computational resources of the first set of processors and the second processor based on at least one metric while the first set of processors and the second processor are in operation such that the accuracy and processing speed of the first data computation and the second data computation are optimized.
摘要:
A computer implemented program product and data processing system for receiving data to a targeted logical partition. A computer locates buffer element in reliance on a connection status bit array. The computer copies control information to the targeted logical partition's local storage. The computer updates a targeted logical partition's local producer cursor based on the control information. The computer copies data to an application receive buffer. The computer determines that an application completes a receive operation. Responsive to a determination that the application completed the receive operation, the computer a targeted logical partition's local consumer cursor to match the targeted logical partition's producer cursor.