Redundancy and load balancing in remote direct memory access communications
    1.
    发明授权
    Redundancy and load balancing in remote direct memory access communications 有权
    远程直接存储器访问通信中的冗余和负载平衡

    公开(公告)号:US08880935B2

    公开(公告)日:2014-11-04

    申请号:US13494831

    申请日:2012-06-12

    IPC分类号: G06F11/00 G06F11/20

    摘要: A system for managing communications to add a first Remote Direct Memory Access (RDMA) link between a TCP server and a TCP client, where the first RDMA link references first remote memory buffer (RMB) and a second RMB, and further based on a first remote direct memory access network interface card (RNIC) associated with the TCP server and a second RNIC associated with the TCP client. The system determines whether a third RNIC is enabled. The system adds a second RDMA link, responsive to a determination that the third RNIC is enabled. The system detects a failure in a failed RDMA link. The system reconfigures the first RDMA link to carry at least one TCP message of a connection formerly assigned to the failed RDMA link, responsive to detecting the failure. The system communicates at least one message of the at least one connection on the first RDMA link.

    摘要翻译: 一种用于管理通信以在TCP服务器和TCP客户端之间添加第一远程直接存储器访问(RDMA)链接的系统,其中第一RDMA链路引用第一远程存储器缓冲区(RMB)和第二RMB,并且还基于第一 与TCP服务器相关联的远程直接存储器访问网络接口卡(RNIC)和与TCP客户端相关联的第二RNIC。 系统确定是否启用第三个RNIC。 响应于确定第三个RNIC被启用,该系统添加第二个RDMA链路。 系统检测到故障RDMA链路中的故障。 响应于检测到故障,系统重新配置第一RDMA链路以携带先前分配给故障RDMA链路的连接的至少一个TCP消息。 该系统在第一RDMA链路上传送至少一个连接的至少一个消息。

    Unified, workload-optimized, adaptive RAS for hybrid systems
    2.
    发明授权
    Unified, workload-optimized, adaptive RAS for hybrid systems 有权
    用于混合系统的统一的,工作负载优化的自适应RAS

    公开(公告)号:US08806269B2

    公开(公告)日:2014-08-12

    申请号:US13170590

    申请日:2011-06-28

    IPC分类号: G06F11/00

    摘要: A method, system, and computer program product for maintaining reliability in a computer system. In an example embodiment, the method includes managing workloads on a first processor with a first processor architecture by an agent process executing on a second processor with a second processor architecture. The method proceeds by activating redundant computation on the second processor by the agent process. The method continues by performing a same computation from a workload of the workloads at least twice. Finally, the method includes comparing results of the same computation. In this embodiment the first processor is coupled the second processor by a network, and the first processor architecture and second processor architecture are different architectures.

    摘要翻译: 一种用于在计算机系统中维持可靠性的方法,系统和计算机程序产品。 在示例实施例中,该方法包括通过在具有第二处理器架构的第二处理器上执行的代理进程来管理具有第一处理器架构的第一处理器上的工作负载。 该方法通过在代理处理中激活第二处理器上的冗余计算来进行。 该方法通过至少两次从工作负载的工作负载执行相同的计算来继续。 最后,该方法包括比较相同计算的结果。 在该实施例中,第一处理器通过网络耦合第二处理器,并且第一处理器架构和第二处理器架构是不同的架构。

    High-throughput-computing in a hybrid computing environment
    3.
    发明授权
    High-throughput-computing in a hybrid computing environment 有权
    混合计算环境中的高吞吐量计算

    公开(公告)号:US08739171B2

    公开(公告)日:2014-05-27

    申请号:US12872761

    申请日:2010-08-31

    IPC分类号: G06F9/46

    摘要: Embodiments of the present invention provide high-throughput computing in a hybrid processing system. A set of high-throughput computing service level agreements (SLAs) is analyzed. The set of high-throughput computing SLAs are associated with a hybrid processing system. The hybrid processing system includes at least one server system that includes a first computing architecture and a set of accelerator systems each including a second computing architecture that is different from the first computing architecture. A first set of resources at the server system and a second set of resources at the set of accelerator systems are monitored. A set of data-parallel workload tasks is dynamically scheduled across at least one resource in the first set of resources and at least one resource in the second set of resources. The dynamic scheduling of the set of data-parallel workload tasks substantially satisfies the set of high-throughput computing SLAs.

    摘要翻译: 本发明的实施例在混合处理系统中提供高吞吐量计算。 分析了一组高吞吐量计算服务水平协议(SLA)。 该组高吞吐量计算SLA与混合处理系统相关联。 混合处理系统包括至少一个包括第一计算架构和一组加速器系统的服务器系统,每个加速器系统包括不同于第一计算体系结构的第二计算体系结构。 监视在服务器系统处的第一组资源和在该组加速器系统处的第二组资源。 一组数据并行工作负载任务在第一组资源中的至少一个资源和第二组资源中的至少一个资源上动态调度。 一组数据并行工作负载任务的动态调度基本上满足了高吞吐量计算SLA的集合。

    RESCHEDULING WORKLOAD IN A HYBRID COMPUTING ENVIRONMENT
    5.
    发明申请
    RESCHEDULING WORKLOAD IN A HYBRID COMPUTING ENVIRONMENT 有权
    在混合计算环境中减少工作量

    公开(公告)号:US20120054771A1

    公开(公告)日:2012-03-01

    申请号:US12872793

    申请日:2010-08-31

    IPC分类号: G06F9/46

    摘要: Embodiments of the present invention manage workloads in a high-throughput computing environment for a hybrid processing system. A set of high-throughput computing service level agreements (SLAs) is retrieved. The set of SLAs is associated with a hybrid processing system including a server system and a set of accelerator systems, where each system has a different architecture. A first set of data-parallel workload tasks scheduled on the server system and a second set of data-parallel workload tasks scheduled with the set of accelerator systems are identified. At least a portion of one of the first set of data-parallel workload tasks and the second set of data-parallel workload tasks is dynamically rescheduled on a second one of the server system and the set of accelerator systems. The dynamically rescheduling substantially satisfies the set of high-throughput computing SLAs.

    摘要翻译: 本发明的实施例管理用于混合处理系统的高吞吐量计算环境中的工作负载。 检索一组高吞吐量计算服务级别协议(SLA)。 SLA集合与包括服务器系统和一组加速器系统的混合处理系统相关联,其中每个系统具有不同的架构。 识别在服务器系统上调度的第一组数据并行工作负载任务和由该组加速器系统调度的第二组数据并行工作负载任务。 所述第一组数据并行工作负载任务中的一个和所述第二组数据并行工作负载任务中的至少一部分被动态地重新安排在所述服务器系统和所述一组加速器系统的第二组上。 动态重新调度基本上满足了一组高吞吐量计算SLA。

    Unified, workload-optimized, adaptive RAS for hybrid systems
    6.
    发明授权
    Unified, workload-optimized, adaptive RAS for hybrid systems 有权
    用于混合系统的统一的,工作负载优化的自适应RAS

    公开(公告)号:US08788871B2

    公开(公告)日:2014-07-22

    申请号:US13170453

    申请日:2011-06-28

    IPC分类号: G06F11/00

    摘要: A method, system, and computer program product for maintaining reliability in a computer system. In an example embodiment, the method includes performing a first data computation by a first set of processors, the first set of processors having a first computer processor architecture. The method continues by performing a second data computation by a second processor coupled to the first set of processors, the second processor having a second computer processor architecture, the first computer processor architecture being different than the second computer processor architecture. Finally, the method includes dynamically allocating computational resources of the first set of processors and the second processor based on at least one metric while the first set of processors and the second processor are in operation such that the accuracy and processing speed of the first data computation and the second data computation are optimized.

    摘要翻译: 一种用于在计算机系统中维持可靠性的方法,系统和计算机程序产品。 在示例实施例中,该方法包括由第一组处理器执行第一数据计算,第一组处理器具有第一计算机处理器架构。 该方法通过由耦合到第一组处理器的第二处理器执行第二数据计算而继续,第二处理器具有第二计算机处理器架构,第一计算机处理器架构不同于第二计算机处理器架构。 最后,该方法包括在第一组处理器和第二处理器运行时基于至少一个度量来动态分配第一组处理器和第二处理器的计算资源,使得第一数据计算的精度和处理速度 并且第二数据计算被优化。

    REDUNDANCY AND LOAD BALANCING IN REMOTE DIRECT MEMORY ACCESS COMMUNICATIONS
    7.
    发明申请
    REDUNDANCY AND LOAD BALANCING IN REMOTE DIRECT MEMORY ACCESS COMMUNICATIONS 有权
    远程直接存储器访问通信中的冗余和负载均衡

    公开(公告)号:US20130332767A1

    公开(公告)日:2013-12-12

    申请号:US13494831

    申请日:2012-06-12

    IPC分类号: G06F11/07 G06F15/167

    摘要: A system for managing communications to add a first Remote Direct Memory Access (RDMA) link between a TCP server and a TCP client, where the first RDMA link references first remote memory buffer (RMB) and a second RMB, and further based on a first remote direct memory access network interface card (RNIC) associated with the TCP server and a second RNIC associated with the TCP client. The system determines whether a third RNIC is enabled. The system adds a second RDMA link, responsive to a determination that the third RNIC is enabled. The system detects a failure in a failed RDMA link. The system reconfigures the first RDMA link to carry at least one TCP message of a connection formerly assigned to the failed RDMA link, responsive to detecting the failure. The system communicates at least one message of the at least one connection on the first RDMA link.

    摘要翻译: 一种用于管理通信以在TCP服务器和TCP客户端之间添加第一远程直接存储器访问(RDMA)链接的系统,其中第一RDMA链路引用第一远程存储器缓冲区(RMB)和第二RMB,并且还基于第一 与TCP服务器相关联的远程直接存储器访问网络接口卡(RNIC)和与TCP客户端相关联的第二RNIC。 系统确定是否启用第三个RNIC。 响应于确定第三个RNIC被启用,该系统添加第二个RDMA链路。 系统检测到故障RDMA链路中的故障。 响应于检测到故障,系统重新配置第一RDMA链路以携带先前分配给故障RDMA链路的连接的至少一个TCP消息。 该系统在第一RDMA链路上传送至少一个连接的至少一个消息。

    Rescheduling workload in a hybrid computing environment
    8.
    发明授权
    Rescheduling workload in a hybrid computing environment 有权
    在混合计算环境中重新安排工作量

    公开(公告)号:US08914805B2

    公开(公告)日:2014-12-16

    申请号:US12872793

    申请日:2010-08-31

    IPC分类号: G06F9/46 G06F9/50 G06F9/48

    摘要: Embodiments of the present invention manage workloads in a high-throughput computing environment for a hybrid processing system. A set of high-throughput computing service level agreements (SLAs) is retrieved. The set of SLAs is associated with a hybrid processing system including a server system and a set of accelerator systems, where each system has a different architecture. A first set of data-parallel workload tasks scheduled on the server system and a second set of data-parallel workload tasks scheduled with the set of accelerator systems are identified. At least a portion of one of the first set of data-parallel workload tasks and the second set of data-parallel workload tasks is dynamically rescheduled on a second one of the server system and the set of accelerator systems. The dynamically rescheduling substantially satisfies the set of high-throughput computing SLAs.

    摘要翻译: 本发明的实施例管理用于混合处理系统的高吞吐量计算环境中的工作负载。 检索一组高吞吐量计算服务级别协议(SLA)。 SLA集合与包括服务器系统和一组加速器系统的混合处理系统相关联,其中每个系统具有不同的架构。 识别在服务器系统上调度的第一组数据并行工作负载任务和由该组加速器系统调度的第二组数据并行工作负载任务。 所述第一组数据并行工作负载任务中的一个和所述第二组数据并行工作负载任务中的至少一部分被动态地重新安排在所述服务器系统和所述一组加速器系统的第二组上。 动态重新调度基本上满足了一组高吞吐量计算SLA。

    UNIFIED, WORKLOAD-OPTIMIZED, ADAPTIVE RAS FOR HYBRID SYSTEMS

    公开(公告)号:US20130007759A1

    公开(公告)日:2013-01-03

    申请号:US13170453

    申请日:2011-06-28

    IPC分类号: G06F9/46

    摘要: A method, system, and computer program product for maintaining reliability in a computer system. In an example embodiment, the method includes performing a first data computation by a first set of processors, the first set of processors having a first computer processor architecture. The method continues by performing a second data computation by a second processor coupled to the first set of processors, the second processor having a second computer processor architecture, the first computer processor architecture being different than the second computer processor architecture. Finally, the method includes dynamically allocating computational resources of the first set of processors and the second processor based on at least one metric while the first set of processors and the second processor are in operation such that the accuracy and processing speed of the first data computation and the second data computation are optimized.