Locating hardware faults in a parallel computer
    121.
    发明授权
    Locating hardware faults in a parallel computer 失效
    在并行计算机中查找硬件故障

    公开(公告)号:US07697443B2

    公开(公告)日:2010-04-13

    申请号:US11279592

    申请日:2006-04-13

    摘要: Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.

    摘要翻译: 在并行计算机中定位硬件故障,包括在并行计算机的树形网络中定义网络的计算节点的两个或多个不重叠的测试级别集合,其中包括网络的所有数据通信链路,每个不重叠的 测试级别包括树的两个或多个相邻层; 在每个不重叠的测试级别内定义测试单元,每个测试单元包括所述树的子树,所述树的子树包括非重叠测试级别中的子树根计算节点和子树根计算节点的所有后代计算节点; 在每组非重叠测试级别上单独执行在一组非重叠测试级别中的所有测试单元上的上行链路测试; 并且与上行链路测试分离地执行并且在每组非重叠测试级别上分别执行在一组非重叠测试级别中的所有测试小区上的下行链路测试。

    Performing an Allreduce Operation Using Shared Memory
    122.
    发明申请
    Performing an Allreduce Operation Using Shared Memory 有权
    使用共享内存执行Allreduce操作

    公开(公告)号:US20080301683A1

    公开(公告)日:2008-12-04

    申请号:US11754782

    申请日:2007-05-29

    IPC分类号: G06F9/46

    CPC分类号: G06F9/4843 G06F9/52 G06F9/546

    摘要: Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.

    摘要翻译: 公开了用于使用共享存储器执行全部还原操作的方法,装置和产品,其包括:由计算节点上的多个处理核心中的至少一个接收执行全部降低操作的指令; 通过所述接收到所述指令的核心建立用于指定多个共享存储器全部还原工作单元的作业状态对象,所述多个共享存储器全部还原工作单元一起在所述计算节点上执行全部还原操作; 通过所述计算节点上的可用核确定所述作业状态对象中的下一个共享存储器allreduce工作单元; 并且通过计算节点上的可用核心执行下一个共享存储器allreduce工作单元。

    Performing an allreduce operation using shared memory
    123.
    发明授权
    Performing an allreduce operation using shared memory 失效
    使用共享内存执行allreduce操作

    公开(公告)号:US08752051B2

    公开(公告)日:2014-06-10

    申请号:US13427057

    申请日:2012-03-22

    IPC分类号: G06F9/46 G06F9/48 G06F9/52

    CPC分类号: G06F9/4843 G06F9/52 G06F9/546

    摘要: Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.

    摘要翻译: 公开了用于使用共享存储器执行全部还原操作的方法,装置和产品,其包括:由计算节点上的多个处理核心中的至少一个接收执行全部降低操作的指令; 通过所述接收到所述指令的核心建立用于指定多个共享存储器全部还原工作单元的作业状态对象,所述多个共享存储器全部还原工作单元一起在所述计算节点上执行全部还原操作; 通过所述计算节点上的可用核确定所述作业状态对象中的下一个共享存储器allreduce工作单元; 并且通过计算节点上的可用核心执行下一个共享存储器allreduce工作单元。

    PIPELINING PROTOCOLS IN MISALIGNED BUFFER CASES
    124.
    发明申请
    PIPELINING PROTOCOLS IN MISALIGNED BUFFER CASES 有权
    管道缓冲器案例中的管道协议

    公开(公告)号:US20110271006A1

    公开(公告)日:2011-11-03

    申请号:US12769972

    申请日:2010-04-29

    IPC分类号: G06F15/16

    CPC分类号: G06F15/17318

    摘要: Systems, methods and articles of manufacture are disclosed for effecting a desired collective operation on a parallel computing system that includes multiple compute nodes. The compute nodes may pipeline multiple collective operations to effect the desired collective operation. To select protocols suitable for the multiple collective operations, the compute nodes may also perform additional collective operations. The compute nodes may pipeline the multiple collective operations and/or the additional collective operations to effect the desired collective operation more efficiently.

    摘要翻译: 公开了系统,方法和制品,用于在包括多个计算节点的并行计算系统上实现期望的集体操作。 计算节点可以管理多个集合操作来实现所需的集体操作。 为了选择适合于多个集合操作的协议,计算节点还可以执行附加的集合操作。 计算节点可以管理多个集合操作和/或附加集合操作以更有效地实现期望的集体操作。

    Configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks
    125.
    发明授权
    Configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks 失效
    将操作组中的并行计算机的计算节点配置为多个独立的非重叠集合网络

    公开(公告)号:US07673011B2

    公开(公告)日:2010-03-02

    申请号:US11837015

    申请日:2007-08-10

    摘要: Methods, apparatus, and products are disclosed for configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks, the compute nodes in the operational group connected together for data communications through a global combining network, that include: partitioning the compute nodes in the operational group into a plurality of non-overlapping subgroups; designating one compute node from each of the non-overlapping subgroups as a master node; and assigning, to the compute nodes in each of the non-overlapping subgroups, class routing instructions that organize the compute nodes in that non-overlapping subgroup as a collective network such that the master node is a physical root.

    摘要翻译: 公开了用于将操作组中的并行计算机的计算节点配置成多个独立的非重叠集合网络的方法,装置和产品,所述操作组中的计算节点通过全局组合网络连接在一起用于数据通信, 包括:将操作组中的计算节点划分成多个不重叠的子组; 将来自每个非重叠子组的一个计算节点指定为主节点; 以及将每个非重叠子组中的计算节点分配给将所述非重叠子组中的计算节点组织为集合网络的类路由指令,使得所述主节点是物理根。

    Effecting a broadcast with an allreduce operation on a parallel computer
    126.
    发明授权
    Effecting a broadcast with an allreduce operation on a parallel computer 失效
    在并行计算机上实现全反射广播

    公开(公告)号:US07827385B2

    公开(公告)日:2010-11-02

    申请号:US11832918

    申请日:2007-08-02

    IPC分类号: G06F15/76

    CPC分类号: G06F9/542 G06F2209/543

    摘要: A parallel computer comprises a plurality of compute nodes organized into at least one operational group for collective parallel operations. Each compute node is assigned a unique rank and is coupled for data communications through a global combining network. One compute node is assigned to be a logical root. A send buffer and a receive buffer is configured. Each element of a contribution of the logical root in the send buffer is contributed. One or more zeros corresponding to a size of the element are injected. An allreduce operation with a bitwise OR using the element and the injected zeros is performed. And the result for the allreduce operation is determined and stored in each receive buffer.

    摘要翻译: 并行计算机包括被组织成用于集体并行操作的至少一个操作组的多个计算节点。 每个计算节点被分配唯一的等级,并且通过全局组合网络被耦合用于数据通信。 一个计算节点被分配为逻辑根。 配置发送缓冲区和接收缓冲区。 贡献了发送缓冲区中逻辑根的贡献的每个元素。 注入与元素大小对应的一个或多个零。 执行使用元素和注入的零进行按位OR的全部还原操作。 并且确定allreduce操作的结果并存储在每个接收缓冲区中。

    Parallel computing system using coordinator and master nodes for load balancing and distributing work
    127.
    发明授权
    Parallel computing system using coordinator and master nodes for load balancing and distributing work 有权
    并行计算系统使用协调器和主节点进行负载均衡和分配工作

    公开(公告)号:US07647590B2

    公开(公告)日:2010-01-12

    申请号:US11469107

    申请日:2006-08-31

    IPC分类号: G06F9/46

    CPC分类号: G06F9/5027 G06F2209/5011

    摘要: Embodiments of the invention provide a method, system and article of manufacture for parallel application load balancing and distributed work management. In one embodiment, a hierarchy of master nodes may be used to coordinate the actions of pools of worker nodes. Further, the activity of the master nodes may be controlled by a “coordinator” node. A coordinator node may be configured to distribute work unit descriptions to the collection of master nodes. If needed, embodiments of the invention may be scaled to deeper hierarchies.

    摘要翻译: 本发明的实施例提供了用于并行应用负载平衡和分布式工作管理的方法,系统和制造。 在一个实施例中,可以使用主节点的层次来协调工作节点的池的动作。 此外,主节点的活动可以由“协调器”节点控制。 协调器节点可以被配置为将工作单元描述分发到主节点的集合。 如果需要,本发明的实施例可以被缩放到更深层次。

    Configuring Compute Nodes of a Parallel Computer in an Operational Group into a Plurality of Independent Non-Overlapping Collective Networks
    128.
    发明申请
    Configuring Compute Nodes of a Parallel Computer in an Operational Group into a Plurality of Independent Non-Overlapping Collective Networks 失效
    将操作组中的并行计算机的计算节点配置为多个独立非重叠集合网络

    公开(公告)号:US20090043988A1

    公开(公告)日:2009-02-12

    申请号:US11837015

    申请日:2007-08-10

    IPC分类号: G06F9/38

    摘要: Methods, apparatus, and products are disclosed for configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks, the compute nodes in the operational group connected together for data communications through a global combining network, that include: partitioning the compute nodes in the operational group into a plurality of non-overlapping subgroups; designating one compute node from each of the non-overlapping subgroups as a master node; and assigning, to the compute nodes in each of the non-overlapping subgroups, class routing instructions that organize the compute nodes in that non-overlapping subgroup as a collective network such that the master node is a physical root.

    摘要翻译: 公开了用于将操作组中的并行计算机的计算节点配置成多个独立的非重叠集合网络的方法,装置和产品,操作组中的计算节点通过全局组合网络连接在一起用于数据通信, 包括:将操作组中的计算节点划分成多个不重叠的子组; 将来自每个非重叠子组的一个计算节点指定为主节点; 以及将每个非重叠子组中的计算节点分配给将所述非重叠子组中的计算节点组织为集合网络的类路由指令,使得所述主节点是物理根。

    PARALLEL APPLICATION LOAD BALANCING AND DISTRIBUTED WORK MANAGEMENT
    129.
    发明申请
    PARALLEL APPLICATION LOAD BALANCING AND DISTRIBUTED WORK MANAGEMENT 有权
    并行应用负载平衡和分布式工作管理

    公开(公告)号:US20080059555A1

    公开(公告)日:2008-03-06

    申请号:US11469107

    申请日:2006-08-31

    IPC分类号: G06F15/16

    CPC分类号: G06F9/5027 G06F2209/5011

    摘要: Embodiments of the invention provide a method, system and article of manufacture for parallel application load balancing and distributed work management. In one embodiment, a hierarchy of master nodes may be used to coordinate the actions of pools of worker nodes. Further, the activity of the master nodes may be controlled by a “coordinator” node. A coordinator node may be configured to distribute work unit descriptions to the collection of master nodes. If needed, embodiments of the invention may be scaled to deeper hierarchies.

    摘要翻译: 本发明的实施例提供了用于并行应用负载平衡和分布式工作管理的方法,系统和制造。 在一个实施例中,可以使用主节点的层次来协调工作节点的池的动作。 此外,主节点的活动可以由“协调器”节点控制。 协调器节点可以被配置为将工作单元描述分发到主节点的集合。 如果需要,本发明的实施例可以被缩放到更深层次。

    Performing An Allreduce Operation Using Shared Memory
    130.
    发明申请
    Performing An Allreduce Operation Using Shared Memory 失效
    使用共享内存执行Allreduce操作

    公开(公告)号:US20120179881A1

    公开(公告)日:2012-07-12

    申请号:US13427057

    申请日:2012-03-22

    IPC分类号: G06F12/02

    CPC分类号: G06F9/4843 G06F9/52 G06F9/546

    摘要: Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.

    摘要翻译: 公开了用于使用共享存储器执行全部还原操作的方法,装置和产品,其包括:由计算节点上的多个处理核心中的至少一个接收执行全部降低操作的指令; 通过所述接收到所述指令的核心建立用于指定多个共享存储器全部还原工作单元的作业状态对象,所述多个共享存储器全部还原工作单元一起在所述计算节点上执行全部还原操作; 通过所述计算节点上的可用核确定所述作业状态对象中的下一个共享存储器allreduce工作单元; 并且通过计算节点上的可用核心执行下一个共享存储器allreduce工作单元。