Effecting a broadcast with an allreduce operation on a parallel computer
    1.
    发明授权
    Effecting a broadcast with an allreduce operation on a parallel computer 失效
    在并行计算机上实现全反射广播

    公开(公告)号:US07827385B2

    公开(公告)日:2010-11-02

    申请号:US11832918

    申请日:2007-08-02

    IPC分类号: G06F15/76

    CPC分类号: G06F9/542 G06F2209/543

    摘要: A parallel computer comprises a plurality of compute nodes organized into at least one operational group for collective parallel operations. Each compute node is assigned a unique rank and is coupled for data communications through a global combining network. One compute node is assigned to be a logical root. A send buffer and a receive buffer is configured. Each element of a contribution of the logical root in the send buffer is contributed. One or more zeros corresponding to a size of the element are injected. An allreduce operation with a bitwise OR using the element and the injected zeros is performed. And the result for the allreduce operation is determined and stored in each receive buffer.

    摘要翻译: 并行计算机包括被组织成用于集体并行操作的至少一个操作组的多个计算节点。 每个计算节点被分配唯一的等级,并且通过全局组合网络被耦合用于数据通信。 一个计算节点被分配为逻辑根。 配置发送缓冲区和接收缓冲区。 贡献了发送缓冲区中逻辑根的贡献的每个元素。 注入与元素大小对应的一个或多个零。 执行使用元素和注入的零进行按位OR的全部还原操作。 并且确定allreduce操作的结果并存储在每个接收缓冲区中。

    Effecting a Broadcast with an Allreduce Operation on a Parallel Computer
    2.
    发明申请
    Effecting a Broadcast with an Allreduce Operation on a Parallel Computer 失效
    在并行计算机上实现全反射广播

    公开(公告)号:US20090037511A1

    公开(公告)日:2009-02-05

    申请号:US11832918

    申请日:2007-08-02

    IPC分类号: G06F15/16

    CPC分类号: G06F9/542 G06F2209/543

    摘要: Methods, parallel computers, and computer program products are disclosed for effecting a broadcast with an allreduce operation on a parallel computer, the parallel computer comprising a plurality of compute nodes, the compute nodes organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer, each compute node in the operational group assigned a unique rank, the compute nodes of the operational group coupled for data communications through a global combining network; and one compute node assigned to be a logical root. Embodiments include configuring, by the logical root node, a send buffer having a contribution to be broadcast to each ranked node in the operational group; configuring, by all ranked nodes other than the logical root, a receive buffer for receiving the contribution from the logical root; and repeatedly for each element of the contribution of the logical root in the send buffer: contributing, by the logical root, the element of the contribution in the send buffer; injecting, by all ranked nodes other than the logical root, one or more zeros corresponding to a size of the element; performing, by all the compute nodes of the operational group, an allreduce operation with a bitwise OR using the element and the injected zeros, yielding a result for the allreduce operation; and storing in each receive buffer, by all ranked nodes other than the logical root, the result of the allreduce.

    摘要翻译: 公开了方法,并行计算机和计算机程序产品,用于在并行计算机上实现具有全部还原操作的广播,该并行计算机包括多个计算节点,计算节点被组织成用于集体并行的至少一个运算组的计算节点 并行计算机的操作,操作组中的每个计算节点分配唯一的等级,操作组的计算节点通过全局组合网络耦合用于数据通信; 并且一个计算节点被分配为逻辑根。 实施例包括通过逻辑根节点将具有要广播的贡献的发送缓冲器配置到操作组中的每个排序节点; 由除逻辑根之外的所有排序节点配置用于从逻辑根接收贡献的接收缓冲器; 并且针对发送缓冲器中逻辑根的贡献的每个元素重复:由逻辑根贡献发送缓冲器中的贡献的元素; 由除逻辑根之外的所有排序的节点注入对应于该元素的大小的一个或多个零; 由操作组的所有计算节点执行使用该元素和被注入的零的具有按位OR的全部还原操作,产生全部还原操作的结果; 并且在除了逻辑根以外的所有排序节点的每个接收缓冲器中存储allreduce的结果。

    Administering Communications Schedules for Data Communications Among Compute Nodes in a Data Communications Network of a Parallel Computer
    3.
    发明申请
    Administering Communications Schedules for Data Communications Among Compute Nodes in a Data Communications Network of a Parallel Computer 审中-公开
    管理并行计算机数据通信网络中计算节点之间数据通信的通信时间表

    公开(公告)号:US20090113308A1

    公开(公告)日:2009-04-30

    申请号:US11924934

    申请日:2007-10-26

    IPC分类号: G06F17/00 G06F3/048

    摘要: Methods, apparatus, and products are disclosed for creating and administering communications schedules for data communications among compute nodes in a data communications network of a parallel computer that include: receiving a communications schedule specifying data communications steps in a message passing operation performed by the compute nodes in the data communications network of the parallel computer; parsing the communications schedule to identify the data communications steps; and generating a graphical representation of the communications schedule, including graphing the data communications steps for the message passing operation.

    摘要翻译: 公开了用于创建和管理并行计算机的数据通信网络中的计算节点之间的数据通信的通信计划的方法,装置和产品,包括:接收指定由计算节点执行的消息传递操作中的数据通信步骤的通信调度 在并行计算机的数据通信网络中; 解析通信时间表以识别数据通信步骤; 以及生成所述通信调度的图形表示,包括绘制消息传递操作的数据通信步骤。

    Dispatching packets on a global combining network of a parallel computer
    4.
    发明授权
    Dispatching packets on a global combining network of a parallel computer 失效
    在并行计算机的全局组合网络上调度数据包

    公开(公告)号:US07984450B2

    公开(公告)日:2011-07-19

    申请号:US11946136

    申请日:2007-11-28

    IPC分类号: G06F13/00

    CPC分类号: G06F13/387

    摘要: Methods, apparatus, and products are disclosed for dispatching packets on a global combining network of a parallel computer comprising a plurality of nodes connected for data communications using the network capable of performing collective operations and point to point operations that include: receiving, by an origin system messaging module on an origin node from an origin application messaging module on the origin node, a storage identifier and an operation identifier, the storage identifier specifying storage containing an application message for transmission to a target node, and the operation identifier specifying a message passing operation; packetizing, by the origin system messaging module, the application message into network packets for transmission to the target node, each network packet specifying the operation identifier and an operation type for the message passing operation specified by the operation identifier; and transmitting, by the origin system messaging module, the network packets to the target node.

    摘要翻译: 公开了用于在并行计算机的全局组合网络上分发分组的方法,装置和产品,所述并行计算机包括使用能够执行集合操作的网络连接的数据通信的多个节点和点对点操作,所述多个节点包括: 来自源节点上的原始应用消息模块的源节点上的系统消息模块,存储标识符和操作标识符,存储标识符指定存储器,其包含用于传输到目标节点的应用消息,以及指定消息传递的操作标识符 操作; 由原始系统消息传递模块将应用消息分组到网络分组中以传输到目标节点,每个网络分组指定操作标识符和由操作标识符指定的消息传递操作的操作类型; 并且由原始系统消息传递模块将网络分组发送到目标节点。

    Dispatching Packets on a Global Combining Network of a Parallel Computer
    5.
    发明申请
    Dispatching Packets on a Global Combining Network of a Parallel Computer 失效
    在并行计算机的全球组合网络上调度数据包

    公开(公告)号:US20090138892A1

    公开(公告)日:2009-05-28

    申请号:US11946136

    申请日:2007-11-28

    IPC分类号: G06F13/38

    CPC分类号: G06F13/387

    摘要: Methods, apparatus, and products are disclosed for dispatching packets on a global combining network of a parallel computer comprising a plurality of nodes connected for data communications using the network capable of performing collective operations and point to point operations that include: receiving, by an origin system messaging module on an origin node from an origin application messaging module on the origin node, a storage identifier and an operation identifier, the storage identifier specifying storage containing an application message for transmission to a target node, and the operation identifier specifying a message passing operation; packetizing, by the origin system messaging module, the application message into network packets for transmission to the target node, each network packet specifying the operation identifier and an operation type for the message passing operation specified by the operation identifier; and transmitting, by the origin system messaging module, the network packets to the target node.

    摘要翻译: 公开了用于在并行计算机的全局组合网络上分发分组的方法,装置和产品,所述并行计算机包括使用能够执行集合操作的网络连接的数据通信的多个节点和点对点操作,所述多个节点包括: 来自源节点上的原始应用消息模块的源节点上的系统消息模块,存储标识符和操作标识符,存储标识符指定存储器,其包含用于传输到目标节点的应用消息,以及指定消息传递的操作标识符 操作; 由原始系统消息传递模块将应用消息分组到网络分组中以传输到目标节点,每个网络分组指定操作标识符和由操作标识符指定的消息传递操作的操作类型; 并且由原始系统消息传递模块将网络分组发送到目标节点。

    Executing application function calls in response to an interrupt
    6.
    发明授权
    Executing application function calls in response to an interrupt 有权
    执行应用程序函数调用以响应中断

    公开(公告)号:US07716407B2

    公开(公告)日:2010-05-11

    申请号:US11968720

    申请日:2008-01-03

    IPC分类号: G06F13/24

    摘要: Executing application function calls in response to an interrupt including creating a thread; receiving an interrupt having an interrupt type; determining whether a value of a semaphore represents that interrupts are disabled; if the value of the semaphore represents that interrupts are not disabled: calling, by the thread, one or more preconfigured functions in dependence upon the interrupt type of the interrupt; yielding the thread; and if the value of the semaphore represents that interrupts are disabled: setting the value of the semaphore to represent to a kernel that interrupts are hard-disabled; and hard-disabling interrupts at the kernel.

    摘要翻译: 响应于包括创建线程的中断执行应用程序函数调用; 接收具有中断类型的中断; 确定信号量的值是否表示中断被禁用; 如果信号量的值表示中断未被禁用:根据中断的中断类型,线程调用一个或多个预配置函数; 产生线程; 并且如果信号量的值表示中断被禁用:将信号量的值表示为中断的内核是硬禁用的; 并在内核上进行硬禁止中断。

    Executing Application Function Calls in Response to an Interrupt
    7.
    发明申请
    Executing Application Function Calls in Response to an Interrupt 有权
    执行响应中断的应用程序函数调用

    公开(公告)号:US20090177828A1

    公开(公告)日:2009-07-09

    申请号:US11968720

    申请日:2008-01-03

    IPC分类号: G06F13/24

    摘要: Executing application function calls in response to an interrupt including creating a thread; receiving an interrupt having an interrupt type; determining whether a value of a semaphore represents that interrupts are disabled; if the value of the semaphore represents that interrupts are not disabled: calling, by the thread, one or more preconfigured functions in dependence upon the interrupt type of the interrupt; yielding the thread; and if the value of the semaphore represents that interrupts are disabled: setting the value of the semaphore to represent to a kernel that interrupts are hard-disabled; and hard-disabling interrupts at the kernel.

    摘要翻译: 响应于包括创建线程的中断执行应用程序函数调用; 接收具有中断类型的中断; 确定信号量的值是否表示中断被禁用; 如果信号量的值表示中断未被禁用:根据中断的中断类型,线程调用一个或多个预配置函数; 产生线程; 并且如果信号量的值表示中断被禁用:将信号量的值表示为中断的内核是硬禁用的; 并在内核上进行硬禁止中断。

    MECHANISM TO SUPPORT GENERIC COLLECTIVE COMMUNICATION ACROSS A VARIETY OF PROGRAMMING MODELS
    8.
    发明申请
    MECHANISM TO SUPPORT GENERIC COLLECTIVE COMMUNICATION ACROSS A VARIETY OF PROGRAMMING MODELS 失效
    通过各种编程模式支持通用集体交流的机制

    公开(公告)号:US20090006810A1

    公开(公告)日:2009-01-01

    申请号:US11768669

    申请日:2007-06-26

    IPC分类号: G06F15/00

    CPC分类号: G06F9/54

    摘要: A system and method for supporting collective communications on a plurality of processors that use different parallel programming paradigms, in one aspect, may comprise a schedule defining one or more tasks in a collective operation an executor that executes the task, a multisend module to perform one or more data transfer functions associated with the tasks, and a connection manager that controls one or more connections and identifies an available connection. The multisend module uses the available connection in performing the one or more data transfer functions. A plurality of processors that use different parallel programming paradigms can use a common implementation of the schedule module, the executor module, the connection manager and the multisend module via a language adaptor specific to a parallel programming paradigm implemented on a processor.

    摘要翻译: 在一个方面,用于在使用不同的并行编程范例的多个处理器上支持集体通信的系统和方法可以包括在集体操作中定义执行任务的执行器中的一个或多个任务的调度,执行一个执行器的多发模块 或更多数据传输功能,以及连接管理器,其控制一个或多个连接并识别可用连接。 多次模块在执行一个或多个数据传输功能时使用可用的连接。 使用不同的并行编程范例的多个处理器可以经由特定于在处理器上实现的并行编程范例的语言适配器来使用调度模块,执行器模块,连接管理器和多发模块的通用实现。

    Methods and apparatus using commutative error detection values for fault isolation in multiple node computers
    9.
    发明申请
    Methods and apparatus using commutative error detection values for fault isolation in multiple node computers 失效
    使用多节点计算机故障隔离交换误差检测值的方法和装置

    公开(公告)号:US20060248370A1

    公开(公告)日:2006-11-02

    申请号:US11106069

    申请日:2005-04-14

    IPC分类号: G06F11/00

    CPC分类号: G06F11/1633

    摘要: The present invention concerns methods and apparatus for performing fault isolation in multiple node computing systems using commutative error detection values—for example, checksums—to identify and to isolate faulty nodes. In the present invention nodes forming the multiple node computing system are networked together and during program execution communicate with one another by transmitting information through the network. When information associated with a reproducible portion of a computer program is injected into the network by a node, a commutative error detection value is calculated and stored in commutative error detection apparatus associated with the node. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values saved in the commutative error detection apparatus associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created; the node fault detection apparatus retrieves them and stores them in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in commutative error detection values indicate that the node may be faulty.

    摘要翻译: 本发明涉及在多节点计算系统中使用交换性错误检测值(例如校验和)识别和隔离故障节点来执行故障隔离的方法和装置。 在本发明中,形成多节点计算系统的节点被联网在一起,并且在程序执行期间通过网络传送信息彼此通信。 当与计算机程序的可再现部分相关联的信息被节点注入到网络中时,计算交换性错误检测值并将其存储在与节点相关联的交换错误检测装置中。 间歇地,与多节点计算机系统相关联的节点故障检测装置检索保存在与节点相关联的交换性错误检测装置中的交换性错误检测值,并将其存储在存储器中。 当多节点计算机系统再次执行计算机程序时,创建新的交换错误检测值; 节点故障检测装置检索它们并将其存储在存储器中。 节点故障检测装置通过比较与来自应用程序的不同运行的特定节点生成的应用程序的可再现部分相关联的交换错误检测值来识别故障节点。 交换性错误检测值的差异表明节点可能有故障。

    Mechanism to support generic collective communication across a variety of programming models
    10.
    发明授权
    Mechanism to support generic collective communication across a variety of programming models 失效
    支持各种编程模型中的通用集体通信的机制

    公开(公告)号:US07984448B2

    公开(公告)日:2011-07-19

    申请号:US11768669

    申请日:2007-06-26

    IPC分类号: G06F9/44 G06F9/46 G06F15/76

    CPC分类号: G06F9/54

    摘要: A system and method for supporting collective communications on a plurality of processors that use different parallel programming paradigms, in one aspect, may comprise a schedule defining one or more tasks in a collective operation, an executor that executes the task, a multisend module to perform one or more data transfer functions associated with the tasks, and a connection manager that controls one or more connections and identifies an available connection. The multisend module uses the available connection in performing the one or more data transfer functions. A plurality of processors that use different parallel programming paradigms can use a common implementation of the schedule module, the executor module, the connection manager and the multisend module via a language adaptor specific to a parallel programming paradigm implemented on a processor.

    摘要翻译: 在一个方面,用于支持在使用不同的并行编程范例的多个处理器上的集体通信的系统和方法可以包括在集体操作中定义一个或多个任务的调度,执行该任务的执行器,执行多个模块的执行器 与任务相关联的一个或多个数据传送功能,以及控制一个或多个连接并识别可用连接的连接管理器。 多次模块在执行一个或多个数据传输功能时使用可用的连接。 使用不同的并行编程范例的多个处理器可以经由特定于在处理器上实现的并行编程范例的语言适配器来使用调度模块,执行器模块,连接管理器和多发模块的通用实现。