Adaptive recovery for parallel reactive power throttling
    1.
    发明授权
    Adaptive recovery for parallel reactive power throttling 有权
    并联无功功率调节的自适应恢复

    公开(公告)号:US08799694B2

    公开(公告)日:2014-08-05

    申请号:US13327100

    申请日:2011-12-15

    IPC分类号: G06F1/32

    摘要: Power throttling may be used to conserve power and reduce heat in a parallel computing environment. Compute nodes in the parallel computing environment may be organized into groups based on, for example, whether they execute tasks of the same job or receive power from the same converter. Once one of compute nodes in the group detects that a parameter (i.e., temperature, current, power consumption, etc.) has exceeded a first threshold, power throttling on all the nodes in the group may be activated. However, before deactivating power throttling, a plurality of parameters associated with the group of compute nodes may be monitored to ensure they are all below a second threshold. If so, the power throttling for all of the compute nodes is deactivated.

    摘要翻译: 功率节流可用于节约电力并减少并行计算环境中的热量。 并行计算环境中的计算节点可以基于例如它们是否执行相同作业的任务或从相同的转换器接收功率而被组织成组。 一旦组中的一个计算节点检测到参数(即,温度,电流,功耗等)已经超过第一阈值,则可以激活组中所有节点上的功率节流。 然而,在停用功率节流之前,可以监视与该组计算节点相关联的多个参数,以确保它们都在低于第二阈值。 如果是这样,则停用所有计算节点的功率节流。

    Shared address collectives using counter mechanisms
    2.
    发明授权
    Shared address collectives using counter mechanisms 失效
    共享地址集合使用计数器机制

    公开(公告)号:US08655962B2

    公开(公告)日:2014-02-18

    申请号:US12568115

    申请日:2009-09-28

    IPC分类号: G06F15/16 G06F15/167

    CPC分类号: G06F9/544

    摘要: A shared address space on a compute node stores data received from a network and data to transmit to the network. The shared address space includes an application buffer that can be directly operated upon by a plurality of processes, for instance, running on different cores on the compute node. A shared counter is used for one or more of signaling arrival of the data across the plurality of processes running on the compute node, signaling completion of an operation performed by one or more of the plurality of processes, obtaining reservation slots by one or more of the plurality of processes, or combinations thereof.

    摘要翻译: 计算节点上的共享地址空间存储从网络接收的数据和要发送到网络的数据。 共享地址空间包括可以通过多个进程直接操作的应用缓冲器,例如在计算节点上的不同核上运行。 共享计数器用于通过在计算节点上运行的多个进程的信令到达的一个或多个,信令完成由多个进程中的一个或多个执行的操作,通过一个或多个 多个处理或其组合。

    Collectively Loading An Application In A Parallel Computer
    3.
    发明申请
    Collectively Loading An Application In A Parallel Computer 有权
    在并行计算机中集成加载应用程序

    公开(公告)号:US20130263138A1

    公开(公告)日:2013-10-03

    申请号:US13431248

    申请日:2012-03-27

    IPC分类号: G06F9/46

    CPC分类号: G06F9/5072 G06F2209/549

    摘要: Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.

    摘要翻译: 在并行计算机中集体加载应用程序,并行计算机包括多个计算节点,包括:通过并行计算机控制系统识别并行计算机中的计算节点的子集以执行作业; 由并行计算机控制系统选择并行计算机中的计算节点子集之一作为工作领导计算节点; 由作业领导计算节点从计算机存储器检索用于执行作业的应用程序; 并且由作业领导者将并行计算机中的计算节点的子集广播为执行作业的应用程序。

    Debugging a high performance computing program
    4.
    发明授权
    Debugging a high performance computing program 有权
    调试高性能计算程序

    公开(公告)号:US08516444B2

    公开(公告)日:2013-08-20

    申请号:US11360346

    申请日:2006-02-23

    申请人: Thomas M. Gooding

    发明人: Thomas M. Gooding

    IPC分类号: G06F9/44

    CPC分类号: G06F11/3636 G06F11/3664

    摘要: Methods, apparatus, and computer program products are disclosed for debugging a high performance computing program by gathering lists of addresses of calling instructions for a plurality of threads of execution of the program, assigning the threads to groups in dependence upon the addresses, and displaying the groups to identify defective threads.

    摘要翻译: 公开了用于调试高性能计算程序的方法,装置和计算机程序产品,其通过收集程序执行的多个线程的调用指令的地址列表,根据地址将线程分配给组,并且显示 组识别有缺陷的线程。

    Messaging in a parallel computer using remote direct memory access (‘RDMA’)

    公开(公告)号:US08490113B2

    公开(公告)日:2013-07-16

    申请号:US13167911

    申请日:2011-06-24

    IPC分类号: G06F13/00

    CPC分类号: G06F15/167 G06F15/17331

    摘要: Messaging in a parallel computer using remote direct memory access (‘RDMA’), including: receiving a send work request; responsive to the send work request: translating a local virtual address on the first node from which data is to be transferred to a physical address on the first node from which data is to be transferred from; creating a local RDMA object that includes a counter set to the size of a messaging acknowledgment field; sending, from a messaging unit in the first node to a messaging unit in a second node, a message that includes a RDMA read operation request, the physical address of the local RDMA object, and the physical address on the first node from which data is to be transferred from; and receiving, by the first node responsive to the second node's execution of the RDMA read operation request, acknowledgment data in the local RDMA object.

    SHARED ADDRESS COLLECTIVES USING COUNTER MECHANISMS
    7.
    发明申请
    SHARED ADDRESS COLLECTIVES USING COUNTER MECHANISMS 失效
    使用计数器机制的共享地址集合

    公开(公告)号:US20110078249A1

    公开(公告)日:2011-03-31

    申请号:US12568115

    申请日:2009-09-28

    IPC分类号: G06F15/16

    CPC分类号: G06F9/544

    摘要: A shared address space on a compute node stores data received from a network and data to transmit to the network. The shared address space includes an application buffer that can be directly operated upon by a plurality of processes, for instance, running on different cores on the compute node. A shared counter is used for one or more of signaling arrival of the data across the plurality of processes running on the compute node, signaling completion of an operation performed by one or more of the plurality of processes, obtaining reservation slots by one or more of the plurality of processes, or combinations thereof.

    摘要翻译: 计算节点上的共享地址空间存储从网络接收的数据和要发送到网络的数据。 共享地址空间包括可以通过多个进程直接操作的应用缓冲器,例如在计算节点上的不同核上运行。 共享计数器用于通过在计算节点上运行的多个进程的信令到达的一个或多个,信令完成由多个进程中的一个或多个执行的操作,通过一个或多个 多个处理或其组合。

    Temperature Threshold Application Signal Trigger for Real-Time Relocation of Process
    8.
    发明申请
    Temperature Threshold Application Signal Trigger for Real-Time Relocation of Process 失效
    温度阈值应用信号触发器用于实时重定位过程

    公开(公告)号:US20090271608A1

    公开(公告)日:2009-10-29

    申请号:US12109579

    申请日:2008-04-25

    IPC分类号: G06F1/24

    CPC分类号: G06F1/206

    摘要: A method of managing a process relocation operation in a computing system is provided and includes determining respective operating temperatures of first, second and additional nodes of the system, where the first node has an elevated operating temperature and the second node has a normal operating temperature, notifying first and second kernels respectively associated with the first and second nodes, of a swapping condition, initially managing the first and second kernels to swap an application between the first and the second nodes while the swapping condition is in effect, and secondarily managing the first and second kernels to perform a barrier operation to end the swapping condition.

    摘要翻译: 提供了一种管理计算系统中的处理重定位操作的方法,并且包括确定系统的第一,第二和附加节点的相应操作温度,其中第一节点具有升高的工作温度,并且第二节点具有正常工作温度, 通知交换条件分别与第一和第二节点相关联的第一和第二内核,最初管理第一和第二内核以在交换条件有效的同时在第一和第二节点之间交换应用,并且其次管理第一和第二内核 和第二内核执行屏障操作以结束交换条件。

    Managing Power in a Parallel Computer
    9.
    发明申请
    Managing Power in a Parallel Computer 有权
    在并行计算机中管理电源

    公开(公告)号:US20090049317A1

    公开(公告)日:2009-02-19

    申请号:US11840743

    申请日:2007-08-17

    IPC分类号: G06F1/32

    CPC分类号: G06F1/263 G06F1/3203

    摘要: Managing power in a parallel computer, the parallel computer including a power supply and a plurality of compute nodes, the plurality of compute nodes powered by the power supply through a plurality of DC-DC converters, each DC-DC converter supplying current to an assigned group of compute nodes, each DC-DC converter having a current sensor. Embodiments include monitoring, by the current sensor, an amount of current supplied by that DC-DC converter to its assigned group of compute nodes; determining, by at least one DC-DC converter, that the amount of current supplied is greater than a predefined threshold value; sending, by the at least one DC-DC converter to the plurality of compute nodes, a global interrupt, including notifying the plurality of compute nodes to reduce power consumption; and reducing, by the plurality of compute nodes in accordance with power consumption ratios, power consumption of the compute nodes.

    摘要翻译: 在并行计算机中管理并行计算机,并行计算机包括电源和多个计算节点,所述多个计算节点由电源通过多个DC-DC转换器供电,每个DC-DC转换器将电流提供给所分配的 一组计算节点,每个DC-DC转换器具有电流传感器。 实施例包括由电流传感器监测由该DC-DC转换器提供给其分配的计算节点组的电流量; 由至少一个DC-DC转换器确定所提供的电流量大于预定阈值; 由所述至少一个DC-DC转换器向所述多个计算节点发送全局中断,包括通知所述多个计算节点以减少功耗; 并且根据功耗比由所述多个计算节点减少所述计算节点的功率消耗。

    Collectively loading an application in a parallel computer
    10.
    发明授权
    Collectively loading an application in a parallel computer 有权
    在并行计算机中集体加载应用程序

    公开(公告)号:US09229782B2

    公开(公告)日:2016-01-05

    申请号:US13431248

    申请日:2012-03-27

    IPC分类号: G06F9/46 G06F9/50

    CPC分类号: G06F9/5072 G06F2209/549

    摘要: Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.

    摘要翻译: 在并行计算机中集体加载应用程序,并行计算机包括多个计算节点,包括:通过并行计算机控制系统识别并行计算机中的计算节点的子集以执行作业; 由并行计算机控制系统选择并行计算机中的计算节点子集之一作为工作领导计算节点; 由作业领导计算节点从计算机存储器检索用于执行作业的应用程序; 并且由作业领导者将并行计算机中的计算节点的子集广播为执行作业的应用程序。