Optimizing layout of an application on a massively parallel supercomputer
    81.
    发明申请
    Optimizing layout of an application on a massively parallel supercomputer 失效
    在大型并行超级计算机上优化应用程序的布局

    公开(公告)号:US20060101104A1

    公开(公告)日:2006-05-11

    申请号:US10963101

    申请日:2004-10-12

    IPC分类号: G06F1/16

    CPC分类号: G06F9/5066

    摘要: A general computer-implement method and apparatus to optimize problem layout on a massively parallel supercomputer is described. The method takes as input the communication matrix of an arbitrary problem in the form of an array whose entries C(i, j) are the amount to data communicated from domain i to domain j. Given C(i, j), first implement a heuristic map is implemented which attempts sequentially to map a domain and its communications neighbors either to the same supercomputer node or to near-neighbor nodes on the supercomputer torus while keeping the number of domains mapped to a supercomputer node constant (as much as possible). Next a Markov Chain of maps is generated from the initial map using Monte Carlo simulation with Free Energy (cost function) F=Σi,jC(i,j)H(i,j)—where H(i,j) is the smallest number of hops on the supercomputer torus between domain i and domain j. On the cases tested, found was that the method produces good mappings and has the potential to be used as a general layout optimization tool for parallel codes. At the moment, the serial code implemented to test the method is un-optimized so that computation time to find the optimum map can be several hours on a typical PC. For production implementation, good parallel code for our algorithm would be required which could itself be implemented on supercomputer.

    摘要翻译: 描述了在大型并行超级计算机上优化问题布局的通用计算机实现方法和装置。 该方法采用数组形式的任意问题的通信矩阵作为输入,其条目C(i,j)是从域i到域j传送的数据量。 给定C(i,j),首先实现启发式映射,其尝试顺序地将域及其通信邻居映射到超级计算机节点或超级计算机环面上的近邻节点,同时保持域的数量映射到 超级计算机节点常数(尽可能多)。 接下来,使用具有自由能量(成本函数)的蒙特卡罗模拟,从初始映射生成马尔可夫链映射,其中F =Σi,j C(i,j)H(i,j) H(i,j)是域i和域j之间的超级计算机环面上的最小跳数。 在测试的情况下,发现该方法产生良好的映射,并且有可能被用作并行代码的通用布局优化工具。 此时,实现测试方法的序列号未优化,以便在典型的PC上找到最佳映射的计算时间可以为几个小时。 对于生产实现,将需要我们的算法的良好的并行代码,这本身可以在超级计算机上实现。

    Using DMA for copying performance counter data to memory
    82.
    发明授权
    Using DMA for copying performance counter data to memory 失效
    使用DMA将性能计数器数据复制到存储器

    公开(公告)号:US08621167B2

    公开(公告)日:2013-12-31

    申请号:US13446467

    申请日:2012-04-13

    IPC分类号: G06F12/00

    摘要: A device for copying performance counter data includes hardware path that connects a direct memory access (DMA) unit to a plurality of hardware performance counters and a memory device. Software prepares an injection packet for the DMA unit to perform copying, while the software can perform other tasks. In one aspect, the software that prepares the injection packet runs on a processing core other than the core that gathers the hardware performance counter data.

    摘要翻译: 用于复制性能计数器数据的设备包括将直接存储器访问(DMA)单元连接到多个硬件性能计数器和存储器设备的硬件路径。 软件为DMA单元准备一个注入数据包来执行复制,而软件可以执行其他任务。 在一个方面,准备注射分组的软件在收集硬件性能计数器数据的核心以外的处理核上运行。

    Efficiency of static core turn-off in a system-on-a-chip with variation
    83.
    发明授权
    Efficiency of static core turn-off in a system-on-a-chip with variation 失效
    在具有变化的片上系统中静态磁芯关断的效率

    公开(公告)号:US08571847B2

    公开(公告)日:2013-10-29

    申请号:US12727984

    申请日:2010-03-19

    IPC分类号: G06G7/75

    摘要: A processor-implemented method for improving efficiency of a static core turn-off in a multi-core processor with variation, the method comprising: conducting via a simulation a turn-off analysis of the multi-core processor at the multi-core processor's design stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's design stage includes a first output corresponding to a first multi-core processor core to turn off; conducting a turn-off analysis of the multi-core processor at the multi-core processor's testing stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's testing stage includes a second output corresponding to a second multi-core processor core to turn off; comparing the first output and the second output to determine if the first output is referring to the same core to turn off as the second output; outputting a third output corresponding to the first multi-core processor core if the first output and the second output are both referring to the same core to turn off.

    摘要翻译: 一种用于提高多核处理器中的静态核心关断的效率的处理器实现的方法,所述方法包括:通过模拟在多核处理器的设计处进行多核处理器的关断分析 其中所述多核处理器的设计阶段的所述多核处理器的关断分析包括对应于第一多核处理器核的第一输出关闭; 在多核处理器的测试阶段对多核处理器进行关断分析,其中多核处理器的测试阶段的多核处理器的关断分析包括对应于第二多核处理器的第二多输出 核心处理器核心关闭; 比较第一输出和第二输出以确定第一输出是否指相同的磁芯作为第二输出关闭; 如果第一输出和第二输出均指向相同的核来关闭,则输出对应于第一多核处理器核心的第三输出。

    USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY
    85.
    发明申请
    USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY 失效
    使用DMA将性能计数器数据复制到存储器

    公开(公告)号:US20110173403A1

    公开(公告)日:2011-07-14

    申请号:US12684367

    申请日:2010-01-08

    IPC分类号: G06F12/16 G06F13/28 G06F3/00

    摘要: A device for copying performance counter data includes hardware path that connects a direct memory access (DMA) unit to a plurality of hardware performance counters and a memory device. Software prepares an injection packet for the DMA unit to perform copying, while the software can perform other tasks. In one aspect, the software that prepares the injection packet runs on a processing core other than the core that gathers the hardware performance counter data.

    摘要翻译: 用于复制性能计数器数据的设备包括将直接存储器访问(DMA)单元连接到多个硬件性能计数器和存储器设备的硬件路径。 软件为DMA单元准备一个注入数据包来执行复制,而软件可以执行其他任务。 在一个方面,准备注射分组的软件在收集硬件性能计数器数据的核心以外的处理核上运行。

    Managing power in a parallel computer
    86.
    发明授权
    Managing power in a parallel computer 有权
    在并行计算机中管理电源

    公开(公告)号:US07877620B2

    公开(公告)日:2011-01-25

    申请号:US11840743

    申请日:2007-08-17

    IPC分类号: G06F1/26

    CPC分类号: G06F1/263 G06F1/3203

    摘要: Managing power in a parallel computer, the parallel computer including a power supply and a plurality of compute nodes, the plurality of compute nodes powered by the power supply through a plurality of DC-DC converters, each DC-DC converter supplying current to an assigned group of compute nodes, each DC-DC converter having a current sensor. Embodiments include monitoring, by the current sensor, an amount of current supplied by that DC-DC converter to its assigned group of compute nodes; determining, by at least one DC-DC converter, that the amount of current supplied is greater than a predefined threshold value; sending, by the at least one DC-DC converter to the plurality of compute nodes, a global interrupt, including notifying the plurality of compute nodes to reduce power consumption; and reducing, by the plurality of compute nodes in accordance with power consumption ratios, power consumption of the compute nodes.

    摘要翻译: 在并行计算机中管理并行计算机,并行计算机包括电源和多个计算节点,所述多个计算节点由电源通过多个DC-DC转换器供电,每个DC-DC转换器将电流提供给所分配的 一组计算节点,每个DC-DC转换器具有电流传感器。 实施例包括由电流传感器监测由该DC-DC转换器提供给其分配的计算节点组的电流量; 由至少一个DC-DC转换器确定所提供的电流量大于预定阈值; 由所述至少一个DC-DC转换器向所述多个计算节点发送全局中断,包括通知所述多个计算节点以减少功耗; 并且根据功耗比由所述多个计算节点减少所述计算节点的功率消耗。

    Method and apparatus for filtering snoop requests using a scoreboard
    87.
    发明申请
    Method and apparatus for filtering snoop requests using a scoreboard 失效
    使用记分板过滤窥探请求的方法和装置

    公开(公告)号:US20060224840A1

    公开(公告)日:2006-10-05

    申请号:US11093160

    申请日:2005-03-29

    IPC分类号: G06F13/28

    摘要: An apparatus for implementing snooping cache coherence that locally reduces the number of snoop requests presented to each cache in a multiprocessor system. A snoop filter device associated with a single processor includes one or more “scoreboard” data structures that make snoop determinations, i.e., for each snoop request from another processor, to determine if a request is to be forwarded to the processor or, discarded. At least one scoreboard is active, and at least one scoreboard is determined to be historic at any point in time. A snoop determination of the queue indicates that an entry may be in the cache, but does not indicate its actual residence status. In addition, the snoop filter block implementing scoreboard data structures is operatively coupled with a cache wrap detection logic means whereby, upon detection of a cache wrap condition, the content of the active scoreboard is copied into a historic scoreboard and the content of at least one active scoreboard is reset.

    摘要翻译: 用于实现窥探高速缓存一致性的装置,其本地地减少呈现给多处理器系统中的每个缓存的窥探请求的数量。 与单个处理器相关联的窥探过滤器装置包括一个或多个“记分板”数据结构,其进行窥探确定,即,来自另一个处理器的每个窥探请求,以确定请求是否被转发到处理器或被丢弃。 至少一个记分牌是活跃的,并且至少一个记分牌被确定为在任何时间点的历史。 队列的窥探确定表示一个条目可能在缓存中,但不表示其实际居住状态。 此外,实现记分板数据结构的窥探过滤器块与高速缓存包检测逻辑装置可操作地耦合,由此在检测到缓存包装条件时,将活动记分板的内容复制到历史记分板中,并且至少一个 活动记分板重置。

    STATE RECOVERY AND LOCKSTEP EXECUTION RESTART IN A SYSTEM WITH MULTIPROCESSOR PAIRING
    88.
    发明申请
    STATE RECOVERY AND LOCKSTEP EXECUTION RESTART IN A SYSTEM WITH MULTIPROCESSOR PAIRING 失效
    在具有多处理器配对的系统中的状态恢复和锁定执行重新启动

    公开(公告)号:US20120210162A1

    公开(公告)日:2012-08-16

    申请号:US13027932

    申请日:2011-02-15

    IPC分类号: G06F11/08 G06F11/00

    摘要: System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory “nest” (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switch or a bus. Each selectively paired processor core is includes a transactional execution facility, wherein the system is configured to enable processor rollback to a previous state and reinitialize lockstep execution in order to recover from an incorrect execution when an incorrect execution has been detected by the selective pairing facility.

    摘要翻译: 用于多处理系统的系统,方法和计算机程序产品,以提供处理器核心的选择性配对,以提高处理可靠性。 提供选择性配对设施,其选择性地连接,即配对多个微处理器或处理器核,以提供一个高度可靠的线程(或线程组)。 每个成对的微处理器或处理器核心提供一个高度可靠的线程,用于高可靠性与诸如存储器“嵌套”(或存储器层级),可选系统控制器和可选中断控制器的系统组件连接,可选的I / O或外设 设备等。存储器套件通过开关或总线连接到选择性配对设施。 每个选择性配对的处理器核心包括事务执行设施,其中所述系统被配置为使能处理器回滚到先前状态,并且重新初始化锁步执行,以便当所述选择性配对设施检测到不正确的执行时,从不正确的执行中恢复。

    USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY
    89.
    发明申请
    USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY 失效
    使用DMA将性能计数器数据复制到存储器

    公开(公告)号:US20120198118A1

    公开(公告)日:2012-08-02

    申请号:US13446467

    申请日:2012-04-13

    IPC分类号: G06F13/36

    摘要: A device for copying performance counter data includes hardware path that connects a direct memory access (DMA) unit to a plurality of hardware performance counters and a memory device. Software prepares an injection packet for the DMA unit to perform copying, while the software can perform other tasks. In one aspect, the software that prepares the injection packet runs on a processing core other than the core that gathers the hardware performance counter data.

    摘要翻译: 用于复制性能计数器数据的设备包括将直接存储器访问(DMA)单元连接到多个硬件性能计数器和存储器设备的硬件路径。 软件为DMA单元准备一个注入数据包来执行复制,而软件可以执行其他任务。 在一个方面,准备注射分组的软件在收集硬件性能计数器数据的核心以外的处理核上运行。

    DISTRIBUTED PERFORMANCE COUNTERS
    90.
    发明申请
    DISTRIBUTED PERFORMANCE COUNTERS 失效
    分布式性能计数器

    公开(公告)号:US20110172968A1

    公开(公告)日:2011-07-14

    申请号:US12684738

    申请日:2010-01-08

    IPC分类号: G21C17/00

    摘要: A plurality of first performance counter modules is coupled to a plurality of processing cores. The plurality of first performance counter modules is operable to collect performance data associated with the plurality of processing cores respectively. A plurality of second performance counter modules are coupled to a plurality of L2 cache units, and the plurality of second performance counter modules are operable to collect performance data associated with the plurality of L2 cache units respectively. A central performance counter module may be operable to coordinate counter data from the plurality of first performance counter modules and the plurality of second performance modules, the a central performance counter module, the plurality of first performance counter modules, and the plurality of second performance counter modules connected by a daisy chain connection.

    摘要翻译: 多个第一性能计数器模块耦合到多个处理核心。 多个第一性能计数器模块可操作以分别收集与多个处理核心相关联的性能数据。 多个第二性能计数器模块耦合到多个L2高速缓存单元,并且所述多个第二性能计数器模块可操作以分别收集与所述多个L2高速缓存单元相关联的性能数据。 中央性能计数器模块可以用于协调来自多个第一性能计数器模块和多个第二性能模块的计数器数据,中央性能计数器模块,多个第一性能计数器模块和多个第二性能计数器 模块通过菊花链连接连接。