Re-utilizing partially failed resources as network resources
    1.
    发明授权
    Re-utilizing partially failed resources as network resources 失效
    重新利用部分失败的资源作为网络资源

    公开(公告)号:US07620841B2

    公开(公告)日:2009-11-17

    申请号:US11335784

    申请日:2006-01-19

    IPC分类号: G06F11/00

    CPC分类号: G06F11/0793 G06F11/0724

    摘要: A method and apparatus for re-utilizing partially failed compute resources in a massively parallel super computer system. In the preferred embodiments the compute node comprises a number of clock domains that can be enabled separately. When an error in a compute node is detected, and the failure is not in network communication blocks, a clock enable circuit enables the clocks to the network communication blocks only to allow the partially failed compute node to be re-utilized as a network resource. The computer system can then continue to operate with only slightly diminished performance and thereby improve performance and perceived overall reliability.

    摘要翻译: 在大规模并行的超级计算机系统中重新利用部分失败的计算资源的方法和装置。 在优选实施例中,计算节点包括可以单独使能的多个时钟域。 当检测到计算节点中的错误,并且故障不在网络通信块中时,时钟使能电路仅允许网络通信块的时钟允许部分失败的计算节点被重新利用为网络资源。 然后,计算机系统可以继续操作,性能略有降低,从而提高性能和可察觉的整体可靠性。

    LOW LATENCY MEMORY ACCESS AND SYNCHRONIZATION
    3.
    发明申请
    LOW LATENCY MEMORY ACCESS AND SYNCHRONIZATION 失效
    低延迟存储器访问和同步

    公开(公告)号:US20070204112A1

    公开(公告)日:2007-08-30

    申请号:US11617276

    申请日:2006-12-28

    IPC分类号: G06F12/14

    摘要: A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.

    摘要翻译: 与弱有序的多处理器系统相关联地提供低延迟存储器系统访问。 多处理器中的每个处理器共享资源,并且每个共享资源在锁定设备内具有关联的锁,其提供对多处理器中的多个处理器之间的同步的支持以及资源的有序共享。 当处理器拥有与该资源相关联的锁定时,处理器仅具有访问资源的权限,并且处理器拥有锁的尝试仅需要单个加载操作,而不是传统的原子负载后跟存储,使得处理器 只执行读取操作,并且硬件锁定装置执行后续的写入操作而不是处理器。 还公开了用于非连续数据结构的简单预取。 重新定义存储器线,使得除了正常的物理存储器数据之外,每行包括足够大的指针以指向存储器中的任何其他行,其中指针用于确定要预取的存储器行而不是一些其它预测 算法。 这使得硬件能够有效地预取不连续但重复的存储器访问模式。

    TLB EXCLUSION RANGE
    7.
    发明申请
    TLB EXCLUSION RANGE 有权
    TLB排除范围

    公开(公告)号:US20110173411A1

    公开(公告)日:2011-07-14

    申请号:US12684642

    申请日:2010-01-08

    IPC分类号: G06F12/10 G06F12/00 G06F12/08

    摘要: A system and method for accessing memory are provided. The system comprises a lookup buffer for storing one or more page table entries, wherein each of the one or more page table entries comprises at least a virtual page number and a physical page number; a logic circuit for receiving a virtual address from said processor, said logic circuit for matching the virtual address to the virtual page number in one of the page table entries to select the physical page number in the same page table entry, said page table entry having one or more bits set to exclude a memory range from a page.

    摘要翻译: 提供了一种访问存储器的系统和方法。 该系统包括用于存储一个或多个页表条目的查找缓冲器,其中所述一个或多个页表条目中的每一个包括至少虚拟页码和物理页号; 用于从所述处理器接收虚拟地址的逻辑电路,所述逻辑电路用于将所述虚拟地址与所述页表项之一中的虚拟页号进行匹配,以选择所述同一页表项中的所述物理页号,所述页表项具有 一个或多个位被设置为从页面排除存储器范围。

    Deterministic error recovery protocol
    8.
    发明申请
    Deterministic error recovery protocol 失效
    确定性错误恢复协议

    公开(公告)号:US20050081078A1

    公开(公告)日:2005-04-14

    申请号:US10674952

    申请日:2003-09-30

    摘要: Disclosed are an error recovery method and system for use with a communication system having first and second nodes, each of said nodes having a receiver and a sender, the sender of the first node being connected to the receiver of the second node by a first cable, and the sender of the second node being connected to the receiver of the first node by a second cable. The method comprising the step of after one of the nodes detects an error, both of the nodes entering the same defined state. In particular, the receiver of the first node enters an error state, stays in the error state for a defined period of time T, and, after said defined period of time T, enters a wait state. Also, the sender of the first node sends to the receiver of the second node an error message for a defined period of time Te, and after the defined period of time Te, the sender of the first node enters an idle state.

    摘要翻译: 公开了一种用于与具有第一和第二节点的通信系统一起使用的错误恢复方法和系统,每个所述节点具有接收器和发送器,第一节点的发送器通过第一电缆连接到第二节点的接收器 并且第二节点的发送者通过第二电缆连接到第一节点的接收器。 所述方法包括在所述节点中的一个检测到错误之后的两个节点进入相同的定义状态的步骤。 特别地,第一节点的接收机进入错误状态,在定义的时间段T内保持在错误状态,并且在所述定义的时间段T之后进入等待状态。 此外,第一节点的发送方在给定的时间段Te的情况下向第二节点的接收者发送错误消息,并且在定义的时间段Te之后,第一节点的发送者进入空闲状态。

    Local rollback for fault-tolerance in parallel computing systems
    10.
    发明授权
    Local rollback for fault-tolerance in parallel computing systems 有权
    并行计算系统容错的局部回滚

    公开(公告)号:US08103910B2

    公开(公告)日:2012-01-24

    申请号:US12696780

    申请日:2010-01-29

    IPC分类号: G06F11/00

    CPC分类号: G06F15/17381 G06F9/30072

    摘要: A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.

    摘要翻译: 控制逻辑设备在并行超级计算系统中执行本地回滚。 超级计算系统包括至少一个高速缓冲存储器设备。 控制逻辑设备确定本地回滚间隔。 控制逻辑器件在本地回滚间隔中运行至少一条指令。 控制逻辑设备评估在本地回滚间隔期间运行至少一条指令时是否发生不可恢复的条件。 控制逻辑器件检查本地回滚期间是否发生错误。 如果发生错误,并且在本地回滚间隔期间不发生不可恢复的条件,则控制逻辑设备将重新启动本地回滚间隔。