Managing coherence via put/get windows
    71.
    发明授权
    Managing coherence via put/get windows 失效
    通过put / get窗口管理一致性

    公开(公告)号:US07870343B2

    公开(公告)日:2011-01-11

    申请号:US10468995

    申请日:2002-02-25

    摘要: A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.

    摘要翻译: 一种用于管理多处理器计算机系统的两个处理器节点的两个处理器之间的相干性的方法和装置。 通常,本发明涉及一种软件算法,其简化并显着加速了传送并行计算机的消息中的高速缓存一致性的管理以及辅助该高速缓存一致性算法的硬件设备。 软件算法使用put / get窗口的打开和关闭来协调激活的所需要的,以实现缓存一致性。 硬件设备可以是硬件地址解码的扩展,其在节点的物理存储器地址空间中创建(a)实际不存在的虚拟存储器的区域,并且(b)因此能够立即响应 从处理元素读取和写入请求。

    Low latency memory access and synchronization
    72.
    发明授权
    Low latency memory access and synchronization 失效
    低延迟内存访问和同步

    公开(公告)号:US07818514B2

    公开(公告)日:2010-10-19

    申请号:US12196796

    申请日:2008-08-22

    IPC分类号: G06F12/06

    摘要: A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.

    摘要翻译: 与弱有序的多处理器系统相关联地提供低延迟存储器系统访问。 多处理器中的Bach处理器共享资源,并且每个共享资源在锁定设备内具有关联的锁,其提供对多处理器中的多个处理器之间的同步的支持以及资源的有序共享。 当处理器拥有与该资源相关联的锁定时,处理器仅具有访问资源的权限,并且处理器拥有锁的尝试仅需要单个加载操作,而不是传统的原子负载后跟存储,使得处理器 只执行读取操作,并且硬件锁定装置执行后续的写入操作而不是处理器。 还公开了用于非连续数据结构的简单预取。 重新定义存储器线,使得除了正常的物理存储器数据之外,每行包括足够大的指针以指向存储器中的任何其他行,其中指针用于确定要预取的存储器行而不是一些其它预测 算法。 这使得硬件能够有效地预取不连续但重复的存储器访问模式。

    Multiple node remote messaging
    73.
    发明授权
    Multiple node remote messaging 有权
    多节点远程消息传递

    公开(公告)号:US07788334B2

    公开(公告)日:2010-08-31

    申请号:US11768784

    申请日:2007-06-26

    IPC分类号: G06F15/167 G06F13/28

    CPC分类号: G06F15/16

    摘要: A method for passing remote messages in a parallel computer system formed as a network of interconnected compute nodes includes that a first compute node (A) sends a single remote message to a remote second compute node (B) in order to control the remote second compute node (B) to send at least one remote message. The method includes various steps including controlling a DMA engine at first compute node (A) to prepare the single remote message to include a first message descriptor and at least one remote message descriptor for controlling the remote second compute node (B) to send at least one remote message, including putting the first message descriptor into an injection FIFO at the first compute node (A) and sending the single remote message and the at least one remote message descriptor to the second compute node (B).

    摘要翻译: 在形成为互连的计算节点的网络的并行计算机系统中传递远程消息的方法包括:第一计算节点(A)将单个远程消息发送到远程第二计算节点(B),以便控制远程第二计算 节点(B)发送至少一个远程消息。 该方法包括各种步骤,包括在第一计算节点(A)处控制DMA引擎以准备单个远程消息以包括第一消息描述符和至少一个远程消息描述符,用于控制远程第二计算节点(B)至少发送 一个远程消息,包括将第一消息描述符放在第一计算节点(A)的注入FIFO中,并将单个远程消息和至少一个远程消息描述符发送到第二计算节点(B)。

    NOVEL MASSIVELY PARALLEL SUPERCOMPUTER
    74.
    发明申请
    NOVEL MASSIVELY PARALLEL SUPERCOMPUTER 有权
    新的大型并行超级计算机

    公开(公告)号:US20090259713A1

    公开(公告)日:2009-10-15

    申请号:US12492799

    申请日:2009-06-26

    摘要: A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures based upon System-On-a-Chip technology, i.e., each processing node comprises a single Application Specific Integrated Circuit (ASIC). Within each ASIC node is a plurality of processing elements each of which consists of a central processing unit (CPU) and plurality of floating point processors to enable optimal balance of computational performance, packaging density, low cost, and power and cooling requirements. The plurality of processors within a single node may be used individually or simultaneously to work on any combination of computation or communication as required by the particular algorithm being solved or executed at any point in time. The system-on-a-chip ASIC nodes are interconnected by multiple independent networks that optimally maximizes packet communications throughput and minimizes latency. In the preferred embodiment, the multiple networks include three high-speed networks for parallel algorithm message passing including a Torus, Global Tree, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. For particular classes of parallel algorithms, or parts of parallel calculations, this architecture exhibits exceptional computational performance, and may be enabled to perform calculations for new classes of parallel algorithms. Additional networks are provided for external connectivity and used for Input/Output, System Management and Configuration, and Debug and Monitoring functions. Special node packaging techniques implementing midplane and other hardware devices facilitates partitioning of the supercomputer in multiple networks for optimizing supercomputing resources.

    摘要翻译: 数百个teraOPS级别的新型大规模并行超级计算机包括基于片上系统技术的节点架构,即每个处理节点包括单个专用集成电路(ASIC)。 在每个ASIC节点内是多个处理元件,每个处理元件由中央处理单元(CPU)和多个浮点处理器组成,以实现计算性能,封装密度,低成本以及功率和冷却​​要求的最佳平衡。 单个节点内的多个处理器可以单独使用或同时使用,以在任何时间点解决或执行的特定算法所要求的任何计算或通信组合上工作。 片上系统ASIC节点通过多个独立网络互连,从而最大限度地最大限度地提高了分组通信吞吐量并最大限度地减少了延迟。 在优选实施例中,多个网络包括用于并行算法消息传递的三个高速网络,包括提供全局障碍和通知功能的环形,全局树和全球异步网络。 这些多个独立网络可以根据用于优化算法处理性能的算法的需求或阶段来协同或独立地利用。 对于特定类别的并行算法或并行计算的部分,该架构具有出色的计算性能,并且可以启用对新类并行算法执行计算。 为外部连接提供附加网络,用于输入/输出,系统管理和配置以及调试和监控功能。 实现中平面和其他硬件设备的特殊节点打包技术有助于在多个网络中划分超级计算机,以优化超级计算资源。

    MESSAGE PASSING WITH A LIMITED NUMBER OF DMA BYTE COUNTERS
    75.
    发明申请
    MESSAGE PASSING WITH A LIMITED NUMBER OF DMA BYTE COUNTERS 失效
    消息传递与有限数量的DMA字节计数器

    公开(公告)号:US20090007141A1

    公开(公告)日:2009-01-01

    申请号:US11768813

    申请日:2007-06-26

    IPC分类号: G06F9/44

    CPC分类号: G06F15/17356 G06F9/546

    摘要: A method for passing messages in a parallel computer system constructed as a plurality of compute nodes interconnected as a network where each compute node includes a DMA engine but includes only a limited number of byte counters for tracking a number of bytes that are sent or received by the DMA engine, where the byte counters may be used in shared counter or exclusive counter modes of operation. The method includes using rendezvous protocol, a source compute node deterministically sending a request to send (RTS) message with a single RTS descriptor using an exclusive injection counter to track both the RTS message and message data to be sent in association with the RTS message, to a destination compute node such that the RTS descriptor indicates to the destination compute node that the message data will be adaptively routed to the destination node. Using one DMA FIFO at the source compute node, the RTS descriptors are maintained for rendezvous messages destined for the destination compute node to ensure proper message data ordering thereat. Using a reception counter at a DMA engine, the destination compute node tracks reception of the RTS and associated message data and sends a clear to send (CTS) message to the source node in a rendezvous protocol form of a remote get to accept the RTS message and message data and processing the remote get (CTS) by the source compute node DMA engine to provide the message data to be sent.

    摘要翻译: 一种在并行计算机系统中传送消息的方法,该并行计算机系统被构造为作为网络互连的多个计算节点,其中每个计算节点包括DMA引擎,但是仅包括有限数量的字节计数器,用于跟踪由 DMA引擎,其中可以在共享计数器或专用计数器操作模式中使用字节计数器。 该方法包括使用会合协议,源计算节点使用专用注入计数器确定性地发送具有单个RTS描述符的请求(RTS)消息以跟踪要与RTS消息相关联地发送的RTS消息和消息数据, 到目的地计算节点,使得RTS描述符向目标计算节点指示消息数据将自适应地路由到目的地节点。 在源计算节点使用一个DMA FIFO,将为发往目的地计算节点的会合消息保留RTS描述符,以确保正确的消息数据顺序。 在DMA引擎上使用接收计数器,目的地计算节点跟踪RTS和相关联的消息数据的接收,并以远程获取的会合协议形式向源节点发送明确发送(CTS)消息以接受RTS消息 和消息数据,并由源计算节点DMA引擎处理远程获取(CTS)以提供要发送的消息数据。

    NOVEL SNOOP FILTER FOR FILTERING SNOOP REQUESTS
    76.
    发明申请
    NOVEL SNOOP FILTER FOR FILTERING SNOOP REQUESTS 失效
    用于过滤SNOOP要求的新SNOOP过滤器

    公开(公告)号:US20090006770A1

    公开(公告)日:2009-01-01

    申请号:US12113262

    申请日:2008-05-01

    IPC分类号: G06F12/08

    摘要: A method and apparatus for supporting cache coherency in a multiprocessor computing environment having multiple processing units, each processing unit having one or more local cache memories associated and operatively connected therewith. The method comprises providing a snoop filter device associated with each processing unit, each snoop filter device having a plurality of dedicated input ports for receiving snoop requests from dedicated memory writing sources in the multiprocessor computing environment. Each snoop filter device includes a plurality of parallel operating port snoop filters in correspondence with the plurality of dedicated input ports, each port snoop filter implementing one or more parallel operating sub-filter elements that are adapted to concurrently filter snoop requests received from respective dedicated memory writing sources and forward a subset of those requests to its associated processing unit.

    摘要翻译: 一种用于在具有多个处理单元的多处理器计算环境中支持高速缓存一致性的方法和装置,每个处理单元具有与其相关联并与之可操作地相连的一个或多个本地高速缓冲存储器。 该方法包括提供与每个处理单元相关联的窥探过滤器设备,每个窥探过滤器设备具有多个专用输入端口,用于从多处理器计算环境中的专用存储器写入源接收窥探请求。 每个窥探过滤器装置包括与多个专用输入端口相对应的多个并行操作端口窥探滤波器,每个端口窥探滤波器实现一个或多个并行操作子滤波器元件,其适于同时滤除从相应专用存储器接收的窥探请求 写入源并将这些请求的子集转发到其相关联的处理单元。

    SYSTEM AND METHOD FOR PROGRAMMABLE BANK SELECTION FOR BANKED MEMORY SUBSYSTEMS
    77.
    发明申请
    SYSTEM AND METHOD FOR PROGRAMMABLE BANK SELECTION FOR BANKED MEMORY SUBSYSTEMS 有权
    用于银行存储器子系统的可编程银行选择的系统和方法

    公开(公告)号:US20090006718A1

    公开(公告)日:2009-01-01

    申请号:US11768805

    申请日:2007-06-26

    IPC分类号: G06F12/02

    摘要: A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.

    摘要翻译: 一种用于使得一个或多个处理器设备能够访问计算环境中的共享存储器的可编程存储器系统和方法,所述共享存储器包括具有用于存储数据的可寻址位置的一个或多个存储器存储结构。 该系统包括:与相应的一个或多个处理器设备相关联的一个或多个第一逻辑设备,用于接收物理存储器地址信号的每个第一逻辑设备,并且可编程以在接收到预定地址位值时产生相应的存储器存储结构选择信号 在选定的物理存储器地址位置; 以及响应于每个相应选择信号的第二逻辑设备,用于产生用于选择用于处理器访问的存储器存储结构的地址信号。 因此,该系统使每个处理器设备能够分布在一个或多个存储器结构上的计算环境存储器存储访问。

    LOW LATENCY MEMORY ACCESS AND SYNCHRONIZATION
    78.
    发明申请
    LOW LATENCY MEMORY ACCESS AND SYNCHRONIZATION 失效
    低延迟存储器访问和同步

    公开(公告)号:US20080313408A1

    公开(公告)日:2008-12-18

    申请号:US12196796

    申请日:2008-08-22

    IPC分类号: G06F12/08

    摘要: A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.

    摘要翻译: 与弱有序的多处理器系统相关联地提供低延迟存储器系统访问。 多处理器中的Bach处理器共享资源,并且每个共享资源在锁定设备内具有关联的锁,其提供对多处理器中的多个处理器之间的同步的支持以及资源的有序共享。 当处理器拥有与该资源相关联的锁定时,处理器仅具有访问资源的权限,并且处理器拥有锁的尝试仅需要单个加载操作,而不是传统的原子负载后跟存储,使得处理器 只执行读取操作,并且硬件锁定装置执行后续的写入操作而不是处理器。 还公开了用于非连续数据结构的简单预取。 重新定义存储器线,使得除了正常的物理存储器数据之外,每行包括足够大的指针以指向存储器中的任何其他行,其中指针用于确定要预取的存储器行而不是一些其它预测 算法。 这使得硬件能够有效地预取不连续但重复的存储器访问模式。

    SNOOP FILTERING SYSTEM IN A MULTIPROCESSOR SYSTEM
    80.
    发明申请
    SNOOP FILTERING SYSTEM IN A MULTIPROCESSOR SYSTEM 有权
    SNOOP过滤系统在多处理器系统中的应用

    公开(公告)号:US20080222364A1

    公开(公告)日:2008-09-11

    申请号:US12126674

    申请日:2008-05-23

    IPC分类号: G06F12/08

    摘要: A system and method for supporting cache coherency in a computing environment having multiple processing units, each unit having an associated cache memory system operatively coupled therewith. The system includes a plurality of interconnected snoop filter units, each snoop filter unit corresponding to and in communication with a respective processing unit, with each snoop filter unit comprising a plurality of devices for receiving asynchronous snoop requests from respective memory writing sources in the computing environment; and a point-to-point interconnect comprising communication links for directly connecting memory writing sources to corresponding receiving devices; and, a plurality of parallel operating filter devices coupled in one-to-one correspondence with each receiving device for processing snoop requests received thereat and one of forwarding requests or preventing forwarding of requests to its associated processing unit. Each of the plurality of parallel operating filter devices comprises parallel operating sub-filter elements, each simultaneously receiving an identical snoop request and implementing one or more different snoop filter algorithms for determining those snoop requests for data that are determined not cached locally at the associated processing unit and preventing forwarding of those requests to the processor unit. In this manner, a number of snoop requests forwarded to a processing unit is reduced thereby increasing performance of the computing environment.

    摘要翻译: 一种用于在具有多个处理单元的计算环境中支持高速缓存一致性的系统和方法,每个单元具有与其可操作耦合的相关联的高速缓存存储器系统 该系统包括多个互连的窥探过滤器单元,每个窥探过滤器单元对应于相应处理单元并与其通信,每个窥探过滤器单元包括用于在计算环境中从相应存储器写入源接收异步窥探请求的多个设备 ; 以及包括用于将存储器写入源直接连接到对应的接收设备的通信链路的点对点互连; 以及与每个接收设备一一对应地耦合的多个并行操作过滤器设备,用于处理在其上接收的窥探请求,并且转发请求之一或者阻止将请求转发到其相关联的处理单元。 多个并行操作过滤器装置中的每一个包括并行操作子滤波器元件,每个并行操作子滤波器元件同时接收相同的窥探请求,并且实现一个或多个不同的窥探滤波器算法,用于确定对于在相关处理中本地未被缓存的数据被确定的窥探请求 并且防止将这些请求转发到处理器单元。 以这种方式,减少了转发到处理单元的多个窥探请求,从而增加了计算环境的性能。