摘要:
A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a bather algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal to the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.
摘要:
A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a bather algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal to the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.
摘要:
A shared address space on a compute node stores data received from a network and data to transmit to the network. The shared address space includes an application buffer that can be directly operated upon by a plurality of processes, for instance, running on different cores on the compute node. A shared counter is used for one or more of signaling arrival of the data across the plurality of processes running on the compute node, signaling completion of an operation performed by one or more of the plurality of processes, obtaining reservation slots by one or more of the plurality of processes, or combinations thereof.
摘要:
A shared address space on a compute node stores data received from a network and data to transmit to the network. The shared address space includes an application buffer that can be directly operated upon by a plurality of processes, for instance, running on different cores on the compute node. A shared counter is used for one or more of signaling arrival of the data across the plurality of processes running on the compute node, signaling completion of an operation performed by one or more of the plurality of processes, obtaining reservation slots by one or more of the plurality of processes, or combinations thereof.
摘要:
Processing posted receive commands in a parallel computer, including: posting, by a parallel process of a compute node, a receive command, the receive command including a set of parameters excluding the receive command from being directed among parallel posted receive queues; flattening the parallel unexpected message queues into a single unexpected message queue; determining whether the posted receive command is satisfied by an entry in the single unexpected message queue; if the posted receive command is satisfied by an entry in the single unexpected message queue, processing the posted receive command; if the posted receive command is not satisfied by an entry in the single unexpected message queue: flattening the parallel posted receive queues into a single posted receive queue; and storing the posted receive command in the single posted receive queue.
摘要:
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.
摘要翻译:具有100 petaOPS规模计算的多Petascale高效并行超级计算机,其成本,功耗和占地面积都在降低,并且允许从互连角度来看处理节点的最大封装密度。 超级计算机利用了VLSI的技术进步,实现了许多处理器可以集成到单个专用集成电路(ASIC)中的计算模型。 每个ASIC计算节点包括利用集成到一个管芯中的四个或更多个处理器的片上系统ASIC,每个处理器具有对所有系统资源的完全访问,并且使得处理器能够对诸如计算或消息传递I / O 并且优选地,根据应用内的各种算法阶段实现功能的自适应分割,或者如果I / O或其他处理器未被充分利用,则可以参与计算或通信节点通过五维环面网络互连 使用DMA来最大限度地最大化节点之间的分组通信的吞吐量并最小化等待时间。
摘要:
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.
摘要翻译:具有100 petaOPS规模计算的多Petascale高效并行超级计算机,其成本,功耗和占地面积都在降低,并且允许从互连角度来看处理节点的最大封装密度。 超级计算机利用了VLSI的技术进步,实现了许多处理器可以集成到单个专用集成电路(ASIC)中的计算模型。 每个ASIC计算节点包括利用集成到一个管芯中的四个或更多个处理器的片上系统ASIC,每个处理器具有对所有系统资源的完全访问,并且使得处理器能够对诸如计算或消息传递I / O 并且优选地,根据应用内的各种算法阶段实现功能的自适应分割,或者如果I / O或其他处理器未被充分利用,则可以参与计算或通信节点通过五维环面网络互连 使用DMA来最大限度地最大化节点之间的分组通信的吞吐量并最小化等待时间。
摘要:
Processing posted receive commands in a parallel computer, including: posting, by a parallel process of a compute node, a receive command, the receive command including a set of parameters excluding the receive command from being directed among parallel posted receive queues; flattening the parallel unexpected message queues into a single unexpected message queue; determining whether the posted receive command is satisfied by an entry in the single unexpected message queue; if the posted receive command is satisfied by an entry in the single unexpected message queue, processing the posted receive command; if the posted receive command is not satisfied by an entry in the single unexpected message queue: flattening the parallel posted receive queues into a single posted receive queue; and storing the posted receive command in the single posted receive queue.
摘要:
Mechanism of efficient intra-die collective processing across the nodelets with separate shared memory coherency domains is provided. An integrated circuit die may include a hardware collective unit implemented on the integrated circuit die. A plurality of cores on the integrated circuit die is grouped into a plurality of shared memory coherence domains. Each of the plurality of shared memory coherence domains is connected to the collective unit for performing collective operations between the plurality of shared memory coherence domains.
摘要:
Point-to-point intra-nodelet messaging support for nodelets on a single chip that obey MPI semantics may be provided. In one aspect, a local buffering mechanism is employed that obeys standard communication protocols for the network communications between the nodelets integrated in a single chip. Sending messages from one nodelet to another nodelet on the same chip may be performed not via the network, but by exchanging messages in the point-to-point messaging buckets between the nodelets. The messaging buckets need not be part of the memory system of the nodelets. Specialized hardware controllers may be used for moving data between the nodelets and each messaging bucket, and ensuring correct operation of the network protocol.