摘要:
A general computer-implement method and apparatus to optimize problem layout on a massively parallel supercomputer is described. The method takes as input the communication matrix of an arbitrary problem in the form of an array whose entries C(i, j) are the amount to data communicated from domain i to domain j. Given C(i, j), first implement a heuristic map is implemented which attempts sequentially to map a domain and its communications neighbors either to the same supercomputer node or to near-neighbor nodes on the supercomputer torus while keeping the number of domains mapped to a supercomputer node constant (as much as possible). Next a Markov Chain of maps is generated from the initial map using Monte Carlo simulation with Free Energy (cost function) F=Σi,jC(i,j)H(i,j)—where H(i,j) is the smallest number of hops on the supercomputer torus between domain i and domain j. On the cases tested, found was that the method produces good mappings and has the potential to be used as a general layout optimization tool for parallel codes. At the moment, the serial code implemented to test the method is un-optimized so that computation time to find the optimum map can be several hours on a typical PC. For production implementation, good parallel code for our algorithm would be required which could itself be implemented on supercomputer.
摘要:
A device for copying performance counter data includes hardware path that connects a direct memory access (DMA) unit to a plurality of hardware performance counters and a memory device. Software prepares an injection packet for the DMA unit to perform copying, while the software can perform other tasks. In one aspect, the software that prepares the injection packet runs on a processing core other than the core that gathers the hardware performance counter data.
摘要:
A processor-implemented method for improving efficiency of a static core turn-off in a multi-core processor with variation, the method comprising: conducting via a simulation a turn-off analysis of the multi-core processor at the multi-core processor's design stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's design stage includes a first output corresponding to a first multi-core processor core to turn off; conducting a turn-off analysis of the multi-core processor at the multi-core processor's testing stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's testing stage includes a second output corresponding to a second multi-core processor core to turn off; comparing the first output and the second output to determine if the first output is referring to the same core to turn off as the second output; outputting a third output corresponding to the first multi-core processor core if the first output and the second output are both referring to the same core to turn off.
摘要:
A processor-implemented method for determining aging of a processing unit in a processor the method comprising: calculating an effective aging profile for the processing unit wherein the effective aging profile quantifies the effects of aging on the processing unit; combining the effective aging profile with process variation data, actual workload data and operating conditions data for the processing unit; and determining aging through an aging sensor of the processing unit using the effective aging profile, the process variation data, the actual workload data, architectural characteristics and redundancy data, and the operating conditions data for the processing unit.
摘要:
A device for copying performance counter data includes hardware path that connects a direct memory access (DMA) unit to a plurality of hardware performance counters and a memory device. Software prepares an injection packet for the DMA unit to perform copying, while the software can perform other tasks. In one aspect, the software that prepares the injection packet runs on a processing core other than the core that gathers the hardware performance counter data.
摘要:
Managing power in a parallel computer, the parallel computer including a power supply and a plurality of compute nodes, the plurality of compute nodes powered by the power supply through a plurality of DC-DC converters, each DC-DC converter supplying current to an assigned group of compute nodes, each DC-DC converter having a current sensor. Embodiments include monitoring, by the current sensor, an amount of current supplied by that DC-DC converter to its assigned group of compute nodes; determining, by at least one DC-DC converter, that the amount of current supplied is greater than a predefined threshold value; sending, by the at least one DC-DC converter to the plurality of compute nodes, a global interrupt, including notifying the plurality of compute nodes to reduce power consumption; and reducing, by the plurality of compute nodes in accordance with power consumption ratios, power consumption of the compute nodes.
摘要:
An apparatus for implementing snooping cache coherence that locally reduces the number of snoop requests presented to each cache in a multiprocessor system. A snoop filter device associated with a single processor includes one or more “scoreboard” data structures that make snoop determinations, i.e., for each snoop request from another processor, to determine if a request is to be forwarded to the processor or, discarded. At least one scoreboard is active, and at least one scoreboard is determined to be historic at any point in time. A snoop determination of the queue indicates that an entry may be in the cache, but does not indicate its actual residence status. In addition, the snoop filter block implementing scoreboard data structures is operatively coupled with a cache wrap detection logic means whereby, upon detection of a cache wrap condition, the content of the active scoreboard is copied into a historic scoreboard and the content of at least one active scoreboard is reset.
摘要:
System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory “nest” (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switch or a bus. Each selectively paired processor core is includes a transactional execution facility, wherein the system is configured to enable processor rollback to a previous state and reinitialize lockstep execution in order to recover from an incorrect execution when an incorrect execution has been detected by the selective pairing facility.
摘要:
A device for copying performance counter data includes hardware path that connects a direct memory access (DMA) unit to a plurality of hardware performance counters and a memory device. Software prepares an injection packet for the DMA unit to perform copying, while the software can perform other tasks. In one aspect, the software that prepares the injection packet runs on a processing core other than the core that gathers the hardware performance counter data.
摘要:
A plurality of first performance counter modules is coupled to a plurality of processing cores. The plurality of first performance counter modules is operable to collect performance data associated with the plurality of processing cores respectively. A plurality of second performance counter modules are coupled to a plurality of L2 cache units, and the plurality of second performance counter modules are operable to collect performance data associated with the plurality of L2 cache units respectively. A central performance counter module may be operable to coordinate counter data from the plurality of first performance counter modules and the plurality of second performance modules, the a central performance counter module, the plurality of first performance counter modules, and the plurality of second performance counter modules connected by a daisy chain connection.