Abstract:
A method for providing selective memory error protection responsive to a predictable failure notification associated with at least one portion of a memory in a computing system includes: obtaining an active error correcting code (ECC) configuration corresponding to the portion of the memory; determining whether the active ECC configuration is sufficient to correct at least one error in the portion of the memory affected by the predictable failure notification; when the active ECC configuration is insufficient to correct the error, determining whether data corruption can be tolerated by an application running on the computing system; when data corruption cannot be tolerated by the application, determining whether a stronger ECC level is available and, if a stronger ECC level is available, increasing a strength of the active ECC configuration; and when data corruption can be tolerated, performing page reassignment and aggregation of non-critical data.
Abstract:
According to an aspect, a method for triggering creation of a checkpoint in a computer system includes executing a task in a processing node and determining whether it is time to read a monitor associated with a metric of the task. The monitor is read to determine a value of the metric based on determining that it is time to read the monitor. A threshold for triggering creation of the checkpoint is determined based on the metric. A monitoring block size is determined for the checkpoint. A checkpoint interval is determined based on the monitoring block size, a checkpoint bandwidth, and a failure rate of the computer system. Based on determining that the value of the metric has crossed the threshold and the checkpoint interval has elapsed, the checkpoint including state data of the task is created to enable restarting execution of the task upon a restart operation.
Abstract:
A method for managing a network queue memory includes receiving sensor information about the network queue memory, predicting a memory failure in the network queue memory based on the sensor information, and outputting a notification through a plurality of nodes forming a network and using the network queue memory, the notification configuring communications between the nodes.
Abstract:
An aspect includes optimizing an application workflow. The optimizing includes characterizing the application workflow by determining at least one baseline metric related to an operational control knob of an embedded system processor. The application workflow performs a real-time computational task encountered by at least one mobile embedded system of a wirelessly connected cluster of systems supported by a server system. The optimizing of the application workflow further includes performing an optimization operation on the at least one baseline metric of the application workflow while satisfying at least one runtime constraint. An annotated workflow that is the result of performing the optimization operation is output.
Abstract:
An aspect includes receiving a write request that includes a memory address and write data. Stored data is read from a memory location at the memory address. Based on determining that the memory location was not previously modified, the stored data is compared to the write data. Based on the stored data matching the write data, the write request is completed without writing the write data to the memory and a corresponding silent store bit, in a silent store bitmap is set. Based on the stored data not matching the write data, the write data is written to the memory location, the silent store bit is reset and a corresponding modified bit is set. At least one of an application and an operating system is provided access to the silent store bitmap.
Abstract:
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time and supports DMA functionality allowing for parallel processing message-passing.
Abstract:
According to an aspect, a method for checkpointing in a hybrid computing node includes executing a task in a processing accelerator of the hybrid computing node. A checkpoint is created in a local memory of the processing accelerator. The checkpoint includes state data to restart execution of the task in the processing accelerator upon a restart operation. Execution of the task is resumed in the processing accelerator after creating the checkpoint. The state data of the checkpoint are transferred from the processing accelerator to a main processor of the hybrid computing node while the processing accelerator is executing the task.
Abstract:
A mechanism is provided for detecting malicious activity in a functional unit of a data processing system. A set of activity values associated with a set of functional units and a set of thermal levels associated with the set of functional units are monitored. For a current activity value associated with the functional unit in the set of functional units, a determination is made as to whether a thermal level associated with the functional unit differs from a verified thermal level beyond a predetermined threshold. Responsive to the thermal level associated with the functional unit differing from the verified thermal level beyond the predetermined threshold, sending an indication of suspected abnormal activity associated with the given functional unit.
Abstract:
A method for selective duplication of subtasks in a high-performance computing system includes: monitoring a health status of one or more nodes in a high-performance computing system, where one or more subtasks of a parallel task execute on the one or more nodes; identifying one or more nodes as having a likelihood of failure which exceeds a first prescribed threshold; selectively duplicating the one or more subtasks that execute on the one or more nodes having a likelihood of failure which exceeds the first prescribed threshold; and notifying a messaging library that one or more subtasks were duplicated.
Abstract:
A method for managing a network queue memory includes receiving sensor information about the network queue memory, predicting a memory failure in the network queue memory based on the sensor information, and outputting a notification through a plurality of nodes forming a network and using the network queue memory, the notification configuring communications between the nodes.