-
1.
公开(公告)号:US10141955B2
公开(公告)日:2018-11-27
申请号:US14684368
申请日:2015-04-11
摘要: A method for providing selective memory error protection responsive to a predictable failure notification associated with at least one portion of a memory in a computing system includes: obtaining an active error correcting code (ECC) configuration corresponding to the portion of the memory; determining whether the active ECC configuration is sufficient to correct at least one error in the portion of the memory affected by the predictable failure notification; when the active ECC configuration is insufficient to correct the error, determining whether data corruption can be tolerated by an application running on the computing system; when data corruption cannot be tolerated by the application, determining whether a stronger ECC level is available and, if a stronger ECC level is available, increasing a strength of the active ECC configuration; and when data corruption can be tolerated, performing page reassignment and aggregation of non-critical data.
-
公开(公告)号:US10089181B2
公开(公告)日:2018-10-02
申请号:US15194884
申请日:2016-06-28
发明人: Chen-Yong Cher
摘要: According to an aspect, a method for triggering creation of a checkpoint in a computer system includes executing a task in a processing node and determining whether it is time to read a monitor associated with a metric of the task. The monitor is read to determine a value of the metric based on determining that it is time to read the monitor. A threshold for triggering creation of the checkpoint is determined based on the metric. A monitoring block size is determined for the checkpoint. A checkpoint interval is determined based on the monitoring block size, a checkpoint bandwidth, and a failure rate of the computer system. Based on determining that the value of the metric has crossed the threshold and the checkpoint interval has elapsed, the checkpoint including state data of the task is created to enable restarting execution of the task upon a restart operation.
-
公开(公告)号:US20180097712A1
公开(公告)日:2018-04-05
申请号:US15817254
申请日:2017-11-19
CPC分类号: H04L43/0817 , H04L41/0663 , H04L41/0668 , H04L41/147
摘要: A method for managing a network queue memory includes receiving sensor information about the network queue memory, predicting a memory failure in the network queue memory based on the sensor information, and outputting a notification through a plurality of nodes forming a network and using the network queue memory, the notification configuring communications between the nodes.
-
公开(公告)号:US20160378550A1
公开(公告)日:2016-12-29
申请号:US14950934
申请日:2015-11-24
发明人: Ramon Bertran Monfort , Pradip Bose , Alper Buyuktosunoglu , Chen-Yong Cher , Hans M. Jacobson , William J. Song , Karthik V. Swaminathan , Augusto J. Vega , Liang Wang
CPC分类号: G06F8/443 , G06F1/324 , G06F1/329 , G06F1/3296 , G06F9/4843 , G06F9/4887 , G06F9/4893 , G06F9/543 , Y02D10/24
摘要: An aspect includes optimizing an application workflow. The optimizing includes characterizing the application workflow by determining at least one baseline metric related to an operational control knob of an embedded system processor. The application workflow performs a real-time computational task encountered by at least one mobile embedded system of a wirelessly connected cluster of systems supported by a server system. The optimizing of the application workflow further includes performing an optimization operation on the at least one baseline metric of the application workflow while satisfying at least one runtime constraint. An annotated workflow that is the result of performing the optimization operation is output.
摘要翻译: 一个方面包括优化应用程序工作流程。 优化包括通过确定与嵌入式系统处理器的操作控制旋钮相关的至少一个基准度量来表征应用程序工作流程。 应用程序工作流执行由服务器系统支持的无线连接的系统集群的至少一个移动嵌入式系统遇到的实时计算任务。 应用程序工作流的优化还包括对满足至少一个运行时约束的应用工作流的至少一个基准度量执行优化操作。 输出作为执行优化操作的结果的注释工作流。
-
公开(公告)号:US20160378367A1
公开(公告)日:2016-12-29
申请号:US14749680
申请日:2015-06-25
发明人: Pradip Bose , Chen-Yong Cher , Ravi Nair
IPC分类号: G06F3/06
CPC分类号: G06F9/30043 , G06F9/3863 , G06F11/00 , G06F11/30
摘要: An aspect includes receiving a write request that includes a memory address and write data. Stored data is read from a memory location at the memory address. Based on determining that the memory location was not previously modified, the stored data is compared to the write data. Based on the stored data matching the write data, the write request is completed without writing the write data to the memory and a corresponding silent store bit, in a silent store bitmap is set. Based on the stored data not matching the write data, the write data is written to the memory location, the silent store bit is reset and a corresponding modified bit is set. At least one of an application and an operating system is provided access to the silent store bitmap.
摘要翻译: 一方面包括接收包括存储器地址和写数据的写请求。 从存储器地址的存储器位置读取存储的数据。 基于确定存储器位置未被修改,将存储的数据与写入数据进行比较。 基于与写入数据匹配的存储数据,写入请求完成,而不将写入数据写入存储器,并且在静默存储位图中设置相应的静默存储位。 基于与写入数据不匹配的存储数据,将写入数据写入存储器位置,无声存储位被复位并且相应的修改位被置位。 为应用程序和操作系统中的至少一个提供对静默存储位图的访问。
-
公开(公告)号:US20160011996A1
公开(公告)日:2016-01-14
申请号:US14701371
申请日:2015-04-30
发明人: Sameh Asaad , Ralph E. Bellofatto , Michael A. Blocksome , Matthias A. Blumrich , Peter Boyle , Jose R. Brunheroto , Dong Chen , Chen-Yong Cher , George L. Chiu , Norman Christ , Paul W. Coteus , Kristan D. Davis , Gabor J. Dozsa , Alexandre E. Eichenberger , Noel A. Eisley , Matthew R. Ellavsky , Kahn C. Evans , Bruce M. Fleischer , Thomas W. Fox , Alan Gara , Mark E. Giampapa , Thomas M. Gooding , Michael K. Gschwind , John A. Gunnels , Shawn A. Hall , Rudolf A. Haring , Philip Heidelberger , Todd A. Inglett , Brant L. Knudson , Gerard V. Kopcsay , Sameer Kumar , Amith R. Mamidala , James A. Marcella , Mark G. Megerian , Douglas R. Miller , Samuel J. Miller , Adam J. Muff , Michael B. Mundy , John K. O'Brien , Kathryn M. O'Brien , Martin Ohmacht , Jeffrey J. Parker , Ruth J. Poole , Joseph D. Ratterman , Valentina Salapura , David L. Satterfield , Robert M. Senger , Burkhard Steinmacher-Burow , William M. Stockdell , Craig B. Stunkel , Krishnan Sugavanam , Yutaka Sugawara , Todd E. Takken , Barry M. Trager , James L. Van Oosten , Charles D. Wait , Robert E. Walkup , Alfred T. Watson , Robert W. Wisniewski , Peng Wu
CPC分类号: G06F13/287 , G06F9/06 , G06F9/3004 , G06F9/30047 , G06F9/3885 , G06F12/0811 , G06F12/0831 , G06F12/0862 , G06F12/0864 , G06F12/1027 , G06F15/17381 , G06F15/17387 , G06F15/76 , G06F15/8069 , G06F2212/1016 , G06F2212/602 , G06F2212/6022 , G06F2212/6024 , G06F2212/6032 , Y02D10/13 , Y02D10/14
摘要: A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time and supports DMA functionality allowing for parallel processing message-passing.
摘要翻译: 100 petaflop规模的多千兆高效并行超级计算机包括基于片上系统技术的节点架构,其中每个处理节点包括单个专用集成电路(ASIC)。 ASIC节点通过五维环面网络互连,最优化节点之间的分组通信的吞吐量并最小化等待时间。 网络实现集体网络和提供全局障碍和通知功能的全球异步网络。 集成在节点设计中包括一个基于列表的预取器。 存储系统实现事务存储器,线程级别推测和多重切换缓存,同时提高软错误率,并支持DMA功能,允许并行处理消息传递。
-
公开(公告)号:US20150363225A1
公开(公告)日:2015-12-17
申请号:US14302921
申请日:2014-06-12
发明人: Chen-Yong Cher
IPC分类号: G06F9/48
CPC分类号: G06F9/4818 , G06F9/461 , G06F9/4881 , G06F11/1438 , G06F11/202
摘要: According to an aspect, a method for checkpointing in a hybrid computing node includes executing a task in a processing accelerator of the hybrid computing node. A checkpoint is created in a local memory of the processing accelerator. The checkpoint includes state data to restart execution of the task in the processing accelerator upon a restart operation. Execution of the task is resumed in the processing accelerator after creating the checkpoint. The state data of the checkpoint are transferred from the processing accelerator to a main processor of the hybrid computing node while the processing accelerator is executing the task.
摘要翻译: 根据一方面,一种用于在混合计算节点中进行检查点的方法包括在所述混合计算节点的处理加速器中执行任务。 在处理加速器的本地存储器中创建一个检查点。 检查点包括在重新启动操作时重新执行处理加速器中的任务的状态数据。 创建检查点后,在处理加速器中恢复执行任务。 当处理加速器执行任务时,检查点的状态数据从处理加速器传送到混合计算节点的主处理器。
-
公开(公告)号:US09172714B2
公开(公告)日:2015-10-27
申请号:US14012237
申请日:2013-08-28
发明人: Chen-Yong Cher , Eren Kursun , Haifeng Qian
CPC分类号: H04L63/1416 , G06F21/554 , H04L63/14
摘要: A mechanism is provided for detecting malicious activity in a functional unit of a data processing system. A set of activity values associated with a set of functional units and a set of thermal levels associated with the set of functional units are monitored. For a current activity value associated with the functional unit in the set of functional units, a determination is made as to whether a thermal level associated with the functional unit differs from a verified thermal level beyond a predetermined threshold. Responsive to the thermal level associated with the functional unit differing from the verified thermal level beyond the predetermined threshold, sending an indication of suspected abnormal activity associated with the given functional unit.
-
9.
公开(公告)号:US20150227426A1
公开(公告)日:2015-08-13
申请号:US14176083
申请日:2014-02-08
CPC分类号: G06F11/1415 , G06F11/008 , G06F11/0754 , G06F11/1482 , G06F11/1492 , G06F11/2023 , G06F11/203 , G06F11/2048 , G06F11/302 , G06F11/3409 , H04L67/1034
摘要: A method for selective duplication of subtasks in a high-performance computing system includes: monitoring a health status of one or more nodes in a high-performance computing system, where one or more subtasks of a parallel task execute on the one or more nodes; identifying one or more nodes as having a likelihood of failure which exceeds a first prescribed threshold; selectively duplicating the one or more subtasks that execute on the one or more nodes having a likelihood of failure which exceeds the first prescribed threshold; and notifying a messaging library that one or more subtasks were duplicated.
摘要翻译: 一种用于在高性能计算系统中选择性地复制子任务的方法包括:监视高性能计算系统中的一个或多个节点的健康状态,其中并行任务的一个或多个子任务在所述一个或多个节点上执行; 将一个或多个节点识别为具有超过第一规定阈值的故障可能性; 选择性地复制在具有超过第一规定阈值的故障可能性的一个或多个节点上执行的一个或多个子任务; 并通知消息传递库一个或多个子任务被复制。
-
公开(公告)号:US11121951B2
公开(公告)日:2021-09-14
申请号:US15817254
申请日:2017-11-19
摘要: A method for managing a network queue memory includes receiving sensor information about the network queue memory, predicting a memory failure in the network queue memory based on the sensor information, and outputting a notification through a plurality of nodes forming a network and using the network queue memory, the notification configuring communications between the nodes.
-
-
-
-
-
-
-
-
-