RESILIENCY TO MEMORY FAILURES IN COMPUTER SYSTEMS
    51.
    发明申请
    RESILIENCY TO MEMORY FAILURES IN COMPUTER SYSTEMS 有权
    计算机系统中存储器故障的恢复

    公开(公告)号:US20170068596A1

    公开(公告)日:2017-03-09

    申请号:US15357448

    申请日:2016-11-21

    Applicant: Cray Inc.

    Abstract: A resiliency system detects and corrects memory errors reported by a memory system of a computing system using previously stored error correction information. When a program stores data into a memory location, the resiliency system executing on the computing system generates and stores error correction information. When the program then executes a load instruction to retrieve the data from the memory location, the load instruction completes normally if there is no memory error. If, however, there is a memory error, the computing system passes control to the resiliency system (e.g., via a trap) to handle the memory error. The resiliency system retrieves the error correction information for the memory location and re-creates the data of the memory location. The resiliency system stores the data as if the load instruction had completed normally and passes control to the next instruction of the program.

    Abstract translation: 弹性系统使用先前存储的纠错信息来检测和校正由计算系统的存储器系统报告的存储器错误。 当程序将数据存储到存储器位置时,在计算系统上执行的弹性系统生成并存储纠错信息。 当程序然后执行加载指令以从存储器位置检索数据时,如果没有存储器错误,则加载指令正常完成。 然而,如果存在内存错误,则计算系统将控制权传给弹性系统(例如,经由陷阱)来处理存储器错误。 弹性系统检索存储器位置的纠错信息并重新创建存储器位置的数据。 弹性系统存储数据,就好像加载指令已经正常完成,并将控制权传给程序的下一条指令。

    Multi-threaded server control automation for disaster recovery
    52.
    发明授权
    Multi-threaded server control automation for disaster recovery 有权
    多线程服务器控制自动化进行灾难恢复

    公开(公告)号:US09582381B2

    公开(公告)日:2017-02-28

    申请号:US14302851

    申请日:2014-06-12

    CPC classification number: G06F11/2035 G06F11/1417 G06F11/142 G06F11/2023

    Abstract: Systems and methods for multi-threaded server control automation for disaster recovery are described. A method may include initiating a disaster recovery sequence on two or more processors, wherein the disaster recovery sequence comprises a plurality of subsequences. The method may also include implementing the disaster recovery sequence on the two or more processors in parallel, wherein one or more subsequences of the disaster recovery sequence are implemented on the two or more processors in parallel. Upon completion of the disaster recovery sequence, at least one server partition is repurposed from a first configuration, such as a test configuration, to a second configuration, such as a production configuration.

    Abstract translation: 描述了用于灾难恢复的多线程服务器控制自动化的系统和方法。 一种方法可以包括在两个或多个处理器上启动灾难恢复序列,其中灾难恢复序列包括多个子序列。 该方法还可以包括并行地在两个或多个处理器上实现灾难恢复序列,其中并行地在两个或更多个处理器上实现灾难恢复序列的一个或多个子序列。 在完成灾难恢复顺序之后,至少一个服务器分区从诸如测试配置的第一配置重新利用到诸如生产配置的第二配置。

    Injecting faults at select execution points of distributed applications
    53.
    发明授权
    Injecting faults at select execution points of distributed applications 有权
    在分布式应用的选择执行点注入故障

    公开(公告)号:US09483383B2

    公开(公告)日:2016-11-01

    申请号:US14097713

    申请日:2013-12-05

    Abstract: Methods, systems, and articles of manufacture for injecting faults at select execution points of distributed applications are provided herein. A method includes monitoring a run-time state of each of multiple components of a distributed application to determine one or more sequence of events that triggers a fault injection point at one of the multiple components; defining a fault injection scenario in a specification based on said monitoring, wherein said fault injection scenario comprises a description of one or more sequence of events during which an intended fault is to be injected to a target component of the multiple components at one selected event; and executing the fault injection defined in the specification to perform injection of the intended fault during run-time of the distributed application.

    Abstract translation: 本文提供了在分布式应用的选择执行点注入故障的方法,系统和制造。 一种方法包括监视分布式应用的多个组件中的每一个的运行时状态以确定触发多个组件之一处的故障注入点的一个或多个事件序列; 在基于所述监测的规范中定义故障注入场景,其中所述故障注入场景包括一个或多个事件序列的描述,在一个事件序列期间,在一个所选择的事件期间将预期故障注入到所述多个分量的目标分量; 并执行规范中定义的故障注入,以在分布式应用的运行时间内执行预期故障的注入。

    Using location tracking of cluster nodes to avoid single points of failure
    55.
    发明授权
    Using location tracking of cluster nodes to avoid single points of failure 有权
    使用集群节点的位置跟踪来避免单点故障

    公开(公告)号:US09454444B1

    公开(公告)日:2016-09-27

    申请号:US12407237

    申请日:2009-03-19

    CPC classification number: G06F11/2023 G06F11/1425 G06F11/1484 G06F11/20

    Abstract: Systems and methods are provided to track cluster nodes and provide high availability in a computing system. A computer system includes hosts, a cluster manager, and a cluster database. The cluster database includes entries corresponding to the hosts which identify the physical location of a corresponding host. The cluster manager uses the data to select at least two hosts and assign the selected hosts to a service group for executing an application. The cluster manager selects hosts via an algorithm that determines which hosts are least likely to share a single point of failure. The data includes a hierarchical group of location attributes describing two or more of a host's country, state, city, building, room, enclosure, and radio frequency identifier (RFID). The location-based algorithm identifies a group of selected hosts whose smallest shared location attribute is highest in the hierarchical group. The system updates the data whenever a physical location of a host changes.

    Abstract translation: 提供系统和方法来跟踪集群节点并在计算系统中提供高可用性。 计算机系统包括主机,集群管理器和集群数据库。 集群数据库包括与标识相应主机的物理位置的主机相对应的条目。 集群管理器使用数据来选择至少两台主机,并将选定的主机分配给服务组以执行应用程序。 集群管理器通过一种算法选择主机,该算法确定哪些主机最不可能共享单点故障。 数据包括描述主机的国家,州,城市,建筑物,房间,机箱和射频标识符(RFID)中的两个或多个的位置属性的分层组。 基于位置的算法识别分层组中最小共享位置属性最高的一组选定主机。 只要主机的物理位置发生变化,系统会更新数据。

    Network traffic routing
    56.
    发明授权

    公开(公告)号:US09448898B2

    公开(公告)日:2016-09-20

    申请号:US13940001

    申请日:2013-07-11

    Abstract: A service appliance is installed between production servers running service applications and service users. The production servers and their service applications provide services to the service users. In the event that a production server is unable to provide its service to users, the service appliance can transparently intervene to maintain service availability. To maintain transparency to service users and service applications, service users are located on a first network and production servers are located on a second network. The service appliance assumes the addresses of the service users on the second network and the addresses of the production servers on the first network. Thus, the service appliance obtains all network traffic sent between the production server and service users. While the service application is operating correctly, the service appliance forwards network traffic between the two networks using various network layers.

    System and method for performing replica copying using a physical copy mechanism
    58.
    发明授权
    System and method for performing replica copying using a physical copy mechanism 有权
    使用物理复制机制执行副本复制的系统和方法

    公开(公告)号:US09372911B2

    公开(公告)日:2016-06-21

    申请号:US14281508

    申请日:2014-05-19

    CPC classification number: G06F17/30584 G06F11/1471 G06F11/2023 G06F11/2094

    Abstract: A system that implements a data storage service may maintain tables in a data store on behalf of clients. The service may maintain table data in multiple replicas of partitions of the data that are stored on respective computing nodes in the system. In response to detecting a failure or fault condition, or receiving a service request from a client to move or copy a partition replica, the data store may copy a partition replica to another computing node using a physical copy mechanism. The physical copy mechanism may copy table data from physical storage locations in which it is stored to physical storage locations allocated to a destination replica on the other computing node. During copying, service requests to modify table data may be logged and applied to the replica being copied. A catch-up operation may be performed to apply modification requests received during copying to the destination replica.

    Abstract translation: 实现数据存储服务的系统可以代表客户端在数据存储中维护表。 该服务可以在存储在系统中的相应计算节点上的数据的分区的多个副本中维护表数据。 响应于检测到故障或故障状况,或者从客户端接收到移动或复制分区副本的服务请求,数据存储可以使用物理复制机制将分区副本复制到另一个计算节点。 物理复制机制可以将表数据从其存储的物理存储位置复制到分配给另一个计算节点上的目标副本的物理存储位置。 在复制期间,可能会记录修改表数据的服务请求并将其应用于复制副本。 可以执行追赶操作以将复制期间接收的修改请求应用于目的地复本。

Patent Agency Ranking