Apparatus and method for building distributed fault-tolerant/high-availability computer applications
    1.
    发明申请
    Apparatus and method for building distributed fault-tolerant/high-availability computer applications 有权
    用于构建分布式容错/高可用性计算机应用程序的装置和方法

    公开(公告)号:US20050193229A1

    公开(公告)日:2005-09-01

    申请号:US11050588

    申请日:2005-02-02

    IPC分类号: G06F11/00 G06F13/00

    摘要: Software architecture for developing distributed fault-tolerant systems independent of the underlying hardware architecture and operating system. Systems built using architecture components are scalable and allow a set of computer applications to operate in fault-tolerant/high-availability mode, distributed processing mode, or many possible combinations of distributed and fault-tolerant modes in the same system without any modification to the architecture components. The software architecture defines system components that are modular and address problems in present systems. The architecture uses a System Controller, which controls system activation, initial load distribution, fault recovery, load redistribution, and system topology, and implements system maintenance procedures. An Application Distributed Fault-Tolerant/High-Availability Support Module (ADSM) enables an application(s) to operate in various distributed fault-tolerant modes. The System Controller uses ADSM's well-defined API to control the state of the application in these modes. The Router architecture component provides transparent communication between applications during fault recovery and topology changes. An Application Load Distribution Module (ALDM) component distributes incoming external events towards the distributed application. The architecture allows for a Load Manager, which monitors load on various copies of the application and maximizes the hardware usage by providing dynamic load balancing. The architecture also allows for a Fault Manager, which performs fault detection, fault location, and fault isolation, and uses the System Controller's API to initiate fault recovery. These architecture components can be used to achieve a variety of distributed processing high-availability system configurations, which results in a reduction of cost and development time.

    摘要翻译: 用于开发独立于底层硬件架构和操作系统的分布式容错系统的软件架构。 使用架构组件构建的系统是可扩展的,并允许一组计算机应用程序在同一系统中的容错/高可用性模式,分布式处理模式或许多可能的分布式和容错模式组合中运行,而无需对 架构组件。 软件架构定义了系统组件,这些组件是当前系统中的模块化和地址问题。 该架构使用系统控制器,它控制系统激活,初始负载分配,故障恢复,负载再分配和系统拓扑,并实现系统维护过程。 应用程序分布式容错/高可用性支持模块(ADSM)使应用程序能够在各种分布式容错模式下运行。 系统控制器使用ADSM的定义良好的API来控制应用程序在这些模式下的状态。 路由器架构组件在故障恢复和拓扑变化期间提供应用程序之间的透明通信。 应用程序负载分配模块(ALDM)组件将传入的外部事件分发到分布式应用程序。 该体系结构允许负载管理器负责监控应用程序的各种副本的负载,并通过提供动态负载平衡来最大化硬件使用。 该架构还允许执行故障检测,故障定位和故障隔离的故障管理器,并使用系统控制器的API来启动故障恢复。 这些架构组件可用于实现各种分布式处理高可用性系统配置,从而降低成本和开发时间。