Abstract: | Being the most popular runtime infrastructure for distributed systems, middleware can be viewed as a collection of common services. Since the development, deployment and maintenance of distributed systems rely largely on middleware services, the failure of middleware services puts a significant impact on the reliability and availability of the whole system. Though recovery-based fault tolerance is an effective way to improve the reliability of middleware services, it is far away from practice mainly because of the high complexity and cost of the recovery of correlated failures between interdependent services. In this paper,a framework for detecting and recovering the correlated failures of middleware services in an automated way is presented. First, the problem is investigated from two perspectives, i.e.,analyzing the role and impact of middleware services and illustrating a set of correlated failures in J2EE standard services as motivating examples. Then, a general coordinated recovery model is constructed with the elements necessary and su±cient for detecting and recovering correlated failures in middleware services. The supporting framework is demonstrated on three J2EE application servers, i.e., PKUAS, JBoss and JOnAS, one by one without fundamental modifications. Finally, based on the three enhanced application servers, many cases on J2EE common services, including the transaction service, database service, naming and directory service, security service and messaging service, are studied. The experiment results show the effectiveness and applicability of the framework presented in this paper. |