首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In Internet service fault management based on active probing, uncertainty and noises will affect service fault management. In order to reduce the impact, challenges of Internet service fault management are analyzed in this paper. Bipartite Bayesian network is chosen to model the dependency relationship between faults and probes, binary symmetric channel is chosen to model noises, and a service fault management approach using active probing is proposed for such an environment. This approach is composed of two phases: fault detection and fault diagnosis. In first phase, we propose a greedy approximation probe selection algorithm (GAPSA), which selects a minimal set of probes while remaining a high probability of fault detection. In second phase, we propose a fault diagnosis probe selection algorithm (FDPSA), which selects probes to obtain more system information based on the symptoms observed in previous phase. To deal with dynamic fault set caused by fault recovery mechanism, we propose a hypothesis inference algorithm based on fault persistent time statistic (FPTS). Simulation results prove the validity and efficiency of our approach.  相似文献   

2.
Software is critical for Internet service availability since an Internet service may become unavailable due to software faults or software maintenance. In this paper, we propose a framework to allow zero‐loss recovery and online maintenance for Internet services. The framework is based on the virtual machine (VM) technology and a connection migration technique called FT‐TCP. It can recover transient application/operating system faults and it allows fault recovery and online maintenance on a single host. The framework substantially enhances FT‐TCP so that it can be run efficiently in the VM environment. Specifically, we propose techniques to reduce the inter‐VM switches and communication. Moreover, we propose service‐specific optimizations to reduce the recovery time of FT‐TCP. Finally, the framework provides an interface for the service designers to implement more service‐specific optimizations. The framework was implemented by modifying an open source VM monitor, Xen, and the Linux kernel running on top of Xen. The effectiveness and efficiency of the framework were evaluated by running two Internet services, WWW proxy and FTP, on the proposed framework and measuring the impact on their performance. According to the experimental results, our approach causes only slight performance overhead (i.e. less than 4%) and memory overhead (i.e. less than 750 KB) for both the services. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

3.
Detecting application-level failures in component-based Internet services   总被引:2,自引:0,他引:2  
Most Internet services (e-commerce, search engines, etc.) suffer faults. Quickly detecting these faults can be the largest bottleneck in improving availability of the system. We present Pinpoint, a methodology for automating fault detection in Internet services by: 1) observing low-level internal structural behaviors of the service; 2) modeling the majority behavior of the system as correct; and 3) detecting anomalies in these behaviors as possible symptoms of failures. Without requiring any a priori application-specific information, Pinpoint correctly detected 89%-96% of major failures in our experiments, as compared with 20%-70% detected by current application-generic techniques.  相似文献   

4.
顾军  罗军舟  曹玖新  李伟 《计算机学报》2012,35(2):2282-2297
开放的分布式服务平台倾向于涵盖更丰富的管理功能,支持更强的分散交互性,从而导致软件管理和维护的难度和成本问题日益突出.为此,引入一种自管理的服务平台体系结构参考模型,以构件作为功能实现载体,服务作为功能组织手段,交互作为功能扩展方式.提出了一种基于分层反馈的自主控制架构,以服务构件及相互之间的交互关系为控制对象,执行力模型为决策基础.在管理服务的可用性和性能建模中运用了马尔可夫过程、随机Petri网和排队网模型理论,并考虑了链路和节点的失效修复机制.仿真结果表明,基于排队Petri网的执行力模型能够反映失效率和修复时间对服务平台性能和可用性的影响,并验证了自主控制方法对提高服务平台有效性的积极作用.  相似文献   

5.
Assuring high quality of web services, especially regarding service reliability, performance and availability of e-commerce systems (unified under the term performability), has turned into an imperative of the contemporary way of doing business on the Internet. Recognizing the fact that customers’ online shopping behavior is largely affecting the conduct of e-commerce systems, the paper promotes a customer-centric, holistic approach: customers are identified as the most essential “subsystem” with a number of important, but less well-understood behavioral factors. The proposed taxonomy of customers and the specification of operational profiles is a basis to building predictive models, usable for evaluating a range of performability measures. The hierarchical composition of sub-models utilizes the semantic power of deterministic and stochastic Petri nets, in conjunction with discrete-event simulation. A handful of variables are identified in order to turn performability measures into business-oriented performance metrics, as a cornerstone for conducting relevant server sizing activities.  相似文献   

6.
A key issue that needs to be addressed while performing fault diagnosis using black box models is that of robustness against abrupt changes in unknown inputs. A fundamental difficulty with the robust FDI design approaches available in the literature is that they require some a priori knowledge of the model for unmeasured disturbances or modeling uncertainty. In this work, we propose a novel approach for modeling abrupt changes in unmeasured disturbances when innovation form of state space model (i.e. black box observer) is used for fault diagnosis. A disturbance coupling matrix is developed using singular value decomposition of the extended observability matrix and further used to formulate a robust fault diagnosis scheme based on generalized likelihood ratio test. The proposed modeling approach does not require any a priori knowledge of how these faults affect the system dynamics. To isolate sensor and actuator biases from step jumps in unmeasured disturbances, a statistically rigorous method is developed for distinguishing between faults modeled using different number of parameters. Simulation studies on a heavy oil fractionator example show that the proposed FDI methodology based on identified models can be used to effectively distinguish between sensor biases, actuator biases and other soft faults caused by changes in unmeasured disturbance variables. The fault tolerant control scheme, which makes use of the proposed robust FDI methodology, gives significantly better control performance than conventional controllers when soft faults occur. The experimental evaluation of the proposed FDI methodology on a laboratory scale stirred tank temperature control set-up corroborates these conclusions.  相似文献   

7.
Demands on software reliability and availability have increased tremendously due to the nature of present day applications. We focus on the aspect of software for the high availability of application servers since the unavailability of servers more often originates from software faults rather than hardware faults. The software rejuvenation technique has been widely used to avoid the occurrence of unplanned failures, mainly due to the phenomena of software aging or caused by transient failures. In this paper, first we present a new way of using the virtual machine based software rejuvenation named VMSR to offer high availability for application server systems. Second we model a single physical server which is used to host multiple virtual machines (VMs) with the VMSR framework using stochastic modeling and evaluate it through both numerical analysis and SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) tool simulation. This VMSR model is very general and can capture application server characteristics, failure behavior, and performability measures. Our results demonstrate that VMSR approach is a practical way to ensure uninterrupted availability and to optimize performance for aging applications. This research was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) under Grant No. KRF2007-210-D00006.  相似文献   

8.
This paper proposes a novel methodology and an architectural framework for handling multiple classes of faults (namely, hardware-induced software errors in the application, process and/or host crashes or hangs, and errors in the persistent system stable storage) in a COTS and legacy-based application. The basic idea is to use an evidence-accruing fault tolerance manager to choose and carry out one of multiple fault recovery strategies, depending upon the perceived severity of the fault. The methodology and the framework have been applied to a case study system consisting of a legacy system, which makes use of a COTS DBMS for persistent storage facilities. A thorough performability analysis has also been conducted via combined use of direct measurements and analytical modeling. Experimental results demonstrate that effective fault treatment, consisting of careful diagnosis and damage assessment, plays a key role in leveraging the dependability of COTS and legacy-based applications  相似文献   

9.
Computer systems reliability/availability modeling deals with the representation of changes in the structure of the system being modeled, which are generally due to faults, and how such changes affect the availability of the system. On the other hand, performance modeling involves representing the probabilistic nature of user demands and predicting the system capacity to perform useful work, under the assumption that the system structure remains constant. With the advent of degradable systems, the system may be restructured in response to faults and may continue to perform useful work, even though operating at lower capacity. Performability modeling considers the effect of structural changes and their impact on the overall performance of the system. The complexity of current computer systems and the variety of different problems to be analyzed, including the simultaneous evaluation of performance and availability, demonstrate the need for sophisticated tools that allow the specification of general classes of problems while incorporating powerful analytic and/or simulation techniques. Concerning model specification, a recently proposed object oriented modeling paradigm that accommodates a wide variety of applications is discussed and compared with other approaches. With respect to solution methods, a brief overview of past work on performability evaluation of Markov models is presented. Then it is shown that many performability related measures can be calculated using the uniformization or randomization technique by coloring distinguished states and/or transitions of the Markov model of the system being studied. Finally, the state space explosion problem is addressed and several techniques for dealing with the problem are discussed.  相似文献   

10.
Reliability is an important criterion to facilitate extensive deployment of web service technology for commercial business applications. Run-time monitoring and fault management of web services are essential to ensure uninterrupted and continuous availability of web services. This paper presents WISDOM (Web Service Diagnoser Model) a generic architecture for detecting faults during execution of web services. Policies have been proposed to describe the intended behavior of web services and faulty behavior would be detected as deviations or inconsistencies with respect to the specified behavior. The model proposes the use of monitoring components in service registries and service providers to detect run-time faults during publishing, discovery, binding and execution of web services. An independent fault diagnoser is proposed to coordinate the individual monitoring components and also act as a repository for the specified web service policies. The proposed model has been tested with a sample web service application and the results obtained are presented.  相似文献   

11.
基于代理的高性能目录服务的设计和实现   总被引:2,自引:1,他引:1  
目录服务能有效解决虚拟组织中动态多样的资源管理问题。在分布式系统环境下,目录服务的可用性和安全性必须得到保障。本文提出了一种基于代理的目录服务模型,该模型以基于虚拟组织的分布式商业应用为背景,在提高目录服务自身保护能力的同时,保障了目录服务的稳定和高效。  相似文献   

12.
顾军  罗军舟  曹玖新  李伟 《软件学报》2013,24(4):696-714
互联网环境下运行的组合服务易受到资源故障和组件失效影响而导致失效.已有的失效恢复措施在提高服务可用性的同时也会对服务的性能产生负面影响.为了对失效可恢复情况下的组合服务性能进行量化,通过综合组合服务失效类型和恢复策略,给出一种考虑失效恢复的组合服务性能分析模型.采用排队Petri网(queueing Petri net,简称Qn)描述组合服务的失效发生及其恢复处理过程,重点研究实施重试和替换策略的服务运行情况.详细描述了考虑失效恢复的服务节点和链路QPN模型的内部结构,在此基础上,通过服务交互机制构建组合服务分散执行的性能模型.最后,采用QPME工具仿真和比较不同失效发生率、失效类型分布和恢复策略下组合服务模型的性能表现.结果表明,该方法能够定量分析失效恢复对组合服务性能的影响,有助于指导不确定网络环境下的信息服务系统失效恢复策略实施方案的设计.  相似文献   

13.
随着互联网服务的快速发展,分布式的微服务应用逐渐取代传统的单体应用成为互联网应用的主要形式之一.微服务应用在具有可伸缩性、容错性、高可用性等优点的同时,也存在着构建繁琐、部署复杂和维护困难等挑战.面向云计算环境的微服务监测与运维是当前的研究热点,但仍然存在粒度较粗、故障定位不准确等缺点.针对以上问题,本文提出了一种基于模式匹配的微服务故障诊断方法.首先,使用注入代理转发请求流量的方式收集并建模微服务的追踪信息;然后,收集系统正常运行下的状态信息,并通过注入已知故障来收集并刻画故障发生后应用的运行状态;最后,将未知故障的执行追踪信息与已知故障的执行追踪信息相匹配,采用字符串编辑距离衡量相似度以诊断可能的故障原因.实验结果表明,该方法可以有效刻画请求的处理执行追踪信息,以微服务为粒度准确定位应用的故障原因.  相似文献   

14.
云存储技术已经成为当前互联网中共享存储和数据服务的基础技术,云存储系统普遍利用数据复制来提高数据可用性,增强系统容错能力和改善系统性能。提出了一种云存储系统中基于分簇的数据复制策略,该策略包括产生数据复制的时机判断、复制副本数量的决定以及如何放置复制所产生的数据副本。在放置数据副本时,设计了一种基于分簇的负载均衡副本放置方法。相关的仿真实验表明,提出的基于分簇的负载均衡副本放置方法是可行的,并且具有良好的性能。  相似文献   

15.
The problem of fault tolerance in cooperative manipulators rigidly connected to an undeformable load is addressed in this paper. Four categories of faults are considered: free-swinging joint faults (FSJFs), locked joint faults (LJFs), incorrectly measured joint position faults (JPFs), and incorrectly measured joint velocity faults (JVFs). Free-swinging and locked joint faults are detected via artificial neural networks (ANNs). Incorrectly measured joint position and velocity faults are detected by considering the kinematic constraints of the cooperative system. When a fault is detected, the control system is reconfigured according to the nature of the isolated fault and the task is resumed to the largest extent possible. The fault tolerance framework is applied to an actual system composed of two cooperative robotic manipulators. The results presented demonstrate the feasibility and performance of the methodology.  相似文献   

16.
融合网络环境下快速可靠的服务组合容错方法   总被引:1,自引:0,他引:1  
针对传统容错方法在融合网络环境下服务组合的低效性,本文提出了一种快速可靠的服务组合容错方法.该方法首先采用模糊逻辑,对服务的临时性故障进行服务重试;然后采用多属性决策理论,对服务的永久性故障进行服务复制;最后通过改进的粒子群算法,对永久性故障进行服务补偿.基于真实数据集的实验结果表明,所提方法在故障排除率、故障处理时间与组合最优度方面,均优于其它方法.  相似文献   

17.
As computing and communication systems become physically and logically more complex, their evaluation calls for continued innovation with regard to measure definition, model construction/solution, and tool development. In particular, the performance of such systems is often degradable, i.e., internal or external faults can reduce the quality of a delivered service even though that service, according to its specification, remains proper (failure-free). The need to accommodate this property, using model-based evaluation methods, was the raison d'être for the concept of performability. To set the stage for additional progress in its development, we present a retrospective of associated theory, techniques, and applications resulting from work in this area over the past decade and a half. Based on what has been learned, some pointers are made to future directions which might further enhance the effectiveness of these methods and broaden their scope of applicability.  相似文献   

18.

In recent years, data centers (DCs) have evolved a lot, and this change is related to the advent of cloud computing, e-commerce, services aimed at social networks, and big data. Such architectures demand high availability, reliability, and performance at satisfactory service levels; requirements are often neglected at the expense of high costs. In addition, the use of techniques capable of promoting greater environmental sustainability is most often forgotten in the design phase of such architectures. Approaches to perform an integrated assessment of dependability attributes for DCs, in general, are not trivial. Thus, this work presents the dependability attributes (availability and reliability), performability, and sustainability parameters that need special attention in implementing a cooling subsystem in DCs. That is one of the most cost generators for these infrastructures. In this study, we use the hypothetical-deductive method through a quantitative and qualitative approach; as for the procedure, it is bibliographical research through the review of scientific studies, and the research objectives are exploratory in nature. The results show that among all the papers selected and analyzed in this systematic literature review (SLR), none have jointly addressed performability, dependability, and sustainability in cooling systems for DCs. The main results of this work are presented through research questions, as they bring evidence of gaps to be addressed in the area. The four research questions point out challenges in implementing cooling systems in DCs and present the techniques and/or methods most used to propose or analyze data center cooling infrastructures; addressing the essential sustainability requirements for cooling subsystems, and finally, presenting open questions that can be investigated in the area of sustainable cooling in DCs regarding the data center’s cooling and the difficulty of incorporating dependability attributes in the environmental context. In addition to these results, the present study actively contributes to the concept of a “green data center” for the companies, which ranges from the choice of renewable energy sources to more efficient information technology equipment. Hence, we show the relevance and originality of this SLR and its results.

  相似文献   

19.
Recursive Distributed Rendezvous (ReDiR) is a service discovery mechanism for Distributed Hash Table (DHT) based Peer-to-Peer (P2P) overlay networks. One of the major P2P systems that has adopted ReDiR is Peer-to-Peer Session Initiation Protocol (P2PSIP), which is a distributed communication system being standardized in the P2PSIP working group of the Internet Engineering Task Force (IETF). In a P2PSIP overlay, ReDiR can be used for instance to discover Traversal Using Relays around NAT (TURN) relay servers needed by P2PSIP nodes located behind a Network Address Translator (NAT). In this paper, we study the performance of ReDiR in a P2PSIP overlay network. We focus on metrics such as service lookup and registration delays, failure rate, traffic load, and ReDiR’s ability to balance load between service providers and between nodes storing information about service providers.  相似文献   

20.
Novella   《Computer Networks》2006,50(18):3763-3783
Performance critical services over Internet often rely on geographically distributed architectures of replicated servers. Content Delivery Networks (CDN) are a typical example where service is based on a distributed architecture of replica servers to guarantee resource availability and proximity to final users. In such distributed systems, network links are not dedicated, and may be subject to external traffic. This brings up the need to develop access control policies that adapt to network load changing conditions. Further, Internet services are mainly session based, thus an access control support must take into account a proper differentiation of requests and perform session based decisions while considering the dynamic availability of resources due to external traffic.In this paper we introduce a distributed architecture with access control capabilities at session aware access points. We consider two types of services characterized by different patterns of resource consumption and priorities. We formulate a Markov Modulated Poisson Decision Process for access control that captures the heterogeneity of multimedia services and the variable availability of resources due to external traffic. The proposed model is optimized by means of stochastic analysis, showing the impact of external traffic on service quality. The structural properties of the optimal solutions are studied and considered as the basis for the formulation of heuristics. The performance of the proposed heuristics is studied by means of simulations, showing that in some typical scenario they perform close to the optimum.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号