首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper discusses several approaches to designing and implementing shared‐memory communication protocol modules for the message‐passing interface (MPI) libraries, colloquially called ‘shared‐memory devices’. The authors present a new taxonomy for classifying designs for shared‐memory MPI communication devices and formulate design evaluation criteria. Using these criteria, the authors compare three existing shared‐memory devices for MPICH and choose the best one. The authors also present experimental results that support their choice. The contributions of this paper are three‐fold. First, the authors present the taxonomy for shared‐memory communication devices. Second, they show advantages and potential problems of the devices that belong to different classes of their taxonomy using the formulated design criteria. Third, they analyze communication performance of existing MPICH shared‐memory devices, discuss optimizations of their performance, and show the performance gains that these optimizations yield. MPICH is used for comparison, since it is a widely used MPI implementation. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

2.
As an objected‐oriented programming language and a platform‐independent environment, Java has been attracting much attention. However, the trade‐off between portability and performance has not spared Java. The initial performance of Java programs has been poor, due to the interpretive nature of the environment. In this paper we present the communication performance results of three different types of message‐passing programs: native, Java and native communications, and pure Java. Despite concerns about performance and numerical issues, we believe the obtained results confirm that high‐performance parallel computing in Java is possible, as the technology matures and the approach is pragmatic.Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

3.
In a distributed computing system, message logging is widely used for providing nodes with recoverability. To reduce the piggyback overhead of traditional causal message logging, we present a zoning causal message logging approach in this paper. The crux of the approach is to control the propagation of dependency information: the nodes in the system are divided into zones, and by a message fragment mechanism, the dependency information of a node is only visible in the zone scope. Simulation results show that the piggyback overhead of the proposed approach is lower than that of traditional causal message logging.  相似文献   

4.
Parallel programs present some features such as concurrency, communication and synchronization that make the test a challenging activity. Because of these characteristics, the direct application of traditional testing is not always possible and adequate testing criteria and tools are necessary. In this paper we investigate the challenges of validating message‐passing parallel programs and present a set of specific testing criteria. We introduce a family of structural testing criteria based on a test model. The model captures control and data flow of the message‐passing programs, by considering their sequential and parallel aspects. The criteria provide a coverage measure that can be used for evaluating the progress of the testing activity and also provide guidelines for the generation of test data. We also describe a tool, called ValiPar, which supports the application of the proposed testing criteria. Currently, ValiPar is configured for parallel virtual machine (PVM) and message‐passing interface (MPI). Results of the application of the proposed criteria to MPI programs are also presented and analyzed. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

5.
Developing high‐quality, error‐free message‐passing concurrent programs is not trivial. Although a number of different primitives with associated semantics are available to assist such development, they often increase the complexity of the testing process. In this paper, we extend our previous test model for message‐passing programs and present new structural testing criteria, taking into account additional features used in this paradigm, such as collective communication, non‐blocking sends, distinct semantics for non‐blocking receives, and persistent operations. Our new model also recognizes that sender primitives cannot always be matched with every receive primitive. This improvement allows us to remove statically a significant number of infeasible synchronization edges that would otherwise have to be analyzed later by the tester. In this paper, the test model is presented using the Message‐Passing Interface standard; however, our new model has been designed to be flexible, and it can be configured to support a range of different message‐passing environments or languages. We have carried out case studies showing the applicability of the new test model to represent message‐passing programs and also to reveal errors, mainly those errors related to inter‐process communication. In addition to increasing the number of features supported by the test model, we have also reduced the overall cost of testing significantly. Our case studies suggest that the number of synchronization edges can be reduced by up to 93%, mainly by eliminating infeasible edges between unmatchable communication primitives. The main contribution of the paper is to present a more flexible test model that provides improved coverage for message‐passing programs and at the same time reduces the cost of testing significantly. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

6.
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm.  相似文献   

7.
分布式实时操作系统消息机制的设计与实现   总被引:1,自引:1,他引:0  
随着数字信号处理技术的迅猛发展,针对并行数字信号处理(DSP)应用自主开发了一个满足用户需要的高性能分布式实时操作系统--腾飞分布式实时操作系统(TF-RTOS).消息机制用于线程间的通信,是操作系统中的重要部分.在开发TF-RTOS过程中,从消息命令包、消息队列、消息传递过程和消息原语这4个方面设计并实现了一种直接消息传递的消息机制,该消息机制具有简化线程间通信、增强系统功能、提高系统性能的特点.  相似文献   

8.
MPI是大规模集群和网格平台中最通用的编程环境,但其运行环境经常会因为节点或网络的故障而出现错误,所以有必要为MPI编程提供容错机制。本文分析了实现MPI程序容错的关键技术,并针对运行MPICH-P4的LINUX集群,利用检查点和消息日志技术,通过改造和扩充MPI底层的P4通信库,提出了一套MPI程序容错系统的具体实施方案。  相似文献   

9.
We develop a mixed graph and optimal control theoretic formulation to design a robust cooperative control protocol for a large‐scale multiagent system with partially known interconnected first‐, second‐, or mixed first‐ and second‐order dynamics. In each case, we transform the control protocol design task to a robust communication graph design problem, which, from a cyber‐physical perspective, is interpreted as the control layer design problem for an interconnected system with unknown agent layer dynamics. According to this viewpoint, each state variable has its own control layer communication topology separate from the other state variable's communication topology and the unknown agent layer interconnection topologies. We prove that all cooperative, decentralized, and centralized tracking protocols can be treated as a single design problem and, by deriving closed‐form solutions for the robust control layer topologies, we further provide a simpler design procedure, which is only based on the matrix manipulations. Aside from the linear implementation of the protocol and the connection of the proposed formulation to the well known rules‐of‐thumb in optimal control theory, this creates a higher potential to transfer ideas to industry. Modeling uncertainties tolerable by a given control layer topology is analyzed, and a preliminary performance‐oriented analysis and design approach for large‐scale interconnected systems is discussed. We show that exactly the same steps can be followed to design appropriate control layers for both tracking and stabilization.  相似文献   

10.
We discuss the parallelization of an efficient algorithm for the partial stabilization of large‐scale linear control systems in generalized state‐space form. The algorithm is composed of highly parallel iterative schemes that appear in the computation of certain matrix functions. Here we evaluate different approaches to exploit parallelism at two levels, based on threads and processes. Our experimental results on a cluster of symmetric multiprocessors and a CC‐NUMA platform show that the efficiency of the matrix operations underlying the iterative schemes carry over to the parallel implementation of the stabilization algorithm. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

11.
Checkpointing has a crucial impact on systems' performance and fault‐tolerance effectiveness: excessive checkpointing results in performance degradation, while deficient checkpointing incurs expensive recovery. In distributed systems with independent checkpoint activities there is no easy way to determine checkpoint frequencies optimizing response‐time and fault‐tolerance costs at the same time. The purpose of this paper is to investigate the potentialities of a statistical decision‐making procedure. We adopt a simulation‐based approach for obtaining performance metrics that are afterwards used for determining a trade‐off between checkpoint interval reductions and efficiency in performance. Statistical methodology including experimental design, regression analysis and optimization provides us with the framework for comparing configurations, which use possibly different fault‐tolerance mechanisms (replication‐based or message‐logging‐based). Systematic research also allows us to take into account additional design factors, such as load balancing. The method is described in terms of a standardized object replication model (OMG FT‐CORBA), but it could also be applied in other (e.g. process‐based) computational models. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

12.
With the increasing uniprocessor and symmetric multiprocessor computational power available today, interprocessor communication has become an important factor that limits the performance of clusters of workstations/multiprocessors. Many factors including communication hardware overhead, communication software overhead, and the user environment overhead (multithreading, multiuser) affect the performance of the communication subsystems in such systems. A significant portion of the software communication overhead belongs to a number of message copying operations. Ideally, it is desirable to have a true zero‐copy protocol where the message is moved directly from the send buffer in its user space to the receive buffer in the destination without any intermediate buffering. However, due to the fact that message‐passing applications at the send side do not know the final receive buffer addresses, early arrival messages have to be buffered at a temporary area. In this paper, we show that there is a message reception communication locality in message‐passing applications. We have utilized this communication locality and devised different message predictors at the receiver sides of communications. In essence, these message predictors can be efficiently used to drain the network and cache the incoming messages even if the corresponding receive calls have not yet been posted. The performance of these predictors, in terms of hit ratio, on some parallel applications are quite promising and suggest that prediction has the potential to eliminate most of the remaining message copies. We also show that the proposed predictors do not have sensitivity to the starting message reception call, and that they perform better than (or at least equal to) our previously proposed predictors. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

13.
This paper describes key aspects of remote service invocation in federations of OSGi containers. It refers to the OSGi Remote Service Admin specification and describes its efficient implementation over message‐oriented middleware. Scalability problems of several different approaches to implementation are identified, and a solution in a form of innovative Remote Service Admin model extension is proposed. The extension, named On‐demand Remote Service Admin, is analyzed and validated in the context of a motivating scenario. Validation includes performance and scalability evaluation, which confirms that all assumed requirements have been satisfied by the constructed prototype. Finally, the presented research is compared with related works. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

14.
In this paper, an adaptive decentralized tracking control scheme is designed for large‐scale nonlinear systems with input quantization, actuator faults, and external disturbance. The nonlinearities, time‐varying actuator faults, and disturbance are assumed to exist unknown upper and lower bounds. Then, an adaptive decentralized fault‐tolerant tracking control method is designed without using backstepping technique and neural networks. In the proposed control scheme, adaptive mechanisms are used to compensate the effects of unknown nonlinearities, input quantization, actuator faults, and disturbance. The designed adaptive control strategy can guarantee that all the signals of each subsystem are bounded and the tracking errors of all subsystems converge asymptotically to zero. Finally, simulation results are provided to illustrate the effectiveness of the designed approach.  相似文献   

15.
In this paper, we investigate global decentralized sampled‐data output feedback stabilization problem for a class of large‐scale nonlinear systems with time‐varying sensor and actuator failures. The considered systems include unknown time‐varying control coefficients and inherently nonlinear terms. Firstly, coordinate transformations are introduced with suitable scaling gains. Next, a reduced‐order observer is designed to estimate unmeasured states. Then, a decentralized sampled‐data fault‐tolerant control scheme is developed with an allowable sampling period. By constructing an appropriate Lyapunov function, it can be shown that all states of the resulting closed‐loop system are globally uniformly ultimately bounded. Finally, the validity of the proposed control approach is verified by using two examples.  相似文献   

16.
王之元  杨学军  周云 《软件学报》2012,23(4):1022-1035
随着系统规模的扩大,并行计算的性能不断提高,但可靠性却也在不断下降,因此需要采用某种容错机制来容忍或恢复硬件故障和数据错误.目前常用的容错机制Checkpoint/Restart和多模冗余均引入了额外的开销,这些开销均在某种程度上制约了并行计算的可扩展性.因此,在高性能计算需求不断增长的今天,可扩展容错机制的设计显得尤为迫切和重要.以三模冗余(triple modular redundancy,简称TMR)为典型案例,描述了传统TMR在大规模MPI 并行计算上的实现方法,分析了该机制所面临的实际问题,进而指出传统TMR制约了并行计算的扩展.根据该技术所面临的问题,设计了可扩展三模冗余(scalable triple modular redundancy,简称STMR),并进一步验证了其有效性和可扩展性.该机制不仅能够处理Checkpoint/Restart针对的fail-stop故障,还能够解决绝大部分硬件不能直接感知的数据错误.最后,借用BlueGene/L的系统参数进行模拟,预测当系统规模增大时,在分别采用TMR和STMR的情况下并行计算可扩展性的变化,结果进一步验证了STMR是可扩展的容错机制.  相似文献   

17.
A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long‐running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long‐running applications. Several fault tolerance approaches were investigated; however, it was found that rollback‐recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback‐recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

18.
In this paper, a general method is developed to generate a stable adaptive fuzzy semi‐decentralized control for a class of large‐scale interconnected nonlinear systems with unknown nonlinear subsystems and unknown nonlinear interconnections. In the developed control algorithms, fuzzy logic systems, using fuzzy basis functions (FBF), are employed to approximate the unknown subsystems and interconnection functions without imposing any constraints or assumptions about the interconnections. The proposed controller consists of primary and auxiliary parts, where both direct and indirect adaptive approaches for the primary control part are aiming to maintain the closed‐loop stability, whereas the auxiliary control part is designed to attenuate the fuzzy approximation errors. By using Lyapunov stability method, the proposed semi‐decentralized adaptive fuzzy control system is proved to be globally stable, with converging tracking errors to a desired performance. Simulation examples are presented to illustrate the effectiveness of the proposed controller. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

19.
This paper describes our work to improve the performance of distributed applications. We aim at certain application characteristics such as balancing load, allowing separately written applications to work better together, allowing a distributed application to adapt its behavior in more flexible ways, and so on. Our approach is to write application‐specific schedulers, which can access the global state of the application in making scheduling decisions. To achieve this goal, we extended our earlier work on CATAPULTS ( C reating A nd T esting AP plication‐specific U ser L evel T hread S chedulers), a domain‐specific language for creating and testing application‐specific user‐level thread schedulers, to distributed applications by adding ‘master schedulers’ for dealing with the distributed parts of applications. This paper presents our design of, experimentation with, and implementation of distributed CATAPULTS. This paper presents several realistic examples to measure the feasibility of this approach, specifically: a website application, an embedded application, and load balancing. Each example has a scheduling goal for which we developed a customized scheduler. We measured the performance with and without the customized scheduler. The customized scheduler for each example was fairly straightforward to develop and each achieved its scheduling goal. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

20.
Pierre Sens  Bertil Folliot 《Software》1998,28(10):1079-1099
This paper presents the design, implementation and performance evaluation of a software fault manager for distributed applications. Dubbed Star, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault-tolerant applications, Star implements non-blocking and incremental checkpointing to perform an efficient backup of process state. Moreover, Star is application independent, highly configurable. Star actually runs on top of SunOs and is easily portable to UNIX™-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment. © 1998 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号